What Makes a Data Scientist - Part 1

Really, what is a Data Scientist?

Posted by Seth Dobrin on November 9, 2015

Reposted from LinkedIn.

When having conversations about data and analytics, the new fad is hiring Data Scientists. But I find that most people don’t know what that means, and most people we call Data Scientists are not. Typically, individuals who are called Data Scientists fall into one of three camps. The first and least prevalent is a real Data Scientist. The second camp is composed of database engineers, architects, or data analysts. And the third is comprised of statisticians and/or mathematicians. That’s not to say that the two latter camps can’t become Data Scientists, but to do so requires a willingness to learn and take risks, as well as an investment in time by the individual and her employer.

So what is a Data Scientist? She is an individual who possesses domain knowledge in a relevant area, a spectrum of skills and experiences that range from raw data architecture to what I will call data journalism. Perhaps even more importantly this breadth of skills and experiences needs to be with modern, relevant technologies and techniques. These skills and experiences are as follows and in future posts I will dig into what each of these mean and why some combination of them are required. The specific categories of skills and experiences are as follows: database architecture & engineering, data intake & transformation, ontology & metadata, statistical transformation & analytics foundation, descriptive analytics & reporting, model-based analytics and finally data journalism. Data Scientists are the data and analytics version of an Agile Development team member with E-shaped skills. This means she should have a depth of skills in 2-3 areas, a breadth of skills that spans most of the gamut of skills listed above and the ability to execute. If you have separate data and analytics teams in your organization, you need Data Scientists in both organizations. The depth of her skills in descriptive analytics & reporting determines which group she belongs in, a data team or an analytics team.

As mentioned above the skills and experiences need to be with relevant and modern technologies and techniques. What do I mean by this? A Data Scientist possesses an understanding of legacy type platforms, such as relation databases management systems (RDBMS) and data warehouses (DW), and she is able to manage processes such as extract, transform and load (ETL). Additionally, she will have a foundation in traditional supervised and unsupervised statistical methods. While these are valuable skills and an important part of her toolbox, they are only the beginning. Her toolbox should also consist of an understanding and depth of skill in the so called “Not only SQL”, or NoSQL, platforms. Theses include columnar data stores, document stores, key-value stores, graph databases, and finally multi-model databases. More modern analytics skill include techniques such as machine learning, neural networks, and operational research. To leverage most of these modern tools, she needs to know at least one programming language, and preferably more, such as Python, R, Java, or Scala. All of these skills and experiences should be on web-scale, cloud-based environments.

Wow what a list and what expectations! It is important to remember that very few people, if any, will master the entire spectrum of skills, platforms, and techniques. What is important is that anyone you call a Data Scientist has a breadth of skills and experiences that span from data architecture to data journalism with depth in at least 2-3 areas. As technologies change it is important that they be a learner so they can maintain their understanding of relevant technologies.

posted on November 9, 2015 by
Seth Dobrin