And yet, data science (and data engineering) is still a young field that is not easily parsed by recruiters, managers who make departmental budgeting decisions, or even data scientists themselves. Add into the mix the latest buzz about artificial intelligence and deep learning, and there are a lot of misconceptions about what data science is and what data scientists do. Here are four common data science myths I've run into over the past several years.
1) The Data Scientist Is a Swiss Army Knife
Many companies believe that a data scientist is some kind of unicorn who can write production-quality engineering code, produce PhD-level statistical analysis, create Pulitzer Prize-winning visualizations, and understand the business logic underpinning it all, with equal aplomb.
It's true that, in theory, data science is an interdisciplinary job where tech and business knowledge meet so that data can be processed and made sense of.
But the high expectations stemming from the original data science Venn diagram have led companies and recruiters to believe that unless a candidate can write Java code AND do Bayesian methods AND create charts in D3, then the person is no good.
In reality, data scientists' skills tend to be T-shaped, meaning they're broadly exposed to a variety of skills and tools but focus on a few. Most practitioners fall somewhere in the spectrum between what's known as A and B - analysis and building - and are either better at creating pipelines or working with existing pipelines and surfacing insights.
What kind of role your business needs will depend on what you need to do in a data pipeline. All these roles are slightly different and require different levels of skill and expertise, depending on the kind of data you have and what kind of shape it's in.
I haven't seen anyone who is good at everything in the diagram. Either the practitioner knows stats well, or they grok distributed systems, or they can present well to executives, or some combination of the two. Someone who knows all three is unaffordable and unnecessary at most companies.
2) A Data Scientist Works Alone
Along with the myth that a good data scientist knows everything, there is a myth that any given organization needs one data scientist, working in isolation, to do everything.
Since data scientists specialize (like most people do), many businesses will need a solid team of people who complement each other's skill sets if they hope to build a successful data product or analysis pipeline.
Ideally, you'll have a data engineer, a data scientist, and a product owner/UX specialist (this role includes communicating with the business, documentation, and architecture), or some variation of those three roles in a team of people not to exceed five.
A data scientist needs to be in touch with other parts of the business as much as possible, sitting in on business meetings and trying to understand where the questions about the business are coming from. Organizations that see data scientists as an asset, instead of putting them in a corner and handing them one-off questions to answer, are organizations that do well.
3) All Businesses Need a Data Scientist
This brings me to the question of whether companies need a data scientist at all. Organizations typically hire a data scientist because they want to gain value from their datasets by answering such questions as: Should we build out Product X? Will City Y be a competitive market for us? How many widgets do we make each day, and what is the optimal quantity?
If a business can't figure out whether it has enough data available to answer such questions and, it can't determine whether the data is well-organized enough to support the research, then a data scientist won't be able to do so either. Or rather, a data scientist will be able to produce answers, but the person will spend most of his or her time in the role doing janitorial data anthropology, which is extremely important work, but not why you hired a data scientist.
If that's the kind of work your business needs, make it clear up-front. If you don't, your data scientist, who was expecting to solve important business problems, will become extremely frustrated trying to figure out why he or she is labeling training data for the fifth week in a row. Instead, hire a data engineer or architect to make sense of and organize your data before the data scientist steps in.
4) You Need a Data Scientist to Work on Deep Learning with Big Data
This is the most pervasive myth of the hype cycle. Just as you don't always need a software developer to work with Go and microservices on containers, you don't always need a machine learning engineer specializing in Julia to work on neural nets.
Although many companies don't understand what they need from a data scientist, I often see employment ads for data scientists who hold a PhD in statistics or machine learning. Unless your company is doing groundbreaking research for SpaceX or CERN, you probably don't need a PhD. Most problems revolve around simple questions such as these:
- How do we understand our customers better?
- How do we get more people to click on thing X or Y?
- Should we move our data to Hadoop? How? And how much?
- How do we count - and increase - the number of products we sell?
- We have data in two places; how do we get it into one place?
An academic will understand the deep statistics, but will not necessarily understand the business problem you're trying to solve, unless they've had previous industry experience. To answer these questions, it's important to have someone who has a breadth of experience across industries, or a depth of experience in a single industry (i.e. expertise in ecommerce or healthcare). You also don't always need complicated algorithms to solve routine business problems.
As a point of comparison: I find that many companies that are considering entering the world of big data express interest in implementing a Hadoop cluster. After all, it's trendy. But for many companies, it's also unnecessary. The same can be said of hiring a highly-credentialed data scientist. Hands-on technical experience may be of far greater value than impressive academic credentials.
In general, the hype cycle around data science has made it seem much more glamorous and complicated than the work is in actuality, and it's important to parse out the real work that needs to be done between the haze of buzzwords. Within the data industry and the data science profession, it's important to clear up the misconceptions so we can get to the real work of solving business problems.