Are Data Scientists Real Scientists?

TL;DR—No.

***

Data scientist is The Sexiest Job of the 21st Century, said Harvard Business Review in 2012.

Data Scientist vs Data Engineer Google Trends
Google Search Trends for Data Scientist vs. Data Engineer
Borrowed from the Analytics Association of the Philippines (AAP)

Seven years later, and we haven’t quite worked it out: Are data scientists real scientists? What do they really do? Is the work really as sexy as it sounds?

Much like Fatal Attraction, we begin our data science journey with doe-eyed infatuation and all kinds of butterflies—until we realize that data science is a psycho bitch who kidnaps our kid, boils our pet bunny (fluffy fur and all), and hacks us bloody with a blunt kitchen knife; we try to drown her in a tub but she…Just. Won’t. Die.

But I’m getting way ahead of myself.

Basically, the point I’m trying to make is that everyone wants to be a data scientist (or hire one) without fully understanding what data science is and what we really want out of it.

I think Dan Ariely, a psychology and behavioral economics professor at Duke University, hits the nail on the head:

“Big data is like teenage sex: everyone talks about it, nobody really knows how to do it, everyone thinks everyone else is doing it, so everyone claims they are doing it…”

Part of the reason could be that data science is still in its confused adolescence. The field has only been around since 2008, and it has a ways to go in terms of defining and organizing itself.

It certainly doesn’t help that majority of data projects fail (see: 85-Percent of Big Data Projects Fail) and companies don’t know how to utilize data science talent (see: Why Data Scientists Are Leaving Their Jobs).

There are two issues I want to address: first, that data science has more to do with algorithms and statistical techniques than formal scientific work; and second, that most data scientist jobs are less sexy that people imagine.

Science, schmience

Neil Degrasse Tyson and Bill Nye Do You Even Science Meme
Do you even??

The dividing line between data science and real science is research methods—if you design experiments, and prove and propose formal hypotheses, your work is closer to a scientific role. Even generalizing conclusions from empirical data using algorithms can qualify as science when used  to augment research.

However, let’s not confuse that with people who draw pretty charts and run Python or R scripts for a living, without any research involved.

This isn’t to be exclusionary or pedantic about “real scientists” is, but the abuse of the term “data scientist” for recruitment, staffing, or marketing purposes does place a stain on the practice.

Let’s get real and call a spade, a spade.

Majority of a data scientist’s working hours are spent on arguably the least scientific parts of the job: cleaning and organizing data (60%) and collecting data sets (19%).

Incidentally, these are also the least enjoyable parts of the job.

What data scientists spend the most time doing infographic
Grab a magnifying glass to see the “science” part
Borrowed from the Analytics Association of the Philippines (AAP)
What is the least enjoyable part of data science infographic
Real data science isn’t as sexy as it sounds
Borrowed from the Analytics Association of the Philippines (AAP)

The sexy “science” stuff comes much, much later.

More marriage than sex

The idea of data science is sexy—especially when we hear stories about the Ubers, Netflixes, and Amazons of the world. But real data science work is a lot less so.

When data scientists take on jobs and companies hire data scientists, the expectation is to produce mind-blowing insight from vast amounts of data (read: Big Data). Truth be told, the road to data-driven disruption is not as straightforward as many think.

It takes a lot of work to get data that works.

Before a data scientist can even arrive at said mind-blowing insight, they need to make it through the mundane: hours and hours of ensuring the data sets are available and prepped for analysis, scripts are thoroughly debugged, and libraries are compatible—not to mention the professional Googling and GitHubbing involved before we can find the right algorithms.

If you recall the charts above, data scientists are actually spending most of their time on the least enjoyable parts of the job (see comparison below).

Tasks data scientists spend the most time on vs Least enjoyable tasks
Oh, so it’s just like a real job…*sad*

Anyone looking to pursue data science or employ a data scientists ought to consider the work, especially the mundane mind-numbing aspect of it, before leaping. If you’re not crazy about data, data science will drive you crazy.

The Future of Data Science

This isn’t to discourage aspiring “data scientists”—quite the contrary. As the global ocean of data expands, we need more data experts with the skills and knowledge to navigate it.

That said, we still have a ways to go in terms of defining the whos, hows, and whats of data work. In the process of figuring it all out, we should also avoid sugar-coating data job titles lest we get disappointed (this goes for both employers and potential employees).

A decade from now, we may well witness the extinction of the data scientist. Data jobs will only get increasingly specific, and catchall data scientists (the ones who juggle 8 different programming languages and 3 different job functions) may no longer be enough to meet the narrower and deeper demands of future data work. This is the same reason we rarely see job openings for “computer expert” or “business manager” anymore.

As data continues to stretch the limits of our imagination, we can’t possibly expect a handful of people to handle all of it.

Not even real scientists do that.

***

If you’re less concerned about job titles and more concerned about real, applicable skills, why not book a Business Analytics Masterclass. It’s not exactly science, but it works!

Data Scientist or Know-It-All?

The importance of domain expertise in data practice.

There’s a short yet wonderful story that perfectly encapsulates how many of today’s businesses use data. It’s a simple parable that all of us can learn from, regardless of our background, level of experience, or field of practice. In fact, if you’re already working with data, you might have a similar story to share. It goes like this:

A data scientist holds up a chart.
Everyone believes him.
End of story.

In today’s data-supercharged world, data is the law and the data practitioner is taken as the de facto expert. Ignore the fact that Ben just got hired last week—he has a MA in Statistics and a PhD in Machine Learning, so he must have all the answers, right?

(To clarify, said Ben is a hypothetical person. If you happen to know a Ben or are one, we apologize in advance. If you happen to be a woman, please don’t take our usage of a traditionally male name as a vote in favor of the patriarchy. This is purely for emphasis. We support all women, especially women in data. With everything cleared up, let’s get back to the matter at hand…)

Of course people will listen to the data guy. Numbers are compelling, especially when presented in chart form. Who are we mortals to question an interactive, multi-colored bubble chart? What power does one man hold over a regression line with an R-square well over 0.90?

In the modern boardroom, data is gospel truth. Everything else is mere conjecture.

We seldom stop to consider whether the data is flawed, or if the data guy understands the subject matter enough to draw insights or conclusions. Maybe the regression model is accurate, but what if it uses the wrong variables, or maps out the wrong features? What if the chart displays absolute figures in places where a logarithmic scale is more appropriate? What if the time series shows periods that are either too long or too short? What if the final analysis is inconsequential to the use case at hand?

When all is said and done, data can be just as flawed as the people who work with it.

Consider the infamous case of the NASA’s $125 million Mars Climate Orbiter. A simple conversion mishap—the failure to convert pound-force (lbf) to Newtons (N)—had the spacecraft flying within 37 miles of the Martian surface, dangerously below the 53-mile minimum. What followed was an epic fail of astronomic (no pun intended) proportions: Mars’ atmospheric friction burned the poor thing to a crisp before hurling its ashes deep into a cratery abyss.

Eyewitness reports allege the fire started with a contentious bar chart

Crash and burn—or, rather, burn then crash.

Mind you, this blunder happened with Lockheed Martin’s and NASA’s top brass, arguably the best domain experts in their respective fields, on the job. If even they can make mistakes like this, what makes us think we regular folk are exempt?

The next example is more down-to-earth… literally.

Applying domain knowledge could be as simple as choosing between a FIFO (first-in-first-out) and LIFO (last-in-first-out) approach, as detailed in this SuperDataScience podcast.

To explain FIFO and LIFO briefly: If element A arrives first, B second, and C last, FIFO dictates that they leave in that same order. LIFO is the complete opposite, wherein the last element, in this case C, leaves first, followed by B then A.

As you might already predict, the “right” choice varies greatly among industries.

For example, a business dealing in perishable goods like vegetables or fresh meat might prefer a FIFO approach, wherein an earlier element, say Monday’s shipment, is sent out before Tuesday’s or Wednesday’s. Conversely, a steel manufacturer may opt for convenience and use a LIFO approach wherein the steel bars at the top of the pile (i.e. the last ones in) get shipped out first. Caveat: we are experts in neither the perishable goods nor steel industries, so this is, again, purely for illustration purposes.

Yes, data skills can be applied to nearly every domain. However, we cannot discount the fact that data practitioners need domain expertise in order to truly be effective (or at least to avoid $125 million blunders). Data in retail can differ from data in healthcare or economics or agriculture or any other industry.

This is no different from other jobs. In the same manner we demand industry experience from management professionals and sub-specializations from doctors and engineers, we need to push for domain expertise and domain knowledge in the data practice.

What does this entail?

For the data practitioner, this means building years of experience and knowledge in a specific domain. Go deep rather than broad.

For the company looking to fill a data position, this means hiring a data practitioner with an industry background, or grooming one from the existing workforce (the second is an option we highly encourage).

For schools and institutions offering data courses, this means creating industry-specific courses and tracks, or encouraging students to pursue a minor in a field of interest.

Parting thoughts

Data does not exist in a vacuum and neither do data experts. To make data impactful, we need to encourage data practitioners to look beyond the spreadsheet and out into the real world.

Doing so might just save all of us from another epic crash and burn.