Productivity in data science isn’t a matter of output in any quantitative sense. It’s more an issue of the quality of what data scientists produce.
In a data-science context, quality refers to the validity and relevance of the insights that statistical models are able to distill from the data. As I stated recently, the more data you have, the more stories that data scientists can tell with it, though many of those narratives may be entirely (albeit inadvertently) fictitious.
Given the paramount importance of actionable insights, the productivity of data science teams can’t be neatly reduced to throughput or any other quantitative metric. Data scientists can easily pump up their aggregate output along myriad dimensions, such as more sources, more data, more pipeline processes, more variables, more iterations, and more visualizations. But that doesn’t necessarily get them any closer to delivering high-quality analytics for predictive, prescriptive, and other uses.
Likewise, you can’t always assume that throwing more data scientists at a problem will boost the quality of their collective output. Different data scientists may rely on different data sources; aggregate, cleanse, and sample them in different ways; incorporate different feature sets into their models, use different algorithmic and visualization approaches; employ different metrics of model fitness and predictive capability, and so on. Lack of standardized approaches and consistent documentation may frustrate efforts by third parties to compare which data scientists’ results, if anybody’s, are most valid in any particular instance of them working on a common problem.
As the number and variety of data scientists at work on a problem grows, they may be surfacing more spurious correlations than causal insights. And if they use different methodologies to produce their various results, it may get more difficult to identify who, if anyone, has delivered the highest-quality insights.
If you throw “citizen data scientists” into the mix, the quality problem can easily deteriorate unless you implement safeguard procedures (to be discussed later in this post). Citizen data scientist refers to the new generation of statistical explorers who lack the traditional academic and work backgrounds of established data scientists. Typically, these newbies are self-taught, self-starting, self-sufficient, and use self-service cloud-based statistical modeling and data engineering tools. They tend to use idiosyncratic methods, which may frustrate subsequent efforts by others to assess the validity of their findings.
So what happens when you automate the work of all these data scientists, both established and nouveau? And what if you democratize the data science field even further by providing anyone who wants to be a citizen data scientist with the self-service, cloud-based, open analytics tools and data they need to get going? Well, of course, the sheer quantity of data science artifacts being produced will grow by leaps and bounds. But what about the quality? Overwhelmed by a glut of predictive analytics, machine-learning models, deep-learning applications, and other fruits of this new age, it’s not clear how we will distinguish the junk from the output that has value.
There’s no stopping the automation of the data science field. Commercial tools—for both established and citizen data scientists—are enabling automation of many pipeline tasks relating to data engineering (e.g, cleansing, normalization, skewness removal, transformation) and modeling (champion model selection, feature selection, algorithm selection, fitness metric selection). Some of the key pipeline processes that will continue to require manual methods (to varying degrees) include data-engineering tasks such as cluster analysis and exception handling, as well as data-modeling tasks such as feature engineering and missing-data imputation.
How will data science teams maintain quality standards in the face of advancing automation? To some degree, manual quality-assurance methods will remain essential. At the very least, data scientist teams need to review the automated output of their tools to ensure the validity and actionability of their results. This is analogous to how high-throughput manufacturing facilities dedicate personnel to test samples of their production runs before they're shipped to the customer.
With respect to the output (manual and automated) of citizen data scientists in the enterprise, established data scientists will need to perform manual reviews prior to putting those assets into production. I discussed this latter requirement in detail in this recent post.
The practical limits of data-science automation are qualitative. We dare not automate these data-science pipeline and development processes any further than we can vouch for the quality of their output. And that involves having human data scientists and/or subject matter experts manually review algorithmically generated data-driven results before they're put into production.
Without that manual review step, the risks of entirely automating the data-science pipeline may prove unacceptable to society at large. Considering how thoroughly the algorithmic fruits of data science are being implemented in our lives, this is the essence of prudence.
If you’re a working data scientist, data engineering, or data application developer, register here to attend the IBM DataFirst Launch Event on Tuesday, September 27 in New York. Engage with open-source community leaders and practitioners and learn how to drive greater productivity from your data science teams without compromising the quality of the mission-critical business assets they produce.