By Sandro Saitta, Nestlé Nespresso.
Editor's note: This blog post was an entrant in the recent KDnuggets Automated Data Science and Machine Learning blog contest, where it received an honorable mention.
Data Science automation is a hot topic recently, with several articles about it (here and here for example). Most of them discuss the so-called "automation" tools (see here and here). Too often, editors claim that their tools can automate the Data Science process. This provides the feeling that combining these tools with a Big Data architecture can solve any business problems.
The misconception comes from the confusion between the whole Data Science process (see for example CRISP-DM) and the sub-tasks of data preparation (feature extraction, etc.) and modeling (algorithm selection, hyper-parameters tuning, etc.) which I call Machine Learning. This issue is amplified by the recent success of platforms such as Kaggle and DrivenData. Competitors are provided with a clear problem to solve and clean data. Choosing and tuning a machine learning algorithm is the main task. Participants are evaluated using metrics such as test set accuracy. In industry, data scientists will be evaluated on the value added to the business, rather than algorithm accuracy. A project with 99% classification accuracy, but that isn't deployed in production, is bringing no value to the company.
I recently read how the winner of a Kaggle competition, Gert Jacobusse, spent his time on solving the challenge: “I spent 50% on feature engineering, 40% on feature selection plus model ensembling, and less than 10% on model selection and tuning”. This is very far from what I have experienced in industry. It is usually more something like: data preparation and modeling (10%) and the rest (90%). I will explain below what I mean by “the rest”. When you read news about tools that automate Data Science and Data Science competitions, people with no industry experience may be confused and think that Data Science is only modeling and can be fully automated.
Back in 2006, 2010 and 2013, I discussed about Data Science automation on my blog. I listed the different Data Science steps and discuss the ones that can be automated. Most complex and time consuming tasks such as defining problems to solve, getting data, exploring data, deploying the project, debugging and monitoring solutions can't be fully automated. In a recent study from MIT, researchers said their tool bested more than 600 teams out of 900. Wow! But wait, what's the benchmark? Clearly defined and closed world Kaggle competitions. Such challenges don't represent the heart of data scientists activities. It's not that available tools are useless, on the contrary, they can free up time for the data scientist. Still, they don't automate Data Science.
Don't get me wrong: Kaggle and the likes are really good places to start learning about Machine Learning algorithms and it will certainly improve your feature engineering and modeling skills. However, you won't learn the main aspects of Data Science within these competitions: business problem definition, data gathering and cleaning, deployment, stakeholder management, email communications, presentation skills…well, “the rest”. A recent article mentions that data science will be automated within a few years. Machine Learning, as defined above, can be automated, but we are far from automating the whole Data Science process. Even for Machine Learning, we need specialists to develop new algorithms, adapted to our business challenges, people that will make the field progress. The main reason that makes Data Science difficult to automate is that business challenges are by definition ill-posed open world problems. To the often asked question "Will machines replace Data Scientists?", my answer is "Yes, just after any other jobs in the World".
Sandro Saitta is currently a Data Scientist within Nestlé Nespresso. He holds a Master and PhD in Computer Science from EPFL. He has experience in applying Predictive Analytics in industries such as Civil Engineering, Telco, Security Solutions and Travel. He is blogging at dataminingblog.com and is co-creator of the Swiss Association for Analytics.
- The Data Science Machine, or ‘How To Engineer Feature Engineering’
- TPOT: A Python Tool for Automating Data Science
- Automatic Data Science: DataRobot, Quill and Loom Systems