Automated machine learning is continually gaining increased exposure, yet there still seems to be some confusion as to what automated machine learning actually is. Is it the same thing as automated data science?
Let's start by looking at what data science and machine learning are, as they are defined independently of one another.
Data science is the application of the scientific method to the very broad concept of extracting knowledge and insight from data. Just think about how broad and inclusive this description is. So inclusive is it, in fact, that there is no agreed upon definition or scope for data science. Sure, lots of attempts at both can be found, and there may be some agreement on where very loose boundaries are, but no one will convince me that there exists some definition to any notable degree of specificity that a majority of data scientists would accept. Also, what exactly is a data scientist?
Machine learning, according to Tom Mitchell, is "concerned with the question of how to construct computer programs that automatically improve with experience." Slightly more formally, such computer programs are "said to learn from experience E with respect to some class of tasks T and performance measure P, if its performance at tasks in T, as measured by P, improves with experience E" (again, Mitchell).
I (and many others) would argue that machine learning is a relatively well-defined and comparatively narrow field, while "data science" is most definitely not. Just by adding the term "automated" in front of these 2 separate, distinct concepts does not somehow make them equivalent. Machine learning and data science are not the same thing, just like automated machine learning is not the same thing as automated data science. Machine learning is but one of many tools that a data scientist has at their disposal.
Switching gears, automated machine learning is "the automation of automating automation." Silly wordplay? Not at all. Comparing "regular" computer programming to machine learning, Sebastian Raschka has said that computer programming is about automation, and machine learning is "all about automating automation." If that's true, then automated machine learning is "the automation of automating automation." Programming relieves us by managing rote tasks; machine learning allows computers to learn how to best perform these rote tasks; automated machine learning allows for computers to learn how to optimize the outcome of learning how to perform these rote actions.
Since, in my view, the practice of machine learning comes down to 2 main overarching tasks, in a restrained practical definition we can consider the core of automated machine learning to be 1) the automated optimization of feature engineering and/or selection, and 2) hyperparameter tuning. Semantically, the training of a machine learning model, while the result of these automated steps, is incidental to the automated machine learning process, while automated steps such as model evaluation and model selection are ancillary to the core. See Figure 1.
Figure 1. The automated machine learning process (core automation processes in red, ancillary processes in yellow)
So what is automated data science? Automated data science would encompass attempts to automate any portion of the data science process (such as CRISP-DM, Figure 2), the process of extracting knowledge and insight from data. This would include machine learning, as one of the data scientist's tools. So automated machine learning can be used as a tool in the context of automated data science, but automated machine learning is not equivalent to automated data science.
But automated data science would extend beyond this, and could include either the full or partial automation of tasks such as exploratory data analysis and data visualization, for example. You may also consider automated data collection, data wrangling, or data preparation to be included under this umbrella.
Some aspects of data science are obviously more difficult to automate than others (hypothesis testing, communication, domain knowledge, formulating data-based strategies, etc.), which is why "data science" can no more be fully automated than can be the scientific method applied to any other domain (there is a difference between automating the entire process and automating the tools used within the process). Human involvement, for the foreseeable future, is paramount, not only for overseeing and correcting course for any level of automation, but also to kick off searches for insight. We may be able to automate exploratory investigations of what questions we should be looking to potentially apply the data science process to in the hopes of answering, and even have this phase augmented by facts and figures, but the human element will need to make nuanced decisions on which courses of action are worthy of pursuit.
Figure 2. The cross-industry standard process for data mining (CRISP-DM) process model, often used to guide the data science process in practice
Randy Olson, automated machine learning advocate and developer of the automated machine learning tool TPOT, has stated that such automation tools should not be seen as replacements to data scientists, but as data science assistants (TPOT is billed as "your data science assistant," a nod to the fact that machine learning may not be equivalent to data science, but is often an integral part of a data science pipeline). Such tools eliminate repetitive tasks such as running experiments on vast combinations of model hyperparameters and selected features and instead allow the humans in the process to be able to focus on more important and guiding issues.
This quote from Sandro Saitta sums up much of the confusion from both within and beyond the data science community when it comes to this topic:
When you read news about tools that automate Data Science and Data Science competitions, people with no industry experience may be confused and think that Data Science is only modeling and can be fully automated.
And so I say again, automated machine learning is not the same thing as automated data science.
- The Current State of Automated Machine Learning
- Data Science Automation: Debunking Misconceptions
- TPOT: A Python Tool for Automating Data Science