No matter how you view it, data preparation and preprocessing tasks constitute a high percentage of any data-centric operation, be it of a descriptive or predictive nature. It can also be a collection of the most frustrating tasks for a data practitioner, though this often is driven by a fundamental lack of understanding of the importance of preparatory measures. Whether from academia or industry, data preparation is the one thing we all have in common; the great equalizer, if you will.
In an effort to shed some light on the importance and ubiquity of data preparation, we have asked a trio of experts to provide some insight into the subject:
Sebastian Raschka is a 'Data Scientist' (his quotes) and Machine Learning enthusiast, author of the authoritative book 'Python Machine Learning,' and a PhD candidate in Computational Biology at Michigan State University.
Clare Bernard holds a PhD in Experimental High-Energy Physics and is a Product Lead at Tamr, an organization which "transforms dark, dirty, and disparate data into clean, connected data that can be delivered quickly and repeatedly throughout your organization."
Joe Boutros is a computer scientist and Director of Product Engineering at data.world, which aims to build "the most meaningful, collaborative, and abundant data resource in the world."
What follows are the answers our experts provided to a few somewhat open-ended questions related to the data preparation process. (Keep in mind that responses from our experts are to the question being posed, and not to the previous response(s), as they were solicited separately and compiled after-the-fact.)
Matthew Mayo: Why is it that data preparation is often described as 80% of the work involved in data-related tasks, and do you think this is an accurate generalization?
Sebastian Raschka: 80%? I often hear >90%! Joking aside, I think it's really true that data preparation makes up most of the work in typical data-related projects.
For simplicity, let's use "data preparation" as a category that summarizes tasks such as data acquisition, data storage and handling, data cleaning, and maybe even early-stages of feature engineering.
First, we start with the question we want to answer, or a problem that we want to solve. And in order to address this problem, we typically -- not always -- need to *get* the data! This means that we have to search and ask for datasets and resources that are relevant, trustworthy, relatively up to date (depending on the task), and in a format that we may be able to work with. If we are lucky, there are APIs or maybe even curated datasets out there. For instance, for a soccer-prediction hobby project, I ended up writing cron-tab powered Python web scrapers for dozens of websites.
Now that we have our data, we may want to do several sanity checks: checking for missing data, formatting issues, or other problems. Usually, we have to come back to this step a couple of times during our data exploration stage ...
Often, we do not only collect data from a single resource, and we have to come up with ways to combine data in a meaningful way. Back to my soccer prediction example, one particular challenge was that each website spelled the (English Premier League) soccer clubs or players differently. Some relied on ASCII characters only while others used all kinds of fancy, special alphanumeric characters you could possible think of.
After we put our data through some sort of early cleaning stage, we may also want to think about suitable storage formats, especially if we are working with large datasets that don't fit into memory and need to query frequently. For example, in a recent virtual screening for small molecules (related to drug discovery), I was working with 5 Tb of data, 200 conformations of ~15 million molecules stored as 3-dimensional coordinates, plus atom type and charge information. Here, I used SQL(ite) for maintaining a database of basic property information (such as off-the shelf vendor availability) and a HDF5 database of structural information in an attempt to store this data in a fairly efficient way, also considering that I needed to retrieve and analyze certain subsets of molecules frequently.
Storing data in a database is one thing. When I want to use common tools, such as my favorite machine learning library (scikit-learn) I need to represent the data as arrays in memory, which is yet another data processing (or "data preparation") step -- getting the relevant subset of data "out" again. This is why it's also important to think about data storage carefully, since going back and rebuilding databases can be a huge time sink as well.
That's my brief take on "data preparation," and as you can see, there are a lot of steps involved before we can move on to the really exciting parts of a data-related project: descriptive, exploratory, inferential, and predictive analyses (or maybe even causal and mechanistic studies).
Clare Bernard: This is spot on from everything we’ve seen. Over the past 20 years, companies have invested an estimated $3-4 trillion in IT systems to automate and optimize key business processes. These systems, which are largely dedicated to a single business function or geography, generate enormous amounts of disparate data that is typically stored in one or more data lakes or warehouses. Now, with billions being invested in Big Data storage and access and next-generation analytics platforms, companies are beginning the analytic prosecution of the data stored in these centralized systems. However, the variety of data collected leads to natural silos, which are rapidly becoming a bottleneck for analysis. This how and where you get to the 80% data preparation number. Organizations are quickly discovering that while data lakes may help their ability to manage information by placing data in one location, without proper attention to the curation of the data, these lakes can turn into expensive, unproductive data swamps. We’ve built Tamr specifically to address this -- a sustainable, scalable data integration system capable of automating the collection, organization and preparation of enterprise-wide data (supplier, customer, product, financial, etc.).
Joe Boutros: During the course of the last year, our team has spoken with hundreds of data professionals at all levels, from students to hardcore quants. As data.world is focused on helping with first-mile data problems, the topic of data preparation comes up quite often. The 80% figure comes up so frequently that it has become somewhat of an inside joke in our office. There is an oft-cited 2014 New York Times article about "data janitor work" that mentions "interviews and expert estimates" claiming data preparation takes 50%-80% of a data scientist's time. But I've seen traces of this claim from before this article was published. I am left to conclude that:
- Yes, data preparation truly is a mammoth part of the data practitioner’s job (80% is just such a nice sounding number!)
- Data preparation can be tedious, repetitive, fraught with error and misunderstanding, so it isn’t always the most fun part of the job. But misery loves company and it is fun to complain about!
Data preparation is quite a broad concept that encompasses things such as:
- collection methodology
- formatting (strings, numbers, delimiters, etc)
- normalization (acronyms, typos, formatting, encoding, null/empty, etc)
- filling in / removing missing values
- merging multiple datasets / sources
- time series expansion
- specific preparation methods depending on analytical technique
Whew! No wonder it’s such a big part of the job!