By Abubakar Abid, Stanford.
Here’s a story familiar to anyone who does research in data science or machine learning: (1) you have a brand-new idea for a method to analyze data (2) you want to test it, so you start by generating a random dataset or finding a dataset online.(3) You apply your method to the data, but the results are unimpressive. (4) So, you tweak the dataset, perhaps by reducing the dimensionality and making it easier to work with. And you introduce a hyperparameter into your method so that you can fine-tune it, until (5) the method eventually starts producing gorgeous results.
However, in taking these steps, you have developed a fragile method, one that is sensitive to the choice of dataset and customized hyperparameters. Rather than developing a more general and robust method, you have made the problem easier. Furthermore, you have run many experiments and (perhaps even subconsciously) focused on those trials that appear favorable. When you have your final results, you forget all the assumptions you’ve made and biases you’ve accumulated, and present your method in an unrealistically optimistic light.
I first realized how fragile my research was when I was working on a research project and I showed some results to my colleague Jamal. For context, we were working on a way to visualize high-dimensional data, a generalization of PCA. Our insight was that we can choose a better basis than the principal components by using information from a second, background dataset that does not have the patterns we’re looking for. We developed a method called contrastive PCA to perform this analysis.
When I first ran contrastive PCA on a simulated dataset, I generated a bunch of plots by tuning a parameter that I’ll call alpha. For one value of alpha, contrastive PCA found a basis that beautifully separated the four clusters within the data. When I showed Jamal the figure, I thought he would be blown away. Instead, he skeptically waved his hands at my results. Here’s why:
“Why did you pick alpha equals 2.73?” He asked, rightly pointing out that I had no reason apriori to pick this value. For all we knew, I might have gotten lucky by choosing a parameter that just happened to work for this dataset. Dropping a cold towel on my enthusiasm, Jamal also gave me some other advice: “Throw your research against the wall, because if you don’t, someone else will.”
In other words, don’t just look for datasets or settings that make your method work. Stress-test your work. When does it fail? Why? By asking these questions, you’ll have a more thorough understanding of the method. Over time, I’ve changed the way I do research to help me get this clearer perspective. Here are three steps I recommend:
1. Run At Least 10 Simulations at Once, and Only Look at the Aggregate Results
The human mind is good at finding patterns. If you run just one simulation at a time, you’ll start to build a mental model of whether our method is working by looking at the results of the initial simulations or by a few simulations that are extremely good or dramatically bad. You might hallucinate patterns that aren’t there. This is especially true if you’re comparing your novel method with a standard baseline. If there’s variation in the results, it can seem that your method is often working better than baseline even when, on average, it isn’t.
To avoid making imaginary inferences, generate lots of random, independent datasets and run your method with them all. Only display aggregate results (means and standard deviations), and plot the results. Graphs reveal “the whole truth” better than raw numbers.
If your method requires working with a real dataset that cannot be simulated, bootstrap your data: pick samples from your dataset (randomly, with replacement) to generate new datasets of the same size as your original. Then, run the method across all bootstrapped data.
Do this from the very beginning: it seems like extra work, but the payoff is that you won’t go bumbling down the wrong path.
2. Have a Negative Control Dataset
Say you’ve invented a new clustering technique. You generate a bunch of datasets with two clusters of points. As you bring the clusters closer together, you observe that existing clustering techniques fail to resolve the clusters in the data, but your technique continues to work. Success? Not so fast!
What if you’re accidentally feeding the ground-truth labels into your clustering algorithm? Or making a subtler mistake? As a sanity check to any algorithm, generate a negative control dataset, for example by generating the two clusters with identical distributions or by randomly shuffling the labels on a classification task.
Is your method still working on the negative control? If so, you have a problem.
3. Look Out for Free Parameters
You must be particularly careful if your method has free (hyper)parameters. It’s easy to fool yourself into adjusting the hyperparameter just so until your method is working perfectly on a particular dataset. Then, when you try your method on a very different dataset, the algorithm fails miserably, needing to be re-tuned like an old guitar.
To clarify, the mere presence of hyperparameters isn’t a problem. If you find that the algorithm isn’t too sensitive to the choice of parameters across its range of operation, then there’s nothing to worry about (take t-SNE, for example). Even if the algorithm is sensitive to the parameters, but those parameters are well-understood, then you can build a subroutine to automatically choose suitable values for hyperparameters or advise users how to set the values themselves.
In summary: adopt an antagonistic mindset when you’re doing research. Try to get your method to fail. By finding the weaknesses in your algorithm, you’ll be forced to find ways to strengthen it and develop a method that is useful in the wild – that is, to other people who are using it outside the carefully curated conditions of your lab.
Bio: Abubakar Abid is a PhD student in the EE Department at Stanford University.
- A Day in the Life of a Data Scientist
- Advice For New and Junior Data Scientists
- The Qualitative Side of Quantitative Research