By Simon Whittick, Geckoboard
To avoid falling for this fallacy, define your hypothesis upfront before analyzing data or testing for statistical significance.
This can also be known as “cum hoc ergo propter hoc”, which is Latin for "with this, therefore because of this". That’s because this data fallacy is the false assumption that when two events occur together one must have caused the other. Correlation does not imply causation.
For example, global temperatures have steadily risen over the past 150 years and the number of pirates has declined at a comparable rate. No one would reasonably claim that the reduction in pirates caused global warming or that more pirates would reverse it.
But it’s not usually this clear-cut. Often, correlations between two things tempt us to believe that one caused the other. However, it’s often a coincidence or there’s a third factor causing both effects that you’re seeing. In our pirates and global warming example, the cause of both is industrialization. There are many more examples of false causality and Tyler Vigen does a great job of highlighting spurious correlation examples.
Never assume causation because of correlation alone – always gather more evidence and consider additional variables that might be causing both movements.
A more complex explanation will often describe your data better than a simple one. However, a simpler explanation is usually more representative of the underlying relationship. This is in essence what overfitting is -- it’s creating a model that’s overly tailored to the data you have and not representative of the general trend.
When looking at data, you’ll want to understand what the underlying relationships are. To do this, you create a model that describes them mathematically. The problem is that a more complex model will fit your initial data better than a simple one. However, they tend to be very brittle: They work well for the data you already have, but try too hard to explain random variations. Therefore, as soon as you add more data, they break down. Simpler models are usually more robust and better at predicting future trends.
For more specific examples of overfitting and how to remedy them, there’s a great overview and discussion on different methods of overfitting and how to remedy them here. However, generally speaking, when first creating models, try to find the simplest possible hypothesis and avoid explaining random variations in your model.
A statistical phenomenon in which a trend appears in different groups of data but disappears or reverses when the groups are combined.
For example, in the 1970s, Berkeley University was accused of sexism because female applicants were less likely to be accepted than male ones. However, when trying to identify the source of the problem, they found that for individual subjects the acceptance rates were generally better for women than men. The paradox was caused by a difference in what subjects men and women were applying for. A greater proportion of the female applicants were applying to highly competitive subjects where acceptance rates were much lower for both genders. There are many more examples of this paradox, some of which are included in this video.
It’s important to be aware of this paradox so that you can identify when it is appearing in your data. When you do see it happening you need to get more context and go outside statistics to look for other variables which are causing it. In the above example this would be that women were applying for more competitive subjects than men.
Always be aware!
When analyzing data or running tests, be aware of these fallacies. When you’re working with data, take into consideration some of the below factors to reduce your chances of falling victim to data fallacies:
- Ensure you have a hypothesis upfront before testing or analyzing data.
- Question the data you’re looking at. How has it been gathered? Is there any potential bias or negative impact that the way it’s been gathered might have on your conclusions?
- Consider what data or other variables you’re not seeing. Is there other research that might contradict what you’re seeing? Are there additional variables that aren’t being considered in your data?
- Consider that you won’t get the same results if you were to gather your data again. There could be random variables impacting your data.
- Consider the shape of your data by visualizing it rather than solely relying on summary metrics.
Bio: Simon Whittick is VP Marketing at Geckoboard, a live TV dashboard software. They’re on a mission to make data more approachable by providing educational content on data fundamentals that’s accessible to everyone, no matter their experience with data.
Want to understand more data fallacies?
Learn about all of them here.
- Stop Doing Fragile Research
- Are Scientists Doing Too Much Research?
- Understanding overfitting: an inaccurate meme in Machine Learning