By Sebastian Flennerhag, Machine Learning Researcher
Stacking models in Python efficiently
Ensembles have rapidly become one of the hottest and most popular methods in applied machine learning. Virtually every winning Kaggle solution features them, and many
Example schematics of an ensemble. An input array X is fed through two preprocessing pipelines and then to a set of base learners f(i). The ensemble combines all base learner predictions into a final prediction array P. Source
In this post, we'll take you through the basics of ensembles — what they are and why they work so well — and provide a hands-on tutorial for building basic ensembles. By the end of this post, you will:
- understand the fundamentals of ensembles
- know how to code them
- understand the main pitfalls and drawbacks of ensembles
Predicting Republican and Democratic donations
import numpy as np import pandas as pd import matplotlib.pyplot as plt %matplotlib inline ### Import data # Always good to set a seed for reproducibility SEED = 222 np.random.seed(SEED) df = pd.read_csv('input.csv') ### Training and test set from sklearn.model_selection import train_test_split from sklearn.metrics import roc_auc_score def get_train_test(test_size=0.95): """Split data into train and test sets.""" y = 1 * (df.cand_pty_affiliation == "REP") X = df.drop(["cand_pty_affiliation"], axis=1) X = pd.get_dummies(X, sparse=True) X.drop(X.columns[X.std() == 0], axis=1, inplace=True) return train_test_split(X, y, test_size=test_size, random_state=SEED) xtrain, xtest, ytrain, ytest = get_train_test() # A look at the data print("\nExample data:") df.head()
The figure above is the data underlying Ben's claim. Indeed, between Democrats and Republicans, about 75% of all contributions are made to democrats. Let's go through the features at our disposal. We have data about the donor, the transaction, and the recipient:
To measure how well our models perform, we use the ROC-AUC score, which trades off having high precision and high recall (if these concepts are new to you, see the Wikipedia entry on precision and recall for a quick introduction). If you haven't used this metric before, a random guess has a score of 0.5 and perfect recall and precision yields 1.0.
What is an ensemble?
Imagine that you are playing trivial pursuit. When you play alone, there might be some topics you are good at, and some that you know next to nothing about. If we want to maximize our trivial pursuit score, we need build a team to cover all topics. This is the basic idea of an ensemble: combining predictions from several models averages out idiosyncratic errors and yield better overall predictions.
An important question is how to combine predictions. In our trivial pursuit example, it is easy to imagine that team members might make their case and majority voting decides which to pick. Machine learning is remarkably similar in classification problems: taking the most common class label prediction is equivalent to a majority voting rule. But there are many other ways to combine predictions, and more generally we can use a model to learn how to best combine predictions.
Basic ensemble structure. data is fed to a set of models, and a meta learner combine model predictions. Source
Understanding ensembles by combining decision trees
To illustrate the machinery of ensembles, we'll start off with a simple interpretable model: a decision tree, which is a tree of
if-then rules. If you're unfamiliar with decision trees or would like to dive deeper, check out the decision trees course on Dataquest. The deeper the tree, the more complex the patterns it can capture, but the more prone to overfitting it will be. Because of this, we will need an alternative way of building complex models of decision trees, and an ensemble of different decision trees is one such way.
We'll use the below helper function to visualize our decision rules:
Let's fit a decision tree with a single node (decision rule) on our training data and see how it perform on the test set:
Decision tree ROC-AUC score: 0.672
Each of the two leaves register their share of training samples, the class distribution within their share, and the class label prediction. Our decision tree bases its prediction on whether the the size of the contribution is above 101.5: but it makes the same prediction regardless! This is not too surprising given that 75% of all donations are to Democrats. But it's not making use of the data we have. Let's use three levels of decision rules and see what we can get:
Decision tree ROC-AUC score: 0.751
This model is not much better than the simple decision tree: a measly 5% of all donations are predicted to go to Republicans–far short of the 25% we would expect. A closer look tells us that the decision tree uses some dubious splitting rules. A whopping 47.3% of all observations end up in the left-most leaf, while another 35.9% end up in the leaf second to the right. The vast majority of leaves are therefore irrelevant. Making the model deeper just causes it to overfit.
Fixing depth, a decision tree can be made more complex by increasing "width", that is, creating several decision trees and combining them. In other words, an ensemble of decision trees. To see why such a model would help, consider how we may force a decision tree to investigate other patterns than those in the above tree. The simplest solution is to remove features that appear early in the tree. Suppose for instance that we remove the transaction amount feature (
transaction_amt), the root of the tree. Our new decision tree would look like this:
Decision tree ROC-AUC score: 0.740
The ROC-AUC score is similar, but the share of Republican donation increased to 7.3%. Still too low, but higher than before. Importantly, in contrast to the first tree, where most of the rules related to the transaction itself, this tree is more focused on the residency of the candidate. We now have two models that by themselves have similar predictive power, but operate on different rules. Because of this, they are likely to make different prediction errors, which we can average out with an ensemble.
Interlude: why averaging predictions work
Why would we expect averaging predictions to work? Consider a toy example with two observations that we want to generate predictions for. The true label for the first observation is Republican, and the true label for the second observation is Democrat. In this toy example, suppose model 1 is prone to predicting Democrat while model 2 is prone to predicting Republican, as in the below table:
|Model||Observation 1||Observation 2|
|Model prediction: P(R)|
If we use the standard 50% cutoff rule for making a class prediction, each decision tree gets one observation right and one wrong. We create an ensemble by averaging the model's class probabilities, which is a majority vote weighted by the strength (probability) of model's prediction. In our toy example, model 2 is certain of its prediction for observation 1, while model 1 is relatively uncertain. Weighting their predictions, the ensemble favors model 2 and correctly predicts Republican. For the second observation, tables are turned and the ensemble correctly predicts Democrat:
|Model||Observation 1||Observation 2|
With more than two decision trees, the ensemble predicts in accordance with the majority. For that reason, an ensemble that averages classifier predictions is known as a majority voting classifier. When an ensembles averages based on probabilities (as above), we refer to it as soft voting, averaging final class label predictions is known as hard voting.
Of course, ensembles are no silver bullet. You might have noticed in our toy example that for averaging to work, prediction errors must be uncorrelated. If both models made incorrect predictions, the ensemble would not be able to make any corrections. Moreover, in the soft voting scheme, if one model makes an incorrect prediction with a high probability value, the ensemble would be overwhelmed. Generally, ensembles don't get every observation right, but in expectation it will do better than the underlying models.
A forest is an ensemble of trees
Returning to our prediction problem, let's see if we can build an ensemble out of our two decision trees. We first check error correlation: highly correlated errors makes for poor ensembles.
There is some correlation, but not overly so: there's still a good deal of prediction variance to exploit. To build our first ensemble, we simply average the two model's predictions.
Average of decision tree ROC-AUC score: 0.783
Indeed, the ensemble procedure leads to an increased score. But maybe if we had more diverse trees, we could get an even greater gain. How should we choose which features to exclude when designing the decision trees?
A fast approach that works well in practice is to randomly select a subset of features, fit one decision tree on each draw and average their predictions. This process is known as bootstrapped averaging (often abbreviated bagging), and when applied to decision trees, the resultant model is a Random Forest. Let's see what a random forest can do for us. We use the Scikit-learn implementation and build an ensemble of 10 decision trees, each fitted on a subset of 3 features.
Average of decision tree ROC-AUC score: 0.844
The Random Forest yields a significant improvement upon our previous models. We're on to something! But there is only so much you can do with decision trees. It's time we expand our horizon.