Introduction to Python Ensembles

In this post, we'll take you through the basics of ensembles — what they are and why they work so well — and provide a hands-on tutorial for building basic ensembles.

By Sebastian Flennerhag, Machine Learning Researcher

Stacking models in Python efficiently

Ensembles have rapidly become one of the hottest and most popular methods in applied machine learning. Virtually every winning Kaggle solution features them, and many
Example schematics of an ensemble. An input array X is fed through two preprocessing pipelines and then to a set of base learners f(i). The ensemble combines all base learner predictions into a final prediction array P. Source

In this post, we'll take you through the basics of ensembles — what they are and why they work so well — and provide a hands-on tutorial for building basic ensembles. By the end of this post, you will:

  • understand the fundamentals of ensembles
  • know how to code them
  • understand the main pitfalls and drawbacks of ensembles

Predicting Republican and Democratic donations

To illustrate how ensembles work, we'll use a

import numpy as np
import pandas as pd

import matplotlib.pyplot as plt
%matplotlib inline

### Import data
# Always good to set a seed for reproducibility
SEED = 222

df = pd.read_csv('input.csv')

### Training and test set
from sklearn.model_selection import train_test_split
from sklearn.metrics import roc_auc_score

def get_train_test(test_size=0.95):
    """Split data into train and test sets."""
    y = 1 * (df.cand_pty_affiliation == "REP")
    X = df.drop(["cand_pty_affiliation"], axis=1)
    X = pd.get_dummies(X, sparse=True)
    X.drop(X.columns[X.std() == 0], axis=1, inplace=True)
    return train_test_split(X, y, test_size=test_size, random_state=SEED)

xtrain, xtest, ytrain, ytest = get_train_test()

# A look at the data
print("\nExample data:")

cand_pty_affiliation cand_office_st cand_office cand_status rpt_tp transaction_tp entity_tp state classification cycle transaction_amt
REP US P C Q3 15 IND NY Engineer 2016.0 500.0
1 DEM US P C M5 15E IND OR Math-Stat 2016.0 50.0
2 DEM US P C M3 15 IND TX Scientist 2008.0 250.0
3 DEM US P C Q2 15E IND IN Math-Stat 2016.0 250.0
4 REP US P C 12G 15 IND MA Engineer 2016.0 184.0

    kind="bar", title="Share of No. donations")


The figure above is the data underlying Ben's claim. Indeed, between Democrats and Republicans, about 75% of all contributions are made to democrats. Let's go through the features at our disposal. We have data about the donor, the transaction, and the recipient:


To measure how well our models perform, we use the ROC-AUC score, which trades off having high precision and high recall (if these concepts are new to you, see the Wikipedia entry on precision and recall for a quick introduction). If you haven't used this metric before, a random guess has a score of 0.5 and perfect recall and precision yields 1.0.

What is an ensemble?

Imagine that you are playing trivial pursuit. When you play alone, there might be some topics you are good at, and some that you know next to nothing about. If we want to maximize our trivial pursuit score, we need build a team to cover all topics. This is the basic idea of an ensemble: combining predictions from several models averages out idiosyncratic errors and yield better overall predictions.

An important question is how to combine predictions. In our trivial pursuit example, it is easy to imagine that team members might make their case and majority voting decides which to pick. Machine learning is remarkably similar in classification problems: taking the most common class label prediction is equivalent to a majority voting rule. But there are many other ways to combine predictions, and more generally we can use a model to learn how to best combine predictions.

ensemble network

Basic ensemble structure. data is fed to a set of models, and a meta learner combine model predictions. Source

Understanding ensembles by combining decision trees

To illustrate the machinery of ensembles, we'll start off with a simple interpretable model: a decision tree, which is a tree of if-then rules. If you're unfamiliar with decision trees or would like to dive deeper, check out the decision trees course on Dataquest. The deeper the tree, the more complex the patterns it can capture, but the more prone to overfitting it will be. Because of this, we will need an alternative way of building complex models of decision trees, and an ensemble of different decision trees is one such way.

We'll use the below helper function to visualize our decision rules:

import pydotplus  # you can install pydotplus with: pip install pydotplus 
from IPython.display import Image
from sklearn.metrics import roc_auc_score
from sklearn.tree import DecisionTreeClassifier, export_graphviz

def print_graph(clf, feature_names):
    """Print decision tree."""
    graph = export_graphviz(
        class_names={0: "D", 1: "R"},
    graph = pydotplus.graph_from_dot_data(graph)  
    return Image(graph.create_png())

Let's fit a decision tree with a single node (decision rule) on our training data and see how it perform on the test set:

t1 = DecisionTreeClassifier(max_depth=1, random_state=SEED), ytrain)
p = t1.predict_proba(xtest)[:, 1]

print("Decision tree ROC-AUC score: %.3f" % roc_auc_score(ytest, p))
print_graph(t1, xtrain.columns)


Decision tree ROC-AUC score: 0.672

Each of the two leaves register their share of training samples, the class distribution within their share, and the class label prediction. Our decision tree bases its prediction on whether the the size of the contribution is above 101.5: but it makes the same prediction regardless! This is not too surprising given that 75% of all donations are to Democrats. But it's not making use of the data we have. Let's use three levels of decision rules and see what we can get:

t2 = DecisionTreeClassifier(max_depth=3, random_state=SEED), ytrain)
p = t2.predict_proba(xtest)[:, 1]

print("Decision tree ROC-AUC score: %.3f" % roc_auc_score(ytest, p))
print_graph(t2, xtrain.columns)


Decision tree ROC-AUC score: 0.751

This model is not much better than the simple decision tree: a measly 5% of all donations are predicted to go to Republicans–far short of the 25% we would expect. A closer look tells us that the decision tree uses some dubious splitting rules. A whopping 47.3% of all observations end up in the left-most leaf, while another 35.9% end up in the leaf second to the right. The vast majority of leaves are therefore irrelevant. Making the model deeper just causes it to overfit.

Fixing depth, a decision tree can be made more complex by increasing "width", that is, creating several decision trees and combining them. In other words, an ensemble of decision trees. To see why such a model would help, consider how we may force a decision tree to investigate other patterns than those in the above tree. The simplest solution is to remove features that appear early in the tree. Suppose for instance that we remove the transaction amount feature (transaction_amt), the root of the tree. Our new decision tree would look like this:

drop = ["transaction_amt"]

xtrain_slim = xtrain.drop(drop, 1)
xtest_slim = xtest.drop(drop, 1)

t3 = DecisionTreeClassifier(max_depth=3, random_state=SEED), ytrain)
p = t3.predict_proba(xtest_slim)[:, 1]

print("Decision tree ROC-AUC score: %.3f" % roc_auc_score(ytest, p))
print_graph(t3, xtrain_slim.columns)


Decision tree ROC-AUC score: 0.740

The ROC-AUC score is similar, but the share of Republican donation increased to 7.3%. Still too low, but higher than before. Importantly, in contrast to the first tree, where most of the rules related to the transaction itself, this tree is more focused on the residency of the candidate. We now have two models that by themselves have similar predictive power, but operate on different rules. Because of this, they are likely to make different prediction errors, which we can average out with an ensemble.

Interlude: why averaging predictions work

Why would we expect averaging predictions to work? Consider a toy example with two observations that we want to generate predictions for. The true label for the first observation is Republican, and the true label for the second observation is Democrat. In this toy example, suppose model 1 is prone to predicting Democrat while model 2 is prone to predicting Republican, as in the below table:

Model Observation 1 Observation 2
True label R D
Model prediction: P(R)
Model 1 0.4 0.2
Model 2 0.8 0.6

If we use the standard 50% cutoff rule for making a class prediction, each decision tree gets one observation right and one wrong. We create an ensemble by averaging the model's class probabilities, which is a majority vote weighted by the strength (probability) of model's prediction. In our toy example, model 2 is certain of its prediction for observation 1, while model 1 is relatively uncertain. Weighting their predictions, the ensemble favors model 2 and correctly predicts Republican. For the second observation, tables are turned and the ensemble correctly predicts Democrat:

Model Observation 1 Observation 2
True label R D
Ensemble 0.6 0.4

With more than two decision trees, the ensemble predicts in accordance with the majority. For that reason, an ensemble that averages classifier predictions is known as a majority voting classifier. When an ensembles averages based on probabilities (as above), we refer to it as soft voting, averaging final class label predictions is known as hard voting.

Of course, ensembles are no silver bullet. You might have noticed in our toy example that for averaging to work, prediction errors must be uncorrelated. If both models made incorrect predictions, the ensemble would not be able to make any corrections. Moreover, in the soft voting scheme, if one model makes an incorrect prediction with a high probability value, the ensemble would be overwhelmed. Generally, ensembles don't get every observation right, but in expectation it will do better than the underlying models.

A forest is an ensemble of trees

Returning to our prediction problem, let's see if we can build an ensemble out of our two decision trees. We first check error correlation: highly correlated errors makes for poor ensembles.

p1 = t2.predict_proba(xtest)[:, 1]
p2 = t3.predict_proba(xtest_slim)[:, 1]

pd.DataFrame({"full_data": p1,
              "red_data": p2}).corr()

full_data red_data
full_data 1.000000 0.669128
red_data 0.669128 1.000000

There is some correlation, but not overly so: there's still a good deal of prediction variance to exploit. To build our first ensemble, we simply average the two model's predictions.

p1 = t2.predict_proba(xtest)[:, 1]
p2 = t3.predict_proba(xtest_slim)[:, 1]
p = np.mean([p1, p2], axis=0)
print("Average of decision tree ROC-AUC score: %.3f" % roc_auc_score(ytest, p))

Average of decision tree ROC-AUC score: 0.783

Indeed, the ensemble procedure leads to an increased score. But maybe if we had more diverse trees, we could get an even greater gain. How should we choose which features to exclude when designing the decision trees?

A fast approach that works well in practice is to randomly select a subset of features, fit one decision tree on each draw and average their predictions. This process is known as bootstrapped averaging (often abbreviated bagging), and when applied to decision trees, the resultant model is a Random Forest. Let's see what a random forest can do for us. We use the Scikit-learn implementation and build an ensemble of 10 decision trees, each fitted on a subset of 3 features.

from sklearn.ensemble import RandomForestClassifier

rf = RandomForestClassifier(
), ytrain)
p = rf.predict_proba(xtest)[:, 1]
print("Average of decision tree ROC-AUC score: %.3f" % roc_auc_score(ytest, p))

Average of decision tree ROC-AUC score: 0.844

The Random Forest yields a significant improvement upon our previous models. We're on to something! But there is only so much you can do with decision trees. It's time we expand our horizon.