Intro to XGBoost: Predicting Life Expectancy with Supervised Learning

Today we’ll use XGBoost Boosted Trees for regression over the official Human Development Index dataset. XGBoost is a framework that allows us to train Boosted Trees exploiting multicore parallelism.
c
comments

By Luciano Strika, MercadoLibre

figure-name
A beautiful forest. So random! Source: Pixabay

Today we’ll use XGBoost Boosted Trees for regression over the official Human Development Index dataset. Who said Supervised Learning was all about classification?
 

XGBoost: What is it?

 
XGBoost is a Python framework that allows us to train Boosted Trees exploiting multicore parallelism. It is also available in R, though we won’t be covering that here.
 

The task: Regression

 
Boosted Trees are a Machine Learning model for regression. That is, given a set of inputs and numeric labels, they will estimate the function that outputs the label given their corresponding input.

Unlike classification though, we are interested in continuous values, and not a discrete set of classes, for the labels.

For instance, we may want to predict a person’s height given their weight and age, as opposed to, say, labeling them as male, female or other.

For each decision tree, we will start at the root, and move to the left or right child depending on the decision’s result. In the end, we’ll return the value at the leaf we reach.
 

XGBoost’s model: What are Gradient Boosted Trees?

 
Boosted trees are similar to random forests: they are an amalgamation of decision trees. However, each leaf will return a number (or vector) in the space we are predicting.

For classification, we will usually return the average of the classes for the training set elements that fall on each leaf. In regression, we will usually return the average of the labels.

On each non-leaf node however, the tree will be making a decision: a numerical comparison between a certain feature’s value and a threshold.

This far, this would just be a regression forest. Where’s the difference?

Boosted Trees vs Random Forest: The difference

 
When training a Boosted Tree, unlike with random forests, we change the labels every time we add a new tree.

For every new tree, we update the labels by subtracting the sum of the previous trees’ predictions, multiplied by a certain learning rate.
This way, each tree will effectively be learning to correct the previous trees’ mistakes.

Consequently, on the prediction phase, we will simply return the sum of the predictions of all of the trees, multiplied by the learning rate.

This also means, unlike with random forests or bagged trees, this model will overfit if we increase the quantity of trees arbitrarily. However, we will learn how to account for that.

To learn more about Boosted Trees, I strongly recommend you to read the official Docs for XGBoost. They taught me a lot, and explain the basics better and with nicer pictures.

If you want to dive even deeper into the model, the book Introduction to Statistical Learning was the one with which these concepts finally clicked for me, and I can’t recommend it enough.

Using XGBoost with Python

 
XGBoost’s API is pretty straightforward, but we will also learn a bit about its hyperparameters. First of all however, I will show you today’s mission.

Today’s Dataset: HDI Public __data

 
The HDI Dataset contains a lot of information about most countries’ development level, looking at many metrics and areas, over the decades.

For today’s article, I decided to only look at the __data from the latest available year: 2017. This is just to keep things recent.

I also had to perform a bit of data reshaping and cleaning to make the original dataset into a more manageable and, particularly, consumable one.

The GitHub repository for this article is available here, and I encourage you to follow along with the Jupyter Notebook. However, I will like always be adding the most relevant snippets here.

Preprocessing the data with Pandas

 
First of all we will read the dataset into memory. Since it contains a whole column for every year, and a row for each country and metric, it is pretty cumbersome to manage.

We will reshape it into something along the lines of:

{country: {metric1: value1, metric2: value2, etc.}
   for country in countries }


So that we can feed it into our XGBoost model. Also since all these metrics are numerical, no more preprocessing will be needed before training.

import pandas as pd
import xgboost as xgb

df = pd.read_csv("2018_all_indicators.csv")
df = df[['dimension','indicator_name','iso3','country_name','2017']]
df.shape # (25636, 5)
df = df.dropna() # getting rid of NaNs

df.shape #(12728, 5)
df.iso3.unique().shape # 195 countries


This snippet gets rid of the rows that have a NaN value, and also tells us we have 195 different countries.

If any of these operations are new for you, or you’re having any trouble following along, try reading my introduction to pandas before continuing.

To see all the available indicators (metrics) the dataset provides, we will use the unique method.

df['indicator_name'].unique()


There are loads (97) of them, so I won’t list them here. Some are related to health, some to education, some to economics, and then a few are about women’s rights.

I thought the most interesting one for our label would be life expectancy, since I think it’s a pretty telling metric about a country.

Of course, feel free to try this same code out with a different label, and hit me up with the results!

Also from the huge list of indicators, I just chose a few by hand that I thought would be correlated with our label (or not), but we could have chosen other ones.

Here’s the code to convert our weirdly shaped dataset into something more palatable.

def add_indicator_to(dictionary, indicator):
    indicator_data = df[df['indicator_name']==indicator]
    zipped_values = zip(list(indicator_data['iso3']), list(indicator_data['2017']))
    for k, v in zipped_values:
        try:
            dictionary[k][indicator] = v
        except:
            print("failed with key: "+k+" on "+indicator)

indic = {}
indicators = [
    'Share of seats in parliament (% held by women)',
    'Infants lacking immunization, measles (% of one-year-olds)',
    'Youth unemployment rate (female to male ratio)',
    'Expected years of schooling (years)',
       'Expected years of schooling, female (years)',
       'Expected years of schooling, male (years)',
    'Unemployment, total (% of labour force)',
       'Unemployment, youth (% ages 15–24)',
       'Vulnerable employment (% of total employment)',
]
for indicator in indicators:
add_indicator_to(indic, indicator)


Finally, after converting our dictionary into a DataFrame I realized it was transposed — Its columns were what should’ve been its rows, and viceversa. Here’s how I fixed it.

final_df = pd.DataFrame(indic).transpose()


Features Correlation Analysis

 
I hypothesized the features I picked would be good for regression, but that assumption was based on my intuition only.

So I decided to check whether it held up to Statistics. Here’s how I made a correlation matrix.

import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline
fig = sns.heatmap(final_df.corr())
plt.title("Correlation Heatmap")
plt.tight_layout()
plt.savefig("correlation_heatmap", figsize=(96,72))


And here’s the result:

correlation-matri

As expected, life expectancy at birth is highly correlated with vaccination (take that, anti-vaxxers!), lack of vulnerable employment, and education.

I would expect the model to use these same features (the ones showing either the highest or lowest correlations) to predict the label.

Thankfully, XGBoost even provides us with a feature_importance method, to check what the model is basing its predictions on.

This is a huge advantage of Boosted Trees over, for instance, Neural Networks, where explaining the models’ decisions to someone can be very hard, and even convincing ourselves that the model is being reasonable is difficult. But more on that later.

Training the XGBoost Model

 
It’s finally time to train! After all that processing, you’d expect this part to be complicated, right? But it’s not, it’s just a simple method call.

We will first make sure there is no overlap between our training and testing data.

# The features we use to train the model obviously can't contain the label.
train_features = [feature for feature in list(final_df) if feature != life_expec]

train_df = final_df.iloc[:150][train_features]
train_label = final_df.iloc[:150][life_expec]

test_df = final_df.iloc[150:][train_features]
test_label = final_df.iloc[150:][life_expec]


And now, here’s what you’ve been waiting for. This is the code to train an XGBoost model.

import xgboost as xgb

dtrain = xgb.DMatrix(train_df, label=train_label)
dtest = xgb.DMatrix(test_df, label=test_label)
# specify parameters via map
param = {'max_depth':6,
         'eta':1,  #this is the default value anyway
         'colsample_bytree':1  #this is the default value anyway
        }

num_round = 15
initial_trees = xgb.train(param, dtrain, num_round)


You’ll notice the code to actually train the model, and the one to generate the predictions, are pretty simple. However before that, there’s a bit of a set up. Those values you see there are the model’s hyperparameters, which will specify certain behaviors when training or predicting.

XGBoost Hyperparameters Primer

 
max_depth refers to the maximum depth allowed to each tree in the ensemble. If this parameter is bigger, the trees tend to be more complex, and will usually overfit faster (all other things being equal).

eta is our learning rate. As I said earlier, it will multiply the output of each tree before fitting the next one, and also the sum when doing predictions afterwards. When set to 1, the default value, it does nothing. If set to a smaller number, the model will take longer to converge, but will usually fit the data better (potentially overfitting). It behaves similarly to a Neural Network’s learning rate.

colsample_bytree refers to how many of the columns will be available to each tree when generating the branches.

Its default value is 1, meaning “all of them”. Potentially, we may want to set it to a lower value, so that all of the trees can’t just use the best feature over and over. This way, the model becomes more robust to changes in the data distribution, and also overfits less.

Lastly, num_rounds refers to the number of training rounds: instances where we check whether to add a new tree. The training will also stop if the target function hasn’t improved in many iterations.

Evaluating our results

 
Now let’s see if the model learned!

import math 

def msesqrt(std, test_label, preds):
    squared_errors = [diff*diff for diff in (test_label - preds)]
    return math.sqrt(sum(squared_errors)/len(preds))

msesqrt(std, test_label, first_preds) # 4.02


Life expectancy had a standard deviation slightly over 7 in 2017 (I checked with Pandas’ describe), so a MSE-square-root of 4 is nothing to laugh at! The training was also super fast, though this small dataset doesn’t really take advantage of XGBoost’s multicore capabilities.

However, I feel like the model is still underfitting: that means it hasn’t reached its full potential yet.

Hyperparameter Tuning: Let’s Iterate

 
Since we think we are underfitting, let’s try allowing more complex trees (max_depth = 6), reducing the learning rate (eta = 0.1) and increasing the number of training rounds to 40.

param = {'max_depth':5,
         'eta':.1, 
         'colsample_bytree':.75 
        }
num_round = 40
new_tree = xgb.train(param, dtrain, num_round)
# make prediction
new_preds = new_tree.predict(dtest)


This time, using the same metric, I’m pretty happy with the results! We reduced the error rate in our test set to 3.15! That’s less than half the standard deviation of our label, and should be considered statistically accurate.

Imagine predicting someone’s life expectancy with an error margin of 3 years, based on a few statistics about their country.
(Of course, that interpretation would be wrong, since deviation in life expectancy inside a single country is definitely non-zero.)

Understanding XGBoost’s decisions: Feature Importance

 
The model seems to be pretty accurate. However, what is it basing its decisions on? To come to our aid, XGBoost gives us the plot_importance method. It will make a chart with all of our features (or the top N most important, if we pass it a value) ranked by their importance.

But how is importance measured?

The default algorithm will measure which % of our trees’ decisions use each feature (amount of nodes using a certain feature over the total), but there are other options, and they all work differently.

I hadn’t really understood this until I read this very clear article on Medium, which I will just link, instead of trying to come up with another explanation for the same thing.

In our case, the first model gave this plot:

feature-importance

Which means our hypothesis was true: The features with the higher or lower correlations are also the most important ones.

This is the feature importance plot for the second model, which performed better:

feature-importance

So both models were using the first three features the most, though the first one seems to have relied too much on expected years of schooling. Neat. This kind of analysis would’ve been kinda hard with other models.

Conclusions

 
XGBoost provided us with a pretty good regression, and even helped us understand what it was basing its predictions on.

Feature correlation analysis helped us theorize which features would be most important, and reality matched our expectations.

Finally, I think it is important to remark how fast training this kind of model can be, even though this particular example was not too good at demonstrating that.

In the future, I’d like to try XGBoost on a much bigger dataset. If you can think of any, let me know! Also, this dataset looks like it would be fun for some Time Series Analysis, but I don’t have that much experience with those subjects. Are there any books, articles, or any other sources you could recommend me on that? Please let me know in the comments!

Follow me on Medium or Twitter to know when the next article comes out. If you learned something today, please consider supporting my website by helping me pay for my site’s hosting.

 
Bio: Luciano Strika is a computer science student at Buenos Aires University, and a data scientist at MercadoLibre. He also writes about machine learning and data on www.datastuff.tech.

Original. Reposted with permission.

Related:

  • Modeling Price with Regularized Linear Model & XGBoost
  • XGBoost Algorithm: Long May She Reign
  • XGBoost: A Concise Technical Overview