Or a less clickbaity title: Model based optimization of machine learning models with mlr and mlrMBO.
I recently participated in the #TEFDataChallenge a datathon organized by Wayra.
The first price was a drone for every team member, which is a pretty awesome price.
So what exactly is a datathon?
At a datathon multiple teams consisting out of computer engineers, data
scientists and experts from other fields come together to work on a case for 24
hours straight, where the purpose is to study and analyze data related to the
case. Methods well known in AI will be used. A datathon is derived from a
socalled hackathon, where in this case the focus is more on data rather than on
innovation. Although with the upcoming field of big data bigger and more
complicated problems can be solved, there is still a lack of real data scientists.
Therefor a datathon is the solution: In a fun and challenging way people are
triggered to come up with the best ideas! The Team with the most outstanding
patterns or the best predictive models will win the datathon!
Source
I already took part in some hackathons, but never in a datathon so this was a bit new for me. But
only 24 hours to create a prediction model as well as an interesting visualization seems quite hard.
Since I signed a NDA, I won’t talk about the data used in the challenge. Only so much (that’s the information you can find on their website):
The aim is to build a prediction model for customer churn, a model to predict if a
customer will or will not cancel a mobile contract.
This is a typical binary classification problem. Fortunately the data was already nicely preprocessed.
No missing values or data garbage, they even aggregated the data set already so that only one observation per costumer remained.
This makes the problem a bit easier: find the best prediction model in a short amount of time, but
most importantly: We don’t want to spend hours coding it, so that we can use most of our time to
create an awesome visualization and create/scrape additional features.
Typically in these kind of challenges stacking results in the winning model. But this problematic in this case because
a) Stacking multiple models takes an extremely long time and/or huge computation power. We only had one
EC2 instance.
b) The judges wanted to have simple models, the simpler the model the better, while still having extremely good prediction accuracy.
So what to do when we cannot stack a hugely complected model ensemble? We use gradient boosting with trees.
This has multiple advantages, we have a (rather) simple model with good prediction accuracy, we can handle
categorical variables with large number of classes and we get variable importance measures which can be
used in the visualization of the model. (We can also create partial dependence plots to gain even more insights in the effects). We use xgboost which is currently the fastest implementation for gradient boosting with trees.
Here is our learner definition with the paramset we want to optimize.
Overall we have 5 parameters (number of trees, learning rate, tree depth, bagging fraction and subsampling fraction) that we vary to find our optimal model.
For standard optimization techniques like grid search this is already a real problem because the numbers of points to search increases exponentially with every additional hyperparameter.
Since all hyperparameter are numeric it would be possible to use evolutionary algorithms like cmaes.
The problem with that is, that these methods generally take a large number of function evaluation to find the optimal model, but we don’t have too much time and (crossvalidated) model fits are quite expensive.
This is basically the perfect situation for model based optimization. We fit a surrogate model over
the space of hyperparameters and search promising points. Having only numeric hyperparameter makes the optimization
even nicer because we can use a Gaussian process as our surrogate model. In the figure you can see an example of model based optimization in 1 dimension. The upper figure shows evaluated points, the estimated model (dashed line) and its variance (gray area). The lower picture shows how interesting every possible point is for future exploration of the space.
For an indepth introduction to modelbased optimization you can read Jones(1998).
We use mlrMBO as a general blackbox optimization toolkit with is already nicely connected to mlr. The optimization definition looks like this:
Some notes on this:

We save our model (actually the optimization path) at different iterations of the process, which is quite useful if anything crashes.

We do 100 sequential iterations after the initial design (for which we used the default setting).

We use kriging as our surrogate model. Actually we don’t even need to specify this, since it is the default surrogate model of mlrMBO.

parallelMap is used to parallelize the crossvalidation over all 10 available cores on the EC2 instance.
The resulting model was in the end the best model on the hidden test set and even beat a large stacking ensemble of
multiple boosting models, random forests and deep neural networks.
One other team used a quite similar approach, but instead of model based optimization they used irace to tune a gradient boosting model. This can also be done in mlr quite easily: