Managing Machine Learning Workflows with Scikit-learn Pipelines Part 1: A Gentle Introduction

Scikit-learn's Pipeline class is designed as a manageable way to apply a series of


Are you familiar with Scikit-learn Pipelines?

They are an extremely simple yet very useful tool for managing machine learning workflows.

A typical machine learning task generally involves

 $ python3

Logistic Regression pipeline test accuracy: 0.933
Support Vector Machine pipeline test accuracy: 0.900
Decision Tree pipeline test accuracy: 0.867
Classifier with best accuracy: Logistic Regression
Saved Logistic Regression pipeline to file

So there you have it; a simple implementation of Scikit-learn pipelines. In this particular case, our logistic regression-based pipeline with default parameters scored the highest accuracy.

As mentioned above, however, these results likely don't represent our best efforts. What if we did want to test out a series of different hyperparameters? Can we use grid search? Can we incorporate automated methods for tuning these hyperparameters? Can AutoML fit in to this picture somewhere? What about using cross-validation?

Over the next couple of posts we will take a look at these additional issues, and see how these simple pieces fit together to make pipelines much more powerful than they may first appear to be given our initial example.


  • 7 Steps to Mastering data Preparation with Python
  • Machine Learning Workflows in Python from Scratch Part 1: data Preparation
  • Machine Learning Workflows in Python from Scratch Part 2: k-means Clustering