Managing Machine Learning Workflows with Scikit-learn Pipelines Part 1: A Gentle Introduction

Scikit-learn's Pipeline class is designed as a manageable way to apply a series of
c
comments

Pipelines

Are you familiar with Scikit-learn Pipelines?

They are an extremely simple yet very useful tool for managing machine learning workflows.

A typical machine learning task generally involves

 $ python3 pipelines.py


Logistic Regression pipeline test accuracy: 0.933
Support Vector Machine pipeline test accuracy: 0.900
Decision Tree pipeline test accuracy: 0.867
Classifier with best accuracy: Logistic Regression
Saved Logistic Regression pipeline to file


So there you have it; a simple implementation of Scikit-learn pipelines. In this particular case, our logistic regression-based pipeline with default parameters scored the highest accuracy.

As mentioned above, however, these results likely don't represent our best efforts. What if we did want to test out a series of different hyperparameters? Can we use grid search? Can we incorporate automated methods for tuning these hyperparameters? Can AutoML fit in to this picture somewhere? What about using cross-validation?

Over the next couple of posts we will take a look at these additional issues, and see how these simple pieces fit together to make pipelines much more powerful than they may first appear to be given our initial example.

 
Related:

  • 7 Steps to Mastering data Preparation with Python
  • Machine Learning Workflows in Python from Scratch Part 1: data Preparation
  • Machine Learning Workflows in Python from Scratch Part 2: k-means Clustering