Python Data Preparation Case Files: Group-based Imputation

The second part in this series addresses group-based imputation for dealing with missing

This is the second post in this series on Python
Figure 1: Mock bank dataset for our exercise (after exercise 1).

The Process

 
The process is pretty cut and dry this time out, as we only have a single task.

Imputing Missing Account Balances

We will assume that our domain experts have concluded that missing account balances are best replaced with the mean account balances of customers in the same state (lending to why we had to drop instances with missing state values). This will be done for both checking and savings accounts.

First, let's have a look at the current state of missing values in our dataset.

import pandas as pd
import numpy as np

# Importing the dataset
dataset_filename = 'mock_bank_data_original.csv'
df = pd.read_csv(dataset_filename)

# Summarize missing values
print 'Null values by variable:'
print '------------------------'
df.isnull().sum()


Null values by variable:
------------------------
sex                108
age                107
state                0
cheq_balance        23
savings_balance     91
credit_score         0
special_offer        0
dtype: int64


Our checking account balance (pardon the 'q'; I'm Canadian) and savings account balances have 23 and 91 missing values, respectively.

The Code

 
First, let's compare the by-state balance means with the overall mean (prior to imputation), for interest's sake.

print 'Overall cheq_balance mean:', df['cheq_balance'].mean().round(2)
print 'Overall savings_balance mean:', df['savings_balance'].mean().round(2)

Overall cheq_balance mean: 4938.91
Overall savings_balance mean: 9603.64


print 'cheq_balance mean by state:'
print '---------------------------'
print df.groupby(['state']).mean().groupby('state')['cheq_balance'].mean().round(2)

cheq_balance mean by state:
---------------------------
state
CA    4637.23
FL    4993.99
NY    4932.80
TX    5175.78
Name: cheq_balance, dtype: float64


print 'savings_balance mean by state:'
print '------------------------------'
print df.groupby(['state']).mean().groupby('state')['savings_balance'].mean().round(2)

savings_balance mean by state:
------------------------------
state
CA     9174.56
FL     9590.59
NY    10443.61
TX     9611.70
Name: savings_balance, dtype: float64


Though perhaps not as dramatic as hoped, it should be clear to see why such group-based imputation is a valid approach to a problem such as this. The mean savings account balance difference between California ($9174.56) and New York ($10443.61), for example, is nearly $1270. Taking an overall mean of $9603.64 to fill in missing values would not provide the most accurate picture. This goes to goodness of fit.

Let's go ahead and fill in these missing values by using the Pandas 'groupby' and 'transform' functionality, along with a lambda function. We then round the result in the line of code beneath.

# Replace cheq_balance NaN with mean cheq_balance of same state
df['cheq_balance'] = df.groupby('state').cheq_balance.transform(lambda x: x.fillna(x.mean()))
df.cheq_balance = df.cheq_balance.round(2)

# Replace savings_balance NaN with mean savings_balance of same state
df['savings_balance'] = df.groupby('state').savings_balance.transform(lambda x: x.fillna(x.mean()))
df.savings_balance = df.savings_balance.round(2)


That's it. Our pertinent missing values are now filled in with the mean values of those in the same state.

Checking the results:

Null values by variable:
------------------------
sex                108
age                107
state                0
cheq_balance         0
savings_balance      0
credit_score         0
special_offer        0
dtype: int64


Bingo!

Now let's save the dataset for next time.

# Output modified dataset to CSV
df.to_csv('mock_bank_data_original_PART2.csv', index=False)


That's it for this exercise. We still have to deal with 'sex' and 'age' variable missing values. Next time we will look at regression-based imputation to help deal with one of these.

 
Related:

  • 7 Steps to Mastering data Preparation with Python
  • Machine Learning Workflows in Python from Scratch Part 1: data Preparation
  • Data Preparation Tips, Tricks, and Tools: An Interview with the Insiders