KDnuggets Adversarial Validation, Explained

This post proposes and outlines adversarial validation, a method for selecting training examples most similar to test examples and using them as a validation set, and provides a practical scenario for its usefulness.

By Zygmunt Zając, FastML.

Many

train = pd.read_csv( 'data/train.csv' )
test = pd.read_csv( 'data/test.csv' )

train['TARGET'] = 1
test['TARGET'] = 


Then we concatenate both frames and shuffle the examples:


Finally we create a new train/test split:

train_examples = 100000

x_train = x[:train_examples]
x_test = x[train_examples:]
y_train = y[:train_examples]
y_test = y[train_examples:]


Come to think of it, there’s a shorter way (no need to shuffle examples beforehand, too):

from sklearn.cross_validation import train_test_split

x_train, x_test, y_train, y_test = train_test_split( x, y, train_size = train_examples )


Now we’re ready to train and evaluate. Here are the scores:

# logistic regression / AUC: 49.82%
# random forest, 10 trees / AUC: 50.05%
# random forest, 100 trees / AUC: 49.95%


Train and test are like two peas in a pod, like Tweedledum and Tweedledee - indistinguishable to our models.

Below is a 3D interactive visualization of the combined train and test sets, in red and turquoise. They very much overlap. Click the image to view the interactive version (might take a while to load, the data file is ~8MB).

UPDATE: The hosting provider which shall remain unnamed has taken down the account with visualizations. We plan to re-create them on Cubert. In the meantime, you can do so yourself.

Santander training set after PCA

Santander training set after PCA.

UPDATE: It’s a-live! Now you can create 3D visualizations of your own data sets. Visit cubert.fastml.com and upload a CSV or libsvm-formatted file.

Let’s see if validation scores translate into leaderboard scores, then. We train and validate logistic regression and a random forest. LR gets 58.30% AUC, RF 75.32% (subject to randomness).

On the private leaderboard LR scores 61.47% and RF 74.37%. These numbers correspond pretty well to the validation results.

The code is available at GitHub.

Originals: Part 1. Part 2. Reposted with permission.