By Zygmunt Zając, FastML.
train = pd.read_csv( 'data/train.csv' ) test = pd.read_csv( 'data/test.csv' ) train['TARGET'] = 1 test['TARGET'] =
Then we concatenate both frames and shuffle the examples:
Finally we create a new train/test split:
Come to think of it, there’s a shorter way (no need to shuffle examples beforehand, too):
Now we’re ready to train and evaluate. Here are the scores:
Train and test are like two peas in a pod, like Tweedledum and Tweedledee - indistinguishable to our models.
Below is a 3D interactive visualization of the combined train and test sets, in red and turquoise. They very much overlap. Click the image to view the interactive version (might take a while to load, the data file is ~8MB).
UPDATE: The hosting provider which shall remain unnamed has taken down the account with visualizations. We plan to re-create them on Cubert. In the meantime, you can do so yourself.
UPDATE: It’s a-live! Now you can create 3D visualizations of your own data sets. Visit cubert.fastml.com and upload a CSV or libsvm-formatted file.
Let’s see if validation scores translate into leaderboard scores, then. We train and validate logistic regression and a random forest. LR gets 58.30% AUC, RF 75.32% (subject to randomness).
On the private leaderboard LR scores 61.47% and RF 74.37%. These numbers correspond pretty well to the validation results.
The code is available at GitHub.
Originals: Part 1. Part 2. Reposted with permission.