Why you need to improve your training data, and how to do it

This article examines the way you need to improve your training comments

By Pete Warden.


sleep_lost

Photo by Lisha Li

Andrej Karpathy showed this slide as part of his talk at Train AI and I loved it! It captures the difference between deep learning research and production perfectly. Academic papers are almost entirely focused on new and improved models, with datasets usually chosen from a small set of public archives. Everyone I know who uses deep learning as part of an actual application spends most of their time worrying about the training

What action you’ll take depends on what you find, but you should always do this kind of inspection before you do any other data cleanup, since an intuitive knowledge of what’s in the set will help you make decisions on the rest of the steps.

Pick a Model Fast

Don’t spend very long choosing a model. If you’re doing image classification, check out AutoML, otherwise look at something like TensorFlow’s model repository or Fast.AI’s collection of examples to find a model that’s solving a similar problem to your product. The important thing is to begin iterating as quickly as possible, so you can try out your model with real users early and often. You’ll always be able to swap out an improved model down the road, and maybe see better results, but you have to get the data right first. Deep learning still obeys the fundamental computing law of “garbage in, garbage out”, so even the best model will be limited by flaws in your training set. By picking a model and testing it, you’ll be able to understand what those flaws are and start improving them.

To speed up your iteration speed even more, try to start with a model that’s been pre-trained on a large existing dataset and use transfer learning to fine-tune it with the (probably much smaller) set of data you’ve gathered. This usually gives much better results than training only on your smaller dataset, and is much faster, so you can quickly get a feel for how you need to adjust your data gathering strategy. The most important thing is that you are able to incorporate feedback from your results into your collection process, to adapt it as you learn, rather than running collection as a separate phase before training.

Next, we examine why you need to fake it before you make it.