By Shabaz Patel
These problems gave us countless headaches, and after talking to our friends, we knew we weren’t alone. We wanted something that would not only keep track of configuration and results during the experiment but also allow data scientists to reproduce any experiment with re-runs!
We initially built it as an internal solution for tracking our experiments, making them reproducible, and with easy setup of environments. As we started to grow this out, we strove towards a tool that had an open and simple interface that integrated seamlessly with the way we were already doing machine learning; generic with respect to frameworks yet powerful that it provides complete reproducibility. Basically, something we could give to our friends so that they could run their experiments with a few commands on the command line and still repeat them reliably.
After building and using it ourselves, we decided to provide it as an open source tool called datmo.
Here’s how datmo works!
After the initial project setup, all it takes is a single command to run the experiment, and another command at the end to analyze the results!
This all seems good but what happens when we have multiple experiments? datmo gets more use here as we can use it to compare and analyze results and rerun previous experiments at a later point in time. Here’s how you can use ***datmo ***for that!
Now, let’s get our hands dirty with this example. With datmo, we’ve taken the complexity out while providing a way to get everything off the ground very quickly.
For this example, we’re going to be showing training of a simple classifier from the classic Fisher Iris Dataset
Let’s first make sure we have the pre-requisites for datmo. Docker is the main pre-requisite, so let’s ensure docker is installed (and running!) before starting. You can find the instructions based on your OS here: MacOS, Windows, Ubuntu.
We can then install datmo from your terminal with the following:
1. Clone this GitHub project,
2. In your project, initialize the datmo client using the CLI
Then, respond to the following prompts:
Next, you’ll be asked if you’d like to set up your environment.
Select y and choose the following options when prompted sequentially:
3. Now, run your first experiment using the following command,
Let’s see the list of all runs,
4. Now let’s change the script for a new run,
We’ll change thescript.py file. Let’s uncomment the following line in the script and remove the other config dictionary,
5. Now that we have updated the environment and config in our script, let’s run another experiment,
6. Once that completes, we will now have two experiments tracked, able to be rerun on any machine
Congrats, you’ve now successfully reproduced a previous experiment run!
Previously, this process results in wasted time and effort with all the troubleshooting and headaches! With datmo, we ran experiments, tracked them, and reran them in 4 commands. Now, you can share your experiments using this common standard and not worry about reproduction, whether it’s a teammate reproducing your work or trying to deploy a model to production. This is obviously just a small sample, but you can go out and try other flows yourself like spinning up a TensorFlow jupyter notebook in 2 minutes!
Check us out on GitHub and give us your feedback at @datmoAI ✌️
Bio: Shabaz Patel is a cofounder at Datmo, building developer tools to help make data scientists more efficient. He has built and deployed Computer Vision and NLP based algorithms in production for companies and was a researcher at the Stanford AI Lab.
- 10 Big data Trends You Should Know
- The 6 components of Open-Source data Science/ Machine Learning Ecosystem; Did Python declare victory over R?
- Building a Machine Learning Model through Trial and Error