By Titash Neogi, Chief Architect and Entrepreneur In Residence at kontikilabs
As chief architect at Kontiki Labs, I wear two hats - one as a AI researcher looking at new developments in AI and bringing that into the main body of capabilities of our company, as needed. The second role is an AI evangelist / Product Management role where I work with businesses to understand there needs or problems and suggest the right AI powered solutions for them.
Needless to say I am constantly toggling between developer and business roles and looking for workflows to optimise my available dev time. During business travel, I tend to use my Sundays for some lock-down research and development around ML or AI.
This is the narrative of a typical AI Sunday, where I decided to look at building a sequence to sequence (seq2seq) model based chatbot using some already available sample code and data from the Cornell movie database. Seq2seq models are a type of Recurrent Neural Networks that are very well suited for chatbot and machine translation sort of problems.
In this specific instance, my focus was to get the seq2seq model to work by starting the training on the RNN. The larger goal was to figure out the optimal workflow (cost effective and time to train) for a seq2seq model to be trained. When I started, I had three options - use our own GPU powered desktop, use Google Cloud Platform GPU or an Amazon EC2 GPU instance.
It was a an overcast Sunday and the clouds were building up - perfect day to stay indoors and get your RNN through the paces.
Have the dataset downloaded, cleaned up and converted in the format that was needed by the seq2seq model for training. All of this went reasonably well on my Anaconda development environment. No glitches so far. Time spent 45 minutes.
Time to go seq2seq. Let’s get you all fired up. I had 300 USD credit from GCP, so i figured I would use that and leverage Google’s GPU power. I head over to the Google Cloud Platform’s Dashboard page. The first thing I need to do is to upgrade my account from trial to paid and file a credit card. Google doesn’t allow GPU usage on trial accounts.
The next step was to choose a GCP instance with vCPUs and GPU Quota. Before I launched the GPU, I got curious to see how much time the model would take train on a high end vCPU instance with multiple vCPUs and large memory.
My model was built on tensorflow 1.0.0 so I headed over to tensorflow.org and downloaded the correct .whl for 1.0.0 on python 27. One done, I uploaded the code on my GCP instance and started the training. I had a few venv issues that needed fixing, but after another 10 minutes, the model was training.
I had created a model that would train for 100 epochs on a batch size of 4000 and I noticed from the average step time in the logs that this would probably take more than 18 days to train. Not a good idea and clearly no benefit in a large multi-core CPU instance. I killed the process, and decided to add GPUs into the mix.
The process for GCP about GPUs is a little roundabout, and you have to ensure you create the instance in the right zone (list of zones is available in the documentation here) and after several attempts to choose the right vCPU / GPU combination in the right zone, I finally realised that you can’t launch a GPU backed instance straight away. You have to apply for a quota from Google from the dashboard page and justify why you need to use GPU, and then maybe wait for them to enable the quota.
At this time I had already spend a good hour on GCP and a total of 2.5 hours of a sleepy Sunday and I had not really made much progress towards training the model. I decided to head over to AWS and see how Amazon does things - even though I had no credits, I had a hunch AWS might make it faster for me to setup the GPU.
Amazon indeed makes it very simple to start a GPU backed instance - they call them Deep Learning Instances with various combinations and with CUDA pre-installed and a lot of other ML libraries prebaked into the instance.
Amazon’s pricing is also simpler and straightforward to do your math around and predict costs. I was pretty excited because it appeared I would be on my way to a GPU backed training fairly soon.
I launched the cheapest version of the GPU backend instance - one weird thing I noticed at this stage was the large gap between Amazones base instance of ML and the next tier - almost a 7 USD per hour difference, which is a lot.
With my instance launched I ssh’ed into the terminal, uploaded my data and code and tried installing tf from its binary URL for the version I wanted. At this stage I realised that there was a conflict in my code due to the version of python that was installed on the instance by default 3.6 and the python and tf versions, I wanted to install. Another 20 minutes spent in setting up venv and then figuring the correct way to import the modules within venv.
Once this was sorted, my tensorflow code started executing - I did get warnings about the tensorflow not being optimally compiled for the GPU and CPU but I ignored that for the moment - my objective was to get things going on the training part in a reasonably fast way.
One of the things I have learnt about Machine Learning ( and in fact that holds good for all programming ) is that at an early stage not to try and solve or achieve too many things at the same time, but to focus on getting one thing working well at a time and knowing the best way to do it. In this case, I was focussed on understanding the best cloud GPU choice for running my tensorflow based RNN, so I ignored the smaller aspects of optimisation.
At this stage, nearly 4 hours into my tensorflow Sunday, I realised that I had done very little of tensorflow stuff and a lot of work around data preprocessing, cloud GPU choosing and environment troubleshooting. I did uncover a ton of interesting information in the process, ranging from discussions on how to run a RNN model on your mac using the Radeon GPU and the fact that its a bad idea in terms of your Mac’s life to the correct instructions for building tensorflow from scratch on your system.
Ok, and finally as the Sunday drew to a close my model was getting trained in rapid speed,thanks to AWS GPU backed instance - each step time being under 1 minute and the total time for training looking like it would come to under a day - not bad for a million line content base.
As the evening lights came on and I decided to call it a day, my final take away for the day was the excitement of seeing my seq2seq model in action soon and the fact that Amazon AWS still beats Google hands down when it comes to ease of firing up infrastructure and getting straight to work.
Bio: Titash Neogi is a seasoned technology professional with 10 years of experience in equal mixes of technical product management, consumer-internet user-behaviour, engineering and research skills, user experience design and startup team building. He is Chief Architect and Entrepreneur In Residence at kontikilabs.
- A Day in the Life of a Data Scientist
- Another Day in the Life of a Data Scientist
- Yet Another Day in the Life of a Data Scientist