Generating text is one of those projects that seems like a lot of fun to machine learning and NLP beginners, but one which is also pretty daunting. Or, at least it was for me.
Thankfully, there are all sorts of great materials online for learning how RNNs can be used for generating text, ranging from the theoretical to the technically in-depth to those decidedly focused on the practical. There are also some very good posts which cover it all and are now considered canon in this space. All of these materials share one thing in particular: at some point along the way, you have to build and tune an RNN to do the work.
While this is a obviously a worthwhile undertaking, especially for the sake of learning, what if you are OK with a much higher level of abstraction, whatever your reason may be? What if you are a __data Scientist at BuzzFeed, and former Apple Software QA Engineer.
textgenrnn is a built on top of Keras and TensorFlow, and can be used to generate both character and word level text (character level is the default). The network architecture uses attention-weighting and skip-embedding for accelerated training and improved quality, and allows for the tuning of a number of hyperparameters, such as RNN size, RNN layers, and the inclusion of bidirectional RNNs. You can read more about textgenrnn and its features and architecture at its Github repo or in this introductory blog post.
Since the "Hello, World!" for text generation (at least, in my mind) seems to be generating Trump tweets, let's go with that. textgenrnn's default pretrained model can be trained on new texts easily -- though you can also use textgenrnn to train a new model (just add new_model=True to any of its train functions) -- and since we want to see how quickly we can get generating tweets, let's go that route.
Acquiring the Data
I grabbed a selection of Donald Trump's tweets -- Jan 1, 2014 - Jun 11, 2018 (yesterday, at time of writing), which clearly includes tweets from both before and after his inauguration as President of the United States -- from Trump Twitter Archive, a site which makes querying and downloading tweets from the President painless. I chose only to grab the text from the tweets in that date range, since I don't care about any of the metadata, and saved it to a text file I appropriately called
Training the Model
Let's see how uncomplicated it is to generate text with textgenrnn. The following 4 lines are all we need to import the library, create a text generation object, train the model on the
trump-tweets.txt file for 10 epochs, and then generate some sample tweets.
After about 30 minutes, here's what's generated (on the 10th epoch):
Leaving politics aside, and given that we are only using ~12K tweets for training in a mere 10 epochs, these generated tweets are not... terrible. Want to play with temperature (the textgenrnn default is 0.5) to get some more creative tweets? Let's try it out:
Well, that's less convincing. How about something more conservative, which the model is more confident of:
Well now, some of these are seemingly more legible.
Of course, this isn't perfect. There are all sorts of other things we could have tried, and the good news is that, if you don't want to implement your own solution, textgenrnn can be used to perform many of these things (again, see the Github repo):
- Train our own model from scratch
- Train with more sample data for a greater number of iterations
- Tune other hyperparameters
- Preprocess the data a bit (at the very least to eliminate the fake URLs)
Kind of fun. I'm interested in seeing how a default textgenrnn model performs out-of-the-box against a custom, well-tuned model. Maybe something for next time.
- 5 Machine Learning Projects You Should Not Overlook, June 2018
- Getting Started with spaCy for Natural Language Processing
- Find Out What Celebrities Tweet About the Most