By Sabber Ahamed, Computational Geophysicist and Machine Learning Enthusiast
For the last couple of days, I have been thinking to write something about my recent experience on the usages of raw bash command and regex to mine text. Of course, there are more sophisticated tools and libraries online to process text without writing so many lines codes. For example, Python has built-in regex modules “re” that has many rich features to process text. ‘BeautifulSoup’ on the other hand has nice built-in features to web scare and clean raw web pages. I also use these tool for faster processing for large text corpus and when I feel lazy to write codes.
I always prefer to use the command line. I feel at home on the command line when I work with text processing and file management. In this tutorial, I use raw bash commands and regex to process raw and messy JSON file and raw HTML page. The tutorial helps us understand the text processing mechanism under the hood. I assume readers have the basic familiarity of regex and bash commands.
The first part of the tutorial, I show how bash commands like ‘grep,’ ‘sed,’ ‘tr,’ ‘column,’ ‘sort,’ ‘uniq,’ ‘awk’ can be used with regex to process raw and messy texts to get insight into the data. As an example, I use the Complete Works of Shakespeare provided by Project Gutenberg, which is in cooperation with World Library, Inc.
The whole work of Shakespeare work can be downloaded from the internet. I downloaded the Complete Works of William Shakespeare and put it into a text file: “Shakespeare.txt.” All right, let’s get started looking at the file size:
‘ls’ in bash command shows the list of files or folder in a certain directory. ‘-l’ flag displays file types, owner, group, size, date, and filename. ‘-a’ flag is used to display all the files including hidden ones. The flag ‘h’ -one of my favorite flags as it displays file size which is the human-readable format. Size of the shakes.txt is 5.6 megabyte.
Okay, now lets read the file to see whats in it. I use ‘less,’ and ‘tail’ commands to explorer the parts of the file. Name of the commands kind of tell about their functionalities. ‘less’ is used to view the contents of a text file one screen at a time. It is similar to ‘more’ but has the extended capability of allowing both forward and backward navigation through the file. ‘-N’ flag can be used to define line numbers. Similarly ‘tail’ shows the last couple of lines of the file.
It looks like the first couple of lines are not Shakespeare work but some information about the Gutenberg’s project. Similarly, if we see the tail of the file, there are some lines unrelated to Shakespeare’s work. So I would delete the unnecessary tail part first then the header part of the file using ‘sed’ command like as below:
The above code snippets delete lines from 149260 to 149689 at the tail, then delete the first 141 lines. The unwanted lines include some information about legal rights, Gutenberg’s project and contents of the work. Alright, now let's do some statistics of the file using pipe and ‘awk’.
In the above code, I first extract the entire text of the file using ‘cat’ and then pipe into ‘wc’ to count the number of lines, words, and characters. Finally, I used ‘awk’ to display the information. The way of counting and displaying can be done in tons of other ways. Feel free to explore other possible options.
All right, its time clean and do some processing to the text for further analysis. Cleaning includes, convert the text to lower case, remove all digits, remove all punctuations, remove high-frequency words (stop words). Processings are not limited to these steps, and it depends on the purpose. Since I am only here to show necessary processing, I focus only on the processing mentioned above. First, convert all the uppercase characters/words to lowercase followed by removing all the digits and punctuations. To perform the processing, I use famous bash command ‘tr’ which translate or delete characters from a text document. It looks like the first couple of lines are not Shakespeare work but some information about the Gutenberg’s project. Similarly, if we see the tail of the file, there are some lines unrelated to Shakespeare’s work. So I would delete the unnecessary tail part first then the header part of the file using ‘sed’ command like as below:
The code snippet first converts the entire text to lower case and then remove all the punctuations and digits.
Tokenization is one of the basic preprocessing in natural language processing. Tokenization can be both word or sentence wise. Here in this tutorial, I show how to tokenize the file. Tokenization in bash command can also be performed by the various command like ‘sed’, ‘awk’ and ‘tr’. I find ‘tr’ is very easy. In the code below, I first extract the cleaned text. Then I use ‘tr’ and its two flags: ‘s’ and ‘c’ to convert every word into lines. Details of the ‘tr’ and its various functionalities can be found in this StackExchange answers.
Now that we have all the words tokenized, we can perform some analysis to get information something like, what is the most/least frequent word in the entire Shakespeare work. To do this, In the code below, I first use the ‘sort’ command to sort all the words first, then I use ‘uniq’ command with ‘-c’ flag to find out the frequency of each word. ‘uniq -c’ is same as ‘groupby’ in Pandas or SQL. Finally, sort the words with their frequency in either ascending (least frequent) or descending (most frequent) order.
The above results reveal some interesting observations. For example, the ten most frequent words are either pronouns or prepositions or conjunctions. If we want to find out more abstract information of the document, we might need to remove all the stop words — all the prepositions, pronouns, conjunctions, modal verbs. It also depends on the purpose of the object. One might be interested only in prepositions. In this case, it’s okay to keep all the prepositions. On the other hand, least frequent words are abandoner, abatements, abashd. A linguistic or literature student may find better intuitions from these simple analytics in their perspective.
In the next step, I show to uses ‘awk’ to remove all the stop words on the command line. In this tutorial, I used NLTK’s list of English stopwords. I also have added a couple more words to the list. Details of the following codes can be found in this StackOverflow answers. Details of the different options variables of awk can be also found from the manual of awk (‘man awk’ on the command line)
Alright, after removing the stop words lets sort the words in ascending and descending order like as above.
After removing the stop words, we see the most frequent word used by Shakespeare in this corpus, is the ‘Lord’ followed by ‘good’. The word ‘Love’ is also included in top most frequent words. The least frequent words remain same.
As we are done with some necessary processing and cleaning, in the next tutorial I will discuss how we can perform some analytics. Until then if you have any questions feel free to ask. Please make comments if you see any typos, mistakes or you have better suggestions. You can reach out to me:
Bio: Sabber Ahamed is the Founder of xoolooloo.com. Computational Geophysicist and Machine Learning Enthusiast.
Original. Reposted with permission.
- Three techniques to improve machine learning model performance with imbalanced datasets
- Text __data Scientists