Spam Detection

How To Build a Simple Spam-Detecting Machine Learning Classifier

In this tutorial we will begin by laying out a problem and then proceed to show a simple solution to it using a Machine Learning technique called a Naive Bayes Classifier. This tutorial requires a little bit of programming and statistics experience, but no prior Machine Learning experience is required.

You work as a software engineer at a company which provides email services to millions of people. Lately, spam has a been a major problem and has caused your customers to leave. Your current spam filter only filters out emails that have been previously marked as spam by your customers. However, spammers have become more clever and in order to stop your customers from leaving, you are given the assignment of predicting whether an email is spam.

In order to create algorithm for this, you need to teach your program what a spam email looks like (and what non-spam emails look like). Luckily, you have all the previous emails that have been marked as spam by your customers. You also need a way to test the accuracy of your spam filter. One idea would be to test it on the same data that you used for training. However, this can lead to a major problem in ML called overfitting which means that your model is too biased towards the training data and may not work as well on elements outside of that training set. One common way to avoid this is to split your labeled data 70/30 for training/testing. This ensures you test on different data than you trained on. It’s important to note that you need a mix of both spam and non-spam data in your data sets, not just the spam ones. You really want your training data to be as similar to real email data as possible, I’ve linked a good data set at the bottom of this post.

Bayes theorem is mathematically expressed as:

It essentially gives us a trick to calculate conditional probabilities in situations where it wouldn’t be feasible to directly measure them. For instance, if you want to calculate someones chance of having cancer given their age, instead of doing a nationwide study, you can just take existing statistics about age distribution and cancer and plug them into Bayes’ Theorem. Don’t worry if the statistics theory is confusing, it will make more sense when it is translated into code. I do however recommend going back and trying to understand it later as the failure to understand Bayes theorem is the root of many logical fallacies.

For our problem, we can set A to the probability that the email is spam and B as the contents of the email. If P(A|B) > P(¬A|B) then we can classify the email as spam, otherwise we can’t. Note that since Bayes’ Theorem results in a divisor of P(B) in both cases, we can remove it from the equation for our comparison. This leaves: P(A)*P(B|A) > P(¬A)*P(B|¬A). Calculating P(A) and P(¬A) is trivial, they are simply the percentage of your training set which are spam versus not spam:

#runs once on training data
def train:
total = 0
numSpam = 0
for email in trainData:
if email.label == SPAM:
numSpam += 1
total += 1
pA = numSpam/(float)total
pNotA = (total — numSpam)/(float)total

The more difficult part is calculating P(B|A) and P(B|¬A). In order to calculate these, we are going to use the bag of words model. This is a pretty simple model which treats a piece of text as a bag of individual words, paying no attention to their ordering. For each word, we calculate the percentage of times it shows up in spam emails as well as non-spam emails. We call this probability P(B_i|A_x). A concrete example would be in order to calculate P(free | spam), we would count the number of times the word free occurs in all the spam emails combined, and divide this by the total number of words in all spam emails combined. Since these are static values, we can calculate them in our training phase.

#runs once on training data
def train:
total = 0
numSpam = 0
for email in trainData:
if email.label == SPAM:
numSpam += 1
total += 1
processEmail(email.body, email.label)
pA = numSpam/(float)total
pNotA = (total — numSpam)/(float)total
#counts the words in a specific email
def processEmail(body, label):
for word in body:
if label == SPAM:
trainPositive[word] = trainPositive.get(word, 0) + 1
positiveTotal += 1
else:
trainNegative[word] = trainNegative.get(word, 0) + 1
negativeTotal += 1
#gives the conditional probability p(B_i | A_x)
def conditionalWord(word, spam):
if spam:
return trainPositive[word]/(float)positiveTotal
return trainNegative[word]/(float)negativeTotal

To get P(B|A_x) for an entire email, we simply take the product of the P(B_i|A_x) value for every word i in the email. Note that this is done at time of classification and not when initially training.

#gives the conditional probability p(B | A_x)
def conditionalEmail(body, spam):
result = 1.0
for word in body:
result *= conditionalWord(word, spam)
return conditionalEmail

We finally have all the components needed to put it all together. The final piece we need is the classifier which gets called for every email and uses our previous functions to classify them.

#classifies a new email as spam or not spam
def classify(email):
isSpam = pA * conditionalEmail(email, True) # P (A | B)
notSpam = pNotA * conditionalEmail(email, False) # P(¬A | B)
return isSpam > notSpam

Congratulations! You’ve successfully coded a Naive Bayes Classifier from scratch!

However, there are some changes you’d need to make to get it working optimally and bug-free:

Laplace Smoothing:
One thing we haven’t mentioned is what happens if a word in the email you’re classifying wasn’t in your training set. In order to handle this case we need to add a smoothing factor. This is best exemplified in the modified code below where smoothing factor alpha is added:

#gives the conditional probability p(B_i | A_x) with smoothing
def conditionalWord(word, spam):
if spam:
return (trainPositive.get(word,0)+alpha)/(float)(positiveTotal+alpha*numWords)
return (trainNegative.get(word,0)+alpha)/(float)(negativeTotal+alpha*numWords)

Log-Space
Our current implementation relies heavily on floating point multiplication. To avoid all the potential issues with multiplying very small numbers one usually performs a logarithm on the equation to transform all the multiplications to addition. I didn’t implement this in my sample code but it is strongly recommended in practice.

TF-IDF
Overall, the bag of words model for text classification is fairly naive and could be improved upon by something else like TF-IDF.

N-Grams
Another improvement we could make is not just counting individual words. N-Grams is a technique in which one considers sets of N consecutive words and uses them to calculate the probabilities. This makes sense because in English the 1-gram ‘good’ conveys something different than the 2-gram ‘not good’.

Tokenization
One interesting thing to play around with is how you want to classify distinct words. For instance are Free, free, and FREE the same words? What about punctuation?

Please note that the sample code is written for optimal teaching instead of performance. There are some clear, trivial changes that could drastically improve its performance.

Sample Dataset: https://spamassassin.apache.org/publiccorpus/

If you enjoyed this article, please recommend it and follow the author to help others see it!