Monday, April 7, 2014

How probability is used in classifying emails as spam or non-spam?
Imagine that every mail message is a series of words. So, while writing the message, the author keeps pulling words from their dictionary and writes them one next to the other.
Now, suppose we have another person who looked at lots of messages and classified them as spam or non-spam.
What we want to do is build an automatic classifier that can read all the words in a mail message and automatically determine if it is spam.
This is an example of conditional probability
P(Spam=true | Document = d) = P(Document = d | Spam = true) x P(Spam=true) /P(Document = d)

Document is a collection of words. Here we assume that the series of words in the document are drawn independently i.e. occurence of one word does not influence the probability of subsequent words occurring.
P(Document =d | Spam = true) = P(word1 = w1,word2=w2...| spam=true) = P(word1=w1| spam) * P(word2=w2|spam)..

P(Document = d) is the same for both a spam probability and non-spam probability
So, we only compute the numerator and use that for comparing classifications and choose the one with the highest probability