Monday, April 7, 2014

How probability is used in classifying emails as spam or non-spam?
Imagine that every mail message is a series of words. So, while writing the message, the author keeps pulling words from their dictionary and writes them one next to the other.
Now, suppose we have another person who looked at lots of messages and classified them as spam or non-spam.
What we want to do is build an automatic classifier that can read all the words in a mail message and automatically determine if it is spam.
This is an example of conditional probability
P(Spam=true | Document = d) = P(Document = d | Spam = true) x P(Spam=true) /P(Document = d)

Document is a collection of words. Here we assume that the series of words in the document are drawn independently i.e. occurence of one word does not influence the probability of subsequent words occurring.
P(Document =d | Spam = true) = P(word1 = w1,word2=w2...| spam=true) = P(word1=w1| spam) * P(word2=w2|spam)..

P(Document = d) is the same for both a spam probability and non-spam probability
So, we only compute the numerator and use that for comparing classifications and choose the one with the highest probability

Friday, July 27, 2012

Simple combinatorics

A simple explanation for the following result
NCr=N-1Cr+N-1Cr-1

Suppose you have 100 people and you need to choose 10 people. How many ways can you do this? This is equal to NCr

You can break this into two parts.

1. How many ways can you choose 10 people from 99 and excluding yourself from the 10? This is the same as N-1Cr

2. How many ways can you choose 9 people from 99 and keeping yourself as the 10th? This is the same as
  N-1Cr-1

It is clear that summing these two parts gives us the number of ways to choose 10 people from 100.
I heard this in a machine learning course offered by Caltech. Machine Learning lecture

Thursday, July 12, 2012

Basic algebra

I found a new interpretation (atleast new to me) for (a+b)n ..if we have n items and we want to partition them into two parts, then (a+b)n gives us a clue..
(a+b)n can be interpreted as follows.
Imagine that you have n slots and each slot can take either a or b.
Let us come up with the number of ways in which these slots can be filled :
Assume n=3 for simplicity
In this case we have the following options
aaa
aab
aba
abb
baa
bab
bba
bbb

Now let us group this based on number  of a values and number of b values
aaa 

aab
baa
aba

abb
bab
bba

bbb

So we notice the following
a3b0 is occuring 1 time
a2b1 is occurring 3 times
a1b2 is occurring 3 times
a0b3 is occuring 1 time

(a+b)3= 1xa3b0+3xa2b1+3xa1b2+1xa0b3

the number of times any combination aibj occurs is equal to the number of ways in which we can choose a i times and b j times while ensuring that number of a values and number of b values sum to n i.e. i+j=n

Apply the well known combination formula nCr = n!/(n-r)!r!
(a+b)3= 3C0xa3b0+3C1xa2b1+3C2xa1b2+3C3xa0b3


So (a+b)n = Σ r=0 to n nCr x arb(n-r)
Note that the coefficients are symmetric in the sense that a=r and b=n-r has the same coefficient as b=r and a=n-r
Suppose we set a=1 and b=1 then we get

2=  Σ i=0 to n nCr


       OR  we can write this as 


2n=1+nC1+nC2+nC3+..nCn


    OR we can write this as 


2n = 1+ n+n(n-1)/2 + n(n-1)(n-2)/6 + n(n-1)(n-2)(n-3)/24.... 


Let us now see how we can do the same thing for
(a+b+c)n


here we have to find the number of ways in which 3 kinds of items are selected
example we have apples, bananas and cherries
We need to select 3 fruit - how many ways can we do it
3 apples and 0 bananas and 0 cherries
2 apples and 0 bananas and 1 cherry
2 apples and 1 banana and 0 cherries
1 apple and 0 bananas and 2 cherries
1 apple and 1 banana and 1 cherry
1 apple and 2 bananas and 0 cherries
0 apples and 0 bananas and 3 cherries
0 apples and 1 banana and 2 cherries
0 apples and 2 bananas and 1 cherry
0 apples and 3 bananas and 0 cherries

This is like choosing r apples and s bananas and (n-r-s) cherries
Replace apples, bananas and cherries with a, b and c and we get the various terms in the equation
r can take value 0 to 3
s can take value 0 to 3
Replace 3 with n and we can get the solution for (a+b+c)n
So it is a double summation
(a+b+c)n = Σ r=0 to n Σ s=0 to n nCr,s x arbsb(n-r-s)

Now we can apply this same logic to find (a+b+c..+k)n where instead of choosing only a and b and c we
have to choose k items
This means we need to partition into k sets with set 1 having n1 items, set 2 having n2 items..and so on till set k-1 having nk-1 items and the remaining items in set k
In this case each term in the summation is of the following form
an1bn2cn3....jnjk(n-(n1+n2...nj)) and the coeffecient is n!/(n1! x n2! x n3!...(n-(n1+n2...+nj))!


Another interesting derivation

Suppose a=-1 and b = 1 and we substitute this in
(a+b) then we get the following results
when n is even
0 = 1 - n + n(n-1)/2 - n(n-1)(n-2)/6 + n(n-1)(n-2)(n-3)/24.... -n(n-1)/2 + n - 1
means
 n - n(n-1)/2 + n(n-1)(n-2)/6 -  n(n-1)(n-2)(n-3)/24.... =  1

when n is odd
0 = -1 + n - n(n-1)/2 + n(n-1)(n-2)/6 - n(n-1)(n-2)(n-3)/24.... 
means
n - n(n-1)/2 + n(n-1)(n-2)/6 - n(n-1)(n-2)(n-3)/24....  = 1

Each term will appear once with a positive sign and once with a negative sign. As a result it sums to zero.