├── README.md
└── naive_bayes.py


/README.md:
--------------------------------------------------------------------------------
 1 | # naive_bayes_classifier
 2 | This is the code for "Probability Theory - The Math of Intelligence #6" By Siraj Raval on Youtube
 3 | 
 4 | 
 5 | ## Coding Challenge - Due Date, Thursday July 27 2017 at 12 PM PST
 6 | 
 7 | Write your own Naive Bayes Classifer for any text dataset. Any improvements from my code in the video, (example - Using n-grams instead of counting individual words, TFIDF instead of bag of words, etc.) will be bonus points. Good luck! 
 8 | 
 9 | ## Overview
10 | 
11 | This is the code for [this](https://youtu.be/PrkiRVcrxOs) video on Youtube by Siraj Raval as part of The Math of Intelligence course. I uses a Naive Bayes classifier to label some text as spam or not spam by training on [this](https://www.kaggle.com/uciml/sms-spam-collection-dataset) Kaggle dataset as training data.
12 | 
13 | ## Dependencies
14 | 
15 | * numpy
16 | 
17 | Install numpy using [pip](https://pip.pypa.io/en/stable/)
18 | 
19 | 
20 | ## Usage
21 | 
22 | Type `python naive_bayes.py` to run the code. 
23 | 
24 | 
25 | ## Credits
26 | 
27 | Credits go to [AlanBuzdar](https://github.com/alanbuzdar). I've merely created a wrapper to get people started. 
28 | 


--------------------------------------------------------------------------------
/naive_bayes.py:
--------------------------------------------------------------------------------
 1 | 
 2 | #creating a spam model
 3 | #runs once on a training data
 4 | def train:
 5 |   total = 0
 6 |   numSpam = 0
 7 |   for email in trainData:
 8 |      if email.label == SPAM :
 9 |        numSpam +=1
10 |      total += 1
11 |      processEmail(email.body , email.label):
12 |     pA = numSpam/float(total)
13 |     pNotA = (total - numSpam)/float(total)
14 | 
15 | #reading words from a specific email
16 |   def processEmail(body , label):
17 |     for word in body:
18 |         if label == SPAM
19 |            trainPositive[word] = trainPositive.get(word, 0) + 1
20 |             positiveTotal += 1
21 |         else
22 |           trainNegative[word] = trainNegative.get(word, 0) + 1
23 |             negativeTotal += 1
24 | #gives the conditional probability p(B_i/A_x)
25 | def conditionalEmail(body , spam) :
26 |   result =1.0
27 |   for word in body:
28 |     result *= conditionalWord(body , spam)
29 |   return result
30 | 
31 | #classifies a new email as spam or not spam
32 |   def classify(email):
33 |       isSpam = pA * conditionalEmail(email, True) # P (A | B)
34 |       notSpam = pNotA * conditionalEmail(email, False) # P(¬A | B)
35 |       return isSpam > notSpam
36 | #Laplace Smoothing for the words not present in the training set
37 | #gives the conditional probability p(B_i | A_x) with smoothing
38 | def conditionalWord(word, spam):
39 |     if spam:
40 |        return (trainPositive.get(word,0)+alpha)/(float)(positiveTotal+alpha*numWords)
41 |     return (trainNegative.get(word,0)+alpha)/(float)(negativeTotal+alpha*numWords)
42 | 


--------------------------------------------------------------------------------