├── README.md └── naive_bayes.py /README.md: -------------------------------------------------------------------------------- 1 | # naive_bayes_classifier 2 | This is the code for "Probability Theory - The Math of Intelligence #6" By Siraj Raval on Youtube 3 | 4 | 5 | ## Coding Challenge - Due Date, Thursday July 27 2017 at 12 PM PST 6 | 7 | Write your own Naive Bayes Classifer for any text dataset. Any improvements from my code in the video, (example - Using n-grams instead of counting individual words, TFIDF instead of bag of words, etc.) will be bonus points. Good luck! 8 | 9 | ## Overview 10 | 11 | This is the code for [this](https://youtu.be/PrkiRVcrxOs) video on Youtube by Siraj Raval as part of The Math of Intelligence course. I uses a Naive Bayes classifier to label some text as spam or not spam by training on [this](https://www.kaggle.com/uciml/sms-spam-collection-dataset) Kaggle dataset as training data. 12 | 13 | ## Dependencies 14 | 15 | * numpy 16 | 17 | Install numpy using [pip](https://pip.pypa.io/en/stable/) 18 | 19 | 20 | ## Usage 21 | 22 | Type `python naive_bayes.py` to run the code. 23 | 24 | 25 | ## Credits 26 | 27 | Credits go to [AlanBuzdar](https://github.com/alanbuzdar). I've merely created a wrapper to get people started. 28 | -------------------------------------------------------------------------------- /naive_bayes.py: -------------------------------------------------------------------------------- 1 | 2 | #creating a spam model 3 | #runs once on a training data 4 | def train: 5 | total = 0 6 | numSpam = 0 7 | for email in trainData: 8 | if email.label == SPAM : 9 | numSpam +=1 10 | total += 1 11 | processEmail(email.body , email.label): 12 | pA = numSpam/float(total) 13 | pNotA = (total - numSpam)/float(total) 14 | 15 | #reading words from a specific email 16 | def processEmail(body , label): 17 | for word in body: 18 | if label == SPAM 19 | trainPositive[word] = trainPositive.get(word, 0) + 1 20 | positiveTotal += 1 21 | else 22 | trainNegative[word] = trainNegative.get(word, 0) + 1 23 | negativeTotal += 1 24 | #gives the conditional probability p(B_i/A_x) 25 | def conditionalEmail(body , spam) : 26 | result =1.0 27 | for word in body: 28 | result *= conditionalWord(body , spam) 29 | return result 30 | 31 | #classifies a new email as spam or not spam 32 | def classify(email): 33 | isSpam = pA * conditionalEmail(email, True) # P (A | B) 34 | notSpam = pNotA * conditionalEmail(email, False) # P(¬A | B) 35 | return isSpam > notSpam 36 | #Laplace Smoothing for the words not present in the training set 37 | #gives the conditional probability p(B_i | A_x) with smoothing 38 | def conditionalWord(word, spam): 39 | if spam: 40 | return (trainPositive.get(word,0)+alpha)/(float)(positiveTotal+alpha*numWords) 41 | return (trainNegative.get(word,0)+alpha)/(float)(negativeTotal+alpha*numWords) 42 | --------------------------------------------------------------------------------