├── .gitignore ├── spam.csv ├── README.md ├── naive-bayes-explained.ipynb ├── spam-filtering-with-naive-bayes.ipynb └── spam-filtering-with-gensim.ipynb /.gitignore: -------------------------------------------------------------------------------- 1 | .ipynb_checkpoints 2 | *.pyc 3 | 4 | -------------------------------------------------------------------------------- /spam.csv: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/AndersonJo/text-classification-tutorial/master/spam.csv -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # Text Classification Tutorial 2 | 3 | This repository uses probablistic models as tutorials and have the following contents. 4 | 5 | 1. [Naive Bayes Explained](naive-bayes-explained.ipynb) 6 | 2. [Spam Filtering with Naive Bayes](spam-filtering-with-naive-bayes.ipynb) 7 | 3. [20 News Classification](20-news-classification.ipynb) 8 | 9 | ## Preprocessing Methods 10 | 11 | 1. Word Frequency based vectorization 12 | 2. TF-IDF 13 | 14 | 15 | ## Models 16 | 17 | I used the following models. 18 | 19 | 1. Naive Bayes 20 | 2. Logistic Regression 21 | 3. Support Vector Machine (Linear) 22 | 4. Deep Learning (Fully-Connected Layers) 23 | 24 | 25 | ## Accuracy 26 | 27 | | Data | Preprocessing | Model | Precision | Recall | F1-Score | 28 | |:------|:--------------|:------|:----------|:-------|:---------| 29 | | Spam Filtering | Word frequency vectorization | Naive Bayes | 0.97 | 0.97 | 0.96 | 30 | | Spam Filtering | Word frequency vectorization | Logistic Regression | 0.95 | 0.95 | 0.95 | 31 | | 20 News | Word frequency vectorization | Naive Bayes | 0.83 | 0.82 | 0.81 | 32 | | 20 News | Word frequency vectorization | SVM (Linear) | 0.85 | 0.85 | 0.85 | 33 | | 20 News | Word frequency vectorization | SVM (Linear) | 0.85 | 0.85 | 0.85 | 34 | | 20 News | Word frequency vectorization | Deep Learning (FC Network) | 0.84 | 0.83 | 0.83 | 35 | 36 | # Code Contribution 37 | 38 | We need your contribution! 39 | 40 | Please send PR! 41 | -------------------------------------------------------------------------------- /naive-bayes-explained.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "# Introduction\n", 8 | "\n", 9 | "많은 Machine learning 알고리즘들이 나왔지만, 그럼에도 아직도 많이 사용되는 알고리즘중의 하나는 Naive Bayes 알고리즘이 있습니다.
\n", 10 | "본문에서는 Multinomial Naive Bayes에 관해서 알아보고 예제까지 해보면서 이론부터 구현까지 튜토리얼을 해보도록 하겠습니다." 11 | ] 12 | }, 13 | { 14 | "cell_type": "markdown", 15 | "metadata": {}, 16 | "source": [ 17 | "# Simple Text Example\n", 18 | "\n", 19 | "이메일의 내용을 갖고서 스팸인지 또는 햄(스팸이 아닌 이메일)인지 분류하는 데이터는 다음과 같은 예제를 갖고 있습니다.\n", 20 | "\n", 21 | "\n", 22 | "| Text | Tag | \n", 23 | "|:-----|:----|\n", 24 | "| free message | Spam |\n", 25 | "| send me a messsage | Ham |\n", 26 | "| are you free tomorrow? | Ham |\n", 27 | "| where is tesseract? | Spam |\n", 28 | "| where are you now? | Ham |\n", 29 | "| buy awesome tv | Spam |\n", 30 | "\n", 31 | "예를 들어, `I cooked a salmon` 이런 문장인 경우 Naive Bayes Classifier를 사용해서 Spam인지 또는 Ham인지 확률을 알아내는 것이 목표입니다.\n" 32 | ] 33 | }, 34 | { 35 | "cell_type": "markdown", 36 | "metadata": {}, 37 | "source": [ 38 | "# Tokenizing Text\n", 39 | "\n", 40 | "Text를 기계학습의 feature로 사용하기 위해서 **word frequencies**를 사용합니다.
\n", 41 | "이 경우, 문장속에서 단어운 순서나, 문장의 구조에 대한 정보 손실이 일어나게 됩니다.
\n", 42 | "너무 단순한 방법이 아닐까 생각이 될지 모르지만 데이터의 적을수록 단순화 하는 것이 좋으며, 꽤나 잘 작동을 합니다.\n", 43 | "\n", 44 | "다를 방법으로는 딥러닝을 사용해서 각각의 단어마다 Word2Vec 또는 Glove를 사용하여 vector화 하는 방법이 있습니다.
\n", 45 | "이 경우 문장의 구조, 단어의 순서에 대한 정보까지도 학습을 하게 되지만, 데이터가 부족할 경우 실제 학습시 overfitting이 매우 일어나기 쉽습니다.
\n", 46 | "물론 drop out, l2 regularization, 레이어의 단순화, early termination, ensemble등으로 어느정도 해결할수 있지만 당연히 accuracy가 떨어지게 됩니다. \n", 47 | "\n", 48 | "었쟀든 word frequency를 이용한 vectorization 방법은 단순하면서, 꽤나 유용한 feature engineering 입니다." 49 | ] 50 | }, 51 | { 52 | "cell_type": "markdown", 53 | "metadata": {}, 54 | "source": [ 55 | "# Bayes Theorem \n", 56 | "\n", 57 | "베이즈 이론에 관해서 자세한 설명은 [여기](http://incredible.ai/statistics/2014/03/01/Bayes-Theorem/) 를 클릭합니다.
\n", 58 | "쉽게 이야기 해서 베이즈 이론을 사용하여 conditional probabilities를 계산할수 있는데.. 이때 reversed condition을 사용함으로서 문제를 좀 더 쉽게 풀수 있게 도와 줍니다.\n", 59 | "\n", 60 | "Bayes' Theorem 공식은 다음과 같습니다.\n", 61 | "\n", 62 | "$$ P(A|B) = \\frac{P(A \\cap B )}{P(B)} = \\frac{P(A) P(B|A)}{P(B)} $$\n", 63 | "\n", 64 | "예를 들어서 스팸 필터링 예제를 공식화 하면 다음과 같습니다.\n", 65 | "\n", 66 | "$$ P(\\text{Spam} | \\text{Email}) = \\frac{P(\\text{Spam}) P(\\text{Email} | \\text{Spam})}{P(\\text{Email})} $$\n", 67 | "\n", 68 | "$ P(\\text{Email}) $ 는 normalization으로서 $ P(\\text{Email}) = \n", 69 | "P(\\text{Spam}) P(\\text{Email} | \\text{Spam}) + P(\\text{Ham}) P(\\text{Email} | \\text{Ham}) $ 과 같습니다만, 어떤 class의 확률이 더 높은지 비교하는 것이기 때문에 $ P(\\text{Email}) $ 의 확률은 계산할 필요가 없습니다. 따라서 비교하는 2개의 공식은 다음과 같습니다.\n", 70 | "\n", 71 | "$$ P(\\text{Spam}) P(\\text{Email} | \\text{Spam}) $$\n", 72 | "\n", 73 | "$$ VS $$\n", 74 | "\n", 75 | "$$ P(\\text{Ham}) P(\\text{Email} | \\text{Ham}) $$\n", 76 | "\n", 77 | "\n" 78 | ] 79 | }, 80 | { 81 | "cell_type": "markdown", 82 | "metadata": {}, 83 | "source": [ 84 | "# Naive Bayes\n", 85 | "\n", 86 | "Bayes theorem의 문제는 주어지는 데이터의 종속적 관계 때문에 연상량이 급격하게 늘어나게 됩니다.
\n", 87 | "예를 들어 이메일속의 단어들의 순서는 다른 단어가 나타날 확률을 의미할 수 있으며, 이는 각각의 단어가 다른 단어에 종속적임을 의미하게 됩니다.
\n", 88 | "예를 들어서 \"비아그라\" 라는 단어는 처방과 관련된 단어가 나올 확률 또는 발기부진에 빠진 남자들을 위한 광고와 관련된 단어들이 나올 확률이 높을 것 입니다.\n", 89 | "\n", 90 | "Naive Bayes는 이러한 현실적인 가정을 무시하고 모든 단어(또는 features)가 모두 **독립적(Independent)이라고 가정**을 합니다.
\n", 91 | "물론 현실적으로 맞지는 않지만, 그럼에도 불구하고 이러한 가정은 계산량은 줄여주면서 잘 작동합니다.
\n", 92 | "\n", 93 | "많은 통계학자들이 가정 자체가 틀렸는데 왜 이렇게 잘 작동하는지 많은 연구를 하였는데.. 그중 하나의 설명이 좀 개인적으로 와닿았습니다.
\n", 94 | "만약 스팸을 정확하게 모두 걸러낸다면 신뢰구간 51% ~ 99%가 의미가 있는 것인가 입니다.
\n", 95 | "즉 test결과 자체가 정확하다면, 매우 정확한 확률론적 계산을 하는 것 자체가 크게 중요하지 않다는 의미입니다. \n", 96 | "\n", 97 | "\n", 98 | "## Formula\n", 99 | "\n", 100 | "Naive Bayes의 공식은 다음과 같습니다. \n", 101 | "\n", 102 | "$$ P(C_L | F_1, ..., F_n) = \\frac{1}{Z} P(C_L) \\prod^n_{i=1} P(F_i | C_L) $$\n", 103 | "\n", 104 | "- $ C_L $ : 클래스를 타나내며 예제에서는 Spam 또는 Ham \n", 105 | "- $ F $ : features들로서 예제에서는 각각의 단어를 가르킴\n", 106 | "- $ \\frac{1}{Z} $ : scaling factor로서 계산된 결과값을 확률로 변형시켜줍니다.\n", 107 | "\n", 108 | "\n", 109 | "## Example\n", 110 | "\n", 111 | "예를 들어서 \"where is tesseract\" 라는 문장이 Spam 인지 Ham인지 구분하는 공식은 다음과 같습니다.\n", 112 | "\n", 113 | "$$ P(Spam) P(Email | Spam) = P(Spam)\\ P(where | Spam) \\\n", 114 | "P(is | Spam) \\ P(tesseract | Spam) $$\n", 115 | "\n", 116 | "위의 공식은 반드시 Ham일 확률과 비교해야 됩니다. 둘 중에서 확률이 높은 것으로 Spam 인지 Ham인지 결정이 됩니다.\n", 117 | "\n", 118 | "$$ P(Ham) P(Email | Ham) = P(Ham)\\ P(where | Ham) \\\n", 119 | "P(is | Ham) \\ P(tesseract | Ham) $$\n", 120 | "\n", 121 | "## Laplace Smoothing\n", 122 | "\n", 123 | "또다른 문제가 있습니다. 예를 들어서 Ham으로 구분되는 텍스트 중에 `tesseract` 라는 단어가 없을때 입니다.
\n", 124 | "이 경우 확률은 0이 되고 예를 들어서 다음과 같이 공식이 만들어 질 수 있습니다.\n", 125 | "\n", 126 | "$$ P(Ham) P(Email | Ham) = P(Ham)\\ P(where | Ham) \\\n", 127 | "P(is | Ham) \\ \\times 0 $$\n", 128 | "\n", 129 | "즉 0이 곱해지기 때문에 결과값은 다른 단어들의 확률과 상관없이 0이 되게 됩니다.
\n", 130 | "이를 방지하기 위해서 모든 word count에 1을 더합니다.
\n", 131 | "그리고 분모에는 해당 클래스에 속하는 모든 단어들의 갯수 + 해당 단어가 나온 갯수를 divisor로 사용합니다.
\n", 132 | "따라서 결과값은 항상 확률로 나오게 됩니다.\n", 133 | "\n", 134 | "Laplace Smoothing 공식은 다음과 같습니다.\n", 135 | "\n", 136 | "$$ P(w | c) = \\frac{\\text{count}(w, c) + 1}{\\text{count(c)} + V + 1} $$\n", 137 | "\n", 138 | "- count(w, c) : 해당 클래스 안에서 나온 단어의 횟수. (중복도 포함)\n", 139 | "- count(c) : 해당 클래스의 단어의 총 횟수 (중복도 포함. 즉 apple이 여러 문장에서 12번 나오면 12번으로 침) \n", 140 | "- V : 전체 유니크 단어의 갯수 (중복 X)\n", 141 | "\n", 142 | "예를 들어서 `where` 이라는 단어의 laplace smoothing을 적용한 경과는 다음과 같습니다.\n", 143 | "\n", 144 | "$$ P( where | Spam ) = \\frac{1 + 1}{12 + 20 + 1} $$" 145 | ] 146 | } 147 | ], 148 | "metadata": { 149 | "kernelspec": { 150 | "display_name": "Python 3", 151 | "language": "python", 152 | "name": "python3" 153 | }, 154 | "language_info": { 155 | "codemirror_mode": { 156 | "name": "ipython", 157 | "version": 3 158 | }, 159 | "file_extension": ".py", 160 | "mimetype": "text/x-python", 161 | "name": "python", 162 | "nbconvert_exporter": "python", 163 | "pygments_lexer": "ipython3", 164 | "version": "3.6.4" 165 | } 166 | }, 167 | "nbformat": 4, 168 | "nbformat_minor": 2 169 | } 170 | -------------------------------------------------------------------------------- /spam-filtering-with-naive-bayes.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "code", 5 | "execution_count": 2, 6 | "metadata": {}, 7 | "outputs": [ 8 | { 9 | "name": "stdout", 10 | "output_type": "stream", 11 | "text": [ 12 | "Populating the interactive namespace from numpy and matplotlib\n" 13 | ] 14 | } 15 | ], 16 | "source": [ 17 | "%pylab inline\n", 18 | "import numpy as np \n", 19 | "import pandas as pd\n", 20 | "import seaborn as sns\n", 21 | "import string\n", 22 | "\n", 23 | "from nltk.corpus import stopwords\n", 24 | "from sklearn.model_selection import train_test_split\n", 25 | "from sklearn.pipeline import Pipeline\n", 26 | "from sklearn.feature_extraction.text import CountVectorizer\n", 27 | "from sklearn.feature_extraction.text import TfidfTransformer\n", 28 | "from sklearn.naive_bayes import MultinomialNB\n", 29 | "from sklearn.linear_model import LogisticRegression\n", 30 | "from sklearn.metrics import classification_report,confusion_matrix" 31 | ] 32 | }, 33 | { 34 | "cell_type": "markdown", 35 | "metadata": {}, 36 | "source": [ 37 | "# Data Preprocessing\n", 38 | "\n", 39 | "- 약 13.5% 만이 spam으로 분류되어 있다. 즉 class imbalance 문제가 있다.\n", 40 | "- unique 메세지를 보면 duplicate text가 존재함을 알 수 있다." 41 | ] 42 | }, 43 | { 44 | "cell_type": "code", 45 | "execution_count": 4, 46 | "metadata": {}, 47 | "outputs": [ 48 | { 49 | "data": { 50 | "text/html": [ 51 | "
\n", 52 | "\n", 65 | "\n", 66 | " \n", 67 | " \n", 68 | " \n", 69 | " \n", 70 | " \n", 71 | " \n", 72 | " \n", 73 | " \n", 74 | " \n", 75 | " \n", 76 | " \n", 77 | " \n", 78 | " \n", 79 | " \n", 80 | " \n", 81 | " \n", 82 | " \n", 83 | " \n", 84 | " \n", 85 | " \n", 86 | " \n", 87 | " \n", 88 | " \n", 89 | " \n", 90 | " \n", 91 | " \n", 92 | " \n", 93 | " \n", 94 | " \n", 95 | " \n", 96 | " \n", 97 | " \n", 98 | " \n", 99 | " \n", 100 | "
classtext
0hamGo until jurong point, crazy.. Available only ...
1hamOk lar... Joking wif u oni...
2spamFree entry in 2 a wkly comp to win FA Cup fina...
3hamU dun say so early hor... U c already then say...
4hamNah I don't think he goes to usf, he lives aro...
\n", 101 | "
" 102 | ], 103 | "text/plain": [ 104 | " class text\n", 105 | "0 ham Go until jurong point, crazy.. Available only ...\n", 106 | "1 ham Ok lar... Joking wif u oni...\n", 107 | "2 spam Free entry in 2 a wkly comp to win FA Cup fina...\n", 108 | "3 ham U dun say so early hor... U c already then say...\n", 109 | "4 ham Nah I don't think he goes to usf, he lives aro..." 110 | ] 111 | }, 112 | "metadata": {}, 113 | "output_type": "display_data" 114 | }, 115 | { 116 | "data": { 117 | "text/html": [ 118 | "
\n", 119 | "\n", 136 | "\n", 137 | " \n", 138 | " \n", 139 | " \n", 140 | " \n", 141 | " \n", 142 | " \n", 143 | " \n", 144 | " \n", 145 | " \n", 146 | " \n", 147 | " \n", 148 | " \n", 149 | " \n", 150 | " \n", 151 | " \n", 152 | " \n", 153 | " \n", 154 | " \n", 155 | " \n", 156 | " \n", 157 | " \n", 158 | " \n", 159 | " \n", 160 | " \n", 161 | " \n", 162 | " \n", 163 | " \n", 164 | " \n", 165 | " \n", 166 | " \n", 167 | " \n", 168 | " \n", 169 | " \n", 170 | " \n", 171 | " \n", 172 | " \n", 173 | "
text
countuniquetopfreq
class
ham48254516Sorry, I'll call later30
spam747653Please call our customer service representativ...4
\n", 174 | "
" 175 | ], 176 | "text/plain": [ 177 | " text \n", 178 | " count unique top freq\n", 179 | "class \n", 180 | "ham 4825 4516 Sorry, I'll call later 30\n", 181 | "spam 747 653 Please call our customer service representativ... 4" 182 | ] 183 | }, 184 | "metadata": {}, 185 | "output_type": "display_data" 186 | } 187 | ], 188 | "source": [ 189 | "data = pd.read_csv('spam.csv', encoding='latin-1', usecols=(0, 1), names=('class', 'text'), skiprows=1)\n", 190 | "display(data.head())\n", 191 | "display(data.groupby('class').describe())" 192 | ] 193 | }, 194 | { 195 | "cell_type": "markdown", 196 | "metadata": {}, 197 | "source": [ 198 | "아래의 histogram에서 알 수 있듯이, 대부분의 Ham의 글자길이는 대부분 100이하에 있고, Spam의 경우는 150쯤에 있습니다.
\n", 199 | "즉 길이의 차이가 있으며, Spam이 문장의 길이가 더 깁니다." 200 | ] 201 | }, 202 | { 203 | "cell_type": "code", 204 | "execution_count": 5, 205 | "metadata": {}, 206 | "outputs": [ 207 | { 208 | "data": { 209 | "image/png": "\n", 210 | "text/plain": [ 211 | "" 212 | ] 213 | }, 214 | "metadata": {}, 215 | "output_type": "display_data" 216 | } 217 | ], 218 | "source": [ 219 | "data['length'] = data['text'].apply(len)\n", 220 | "ax = data.hist('length', by='class', bins=50, figsize=(15, 6))" 221 | ] 222 | }, 223 | { 224 | "cell_type": "markdown", 225 | "metadata": {}, 226 | "source": [ 227 | "# Tokenizing\n", 228 | "\n", 229 | "process_text는 먼저 문장속에서 !,\\` 같은 punctuation을 삭제합니다.
\n", 230 | "이후 stopwords를 삭제시킵니다.\n", 231 | "\n", 232 | "Stopwords란 `a`, `the`, `of`, `at` 처럼 빈번하게 사용되는 단어들을 가르킵니다.>
\n", 233 | "Word frequency를 이용한 확률적 모델에는 이런 비번하게 사용되는 단어들에 특별한 의미가 없기 때문에 제거를 합니다." 234 | ] 235 | }, 236 | { 237 | "cell_type": "code", 238 | "execution_count": 6, 239 | "metadata": {}, 240 | "outputs": [ 241 | { 242 | "data": { 243 | "text/plain": [ 244 | "0 [Go, jurong, point, crazy, Available, bugis, n...\n", 245 | "1 [Ok, lar, Joking, wif, u, oni]\n", 246 | "2 [Free, entry, 2, wkly, comp, win, FA, Cup, fin...\n", 247 | "3 [U, dun, say, early, hor, U, c, already, say]\n", 248 | "4 [Nah, dont, think, goes, usf, lives, around, t...\n", 249 | "Name: text, dtype: object" 250 | ] 251 | }, 252 | "execution_count": 6, 253 | "metadata": {}, 254 | "output_type": "execute_result" 255 | } 256 | ], 257 | "source": [ 258 | "def process_text(text):\n", 259 | " \n", 260 | " # Remove Punctuations\n", 261 | " nopunc = [char for char in text if char not in string.punctuation]\n", 262 | " nopunc = ''.join(nopunc)\n", 263 | " \n", 264 | " # Remove stopwords\n", 265 | " cleaned_words = [word for word in nopunc.split() if word.lower() not in stopwords.words('english')]\n", 266 | " return cleaned_words\n", 267 | "\n", 268 | "data['text'].apply(process_text).head()" 269 | ] 270 | }, 271 | { 272 | "cell_type": "markdown", 273 | "metadata": {}, 274 | "source": [ 275 | "# Naive Bayes Model" 276 | ] 277 | }, 278 | { 279 | "cell_type": "code", 280 | "execution_count": 5, 281 | "metadata": {}, 282 | "outputs": [], 283 | "source": [ 284 | "pipeline = Pipeline([\n", 285 | " ('vectorization', CountVectorizer(analyzer=process_text)), # Convert strings to frequency vectors\n", 286 | " ('tfidf', TfidfTransformer()), # Convert vectors to weighted TF-IDF scores\n", 287 | " ('classifier', MultinomialNB())\n", 288 | "])" 289 | ] 290 | }, 291 | { 292 | "cell_type": "markdown", 293 | "metadata": {}, 294 | "source": [ 295 | "## Train" 296 | ] 297 | }, 298 | { 299 | "cell_type": "code", 300 | "execution_count": 7, 301 | "metadata": {}, 302 | "outputs": [ 303 | { 304 | "data": { 305 | "text/plain": [ 306 | "Pipeline(memory=None,\n", 307 | " steps=[('vectorization', CountVectorizer(analyzer=,\n", 308 | " binary=False, decode_error='strict', dtype=,\n", 309 | " encoding='utf-8', input='content', lowercase=True, max_df=1.0,\n", 310 | " max_features=None, min_df=1, ngram_range=(1, 1), prepr...f=False, use_idf=True)), ('classifier', MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True))])" 311 | ] 312 | }, 313 | "execution_count": 7, 314 | "metadata": {}, 315 | "output_type": "execute_result" 316 | } 317 | ], 318 | "source": [ 319 | "train_x, test_x, train_y, test_y = train_test_split(data['text'], data['class'], test_size=0.2)\n", 320 | "pipeline.fit(train_x, train_y)" 321 | ] 322 | }, 323 | { 324 | "cell_type": "markdown", 325 | "metadata": {}, 326 | "source": [ 327 | "## Predict" 328 | ] 329 | }, 330 | { 331 | "cell_type": "code", 332 | "execution_count": 9, 333 | "metadata": {}, 334 | "outputs": [ 335 | { 336 | "name": "stdout", 337 | "output_type": "stream", 338 | "text": [ 339 | " precision recall f1-score support\n", 340 | "\n", 341 | " ham 0.96 1.00 0.98 963\n", 342 | " spam 1.00 0.75 0.86 152\n", 343 | "\n", 344 | "avg / total 0.97 0.97 0.96 1115\n", 345 | "\n" 346 | ] 347 | }, 348 | { 349 | "data": { 350 | "text/plain": [ 351 | "" 352 | ] 353 | }, 354 | "execution_count": 9, 355 | "metadata": {}, 356 | "output_type": "execute_result" 357 | }, 358 | { 359 | "data": { 360 | "image/png": "iVBORw0KGgoAAAANSUhEUgAAAWAAAAD8CAYAAABJsn7AAAAABHNCSVQICAgIfAhkiAAAAAlwSFlzAAALEgAACxIB0t1+/AAAADl0RVh0U29mdHdhcmUAbWF0cGxvdGxpYiB2ZXJzaW9uIDIuMS4yLCBodHRwOi8vbWF0cGxvdGxpYi5vcmcvNQv5yAAAFVNJREFUeJzt3Xm4V1W5wPHvezigFgJqiggqTlcjm7zk1RxyyhQHTIubU5QUhXivc1naLW/ebFLSHvORqxZ6ccChQCRTUVJSTDA1FQfE6TA5xGBawjln3T/Ohg5yOIP8zln8dt+Pz3r47b3X3r91Hg+vr+9ea+9IKSFJ6no1uQcgSf+sDMCSlIkBWJIyMQBLUiYGYEnKxAAsSZkYgCUpEwOwJGViAJakTGo7+wtWvD7XpXZaw0Zb7ZN7CFoP1S+fF+t6jY7EnO4f2H6dv29dmAFLUiadngFLUpdqbMg9gnYzAEsql4b63CNoNwOwpFJJqTH3ENrNACypXBoNwJKUhxmwJGXiTThJysQMWJLySM6CkKRMvAknSZlYgpCkTLwJJ0mZmAFLUibehJOkTLwJJ0l5pGQNWJLysAYsSZlYgpCkTMyAJSmThhW5R9BuBmBJ5WIJQpIysQQhSZmYAUtSJgZgScojeRNOkjKxBixJmViCkKRMzIAlKRMzYEnKxAxYkjKp94HskpSHGbAkZVJFNeCa3AOQpIpKje1vbYiI0yPiyYh4IiKuj4gNI2K7iHgoIuZExI0R0aPou0GxPac4PrCt6xuAJZVLY2P7Wysioj/wn8DglNKuQDfgC8CPgDEppR2BxcCI4pQRwOJi/5iiX6sMwJLKpYIZME1l2o0iohZ4H7AAOAC4uTg+Djiq+Dy02KY4fmBERGsXNwBLKpf6+na3iBgZETObtZErL5NSmgf8FHiZpsC7FJgFLEkprZxqUQf0Lz73B14pzq0v+m/W2lC9CSepXFLqQNc0Fhjb0rGI2ISmrHY7YAlwE3BIBUa4igFYUrlUbhbEQcALKaXXACLiVmAvoE9E1BZZ7gBgXtF/HrA1UFeULHoDb7T2BZYgJJVLhW7C0VR62CMi3lfUcg8EngLuBT5X9BkOTCw+Tyq2KY7fk1Lr6bgZsKRyqdBCjJTSQxFxM/AIUA/8iaZyxe3ADRFxQbHvquKUq4BrI2IO8BeaZky0ygAsqVwaGip2qZTSd4Hvvmv3XGD3Fvr+Hfh8R65vAJZULlW0Es4ALKlcDMCSlIkP45GkPFJj++cB52YAllQuliAkKZMKzoLobAZgSeVSRRmwK+HWwbUTfsNRJ3ydocd/jWtv/PWq/eNvmsgRx36Vocd/jYsua5qj/eennuGY4aM5Zvhojh5+Mnf//g8tXrNu/kKO/eppHDrsJM78zoWsWLGiS34Wdb7PHLwfTz5xH08/NZ1vnD16jeM9evTguvGX8/RT03lg+m1su+2ADKMsgcqthOt0ZsDv0XNzX+SWSXdw/ZU/o3ttd75+5nl8aq9/Y+Gi17h3+gxuGXcZPXr04I3FSwDYcfttufGqS6mt7cZrr/+FY4afzH577UFtbbfVrjvm8qs58d+PYshB+3H+j3/OLZN/xxc+e3iOH1EVVFNTw6WX/A+HDDmWuroFzHhwCrdNvpPZs59b1eekLx/L4sVL2WXQ3gwbdiQX/uBcjjt+VMZRV6kOPIwntzYz4IjYJSK+GRGXFu2bEfHBrhjc+mzui6/w4Q/tzEYbbkhtbTcGf+zD3P37P3Djb25nxAnD6NGjBwCbbdIHYFU/gHeWL4cWHhOaUuKhWY9x8H77ADB0yEHcc9+DXfQTqTPt/omP8/zzL/LCCy+zYsUKJkyYyJFHfGa1PkcecTDXXnsTALfccjsH7L93jqFWvyrKgFsNwBHxTeAGIIA/Fi2A6yPinM4f3vprx+235ZHHnmTJ0mX87e9/5/4HH2bhotd48eV5zHrsCY796ml8afTZ/Hn2M6vOefzJpxl6/Nf47BdH8V9nn7JG9rtk6TI27vn+Vfv7bv4BXn2t1YcpqUps1X9LXqmbv2q7bt4Cttpqy7X2aWhoYOnSZWy22SZdOs5SaEztb5m1VYIYAXwopbRaITIiLgaeBH7YWQNb3+0wcBtOOv7zjDz9XDbacEN23ml7ampqaGhoYNmyN7lu7BiemP0sZ33nQu646ZdEBB/50C5MHH8Fz7/4MudecBH77PEJNtigR+4fRSqXKpoF0VYJohHYqoX9/YpjLWr+lPkrr7l+Xca3XjvmiM8w4eqfM+4XP6HXxhszcJsB9N3iAxz0qb2ICD48aGcigsVLlq523g4Dt+F9G23Ec3NfXG1/n969ePOvb1Ff3/QLtOi119li81YfqK8qMX/eQrYe8I+/SgP692P+/IVr7dOtWzd69+7FG28s7tJxlkFqbGx3y62tAHwaMDUifhsRY4t2BzAVOHVtJ6WUxqaUBqeUBn/li8dWcrzrlZU32BYsfJWpv/8DQz69Hwfssyd/fOQxAF58uY4V9fVs0qc3dfMXrgqs8xcu4oWXXqF/v76rXS8i2H23j3DntPsBmDjlbg7YZ88u/InUWR6e+Sg77rgdAwduTffu3Rk2bCi3Tb5ztT63Tb6TE09sepjWMcccxr3TWp4pozaUpQSRUrojIv6FpkevrXzv0Tzg4ZRS9eT5neT0b1/AkmXLqK2t5dwzT6bXxj05+vCDOe8HYzjqhK/TvXstPzjvTCKCRx5/kquunUBtbS01NcF5Z41mkz69ARh15nc4/5zT2GLzzTh91Emc/d0f8vOx1/DBf9mBow8/OPNPqUpoaGjg1NPOY8rt19GtpoZfjbuRp556lu999yxmznqMyZPv4upf3sC4X13K009NZ/HiJRx3wsm5h12dquhZENHGA9vX2YrX5+b/z4zWOxtttU/uIWg9VL98XqtvEW6Pt/77+HbHnPf/1/h1/r514TxgSeVSXz3/c24AllQuVVSCMABLKpf14OZaexmAJZXK+jC9rL0MwJLKxQxYkjIxAEtSJlW0FNkALKlUfCecJOViAJakTJwFIUmZmAFLUiYGYEnKIzVYgpCkPMyAJSkPp6FJUi4GYEnKpHpKwAZgSeWS6qsnAhuAJZVL9cRfA7Ckcqmmm3BtvZZekqpLYwdaGyKiT0TcHBFPR8TsiNgzIjaNiLsi4rniz02KvhERl0bEnIh4PCJ2a+v6BmBJpZIaU7tbO1wC3JFS2gX4KDAbOAeYmlLaCZhabAMcCuxUtJHA5W1d3AAsqVwqlAFHRG9gX+AqgJTS8pTSEmAoMK7oNg44qvg8FLgmNZkB9ImIfq19hwFYUqmk+va3iBgZETObtZHNLrUd8Brwy4j4U0RcGRHvB/qmlBYUfRYCfYvP/YFXmp1fV+xbK2/CSSqVjryVPqU0Fhi7lsO1wG7Af6SUHoqIS/hHuWHl+Ski3vNdPzNgSeVSuZtwdUBdSumhYvtmmgLyopWlheLPV4vj84Ctm50/oNi3VgZgSaWSGtvfWr1OSguBVyJi52LXgcBTwCRgeLFvODCx+DwJ+GIxG2IPYGmzUkWLLEFIKpWOlCDa4T+A8RHRA5gLfJmmxHVCRIwAXgKGFX2nAEOAOcDbRd9WGYAllUpqiMpdK6VHgcEtHDqwhb4JGN2R6xuAJZVKhTPgTmUAllQqqbFyGXBnMwBLKhUzYEnKJCUzYEnKwgxYkjJprOAsiM5mAJZUKt6Ek6RMDMCSlEmqnhdiGIAllYsZsCRl4jQ0ScqkwVkQkpSHGbAkZWINWJIycRaEJGViBixJmTQ0Vs+b1gzAkkrFEoQkZdLoLAhJysNpaJKUiSWIZrYYeHBnf4Wq0KBNt8k9BJWUJQhJysRZEJKUSRVVIAzAksrFEoQkZeIsCEnKpIpeimwAllQuCTNgScqi3hKEJOVhBixJmVgDlqRMzIAlKRMzYEnKpMEMWJLyqKI3EhmAJZVLYxVlwNXz2CBJaofUgdYeEdEtIv4UEZOL7e0i4qGImBMRN0ZEj2L/BsX2nOL4wLaubQCWVCqNHWjtdCowu9n2j4AxKaUdgcXAiGL/CGBxsX9M0a9VBmBJpdIY0e7WlogYABwGXFlsB3AAcHPRZRxwVPF5aLFNcfzAov9aGYAllUpDB1pEjIyImc3ayHdd7mfAN/hHwrwZsCSlVF9s1wH9i8/9gVcAiuNLi/5r5U04SaXSkVkQKaWxwNiWjkXE4cCrKaVZEbFfRQb3LgZgSaVSwVkQewFHRsQQYEOgF3AJ0CciaossdwAwr+g/D9gaqIuIWqA38EZrX2AJQlKpVGoWRErpWymlASmlgcAXgHtSSscD9wKfK7oNByYWnycV2xTH70mp9Xc0G4AllUpjtL+9R98EzoiIOTTVeK8q9l8FbFbsPwM4p60LWYKQVCqd8SyIlNI0YFrxeS6wewt9/g58viPXNQBLKpWG6lkIZwCWVC4+DU2SMjEAS1ImVfRKOAOwpHIxA5akTBpyD6ADDMCSSsUHsktSJpYgJCkTA7AkZdLeN12sDwzAkkrFGrAkZeIsCEnKpLGKihAGYEml4k04ScqkevJfA7CkkjEDlqRM6qN6cmADsKRSqZ7wawCWVDKWICQpE6ehSVIm1RN+DcCSSsYShCRl0lBFObABWFKpmAFLUibJDFiS8qimDLgm9wDKYIMNenD3tFu4/8HbeODh33LOuacCsO9+ezJt+kTue2ASv73zBrbbftsWzz/9zK8z67Gp/PGROzngwH26cuiqsPPHnMu0J27n1mn/t2rfp484gFt/P55H5/+BQR/dZY1ztuzflxnPT2X4qONavGb/bfoxfsqVTH7wJn58xfep7W7e1JpGUrtbbgbgCnjnneUMPexE9tnzCPbd8wgOPGgfBn/iY1w05r8ZOeIM9v3kkdx8022c9Y2T1zh351125OjPHcaenziUz332JH465nxqavzXUq0m3Xg7o449fbV9c55+njNO+hazZjza4jlnn/+fTL9nxlqvedp5o7n2ihs4fM/Ps2zJmxx93BEVHXPZpA603PybXiFvvfU2AN2719K9e3dSSqSU2HjjngD06rUxCxe8usZ5Qw47iFtvvp3ly5fz8kt1zJ37Ev86+KNdOnZVzqwZj7J0ybLV9r3w3Eu8+PzLLfbf/5B9mffyAp5/Zu5ar7n7Xv/KXZPvBWDShCnsf8i+lRtwCdWT2t1ye88BOCK+XMmBVLuamhrue2ASz77wENPumc6smY9x6infZsItV/LEM9MZduxR/OziK9Y4r99WfZlXt2DV9vx5C+m3Vd+uHLoy2eh9G3HSKSdw+U+vWmufPpv25s1lf6Whoek9D4sWvErffpt31RCrUurAP7mtSwZ8/toORMTIiJgZETPfWbFsbd1KpbGxkX0/eSQf2nlvdhv8UT44aCdGnfJlhh3zFXbdeW+uu/ZmLrjw27mHqfXIyWd/hWvH3sjf3v5b7qGUSmMHWm6tVvMj4vG1HQLWmqallMYCYwE26blj/v/MdKFlS9/k/vtmcNCnP8Wuu36QWTMfA+DXt9zOTb/55Rr9F8xfRP8B/VZtb9V/SxbMX9Rl41U+H/74IA46fH9O/85oNu7Vk9SYeOed5dxw9c2r+iz5y1I27tWTbt260dDQQN9+W7BowWsZR73+Wx8y2/Zq63ZqX+AzwOJ37Q/ggU4ZURXa7AObsmLFCpYtfZMNN9yA/Q/Yi0suHkuv3j3ZYceBPD/nRfY7YG+efWbOGuf+dspU/vfqi7ns51ezZb8t2GGHbVcFbZXbl44aterzqLNG8PZbf1st+K708AOP8OnD9+eOiXdz5LAhTPvd/V05zKqzPmS27dVWAJ4M9EwprXH7NiKmdcqIqtCWfTfnF2N/QrduNdTU1PDrW6fwuzvu5dRTzuWa8ZfR2NjIkiXLOGXUOQAcOuRAPrbbrlx4wSU8Pfs5fnPrFGbMvIP6+nrOPuN7NDZW06+QmvvR5ecz+JO70WfTPtz1yER+8ZMrWbpkGd/6nzPYZLM+XPZ/F/H0E8+uMVPi3S4bfxHfO+NCXlv0OmO+fxk/vuL7nHLO13j6iWe59brbuuinqU4NqXoy4EidPNh/thKE2mfrnt5I0poeX/hgrOs1jtv2s+2OOde99Ot1/r514YxuSaVSTTVg5wFLKpVKzYKIiK0j4t6IeCoinoyIU4v9m0bEXRHxXPHnJsX+iIhLI2JORDweEbu1NVYDsKRSqeBS5HrgzJTSIGAPYHREDALOAaamlHYCphbbAIcCOxVtJHB5W19gAJZUKpVaiJFSWpBSeqT4/CYwG+gPDAXGFd3GAUcVn4cC16QmM4A+EdGPVlgDllQqnTELIiIGAh8HHgL6ppRWLl9dyD/WRPQHXml2Wl2xbwFrYQYsqVQ6UoJovmq3aCPffb2I6AncApyWUlptaW9qmkb2niO+GbCkUunILPrmq3ZbEhHdaQq+41NKtxa7F0VEv5TSgqLEsPIpW/OArZudPqDYt1ZmwJJKpVI14IgI4Cpgdkrp4maHJgHDi8/DgYnN9n+xmA2xB7C0WamiRWbAkkqlgg9a3ws4EfhzRKxcDfxt4IfAhIgYAbwEDCuOTQGGAHOAt4E2nxhpAJZUKpVa3ZtSmk7Tc29acmAL/RMwuiPfYQCWVCq+ll6SMlkf3vXWXgZgSaXS2Q8YqyQDsKRSMQOWpEyq6WloBmBJpVJND2Q3AEsqFUsQkpSJAViSMnEWhCRlYgYsSZk4C0KSMmlIHXkgZV4GYEmlYg1YkjKxBixJmVgDlqRMGi1BSFIeZsCSlImzICQpE0sQkpSJJQhJysQMWJIyMQOWpEwaUkPuIbSbAVhSqbgUWZIycSmyJGViBixJmTgLQpIycRaEJGXiUmRJysQasCRlYg1YkjIxA5akTJwHLEmZmAFLUibOgpCkTKrpJlxN7gFIUiWllNrd2hIRh0TEMxExJyLOqfRYDcCSSiV14J/WREQ34DLgUGAQcGxEDKrkWA3Akkqlghnw7sCclNLclNJy4AZgaCXHag1YUqlUsAbcH3il2XYd8G+Vujh0QQBe/Nc50dnfUS0iYmRKaWzucWj94u9FZdUvn9fumBMRI4GRzXaN7cp/F5YgutbItrvon5C/F5mklMamlAY3a82D7zxg62bbA4p9FWMAlqSWPQzsFBHbRUQP4AvApEp+gTVgSWpBSqk+Ik4Bfgd0A65OKT1Zye8wAHct63xqib8X66mU0hRgSmddP6pp3bQklYk1YEnKxADcRTp7SaOqT0RcHRGvRsQTuceiPAzAXaArljSqKv0KOCT3IJSPAbhrdPqSRlWflNJ9wF9yj0P5GIC7RktLGvtnGouk9YQBWJIyMQB3jU5f0iip+hiAu0anL2mUVH0MwF0gpVQPrFzSOBuYUOkljao+EXE98CCwc0TURcSI3GNS13IlnCRlYgYsSZkYgCUpEwOwJGViAJakTAzAkpSJAViSMjEAS1ImBmBJyuT/AaOiNcGkDQkOAAAAAElFTkSuQmCC\n", 361 | "text/plain": [ 362 | "" 363 | ] 364 | }, 365 | "metadata": {}, 366 | "output_type": "display_data" 367 | } 368 | ], 369 | "source": [ 370 | "pred_y = pipeline.predict(test_x)\n", 371 | "\n", 372 | "print(classification_report(test_y, pred_y))\n", 373 | "sns.heatmap(confusion_matrix(test_y, pred_y), fmt='.1f', annot=True)" 374 | ] 375 | }, 376 | { 377 | "cell_type": "markdown", 378 | "metadata": {}, 379 | "source": [ 380 | "# Logistic Regression Model" 381 | ] 382 | }, 383 | { 384 | "cell_type": "code", 385 | "execution_count": 7, 386 | "metadata": {}, 387 | "outputs": [], 388 | "source": [ 389 | "pipeline = Pipeline([\n", 390 | " ('vectorization', CountVectorizer(analyzer=process_text)), # Convert strings to frequency vectors\n", 391 | " ('tfidf', TfidfTransformer()), # Convert vectors to weighted TF-IDF scores\n", 392 | " ('classifier', LogisticRegression())\n", 393 | "])" 394 | ] 395 | }, 396 | { 397 | "cell_type": "markdown", 398 | "metadata": {}, 399 | "source": [ 400 | "## Train" 401 | ] 402 | }, 403 | { 404 | "cell_type": "code", 405 | "execution_count": 8, 406 | "metadata": {}, 407 | "outputs": [ 408 | { 409 | "data": { 410 | "text/plain": [ 411 | "Pipeline(memory=None,\n", 412 | " steps=[('vectorization', CountVectorizer(analyzer=,\n", 413 | " binary=False, decode_error='strict', dtype=,\n", 414 | " encoding='utf-8', input='content', lowercase=True, max_df=1.0,\n", 415 | " max_features=None, min_df=1, ngram_range=(1, 1), prepr...ty='l2', random_state=None, solver='liblinear', tol=0.0001,\n", 416 | " verbose=0, warm_start=False))])" 417 | ] 418 | }, 419 | "execution_count": 8, 420 | "metadata": {}, 421 | "output_type": "execute_result" 422 | } 423 | ], 424 | "source": [ 425 | "train_x, test_x, train_y, test_y = train_test_split(data['text'], data['class'], test_size=0.2)\n", 426 | "pipeline.fit(train_x, train_y)" 427 | ] 428 | }, 429 | { 430 | "cell_type": "markdown", 431 | "metadata": {}, 432 | "source": [ 433 | "# Predict" 434 | ] 435 | }, 436 | { 437 | "cell_type": "code", 438 | "execution_count": 9, 439 | "metadata": {}, 440 | "outputs": [ 441 | { 442 | "name": "stdout", 443 | "output_type": "stream", 444 | "text": [ 445 | " precision recall f1-score support\n", 446 | "\n", 447 | " ham 0.95 1.00 0.97 958\n", 448 | " spam 1.00 0.66 0.80 157\n", 449 | "\n", 450 | "avg / total 0.95 0.95 0.95 1115\n", 451 | "\n" 452 | ] 453 | }, 454 | { 455 | "data": { 456 | "text/plain": [ 457 | "" 458 | ] 459 | }, 460 | "execution_count": 9, 461 | "metadata": {}, 462 | "output_type": "execute_result" 463 | }, 464 | { 465 | "data": { 466 | "image/png": "iVBORw0KGgoAAAANSUhEUgAAAWAAAAD8CAYAAABJsn7AAAAABHNCSVQICAgIfAhkiAAAAAlwSFlzAAALEgAACxIB0t1+/AAAADl0RVh0U29mdHdhcmUAbWF0cGxvdGxpYiB2ZXJzaW9uIDIuMS4yLCBodHRwOi8vbWF0cGxvdGxpYi5vcmcvNQv5yAAAFVNJREFUeJzt3Xm8VlW9+PHP9wAqDoASINN1SK9Dlmlcr2WaQiriAA5JakppcX/mL+fSrPTeMjO7qWHm74eioZWKkoFIqDE4pJlDWTmUZJnMiAxmeeWcs+4fZ4OHPKM856zz7D5vX+vFs/deez/rkePX7/k+a+0dKSUkSZ2vJvcAJOmflQFYkjIxAEtSJgZgScrEACxJmRiAJSkTA7AkZWIAlqRMDMCSlEn3jn6Dta+86FI7vU3PQfvnHoK6oNo3F8bGXqM9MafHu3bc6PfbGGbAkpRJh2fAktSp6utyj6DNDMCSyqWuNvcI2swALKlUUqrPPYQ2MwBLKpd6A7Ak5WEGLEmZ+CWcJGViBixJeSRnQUhSJn4JJ0mZWIKQpEz8Ek6SMjEDlqRM/BJOkjLxSzhJyiMla8CSlIc1YEnKxBKEJGViBixJmdStzT2CNjMASyoXSxCSlIklCEnKxAxYkjIxAEtSHskv4SQpE2vAkpSJJQhJysQMWJIyMQOWpEzMgCUpk1pvyC5JeVRRBlyTewCSVFH19W1vrYiIcyLimYj4XUTcGhGbRcQOEfFYRMyPiNsjYpOi76bF9vzi+PatXd8ALKlcUn3bWwsiYjBwJjAspbQH0A34OPBN4KqU0k7ASuC04pTTgJXF/quKfi0yAEsqlwpmwDSUaXtGRHdgc2AxMBy4szg+GRhTvB5dbFMcHxER0dLFDcCSyqUdGXBEjI+IJxq18esvk9JC4L+Bv9AQeFcDTwKrUkrrvulbAAwuXg8GXi7OrS36921pqH4JJ6lc2jELIqU0EZjY1LGI2JqGrHYHYBVwBzCyAiNczwxYUrmk1PbWso8Cf0opLU8prQV+DOwH9ClKEgBDgIXF64XAUIDieG9gRUtvYACWVC6VqwH/Bdg3IjYvarkjgGeBucBxRZ9xwLTi9fRim+L4nJRajvKWICSVS4WWIqeUHouIO4GngFrgVzSUK+4BbouIS4t9k4pTJgG3RMR84FUaZky0yAAsqVwquBAjpXQJcMk/7H4R2KeJvm8AH2vP9Q3Aksqlri73CNrMACypXLwbmiRlYgCWpEyq6GY8BmBJpZLqW53f22UYgCWViyUIScrEWRCSlIkZ8D+HW6b8hKnTZ5FS4rijRnLy2KO5dtIPmDp9Flv36Q3AWf8xjgM+tA9ra2u55BtX89wf/khtXR1HjRzBZ04Z+7ZrLli0hM9fcjmrVq9h91125vKLz6dHjx6d/dHUAQ495ECuvPKrdKup4cabbuWKb127wfFNNtmE79/0Hfbe6728+upKTjjpdF56aUGm0VaxKgrA3gviHXrhxT8zdfosbr3haqZO/h4PPPJL/rJgEQAnjx3D1MnXMnXytRzwoYYFM/fNeYg3167lrluuY8qNE7hj2kwWLl76tutedd2NnDx2DD+dciO9ttqSqTPu7dTPpY5RU1PDhO98nSOO/ATv3fMgxo4dw2677bxBn1M/dQIrV65m190/zNUTrucbl30p02irXOVuxtPhWg3AEbFrRFwQEROKdkFE7NYZg+vKXvzzy7z3PbvQc7PN6N69G8Pe/15+9sDPm+0fEfz9jTeora3jf/7nTXr06MGWW2y+QZ+UEo89+TSHHLg/AKNHfZQ5Dz7aoZ9DnWOff9uLP/7xz/zpT39h7dq1TJkyjaOOPHSDPkcdeQi33HIHAFOn3sPwgz6cY6jVr7I3ZO9QLQbgiLgAuA0I4JdFC+DWiLiw44fXde2043Y89fQzrFq9hr+/8QYPPfo4S5YuB+DWqXdz9Cmn8+XLrmT1mtcAOPigD9Nzs804aPSJHHzMKXzyhGPo3WurDa65avUattpyC7p37wbAgH7vYtnyFu9mpyoxaPC2vFz8hgSwYOFiBg3attk+dXV1rF69hr59t+7UcZZCfWp7y6y1GvBpwHuKe2GuFxFXAs8Al3fUwLq6d2//L5x60scYf86X6LnZZuyy847U1NQw9ujD+T+fPIGI4Jrrb+Zb372eSy86l98++3u61dQwZ9oPWfPaXxl3+vnsO2wvhg4emPujSOVSRbMgWitB1AODmtg/sDjWpMaP+bjh5ls3Znxd2rFHHsqUG69h8ve+Ra+ttmL7fxnCu7bZmm7dulFTU8NxRx3G7579AwAz75/HfvsOo0f37vTdug/vf9/uPPP8Cxtcr0/vXrz219eprW34AVq6/BX692vxiSaqEosWLmHokLf+UxoyeCCLFi1ptk+3bt3o3bsXK1as7NRxlkGqr29zy621AHw2MDsifhoRE4s2C5gNnNXcSSmliSmlYSmlYZ8+5YRKjrdLWbFyFQCLlyxj9gM/Z9TBB7L8lVfXH5/9wCPstON2AAwc0I9fPvk0AH/7+xv85pnn2WG7oRtcLyLYZ+/3cd+8hwCYNvNnDN//g53xUdTBHn/i1+y00w5sv/1QevTowfHHj+buGfdt0OfuGfdx8skNdzM89tjDmTuv+e8U1IKylCBSSrMi4l9puPflugfPLQQeTylVT57fQc656FJWrVlD9+7d+dJ5n6XXVlty4VXf4vcvvAgBg7cdwCVfOBOAE445ki9fdiWjT/oPEokxow5hl512AOD0877Cf114Nv379eWc00/l85dczjUTb2a3f303xxxxSM6PqAqpq6vjrLO/zMx7fkS3mhq+P/l2nn32D/znJefzxJNPM2PG/dx4021M/v4Enn/2YVauXMWJn/hs7mFXpyq6F0S08sSMjbb2lRfz/29GXU7PQfvnHoK6oNo3F7b4GPe2eP2rJ7U55mxx8Q83+v02hgsxJJVLbfX8cm4AllQuVVSCMABLKpcu8OVaWxmAJZVKV5he1lYGYEnlYgYsSZkYgCUpkypaimwAllQqPhNOknIxAEtSJs6CkKRMzIAlKRMDsCTlkeosQUhSHmbAkpSH09AkKRcDsCRlUj0lYAOwpHJJtdUTgQ3AksqleuJvq09FlqSqkupTm1trIqJPRNwZEc9HxHMR8cGI2CYi7o+IF4o/ty76RkRMiIj5EfGbiNi7tesbgCWVS307Wuu+A8xKKe0K7Ak8B1wIzE4p7QzMLrYBDgN2Ltp44LrWLm4AllQqlcqAI6I3cAAwCSCl9GZKaRUwGphcdJsMjClejwZuTg1+AfSJiIEtvYcBWFK5VC4D3gFYDtwUEb+KiBsiYgtgQEppcdFnCTCgeD0YeLnR+QuKfc0yAEsqlVTb9hYR4yPiiUZtfKNLdQf2Bq5LKe0FvM5b5YaG90opAe944rGzICSVSnueSp9SmghMbObwAmBBSumxYvtOGgLw0ogYmFJaXJQYlhXHFwJDG50/pNjXLDNgSeVSoRJESmkJ8HJE7FLsGgE8C0wHxhX7xgHTitfTgVOK2RD7AqsblSqaZAYsqVTakwG3weeAH0bEJsCLwKdoSFynRMRpwEvA8UXfmcAoYD7wt6JviwzAkkqlkgE4pfRrYFgTh0Y00TcBZ7Tn+gZgSaWS6iL3ENrMACypVCpcguhQBmBJpZLqzYAlKQszYEnKJCUzYEnKwgxYkjKpdxaEJOXhl3CSlIkBWJIySdXzUGQDsKRyMQOWpEychiZJmdQ5C0KS8jADlqRMrAFLUibOgpCkTMyAJSmTuvrqedSlAVhSqViCkKRM6p0FIUl5OA1NkjKxBNHIkHeP6ui3UBXauc/g3ENQSVmCkKRMnAUhSZlUUQXCACypXCxBSFImzoKQpEyq6KHIBmBJ5ZIwA5akLGotQUhSHmbAkpSJNWBJysQMWJIyMQOWpEzqzIAlKY8qeiIR1XPXCklqg3qiza0tIqJbRPwqImYU2ztExGMRMT8ibo+ITYr9mxbb84vj27d2bQOwpFJJ7WhtdBbwXKPtbwJXpZR2AlYCpxX7TwNWFvuvKvq1yAAsqVTq29FaExFDgMOBG4rtAIYDdxZdJgNjiteji22K4yOK/s2yBiypVOpbjnntdTXwBWCrYrsvsCqlVFtsLwDWPV1gMPAyQEqpNiJWF/1fae7iZsCSSqWuHS0ixkfEE43a+HXXiYgjgGUppSc7aqxmwJJKpT2zIFJKE4GJzRzeDzgqIkYBmwG9gO8AfSKie5EFDwEWFv0XAkOBBRHRHegNrGjp/c2AJZVKpWZBpJS+mFIaklLaHvg4MCeldBIwFziu6DYOmFa8nl5sUxyfk1LLjwg1AEsqlQ6YBfGPLgDOjYj5NNR4JxX7JwF9i/3nAhe2diFLEJJKpSMWYqSU5gHzitcvAvs00ecN4GPtua4BWFKpeC8IScqkroqWIhuAJZWKGbAkZWIAlqRMquiRcAZgSeViBixJmdTlHkA7GIAllUo13ZDdACypVCxBSFImBmBJymQj7vHQ6QzAkkrFGrAkZeIsCEnKpL6KihAGYEml4pdwkpRJ9eS/BmBJJWMGLEmZ1Eb15MAGYEmlUj3h1wAsqWQsQUhSJk5Dk6RMqif8GoAllYwlCEnKpK6KcmADsKRSMQOWpEySGbAk5VFNGXBN7gGUxeO/mc28R6Yz+6G7uHfenQBc8KUzmfvzacx+6C5uv2sSA7bt3+S5x58whkefmsWjT83i+BPGdOawVWGXXv1lHn5mFtMfuHX9vt59ejHpjmuY9Ys7mXTHNfTqvdUG5+zx/t347aJHOOSI4U1ec/f37cq0eT9i1mNTuejr53Xo+MugntTmlpsBuIKOOeIURux/NIceeBwA106YxEH7jWbE/kdz/6x5nHfBZ992Tp+te3P+hWdw2IixjBx+POdfeAa9+/Tq7KGrQn5y2z2M//hZG+z7zJnjePTBxxm573E8+uDjfObMceuP1dTUcN5XPscj8x5r9pqXXHEBF593GSP//Vi223Eo+w//YIeNvwxSO1puBuAO9NfXXl//evMtepLS2//KDxr+YR6Y+wirVq5m9ao1PDD3EYaP2L8zh6kKeuIXv2LVqjUb7Bs+8gCm3X4PANNuv4cRh31k/bFPfPp47r9nDiteWdnk9fr178uWW23B00/+ruH8KTMZMeojTfZVg1pSm1tu7zgAR8SnKjmQ6pe4/SeTuO+BqZz8yePX7/3iV87mqWfmcuzHjuCKr09421nbDhrAogWL128vWriEbQcN6JQRq3P07bcNy5etAGD5shX07bcNAP237cdHRx3IrTdNbfbc/gP7s3TxsvXbSxcta7aUpQapHf/ktjEZ8H81dyAixkfEExHxxN/fXLURb1E9jjz0RA4+4FhOPPYzfOrTJ7Lvh4YB8I2vXc3e7zmIqXfM4NTxn8g8SnUF634T+uKl5/Ltr323yd+M9M7Vt6Pl1mIAjojfNNN+CzSbpqWUJqaUhqWUhvXcpE/FB90VLSmylFdeeZWZM37GXh943wbHp065myOOOvjt5y1ayqAhA9dvDxq8LUsWLe3YwapTrVj+Kv369wUaSgqvFuWGPfbcjW///0v52RM/4ZAjh3PxN7+wQXkCYNniZQwY+FbGO2BQf5YuWYaaV6YMeABwCnBkE21Fxw6temy+eU+22HKL9a8PHL4fzz/7B3bYcbv1fUaOGsELL/zpbefOnfMwBw7fj959etG7Ty8OHL4fc+c83GljV8ebc++DjB57OACjxx7OnFkPAnDwv43ho8Ma2n13z+GrF1zB7J8+sMG5y5et4K+vvc6eH9ij4fzjRzHnpw927geoMtWUAbc2D3gGsGVK6df/eCAi5nXIiKpQv/59uekH3wWgW/du3HXnDObOfphJt0xgp522p74+seDlRXz+nEsA2HOvPRh36ljO/dxXWLVyNVde8T3unXsHAN/+5vdYtXJ1ts+ijfPf/+9r7LPfB+izTR/m/vpuvnvF9dww4WauvP4yjjvpKBYtWMI5n76o1ev8eM4POGZ4Q8nqqxdcwTcmXMymPTflodmP8ODsRzr6Y1S1uioq6URH158G9N61ev5tqNNss6lT7fR2zy37ZWzsNU7c7ug2x5wfvXTXRr/fxnAlnKRS6Qq13bZyHrCkUqlUDTgihkbE3Ih4NiKeiYiziv3bRMT9EfFC8efWxf6IiAkRMb+YrLB3a2M1AEsqlQouRa4Fzksp7Q7sC5wREbsDFwKzU0o7A7OLbYDDgJ2LNh64rrU3MABLKpVKTUNLKS1OKT1VvH4NeA4YDIwGJhfdJgPrbuAyGrg5NfgF0CciBtICA7CkUqlLqc2t8aKxoo1v6poRsT2wF/AYMCCltG756hLeWhMxGHi50WkLin3N8ks4SaXSnrucpZQmAhNb6hMRWwJTgbNTSmsi3po4kVJKEfGOv/UzA5ZUKpVciBERPWgIvj9MKf242L10XWmh+HPd0sSFwNBGpw8p9jXLACypVCpVA46GVHcS8FxK6cpGh6YD6+4pOg6Y1mj/KcVsiH2B1Y1KFU2yBCGpVCp4o/X9gJOB30bEutXAFwGXA1Mi4jTgJWDd7Q9nAqOA+cDfgFbvGGkAllQqlVrdm1J6GGhupdyIJvon4Iz2vIcBWFKp+Fh6ScqkKzzrra0MwJJKpZpucG8AllQqZsCSlEk13Q3NACypVKrphuwGYEmlYglCkjIxAEtSJs6CkKRMzIAlKRNnQUhSJnWpLTea7BoMwJJKxRqwJGViDViSMrEGLEmZ1FuCkKQ8zIAlKRNnQUhSJpYgJCkTSxCSlIkZsCRlYgYsSZnUpbrcQ2gzA7CkUnEpsiRl4lJkScrEDFiSMnEWhCRl4iwIScrEpciSlIk1YEnKxBqwJGViBixJmTgPWJIyMQOWpEycBSFJmVTTl3A1uQcgSZWUUmpza01EjIyI30fE/Ii4sNJjNQBLKpXUjn9aEhHdgGuBw4DdgRMiYvdKjtUALKlUKpgB7wPMTym9mFJ6E7gNGF3JsVoDllQqFawBDwZebrS9APj3Sl0cOiEAL139fHT0e1SLiBifUpqYexzqWvy5qKzaNxe2OeZExHhgfKNdEzvz78ISROca33oX/RPy5yKTlNLElNKwRq1x8F0IDG20PaTYVzEGYElq2uPAzhGxQ0RsAnwcmF7JN7AGLElNSCnVRsT/Be4FugE3ppSeqeR7GIA7l3U+NcWfiy4qpTQTmNlR149qWjctSWViDViSMjEAd5KOXtKo6hMRN0bEsoj4Xe6xKA8DcCfojCWNqkrfB0bmHoTyMQB3jg5f0qjqk1J6EHg19ziUjwG4czS1pHFwprFI6iIMwJKUiQG4c3T4kkZJ1ccA3Dk6fEmjpOpjAO4EKaVaYN2SxueAKZVe0qjqExG3Ao8Cu0TEgog4LfeY1LlcCSdJmZgBS1ImBmBJysQALEmZGIAlKRMDsCRlYgCWpEwMwJKUiQFYkjL5X//ocwyceT4HAAAAAElFTkSuQmCC\n", 467 | "text/plain": [ 468 | "" 469 | ] 470 | }, 471 | "metadata": {}, 472 | "output_type": "display_data" 473 | } 474 | ], 475 | "source": [ 476 | "pred_y = pipeline.predict(test_x)\n", 477 | "\n", 478 | "print(classification_report(test_y, pred_y))\n", 479 | "sns.heatmap(confusion_matrix(test_y, pred_y), fmt='.1f', annot=True)" 480 | ] 481 | } 482 | ], 483 | "metadata": { 484 | "kernelspec": { 485 | "display_name": "Python 3", 486 | "language": "python", 487 | "name": "python3" 488 | }, 489 | "language_info": { 490 | "codemirror_mode": { 491 | "name": "ipython", 492 | "version": 3 493 | }, 494 | "file_extension": ".py", 495 | "mimetype": "text/x-python", 496 | "name": "python", 497 | "nbconvert_exporter": "python", 498 | "pygments_lexer": "ipython3", 499 | "version": "3.6.4" 500 | } 501 | }, 502 | "nbformat": 4, 503 | "nbformat_minor": 2 504 | } 505 | -------------------------------------------------------------------------------- /spam-filtering-with-gensim.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "code", 5 | "execution_count": 5, 6 | "metadata": {}, 7 | "outputs": [ 8 | { 9 | "name": "stdout", 10 | "output_type": "stream", 11 | "text": [ 12 | "Populating the interactive namespace from numpy and matplotlib\n" 13 | ] 14 | } 15 | ], 16 | "source": [ 17 | "%pylab inline\n", 18 | "import numpy as np \n", 19 | "import pandas as pd\n", 20 | "import seaborn as sns\n", 21 | "import string\n", 22 | "\n", 23 | "from nltk.corpus import stopwords\n", 24 | "from sklearn.model_selection import train_test_split\n", 25 | "from sklearn.pipeline import Pipeline\n", 26 | "\n", 27 | "from sklearn.naive_bayes import MultinomialNB\n", 28 | "from sklearn.linear_model import LogisticRegression\n", 29 | "from sklearn.metrics import classification_report,confusion_matrix\n", 30 | "\n", 31 | "from gensim.parsing.preprocessing import remove_stopwords\n", 32 | "from gensim.parsing.preprocessing import strip_punctuation\n", 33 | "from gensim.parsing.preprocessing import preprocess_string\n", 34 | "from gensim.corpora import Dictionary\n", 35 | "from gensim.models import TfidfModel\n", 36 | "from gensim.matutils import corpus2csc, corpus2dense" 37 | ] 38 | }, 39 | { 40 | "cell_type": "markdown", 41 | "metadata": {}, 42 | "source": [ 43 | "# Data Preprocessing\n", 44 | "\n", 45 | "- 약 13.5% 만이 spam으로 분류되어 있다. 즉 class imbalance 문제가 있다.\n", 46 | "- unique 메세지를 보면 duplicate text가 존재함을 알 수 있다." 47 | ] 48 | }, 49 | { 50 | "cell_type": "code", 51 | "execution_count": 2, 52 | "metadata": {}, 53 | "outputs": [ 54 | { 55 | "data": { 56 | "text/html": [ 57 | "
\n", 58 | "\n", 71 | "\n", 72 | " \n", 73 | " \n", 74 | " \n", 75 | " \n", 76 | " \n", 77 | " \n", 78 | " \n", 79 | " \n", 80 | " \n", 81 | " \n", 82 | " \n", 83 | " \n", 84 | " \n", 85 | " \n", 86 | " \n", 87 | " \n", 88 | " \n", 89 | " \n", 90 | " \n", 91 | " \n", 92 | " \n", 93 | " \n", 94 | " \n", 95 | " \n", 96 | " \n", 97 | " \n", 98 | " \n", 99 | " \n", 100 | " \n", 101 | " \n", 102 | " \n", 103 | " \n", 104 | " \n", 105 | " \n", 106 | "
classtext
0hamGo until jurong point, crazy.. Available only ...
1hamOk lar... Joking wif u oni...
2spamFree entry in 2 a wkly comp to win FA Cup fina...
3hamU dun say so early hor... U c already then say...
4hamNah I don't think he goes to usf, he lives aro...
\n", 107 | "
" 108 | ], 109 | "text/plain": [ 110 | " class text\n", 111 | "0 ham Go until jurong point, crazy.. Available only ...\n", 112 | "1 ham Ok lar... Joking wif u oni...\n", 113 | "2 spam Free entry in 2 a wkly comp to win FA Cup fina...\n", 114 | "3 ham U dun say so early hor... U c already then say...\n", 115 | "4 ham Nah I don't think he goes to usf, he lives aro..." 116 | ] 117 | }, 118 | "metadata": {}, 119 | "output_type": "display_data" 120 | }, 121 | { 122 | "data": { 123 | "text/html": [ 124 | "
\n", 125 | "\n", 142 | "\n", 143 | " \n", 144 | " \n", 145 | " \n", 146 | " \n", 147 | " \n", 148 | " \n", 149 | " \n", 150 | " \n", 151 | " \n", 152 | " \n", 153 | " \n", 154 | " \n", 155 | " \n", 156 | " \n", 157 | " \n", 158 | " \n", 159 | " \n", 160 | " \n", 161 | " \n", 162 | " \n", 163 | " \n", 164 | " \n", 165 | " \n", 166 | " \n", 167 | " \n", 168 | " \n", 169 | " \n", 170 | " \n", 171 | " \n", 172 | " \n", 173 | " \n", 174 | " \n", 175 | " \n", 176 | " \n", 177 | " \n", 178 | " \n", 179 | "
text
countuniquetopfreq
class
ham48254516Sorry, I'll call later30
spam747653Please call our customer service representativ...4
\n", 180 | "
" 181 | ], 182 | "text/plain": [ 183 | " text \n", 184 | " count unique top freq\n", 185 | "class \n", 186 | "ham 4825 4516 Sorry, I'll call later 30\n", 187 | "spam 747 653 Please call our customer service representativ... 4" 188 | ] 189 | }, 190 | "metadata": {}, 191 | "output_type": "display_data" 192 | } 193 | ], 194 | "source": [ 195 | "data = pd.read_csv('spam.csv', encoding='latin-1', usecols=(0, 1), names=('class', 'text'), skiprows=1)\n", 196 | "display(data.head())\n", 197 | "display(data.groupby('class').describe())" 198 | ] 199 | }, 200 | { 201 | "cell_type": "markdown", 202 | "metadata": {}, 203 | "source": [ 204 | "아래의 histogram에서 알 수 있듯이, 대부분의 Ham의 글자길이는 대부분 100이하에 있고, Spam의 경우는 150쯤에 있습니다.
\n", 205 | "즉 길이의 차이가 있으며, Spam이 문장의 길이가 더 깁니다." 206 | ] 207 | }, 208 | { 209 | "cell_type": "code", 210 | "execution_count": 3, 211 | "metadata": {}, 212 | "outputs": [ 213 | { 214 | "data": { 215 | "image/png": "\n", 216 | "text/plain": [ 217 | "" 218 | ] 219 | }, 220 | "metadata": {}, 221 | "output_type": "display_data" 222 | } 223 | ], 224 | "source": [ 225 | "data['length'] = data['text'].apply(len)\n", 226 | "ax = data.hist('length', by='class', bins=50, figsize=(15, 6))" 227 | ] 228 | }, 229 | { 230 | "cell_type": "markdown", 231 | "metadata": {}, 232 | "source": [ 233 | "# Preprocessing" 234 | ] 235 | }, 236 | { 237 | "cell_type": "code", 238 | "execution_count": 6, 239 | "metadata": {}, 240 | "outputs": [ 241 | { 242 | "data": { 243 | "text/plain": [ 244 | "0 [jurong, point, crazi, avail, bugi, great, wor...\n", 245 | "1 [lar, joke, wif, oni]\n", 246 | "2 [free, entri, wkly, comp, win, cup, final, tkt...\n", 247 | "3 [dun, earli, hor]\n", 248 | "4 [nah, think, goe, usf, live]\n", 249 | "Name: text, dtype: object" 250 | ] 251 | }, 252 | "metadata": {}, 253 | "output_type": "display_data" 254 | } 255 | ], 256 | "source": [ 257 | "def process_text(text):\n", 258 | " text = remove_stopwords(text)\n", 259 | " text = preprocess_string(text)\n", 260 | " return text\n", 261 | "\n", 262 | "data['text'] = data['text'].apply(process_text)\n", 263 | "display(data['text'].head(5))" 264 | ] 265 | }, 266 | { 267 | "cell_type": "code", 268 | "execution_count": 7, 269 | "metadata": {}, 270 | "outputs": [], 271 | "source": [ 272 | "texts = data['text'].tolist()\n", 273 | "train_x, test_x, train_y, test_y = train_test_split(texts, data['class'], \n", 274 | " test_size=0.2)" 275 | ] 276 | }, 277 | { 278 | "cell_type": "markdown", 279 | "metadata": {}, 280 | "source": [ 281 | "## Dictionary" 282 | ] 283 | }, 284 | { 285 | "cell_type": "code", 286 | "execution_count": 8, 287 | "metadata": {}, 288 | "outputs": [ 289 | { 290 | "data": { 291 | "text/plain": [ 292 | "{'friend': 0,\n", 293 | " 'afternoon': 1,\n", 294 | " 'babe': 2,\n", 295 | " 'fine': 3,\n", 296 | " 'funni': 4,\n", 297 | " 'movi': 5,\n", 298 | " 'saw': 6,\n", 299 | " 'tho': 7,\n", 300 | " 'town': 8,\n", 301 | " 'want': 9,\n", 302 | " 'yest': 10,\n", 303 | " 'chat': 11,\n", 304 | " 'cheap': 12,\n", 305 | " 'choos': 13,\n", 306 | " 'connect': 14,\n", 307 | " 'girl': 15,\n", 308 | " 'hard': 16,\n", 309 | " 'live': 17,\n", 310 | " 'min': 18,\n", 311 | " 'servic': 19,\n", 312 | " 'lesson': 20,\n", 313 | " 'practic': 21,\n", 314 | " 'start': 22,\n", 315 | " 'take': 23,\n", 316 | " 'fri': 24,\n", 317 | " 'leh': 25,\n", 318 | " 'lor': 26,\n", 319 | " 'said': 27,\n", 320 | " 'wait': 28,\n", 321 | " 'bath': 29,\n", 322 | " 'cup': 30,\n", 323 | " 'drink': 31,\n", 324 | " 'know': 32,\n", 325 | " 'need': 33,\n", 326 | " 'night': 34,\n", 327 | " 'warm': 35,\n", 328 | " 'work': 36,\n", 329 | " 'boytoi': 37,\n", 330 | " 'get': 38,\n", 331 | " 'hello': 39,\n", 332 | " 'job': 40,\n", 333 | " 'lazi': 41,\n", 334 | " 'miss': 42,\n", 335 | " 'net': 43,\n", 336 | " 'nice': 44,\n", 337 | " 'readi': 45,\n", 338 | " 'lover': 46,\n", 339 | " 'noon': 47,\n", 340 | " 'sure': 48,\n", 341 | " 'gonna': 49,\n", 342 | " 'normal': 50,\n", 343 | " 'see': 51,\n", 344 | " 'sorri': 52,\n", 345 | " 'time': 53,\n", 346 | " 'tuesdai': 54,\n", 347 | " 'believ': 55,\n", 348 | " 'dun': 56,\n", 349 | " 'muz': 57,\n", 350 | " 'thk': 58,\n", 351 | " 'told': 59,\n", 352 | " 'tot': 60,\n", 353 | " 'true': 61,\n", 354 | " 'didnt': 62,\n", 355 | " 'fix': 63,\n", 356 | " 'good': 64,\n", 357 | " 'goodnight': 65,\n", 358 | " 'sleep': 66,\n", 359 | " 'wake': 67,\n", 360 | " 'plai': 68,\n", 361 | " 'credit': 69,\n", 362 | " 'detail': 70,\n", 363 | " 'free': 71,\n", 364 | " 'great': 72,\n", 365 | " 'hous': 73,\n", 366 | " 'offer': 74,\n", 367 | " 'pl': 75,\n", 368 | " 'pound': 76,\n", 369 | " 'remind': 77,\n", 370 | " 'repli': 78,\n", 371 | " 'text': 79,\n", 372 | " 'valid': 80,\n", 373 | " 'adult': 81,\n", 374 | " 'content': 82,\n", 375 | " 'video': 83,\n", 376 | " 'black': 84,\n", 377 | " 'blue': 85,\n", 378 | " 'contact': 86,\n", 379 | " 'di': 87,\n", 380 | " 'face': 88,\n", 381 | " 'green': 89,\n", 382 | " 'hot': 90,\n", 383 | " 'luv': 91,\n", 384 | " 'orang': 92,\n", 385 | " 'pass': 93,\n", 386 | " 'plz': 94,\n", 387 | " 'reali': 95,\n", 388 | " 'red': 96,\n", 389 | " 'smile': 97,\n", 390 | " 'swt': 98,\n", 391 | " 'thnk': 99,\n", 392 | " 'wana': 100,\n", 393 | " 'wat': 101,\n", 394 | " 'wid': 102,\n", 395 | " 'wnt': 103,\n", 396 | " 'car': 104,\n", 397 | " 'collect': 105,\n", 398 | " 'oredi': 106,\n", 399 | " 'bold': 107,\n", 400 | " 'love': 108,\n", 401 | " 'thank': 109,\n", 402 | " 'amt': 110,\n", 403 | " 'bui': 111,\n", 404 | " 'prob': 112,\n", 405 | " 'that': 113,\n", 406 | " 'urgent': 114,\n", 407 | " 'vry': 115,\n", 408 | " 'join': 116,\n", 409 | " 'todai': 117,\n", 410 | " 'train': 118,\n", 411 | " 'awesom': 119,\n", 412 | " 'lemm': 120,\n", 413 | " 'doctor': 121,\n", 414 | " 'hope': 122,\n", 415 | " 'left': 123,\n", 416 | " 'littl': 124,\n", 417 | " 'thing': 125,\n", 418 | " 'went': 126,\n", 419 | " 'haven': 127,\n", 420 | " 'past': 128,\n", 421 | " 'slept': 129,\n", 422 | " 'tire': 130,\n", 423 | " 'crave': 131,\n", 424 | " 'morn': 132,\n", 425 | " 'sexi': 133,\n", 426 | " 'think': 134,\n", 427 | " 'like': 135,\n", 428 | " 'lol': 136,\n", 429 | " 'cash': 137,\n", 430 | " 'chanc': 138,\n", 431 | " 'custcar': 139,\n", 432 | " 'txt': 140,\n", 433 | " 'win': 141,\n", 434 | " 'www': 142,\n", 435 | " 'howz': 143,\n", 436 | " 'person': 144,\n", 437 | " 'stori': 145,\n", 438 | " 'bit': 146,\n", 439 | " 'care': 147,\n", 440 | " 'drop': 148,\n", 441 | " 'heart': 149,\n", 442 | " 'life': 150,\n", 443 | " 'tear': 151,\n", 444 | " 'wil': 152,\n", 445 | " 'answer': 153,\n", 446 | " 'fight': 154,\n", 447 | " 'plu': 155,\n", 448 | " 'singl': 156,\n", 449 | " 'bu': 157,\n", 450 | " 'leav': 158,\n", 451 | " 'coz': 159,\n", 452 | " 'anymor': 160,\n", 453 | " 'home': 161,\n", 454 | " 'soon': 162,\n", 455 | " 'stuff': 163,\n", 456 | " 'talk': 164,\n", 457 | " 'tonight': 165,\n", 458 | " 'got': 166,\n", 459 | " 'monei': 167,\n", 460 | " 'ask': 168,\n", 461 | " 'avail': 169,\n", 462 | " 'number': 170,\n", 463 | " 'co': 171,\n", 464 | " 'drive': 172,\n", 465 | " 'dunno': 173,\n", 466 | " 'lei': 174,\n", 467 | " 'sch': 175,\n", 468 | " 'address': 176,\n", 469 | " 'com': 177,\n", 470 | " 'complet': 178,\n", 471 | " 'locat': 179,\n", 472 | " 'post': 180,\n", 473 | " 'receiv': 181,\n", 474 | " 'week': 182,\n", 475 | " 'messag': 183,\n", 476 | " 'phone': 184,\n", 477 | " 'pick': 185,\n", 478 | " 'right': 186,\n", 479 | " 'send': 187,\n", 480 | " 'centr': 188,\n", 481 | " 'road': 189,\n", 482 | " 'childish': 190,\n", 483 | " 'express': 191,\n", 484 | " 'irrit': 192,\n", 485 | " 'lovabl': 193,\n", 486 | " 'naughti': 194,\n", 487 | " 'romant': 195,\n", 488 | " 'speak': 196,\n", 489 | " 'voic': 197,\n", 490 | " 'wast': 198,\n", 491 | " 'select': 199,\n", 492 | " 'await': 200,\n", 493 | " 'costa': 201,\n", 494 | " 'costå£': 202,\n", 495 | " 'del': 203,\n", 496 | " 'holidai': 204,\n", 497 | " 'maxmin': 205,\n", 498 | " 'pobox': 206,\n", 499 | " 'sae': 207,\n", 500 | " 'skxh': 208,\n", 501 | " 'sol': 209,\n", 502 | " 'stockport': 210,\n", 503 | " 'toclaim': 211,\n", 504 | " 'gui': 212,\n", 505 | " 'invit': 213,\n", 506 | " 'actual': 214,\n", 507 | " 'lot': 215,\n", 508 | " 'check': 216,\n", 509 | " 'hurt': 217,\n", 510 | " 'appli': 218,\n", 511 | " 'cost': 219,\n", 512 | " 'joke': 220,\n", 513 | " 'on': 221,\n", 514 | " 'school': 222,\n", 515 | " 'score': 223,\n", 516 | " 'sent': 224,\n", 517 | " 'come': 225,\n", 518 | " 'plan': 226,\n", 519 | " 'dinner': 227,\n", 520 | " 'list': 228,\n", 521 | " 'reason': 229,\n", 522 | " 'tell': 230,\n", 523 | " 'call': 231,\n", 524 | " 'darren': 232,\n", 525 | " 'final': 233,\n", 526 | " 'guess': 234,\n", 527 | " 'ju': 235,\n", 528 | " 'present': 236,\n", 529 | " 'shop': 237,\n", 530 | " 'wan': 238,\n", 531 | " 'wif': 239,\n", 532 | " 'go': 240,\n", 533 | " 'hospit': 241,\n", 534 | " 'offic': 242,\n", 535 | " 'charg': 243,\n", 536 | " 'enjoi': 244,\n", 537 | " 'rington': 245,\n", 538 | " 'tone': 246,\n", 539 | " 'abiola': 247,\n", 540 | " 'bring': 248,\n", 541 | " 'india': 249,\n", 542 | " 'light': 250,\n", 543 | " 'lucki': 251,\n", 544 | " 'project': 252,\n", 545 | " 'trip': 253,\n", 546 | " 'activ': 254,\n", 547 | " 'game': 255,\n", 548 | " 'kei': 256,\n", 549 | " 'press': 257,\n", 550 | " 'save': 258,\n", 551 | " 'set': 259,\n", 552 | " 'forgot': 260,\n", 553 | " 'award': 261,\n", 554 | " 'bonu': 262,\n", 555 | " 'caller': 263,\n", 556 | " 'landlin': 264,\n", 557 | " 'mobil': 265,\n", 558 | " 'ppm': 266,\n", 559 | " 'prize': 267,\n", 560 | " 'try': 268,\n", 561 | " 'mean': 269,\n", 562 | " 'deliveri': 270,\n", 563 | " 'new': 271,\n", 564 | " 'tomorrow': 272,\n", 565 | " 'tri': 273,\n", 566 | " 'dint': 274,\n", 567 | " 'dont': 275,\n", 568 | " 'possibl': 276,\n", 569 | " 'ye': 277,\n", 570 | " 'aight': 278,\n", 571 | " 'fuck': 279,\n", 572 | " 'later': 280,\n", 573 | " 'lar': 281,\n", 574 | " 'watch': 282,\n", 575 | " 'jai': 283,\n", 576 | " 'cum': 284,\n", 577 | " 'end': 285,\n", 578 | " 'gettin': 286,\n", 579 | " 'heard': 287,\n", 580 | " 'line': 288,\n", 581 | " 'pic': 289,\n", 582 | " 'pix': 290,\n", 583 | " 'privat': 291,\n", 584 | " 'sam': 292,\n", 585 | " 'stop': 293,\n", 586 | " 'xxx': 294,\n", 587 | " 'cuz': 295,\n", 588 | " 'help': 296,\n", 589 | " 'problem': 297,\n", 590 | " 'anytim': 298,\n", 591 | " 'download': 299,\n", 592 | " 'inclus': 300,\n", 593 | " 'network': 301,\n", 594 | " 'knw': 302,\n", 595 | " 'fun': 303,\n", 596 | " 'comuk': 304,\n", 597 | " 'extra': 305,\n", 598 | " 'goto': 306,\n", 599 | " 'login': 307,\n", 600 | " 'sm': 308,\n", 601 | " 'unsubscrib': 309,\n", 602 | " 'aathi': 310,\n", 603 | " 'amp': 311,\n", 604 | " 'dai': 312,\n", 605 | " 'dear': 313,\n", 606 | " 'success': 314,\n", 607 | " 'let': 315,\n", 608 | " 'price': 316,\n", 609 | " 'question': 317,\n", 610 | " 'solv': 318,\n", 611 | " 'student': 319,\n", 612 | " 'cheer': 320,\n", 613 | " 'alright': 321,\n", 614 | " 'man': 322,\n", 615 | " 'daddi': 323,\n", 616 | " 'look': 324,\n", 617 | " 'god': 325,\n", 618 | " 'happi': 326,\n", 619 | " 'nite': 327,\n", 620 | " 'rememb': 328,\n", 621 | " 'til': 329,\n", 622 | " 'stupid': 330,\n", 623 | " 'camera': 331,\n", 624 | " 'msg': 332,\n", 625 | " 'neva': 333,\n", 626 | " 'shuhui': 334,\n", 627 | " 'wun': 335,\n", 628 | " 'yup': 336,\n", 629 | " 'hear': 337,\n", 630 | " 'sat': 338,\n", 631 | " 'earlier': 339,\n", 632 | " 'easi': 340,\n", 633 | " 'make': 341,\n", 634 | " 'pai': 342,\n", 635 | " 'yr': 343,\n", 636 | " 'half': 344,\n", 637 | " 'run': 345,\n", 638 | " 'second': 346,\n", 639 | " 'usual': 347,\n", 640 | " 'info': 348,\n", 641 | " 'real': 349,\n", 642 | " 'special': 350,\n", 643 | " 'voucher': 351,\n", 644 | " 'won': 352,\n", 645 | " 'din': 353,\n", 646 | " 'door': 354,\n", 647 | " 'creat': 355,\n", 648 | " 'fill': 356,\n", 649 | " 'finger': 357,\n", 650 | " 'gap': 358,\n", 651 | " 'hand': 359,\n", 652 | " 'hold': 360,\n", 653 | " 'sun': 361,\n", 654 | " 'cool': 362,\n", 655 | " 'wai': 363,\n", 656 | " 'roommat': 364,\n", 657 | " 'ad': 365,\n", 658 | " 'support': 366,\n", 659 | " 'men': 367,\n", 660 | " 'mode': 368,\n", 661 | " 'yesterdai': 369,\n", 662 | " 'mistak': 370,\n", 663 | " 'shall': 371,\n", 664 | " 'food': 372,\n", 665 | " 'type': 373,\n", 666 | " 'came': 374,\n", 667 | " 'forget': 375,\n", 668 | " 'gbp': 376,\n", 669 | " 'point': 377,\n", 670 | " 'yeah': 378,\n", 671 | " 'oop': 379,\n", 672 | " 'box': 380,\n", 673 | " 'complimentari': 381,\n", 674 | " 'sell': 382,\n", 675 | " 'copi': 383,\n", 676 | " 'request': 384,\n", 677 | " 'follow': 385,\n", 678 | " 'link': 386,\n", 679 | " 'us': 387,\n", 680 | " 'date': 388,\n", 681 | " 'fall': 389,\n", 682 | " 'meet': 390,\n", 683 | " 'world': 391,\n", 684 | " 'guid': 392,\n", 685 | " 'honei': 393,\n", 686 | " 'moon': 394,\n", 687 | " 'onlin': 395,\n", 688 | " 'slow': 396,\n", 689 | " 'uncl': 397,\n", 690 | " 'account': 398,\n", 691 | " 'custom': 399,\n", 692 | " 'mail': 400,\n", 693 | " 'argument': 401,\n", 694 | " 'correct': 402,\n", 695 | " 'kick': 403,\n", 696 | " 'lose': 404,\n", 697 | " 'situat': 405,\n", 698 | " 'clean': 406,\n", 699 | " 'finish': 407,\n", 700 | " 'hee': 408,\n", 701 | " 'lunch': 409,\n", 702 | " 'mum': 410,\n", 703 | " 'room': 411,\n", 704 | " 'stai': 412,\n", 705 | " 'yar': 413,\n", 706 | " 'coffe': 414,\n", 707 | " 'slave': 415,\n", 708 | " 'word': 416,\n", 709 | " 'ass': 417,\n", 710 | " 'pleasur': 418,\n", 711 | " 'scream': 419,\n", 712 | " 'best': 420,\n", 713 | " 'england': 421,\n", 714 | " 'oper': 422,\n", 715 | " 'origin': 423,\n", 716 | " 'poli': 424,\n", 717 | " 'rate': 425,\n", 718 | " 'darl': 426,\n", 719 | " 'freemsg': 427,\n", 720 | " 'hei': 428,\n", 721 | " 'std': 429,\n", 722 | " 'insid': 430,\n", 723 | " 'file': 431,\n", 724 | " 'import': 432,\n", 725 | " 'pizza': 433,\n", 726 | " 'gone': 434,\n", 727 | " 'cancer': 435,\n", 728 | " 'mondai': 436,\n", 729 | " 'sai': 437,\n", 730 | " 'fool': 438,\n", 731 | " 'grin': 439,\n", 732 | " 'hungri': 440,\n", 733 | " 'chang': 441,\n", 734 | " 'fanci': 442,\n", 735 | " 'arm': 443,\n", 736 | " 'bless': 444,\n", 737 | " 'open': 445,\n", 738 | " 'forward': 446,\n", 739 | " 'children': 447,\n", 740 | " 'mob': 448,\n", 741 | " 'nokia': 449,\n", 742 | " 'song': 450,\n", 743 | " 'zed': 451,\n", 744 | " 'brother': 452,\n", 745 | " 'advic': 453,\n", 746 | " 'wish': 454,\n", 747 | " 'year': 455,\n", 748 | " 'reach': 456,\n", 749 | " 'what': 457,\n", 750 | " 'gai': 458,\n", 751 | " 'updat': 459,\n", 752 | " 'wow': 460,\n", 753 | " 'babi': 461,\n", 754 | " 'cancel': 462,\n", 755 | " 'darlin': 463,\n", 756 | " 'fone': 464,\n", 757 | " 'kate': 465,\n", 758 | " 'ring': 466,\n", 759 | " 'sound': 467,\n", 760 | " 'understand': 468,\n", 761 | " 'boi': 469,\n", 762 | " 'caus': 470,\n", 763 | " 'dream': 471,\n", 764 | " 'kid': 472,\n", 765 | " 'mark': 473,\n", 766 | " 'omg': 474,\n", 767 | " 'piss': 475,\n", 768 | " 'sort': 476,\n", 769 | " 'huh': 477,\n", 770 | " 'midnight': 478,\n", 771 | " 'paper': 479,\n", 772 | " 'write': 480,\n", 773 | " 'batteri': 481,\n", 774 | " 'dead': 482,\n", 775 | " 'wonder': 483,\n", 776 | " 'aha': 484,\n", 777 | " 'wed': 485,\n", 778 | " 'ill': 486,\n", 779 | " 'park': 487,\n", 780 | " 'claim': 488,\n", 781 | " 'congratul': 489,\n", 782 | " 'draw': 490,\n", 783 | " 'guarante': 491,\n", 784 | " 'repres': 492,\n", 785 | " 'balanc': 493,\n", 786 | " 'current': 494,\n", 787 | " 'land': 495,\n", 788 | " 'maxim': 496,\n", 789 | " 'row': 497,\n", 790 | " 'suit': 498,\n", 791 | " 'wjhl': 499,\n", 792 | " 'at': 500,\n", 793 | " 'sweet': 501,\n", 794 | " 'peopl': 502,\n", 795 | " 'shirt': 503,\n", 796 | " 'wear': 504,\n", 797 | " 'wot': 505,\n", 798 | " 'close': 506,\n", 799 | " 'includ': 507,\n", 800 | " 'januari': 508,\n", 801 | " 'maid': 509,\n", 802 | " 'murder': 510,\n", 803 | " 'loan': 511,\n", 804 | " 'welcom': 512,\n", 805 | " 'earli': 513,\n", 806 | " 'have': 514,\n", 807 | " 'pain': 515,\n", 808 | " 'prai': 516,\n", 809 | " 'remov': 517,\n", 810 | " 'dress': 518,\n", 811 | " 'pretti': 519,\n", 812 | " 'better': 520,\n", 813 | " 'feel': 521,\n", 814 | " 'round': 522,\n", 815 | " 'sick': 523,\n", 816 | " 'ladi': 524,\n", 817 | " 'dad': 525,\n", 818 | " 'parent': 526,\n", 819 | " 'sad': 527,\n", 820 | " 'wife': 528,\n", 821 | " 'player': 529,\n", 822 | " 'unsold': 530,\n", 823 | " 'ago': 531,\n", 824 | " 'liao': 532,\n", 825 | " 'long': 533,\n", 826 | " 'probabl': 534,\n", 827 | " 'swing': 535,\n", 828 | " 'okai': 536,\n", 829 | " 'small': 537,\n", 830 | " 'christma': 538,\n", 831 | " 'decid': 539,\n", 832 | " 'doesnt': 540,\n", 833 | " 'yep': 541,\n", 834 | " 'abt': 542,\n", 835 | " 'spl': 543,\n", 836 | " 'bak': 544,\n", 837 | " 'bore': 545,\n", 838 | " 'colleg': 546,\n", 839 | " 'havent': 547,\n", 840 | " 'isnt': 548,\n", 841 | " 'princess': 549,\n", 842 | " 'opinion': 550,\n", 843 | " 'page': 551,\n", 844 | " 'read': 552,\n", 845 | " 'moment': 553,\n", 846 | " 'chikku': 554,\n", 847 | " 'carlo': 555,\n", 848 | " 'usf': 556,\n", 849 | " 'possess': 557,\n", 850 | " 'shit': 558,\n", 851 | " 'nope': 559,\n", 852 | " 'photo': 560,\n", 853 | " 'hair': 561,\n", 854 | " 'purchas': 562,\n", 855 | " 'opt': 563,\n", 856 | " 'kinda': 564,\n", 857 | " 'tomo': 565,\n", 858 | " 'contract': 566,\n", 859 | " 'doubl': 567,\n", 860 | " 'latest': 568,\n", 861 | " 'mnth': 569,\n", 862 | " 'motorola': 570,\n", 863 | " 'record': 571,\n", 864 | " 'dog': 572,\n", 865 | " 'gotta': 573,\n", 866 | " 'juz': 574,\n", 867 | " 'txtauction': 575,\n", 868 | " 'add': 576,\n", 869 | " 'pussi': 577,\n", 870 | " 'gud': 578,\n", 871 | " 'relat': 579,\n", 872 | " 'wit': 580,\n", 873 | " 'haf': 581,\n", 874 | " 'countri': 582,\n", 875 | " 'dvd': 583,\n", 876 | " 'quiz': 584,\n", 877 | " 'soni': 585,\n", 878 | " 'sunshin': 586,\n", 879 | " 'wkly': 587,\n", 880 | " 'hiya': 588,\n", 881 | " 'statement': 589,\n", 882 | " 'surpris': 590,\n", 883 | " 'alex': 591,\n", 884 | " 'anybodi': 592,\n", 885 | " 'thought': 593,\n", 886 | " 'bed': 594,\n", 887 | " 'book': 595,\n", 888 | " 'suppos': 596,\n", 889 | " 'thanx': 597,\n", 890 | " 'issu': 598,\n", 891 | " 'rent': 599,\n", 892 | " 'far': 600,\n", 893 | " 'how': 601,\n", 894 | " 'met': 602,\n", 895 | " 'si': 603,\n", 896 | " 'simpl': 604,\n", 897 | " 'rest': 605,\n", 898 | " 'an': 606,\n", 899 | " 'enter': 607,\n", 900 | " 'gift': 608,\n", 901 | " 'receipt': 609,\n", 902 | " 'regist': 610,\n", 903 | " 'subscrib': 611,\n", 904 | " 'cheaper': 612,\n", 905 | " 'nation': 613,\n", 906 | " 'sale': 614,\n", 907 | " 'biz': 615,\n", 908 | " 'optout': 616,\n", 909 | " 'fyi': 617,\n", 910 | " 'frnd': 618,\n", 911 | " 'wont': 619,\n", 912 | " 'ag': 620,\n", 913 | " 'difficult': 621,\n", 914 | " 'place': 622,\n", 915 | " 'boo': 623,\n", 916 | " 'laugh': 624,\n", 917 | " 'flower': 625,\n", 918 | " 'iv': 626,\n", 919 | " 'cut': 627,\n", 920 | " 'short': 628,\n", 921 | " 'local': 629,\n", 922 | " 'marri': 630,\n", 923 | " 'match': 631,\n", 924 | " 'dat': 632,\n", 925 | " 'bugi': 633,\n", 926 | " 'late': 634,\n", 927 | " 'oso': 635,\n", 928 | " 'walk': 636,\n", 929 | " 'fridai': 637,\n", 930 | " 'pongal': 638,\n", 931 | " 'eat': 639,\n", 932 | " 'bad': 640,\n", 933 | " 'head': 641,\n", 934 | " 'big': 642,\n", 935 | " 'troubl': 643,\n", 936 | " 'apart': 644,\n", 937 | " 'laptop': 645,\n", 938 | " 'touch': 646,\n", 939 | " 'took': 647,\n", 940 | " 'appreci': 648,\n", 941 | " 'safe': 649,\n", 942 | " 'goodmorn': 650,\n", 943 | " 'king': 651,\n", 944 | " 'sed': 652,\n", 945 | " 'camcord': 653,\n", 946 | " 'unlimit': 654,\n", 947 | " 'bank': 655,\n", 948 | " 'fantasi': 656,\n", 949 | " 'horni': 657,\n", 950 | " 'turn': 658,\n", 951 | " 'comin': 659,\n", 952 | " 'abl': 660,\n", 953 | " 'ahead': 661,\n", 954 | " 'askin': 662,\n", 955 | " 'even': 663,\n", 956 | " 'felt': 664,\n", 957 | " 'month': 665,\n", 958 | " 'notic': 666,\n", 959 | " 'sir': 667,\n", 960 | " 'fact': 668,\n", 961 | " 'goe': 669,\n", 962 | " 'noe': 670,\n", 963 | " 'tmr': 671,\n", 964 | " 'differ': 672,\n", 965 | " 'weekend': 673,\n", 966 | " 'hmmm': 674,\n", 967 | " 'awai': 675,\n", 968 | " 'hr': 676,\n", 969 | " 'move': 677,\n", 970 | " 'build': 678,\n", 971 | " 'gym': 679,\n", 972 | " 'hurri': 680,\n", 973 | " 'worri': 681,\n", 974 | " 'spend': 682,\n", 975 | " 'happen': 683,\n", 976 | " 'code': 684,\n", 977 | " 'show': 685,\n", 978 | " 'birthdai': 686,\n", 979 | " 'return': 687,\n", 980 | " 'den': 688,\n", 981 | " 'hug': 689,\n", 982 | " 'rose': 690,\n", 983 | " 'hook': 691,\n", 984 | " 'kind': 692,\n", 985 | " 'wanna': 693,\n", 986 | " 'cover': 694,\n", 987 | " 'alrit': 695,\n", 988 | " 'doin': 696,\n", 989 | " 'expir': 697,\n", 990 | " 'identifi': 698,\n", 991 | " 'redeem': 699,\n", 992 | " 'mon': 700,\n", 993 | " 'mad': 701,\n", 994 | " 'tht': 702,\n", 995 | " 'askd': 703,\n", 996 | " 'silent': 704,\n", 997 | " 'especi': 705,\n", 998 | " 'lift': 706,\n", 999 | " 'studi': 707,\n", 1000 | " 'minut': 708,\n", 1001 | " 'store': 709,\n", 1002 | " 'mom': 710,\n", 1003 | " 'gal': 711,\n", 1004 | " 'station': 712,\n", 1005 | " 'ev': 713,\n", 1006 | " 'xma': 714,\n", 1007 | " 'class': 715,\n", 1008 | " 'busi': 716,\n", 1009 | " 'sens': 717,\n", 1010 | " 'street': 718,\n", 1011 | " 'tonit': 719,\n", 1012 | " 'space': 720,\n", 1013 | " 'hit': 721,\n", 1014 | " 'obvious': 722,\n", 1015 | " 'famili': 723,\n", 1016 | " 'deal': 724,\n", 1017 | " 'eatin': 725,\n", 1018 | " 'fren': 726,\n", 1019 | " 'put': 727,\n", 1020 | " 'weekli': 728,\n", 1021 | " 'pub': 729,\n", 1022 | " 'valu': 730,\n", 1023 | " 'vodafon': 731,\n", 1024 | " 'disturb': 732,\n", 1025 | " 'mate': 733,\n", 1026 | " 'txting': 734,\n", 1027 | " 'qualiti': 735,\n", 1028 | " 'ttyl': 736,\n", 1029 | " 'shower': 737,\n", 1030 | " 'attempt': 738,\n", 1031 | " 'belli': 739,\n", 1032 | " 'loverboi': 740,\n", 1033 | " 'sigh': 741,\n", 1034 | " 'result': 742,\n", 1035 | " 'water': 743,\n", 1036 | " 'break': 744,\n", 1037 | " 'awak': 745,\n", 1038 | " 'feb': 746,\n", 1039 | " 'mid': 747,\n", 1040 | " 'cours': 748,\n", 1041 | " 'cinema': 749,\n", 1042 | " 'near': 750,\n", 1043 | " 'decim': 751,\n", 1044 | " 'meant': 752,\n", 1045 | " 'hmm': 753,\n", 1046 | " 'semest': 754,\n", 1047 | " 'discuss': 755,\n", 1048 | " 'http': 756,\n", 1049 | " 'chennai': 757,\n", 1050 | " 'ga': 758,\n", 1051 | " 'card': 759,\n", 1052 | " 'sim': 760,\n", 1053 | " 'choic': 761,\n", 1054 | " 'mayb': 762,\n", 1055 | " 'inform': 763,\n", 1056 | " 'pleas': 764,\n", 1057 | " 'process': 765,\n", 1058 | " 'earth': 766,\n", 1059 | " 'test': 767,\n", 1060 | " 'workin': 768,\n", 1061 | " 'attend': 769,\n", 1062 | " 'old': 770,\n", 1063 | " 'realiz': 771,\n", 1064 | " 'load': 772,\n", 1065 | " 'excel': 773,\n", 1066 | " 'expect': 774,\n", 1067 | " 'longer': 775,\n", 1068 | " 'print': 776,\n", 1069 | " 'kiss': 777,\n", 1070 | " 'till': 778,\n", 1071 | " 'visit': 779,\n", 1072 | " 'kalli': 780,\n", 1073 | " 'drug': 781,\n", 1074 | " 'dude': 782,\n", 1075 | " 'parti': 783,\n", 1076 | " 'wors': 784,\n", 1077 | " 'hour': 785,\n", 1078 | " 'immedi': 786,\n", 1079 | " 'begin': 787,\n", 1080 | " 'bird': 788,\n", 1081 | " 'random': 789,\n", 1082 | " 'transfer': 790,\n", 1083 | " 'ard': 791,\n", 1084 | " 'mrt': 792,\n", 1085 | " 'sign': 793,\n", 1086 | " 'moral': 794,\n", 1087 | " 'wen': 795,\n", 1088 | " 'prepar': 796,\n", 1089 | " 'orchard': 797,\n", 1090 | " 'nyt': 798,\n", 1091 | " 'quot': 799,\n", 1092 | " 'lost': 800,\n", 1093 | " 'meh': 801,\n", 1094 | " 'mind': 802,\n", 1095 | " 'seen': 803,\n", 1096 | " 'wasn': 804,\n", 1097 | " 'nigeria': 805,\n", 1098 | " 'suck': 806,\n", 1099 | " 'stock': 807,\n", 1100 | " 'facebook': 808,\n", 1101 | " 'croydon': 809,\n", 1102 | " 'ntt': 810,\n", 1103 | " 'internet': 811,\n", 1104 | " 'librari': 812,\n", 1105 | " 'buck': 813,\n", 1106 | " 'stand': 814,\n", 1107 | " 'recent': 815,\n", 1108 | " 'review': 816,\n", 1109 | " 'haha': 817,\n", 1110 | " 'email': 818,\n", 1111 | " 'aft': 819,\n", 1112 | " 'ish': 820,\n", 1113 | " 'march': 821,\n", 1114 | " 'social': 822,\n", 1115 | " 'advanc': 823,\n", 1116 | " 'merri': 824,\n", 1117 | " 'ticket': 825,\n", 1118 | " 'discount': 826,\n", 1119 | " 'savamob': 827,\n", 1120 | " 'sub': 828,\n", 1121 | " 'worth': 829,\n", 1122 | " 'congrat': 830,\n", 1123 | " 'mother': 831,\n", 1124 | " 'respect': 832,\n", 1125 | " 'treat': 833,\n", 1126 | " 'promis': 834,\n", 1127 | " 'search': 835,\n", 1128 | " 'rakhesh': 836,\n", 1129 | " 'fetch': 837,\n", 1130 | " 'marriag': 838,\n", 1131 | " 'smth': 839,\n", 1132 | " 'ipod': 840,\n", 1133 | " 'yiju': 841,\n", 1134 | " 'cute': 842,\n", 1135 | " 'imma': 843,\n", 1136 | " 'airport': 844,\n", 1137 | " 'sister': 845,\n", 1138 | " 'ldn': 846,\n", 1139 | " 'rcvd': 847,\n", 1140 | " 'area': 848,\n", 1141 | " 'arrang': 849,\n", 1142 | " 'crazi': 850,\n", 1143 | " 'learn': 851,\n", 1144 | " 'film': 852,\n", 1145 | " 'avoid': 853,\n", 1146 | " 'idea': 854,\n", 1147 | " 'smoke': 855,\n", 1148 | " 'track': 856,\n", 1149 | " 'boss': 857,\n", 1150 | " 'rain': 858,\n", 1151 | " 'sum': 859,\n", 1152 | " 'beauti': 860,\n", 1153 | " 'secret': 861,\n", 1154 | " 'log': 862,\n", 1155 | " 'fast': 863,\n", 1156 | " 'gave': 864,\n", 1157 | " 'interest': 865,\n", 1158 | " 'exactli': 866,\n", 1159 | " 'hotel': 867,\n", 1160 | " 'mth': 868,\n", 1161 | " 'greet': 869,\n", 1162 | " 'friendship': 870,\n", 1163 | " 'share': 871,\n", 1164 | " 'hang': 872,\n", 1165 | " 'goin': 873,\n", 1166 | " 'ti': 874,\n", 1167 | " 'atm': 875,\n", 1168 | " 'password': 876,\n", 1169 | " 'pin': 877,\n", 1170 | " 'sex': 878,\n", 1171 | " 'figur': 879,\n", 1172 | " 'evng': 880,\n", 1173 | " 'ahmad': 881,\n", 1174 | " 'pete': 882,\n", 1175 | " 'auction': 883,\n", 1176 | " 'rply': 884,\n", 1177 | " 'high': 885,\n", 1178 | " 'case': 886,\n", 1179 | " 'cook': 887,\n", 1180 | " 'teas': 888,\n", 1181 | " 'direct': 889,\n", 1182 | " 'eca': 890,\n", 1183 | " 'flirt': 891,\n", 1184 | " 'career': 892,\n", 1185 | " 'tuition': 893,\n", 1186 | " 'directli': 894,\n", 1187 | " 'kill': 895,\n", 1188 | " 'music': 896,\n", 1189 | " 'hill': 897,\n", 1190 | " 'exhaust': 898,\n", 1191 | " 'order': 899,\n", 1192 | " 'basic': 900,\n", 1193 | " 'manag': 901,\n", 1194 | " 'term': 902,\n", 1195 | " 'sofa': 903,\n", 1196 | " 'wine': 904,\n", 1197 | " 'decis': 905,\n", 1198 | " 'blood': 906,\n", 1199 | " 'tel': 907,\n", 1200 | " 'cal': 908,\n", 1201 | " 'reward': 909,\n", 1202 | " 'winner': 910,\n", 1203 | " 'moan': 911,\n", 1204 | " 'bslvyl': 912,\n", 1205 | " 'matur': 913,\n", 1206 | " 'horribl': 914,\n", 1207 | " 'knew': 915,\n", 1208 | " 'energi': 916,\n", 1209 | " 'buzz': 917,\n", 1210 | " 'exam': 918,\n", 1211 | " 'forev': 919,\n", 1212 | " 'btw': 920,\n", 1213 | " 'quick': 921,\n", 1214 | " 'cell': 922,\n", 1215 | " 'pictur': 923,\n", 1216 | " 'access': 924,\n", 1217 | " 'sport': 925,\n", 1218 | " 'izzit': 926,\n", 1219 | " 'spree': 927,\n", 1220 | " 'pop': 928,\n", 1221 | " 'somebodi': 929,\n", 1222 | " 'name': 930,\n", 1223 | " 'aiyah': 931,\n", 1224 | " 'mrng': 932,\n", 1225 | " 'websit': 933,\n", 1226 | " 'clear': 934,\n", 1227 | " 'aftr': 935,\n", 1228 | " 'sit': 936,\n", 1229 | " 'tour': 937,\n", 1230 | " 'excus': 938,\n", 1231 | " 'rite': 939,\n", 1232 | " 'asap': 940,\n", 1233 | " 'respond': 941,\n", 1234 | " 'depend': 942,\n", 1235 | " 'father': 943,\n", 1236 | " 'isn': 944,\n", 1237 | " 'yun': 945,\n", 1238 | " 'convei': 946,\n", 1239 | " 'colour': 947,\n", 1240 | " 'offici': 948,\n", 1241 | " 'nah': 949,\n", 1242 | " 'trust': 950,\n", 1243 | " 'consid': 951,\n", 1244 | " 'bcoz': 952,\n", 1245 | " 'saturdai': 953,\n", 1246 | " 'sundai': 954,\n", 1247 | " 'asleep': 955,\n", 1248 | " 'woke': 956,\n", 1249 | " 'heavi': 957,\n", 1250 | " 'truth': 958,\n", 1251 | " 'wrong': 959,\n", 1252 | " 'deliv': 960,\n", 1253 | " 'weather': 961,\n", 1254 | " 'weed': 962,\n", 1255 | " 'imagin': 963,\n", 1256 | " 'club': 964,\n", 1257 | " 'season': 965,\n", 1258 | " 'menu': 966,\n", 1259 | " 'confirm': 967,\n", 1260 | " 'wiv': 968,\n", 1261 | " 'bout': 969,\n", 1262 | " 'hav': 970,\n", 1263 | " 'indian': 971,\n", 1264 | " 'cafe': 972,\n", 1265 | " 'deep': 973,\n", 1266 | " 'bedroom': 974,\n", 1267 | " 'cake': 975,\n", 1268 | " 'mood': 976,\n", 1269 | " 'island': 977,\n", 1270 | " 'outsid': 978,\n", 1271 | " 'sing': 979,\n", 1272 | " 'catch': 980,\n", 1273 | " 'vote': 981,\n", 1274 | " 'south': 982,\n", 1275 | " 'fantast': 983,\n", 1276 | " 'urawinn': 984,\n", 1277 | " 'user': 985,\n", 1278 | " 'john': 986,\n", 1279 | " 'joi': 987,\n", 1280 | " 'handset': 988,\n", 1281 | " 'kerala': 989,\n", 1282 | " 'refer': 990,\n", 1283 | " 'daili': 991,\n", 1284 | " 'futur': 992,\n", 1285 | " 'give': 993,\n", 1286 | " 'style': 994,\n", 1287 | " 'armand': 995,\n", 1288 | " 'jesu': 996,\n", 1289 | " 'ldew': 997,\n", 1290 | " 'valentin': 998,\n", 1291 | " 'tampa': 999,\n", 1292 | " ...}" 1293 | ] 1294 | }, 1295 | "execution_count": 8, 1296 | "metadata": {}, 1297 | "output_type": "execute_result" 1298 | } 1299 | ], 1300 | "source": [ 1301 | "dictionary = Dictionary()\n", 1302 | "[dictionary.add_documents([text]) for text in train_x]\n", 1303 | "dictionary.filter_extremes(keep_n=20000)\n", 1304 | "dictionary.token2id" 1305 | ] 1306 | }, 1307 | { 1308 | "cell_type": "code", 1309 | "execution_count": 9, 1310 | "metadata": {}, 1311 | "outputs": [ 1312 | { 1313 | "name": "stdout", 1314 | "output_type": "stream", 1315 | "text": [ 1316 | "[(0, 2)]\n", 1317 | "[(1, 1), (2, 1), (3, 1), (4, 1), (5, 1), (6, 1), (7, 1), (8, 1), (9, 1), (10, 1)]\n", 1318 | "[(20, 1), (21, 1), (22, 1), (23, 1)]\n" 1319 | ] 1320 | } 1321 | ], 1322 | "source": [ 1323 | "train_corpus = [dictionary.doc2bow(text) for text in train_x]\n", 1324 | "test_corpus = [dictionary.doc2bow(text) for text in test_x]\n", 1325 | "print(train_corpus[0])\n", 1326 | "print(train_corpus[1])\n", 1327 | "print(train_corpus[3])" 1328 | ] 1329 | }, 1330 | { 1331 | "cell_type": "markdown", 1332 | "metadata": {}, 1333 | "source": [ 1334 | "## TF-IDF" 1335 | ] 1336 | }, 1337 | { 1338 | "cell_type": "code", 1339 | "execution_count": 26, 1340 | "metadata": {}, 1341 | "outputs": [ 1342 | { 1343 | "name": "stdout", 1344 | "output_type": "stream", 1345 | "text": [ 1346 | "train_sparse_matrix: (4457, 1114)\n", 1347 | "test_sparse_matrix: (1115, 1114)\n" 1348 | ] 1349 | } 1350 | ], 1351 | "source": [ 1352 | "tfidf = TfidfModel(train_corpus)\n", 1353 | "train_sparse_matrix = corpus2csc(tfidf[train_corpus]).T\n", 1354 | "test_sparse_matrix = corpus2csc(tfidf[test_corpus], train_sparse_matrix.shape[1]).T\n", 1355 | "\n", 1356 | "print('train_sparse_matrix:', train_sparse_matrix.shape)\n", 1357 | "print('test_sparse_matrix:', test_sparse_matrix.shape)" 1358 | ] 1359 | }, 1360 | { 1361 | "cell_type": "markdown", 1362 | "metadata": {}, 1363 | "source": [ 1364 | "# Naive Bayes Model" 1365 | ] 1366 | }, 1367 | { 1368 | "cell_type": "code", 1369 | "execution_count": 35, 1370 | "metadata": {}, 1371 | "outputs": [ 1372 | { 1373 | "name": "stdout", 1374 | "output_type": "stream", 1375 | "text": [ 1376 | " precision recall f1-score support\n", 1377 | "\n", 1378 | " ham 0.97 0.99 0.98 947\n", 1379 | " spam 0.97 0.85 0.90 168\n", 1380 | "\n", 1381 | "avg / total 0.97 0.97 0.97 1115\n", 1382 | "\n" 1383 | ] 1384 | }, 1385 | { 1386 | "data": { 1387 | "text/plain": [ 1388 | "" 1389 | ] 1390 | }, 1391 | "execution_count": 35, 1392 | "metadata": {}, 1393 | "output_type": "execute_result" 1394 | }, 1395 | { 1396 | "data": { 1397 | "image/png": "iVBORw0KGgoAAAANSUhEUgAAAWAAAAD8CAYAAABJsn7AAAAABHNCSVQICAgIfAhkiAAAAAlwSFlzAAALEgAACxIB0t1+/AAAADl0RVh0U29mdHdhcmUAbWF0cGxvdGxpYiB2ZXJzaW9uIDIuMS4yLCBodHRwOi8vbWF0cGxvdGxpYi5vcmcvNQv5yAAAFN5JREFUeJzt3XmUVtWZqPHnldKIA+CAyKSi0A7pJGrQqzHKjUNEWkWvIR2bGIzY5TVmkCRGk6yYTmJubK9LbeMUIhpQGxSJgoiI4thGUFS0HeINMSZQUIIDIIlZWlX7/lFHUkpRQ6iqXd/h+bn28px99jnf/tbCl9f37POdSCkhSep6W+SegCRtrgzAkpSJAViSMjEAS1ImBmBJysQALEmZGIAlKRMDsCRlYgCWpEyqOvsD3nv9FR+10wZ6Djg89xTUDdW9WxObeo32xJwtd95zkz9vU5gBS1ImnZ4BS1KXaqjPPYM2MwBLKpf6utwzaDMDsKRSSakh9xTazAAsqVwaDMCSlIcZsCRl4k04ScrEDFiS8kiugpCkTLwJJ0mZWIKQpEy8CSdJmZgBS1Im3oSTpEy8CSdJeaRkDViS8rAGLEmZWIKQpEzMgCUpk/r3cs+gzQzAksrFEoQkZWIJQpIyMQOWpEwMwJKUR/ImnCRlYg1YkjKxBCFJmZgBS1ImFZQBb5F7ApLUoVJD21srImJCRLwQEc9HxNSI2DoihkTEwohYEhG3RsRWxdiPFPtLiuN7tHZ9A7Ckcqmra3trQUQMBL4ODE8p/SPQA/gC8O/A5SmlocBbwPjilPHAW0X/5cW4FhmAJZVLB2bANJZpe0ZEFbANsAI4Eri9OD4ZOKnYHl3sUxw/KiKipYsbgCWVS0NDm1tEVEfEoiat+v3LpJRqgEuBP9EYeNcATwGrU0rvp8/LgIHF9kBgaXFuXTF+p5am6k04SeXSjlUQKaWJwMTmjkXEDjRmtUOA1cB0YGQHzHA9A7Ckcum4VRBHA39IKa0CiIhfA4cBfSKiqshyBwE1xfgaYDCwrChZ9AbeaOkDLEFIKpeOqwH/CTgkIrYparlHAS8CDwKfK8aMA2YW27OKfYrjD6SUUksfYAYsqVxaWd3QVimlhRFxO/A0UAc8Q2O54m5gWkRcVPRNKk6ZBNwUEUuAN2lcMdEiA7Ckcmk56WznpdIPgR9+qPsV4OBmxv4VGNOe6xuAJZVLBT0JZwCWVC4GYEnKxB/jkaRM6utzz6DNDMCSysUShCRlYgCWpEysAUtSHqmh49YBdzYDsKRysQQhSZm4CkKSMjED3jzcdNudzJg1l5QSnztxJKf988nrj/1q6gwuvep6Hr17Gjv06c3sex9g0i3TIcE22/TkB9/+KvsM23ODay5bXst5P7yY1WvWst/ew7j4wm+z5ZZbduXXUidZ8v8W8Pa6ddTXN1BXV8chh47aYMzll/2Y40YeyV/eeYfx4yfwzOLnM8y0wlVQAPbnKP9Ov3vlVWbMmsvU669gxuRrePg3T/CnZcsBWPHaKn7zxNP077fL+vEDB+zKr666hDtuupb/ffqp/OiSK5u97uXX3sBp/3wS99x2A722344Zs+/tku+jrnH0MWMYftBnmw2+x408kmFDh7DPfp/m7LPP5+qrfpZhhiWQUttbZq0G4IjYJyLOj4gri3Z+ROzbFZPrzl55dSkf++je9Nx6a6qqejB8/49x/8OPAXDJlb/gm18ZT9O3QR3wsf3o3Wt7AD7+0X14beXrG1wzpcTCp57ls//zcABGjzqaBx55vPO/jLqFE044lptuaXzV2MInnqZ3n97suusurZylDbTjlUS5tRiAI+J8YBoQwBNFC2BqRFzQ+dPrvobuuTtPP/sCq9es5Z2//pVHH3+S2tdW8cCjj7NL352bLS+879ez7+XThwzfoH/1mrVsv922VFX1AKBf351ZuarFH9RXBUkpcc+cqSxccA9njh+7wfGBA3Zl2dLl6/drlq1g4IBdu3KK5dCQ2t4ya60GPB74aErpvaadEXEZ8AJwcWdNrLvba4/dOGPsGKonfJ+eW2/N3sP25N333uOXU25l4uU/3eh5Tzz1LL+ePY+brr20C2er7mDEZ05m+fJa+vbdibn3TOPll5fw6H8tzD2t8qmgVRCtlSAagAHN9PcvjjWr6ZtGr58ydVPm162dcsKx3HbDz5l8zf+l1/bbs9eQ3alZXssp477CZ08Zx2urXmfMGV/j9TfeBODlJX/gwouv4OcXX0if3r02uF6f3r14e92fqatr/AP02qrX2aVviy9VVQVZvrwWgFWr3mDmzHs46KD9P3C8Znktgwb/7T+3gYP6U1Oco7ZLDQ1tbrm1FoDPBeZHxD0RMbFoc4H5wDc2dlJKaWJKaXhKafiZXzq1I+fbrbzx1moAVtSuZP7DjzH6uKN55O5pzJsxmXkzJtOv785Mv+Hn7LzTjqyoXcm53/sJP7vwPPbYbVCz14sIDj7w48x76FEAZs65nyMPP7TLvo86zzbb9GS77bZdv33M0SN44YWXPzBm9ux5nDa28VVj/+PgA1m7Zi21tSu7fK4VrywliJTS3Ij4BxpfvzGw6K4BnkwpVU6e30kmfO8iVq9dS1VVFd//1lfotf12Gx177Y3/yZq1b3PRpVcD0KNHD267oXElxNnf+gE/uuBcdum7ExPOPoPzfngxP584hX3/YS/+1/Gf7ZLvos7Vr19fbp/e+OqwqqoeTJt2J/fOe4jqfz0NgIm/vIk598xn5Mgjefmlx/jLO+9w5pnfzDnlylVBvwURrby0c5O99/or+f+aUbfTc8Dhuaegbqju3ZpofVTL/vzjsW2OOdteeMsmf96m8EEMSeVSVzn/c24AllQuFVSCMABLKpducHOtrQzAkkqlOywvaysDsKRyMQOWpEwMwJKUSQU9imwAllQqvhNOknIxAEtSJq6CkKRMzIAlKRMDsCTlkeotQUhSHmbAkpSHy9AkKRcDsCRlUjkl4FbfCSdJFSXVNbS5tSYi+kTE7RHx24h4KSIOjYgdI+K+iPhd8e8dirEREVdGxJKIeC4iDmzt+gZgSeXS0I7Wuv8A5qaU9gE+AbwEXADMTykNo/EFxRcUY48DhhWtGri2tYsbgCWVSmpIbW4tiYjewBHAJICU0rsppdXAaGByMWwycFKxPRqYkhotAPpERP+WPsMALKlcOi4DHgKsAm6MiGci4vqI2Bbol1JaUYypBfoV2wOBpU3OX8bf3ibfLAOwpFJpTwYcEdURsahJq25yqSrgQODalNIBwJ/5W7mh8bMaXyv/dy+7cBWEpHJpxyqIlNJEYOJGDi8DlqWUFhb7t9MYgF+LiP4ppRVFiWFlcbwGGNzk/EFF30aZAUsqlVTX9tbidVKqBZZGxN5F11HAi8AsYFzRNw6YWWzPAr5UrIY4BFjTpFTRLDNgSaXSwW+l/xpwS0RsBbwCfJnGxPW2iBgP/BH4fDF2DjAKWAL8pRjbIgOwpHLpwACcUloMDG/m0FHNjE3AOe25vgFYUql0cAbcqQzAkkrFACxJmaT6yD2FNjMASyoVM2BJyiQ1mAFLUhZmwJKUSUpmwJKUhRmwJGXS4CoIScrDm3CSlIkBWJIySZXzUmQDsKRyMQOWpExchiZJmdS7CkKS8jADlqRMrAFLUiaugpCkTMyAJSmT+obKedm7AVhSqViCkKRMGlwFIUl5uAxNkjKxBNFEr8Gf6eyPUAUavvOw3FNQSVmCkKRMXAUhSZlUUAXCACypXCxBSFImroKQpEwq6KXIBmBJ5ZIwA5akLOosQUhSHmbAkpSJNWBJysQMWJIyMQOWpEzqKygDrpyHpiWpDRqi7a0tIqJHRDwTEbOL/SERsTAilkTErRGxVdH/kWJ/SXF8j9aubQCWVCoNRJtbG30DeKnJ/r8Dl6eUhgJvAeOL/vHAW0X/5cW4FhmAJZVKakdrTUQMAv4JuL7YD+BI4PZiyGTgpGJ7dLFPcfyoYvxGGYAllUpDO1obXAF8p8nwnYDVKaW6Yn8ZMLDYHggsBSiOrynGb5QBWFKpNES0uUVEdUQsatKq379ORBwPrEwpPdVZc3UVhKRSqW/H2JTSRGDiRg4fBpwYEaOArYFewH8AfSKiqshyBwE1xfgaYDCwLCKqgN7AGy19vhmwpFLpqFUQKaXvppQGpZT2AL4APJBSGgs8CHyuGDYOmFlszyr2KY4/kFLLb6gzAEsqlU5YBfFh5wPfjIglNNZ4JxX9k4Cdiv5vAhe0diFLEJJKpTNeSZRSegh4qNh+BTi4mTF/Bca057oGYEml0tYHLLoDA7CkUvG3ICQpk3ozYEnKwwxYkjIxAEtSJhX0SjgDsKRyMQOWpEza8yhybgZgSaXiOmBJysQShCRlYgCWpEw647cgOosBWFKpWAOWpExcBSFJmTRUUBHCACypVLwJJ0mZVE7+awCWVDJmwJKUSV1UTg5sAJZUKpUTfg3AkkrGEoQkZeIyNEnKpHLCrwFYUslYgpCkTOorKAc2AEsqFTNgScokmQFLUh6VlAFvkXsCZTBoUH/mzp3G00/fz1NP3cc553x5/bGzzz6dxYvn89RT9/HTn3632fOPOWYEzz77AM8//zDf/vbZXTVtdYLvX/Yd5jx3B7c8cOMGx/7lrM+zYPlD9N6xNwDHnnw0N98/iZvn38DEWVcxdL+9mr1m/8G7Mmn2NUx/7BYuuu5CqrY0b2pJA6nNLTcDcAeoq6vnggsu4sADj2bEiJM466wvsc8+wzjiiEM5/vhjOPjg4/jkJ4/hiismbnDuFltswRVX/ITRo8dxwAFHM2bMieyzz7AM30Id4e5b5zJh7Hc26N9lQF8OHjGcFctq1/ctX7qCs0/5Bl886gxuvHwK373kW81e85zvn8XUX97OmMPGsnb1Ok48dVSnzb8MUjtabgbgDlBbu5LFi58HYN26P/Pb3y5hwIB+VFd/kUsvvYZ3330XgFWr3tjg3IMO2p/f//5VXn11Ke+99x7Tp9/F8ccf06XzV8dZvPA51r719gb95/7bV7nqol984L/6/170Am+vWQfA80+/SN/+fZu95vBPH8iDsx8GYM70uRwx8tMdP/ESqSO1ueX2dwfgiPhy66M2P7vtNoj99/8oTz65mKFDh3DYYQfzyCN3Mm/erXzykx/fYPyAAbuybNmK9fs1NSsYOHDXrpyyOtnhxx7GqtpVLHnx9xsdc8Kp/8SCB5/YoL/3jr15e8066usb3/OwcsUq+u7afKBWo9SOf3LblGLSj4ANC11ARFQD1QBVVTtSVbXdJnxM5dh2222YOvU6zjvvx7z99jqqqqrYccc+HHHESQwf/gluvvka9t3X7GVz8pGeH+H0r43l66eet9ExB35qf048dRTVJ32tC2dWXpV0E67FABwRz23sENBvY+ellCYCEwF69tw9/18zXaCqqoqpU6/j1lvvZObMuUBjNnvnnY3bixY9S0NDAzvvvCOvv/7m+vOWL69l0KD+6/cHDuxPTU0tKodBuw+g/279ufn+SQD07d+XyfdO5IxRZ/PmqjcZuu+efO/S85jwxfNZ+9baDc5f8+Yatu+9HT169KC+vp5d+vdlVe2qrv4aFaU7ZLZt1VoG3A84FnjrQ/0B/KZTZlShrrvuEl5+eQlXXnn9+r677prHiBGH8sgjjzN06BC22mrLDwRfaAzMQ4cOYffdB7N8eS1jxpzA6ad/vaunr07y+9/+gVEfP3n9/h0Lp3H6cWex5s019Bu4Cz+7/if86Ov/h6WvLNvoNZ567Bk+c/wI7p/5AKPGjOTRex/riqlXrErKgFurAc8Gtksp/fFD7VXgoU6fXYX41KeGM3bsKYwY8SkWLJjDggVzOPbYzzB58m0MGbIbixbNY8qUqzjzzMa73P3778Idd/wKgPr6eiZMuJC77prC4sXzmTHjbl566XcZv402xY+v+QG/vOtqdt9rMLMWTeeEFlYsjJ8wjt479OK8n01gyn3Xc+M9v1h/7LKbLmbnfjsBcPVPf8Gp1WOY/tgt9N6hF7Omzun071HJ6lNqc8stUidPYnMpQah9PrHDkNxTUDe0YPlDsanX+JfdT25zzPnPP96xyZ+3KVzRLalUKqkG7DpgSaXS0I7WkogYHBEPRsSLEfFCRHyj6N8xIu6LiN8V/96h6I+IuDIilkTEcxFxYGtzNQBLKpUOfBS5DvhWSmk/4BDgnIjYD7gAmJ9SGgbML/YBjgOGFa0auLa1DzAASyqVjnoQI6W0IqX0dLH9NvASMBAYDUwuhk0GTiq2RwNTUqMFQJ+I6E8LrAFLKpXOWN0QEXsABwALgX4ppfcfX63lb89EDASWNjltWdG3go0wA5ZUKu0pQUREdUQsatKqP3y9iNgOmAGcm1L6wNMyqXEZ2d8d8c2AJZVKex7EaPrUbnMiYksag+8tKaVfF92vRUT/lNKKosSwsuivAQY3OX1Q0bdRZsCSSqWjasAREcAk4KWU0mVNDs0CxhXb44CZTfq/VKyGOARY06RU0SwzYEml0oE/tH4YcBrw3xGxuOj7HnAxcFtEjAf+CHy+ODYHGAUsAf4CtPqLkQZgSaXSUU/3ppT+i8bfvWnOUc2MT8A57fkMA7CkUvG19JKUSXd411tbGYAllUpn/8BYRzIASyoVM2BJyqSSfg3NACypVLrDD623lQFYUqlYgpCkTAzAkpSJqyAkKRMzYEnKxFUQkpRJfWrPD1LmZQCWVCrWgCUpE2vAkpSJNWBJyqTBEoQk5WEGLEmZuApCkjKxBCFJmViCkKRMzIAlKRMzYEnKpD7V555CmxmAJZWKjyJLUiY+iixJmZgBS1ImroKQpExcBSFJmfgosiRlYg1YkjKxBixJmZgBS1ImrgOWpEzMgCUpE1dBSFIm3oSTpEwsQUhSJj4JJ0mZmAFLUiaVVAOOSvrbotJFRHVKaWLueah78c/F5muL3BPYzFTnnoC6Jf9cbKYMwJKUiQFYkjIxAHct63xqjn8uNlPehJOkTMyAJSkTA3AXiYiREfFyRCyJiAtyz0f5RcQNEbEyIp7PPRflYQDuAhHRA7gaOA7YDzg1IvbLOyt1A78CRuaehPIxAHeNg4ElKaVXUkrvAtOA0ZnnpMxSSo8Ab+aeh/IxAHeNgcDSJvvLij5JmzEDsCRlYgDuGjXA4Cb7g4o+SZsxA3DXeBIYFhFDImIr4AvArMxzkpSZAbgLpJTqgK8C9wIvAbellF7IOyvlFhFTgceBvSNiWUSMzz0ndS2fhJOkTMyAJSkTA7AkZWIAlqRMDMCSlIkBWJIyMQBLUiYGYEnKxAAsSZn8fxlI5JLJAHE0AAAAAElFTkSuQmCC\n", 1398 | "text/plain": [ 1399 | "" 1400 | ] 1401 | }, 1402 | "metadata": {}, 1403 | "output_type": "display_data" 1404 | } 1405 | ], 1406 | "source": [ 1407 | "nb = MultinomialNB()\n", 1408 | "nb.fit(train_sparse_matrix, train_y)\n", 1409 | "\n", 1410 | "pred_y = nb.predict(test_sparse_matrix)\n", 1411 | "print(classification_report(test_y, pred_y))\n", 1412 | "sns.heatmap(confusion_matrix(test_y, pred_y), fmt='.1f', annot=True)" 1413 | ] 1414 | }, 1415 | { 1416 | "cell_type": "markdown", 1417 | "metadata": {}, 1418 | "source": [ 1419 | "# Logistic Regression Model" 1420 | ] 1421 | }, 1422 | { 1423 | "cell_type": "code", 1424 | "execution_count": 33, 1425 | "metadata": {}, 1426 | "outputs": [ 1427 | { 1428 | "name": "stdout", 1429 | "output_type": "stream", 1430 | "text": [ 1431 | " precision recall f1-score support\n", 1432 | "\n", 1433 | " ham 0.96 1.00 0.98 947\n", 1434 | " spam 0.97 0.75 0.85 168\n", 1435 | "\n", 1436 | "avg / total 0.96 0.96 0.96 1115\n", 1437 | "\n" 1438 | ] 1439 | }, 1440 | { 1441 | "data": { 1442 | "text/plain": [ 1443 | "" 1444 | ] 1445 | }, 1446 | "execution_count": 33, 1447 | "metadata": {}, 1448 | "output_type": "execute_result" 1449 | }, 1450 | { 1451 | "data": { 1452 | "image/png": "iVBORw0KGgoAAAANSUhEUgAAAWAAAAD8CAYAAABJsn7AAAAABHNCSVQICAgIfAhkiAAAAAlwSFlzAAALEgAACxIB0t1+/AAAADl0RVh0U29mdHdhcmUAbWF0cGxvdGxpYiB2ZXJzaW9uIDIuMS4yLCBodHRwOi8vbWF0cGxvdGxpYi5vcmcvNQv5yAAAFLdJREFUeJzt3Xu4V1Wd+PH3B06IplwEJQTGSzqaWVmamYqKOCaaocPIKIWoNNR4N+cZvDWWlWmRpONlIjHRYRSUSnS0VBACBBFJMy8lw68ERG5y6alMzjnr98fZ0DEO5yLnnMV3+375rIe917581/fx8OFzPnvtvSOlhCSp/XXIPQBJeq8yAEtSJgZgScrEACxJmRiAJSkTA7AkZWIAlqRMDMCSlIkBWJIyqWrrD9i4erG32mkLO+7RP/cQtB2qfntZbOs5WhJz3tdzn23+vG1hBixJmbR5BixJ7aq2JvcIms0ALKlcaqpzj6DZDMCSSiWl2txDaDYDsKRyqTUAS1IeZsCSlIkX4SQpEzNgScojOQtCkjLxIpwkZWIJQpIy8SKcJGViBixJmXgRTpIy8SKcJOWRkjVgScrDGrAkZWIJQpIyMQOWpExqNuYeQbMZgCWViyUIScrEEoQkZWIGLEmZGIAlKY/kRThJysQasCRlYglCkjIxA5akTCooA+6QewCS1KpSbfNbEyLi0oh4MSJ+HRH3RkTniNg7Ip6OiEURMSkiOhX77lCsLyq279XU+Q3Aksqlurr5rRER0Qe4CDg0pXQQ0BE4A7gBGJtS2hdYC4wsDhkJrC36xxb7NcoALKlcWjEDpq5Mu2NEVAE7AcuB44AHiu0TgFOL5cHFOsX2gRERjZ3cACypXGprm90iYlRELKjXRm06TUppGTAGeI26wLseeBZYl1LalD4vBfoUy32AJcWx1cX+PRobqhfhJJVLC2ZBpJTGAeMa2hYR3anLavcG1gH3Aye2wgg3MwBLKpfWmwVxPPD/UkqrACLix8CRQLeIqCqy3L7AsmL/ZUA/YGlRsugKrGnsAyxBSCqX1qsBvwYcHhE7FbXcgcBLwJPAPxX7jAAeLJanFusU26enlFJjH2AGLKlcmpjd0Fwppacj4gFgIVAN/JK6csX/AvdFxDeLvvHFIeOBeyJiEfAmdTMmGmUAllQujSedLTxVuga45m+6FwOHNbDvW8DpLTm/AVhSuVTQnXAGYEnlYgCWpEx8GI8kZVJTk3sEzWYAllQuliAkKRMDsCRlYg1YkvJIta03D7itGYAllYslCEnKxFkQkpRJBWXAPg1tG9wz+aec+oUvM/jzX+KeST95x7a77p3CQUcOYu269QBMnzWX0876V4aMOJ+h517Ewud/3eA5X3zlVU4b/q8MGnou1429nSYepqQK06FDB56Z/3Me/MmELbZ16tSJ/5l4O6+8NJunZj/Ennv2zTDCEmjBA9lzMwC/S68u/h1Tpv6Me+/4PlMm3MbMp+bz2tLXAVi+YhVPzV9I7167b97/8EMO5scTbmPKhFv5xpWXcs31NzV43m+MuYWvjb6IRyaN57WlrzN73oJ2+T5qHxdd+EVeeeXVBrede86ZrF27ngMOPIrv3/xDvn3dVe08upJIqfktsyYDcEQcEBGjI+Lmoo2OiA+1x+C2Z4t/t4SPfHh/duzcmaqqjhx68Ed4YuYcAL5z8w/4ynkjqf82qJ122pFNr4f681tvQQOvilq1+k3++Mc/8bGDPkRE8LkTBzJ91tx2+T5qe3369OakQQO58857G9z+uVNO4J577gdgypT/5bgBR7Xn8MqjLBlwRIwG7gMCmF+0AO6NiMvbfnjbr3332ZOFz7/IuvUb+PNbbzFr7jO8sWIV02fNZffdenLAfvtsccwTM+dwypn/wnn/9h9848pLt9i+YtVqeu3ec/N6r916smJVow/UVwW58Xtf5/IrvkntVv7i79HnAywpfouqqalh/foN9OjRvT2HWA61qfkts6Yuwo0EPpxS2li/MyJuBF4Erm+rgW3vPrjX33Hu509n1KVXsWPnzuy/3z68vXEjP7x7EuPGfqvBY44/5kiOP+ZIFjz3Arf88G7uuOnb7Txq5XLyScezcuVqFv7yBY45+tO5h1NuFTQLoqkSRC2wRwP9vYttDar/ptE77m74160yGHLKZ5h8538y4bbv0mWXXfjg3nuy7PU3GDLiPE4YMoIVq1Zz+rkXsnrNm+847tCDP8LS19/YfIFuk1679WTFytWb11esWk2v3Rp9qaoqxBFHHMopnz2BRb+dx8T/vo0BA45kwl03v2Of15e9Qb++dX/dOnbsSNeuXVizZm2O4Va0VFvb7JZbUxnwJcC0iHiV4nXLwN8B+wIXbO2g+m8a3bh6cf48v42sWbuOHt27sfyNlUybOYeJ48YyfOipm7efMGQEk8bfTPduXXlt6ev069ObiOCl3yzi7bc30q1rl3ecb7eeu/L+9+/E879+mY9++ACm/mwaw4ac0t5fS23gqquv56qr635hPOboT/OVS7/MiLMvesc+Dz38GMOHn868p59lyJCTeXLGnBxDrXzbQWmhuRoNwCmln0XE31P3+o0+Rfcy4JmUUuXk+W3k0iu/yboNG6iqquKqy86jyy47b3Xfx2fMZuqj06iqqqLzDp0Yc+3lmy/KDRlxPlMm3ArA1Zedz9XfupG3/vIX+h/+Sfp/+pPt8l2Ux9eu+TcWPPs8Dz/8OHf+6D4m3HUzr7w0m7Vr1zHsC+flHl5lqqBnQURbzzMtcwasd2/HPfrnHoK2Q9VvL9tyelAL/fHazzc75rz/PyZu8+dtC++Ek1Qu1ZXzy7kBWFK5VFAJwgAsqVzKchFOkirN9jC9rLkMwJLKxQxYkjIxAEtSJhV0K7IBWFKp+E44ScrFACxJmTgLQpIyMQOWpEwMwJKUR6qxBCFJeZgBS1IeTkOTpFwMwJKUSeWUgJt8KackVZRUXdvs1pSI6BYRD0TEKxHxckR8OiJ2jYjHI+LV4s/uxb4RETdHxKKI+FVEfKKp8xuAJZVLbQta024CfpZSOgD4GPAycDkwLaW0HzCtWAcYBOxXtFHA7U2d3AAsqVRSbWp2a0xEdAWOBsYDpJTeTimtAwYDE4rdJgCbXoU+GLg71ZkHdIuI3o19hgFYUrm0Xga8N7AK+FFE/DIi7oiI9wO9UkrLi33eAHoVy32AJfWOX8pf3ybfIAOwpFJpSQYcEaMiYkG9NqreqaqATwC3p5Q+DvyRv5Yb6j6r7rXy73rahbMgJJVLC2ZBpJTGAeO2snkpsDSl9HSx/gB1AXhFRPROKS0vSgwri+3LgH71ju9b9G2VGbCkUknVzW+NnielN4AlEbF/0TUQeAmYCowo+kYADxbLU4GzitkQhwPr65UqGmQGLKlUWvmt9BcCEyOiE7AYOIe6xHVyRIwEfg8MLfZ9BDgJWAT8qdi3UQZgSeXSigE4pfQccGgDmwY2sG8Czm/J+Q3AkkqllTPgNmUAllQqBmBJyiTVRO4hNJsBWFKpmAFLUiap1gxYkrIwA5akTFIyA5akLMyAJSmTWmdBSFIeXoSTpEwMwJKUSaqclyIbgCWVixmwJGXiNDRJyqTGWRCSlIcZsCRlYg1YkjJxFoQkZWIGLEmZ1NRWzsveDcCSSsUShCRlUussCEnKw2lokpSJJYh6dt/rhLb+CFWgj/bYO/cQVFKWICQpE2dBSFImFVSBMABLKhdLEJKUibMgJCmTCnopsgFYUrkkzIAlKYtqSxCSlIcZsCRlYg1YkjIxA5akTMyAJSmTmgrKgCvnpmlJaobaaH5rjojoGBG/jIiHi/W9I+LpiFgUEZMiolPRv0OxvqjYvldT5zYASyqVWqLZrZkuBl6ut34DMDaltC+wFhhZ9I8E1hb9Y4v9GmUAllQqqQWtKRHRFzgZuKNYD+A44IFilwnAqcXy4GKdYvvAYv+tMgBLKpXaFrRm+D7w7/V27wGsSylVF+tLgT7Fch9gCUCxfX2x/1YZgCWVSm1Es1tEjIqIBfXaqE3niYjPAitTSs+21VidBSGpVGpasG9KaRwwbiubjwQ+FxEnAZ2BLsBNQLeIqCqy3L7AsmL/ZUA/YGlEVAFdgTWNfb4ZsKRSaa1ZECmlK1JKfVNKewFnANNTSp8HngT+qdhtBPBgsTy1WKfYPj2lxt9QZwCWVCptMAvib40GvhIRi6ir8Y4v+scDPYr+rwCXN3UiSxCSSqUtXkmUUpoBzCiWFwOHNbDPW8DpLTmvAVhSqTT3BovtgQFYUqn4LAhJyqTGDFiS8jADlqRMDMCSlEkFvRLOACypXMyAJSmTltyKnJsBWFKpOA9YkjKxBCFJmRiAJSmTtngWRFsxAEsqFWvAkpSJsyAkKZPaCipCGIAllYoX4SQpk8rJfw3AkkrGDFiSMqmOysmBDcCSSqVywq8BWFLJWIKQpEychiZJmVRO+DUASyoZSxCSlElNBeXABmBJpWIGLEmZJDNgScqjkjLgDrkHUCYdOnRg5pyp3Hf/OADGjf8e8xc+xlPzH+E/b/s2VVUN/3t3xrDTWPDcEyx47gnOGHZaew5ZreyaG6/giRceYvKTd2/uu+Sr5zFl1kQmTbuLMXdex85ddt68bb8PfZC7Hvov7p9xD5OmT6DTDp22OGeXbrtw231j+emce7ntvrHs0nWXdvkulaqW1OyWmwG4FX35vLP57W8WbV6/f9JUDvvECRxx2EnsuGNnzjp76BbHdOveldFXXMjxA4Yw8Nh/ZPQVF9K1W5f2HLZa0UOTH+GCYZe9o2/eL55h6LFn8c8Dz+a1/1vCuRcOB6Bjx45885av8q3RYzj92OGMGnIh1RurtzjnORd8gfmzn+XUI89k/uxnOeeCL7THV6lYqQUtNwNwK9ljjw9wwonHcveEyZv7Hn9s5ublZxf8ij36fGCL4wYe358ZT85h3dr1rF+3gRlPzuH4fzi6Xcas1rdw3vOsX7vhHX3zZj5DTU3dY8JfWPgiu++xGwCHH/NJXn35/3j1pbp/tNev3UBt7Za/QB/zmf48PPlRAB6e/CjHnti/Lb9CxasmNbvl9q4DcESc05oDqXTXfedqrrn6Bmprt/yfWlVVxT+feSrTHv/FFtt69+7F0qXLN68vW/YGvXv3atOxKp/BZ5zMU9PnAbDnB/uRUuLWe7/HxMfGM+K8YQ0e02O37qxeuQaA1SvX0GO37u023kqUWvBfbtuSAX99axsiYlRELIiIBX/ZuGFru5XGZ04cwOpVa3j+uRcb3D5m7Nd5as585j61oJ1Hpu3JyIvPorqmhkemPAZAx45VHHzYR7nq/GsZOfg8Bgw6msOOOqTJ86T8cWO7VtuCllujsyAi4ldb2wRsNU1LKY0DxgF033nf0v+4fOrwQzjxpIH8wwnHsEPnHdhll535wR3f40tfvIx/v+JCevbcleHDrm7w2OXLV3BU/09tXu/T5wPMnvV0ew1d7eSUoYPof/wRfHnoxZv7VixfycJ5z7PuzfUAzJ4+lwM+8vfMn/3sO45ds2otPXfvweqVa+i5ew/eXL22XcdeabaHzLa5msqAewFnAac00Na07dAqx7VfG8NB+x/Fxz58LCPPvoRZM+fypS9exvARQxk4sD9fPOcS0lbSlmlPzGLAcUfRtVsXunbrwoDjjmLaE7Pa+RuoLR0x4FOMOH8Yl5x9OW/9+S+b++fOmM++H9qHzjvuQMeOHTnk8I+z+Le/2+L4Xzw2m88OHQTAZ4cOYubP/floTGkyYOBhYOeU0nN/uyEiZrTJiErkxpuuZclrr/PY9PsBeGjqY3z3+ls4+OMHcc7IYVx8wZWsW7ue795wK9Nn/gSA71x/C+vWrs85bG2D6277GocccTDddu3Go8/+mP8aM55zLxzO+zq9j9vvGwvUXYi7bvQY/rD+D0z8wSTuefQOUkrMmTaX2dPmAvDVMaN54J6f8vLzv+FHt/w3N/zgWk4982SWL13B6C99NedX3O7VVFCNJraWmbWW90IJQi239y5bzgiRFi6fHdt6jmF7ntbsmPM/v//JNn/etvBOOEmlUqYasCRVlNaqAUdEv4h4MiJeiogXI+Lion/XiHg8Il4t/uxe9EdE3BwRiyLiVxHxiabGagCWVCqteCtyNXBZSulA4HDg/Ig4ELgcmJZS2g+YVqwDDAL2K9oo4PamPsAALKlUWutGjJTS8pTSwmL5D8DLQB9gMDCh2G0CcGqxPBi4O9WZB3SLiN6NfYYBWFKp1KTU7Fb/prGijWronBGxF/Bx4GmgV0pp0+2rb/DXeyL6AEvqHba06NsqL8JJKpWWPOWs/k1jWxMROwNTgEtSShsi/jpxIqWUIuJdX/UzA5ZUKq15I0ZEvI+64DsxpfTjonvFptJC8efKon8Z0K/e4X2Lvq0yAEsqldaqAUddqjseeDmldGO9TVOBEcXyCODBev1nFbMhDgfW1ytVNMgShKRSacUHrR8JDAdeiIhNdwNfCVwPTI6IkcDvgU0P+n4EOAlYBPwJaPKJkQZgSaXSWnf3ppRmU/fgsYYMbGD/BJzfks8wAEsqFV9LL0mZbA/vemsuA7CkUmnrB4y1JgOwpFIxA5akTCrpaWgGYEmlUkkPZDcASyoVSxCSlIkBWJIycRaEJGViBixJmTgLQpIyqUnNedDk9sEALKlUrAFLUibWgCUpE2vAkpRJrSUIScrDDFiSMnEWhCRlYglCkjKxBCFJmZgBS1ImZsCSlElNqsk9hGYzAEsqFW9FlqRMvBVZkjIxA5akTJwFIUmZOAtCkjLxVmRJysQasCRlYg1YkjIxA5akTJwHLEmZmAFLUibOgpCkTLwIJ0mZWIKQpEy8E06SMjEDlqRMKqkGHJX0r0Wli4hRKaVxuceh7Ys/F+9dHXIP4D1mVO4BaLvkz8V7lAFYkjIxAEtSJgbg9mWdTw3x5+I9yotwkpSJGbAkZWIAbicRcWJE/CYiFkXE5bnHo/wi4s6IWBkRv849FuVhAG4HEdERuBUYBBwInBkRB+YdlbYDdwEn5h6E8jEAt4/DgEUppcUppbeB+4DBmcekzFJKvwDezD0O5WMAbh99gCX11pcWfZLewwzAkpSJAbh9LAP61VvvW/RJeg8zALePZ4D9ImLviOgEnAFMzTwmSZkZgNtBSqkauAD4OfAyMDml9GLeUSm3iLgXmAvsHxFLI2Jk7jGpfXknnCRlYgYsSZkYgCUpEwOwJGViAJakTAzAkpSJAViSMjEAS1ImBmBJyuT/A8Jo1Mk5zIoBAAAAAElFTkSuQmCC\n", 1453 | "text/plain": [ 1454 | "" 1455 | ] 1456 | }, 1457 | "metadata": {}, 1458 | "output_type": "display_data" 1459 | } 1460 | ], 1461 | "source": [ 1462 | "lr = LogisticRegression()\n", 1463 | "lr.fit(train_sparse_matrix, train_y)\n", 1464 | "\n", 1465 | "pred_y = lr.predict(test_sparse_matrix)\n", 1466 | "print(classification_report(test_y, pred_y))\n", 1467 | "sns.heatmap(confusion_matrix(test_y, pred_y), fmt='.1f', annot=True)" 1468 | ] 1469 | } 1470 | ], 1471 | "metadata": { 1472 | "kernelspec": { 1473 | "display_name": "Python 3", 1474 | "language": "python", 1475 | "name": "python3" 1476 | }, 1477 | "language_info": { 1478 | "codemirror_mode": { 1479 | "name": "ipython", 1480 | "version": 3 1481 | }, 1482 | "file_extension": ".py", 1483 | "mimetype": "text/x-python", 1484 | "name": "python", 1485 | "nbconvert_exporter": "python", 1486 | "pygments_lexer": "ipython3", 1487 | "version": "3.6.4" 1488 | } 1489 | }, 1490 | "nbformat": 4, 1491 | "nbformat_minor": 2 1492 | } 1493 | --------------------------------------------------------------------------------