├── AnaliseTexto
    ├── Mineração de Textos.pdf
    ├── .ipynb_checkpoints
    │   ├── TextVector-checkpoint.ipynb
    │   └── tutorial-checkpoint.ipynb
    └── AnaliseDeSentimento.ipynb
├── .idea
    └── vcs.xml
└── textAnalisis
    └── .ipynb_checkpoints
        ├── TextVector-checkpoint.ipynb
        ├── TextAnalise-Plin-checkpoint.ipynb
        └── tutorial-checkpoint.ipynb


/AnaliseTexto/Mineração de Textos.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/sandeco/01-AulasDataScience/HEAD/AnaliseTexto/Mineração de Textos.pdf


--------------------------------------------------------------------------------
/.idea/vcs.xml:
--------------------------------------------------------------------------------
1 | <?xml version="1.0" encoding="UTF-8"?>
2 | <project version="4">
3 |   <component name="VcsDirectoryMappings">
4 |     <mapping directory="$PROJECT_DIR$" vcs="Git" />
5 |   </component>
6 | </project>


--------------------------------------------------------------------------------
/AnaliseTexto/.ipynb_checkpoints/TextVector-checkpoint.ipynb:
--------------------------------------------------------------------------------
  1 | {
  2 |  "cells": [
  3 |   {
  4 |    "cell_type": "markdown",
  5 |    "metadata": {
  6 |     "nbpresent": {
  7 |      "id": "e4c7d791-d39c-4247-a950-8f541b2b2b2b"
  8 |     },
  9 |     "slideshow": {
 10 |      "slide_type": "-"
 11 |     }
 12 |    },
 13 |    "source": [
 14 |     "# Classificação de textos com *scikit-learn*\n",
 15 |     "por Prof. Sanderson Macedo"
 16 |    ]
 17 |   },
 18 |   {
 19 |    "cell_type": "markdown",
 20 |    "metadata": {
 21 |     "nbpresent": {
 22 |      "id": "918ce0e7-8f69-4d3c-8106-d3c5264c94e3"
 23 |     },
 24 |     "slideshow": {
 25 |      "slide_type": "-"
 26 |     }
 27 |    },
 28 |    "source": [
 29 |     "<hr size=30>"
 30 |    ]
 31 |   },
 32 |   {
 33 |    "cell_type": "markdown",
 34 |    "metadata": {
 35 |     "nbpresent": {
 36 |      "id": "ca5fe97a-0224-4915-a59d-38e6baa218a2"
 37 |     }
 38 |    },
 39 |    "source": [
 40 |     "## Agenda\n",
 41 |     "\n",
 42 |     "\n",
 43 |     "1. Representar um texto como dados numéricos\n",
 44 |     "2. Ler o *dataset* de texto no Pandas\n",
 45 |     "2. Vetorizar nossso *dataset*\n",
 46 |     "4. Construir e avaliar um modelo\n",
 47 |     "5. Comparar modelos\n"
 48 |    ]
 49 |   },
 50 |   {
 51 |    "cell_type": "code",
 52 |    "execution_count": 73,
 53 |    "metadata": {
 54 |     "collapsed": true,
 55 |     "nbpresent": {
 56 |      "id": "d2e20804-da18-483c-bd40-8c25e2d4699c"
 57 |     }
 58 |    },
 59 |    "outputs": [],
 60 |    "source": [
 61 |     "##Importando pandas e numpy\n",
 62 |     "import pandas as pd\n",
 63 |     "import numpy as np\n",
 64 |     "\n"
 65 |    ]
 66 |   },
 67 |   {
 68 |    "cell_type": "markdown",
 69 |    "metadata": {
 70 |     "nbpresent": {
 71 |      "id": "76e5a32a-69c4-4dc5-a66b-23d2cca623af"
 72 |     }
 73 |    },
 74 |    "source": [
 75 |     "## 1. Definindo um vetor de textos \n",
 76 |     "Os textos do vetor podem ser adquiridos por meio da leitura de \n",
 77 |     "pdf's, doc's, twitter's... etc.\n",
 78 |     "\n",
 79 |     "Esses textos serão a base de treinamento\n",
 80 |     "para a classificação do sentimento de um novo texto."
 81 |    ]
 82 |   },
 83 |   {
 84 |    "cell_type": "code",
 85 |    "execution_count": 88,
 86 |    "metadata": {
 87 |     "collapsed": false,
 88 |     "nbpresent": {
 89 |      "id": "56bab267-0993-4d7a-9436-11bc5de3d1d3"
 90 |     }
 91 |    },
 92 |    "outputs": [],
 93 |    "source": [
 94 |     "train = [\n",
 95 |     "    'Eu te amo e não existe nada melhor que você',\n",
 96 |     "    'Você é algo assim... é tudo pra mim. Ao meu amor... Amor!',\n",
 97 |     "    'Eu te odeio muito, você não presta!',\n",
 98 |     "    'Não gosto de você'\n",
 99 |     "    \n",
100 |     "   ]\n",
101 |     "\n"
102 |    ]
103 |   },
104 |   {
105 |    "cell_type": "markdown",
106 |    "metadata": {
107 |     "nbpresent": {
108 |      "id": "fc1fc669-a603-412e-8855-837d750718ff"
109 |     }
110 |    },
111 |    "source": [
112 |     "## 2. Definindo um vetor de sentimentos\n",
113 |     "Criaremos um vetor de sentimentos chamado **_felling_**. \n",
114 |     "\n",
115 |     "Cada posição do vetor **_felling_** representa o sentimento **BOM** (1) ou **RUIM** (0) para os textos que passamos ao vetor **_train_**.\n",
116 |     "\n",
117 |     "Por exemplo: a frase da primeira posição do vetor **_train_**:\n",
118 |     "\n",
119 |     "> 'Eu te amo e não existe nada melhor que você'\n",
120 |     "\n",
121 |     "Foi classificada como sendo um texto **BOM**:\n",
122 |     "\n",
123 |     "> 1"
124 |    ]
125 |   },
126 |   {
127 |    "cell_type": "code",
128 |    "execution_count": 89,
129 |    "metadata": {
130 |     "collapsed": true,
131 |     "nbpresent": {
132 |      "id": "68a4277e-e38c-42ac-8528-0b90efe86e42"
133 |     }
134 |    },
135 |    "outputs": [],
136 |    "source": [
137 |     "felling = [1,1,0,0]"
138 |    ]
139 |   },
140 |   {
141 |    "cell_type": "markdown",
142 |    "metadata": {
143 |     "nbpresent": {
144 |      "id": "f43ff54a-e843-4a35-8447-66665f36ebca"
145 |     }
146 |    },
147 |    "source": [
148 |     "## 3. Análise de texto com _scikit-learn_.\n",
149 |     "\n",
150 |     "Texto de [scikit-learn documentation](http://scikit-learn.org/stable/modules/feature_extraction.html#text-feature-extraction):\n",
151 |     "\n",
152 |     "> Análise de texto é um campo de aplicação importante para algoritmos de aprendizado de máquina. No entanto, uma sequência de símbolos não podem ser passada diretamente aos algoritmos de Machine Learning, pois a maioria deles espera vetores de características numéricas com um tamanho fixo, em vez de documentos de texto com comprimento variável.\n",
153 |     "\n",
154 |     "Mas nesse caso podemos realizar algumas transformações de para poder manipular textos em algoritmos de aprendizagem.\n",
155 |     "\n",
156 |     "Portanto, aqui utilizaremos a [CountVectorizer](http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html)\n",
157 |     "para converter textos em uma matriz que expressará a quantidade \"tokens\" dos textos.\n",
158 |     "\n",
159 |     "Importamos a classe e criamos uma instância chamada **_vect_**.\n"
160 |    ]
161 |   },
162 |   {
163 |    "cell_type": "code",
164 |    "execution_count": 90,
165 |    "metadata": {
166 |     "collapsed": false,
167 |     "nbpresent": {
168 |      "id": "1ada59d7-f1ba-4625-8999-b8af5aaf461c"
169 |     }
170 |    },
171 |    "outputs": [],
172 |    "source": [
173 |     "from sklearn.feature_extraction.text import CountVectorizer\n",
174 |     "vect = CountVectorizer()"
175 |    ]
176 |   },
177 |   {
178 |    "cell_type": "markdown",
179 |    "metadata": {
180 |     "nbpresent": {
181 |      "id": "154ef867-0532-45ad-9910-c87f6711d1b0"
182 |     }
183 |    },
184 |    "source": [
185 |     "## 4. Treinamento criando o dicionário.\n",
186 |     "Agora treinamos o algoritmo com o vetor de textos que criamos acima. Chamamos o método **_fit()_** passando o vetor de textos."
187 |    ]
188 |   },
189 |   {
190 |    "cell_type": "code",
191 |    "execution_count": 96,
192 |    "metadata": {
193 |     "collapsed": false,
194 |     "nbpresent": {
195 |      "id": "eff3a289-8c0d-4374-9400-d988a6b36624"
196 |     }
197 |    },
198 |    "outputs": [
199 |     {
200 |      "data": {
201 |       "text/plain": [
202 |        "CountVectorizer(analyzer='word', binary=False, decode_error='strict',\n",
203 |        "        dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',\n",
204 |        "        lowercase=True, max_df=1.0, max_features=None, min_df=1,\n",
205 |        "        ngram_range=(1, 1), preprocessor=None, stop_words=None,\n",
206 |        "        strip_accents=None, token_pattern='(?u)\\\\b\\\\w\\\\w+\\\\b',\n",
207 |        "        tokenizer=None, vocabulary=None)"
208 |       ]
209 |      },
210 |      "execution_count": 96,
211 |      "metadata": {},
212 |      "output_type": "execute_result"
213 |     }
214 |    ],
215 |    "source": [
216 |     "vect.fit(train)"
217 |    ]
218 |   },
219 |   {
220 |    "cell_type": "markdown",
221 |    "metadata": {},
222 |    "source": [
223 |     "Veja que o parametro *analyzer* é defindo por padrão como *'word'* na classe *CountVectorizer*. Isso signicica que a classe ignora palavras com menos de dois (2) caracteres e pontuações. "
224 |    ]
225 |   },
226 |   {
227 |    "cell_type": "markdown",
228 |    "metadata": {
229 |     "nbpresent": {
230 |      "id": "d4093cdd-6b19-4fed-9a01-5ee02f41ca51"
231 |     }
232 |    },
233 |    "source": [
234 |     "## 5. Nosso dicionário\n",
235 |     "Aqui vamos listar de forma única\n",
236 |     "quais palavras forma utilizadas no texto, formando assim um dicionário de palavras."
237 |    ]
238 |   },
239 |   {
240 |    "cell_type": "code",
241 |    "execution_count": 95,
242 |    "metadata": {
243 |     "collapsed": false,
244 |     "nbpresent": {
245 |      "id": "3ab9a844-7f38-40c5-a57f-4a2fbf3343ba"
246 |     }
247 |    },
248 |    "outputs": [
249 |     {
250 |      "data": {
251 |       "text/plain": [
252 |        "['algo',\n",
253 |        " 'amo',\n",
254 |        " 'amor',\n",
255 |        " 'ao',\n",
256 |        " 'assim',\n",
257 |        " 'de',\n",
258 |        " 'eu',\n",
259 |        " 'existe',\n",
260 |        " 'gosto',\n",
261 |        " 'melhor',\n",
262 |        " 'meu',\n",
263 |        " 'mim',\n",
264 |        " 'muito',\n",
265 |        " 'nada',\n",
266 |        " 'não',\n",
267 |        " 'odeio',\n",
268 |        " 'pra',\n",
269 |        " 'presta',\n",
270 |        " 'que',\n",
271 |        " 'te',\n",
272 |        " 'tudo',\n",
273 |        " 'você']"
274 |       ]
275 |      },
276 |      "execution_count": 95,
277 |      "metadata": {},
278 |      "output_type": "execute_result"
279 |     }
280 |    ],
281 |    "source": [
282 |     "## examinando o dicionário criado em ordem alfabética.\n",
283 |     "vect.get_feature_names()"
284 |    ]
285 |   },
286 |   {
287 |    "cell_type": "markdown",
288 |    "metadata": {},
289 |    "source": [
290 |     "## 6. Transformação em matriz esparsa em relação as frases\n",
291 |     "Essa transformação é importante porque cria uma matriz onde:\n",
292 |     "\n",
293 |     "1. Cada linha representa um texto do vetor **_train_** \n",
294 |     "2. Cada coluna uma palavra do dicionário aprendido.\n",
295 |     "3. Se a palavra ocorrer no texto o valor será 1 caso contrário 0.\n",
296 |     "\n",
297 |     "\n"
298 |    ]
299 |   },
300 |   {
301 |    "cell_type": "code",
302 |    "execution_count": 93,
303 |    "metadata": {
304 |     "collapsed": false,
305 |     "nbpresent": {
306 |      "id": "34cfd603-24de-4379-9a69-353ba0e50fba"
307 |     }
308 |    },
309 |    "outputs": [],
310 |    "source": [
311 |     "simple_train_dtm = vect.transform(text)\n",
312 |     "ocorrencias = simple_train_dtm.toarray()"
313 |    ]
314 |   },
315 |   {
316 |    "cell_type": "code",
317 |    "execution_count": 94,
318 |    "metadata": {
319 |     "collapsed": false,
320 |     "nbpresent": {
321 |      "id": "88fe39dd-0355-4dd7-b9d6-ed668225208d"
322 |     }
323 |    },
324 |    "outputs": [
325 |     {
326 |      "data": {
327 |       "text/plain": [
328 |        "array([[0, 1, 0, 0, 0, 0, 1, 1, 0, 1, 0, 0, 0, 1, 1, 0, 0, 0, 1, 1, 0, 1],\n",
329 |        "       [1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 1, 1],\n",
330 |        "       [0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 1, 1, 0, 1, 0, 1, 0, 1],\n",
331 |        "       [0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 1]])"
332 |       ]
333 |      },
334 |      "execution_count": 94,
335 |      "metadata": {},
336 |      "output_type": "execute_result"
337 |     }
338 |    ],
339 |    "source": [
340 |     "ocorrencias"
341 |    ]
342 |   },
343 |   {
344 |    "cell_type": "code",
345 |    "execution_count": 56,
346 |    "metadata": {
347 |     "collapsed": false,
348 |     "nbpresent": {
349 |      "id": "2e563c0f-37c5-4861-85c6-9185c20e3507"
350 |     }
351 |    },
352 |    "outputs": [
353 |     {
354 |      "data": {
355 |       "text/html": [
356 |        "<div>\n",
357 |        "<table border=\"1\" class=\"dataframe\">\n",
358 |        "  <thead>\n",
359 |        "    <tr style=\"text-align: right;\">\n",
360 |        "      <th></th>\n",
361 |        "      <th>algo</th>\n",
362 |        "      <th>amo</th>\n",
363 |        "      <th>assim</th>\n",
364 |        "      <th>eu</th>\n",
365 |        "      <th>existe</th>\n",
366 |        "      <th>melhor</th>\n",
367 |        "      <th>mim</th>\n",
368 |        "      <th>muito</th>\n",
369 |        "      <th>nada</th>\n",
370 |        "      <th>não</th>\n",
371 |        "      <th>odeio</th>\n",
372 |        "      <th>pra</th>\n",
373 |        "      <th>presta</th>\n",
374 |        "      <th>que</th>\n",
375 |        "      <th>te</th>\n",
376 |        "      <th>tudo</th>\n",
377 |        "      <th>você</th>\n",
378 |        "    </tr>\n",
379 |        "  </thead>\n",
380 |        "  <tbody>\n",
381 |        "    <tr>\n",
382 |        "      <th>0</th>\n",
383 |        "      <td>0</td>\n",
384 |        "      <td>1</td>\n",
385 |        "      <td>0</td>\n",
386 |        "      <td>1</td>\n",
387 |        "      <td>1</td>\n",
388 |        "      <td>1</td>\n",
389 |        "      <td>0</td>\n",
390 |        "      <td>0</td>\n",
391 |        "      <td>1</td>\n",
392 |        "      <td>1</td>\n",
393 |        "      <td>0</td>\n",
394 |        "      <td>0</td>\n",
395 |        "      <td>0</td>\n",
396 |        "      <td>1</td>\n",
397 |        "      <td>1</td>\n",
398 |        "      <td>0</td>\n",
399 |        "      <td>1</td>\n",
400 |        "    </tr>\n",
401 |        "    <tr>\n",
402 |        "      <th>1</th>\n",
403 |        "      <td>1</td>\n",
404 |        "      <td>0</td>\n",
405 |        "      <td>1</td>\n",
406 |        "      <td>0</td>\n",
407 |        "      <td>0</td>\n",
408 |        "      <td>0</td>\n",
409 |        "      <td>1</td>\n",
410 |        "      <td>0</td>\n",
411 |        "      <td>0</td>\n",
412 |        "      <td>0</td>\n",
413 |        "      <td>0</td>\n",
414 |        "      <td>1</td>\n",
415 |        "      <td>0</td>\n",
416 |        "      <td>0</td>\n",
417 |        "      <td>0</td>\n",
418 |        "      <td>1</td>\n",
419 |        "      <td>1</td>\n",
420 |        "    </tr>\n",
421 |        "    <tr>\n",
422 |        "      <th>2</th>\n",
423 |        "      <td>0</td>\n",
424 |        "      <td>0</td>\n",
425 |        "      <td>0</td>\n",
426 |        "      <td>1</td>\n",
427 |        "      <td>0</td>\n",
428 |        "      <td>0</td>\n",
429 |        "      <td>0</td>\n",
430 |        "      <td>1</td>\n",
431 |        "      <td>0</td>\n",
432 |        "      <td>1</td>\n",
433 |        "      <td>1</td>\n",
434 |        "      <td>0</td>\n",
435 |        "      <td>1</td>\n",
436 |        "      <td>0</td>\n",
437 |        "      <td>1</td>\n",
438 |        "      <td>0</td>\n",
439 |        "      <td>1</td>\n",
440 |        "    </tr>\n",
441 |        "  </tbody>\n",
442 |        "</table>\n",
443 |        "</div>"
444 |       ],
445 |       "text/plain": [
446 |        "   algo  amo  assim  eu  existe  melhor  mim  muito  nada  não  odeio  pra  \\\n",
447 |        "0     0    1      0   1       1       1    0      0     1    1      0    0   \n",
448 |        "1     1    0      1   0       0       0    1      0     0    0      0    1   \n",
449 |        "2     0    0      0   1       0       0    0      1     0    1      1    0   \n",
450 |        "\n",
451 |        "   presta  que  te  tudo  você  \n",
452 |        "0       0    1   1     0     1  \n",
453 |        "1       0    0   0     1     1  \n",
454 |        "2       1    0   1     0     1  "
455 |       ]
456 |      },
457 |      "execution_count": 56,
458 |      "metadata": {},
459 |      "output_type": "execute_result"
460 |     }
461 |    ],
462 |    "source": [
463 |     "df = pd.DataFrame(simple_train_dtm.toarray(), columns=vect.get_feature_names())\n",
464 |     "df"
465 |    ]
466 |   },
467 |   {
468 |    "cell_type": "code",
469 |    "execution_count": 57,
470 |    "metadata": {
471 |     "collapsed": false,
472 |     "nbpresent": {
473 |      "id": "d30743bf-e9b2-46ba-93bd-0615c79b1b29"
474 |     }
475 |    },
476 |    "outputs": [
477 |     {
478 |      "data": {
479 |       "text/plain": [
480 |        "scipy.sparse.csr.csr_matrix"
481 |       ]
482 |      },
483 |      "execution_count": 57,
484 |      "metadata": {},
485 |      "output_type": "execute_result"
486 |     }
487 |    ],
488 |    "source": [
489 |     "type(simple_train_dtm)"
490 |    ]
491 |   },
492 |   {
493 |    "cell_type": "code",
494 |    "execution_count": 60,
495 |    "metadata": {
496 |     "collapsed": false,
497 |     "nbpresent": {
498 |      "id": "95d91cb6-e3f8-4b4b-ab82-900f8719f4db"
499 |     }
500 |    },
501 |    "outputs": [
502 |     {
503 |      "name": "stdout",
504 |      "output_type": "stream",
505 |      "text": [
506 |       "  (0, 1)\t1\n",
507 |       "  (0, 3)\t1\n",
508 |       "  (0, 4)\t1\n",
509 |       "  (0, 5)\t1\n",
510 |       "  (0, 8)\t1\n",
511 |       "  (0, 9)\t1\n",
512 |       "  (0, 13)\t1\n",
513 |       "  (0, 14)\t1\n",
514 |       "  (0, 16)\t1\n",
515 |       "  (1, 0)\t1\n",
516 |       "  (1, 2)\t1\n",
517 |       "  (1, 6)\t1\n",
518 |       "  (1, 11)\t1\n",
519 |       "  (1, 15)\t1\n",
520 |       "  (1, 16)\t1\n",
521 |       "  (2, 3)\t1\n",
522 |       "  (2, 7)\t1\n",
523 |       "  (2, 9)\t1\n",
524 |       "  (2, 10)\t1\n",
525 |       "  (2, 12)\t1\n",
526 |       "  (2, 14)\t1\n",
527 |       "  (2, 16)\t1\n"
528 |      ]
529 |     }
530 |    ],
531 |    "source": [
532 |     "print(simple_train_dtm)"
533 |    ]
534 |   },
535 |   {
536 |    "cell_type": "code",
537 |    "execution_count": null,
538 |    "metadata": {
539 |     "collapsed": true,
540 |     "nbpresent": {
541 |      "id": "201b01cf-47f9-4a94-baf5-9270271e053e"
542 |     }
543 |    },
544 |    "outputs": [],
545 |    "source": []
546 |   }
547 |  ],
548 |  "metadata": {
549 |   "kernelspec": {
550 |    "display_name": "Python [conda root]",
551 |    "language": "python",
552 |    "name": "conda-root-py"
553 |   },
554 |   "language_info": {
555 |    "codemirror_mode": {
556 |     "name": "ipython",
557 |     "version": 3
558 |    },
559 |    "file_extension": ".py",
560 |    "mimetype": "text/x-python",
561 |    "name": "python",
562 |    "nbconvert_exporter": "python",
563 |    "pygments_lexer": "ipython3",
564 |    "version": "3.5.2"
565 |   }
566 |  },
567 |  "nbformat": 4,
568 |  "nbformat_minor": 1
569 | }
570 | 


--------------------------------------------------------------------------------
/textAnalisis/.ipynb_checkpoints/TextVector-checkpoint.ipynb:
--------------------------------------------------------------------------------
  1 | {
  2 |  "cells": [
  3 |   {
  4 |    "cell_type": "markdown",
  5 |    "metadata": {
  6 |     "nbpresent": {
  7 |      "id": "e4c7d791-d39c-4247-a950-8f541b2b2b2b"
  8 |     },
  9 |     "slideshow": {
 10 |      "slide_type": "-"
 11 |     }
 12 |    },
 13 |    "source": [
 14 |     "# Classificação de textos com *scikit-learn*\n",
 15 |     "por Prof. Sanderson Macedo"
 16 |    ]
 17 |   },
 18 |   {
 19 |    "cell_type": "markdown",
 20 |    "metadata": {
 21 |     "nbpresent": {
 22 |      "id": "918ce0e7-8f69-4d3c-8106-d3c5264c94e3"
 23 |     },
 24 |     "slideshow": {
 25 |      "slide_type": "-"
 26 |     }
 27 |    },
 28 |    "source": [
 29 |     "<hr size=30>"
 30 |    ]
 31 |   },
 32 |   {
 33 |    "cell_type": "markdown",
 34 |    "metadata": {
 35 |     "nbpresent": {
 36 |      "id": "ca5fe97a-0224-4915-a59d-38e6baa218a2"
 37 |     }
 38 |    },
 39 |    "source": [
 40 |     "## Agenda\n",
 41 |     "\n",
 42 |     "\n",
 43 |     "1. Representar um texto como dados numéricos\n",
 44 |     "2. Ler o *dataset* de texto no Pandas\n",
 45 |     "2. Vetorizar nossso *dataset*\n",
 46 |     "4. Construir e avaliar um modelo\n",
 47 |     "5. Comparar modelos\n"
 48 |    ]
 49 |   },
 50 |   {
 51 |    "cell_type": "code",
 52 |    "execution_count": 73,
 53 |    "metadata": {
 54 |     "collapsed": true,
 55 |     "nbpresent": {
 56 |      "id": "d2e20804-da18-483c-bd40-8c25e2d4699c"
 57 |     }
 58 |    },
 59 |    "outputs": [],
 60 |    "source": [
 61 |     "##Importando pandas e numpy\n",
 62 |     "import pandas as pd\n",
 63 |     "import numpy as np\n",
 64 |     "\n"
 65 |    ]
 66 |   },
 67 |   {
 68 |    "cell_type": "markdown",
 69 |    "metadata": {
 70 |     "nbpresent": {
 71 |      "id": "76e5a32a-69c4-4dc5-a66b-23d2cca623af"
 72 |     }
 73 |    },
 74 |    "source": [
 75 |     "## 1. Definindo um vetor de textos \n",
 76 |     "Os textos do vetor podem ser adquiridos por meio da leitura de \n",
 77 |     "pdf's, doc's, twitter's... etc.\n",
 78 |     "\n",
 79 |     "Esses textos serão a base de treinamento\n",
 80 |     "para a classificação do sentimento de um novo texto."
 81 |    ]
 82 |   },
 83 |   {
 84 |    "cell_type": "code",
 85 |    "execution_count": 88,
 86 |    "metadata": {
 87 |     "collapsed": false,
 88 |     "nbpresent": {
 89 |      "id": "56bab267-0993-4d7a-9436-11bc5de3d1d3"
 90 |     }
 91 |    },
 92 |    "outputs": [],
 93 |    "source": [
 94 |     "train = [\n",
 95 |     "    'Eu te amo e não existe nada melhor que você',\n",
 96 |     "    'Você é algo assim... é tudo pra mim. Ao meu amor... Amor!',\n",
 97 |     "    'Eu te odeio muito, você não presta!',\n",
 98 |     "    'Não gosto de você'\n",
 99 |     "    \n",
100 |     "   ]\n",
101 |     "\n"
102 |    ]
103 |   },
104 |   {
105 |    "cell_type": "markdown",
106 |    "metadata": {
107 |     "nbpresent": {
108 |      "id": "fc1fc669-a603-412e-8855-837d750718ff"
109 |     }
110 |    },
111 |    "source": [
112 |     "## 2. Definindo um vetor de sentimentos\n",
113 |     "Criaremos um vetor de sentimentos chamado **_felling_**. \n",
114 |     "\n",
115 |     "Cada posição do vetor **_felling_** representa o sentimento **BOM** (1) ou **RUIM** (0) para os textos que passamos ao vetor **_train_**.\n",
116 |     "\n",
117 |     "Por exemplo: a frase da primeira posição do vetor **_train_**:\n",
118 |     "\n",
119 |     "> 'Eu te amo e não existe nada melhor que você'\n",
120 |     "\n",
121 |     "Foi classificada como sendo um texto **BOM**:\n",
122 |     "\n",
123 |     "> 1"
124 |    ]
125 |   },
126 |   {
127 |    "cell_type": "code",
128 |    "execution_count": 89,
129 |    "metadata": {
130 |     "collapsed": true,
131 |     "nbpresent": {
132 |      "id": "68a4277e-e38c-42ac-8528-0b90efe86e42"
133 |     }
134 |    },
135 |    "outputs": [],
136 |    "source": [
137 |     "felling = [1,1,0,0]"
138 |    ]
139 |   },
140 |   {
141 |    "cell_type": "markdown",
142 |    "metadata": {
143 |     "nbpresent": {
144 |      "id": "f43ff54a-e843-4a35-8447-66665f36ebca"
145 |     }
146 |    },
147 |    "source": [
148 |     "## 3. Análise de texto com _scikit-learn_.\n",
149 |     "\n",
150 |     "Texto de [scikit-learn documentation](http://scikit-learn.org/stable/modules/feature_extraction.html#text-feature-extraction):\n",
151 |     "\n",
152 |     "> Análise de texto é um campo de aplicação importante para algoritmos de aprendizado de máquina. No entanto, uma sequência de símbolos não podem ser passada diretamente aos algoritmos de Machine Learning, pois a maioria deles espera vetores de características numéricas com um tamanho fixo, em vez de documentos de texto com comprimento variável.\n",
153 |     "\n",
154 |     "Mas nesse caso podemos realizar algumas transformações de para poder manipular textos em algoritmos de aprendizagem.\n",
155 |     "\n",
156 |     "Portanto, aqui utilizaremos a [CountVectorizer](http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html)\n",
157 |     "para converter textos em uma matriz que expressará a quantidade \"tokens\" dos textos.\n",
158 |     "\n",
159 |     "Importamos a classe e criamos uma instância chamada **_vect_**.\n"
160 |    ]
161 |   },
162 |   {
163 |    "cell_type": "code",
164 |    "execution_count": 90,
165 |    "metadata": {
166 |     "collapsed": false,
167 |     "nbpresent": {
168 |      "id": "1ada59d7-f1ba-4625-8999-b8af5aaf461c"
169 |     }
170 |    },
171 |    "outputs": [],
172 |    "source": [
173 |     "from sklearn.feature_extraction.text import CountVectorizer\n",
174 |     "vect = CountVectorizer()"
175 |    ]
176 |   },
177 |   {
178 |    "cell_type": "markdown",
179 |    "metadata": {
180 |     "nbpresent": {
181 |      "id": "154ef867-0532-45ad-9910-c87f6711d1b0"
182 |     }
183 |    },
184 |    "source": [
185 |     "## 4. Treinamento criando o dicionário.\n",
186 |     "Agora treinamos o algoritmo com o vetor de textos que criamos acima. Chamamos o método **_fit()_** passando o vetor de textos."
187 |    ]
188 |   },
189 |   {
190 |    "cell_type": "code",
191 |    "execution_count": 96,
192 |    "metadata": {
193 |     "collapsed": false,
194 |     "nbpresent": {
195 |      "id": "eff3a289-8c0d-4374-9400-d988a6b36624"
196 |     }
197 |    },
198 |    "outputs": [
199 |     {
200 |      "data": {
201 |       "text/plain": [
202 |        "CountVectorizer(analyzer='word', binary=False, decode_error='strict',\n",
203 |        "        dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',\n",
204 |        "        lowercase=True, max_df=1.0, max_features=None, min_df=1,\n",
205 |        "        ngram_range=(1, 1), preprocessor=None, stop_words=None,\n",
206 |        "        strip_accents=None, token_pattern='(?u)\\\\b\\\\w\\\\w+\\\\b',\n",
207 |        "        tokenizer=None, vocabulary=None)"
208 |       ]
209 |      },
210 |      "execution_count": 96,
211 |      "metadata": {},
212 |      "output_type": "execute_result"
213 |     }
214 |    ],
215 |    "source": [
216 |     "vect.fit(train)"
217 |    ]
218 |   },
219 |   {
220 |    "cell_type": "markdown",
221 |    "metadata": {},
222 |    "source": [
223 |     "Veja que o parametro *analyzer* é defindo por padrão como *'word'* na classe *CountVectorizer*. Isso signicica que a classe ignora palavras com menos de dois (2) caracteres e pontuações. "
224 |    ]
225 |   },
226 |   {
227 |    "cell_type": "markdown",
228 |    "metadata": {
229 |     "nbpresent": {
230 |      "id": "d4093cdd-6b19-4fed-9a01-5ee02f41ca51"
231 |     }
232 |    },
233 |    "source": [
234 |     "## 5. Nosso dicionário\n",
235 |     "Aqui vamos listar de forma única\n",
236 |     "quais palavras forma utilizadas no texto, formando assim um dicionário de palavras."
237 |    ]
238 |   },
239 |   {
240 |    "cell_type": "code",
241 |    "execution_count": 95,
242 |    "metadata": {
243 |     "collapsed": false,
244 |     "nbpresent": {
245 |      "id": "3ab9a844-7f38-40c5-a57f-4a2fbf3343ba"
246 |     }
247 |    },
248 |    "outputs": [
249 |     {
250 |      "data": {
251 |       "text/plain": [
252 |        "['algo',\n",
253 |        " 'amo',\n",
254 |        " 'amor',\n",
255 |        " 'ao',\n",
256 |        " 'assim',\n",
257 |        " 'de',\n",
258 |        " 'eu',\n",
259 |        " 'existe',\n",
260 |        " 'gosto',\n",
261 |        " 'melhor',\n",
262 |        " 'meu',\n",
263 |        " 'mim',\n",
264 |        " 'muito',\n",
265 |        " 'nada',\n",
266 |        " 'não',\n",
267 |        " 'odeio',\n",
268 |        " 'pra',\n",
269 |        " 'presta',\n",
270 |        " 'que',\n",
271 |        " 'te',\n",
272 |        " 'tudo',\n",
273 |        " 'você']"
274 |       ]
275 |      },
276 |      "execution_count": 95,
277 |      "metadata": {},
278 |      "output_type": "execute_result"
279 |     }
280 |    ],
281 |    "source": [
282 |     "## examinando o dicionário criado em ordem alfabética.\n",
283 |     "vect.get_feature_names()"
284 |    ]
285 |   },
286 |   {
287 |    "cell_type": "markdown",
288 |    "metadata": {},
289 |    "source": [
290 |     "## 6. Transformação em matriz esparsa em relação as frases\n",
291 |     "Essa transformação é importante porque cria uma matriz onde:\n",
292 |     "\n",
293 |     "1. Cada linha representa um texto do vetor **_train_** \n",
294 |     "2. Cada coluna uma palavra do dicionário aprendido.\n",
295 |     "3. Se a palavra ocorrer no texto o valor será 1 caso contrário 0.\n",
296 |     "\n",
297 |     "\n"
298 |    ]
299 |   },
300 |   {
301 |    "cell_type": "code",
302 |    "execution_count": 93,
303 |    "metadata": {
304 |     "collapsed": false,
305 |     "nbpresent": {
306 |      "id": "34cfd603-24de-4379-9a69-353ba0e50fba"
307 |     }
308 |    },
309 |    "outputs": [],
310 |    "source": [
311 |     "simple_train_dtm = vect.transform(text)\n",
312 |     "ocorrencias = simple_train_dtm.toarray()"
313 |    ]
314 |   },
315 |   {
316 |    "cell_type": "code",
317 |    "execution_count": 94,
318 |    "metadata": {
319 |     "collapsed": false,
320 |     "nbpresent": {
321 |      "id": "88fe39dd-0355-4dd7-b9d6-ed668225208d"
322 |     }
323 |    },
324 |    "outputs": [
325 |     {
326 |      "data": {
327 |       "text/plain": [
328 |        "array([[0, 1, 0, 0, 0, 0, 1, 1, 0, 1, 0, 0, 0, 1, 1, 0, 0, 0, 1, 1, 0, 1],\n",
329 |        "       [1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 1, 1],\n",
330 |        "       [0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 1, 1, 0, 1, 0, 1, 0, 1],\n",
331 |        "       [0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 1]])"
332 |       ]
333 |      },
334 |      "execution_count": 94,
335 |      "metadata": {},
336 |      "output_type": "execute_result"
337 |     }
338 |    ],
339 |    "source": [
340 |     "ocorrencias"
341 |    ]
342 |   },
343 |   {
344 |    "cell_type": "code",
345 |    "execution_count": 56,
346 |    "metadata": {
347 |     "collapsed": false,
348 |     "nbpresent": {
349 |      "id": "2e563c0f-37c5-4861-85c6-9185c20e3507"
350 |     }
351 |    },
352 |    "outputs": [
353 |     {
354 |      "data": {
355 |       "text/html": [
356 |        "<div>\n",
357 |        "<table border=\"1\" class=\"dataframe\">\n",
358 |        "  <thead>\n",
359 |        "    <tr style=\"text-align: right;\">\n",
360 |        "      <th></th>\n",
361 |        "      <th>algo</th>\n",
362 |        "      <th>amo</th>\n",
363 |        "      <th>assim</th>\n",
364 |        "      <th>eu</th>\n",
365 |        "      <th>existe</th>\n",
366 |        "      <th>melhor</th>\n",
367 |        "      <th>mim</th>\n",
368 |        "      <th>muito</th>\n",
369 |        "      <th>nada</th>\n",
370 |        "      <th>não</th>\n",
371 |        "      <th>odeio</th>\n",
372 |        "      <th>pra</th>\n",
373 |        "      <th>presta</th>\n",
374 |        "      <th>que</th>\n",
375 |        "      <th>te</th>\n",
376 |        "      <th>tudo</th>\n",
377 |        "      <th>você</th>\n",
378 |        "    </tr>\n",
379 |        "  </thead>\n",
380 |        "  <tbody>\n",
381 |        "    <tr>\n",
382 |        "      <th>0</th>\n",
383 |        "      <td>0</td>\n",
384 |        "      <td>1</td>\n",
385 |        "      <td>0</td>\n",
386 |        "      <td>1</td>\n",
387 |        "      <td>1</td>\n",
388 |        "      <td>1</td>\n",
389 |        "      <td>0</td>\n",
390 |        "      <td>0</td>\n",
391 |        "      <td>1</td>\n",
392 |        "      <td>1</td>\n",
393 |        "      <td>0</td>\n",
394 |        "      <td>0</td>\n",
395 |        "      <td>0</td>\n",
396 |        "      <td>1</td>\n",
397 |        "      <td>1</td>\n",
398 |        "      <td>0</td>\n",
399 |        "      <td>1</td>\n",
400 |        "    </tr>\n",
401 |        "    <tr>\n",
402 |        "      <th>1</th>\n",
403 |        "      <td>1</td>\n",
404 |        "      <td>0</td>\n",
405 |        "      <td>1</td>\n",
406 |        "      <td>0</td>\n",
407 |        "      <td>0</td>\n",
408 |        "      <td>0</td>\n",
409 |        "      <td>1</td>\n",
410 |        "      <td>0</td>\n",
411 |        "      <td>0</td>\n",
412 |        "      <td>0</td>\n",
413 |        "      <td>0</td>\n",
414 |        "      <td>1</td>\n",
415 |        "      <td>0</td>\n",
416 |        "      <td>0</td>\n",
417 |        "      <td>0</td>\n",
418 |        "      <td>1</td>\n",
419 |        "      <td>1</td>\n",
420 |        "    </tr>\n",
421 |        "    <tr>\n",
422 |        "      <th>2</th>\n",
423 |        "      <td>0</td>\n",
424 |        "      <td>0</td>\n",
425 |        "      <td>0</td>\n",
426 |        "      <td>1</td>\n",
427 |        "      <td>0</td>\n",
428 |        "      <td>0</td>\n",
429 |        "      <td>0</td>\n",
430 |        "      <td>1</td>\n",
431 |        "      <td>0</td>\n",
432 |        "      <td>1</td>\n",
433 |        "      <td>1</td>\n",
434 |        "      <td>0</td>\n",
435 |        "      <td>1</td>\n",
436 |        "      <td>0</td>\n",
437 |        "      <td>1</td>\n",
438 |        "      <td>0</td>\n",
439 |        "      <td>1</td>\n",
440 |        "    </tr>\n",
441 |        "  </tbody>\n",
442 |        "</table>\n",
443 |        "</div>"
444 |       ],
445 |       "text/plain": [
446 |        "   algo  amo  assim  eu  existe  melhor  mim  muito  nada  não  odeio  pra  \\\n",
447 |        "0     0    1      0   1       1       1    0      0     1    1      0    0   \n",
448 |        "1     1    0      1   0       0       0    1      0     0    0      0    1   \n",
449 |        "2     0    0      0   1       0       0    0      1     0    1      1    0   \n",
450 |        "\n",
451 |        "   presta  que  te  tudo  você  \n",
452 |        "0       0    1   1     0     1  \n",
453 |        "1       0    0   0     1     1  \n",
454 |        "2       1    0   1     0     1  "
455 |       ]
456 |      },
457 |      "execution_count": 56,
458 |      "metadata": {},
459 |      "output_type": "execute_result"
460 |     }
461 |    ],
462 |    "source": [
463 |     "df = pd.DataFrame(simple_train_dtm.toarray(), columns=vect.get_feature_names())\n",
464 |     "df"
465 |    ]
466 |   },
467 |   {
468 |    "cell_type": "code",
469 |    "execution_count": 57,
470 |    "metadata": {
471 |     "collapsed": false,
472 |     "nbpresent": {
473 |      "id": "d30743bf-e9b2-46ba-93bd-0615c79b1b29"
474 |     }
475 |    },
476 |    "outputs": [
477 |     {
478 |      "data": {
479 |       "text/plain": [
480 |        "scipy.sparse.csr.csr_matrix"
481 |       ]
482 |      },
483 |      "execution_count": 57,
484 |      "metadata": {},
485 |      "output_type": "execute_result"
486 |     }
487 |    ],
488 |    "source": [
489 |     "type(simple_train_dtm)"
490 |    ]
491 |   },
492 |   {
493 |    "cell_type": "code",
494 |    "execution_count": 60,
495 |    "metadata": {
496 |     "collapsed": false,
497 |     "nbpresent": {
498 |      "id": "95d91cb6-e3f8-4b4b-ab82-900f8719f4db"
499 |     }
500 |    },
501 |    "outputs": [
502 |     {
503 |      "name": "stdout",
504 |      "output_type": "stream",
505 |      "text": [
506 |       "  (0, 1)\t1\n",
507 |       "  (0, 3)\t1\n",
508 |       "  (0, 4)\t1\n",
509 |       "  (0, 5)\t1\n",
510 |       "  (0, 8)\t1\n",
511 |       "  (0, 9)\t1\n",
512 |       "  (0, 13)\t1\n",
513 |       "  (0, 14)\t1\n",
514 |       "  (0, 16)\t1\n",
515 |       "  (1, 0)\t1\n",
516 |       "  (1, 2)\t1\n",
517 |       "  (1, 6)\t1\n",
518 |       "  (1, 11)\t1\n",
519 |       "  (1, 15)\t1\n",
520 |       "  (1, 16)\t1\n",
521 |       "  (2, 3)\t1\n",
522 |       "  (2, 7)\t1\n",
523 |       "  (2, 9)\t1\n",
524 |       "  (2, 10)\t1\n",
525 |       "  (2, 12)\t1\n",
526 |       "  (2, 14)\t1\n",
527 |       "  (2, 16)\t1\n"
528 |      ]
529 |     }
530 |    ],
531 |    "source": [
532 |     "print(simple_train_dtm)"
533 |    ]
534 |   },
535 |   {
536 |    "cell_type": "code",
537 |    "execution_count": null,
538 |    "metadata": {
539 |     "collapsed": true,
540 |     "nbpresent": {
541 |      "id": "201b01cf-47f9-4a94-baf5-9270271e053e"
542 |     }
543 |    },
544 |    "outputs": [],
545 |    "source": []
546 |   }
547 |  ],
548 |  "metadata": {
549 |   "kernelspec": {
550 |    "display_name": "Python [conda root]",
551 |    "language": "python",
552 |    "name": "conda-root-py"
553 |   },
554 |   "language_info": {
555 |    "codemirror_mode": {
556 |     "name": "ipython",
557 |     "version": 3
558 |    },
559 |    "file_extension": ".py",
560 |    "mimetype": "text/x-python",
561 |    "name": "python",
562 |    "nbconvert_exporter": "python",
563 |    "pygments_lexer": "ipython3",
564 |    "version": "3.5.2"
565 |   }
566 |  },
567 |  "nbformat": 4,
568 |  "nbformat_minor": 1
569 | }
570 | 


--------------------------------------------------------------------------------
/textAnalisis/.ipynb_checkpoints/TextAnalise-Plin-checkpoint.ipynb:
--------------------------------------------------------------------------------
  1 | {
  2 |  "cells": [
  3 |   {
  4 |    "cell_type": "code",
  5 |    "execution_count": 19,
  6 |    "metadata": {
  7 |     "collapsed": true
  8 |    },
  9 |    "outputs": [],
 10 |    "source": [
 11 |     "from sklearn.feature_extraction.text import CountVectorizer\n",
 12 |     "from collections import Counter\n",
 13 |     "\n",
 14 |     "vect = CountVectorizer()"
 15 |    ]
 16 |   },
 17 |   {
 18 |    "cell_type": "code",
 19 |    "execution_count": 20,
 20 |    "metadata": {
 21 |     "collapsed": true
 22 |    },
 23 |    "outputs": [],
 24 |    "source": [
 25 |     "text = [\n",
 26 |     "    '#MachineLearning with Text in scikit learn http://buff.ly/2dJINuD  #DataScience #IoT #BigData #AI',\n",
 27 |     "    'How The Internet Of Things Will Impact Your Everyday Life http://buff.ly/2dIUyMO  #IoT #DataScience #BigData #MachineLearning',\n",
 28 |     "    'The best Brazilian Captain passed away this day. Captain of Brazil Team in 1970. #RIPCapita Carlos Alberto Torres.',\n",
 29 |     "    '10 Videos Featuring Data Science Topics. By Vincent http://buff.ly/2eCWIkA  #DataScience #BigData #IoT #MachineLearning',\n",
 30 |     "    'Data Preparation Tips, Tricks, and Tools: An Interview with the Insiders http://buff.ly/2dDSJ3E  #DataScience #BigData #IoT #MachineLearning',\n",
 31 |     "    'Deep Learning with Neural Networks and TensorFlow Introduction - Youtube http://buff.ly/2efTvdQ  #DataScience #MachineLearning #IoT #BigData',\n",
 32 |     "    'Matplotlib Tutorial - a youtube course http://buff.ly/2eBK4AQ  #DadaScience #MachineLearning #IoT #BigData',\n",
 33 |     "    'Kaggle Releases Data Sets About Global Warming: Make your own Predictions – Data Science Central http://buff.ly/2dUFLQf  #DataScience #IoT',\n",
 34 |     "    '#MachineLearning as a Service http://buff.ly/2ep1Jjk  #BigData #IoT #DataScience',\n",
 35 |     "    '50 Predictions for the Internet of Things in 2016 https://goo.gl/5Zv28z  #IoT #BigData #DataScience #MachineLearning',\n",
 36 |     "    'IoT Programming Languages http://flip.it/wtVufo  #IoT #BigData #DataScience',\n",
 37 |     "    'An Introduction to Variable and Feature Selection  #dataScience #IoT #BigData http://www.datasciencecentral.com/profiles/blogs/an-introduction-to-variable-and-feature-selection …',\n",
 38 |     "    'Use the simulated device to experience the IBM Watson IoT Platform http://buff.ly/2ekeKGi  #IoT #BigData #DataScience #DataViz',\n",
 39 |     "    'Top 10 Data Science and Machine Learning Podcasts http://buff.ly/2erx7cI  #MachineLearning',\n",
 40 |     "    'Adorei esse copão de café. SVM é fantástico algoritmo de #MachineLearning',\n",
 41 |     "    'IBM Watsons latest gig: Improving cancer treatment with genomic sequencing http://buff.ly/2dZ5lVP  #DataScience #MachineLearning #BigData',\n",
 42 |     "    'An Introduction to Implementing Neural Networks using TensorFlow http://buff.ly/2ervn3s  #DataScience #MachineLearning #IoT #BigData',\n",
 43 |     "    'Oi testa serviço de monitoramento baseado na Internet das Coisas http://buff.ly/2e3gg21  #DataScience #MachineLearning #IoT #BigData',\n",
 44 |     "    'Moving from R to Python: The Libraries You Need to Know http://buff.ly/2eeUHuE  #DataScience #MachineLearning #IoT #BigData',\n",
 45 |     "    'Internet of Things Articles : IoT startup and smart cam-maker Smartfrog raises further $20M http://buff.ly/2ei1Kky  #MachineLearning #IoT',\n",
 46 |     "    'An overview of gradient descent optimization algorithms http://buff.ly/2dldKVO  #DataScience #MachineLearning #IoT #BigData',\n",
 47 |     "    'Datafloq - 8 Easy Steps to Become a Data Scientist http://buff.ly/2en6TbA  #DataScience #IoT #BigData #MachineLearning',\n",
 48 |     "    'Time to educate teachers about #datascience'\n",
 49 |     "]"
 50 |    ]
 51 |   },
 52 |   {
 53 |    "cell_type": "code",
 54 |    "execution_count": 21,
 55 |    "metadata": {
 56 |     "collapsed": false
 57 |    },
 58 |    "outputs": [
 59 |     {
 60 |      "data": {
 61 |       "text/plain": [
 62 |        "CountVectorizer(analyzer='word', binary=False, decode_error='strict',\n",
 63 |        "        dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',\n",
 64 |        "        lowercase=True, max_df=1.0, max_features=None, min_df=1,\n",
 65 |        "        ngram_range=(1, 1), preprocessor=None, stop_words=None,\n",
 66 |        "        strip_accents=None, token_pattern='(?u)\\\\b\\\\w\\\\w+\\\\b',\n",
 67 |        "        tokenizer=None, vocabulary=None)"
 68 |       ]
 69 |      },
 70 |      "execution_count": 21,
 71 |      "metadata": {},
 72 |      "output_type": "execute_result"
 73 |     }
 74 |    ],
 75 |    "source": [
 76 |     "vect.fit(text)"
 77 |    ]
 78 |   },
 79 |   {
 80 |    "cell_type": "code",
 81 |    "execution_count": 22,
 82 |    "metadata": {
 83 |     "collapsed": false
 84 |    },
 85 |    "outputs": [
 86 |     {
 87 |      "data": {
 88 |       "text/plain": [
 89 |        "['10',\n",
 90 |        " '1970',\n",
 91 |        " '2016',\n",
 92 |        " '20m',\n",
 93 |        " '2ddsj3e',\n",
 94 |        " '2diuymo',\n",
 95 |        " '2djinud',\n",
 96 |        " '2dldkvo',\n",
 97 |        " '2duflqf',\n",
 98 |        " '2dz5lvp',\n",
 99 |        " '2e3gg21',\n",
100 |        " '2ebk4aq',\n",
101 |        " '2ecwika',\n",
102 |        " '2eeuhue',\n",
103 |        " '2eftvdq',\n",
104 |        " '2ei1kky',\n",
105 |        " '2ekekgi',\n",
106 |        " '2en6tba',\n",
107 |        " '2ep1jjk',\n",
108 |        " '2ervn3s',\n",
109 |        " '2erx7ci',\n",
110 |        " '50',\n",
111 |        " '5zv28z',\n",
112 |        " 'about',\n",
113 |        " 'adorei',\n",
114 |        " 'ai',\n",
115 |        " 'alberto',\n",
116 |        " 'algorithms',\n",
117 |        " 'algoritmo',\n",
118 |        " 'an',\n",
119 |        " 'and',\n",
120 |        " 'articles',\n",
121 |        " 'as',\n",
122 |        " 'away',\n",
123 |        " 'baseado',\n",
124 |        " 'become',\n",
125 |        " 'best',\n",
126 |        " 'bigdata',\n",
127 |        " 'blogs',\n",
128 |        " 'brazil',\n",
129 |        " 'brazilian',\n",
130 |        " 'buff',\n",
131 |        " 'by',\n",
132 |        " 'café',\n",
133 |        " 'cam',\n",
134 |        " 'cancer',\n",
135 |        " 'captain',\n",
136 |        " 'carlos',\n",
137 |        " 'central',\n",
138 |        " 'coisas',\n",
139 |        " 'com',\n",
140 |        " 'copão',\n",
141 |        " 'course',\n",
142 |        " 'dadascience',\n",
143 |        " 'das',\n",
144 |        " 'data',\n",
145 |        " 'datafloq',\n",
146 |        " 'datascience',\n",
147 |        " 'datasciencecentral',\n",
148 |        " 'dataviz',\n",
149 |        " 'day',\n",
150 |        " 'de',\n",
151 |        " 'deep',\n",
152 |        " 'descent',\n",
153 |        " 'device',\n",
154 |        " 'easy',\n",
155 |        " 'educate',\n",
156 |        " 'esse',\n",
157 |        " 'everyday',\n",
158 |        " 'experience',\n",
159 |        " 'fantástico',\n",
160 |        " 'feature',\n",
161 |        " 'featuring',\n",
162 |        " 'flip',\n",
163 |        " 'for',\n",
164 |        " 'from',\n",
165 |        " 'further',\n",
166 |        " 'genomic',\n",
167 |        " 'gig',\n",
168 |        " 'gl',\n",
169 |        " 'global',\n",
170 |        " 'goo',\n",
171 |        " 'gradient',\n",
172 |        " 'how',\n",
173 |        " 'http',\n",
174 |        " 'https',\n",
175 |        " 'ibm',\n",
176 |        " 'impact',\n",
177 |        " 'implementing',\n",
178 |        " 'improving',\n",
179 |        " 'in',\n",
180 |        " 'insiders',\n",
181 |        " 'internet',\n",
182 |        " 'interview',\n",
183 |        " 'introduction',\n",
184 |        " 'iot',\n",
185 |        " 'it',\n",
186 |        " 'kaggle',\n",
187 |        " 'know',\n",
188 |        " 'languages',\n",
189 |        " 'latest',\n",
190 |        " 'learn',\n",
191 |        " 'learning',\n",
192 |        " 'libraries',\n",
193 |        " 'life',\n",
194 |        " 'ly',\n",
195 |        " 'machine',\n",
196 |        " 'machinelearning',\n",
197 |        " 'make',\n",
198 |        " 'maker',\n",
199 |        " 'matplotlib',\n",
200 |        " 'monitoramento',\n",
201 |        " 'moving',\n",
202 |        " 'na',\n",
203 |        " 'need',\n",
204 |        " 'networks',\n",
205 |        " 'neural',\n",
206 |        " 'of',\n",
207 |        " 'oi',\n",
208 |        " 'optimization',\n",
209 |        " 'overview',\n",
210 |        " 'own',\n",
211 |        " 'passed',\n",
212 |        " 'platform',\n",
213 |        " 'podcasts',\n",
214 |        " 'predictions',\n",
215 |        " 'preparation',\n",
216 |        " 'profiles',\n",
217 |        " 'programming',\n",
218 |        " 'python',\n",
219 |        " 'raises',\n",
220 |        " 'releases',\n",
221 |        " 'ripcapita',\n",
222 |        " 'science',\n",
223 |        " 'scientist',\n",
224 |        " 'scikit',\n",
225 |        " 'selection',\n",
226 |        " 'sequencing',\n",
227 |        " 'service',\n",
228 |        " 'serviço',\n",
229 |        " 'sets',\n",
230 |        " 'simulated',\n",
231 |        " 'smart',\n",
232 |        " 'smartfrog',\n",
233 |        " 'startup',\n",
234 |        " 'steps',\n",
235 |        " 'svm',\n",
236 |        " 'teachers',\n",
237 |        " 'team',\n",
238 |        " 'tensorflow',\n",
239 |        " 'testa',\n",
240 |        " 'text',\n",
241 |        " 'the',\n",
242 |        " 'things',\n",
243 |        " 'this',\n",
244 |        " 'time',\n",
245 |        " 'tips',\n",
246 |        " 'to',\n",
247 |        " 'tools',\n",
248 |        " 'top',\n",
249 |        " 'topics',\n",
250 |        " 'torres',\n",
251 |        " 'treatment',\n",
252 |        " 'tricks',\n",
253 |        " 'tutorial',\n",
254 |        " 'use',\n",
255 |        " 'using',\n",
256 |        " 'variable',\n",
257 |        " 'videos',\n",
258 |        " 'vincent',\n",
259 |        " 'warming',\n",
260 |        " 'watson',\n",
261 |        " 'watsons',\n",
262 |        " 'will',\n",
263 |        " 'with',\n",
264 |        " 'wtvufo',\n",
265 |        " 'www',\n",
266 |        " 'you',\n",
267 |        " 'your',\n",
268 |        " 'youtube']"
269 |       ]
270 |      },
271 |      "execution_count": 22,
272 |      "metadata": {},
273 |      "output_type": "execute_result"
274 |     }
275 |    ],
276 |    "source": [
277 |     "vect.get_feature_names()"
278 |    ]
279 |   },
280 |   {
281 |    "cell_type": "code",
282 |    "execution_count": 23,
283 |    "metadata": {
284 |     "collapsed": false
285 |    },
286 |    "outputs": [
287 |     {
288 |      "data": {
289 |       "text/plain": [
290 |        "<23x180 sparse matrix of type '<class 'numpy.int64'>'\n",
291 |        "\twith 346 stored elements in Compressed Sparse Row format>"
292 |       ]
293 |      },
294 |      "execution_count": 23,
295 |      "metadata": {},
296 |      "output_type": "execute_result"
297 |     }
298 |    ],
299 |    "source": [
300 |     "simple_train_dtm = vect.transform(text)\n",
301 |     "simple_train_dtm"
302 |    ]
303 |   },
304 |   {
305 |    "cell_type": "code",
306 |    "execution_count": 24,
307 |    "metadata": {
308 |     "collapsed": false
309 |    },
310 |    "outputs": [
311 |     {
312 |      "data": {
313 |       "text/plain": [
314 |        "array([[0, 0, 0, ..., 0, 0, 0],\n",
315 |        "       [0, 0, 0, ..., 0, 1, 0],\n",
316 |        "       [0, 1, 0, ..., 0, 0, 0],\n",
317 |        "       ..., \n",
318 |        "       [0, 0, 0, ..., 0, 0, 0],\n",
319 |        "       [0, 0, 0, ..., 0, 0, 0],\n",
320 |        "       [0, 0, 0, ..., 0, 0, 0]])"
321 |       ]
322 |      },
323 |      "execution_count": 24,
324 |      "metadata": {},
325 |      "output_type": "execute_result"
326 |     }
327 |    ],
328 |    "source": [
329 |     "simple_train_dtm.toarray()"
330 |    ]
331 |   },
332 |   {
333 |    "cell_type": "code",
334 |    "execution_count": 25,
335 |    "metadata": {
336 |     "collapsed": false
337 |    },
338 |    "outputs": [
339 |     {
340 |      "data": {
341 |       "text/plain": [
342 |        "['10',\n",
343 |        " '1970',\n",
344 |        " '2016',\n",
345 |        " '20m',\n",
346 |        " '2ddsj3e',\n",
347 |        " '2diuymo',\n",
348 |        " '2djinud',\n",
349 |        " '2dldkvo',\n",
350 |        " '2duflqf',\n",
351 |        " '2dz5lvp',\n",
352 |        " '2e3gg21',\n",
353 |        " '2ebk4aq',\n",
354 |        " '2ecwika',\n",
355 |        " '2eeuhue',\n",
356 |        " '2eftvdq',\n",
357 |        " '2ei1kky',\n",
358 |        " '2ekekgi',\n",
359 |        " '2en6tba',\n",
360 |        " '2ep1jjk',\n",
361 |        " '2ervn3s',\n",
362 |        " '2erx7ci',\n",
363 |        " '50',\n",
364 |        " '5zv28z',\n",
365 |        " 'about',\n",
366 |        " 'adorei',\n",
367 |        " 'ai',\n",
368 |        " 'alberto',\n",
369 |        " 'algorithms',\n",
370 |        " 'algoritmo',\n",
371 |        " 'an',\n",
372 |        " 'and',\n",
373 |        " 'articles',\n",
374 |        " 'as',\n",
375 |        " 'away',\n",
376 |        " 'baseado',\n",
377 |        " 'become',\n",
378 |        " 'best',\n",
379 |        " 'bigdata',\n",
380 |        " 'blogs',\n",
381 |        " 'brazil',\n",
382 |        " 'brazilian',\n",
383 |        " 'buff',\n",
384 |        " 'by',\n",
385 |        " 'café',\n",
386 |        " 'cam',\n",
387 |        " 'cancer',\n",
388 |        " 'captain',\n",
389 |        " 'carlos',\n",
390 |        " 'central',\n",
391 |        " 'coisas',\n",
392 |        " 'com',\n",
393 |        " 'copão',\n",
394 |        " 'course',\n",
395 |        " 'dadascience',\n",
396 |        " 'das',\n",
397 |        " 'data',\n",
398 |        " 'datafloq',\n",
399 |        " 'datascience',\n",
400 |        " 'datasciencecentral',\n",
401 |        " 'dataviz',\n",
402 |        " 'day',\n",
403 |        " 'de',\n",
404 |        " 'deep',\n",
405 |        " 'descent',\n",
406 |        " 'device',\n",
407 |        " 'easy',\n",
408 |        " 'educate',\n",
409 |        " 'esse',\n",
410 |        " 'everyday',\n",
411 |        " 'experience',\n",
412 |        " 'fantástico',\n",
413 |        " 'feature',\n",
414 |        " 'featuring',\n",
415 |        " 'flip',\n",
416 |        " 'for',\n",
417 |        " 'from',\n",
418 |        " 'further',\n",
419 |        " 'genomic',\n",
420 |        " 'gig',\n",
421 |        " 'gl',\n",
422 |        " 'global',\n",
423 |        " 'goo',\n",
424 |        " 'gradient',\n",
425 |        " 'how',\n",
426 |        " 'http',\n",
427 |        " 'https',\n",
428 |        " 'ibm',\n",
429 |        " 'impact',\n",
430 |        " 'implementing',\n",
431 |        " 'improving',\n",
432 |        " 'in',\n",
433 |        " 'insiders',\n",
434 |        " 'internet',\n",
435 |        " 'interview',\n",
436 |        " 'introduction',\n",
437 |        " 'iot',\n",
438 |        " 'it',\n",
439 |        " 'kaggle',\n",
440 |        " 'know',\n",
441 |        " 'languages',\n",
442 |        " 'latest',\n",
443 |        " 'learn',\n",
444 |        " 'learning',\n",
445 |        " 'libraries',\n",
446 |        " 'life',\n",
447 |        " 'ly',\n",
448 |        " 'machine',\n",
449 |        " 'machinelearning',\n",
450 |        " 'make',\n",
451 |        " 'maker',\n",
452 |        " 'matplotlib',\n",
453 |        " 'monitoramento',\n",
454 |        " 'moving',\n",
455 |        " 'na',\n",
456 |        " 'need',\n",
457 |        " 'networks',\n",
458 |        " 'neural',\n",
459 |        " 'of',\n",
460 |        " 'oi',\n",
461 |        " 'optimization',\n",
462 |        " 'overview',\n",
463 |        " 'own',\n",
464 |        " 'passed',\n",
465 |        " 'platform',\n",
466 |        " 'podcasts',\n",
467 |        " 'predictions',\n",
468 |        " 'preparation',\n",
469 |        " 'profiles',\n",
470 |        " 'programming',\n",
471 |        " 'python',\n",
472 |        " 'raises',\n",
473 |        " 'releases',\n",
474 |        " 'ripcapita',\n",
475 |        " 'science',\n",
476 |        " 'scientist',\n",
477 |        " 'scikit',\n",
478 |        " 'selection',\n",
479 |        " 'sequencing',\n",
480 |        " 'service',\n",
481 |        " 'serviço',\n",
482 |        " 'sets',\n",
483 |        " 'simulated',\n",
484 |        " 'smart',\n",
485 |        " 'smartfrog',\n",
486 |        " 'startup',\n",
487 |        " 'steps',\n",
488 |        " 'svm',\n",
489 |        " 'teachers',\n",
490 |        " 'team',\n",
491 |        " 'tensorflow',\n",
492 |        " 'testa',\n",
493 |        " 'text',\n",
494 |        " 'the',\n",
495 |        " 'things',\n",
496 |        " 'this',\n",
497 |        " 'time',\n",
498 |        " 'tips',\n",
499 |        " 'to',\n",
500 |        " 'tools',\n",
501 |        " 'top',\n",
502 |        " 'topics',\n",
503 |        " 'torres',\n",
504 |        " 'treatment',\n",
505 |        " 'tricks',\n",
506 |        " 'tutorial',\n",
507 |        " 'use',\n",
508 |        " 'using',\n",
509 |        " 'variable',\n",
510 |        " 'videos',\n",
511 |        " 'vincent',\n",
512 |        " 'warming',\n",
513 |        " 'watson',\n",
514 |        " 'watsons',\n",
515 |        " 'will',\n",
516 |        " 'with',\n",
517 |        " 'wtvufo',\n",
518 |        " 'www',\n",
519 |        " 'you',\n",
520 |        " 'your',\n",
521 |        " 'youtube']"
522 |       ]
523 |      },
524 |      "execution_count": 25,
525 |      "metadata": {},
526 |      "output_type": "execute_result"
527 |     }
528 |    ],
529 |    "source": [
530 |     "\n",
531 |     "vocab = list(vect.get_feature_names())\n",
532 |     "vocab"
533 |    ]
534 |   },
535 |   {
536 |    "cell_type": "code",
537 |    "execution_count": 36,
538 |    "metadata": {
539 |     "collapsed": false
540 |    },
541 |    "outputs": [
542 |     {
543 |      "data": {
544 |       "text/plain": [
545 |        "[('iot', 21),\n",
546 |        " ('http', 19),\n",
547 |        " ('datascience', 18),\n",
548 |        " ('machinelearning', 17),\n",
549 |        " ('bigdata', 17),\n",
550 |        " ('buff', 17),\n",
551 |        " ('ly', 17),\n",
552 |        " ('to', 8),\n",
553 |        " ('the', 7),\n",
554 |        " ('and', 6),\n",
555 |        " ('data', 6),\n",
556 |        " ('an', 5),\n",
557 |        " ('of', 5),\n",
558 |        " ('internet', 4),\n",
559 |        " ('introduction', 4),\n",
560 |        " ('with', 4),\n",
561 |        " ('de', 3),\n",
562 |        " ('things', 3),\n",
563 |        " ('in', 3),\n",
564 |        " ('science', 3),\n",
565 |        " ('neural', 2),\n",
566 |        " ('about', 2),\n",
567 |        " ('networks', 2),\n",
568 |        " ('feature', 2),\n",
569 |        " ('tensorflow', 2),\n",
570 |        " ('ibm', 2),\n",
571 |        " ('variable', 2),\n",
572 |        " ('learning', 2),\n",
573 |        " ('selection', 2),\n",
574 |        " ('youtube', 2),\n",
575 |        " ('your', 2),\n",
576 |        " ('predictions', 2),\n",
577 |        " ('10', 2),\n",
578 |        " ('captain', 2),\n",
579 |        " ('by', 1),\n",
580 |        " ('podcasts', 1),\n",
581 |        " ('tools', 1),\n",
582 |        " ('team', 1),\n",
583 |        " ('text', 1),\n",
584 |        " ('genomic', 1),\n",
585 |        " ('languages', 1),\n",
586 |        " ('esse', 1),\n",
587 |        " ('2diuymo', 1),\n",
588 |        " ('maker', 1),\n",
589 |        " ('libraries', 1),\n",
590 |        " ('learn', 1),\n",
591 |        " ('interview', 1),\n",
592 |        " ('gl', 1),\n",
593 |        " ('scientist', 1),\n",
594 |        " ('café', 1),\n",
595 |        " ('everyday', 1),\n",
596 |        " ('2duflqf', 1),\n",
597 |        " ('cam', 1),\n",
598 |        " ('baseado', 1),\n",
599 |        " ('away', 1),\n",
600 |        " ('device', 1),\n",
601 |        " ('watsons', 1),\n",
602 |        " ('improving', 1),\n",
603 |        " ('programming', 1),\n",
604 |        " ('overview', 1),\n",
605 |        " ('warming', 1),\n",
606 |        " ('2ecwika', 1),\n",
607 |        " ('how', 1),\n",
608 |        " ('own', 1),\n",
609 |        " ('make', 1),\n",
610 |        " ('machine', 1),\n",
611 |        " ('steps', 1),\n",
612 |        " ('kaggle', 1),\n",
613 |        " ('raises', 1),\n",
614 |        " ('svm', 1),\n",
615 |        " ('vincent', 1),\n",
616 |        " ('time', 1),\n",
617 |        " ('python', 1),\n",
618 |        " ('datasciencecentral', 1),\n",
619 |        " ('copão', 1),\n",
620 |        " ('best', 1),\n",
621 |        " ('need', 1),\n",
622 |        " ('datafloq', 1),\n",
623 |        " ('das', 1),\n",
624 |        " ('2erx7ci', 1),\n",
625 |        " ('testa', 1),\n",
626 |        " ('flip', 1),\n",
627 |        " ('become', 1),\n",
628 |        " ('2ekekgi', 1),\n",
629 |        " ('fantástico', 1),\n",
630 |        " ('platform', 1),\n",
631 |        " ('serviço', 1),\n",
632 |        " ('smart', 1),\n",
633 |        " ('scikit', 1),\n",
634 |        " ('tutorial', 1),\n",
635 |        " ('cancer', 1),\n",
636 |        " ('ai', 1),\n",
637 |        " ('top', 1),\n",
638 |        " ('2ei1kky', 1),\n",
639 |        " ('it', 1),\n",
640 |        " ('startup', 1),\n",
641 |        " ('sets', 1),\n",
642 |        " ('2ep1jjk', 1),\n",
643 |        " ('from', 1),\n",
644 |        " ('algoritmo', 1),\n",
645 |        " ('2eftvdq', 1),\n",
646 |        " ('2dz5lvp', 1),\n",
647 |        " ('blogs', 1),\n",
648 |        " ('50', 1),\n",
649 |        " ('easy', 1),\n",
650 |        " ('dataviz', 1),\n",
651 |        " ('further', 1),\n",
652 |        " ('5zv28z', 1),\n",
653 |        " ('central', 1),\n",
654 |        " ('goo', 1),\n",
655 |        " ('topics', 1),\n",
656 |        " ('2e3gg21', 1),\n",
657 |        " ('preparation', 1),\n",
658 |        " ('implementing', 1),\n",
659 |        " ('2eeuhue', 1),\n",
660 |        " ('descent', 1),\n",
661 |        " ('as', 1),\n",
662 |        " ('20m', 1),\n",
663 |        " ('using', 1),\n",
664 |        " ('treatment', 1),\n",
665 |        " ('latest', 1),\n",
666 |        " ('will', 1),\n",
667 |        " ('releases', 1),\n",
668 |        " ('monitoramento', 1),\n",
669 |        " ('https', 1),\n",
670 |        " ('alberto', 1),\n",
671 |        " ('watson', 1),\n",
672 |        " ('ripcapita', 1),\n",
673 |        " ('torres', 1),\n",
674 |        " ('course', 1),\n",
675 |        " ('featuring', 1),\n",
676 |        " ('brazil', 1),\n",
677 |        " ('wtvufo', 1),\n",
678 |        " ('coisas', 1),\n",
679 |        " ('use', 1),\n",
680 |        " ('passed', 1),\n",
681 |        " ('oi', 1),\n",
682 |        " ('optimization', 1),\n",
683 |        " ('moving', 1),\n",
684 |        " ('com', 1),\n",
685 |        " ('know', 1),\n",
686 |        " ('simulated', 1),\n",
687 |        " ('2ervn3s', 1),\n",
688 |        " ('you', 1),\n",
689 |        " ('www', 1),\n",
690 |        " ('this', 1),\n",
691 |        " ('dadascience', 1),\n",
692 |        " ('adorei', 1),\n",
693 |        " ('educate', 1),\n",
694 |        " ('for', 1),\n",
695 |        " ('1970', 1),\n",
696 |        " ('2en6tba', 1),\n",
697 |        " ('teachers', 1),\n",
698 |        " ('matplotlib', 1),\n",
699 |        " ('global', 1),\n",
700 |        " ('sequencing', 1),\n",
701 |        " ('life', 1),\n",
702 |        " ('2ebk4aq', 1),\n",
703 |        " ('insiders', 1),\n",
704 |        " ('gig', 1),\n",
705 |        " ('carlos', 1),\n",
706 |        " ('2016', 1),\n",
707 |        " ('impact', 1),\n",
708 |        " ('day', 1),\n",
709 |        " ('2ddsj3e', 1),\n",
710 |        " ('profiles', 1),\n",
711 |        " ('experience', 1),\n",
712 |        " ('brazilian', 1),\n",
713 |        " ('smartfrog', 1),\n",
714 |        " ('deep', 1),\n",
715 |        " ('gradient', 1),\n",
716 |        " ('na', 1),\n",
717 |        " ('videos', 1),\n",
718 |        " ('service', 1),\n",
719 |        " ('tricks', 1),\n",
720 |        " ('algorithms', 1),\n",
721 |        " ('tips', 1),\n",
722 |        " ('2dldkvo', 1),\n",
723 |        " ('2djinud', 1),\n",
724 |        " ('articles', 1)]"
725 |       ]
726 |      },
727 |      "execution_count": 36,
728 |      "metadata": {},
729 |      "output_type": "execute_result"
730 |     }
731 |    ],
732 |    "source": [
733 |     "counts = simple_train_dtm.sum(axis=0).A1\n",
734 |     "\n",
735 |     "freq_distribution = Counter(dict(zip(vocab, counts)))\n",
736 |     "##print (freq_distribution.most_common(100))\n",
737 |     "list(freq_distribution.most_common())\n",
738 |     "\n"
739 |    ]
740 |   },
741 |   {
742 |    "cell_type": "code",
743 |    "execution_count": null,
744 |    "metadata": {
745 |     "collapsed": true
746 |    },
747 |    "outputs": [],
748 |    "source": []
749 |   }
750 |  ],
751 |  "metadata": {
752 |   "kernelspec": {
753 |    "display_name": "Python [conda root]",
754 |    "language": "python",
755 |    "name": "conda-root-py"
756 |   },
757 |   "language_info": {
758 |    "codemirror_mode": {
759 |     "name": "ipython",
760 |     "version": 3
761 |    },
762 |    "file_extension": ".py",
763 |    "mimetype": "text/x-python",
764 |    "name": "python",
765 |    "nbconvert_exporter": "python",
766 |    "pygments_lexer": "ipython3",
767 |    "version": "3.5.2"
768 |   }
769 |  },
770 |  "nbformat": 4,
771 |  "nbformat_minor": 1
772 | }
773 | 


--------------------------------------------------------------------------------
/AnaliseTexto/AnaliseDeSentimento.ipynb:
--------------------------------------------------------------------------------
   1 | {
   2 |  "cells": [
   3 |   {
   4 |    "cell_type": "markdown",
   5 |    "metadata": {
   6 |     "nbpresent": {
   7 |      "id": "e4c7d791-d39c-4247-a950-8f541b2b2b2b"
   8 |     },
   9 |     "slideshow": {
  10 |      "slide_type": "-"
  11 |     }
  12 |    },
  13 |    "source": [
  14 |     "# Classificação de textos com *scikit-learn*\n",
  15 |     "por Prof. Sanderson Macedo"
  16 |    ]
  17 |   },
  18 |   {
  19 |    "cell_type": "markdown",
  20 |    "metadata": {
  21 |     "nbpresent": {
  22 |      "id": "918ce0e7-8f69-4d3c-8106-d3c5264c94e3"
  23 |     },
  24 |     "slideshow": {
  25 |      "slide_type": "-"
  26 |     }
  27 |    },
  28 |    "source": [
  29 |     "<hr size=5>"
  30 |    ]
  31 |   },
  32 |   {
  33 |    "cell_type": "markdown",
  34 |    "metadata": {
  35 |     "nbpresent": {
  36 |      "id": "ca5fe97a-0224-4915-a59d-38e6baa218a2"
  37 |     }
  38 |    },
  39 |    "source": [
  40 |     "## Agenda\n",
  41 |     "\n",
  42 |     "\n",
  43 |     "1. Representar um texto como dados numéricos\n",
  44 |     "2. Ler o *dataset* de texto no Pandas\n",
  45 |     "2. Vetorizar nossso *dataset*\n",
  46 |     "4. Construir e avaliar um modelo\n",
  47 |     "5. Comparar modelos\n"
  48 |    ]
  49 |   },
  50 |   {
  51 |    "cell_type": "code",
  52 |    "execution_count": 353,
  53 |    "metadata": {
  54 |     "collapsed": true,
  55 |     "nbpresent": {
  56 |      "id": "d2e20804-da18-483c-bd40-8c25e2d4699c"
  57 |     }
  58 |    },
  59 |    "outputs": [],
  60 |    "source": [
  61 |     "##Importando pandas e numpy\n",
  62 |     "import pandas as pd\n",
  63 |     "import numpy as np"
  64 |    ]
  65 |   },
  66 |   {
  67 |    "cell_type": "markdown",
  68 |    "metadata": {
  69 |     "nbpresent": {
  70 |      "id": "76e5a32a-69c4-4dc5-a66b-23d2cca623af"
  71 |     }
  72 |    },
  73 |    "source": [
  74 |     "## 1. Definindo um vetor de textos \n",
  75 |     "Os textos do vetor podem ser adquiridos por meio da leitura de \n",
  76 |     "pdf's, doc's, twitter's... etc.\n",
  77 |     "\n",
  78 |     "Esses textos serão a base de treinamento\n",
  79 |     "para a classificação do sentimento de um novo texto."
  80 |    ]
  81 |   },
  82 |   {
  83 |    "cell_type": "code",
  84 |    "execution_count": 354,
  85 |    "metadata": {
  86 |     "collapsed": false,
  87 |     "nbpresent": {
  88 |      "id": "56bab267-0993-4d7a-9436-11bc5de3d1d3"
  89 |     }
  90 |    },
  91 |    "outputs": [],
  92 |    "source": [
  93 |     "train = [\n",
  94 |     "    'Eu te amo',\n",
  95 |     "    'Você é algo assim... é tudo pra mim. Ao meu amor... Amor!',\n",
  96 |     "    'Eu te odeio muito, você não presta!',\n",
  97 |     "    'Não gosto de você'\n",
  98 |     "   ]"
  99 |    ]
 100 |   },
 101 |   {
 102 |    "cell_type": "markdown",
 103 |    "metadata": {
 104 |     "nbpresent": {
 105 |      "id": "fc1fc669-a603-412e-8855-837d750718ff"
 106 |     }
 107 |    },
 108 |    "source": [
 109 |     "## 2. Definindo um vetor de sentimentos\n",
 110 |     "Criaremos um vetor de sentimentos chamado **_felling_**. \n",
 111 |     "\n",
 112 |     "Cada posição do vetor **_felling_** representa o sentimento **BOM** (1) ou **RUIM** (0) para os textos que passamos ao vetor **_train_**.\n",
 113 |     "\n",
 114 |     "Por exemplo: a frase da primeira posição do vetor **_train_**:\n",
 115 |     "\n",
 116 |     "> 'Eu te amo'\n",
 117 |     "\n",
 118 |     "Foi classificada como sendo um texto **BOM**:\n",
 119 |     "\n",
 120 |     "> 1"
 121 |    ]
 122 |   },
 123 |   {
 124 |    "cell_type": "code",
 125 |    "execution_count": 355,
 126 |    "metadata": {
 127 |     "collapsed": true,
 128 |     "nbpresent": {
 129 |      "id": "68a4277e-e38c-42ac-8528-0b90efe86e42"
 130 |     }
 131 |    },
 132 |    "outputs": [],
 133 |    "source": [
 134 |     "felling = [1,1,0,0]"
 135 |    ]
 136 |   },
 137 |   {
 138 |    "cell_type": "markdown",
 139 |    "metadata": {
 140 |     "nbpresent": {
 141 |      "id": "f43ff54a-e843-4a35-8447-66665f36ebca"
 142 |     }
 143 |    },
 144 |    "source": [
 145 |     "## 3. Análise de texto com _scikit-learn_.\n",
 146 |     "\n",
 147 |     "Texto de [scikit-learn documentation](http://scikit-learn.org/stable/modules/feature_extraction.html#text-feature-extraction):\n",
 148 |     "\n",
 149 |     "> Análise de texto é um campo de aplicação importante para algoritmos de aprendizado de máquina. No entanto, uma sequência de símbolos não podem ser passada diretamente aos algoritmos de Machine Learning, pois a maioria deles espera vetores de características numéricas com um tamanho fixo, em vez de documentos de texto com comprimento variável.\n",
 150 |     "\n",
 151 |     "Mas nesse caso podemos realizar algumas transformações de para poder manipular textos em algoritmos de aprendizagem.\n",
 152 |     "\n",
 153 |     "Portanto, aqui utilizaremos a [CountVectorizer](http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html)\n",
 154 |     "para converter textos em uma matriz que expressará a quantidade \"tokens\" dos textos.\n",
 155 |     "\n",
 156 |     "Importamos a classe e criamos uma instância chamada **_vect_**.\n"
 157 |    ]
 158 |   },
 159 |   {
 160 |    "cell_type": "code",
 161 |    "execution_count": 356,
 162 |    "metadata": {
 163 |     "collapsed": false,
 164 |     "nbpresent": {
 165 |      "id": "1ada59d7-f1ba-4625-8999-b8af5aaf461c"
 166 |     }
 167 |    },
 168 |    "outputs": [],
 169 |    "source": [
 170 |     "from sklearn.feature_extraction.text import CountVectorizer\n",
 171 |     "vect = CountVectorizer()"
 172 |    ]
 173 |   },
 174 |   {
 175 |    "cell_type": "markdown",
 176 |    "metadata": {
 177 |     "nbpresent": {
 178 |      "id": "154ef867-0532-45ad-9910-c87f6711d1b0"
 179 |     }
 180 |    },
 181 |    "source": [
 182 |     "## 4. Treinamento criando o dicionário.\n",
 183 |     "Agora treinamos o algoritmo com o vetor de textos que criamos acima. Chamamos o método **_fit()_** passando o vetor de textos."
 184 |    ]
 185 |   },
 186 |   {
 187 |    "cell_type": "code",
 188 |    "execution_count": 357,
 189 |    "metadata": {
 190 |     "collapsed": false,
 191 |     "nbpresent": {
 192 |      "id": "eff3a289-8c0d-4374-9400-d988a6b36624"
 193 |     }
 194 |    },
 195 |    "outputs": [
 196 |     {
 197 |      "data": {
 198 |       "text/plain": [
 199 |        "CountVectorizer(analyzer='word', binary=False, decode_error='strict',\n",
 200 |        "        dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',\n",
 201 |        "        lowercase=True, max_df=1.0, max_features=None, min_df=1,\n",
 202 |        "        ngram_range=(1, 1), preprocessor=None, stop_words=None,\n",
 203 |        "        strip_accents=None, token_pattern='(?u)\\\\b\\\\w\\\\w+\\\\b',\n",
 204 |        "        tokenizer=None, vocabulary=None)"
 205 |       ]
 206 |      },
 207 |      "execution_count": 357,
 208 |      "metadata": {},
 209 |      "output_type": "execute_result"
 210 |     }
 211 |    ],
 212 |    "source": [
 213 |     "vect.fit(train)"
 214 |    ]
 215 |   },
 216 |   {
 217 |    "cell_type": "markdown",
 218 |    "metadata": {},
 219 |    "source": [
 220 |     "Veja que o parametro *analyzer* é defindo por padrão como *'word'* na classe *CountVectorizer*. Isso signicica que a classe ignora palavras com menos de dois (2) caracteres e pontuações. "
 221 |    ]
 222 |   },
 223 |   {
 224 |    "cell_type": "markdown",
 225 |    "metadata": {
 226 |     "nbpresent": {
 227 |      "id": "d4093cdd-6b19-4fed-9a01-5ee02f41ca51"
 228 |     }
 229 |    },
 230 |    "source": [
 231 |     "## 5. Nosso dicionário de palavras\n",
 232 |     "Aqui vamos listar quais palavras forma utilizadas nos textos de **_train_**, formando nosso dicionário de palavras. Nessa listagem as palavras não se repetem."
 233 |    ]
 234 |   },
 235 |   {
 236 |    "cell_type": "code",
 237 |    "execution_count": 358,
 238 |    "metadata": {
 239 |     "collapsed": false,
 240 |     "nbpresent": {
 241 |      "id": "3ab9a844-7f38-40c5-a57f-4a2fbf3343ba"
 242 |     }
 243 |    },
 244 |    "outputs": [
 245 |     {
 246 |      "data": {
 247 |       "text/plain": [
 248 |        "['algo',\n",
 249 |        " 'amo',\n",
 250 |        " 'amor',\n",
 251 |        " 'ao',\n",
 252 |        " 'assim',\n",
 253 |        " 'de',\n",
 254 |        " 'eu',\n",
 255 |        " 'gosto',\n",
 256 |        " 'meu',\n",
 257 |        " 'mim',\n",
 258 |        " 'muito',\n",
 259 |        " 'não',\n",
 260 |        " 'odeio',\n",
 261 |        " 'pra',\n",
 262 |        " 'presta',\n",
 263 |        " 'te',\n",
 264 |        " 'tudo',\n",
 265 |        " 'você']"
 266 |       ]
 267 |      },
 268 |      "execution_count": 358,
 269 |      "metadata": {},
 270 |      "output_type": "execute_result"
 271 |     }
 272 |    ],
 273 |    "source": [
 274 |     "## examinando o dicionário criado em ordem alfabética.\n",
 275 |     "vect.get_feature_names()"
 276 |    ]
 277 |   },
 278 |   {
 279 |    "cell_type": "markdown",
 280 |    "metadata": {},
 281 |    "source": [
 282 |     "## 6.  Criação de uma matriz de ocorrência\n",
 283 |     "\n",
 284 |     "\n",
 285 |     "\n",
 286 |     "A matriz de ocorrência mostra a ocorrência de cada palavra em cada texto passado para o algoritmo que criou o dicionário.\n",
 287 |     "Essa transformação cria uma matriz onde:\n",
 288 |     "\n",
 289 |     "1. Cada linha representa um texto do vetor **_train_** \n",
 290 |     "2. Cada coluna uma palavra do dicionário aprendido.\n",
 291 |     "3. Se a palavra ocorrer no texto o valor será 1 caso contrário 0.\n",
 292 |     "\n",
 293 |     "Por exemplo:\n",
 294 |     "A primeira linha da matriz é a frase\n",
 295 |     "\n",
 296 |     "> Eu te amo\n",
 297 |     "\n",
 298 |     "Essa frase tem somente três(3) palavras **_eu_**, **_te_** e **_amo_** que serão marcados na matriz com a quantidade que ocorrem no texto nesse caso **_1_** e as outras palavras do dicionário serão marcadas pelo valor zero(0), por não estarem no texto.\n",
 299 |     "\n",
 300 |     "A segunda frase\n",
 301 |     "\n",
 302 |     "> Você é algo assim... é tudo pra mim. Ao meu amor... Amor!\n",
 303 |     "\n",
 304 |     "A palavra **_amor_** ocorre duas(2) vezes, por isso que a terceira posição tem o valor 2. "
 305 |    ]
 306 |   },
 307 |   {
 308 |    "cell_type": "code",
 309 |    "execution_count": 359,
 310 |    "metadata": {
 311 |     "collapsed": false,
 312 |     "nbpresent": {
 313 |      "id": "34cfd603-24de-4379-9a69-353ba0e50fba"
 314 |     }
 315 |    },
 316 |    "outputs": [
 317 |     {
 318 |      "data": {
 319 |       "text/plain": [
 320 |        "array([[0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0],\n",
 321 |        "       [1, 0, 2, 1, 1, 0, 0, 0, 1, 1, 0, 0, 0, 1, 0, 0, 1, 1],\n",
 322 |        "       [0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 1, 1, 0, 1, 1, 0, 1],\n",
 323 |        "       [0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1]])"
 324 |       ]
 325 |      },
 326 |      "execution_count": 359,
 327 |      "metadata": {},
 328 |      "output_type": "execute_result"
 329 |     }
 330 |    ],
 331 |    "source": [
 332 |     "simple_train_dtm = vect.transform(train)\n",
 333 |     "simple_train_dtm.toarray()"
 334 |    ]
 335 |   },
 336 |   {
 337 |    "cell_type": "markdown",
 338 |    "metadata": {},
 339 |    "source": [
 340 |     "#### Criando um *dataframe*  pandas para visualizar melhor os dados."
 341 |    ]
 342 |   },
 343 |   {
 344 |    "cell_type": "code",
 345 |    "execution_count": 360,
 346 |    "metadata": {
 347 |     "collapsed": false,
 348 |     "nbpresent": {
 349 |      "id": "2e563c0f-37c5-4861-85c6-9185c20e3507"
 350 |     }
 351 |    },
 352 |    "outputs": [
 353 |     {
 354 |      "data": {
 355 |       "text/html": [
 356 |        "<div>\n",
 357 |        "<table border=\"1\" class=\"dataframe\">\n",
 358 |        "  <thead>\n",
 359 |        "    <tr style=\"text-align: right;\">\n",
 360 |        "      <th></th>\n",
 361 |        "      <th>algo</th>\n",
 362 |        "      <th>amo</th>\n",
 363 |        "      <th>amor</th>\n",
 364 |        "      <th>ao</th>\n",
 365 |        "      <th>assim</th>\n",
 366 |        "      <th>de</th>\n",
 367 |        "      <th>eu</th>\n",
 368 |        "      <th>gosto</th>\n",
 369 |        "      <th>meu</th>\n",
 370 |        "      <th>mim</th>\n",
 371 |        "      <th>muito</th>\n",
 372 |        "      <th>não</th>\n",
 373 |        "      <th>odeio</th>\n",
 374 |        "      <th>pra</th>\n",
 375 |        "      <th>presta</th>\n",
 376 |        "      <th>te</th>\n",
 377 |        "      <th>tudo</th>\n",
 378 |        "      <th>você</th>\n",
 379 |        "    </tr>\n",
 380 |        "  </thead>\n",
 381 |        "  <tbody>\n",
 382 |        "    <tr>\n",
 383 |        "      <th>Eu te amo</th>\n",
 384 |        "      <td>0</td>\n",
 385 |        "      <td>1</td>\n",
 386 |        "      <td>0</td>\n",
 387 |        "      <td>0</td>\n",
 388 |        "      <td>0</td>\n",
 389 |        "      <td>0</td>\n",
 390 |        "      <td>1</td>\n",
 391 |        "      <td>0</td>\n",
 392 |        "      <td>0</td>\n",
 393 |        "      <td>0</td>\n",
 394 |        "      <td>0</td>\n",
 395 |        "      <td>0</td>\n",
 396 |        "      <td>0</td>\n",
 397 |        "      <td>0</td>\n",
 398 |        "      <td>0</td>\n",
 399 |        "      <td>1</td>\n",
 400 |        "      <td>0</td>\n",
 401 |        "      <td>0</td>\n",
 402 |        "    </tr>\n",
 403 |        "    <tr>\n",
 404 |        "      <th>Você é algo assim... é tudo pra mim. Ao meu amor... Amor!</th>\n",
 405 |        "      <td>1</td>\n",
 406 |        "      <td>0</td>\n",
 407 |        "      <td>2</td>\n",
 408 |        "      <td>1</td>\n",
 409 |        "      <td>1</td>\n",
 410 |        "      <td>0</td>\n",
 411 |        "      <td>0</td>\n",
 412 |        "      <td>0</td>\n",
 413 |        "      <td>1</td>\n",
 414 |        "      <td>1</td>\n",
 415 |        "      <td>0</td>\n",
 416 |        "      <td>0</td>\n",
 417 |        "      <td>0</td>\n",
 418 |        "      <td>1</td>\n",
 419 |        "      <td>0</td>\n",
 420 |        "      <td>0</td>\n",
 421 |        "      <td>1</td>\n",
 422 |        "      <td>1</td>\n",
 423 |        "    </tr>\n",
 424 |        "    <tr>\n",
 425 |        "      <th>Eu te odeio muito, você não presta!</th>\n",
 426 |        "      <td>0</td>\n",
 427 |        "      <td>0</td>\n",
 428 |        "      <td>0</td>\n",
 429 |        "      <td>0</td>\n",
 430 |        "      <td>0</td>\n",
 431 |        "      <td>0</td>\n",
 432 |        "      <td>1</td>\n",
 433 |        "      <td>0</td>\n",
 434 |        "      <td>0</td>\n",
 435 |        "      <td>0</td>\n",
 436 |        "      <td>1</td>\n",
 437 |        "      <td>1</td>\n",
 438 |        "      <td>1</td>\n",
 439 |        "      <td>0</td>\n",
 440 |        "      <td>1</td>\n",
 441 |        "      <td>1</td>\n",
 442 |        "      <td>0</td>\n",
 443 |        "      <td>1</td>\n",
 444 |        "    </tr>\n",
 445 |        "    <tr>\n",
 446 |        "      <th>Não gosto de você</th>\n",
 447 |        "      <td>0</td>\n",
 448 |        "      <td>0</td>\n",
 449 |        "      <td>0</td>\n",
 450 |        "      <td>0</td>\n",
 451 |        "      <td>0</td>\n",
 452 |        "      <td>1</td>\n",
 453 |        "      <td>0</td>\n",
 454 |        "      <td>1</td>\n",
 455 |        "      <td>0</td>\n",
 456 |        "      <td>0</td>\n",
 457 |        "      <td>0</td>\n",
 458 |        "      <td>1</td>\n",
 459 |        "      <td>0</td>\n",
 460 |        "      <td>0</td>\n",
 461 |        "      <td>0</td>\n",
 462 |        "      <td>0</td>\n",
 463 |        "      <td>0</td>\n",
 464 |        "      <td>1</td>\n",
 465 |        "    </tr>\n",
 466 |        "  </tbody>\n",
 467 |        "</table>\n",
 468 |        "</div>"
 469 |       ],
 470 |       "text/plain": [
 471 |        "                                                    algo  amo  amor  ao  \\\n",
 472 |        "Eu te amo                                              0    1     0   0   \n",
 473 |        "Você é algo assim... é tudo pra mim. Ao meu amo...     1    0     2   1   \n",
 474 |        "Eu te odeio muito, você não presta!                    0    0     0   0   \n",
 475 |        "Não gosto de você                                      0    0     0   0   \n",
 476 |        "\n",
 477 |        "                                                    assim  de  eu  gosto  meu  \\\n",
 478 |        "Eu te amo                                               0   0   1      0    0   \n",
 479 |        "Você é algo assim... é tudo pra mim. Ao meu amo...      1   0   0      0    1   \n",
 480 |        "Eu te odeio muito, você não presta!                     0   0   1      0    0   \n",
 481 |        "Não gosto de você                                       0   1   0      1    0   \n",
 482 |        "\n",
 483 |        "                                                    mim  muito  não  odeio  \\\n",
 484 |        "Eu te amo                                             0      0    0      0   \n",
 485 |        "Você é algo assim... é tudo pra mim. Ao meu amo...    1      0    0      0   \n",
 486 |        "Eu te odeio muito, você não presta!                   0      1    1      1   \n",
 487 |        "Não gosto de você                                     0      0    1      0   \n",
 488 |        "\n",
 489 |        "                                                    pra  presta  te  tudo  \\\n",
 490 |        "Eu te amo                                             0       0   1     0   \n",
 491 |        "Você é algo assim... é tudo pra mim. Ao meu amo...    1       0   0     1   \n",
 492 |        "Eu te odeio muito, você não presta!                   0       1   1     0   \n",
 493 |        "Não gosto de você                                     0       0   0     0   \n",
 494 |        "\n",
 495 |        "                                                    você  \n",
 496 |        "Eu te amo                                              0  \n",
 497 |        "Você é algo assim... é tudo pra mim. Ao meu amo...     1  \n",
 498 |        "Eu te odeio muito, você não presta!                    1  \n",
 499 |        "Não gosto de você                                      1  "
 500 |       ]
 501 |      },
 502 |      "execution_count": 360,
 503 |      "metadata": {},
 504 |      "output_type": "execute_result"
 505 |     }
 506 |    ],
 507 |    "source": [
 508 |     "df = pd.DataFrame(simple_train_dtm.toarray(), columns=vect.get_feature_names(), index=train)\n",
 509 |     "df"
 510 |    ]
 511 |   },
 512 |   {
 513 |    "cell_type": "markdown",
 514 |    "metadata": {},
 515 |    "source": [
 516 |     "## 7. esparsividade\n",
 517 |     "A matriz de ocorrência é uma matriz normalmente muito esparsa, ou seja, com muitos valores zero. Essa quantidade de zeros na matriz aumenta substâncialmente o processamento das informações para a classificação de um novo texto. Portanto, a matriz esparsa ficará melhor representada pela ocorrência sem os valores zero.\n",
 518 |     "A linha abaixo mostra que a matriz é do tipo esparsa.\n"
 519 |    ]
 520 |   },
 521 |   {
 522 |    "cell_type": "code",
 523 |    "execution_count": 361,
 524 |    "metadata": {
 525 |     "collapsed": false
 526 |    },
 527 |    "outputs": [
 528 |     {
 529 |      "data": {
 530 |       "text/plain": [
 531 |        "scipy.sparse.csr.csr_matrix"
 532 |       ]
 533 |      },
 534 |      "execution_count": 361,
 535 |      "metadata": {},
 536 |      "output_type": "execute_result"
 537 |     }
 538 |    ],
 539 |    "source": [
 540 |     "type(simple_train_dtm)"
 541 |    ]
 542 |   },
 543 |   {
 544 |    "cell_type": "markdown",
 545 |    "metadata": {},
 546 |    "source": [
 547 |     "O comando anterior mostra os mesmos valores da matriz de ocorrências de palavras só que retirando as não ocorrências.\n",
 548 |     "\n",
 549 |     "Por exemplo:\n",
 550 |     "As três(3) primeiras linhas da impressão do comando se refere a frase:\n",
 551 |     "\n",
 552 |     "> Eu te amo\n",
 553 |     "\n",
 554 |     "(0, 1)\t1<br>\n",
 555 |     "(0, 6)\t1<br>\n",
 556 |     "(0, 15)\t1<br>\n",
 557 |     "\n",
 558 |     "Essa é a frase zero(0) ou seja a primeira frase. os valores 1, 6, 16 é posição da matriz onde ocorres as palavras [amo, eu, te] (em ordem alfabética), e os valores 1 são as quantidades de ocorrências de cada palavra nessa frase"
 559 |    ]
 560 |   },
 561 |   {
 562 |    "cell_type": "code",
 563 |    "execution_count": 362,
 564 |    "metadata": {
 565 |     "collapsed": false,
 566 |     "nbpresent": {
 567 |      "id": "95d91cb6-e3f8-4b4b-ab82-900f8719f4db"
 568 |     }
 569 |    },
 570 |    "outputs": [
 571 |     {
 572 |      "name": "stdout",
 573 |      "output_type": "stream",
 574 |      "text": [
 575 |       "  (0, 1)\t1\n",
 576 |       "  (0, 6)\t1\n",
 577 |       "  (0, 15)\t1\n",
 578 |       "  (1, 0)\t1\n",
 579 |       "  (1, 2)\t2\n",
 580 |       "  (1, 3)\t1\n",
 581 |       "  (1, 4)\t1\n",
 582 |       "  (1, 8)\t1\n",
 583 |       "  (1, 9)\t1\n",
 584 |       "  (1, 13)\t1\n",
 585 |       "  (1, 16)\t1\n",
 586 |       "  (1, 17)\t1\n",
 587 |       "  (2, 6)\t1\n",
 588 |       "  (2, 10)\t1\n",
 589 |       "  (2, 11)\t1\n",
 590 |       "  (2, 12)\t1\n",
 591 |       "  (2, 14)\t1\n",
 592 |       "  (2, 15)\t1\n",
 593 |       "  (2, 17)\t1\n",
 594 |       "  (3, 5)\t1\n",
 595 |       "  (3, 7)\t1\n",
 596 |       "  (3, 11)\t1\n",
 597 |       "  (3, 17)\t1\n"
 598 |      ]
 599 |     }
 600 |    ],
 601 |    "source": [
 602 |     "print(simple_train_dtm)"
 603 |    ]
 604 |   },
 605 |   {
 606 |    "cell_type": "markdown",
 607 |    "metadata": {},
 608 |    "source": [
 609 |     "Normalmente muitos documentos usarão somente um pequeno subconjuto das palavras do nosso *dicionário*, por isso a matriz resultante terá vários valores zerados nas palavras (basicamente mais de 99% delas)\n",
 610 |     "\n",
 611 |     "Por exemplo, um conjunto de **dez mil (10.000)** pequenos textos (tais como emails) terá um vocabulário da ordem de **cem mil (100.000)** palavras únicas. Porém cada texto normalmente usará somente **cem (100)** palavras únicas individualmente   \n",
 612 |     "\n",
 613 |     "Visando o armazenamento dessa matri e a aceleração de operações, algoritimos normalmente usam a representação esparsa como a implementação disponível no pacote **_scipy.sparse_**"
 614 |    ]
 615 |   },
 616 |   {
 617 |    "cell_type": "markdown",
 618 |    "metadata": {},
 619 |    "source": [
 620 |     "## 8. Classificações"
 621 |    ]
 622 |   },
 623 |   {
 624 |    "cell_type": "markdown",
 625 |    "metadata": {},
 626 |    "source": [
 627 |     "### 8.1 Classificando um novo texto\n",
 628 |     "\n",
 629 |     "Nosso objetivo é inferir se um novo texto é **BOM** ou **RUIM**\n",
 630 |     "tendo como base os textos anteriormente classificados.\n",
 631 |     "o vetor ***novo_texto*** contém um novo texto obtido e que será classificado por nosso algoritmo de aprendizagem de máquina.\n",
 632 |     "\n",
 633 |     "Basicamente classificaremos o texto com o algoritmo ***KNN***."
 634 |    ]
 635 |   },
 636 |   {
 637 |    "cell_type": "code",
 638 |    "execution_count": 372,
 639 |    "metadata": {
 640 |     "collapsed": false
 641 |    },
 642 |    "outputs": [],
 643 |    "source": [
 644 |     "novo_texto = ['te odeio']"
 645 |    ]
 646 |   },
 647 |   {
 648 |    "cell_type": "markdown",
 649 |    "metadata": {},
 650 |    "source": [
 651 |     "#### Criando a matriz de ocorrência para o novo texto\n",
 652 |     "A matriz ***simple_test_dtm*** é que será usada para a nova classificação"
 653 |    ]
 654 |   },
 655 |   {
 656 |    "cell_type": "code",
 657 |    "execution_count": 373,
 658 |    "metadata": {
 659 |     "collapsed": false
 660 |    },
 661 |    "outputs": [
 662 |     {
 663 |      "data": {
 664 |       "text/html": [
 665 |        "<div>\n",
 666 |        "<table border=\"1\" class=\"dataframe\">\n",
 667 |        "  <thead>\n",
 668 |        "    <tr style=\"text-align: right;\">\n",
 669 |        "      <th></th>\n",
 670 |        "      <th>algo</th>\n",
 671 |        "      <th>amo</th>\n",
 672 |        "      <th>amor</th>\n",
 673 |        "      <th>ao</th>\n",
 674 |        "      <th>assim</th>\n",
 675 |        "      <th>de</th>\n",
 676 |        "      <th>eu</th>\n",
 677 |        "      <th>gosto</th>\n",
 678 |        "      <th>meu</th>\n",
 679 |        "      <th>mim</th>\n",
 680 |        "      <th>muito</th>\n",
 681 |        "      <th>não</th>\n",
 682 |        "      <th>odeio</th>\n",
 683 |        "      <th>pra</th>\n",
 684 |        "      <th>presta</th>\n",
 685 |        "      <th>te</th>\n",
 686 |        "      <th>tudo</th>\n",
 687 |        "      <th>você</th>\n",
 688 |        "    </tr>\n",
 689 |        "  </thead>\n",
 690 |        "  <tbody>\n",
 691 |        "    <tr>\n",
 692 |        "      <th>te odeio</th>\n",
 693 |        "      <td>0</td>\n",
 694 |        "      <td>0</td>\n",
 695 |        "      <td>0</td>\n",
 696 |        "      <td>0</td>\n",
 697 |        "      <td>0</td>\n",
 698 |        "      <td>0</td>\n",
 699 |        "      <td>0</td>\n",
 700 |        "      <td>0</td>\n",
 701 |        "      <td>0</td>\n",
 702 |        "      <td>0</td>\n",
 703 |        "      <td>0</td>\n",
 704 |        "      <td>0</td>\n",
 705 |        "      <td>1</td>\n",
 706 |        "      <td>0</td>\n",
 707 |        "      <td>0</td>\n",
 708 |        "      <td>1</td>\n",
 709 |        "      <td>0</td>\n",
 710 |        "      <td>0</td>\n",
 711 |        "    </tr>\n",
 712 |        "  </tbody>\n",
 713 |        "</table>\n",
 714 |        "</div>"
 715 |       ],
 716 |       "text/plain": [
 717 |        "          algo  amo  amor  ao  assim  de  eu  gosto  meu  mim  muito  não  \\\n",
 718 |        "te odeio     0    0     0   0      0   0   0      0    0    0      0    0   \n",
 719 |        "\n",
 720 |        "          odeio  pra  presta  te  tudo  você  \n",
 721 |        "te odeio      1    0       0   1     0     0  "
 722 |       ]
 723 |      },
 724 |      "execution_count": 373,
 725 |      "metadata": {},
 726 |      "output_type": "execute_result"
 727 |     }
 728 |    ],
 729 |    "source": [
 730 |     "simple_test_dtm = vect.transform(novo_texto)\n",
 731 |     "\n",
 732 |     "##criando a visualização da matriz de ocorrência\n",
 733 |     "pd.DataFrame(simple_test_dtm.toarray(), columns=vect.get_feature_names(), index=novo_texto)"
 734 |    ]
 735 |   },
 736 |   {
 737 |    "cell_type": "markdown",
 738 |    "metadata": {},
 739 |    "source": [
 740 |     "### 8.2 Classificador KNN\n",
 741 |     "\n",
 742 |     "Importando o classificador KNN do scikit-learn\n",
 743 |     "\n",
 744 |     "Referência sobre o classificador KNN você pode acessar o [wikpedia-KNN](https://en.wikipedia.org/wiki/K-nearest_neighbors_algorithm) e a referência do [KNN no scikit-learn](http://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsClassifier.html) "
 745 |    ]
 746 |   },
 747 |   {
 748 |    "cell_type": "code",
 749 |    "execution_count": 374,
 750 |    "metadata": {
 751 |     "collapsed": true
 752 |    },
 753 |    "outputs": [],
 754 |    "source": [
 755 |     "## importanto o classificador\n",
 756 |     "from sklearn.neighbors import KNeighborsClassifier"
 757 |    ]
 758 |   },
 759 |   {
 760 |    "cell_type": "markdown",
 761 |    "metadata": {},
 762 |    "source": [
 763 |     "Treinando o classificador KNN"
 764 |    ]
 765 |   },
 766 |   {
 767 |    "cell_type": "code",
 768 |    "execution_count": 375,
 769 |    "metadata": {
 770 |     "collapsed": false
 771 |    },
 772 |    "outputs": [
 773 |     {
 774 |      "data": {
 775 |       "text/plain": [
 776 |        "KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',\n",
 777 |        "           metric_params=None, n_jobs=1, n_neighbors=1, p=2,\n",
 778 |        "           weights='uniform')"
 779 |       ]
 780 |      },
 781 |      "execution_count": 375,
 782 |      "metadata": {},
 783 |      "output_type": "execute_result"
 784 |     }
 785 |    ],
 786 |    "source": [
 787 |     "knn = KNeighborsClassifier(n_neighbors=1)\n",
 788 |     "knn.fit(simple_train_dtm, felling)"
 789 |    ]
 790 |   },
 791 |   {
 792 |    "cell_type": "markdown",
 793 |    "metadata": {},
 794 |    "source": [
 795 |     "### 8.3 Gerando uma classificação\n",
 796 |     "Para isso utiliza-se o método ***predict()*** do classificador"
 797 |    ]
 798 |   },
 799 |   {
 800 |    "cell_type": "code",
 801 |    "execution_count": 376,
 802 |    "metadata": {
 803 |     "collapsed": false
 804 |    },
 805 |    "outputs": [
 806 |     {
 807 |      "data": {
 808 |       "text/plain": [
 809 |        "1"
 810 |       ]
 811 |      },
 812 |      "execution_count": 376,
 813 |      "metadata": {},
 814 |      "output_type": "execute_result"
 815 |     }
 816 |    ],
 817 |    "source": [
 818 |     "fell = knn.predict(simple_test_dtm)[0]\n",
 819 |     "fell"
 820 |    ]
 821 |   },
 822 |   {
 823 |    "cell_type": "code",
 824 |    "execution_count": 377,
 825 |    "metadata": {
 826 |     "collapsed": false
 827 |    },
 828 |    "outputs": [
 829 |     {
 830 |      "name": "stdout",
 831 |      "output_type": "stream",
 832 |      "text": [
 833 |       "Bom sentimento\n"
 834 |      ]
 835 |     }
 836 |    ],
 837 |    "source": [
 838 |     "if fell==1:\n",
 839 |     "    print(\"Bom sentimento\")\n",
 840 |     "else:\n",
 841 |     "    print(\"Mal sentimento\")"
 842 |    ]
 843 |   },
 844 |   {
 845 |    "cell_type": "code",
 846 |    "execution_count": null,
 847 |    "metadata": {
 848 |     "collapsed": true
 849 |    },
 850 |    "outputs": [],
 851 |    "source": [
 852 |     ""
 853 |    ]
 854 |   },
 855 |   {
 856 |    "cell_type": "code",
 857 |    "execution_count": null,
 858 |    "metadata": {
 859 |     "collapsed": true
 860 |    },
 861 |    "outputs": [],
 862 |    "source": [
 863 |     ""
 864 |    ]
 865 |   },
 866 |   {
 867 |    "cell_type": "code",
 868 |    "execution_count": 369,
 869 |    "metadata": {
 870 |     "collapsed": true
 871 |    },
 872 |    "outputs": [],
 873 |    "source": [
 874 |     "sms = pd.read_table('sms.tsv', header=None, names=['label', 'message'])"
 875 |    ]
 876 |   },
 877 |   {
 878 |    "cell_type": "code",
 879 |    "execution_count": 370,
 880 |    "metadata": {
 881 |     "collapsed": false
 882 |    },
 883 |    "outputs": [
 884 |     {
 885 |      "data": {
 886 |       "text/html": [
 887 |        "<div>\n",
 888 |        "<table border=\"1\" class=\"dataframe\">\n",
 889 |        "  <thead>\n",
 890 |        "    <tr style=\"text-align: right;\">\n",
 891 |        "      <th></th>\n",
 892 |        "      <th>label</th>\n",
 893 |        "      <th>message</th>\n",
 894 |        "    </tr>\n",
 895 |        "  </thead>\n",
 896 |        "  <tbody>\n",
 897 |        "    <tr>\n",
 898 |        "      <th>0</th>\n",
 899 |        "      <td>ham</td>\n",
 900 |        "      <td>Go until jurong point, crazy.. Available only ...</td>\n",
 901 |        "    </tr>\n",
 902 |        "    <tr>\n",
 903 |        "      <th>1</th>\n",
 904 |        "      <td>ham</td>\n",
 905 |        "      <td>Ok lar... Joking wif u oni...</td>\n",
 906 |        "    </tr>\n",
 907 |        "    <tr>\n",
 908 |        "      <th>2</th>\n",
 909 |        "      <td>spam</td>\n",
 910 |        "      <td>Free entry in 2 a wkly comp to win FA Cup fina...</td>\n",
 911 |        "    </tr>\n",
 912 |        "    <tr>\n",
 913 |        "      <th>3</th>\n",
 914 |        "      <td>ham</td>\n",
 915 |        "      <td>U dun say so early hor... U c already then say...</td>\n",
 916 |        "    </tr>\n",
 917 |        "    <tr>\n",
 918 |        "      <th>4</th>\n",
 919 |        "      <td>ham</td>\n",
 920 |        "      <td>Nah I don't think he goes to usf, he lives aro...</td>\n",
 921 |        "    </tr>\n",
 922 |        "    <tr>\n",
 923 |        "      <th>5</th>\n",
 924 |        "      <td>spam</td>\n",
 925 |        "      <td>FreeMsg Hey there darling it's been 3 week's n...</td>\n",
 926 |        "    </tr>\n",
 927 |        "    <tr>\n",
 928 |        "      <th>6</th>\n",
 929 |        "      <td>ham</td>\n",
 930 |        "      <td>Even my brother is not like to speak with me. ...</td>\n",
 931 |        "    </tr>\n",
 932 |        "    <tr>\n",
 933 |        "      <th>7</th>\n",
 934 |        "      <td>ham</td>\n",
 935 |        "      <td>As per your request 'Melle Melle (Oru Minnamin...</td>\n",
 936 |        "    </tr>\n",
 937 |        "    <tr>\n",
 938 |        "      <th>8</th>\n",
 939 |        "      <td>spam</td>\n",
 940 |        "      <td>WINNER!! As a valued network customer you have...</td>\n",
 941 |        "    </tr>\n",
 942 |        "    <tr>\n",
 943 |        "      <th>9</th>\n",
 944 |        "      <td>spam</td>\n",
 945 |        "      <td>Had your mobile 11 months or more? U R entitle...</td>\n",
 946 |        "    </tr>\n",
 947 |        "  </tbody>\n",
 948 |        "</table>\n",
 949 |        "</div>"
 950 |       ],
 951 |       "text/plain": [
 952 |        "  label                                            message\n",
 953 |        "0   ham  Go until jurong point, crazy.. Available only ...\n",
 954 |        "1   ham                      Ok lar... Joking wif u oni...\n",
 955 |        "2  spam  Free entry in 2 a wkly comp to win FA Cup fina...\n",
 956 |        "3   ham  U dun say so early hor... U c already then say...\n",
 957 |        "4   ham  Nah I don't think he goes to usf, he lives aro...\n",
 958 |        "5  spam  FreeMsg Hey there darling it's been 3 week's n...\n",
 959 |        "6   ham  Even my brother is not like to speak with me. ...\n",
 960 |        "7   ham  As per your request 'Melle Melle (Oru Minnamin...\n",
 961 |        "8  spam  WINNER!! As a valued network customer you have...\n",
 962 |        "9  spam  Had your mobile 11 months or more? U R entitle..."
 963 |       ]
 964 |      },
 965 |      "execution_count": 370,
 966 |      "metadata": {},
 967 |      "output_type": "execute_result"
 968 |     }
 969 |    ],
 970 |    "source": [
 971 |     "sms.head(10)"
 972 |    ]
 973 |   },
 974 |   {
 975 |    "cell_type": "code",
 976 |    "execution_count": 371,
 977 |    "metadata": {
 978 |     "collapsed": false
 979 |    },
 980 |    "outputs": [
 981 |     {
 982 |      "data": {
 983 |       "text/plain": [
 984 |        "ham     4825\n",
 985 |        "spam     747\n",
 986 |        "Name: label, dtype: int64"
 987 |       ]
 988 |      },
 989 |      "execution_count": 371,
 990 |      "metadata": {},
 991 |      "output_type": "execute_result"
 992 |     }
 993 |    ],
 994 |    "source": [
 995 |     "sms.label.value_counts()"
 996 |    ]
 997 |   },
 998 |   {
 999 |    "cell_type": "code",
1000 |    "execution_count": null,
1001 |    "metadata": {
1002 |     "collapsed": true
1003 |    },
1004 |    "outputs": [],
1005 |    "source": [
1006 |     ""
1007 |    ]
1008 |   }
1009 |  ],
1010 |  "metadata": {
1011 |   "kernelspec": {
1012 |    "display_name": "Python [conda root]",
1013 |    "language": "python",
1014 |    "name": "conda-root-py"
1015 |   },
1016 |   "language_info": {
1017 |    "codemirror_mode": {
1018 |     "name": "ipython",
1019 |     "version": 3.0
1020 |    },
1021 |    "file_extension": ".py",
1022 |    "mimetype": "text/x-python",
1023 |    "name": "python",
1024 |    "nbconvert_exporter": "python",
1025 |    "pygments_lexer": "ipython3",
1026 |    "version": "3.5.2"
1027 |   }
1028 |  },
1029 |  "nbformat": 4,
1030 |  "nbformat_minor": 0
1031 | }


--------------------------------------------------------------------------------
/AnaliseTexto/.ipynb_checkpoints/tutorial-checkpoint.ipynb:
--------------------------------------------------------------------------------
   1 | {
   2 |  "cells": [
   3 |   {
   4 |    "cell_type": "markdown",
   5 |    "metadata": {},
   6 |    "source": [
   7 |     "# Tutorial: Machine Learning with Text in scikit-learn"
   8 |    ]
   9 |   },
  10 |   {
  11 |    "cell_type": "markdown",
  12 |    "metadata": {},
  13 |    "source": [
  14 |     "## Agenda\n",
  15 |     "\n",
  16 |     "1. Model building in scikit-learn (refresher)\n",
  17 |     "2. Representing text as numerical data\n",
  18 |     "3. Reading a text-based dataset into pandas\n",
  19 |     "4. Vectorizing our dataset\n",
  20 |     "5. Building and evaluating a model\n",
  21 |     "6. Comparing models"
  22 |    ]
  23 |   },
  24 |   {
  25 |    "cell_type": "code",
  26 |    "execution_count": 1,
  27 |    "metadata": {
  28 |     "collapsed": false
  29 |    },
  30 |    "outputs": [],
  31 |    "source": [
  32 |     "# for Python 2: use print only as a function\n",
  33 |     "from __future__ import print_function"
  34 |    ]
  35 |   },
  36 |   {
  37 |    "cell_type": "markdown",
  38 |    "metadata": {},
  39 |    "source": [
  40 |     "## Part 1: Model building in scikit-learn (refresher)"
  41 |    ]
  42 |   },
  43 |   {
  44 |    "cell_type": "code",
  45 |    "execution_count": 2,
  46 |    "metadata": {
  47 |     "collapsed": true
  48 |    },
  49 |    "outputs": [],
  50 |    "source": [
  51 |     "# load the iris dataset as an example\n",
  52 |     "from sklearn.datasets import load_iris\n",
  53 |     "iris = load_iris()"
  54 |    ]
  55 |   },
  56 |   {
  57 |    "cell_type": "code",
  58 |    "execution_count": 3,
  59 |    "metadata": {
  60 |     "collapsed": true
  61 |    },
  62 |    "outputs": [],
  63 |    "source": [
  64 |     "# store the feature matrix (X) and response vector (y)\n",
  65 |     "X = iris.data\n",
  66 |     "y = iris.target"
  67 |    ]
  68 |   },
  69 |   {
  70 |    "cell_type": "markdown",
  71 |    "metadata": {},
  72 |    "source": [
  73 |     "**\"Features\"** are also known as predictors, inputs, or attributes. The **\"response\"** is also known as the target, label, or output."
  74 |    ]
  75 |   },
  76 |   {
  77 |    "cell_type": "code",
  78 |    "execution_count": 4,
  79 |    "metadata": {
  80 |     "collapsed": false
  81 |    },
  82 |    "outputs": [
  83 |     {
  84 |      "name": "stdout",
  85 |      "output_type": "stream",
  86 |      "text": [
  87 |       "(150, 4)\n",
  88 |       "(150,)\n"
  89 |      ]
  90 |     }
  91 |    ],
  92 |    "source": [
  93 |     "# check the shapes of X and y\n",
  94 |     "print(X.shape)\n",
  95 |     "print(y.shape)"
  96 |    ]
  97 |   },
  98 |   {
  99 |    "cell_type": "markdown",
 100 |    "metadata": {},
 101 |    "source": [
 102 |     "**\"Observations\"** are also known as samples, instances, or records."
 103 |    ]
 104 |   },
 105 |   {
 106 |    "cell_type": "code",
 107 |    "execution_count": 5,
 108 |    "metadata": {
 109 |     "collapsed": false
 110 |    },
 111 |    "outputs": [
 112 |     {
 113 |      "data": {
 114 |       "text/html": [
 115 |        "<div>\n",
 116 |        "<table border=\"1\" class=\"dataframe\">\n",
 117 |        "  <thead>\n",
 118 |        "    <tr style=\"text-align: right;\">\n",
 119 |        "      <th></th>\n",
 120 |        "      <th>sepal length (cm)</th>\n",
 121 |        "      <th>sepal width (cm)</th>\n",
 122 |        "      <th>petal length (cm)</th>\n",
 123 |        "      <th>petal width (cm)</th>\n",
 124 |        "    </tr>\n",
 125 |        "  </thead>\n",
 126 |        "  <tbody>\n",
 127 |        "    <tr>\n",
 128 |        "      <th>0</th>\n",
 129 |        "      <td>5.1</td>\n",
 130 |        "      <td>3.5</td>\n",
 131 |        "      <td>1.4</td>\n",
 132 |        "      <td>0.2</td>\n",
 133 |        "    </tr>\n",
 134 |        "    <tr>\n",
 135 |        "      <th>1</th>\n",
 136 |        "      <td>4.9</td>\n",
 137 |        "      <td>3.0</td>\n",
 138 |        "      <td>1.4</td>\n",
 139 |        "      <td>0.2</td>\n",
 140 |        "    </tr>\n",
 141 |        "    <tr>\n",
 142 |        "      <th>2</th>\n",
 143 |        "      <td>4.7</td>\n",
 144 |        "      <td>3.2</td>\n",
 145 |        "      <td>1.3</td>\n",
 146 |        "      <td>0.2</td>\n",
 147 |        "    </tr>\n",
 148 |        "    <tr>\n",
 149 |        "      <th>3</th>\n",
 150 |        "      <td>4.6</td>\n",
 151 |        "      <td>3.1</td>\n",
 152 |        "      <td>1.5</td>\n",
 153 |        "      <td>0.2</td>\n",
 154 |        "    </tr>\n",
 155 |        "    <tr>\n",
 156 |        "      <th>4</th>\n",
 157 |        "      <td>5.0</td>\n",
 158 |        "      <td>3.6</td>\n",
 159 |        "      <td>1.4</td>\n",
 160 |        "      <td>0.2</td>\n",
 161 |        "    </tr>\n",
 162 |        "  </tbody>\n",
 163 |        "</table>\n",
 164 |        "</div>"
 165 |       ],
 166 |       "text/plain": [
 167 |        "   sepal length (cm)  sepal width (cm)  petal length (cm)  petal width (cm)\n",
 168 |        "0                5.1               3.5                1.4               0.2\n",
 169 |        "1                4.9               3.0                1.4               0.2\n",
 170 |        "2                4.7               3.2                1.3               0.2\n",
 171 |        "3                4.6               3.1                1.5               0.2\n",
 172 |        "4                5.0               3.6                1.4               0.2"
 173 |       ]
 174 |      },
 175 |      "execution_count": 5,
 176 |      "metadata": {},
 177 |      "output_type": "execute_result"
 178 |     }
 179 |    ],
 180 |    "source": [
 181 |     "# examine the first 5 rows of the feature matrix (including the feature names)\n",
 182 |     "import pandas as pd\n",
 183 |     "pd.DataFrame(X, columns=iris.feature_names).head()"
 184 |    ]
 185 |   },
 186 |   {
 187 |    "cell_type": "code",
 188 |    "execution_count": 6,
 189 |    "metadata": {
 190 |     "collapsed": false
 191 |    },
 192 |    "outputs": [
 193 |     {
 194 |      "name": "stdout",
 195 |      "output_type": "stream",
 196 |      "text": [
 197 |       "[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0\n",
 198 |       " 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1\n",
 199 |       " 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2\n",
 200 |       " 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2\n",
 201 |       " 2 2]\n"
 202 |      ]
 203 |     }
 204 |    ],
 205 |    "source": [
 206 |     "# examine the response vector\n",
 207 |     "print(y)"
 208 |    ]
 209 |   },
 210 |   {
 211 |    "cell_type": "markdown",
 212 |    "metadata": {},
 213 |    "source": [
 214 |     "In order to **build a model**, the features must be **numeric**, and every observation must have the **same features in the same order**."
 215 |    ]
 216 |   },
 217 |   {
 218 |    "cell_type": "code",
 219 |    "execution_count": 7,
 220 |    "metadata": {
 221 |     "collapsed": false
 222 |    },
 223 |    "outputs": [
 224 |     {
 225 |      "data": {
 226 |       "text/plain": [
 227 |        "KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',\n",
 228 |        "           metric_params=None, n_jobs=1, n_neighbors=5, p=2,\n",
 229 |        "           weights='uniform')"
 230 |       ]
 231 |      },
 232 |      "execution_count": 7,
 233 |      "metadata": {},
 234 |      "output_type": "execute_result"
 235 |     }
 236 |    ],
 237 |    "source": [
 238 |     "# import the class\n",
 239 |     "from sklearn.neighbors import KNeighborsClassifier\n",
 240 |     "\n",
 241 |     "# instantiate the model (with the default parameters)\n",
 242 |     "knn = KNeighborsClassifier()\n",
 243 |     "\n",
 244 |     "# fit the model with data (occurs in-place)\n",
 245 |     "knn.fit(X, y)"
 246 |    ]
 247 |   },
 248 |   {
 249 |    "cell_type": "markdown",
 250 |    "metadata": {},
 251 |    "source": [
 252 |     "In order to **make a prediction**, the new observation must have the **same features as the training observations**, both in number and meaning."
 253 |    ]
 254 |   },
 255 |   {
 256 |    "cell_type": "code",
 257 |    "execution_count": 8,
 258 |    "metadata": {
 259 |     "collapsed": false
 260 |    },
 261 |    "outputs": [
 262 |     {
 263 |      "data": {
 264 |       "text/plain": [
 265 |        "array([1])"
 266 |       ]
 267 |      },
 268 |      "execution_count": 8,
 269 |      "metadata": {},
 270 |      "output_type": "execute_result"
 271 |     }
 272 |    ],
 273 |    "source": [
 274 |     "# predict the response for a new observation\n",
 275 |     "knn.predict([[3, 5, 4, 2]])"
 276 |    ]
 277 |   },
 278 |   {
 279 |    "cell_type": "markdown",
 280 |    "metadata": {},
 281 |    "source": [
 282 |     "## Part 2: Representing text as numerical data"
 283 |    ]
 284 |   },
 285 |   {
 286 |    "cell_type": "code",
 287 |    "execution_count": 9,
 288 |    "metadata": {
 289 |     "collapsed": true
 290 |    },
 291 |    "outputs": [],
 292 |    "source": [
 293 |     "# example text for model training (SMS messages)\n",
 294 |     "simple_train = ['call you tonight', 'Call me a cab', 'please call me... PLEASE!']"
 295 |    ]
 296 |   },
 297 |   {
 298 |    "cell_type": "code",
 299 |    "execution_count": 10,
 300 |    "metadata": {
 301 |     "collapsed": true
 302 |    },
 303 |    "outputs": [],
 304 |    "source": [
 305 |     "# example response vector\n",
 306 |     "is_desperate = [0, 0, 1]"
 307 |    ]
 308 |   },
 309 |   {
 310 |    "cell_type": "markdown",
 311 |    "metadata": {},
 312 |    "source": [
 313 |     "From the [scikit-learn documentation](http://scikit-learn.org/stable/modules/feature_extraction.html#text-feature-extraction):\n",
 314 |     "\n",
 315 |     "> Text Analysis is a major application field for machine learning algorithms. However the raw data, a sequence of symbols cannot be fed directly to the algorithms themselves as most of them expect **numerical feature vectors with a fixed size** rather than the **raw text documents with variable length**.\n",
 316 |     "\n",
 317 |     "We will use [CountVectorizer](http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html) to \"convert text into a matrix of token counts\":"
 318 |    ]
 319 |   },
 320 |   {
 321 |    "cell_type": "code",
 322 |    "execution_count": 11,
 323 |    "metadata": {
 324 |     "collapsed": true
 325 |    },
 326 |    "outputs": [],
 327 |    "source": [
 328 |     "# import and instantiate CountVectorizer (with the default parameters)\n",
 329 |     "from sklearn.feature_extraction.text import CountVectorizer\n",
 330 |     "vect = CountVectorizer()"
 331 |    ]
 332 |   },
 333 |   {
 334 |    "cell_type": "code",
 335 |    "execution_count": 12,
 336 |    "metadata": {
 337 |     "collapsed": false
 338 |    },
 339 |    "outputs": [
 340 |     {
 341 |      "data": {
 342 |       "text/plain": [
 343 |        "CountVectorizer(analyzer='word', binary=False, decode_error='strict',\n",
 344 |        "        dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',\n",
 345 |        "        lowercase=True, max_df=1.0, max_features=None, min_df=1,\n",
 346 |        "        ngram_range=(1, 1), preprocessor=None, stop_words=None,\n",
 347 |        "        strip_accents=None, token_pattern='(?u)\\\\b\\\\w\\\\w+\\\\b',\n",
 348 |        "        tokenizer=None, vocabulary=None)"
 349 |       ]
 350 |      },
 351 |      "execution_count": 12,
 352 |      "metadata": {},
 353 |      "output_type": "execute_result"
 354 |     }
 355 |    ],
 356 |    "source": [
 357 |     "# learn the 'vocabulary' of the training data (occurs in-place)\n",
 358 |     "vect.fit(simple_train)"
 359 |    ]
 360 |   },
 361 |   {
 362 |    "cell_type": "code",
 363 |    "execution_count": 13,
 364 |    "metadata": {
 365 |     "collapsed": false
 366 |    },
 367 |    "outputs": [
 368 |     {
 369 |      "data": {
 370 |       "text/plain": [
 371 |        "['cab', 'call', 'me', 'please', 'tonight', 'you']"
 372 |       ]
 373 |      },
 374 |      "execution_count": 13,
 375 |      "metadata": {},
 376 |      "output_type": "execute_result"
 377 |     }
 378 |    ],
 379 |    "source": [
 380 |     "# examine the fitted vocabulary\n",
 381 |     "vect.get_feature_names()"
 382 |    ]
 383 |   },
 384 |   {
 385 |    "cell_type": "code",
 386 |    "execution_count": 14,
 387 |    "metadata": {
 388 |     "collapsed": false
 389 |    },
 390 |    "outputs": [
 391 |     {
 392 |      "data": {
 393 |       "text/plain": [
 394 |        "<3x6 sparse matrix of type '<class 'numpy.int64'>'\n",
 395 |        "\twith 9 stored elements in Compressed Sparse Row format>"
 396 |       ]
 397 |      },
 398 |      "execution_count": 14,
 399 |      "metadata": {},
 400 |      "output_type": "execute_result"
 401 |     }
 402 |    ],
 403 |    "source": [
 404 |     "# transform training data into a 'document-term matrix'\n",
 405 |     "simple_train_dtm = vect.transform(simple_train)\n",
 406 |     "simple_train_dtm"
 407 |    ]
 408 |   },
 409 |   {
 410 |    "cell_type": "code",
 411 |    "execution_count": 15,
 412 |    "metadata": {
 413 |     "collapsed": false
 414 |    },
 415 |    "outputs": [
 416 |     {
 417 |      "data": {
 418 |       "text/plain": [
 419 |        "array([[0, 1, 0, 0, 1, 1],\n",
 420 |        "       [1, 1, 1, 0, 0, 0],\n",
 421 |        "       [0, 1, 1, 2, 0, 0]])"
 422 |       ]
 423 |      },
 424 |      "execution_count": 15,
 425 |      "metadata": {},
 426 |      "output_type": "execute_result"
 427 |     }
 428 |    ],
 429 |    "source": [
 430 |     "# convert sparse matrix to a dense matrix\n",
 431 |     "simple_train_dtm.toarray()"
 432 |    ]
 433 |   },
 434 |   {
 435 |    "cell_type": "code",
 436 |    "execution_count": 16,
 437 |    "metadata": {
 438 |     "collapsed": false
 439 |    },
 440 |    "outputs": [
 441 |     {
 442 |      "data": {
 443 |       "text/html": [
 444 |        "<div>\n",
 445 |        "<table border=\"1\" class=\"dataframe\">\n",
 446 |        "  <thead>\n",
 447 |        "    <tr style=\"text-align: right;\">\n",
 448 |        "      <th></th>\n",
 449 |        "      <th>cab</th>\n",
 450 |        "      <th>call</th>\n",
 451 |        "      <th>me</th>\n",
 452 |        "      <th>please</th>\n",
 453 |        "      <th>tonight</th>\n",
 454 |        "      <th>you</th>\n",
 455 |        "    </tr>\n",
 456 |        "  </thead>\n",
 457 |        "  <tbody>\n",
 458 |        "    <tr>\n",
 459 |        "      <th>0</th>\n",
 460 |        "      <td>0</td>\n",
 461 |        "      <td>1</td>\n",
 462 |        "      <td>0</td>\n",
 463 |        "      <td>0</td>\n",
 464 |        "      <td>1</td>\n",
 465 |        "      <td>1</td>\n",
 466 |        "    </tr>\n",
 467 |        "    <tr>\n",
 468 |        "      <th>1</th>\n",
 469 |        "      <td>1</td>\n",
 470 |        "      <td>1</td>\n",
 471 |        "      <td>1</td>\n",
 472 |        "      <td>0</td>\n",
 473 |        "      <td>0</td>\n",
 474 |        "      <td>0</td>\n",
 475 |        "    </tr>\n",
 476 |        "    <tr>\n",
 477 |        "      <th>2</th>\n",
 478 |        "      <td>0</td>\n",
 479 |        "      <td>1</td>\n",
 480 |        "      <td>1</td>\n",
 481 |        "      <td>2</td>\n",
 482 |        "      <td>0</td>\n",
 483 |        "      <td>0</td>\n",
 484 |        "    </tr>\n",
 485 |        "  </tbody>\n",
 486 |        "</table>\n",
 487 |        "</div>"
 488 |       ],
 489 |       "text/plain": [
 490 |        "   cab  call  me  please  tonight  you\n",
 491 |        "0    0     1   0       0        1    1\n",
 492 |        "1    1     1   1       0        0    0\n",
 493 |        "2    0     1   1       2        0    0"
 494 |       ]
 495 |      },
 496 |      "execution_count": 16,
 497 |      "metadata": {},
 498 |      "output_type": "execute_result"
 499 |     }
 500 |    ],
 501 |    "source": [
 502 |     "# examine the vocabulary and document-term matrix together\n",
 503 |     "pd.DataFrame(simple_train_dtm.toarray(), columns=vect.get_feature_names())"
 504 |    ]
 505 |   },
 506 |   {
 507 |    "cell_type": "markdown",
 508 |    "metadata": {},
 509 |    "source": [
 510 |     "From the [scikit-learn documentation](http://scikit-learn.org/stable/modules/feature_extraction.html#text-feature-extraction):\n",
 511 |     "\n",
 512 |     "> In this scheme, features and samples are defined as follows:\n",
 513 |     "\n",
 514 |     "> - Each individual token occurrence frequency (normalized or not) is treated as a **feature**.\n",
 515 |     "> - The vector of all the token frequencies for a given document is considered a multivariate **sample**.\n",
 516 |     "\n",
 517 |     "> A **corpus of documents** can thus be represented by a matrix with **one row per document** and **one column per token** (e.g. word) occurring in the corpus.\n",
 518 |     "\n",
 519 |     "> We call **vectorization** the general process of turning a collection of text documents into numerical feature vectors. This specific strategy (tokenization, counting and normalization) is called the **Bag of Words** or \"Bag of n-grams\" representation. Documents are described by word occurrences while completely ignoring the relative position information of the words in the document."
 520 |    ]
 521 |   },
 522 |   {
 523 |    "cell_type": "code",
 524 |    "execution_count": 17,
 525 |    "metadata": {
 526 |     "collapsed": false
 527 |    },
 528 |    "outputs": [
 529 |     {
 530 |      "data": {
 531 |       "text/plain": [
 532 |        "scipy.sparse.csr.csr_matrix"
 533 |       ]
 534 |      },
 535 |      "execution_count": 17,
 536 |      "metadata": {},
 537 |      "output_type": "execute_result"
 538 |     }
 539 |    ],
 540 |    "source": [
 541 |     "# check the type of the document-term matrix\n",
 542 |     "type(simple_train_dtm)"
 543 |    ]
 544 |   },
 545 |   {
 546 |    "cell_type": "code",
 547 |    "execution_count": 18,
 548 |    "metadata": {
 549 |     "collapsed": false,
 550 |     "scrolled": true
 551 |    },
 552 |    "outputs": [
 553 |     {
 554 |      "name": "stdout",
 555 |      "output_type": "stream",
 556 |      "text": [
 557 |       "  (0, 1)\t1\n",
 558 |       "  (0, 4)\t1\n",
 559 |       "  (0, 5)\t1\n",
 560 |       "  (1, 0)\t1\n",
 561 |       "  (1, 1)\t1\n",
 562 |       "  (1, 2)\t1\n",
 563 |       "  (2, 1)\t1\n",
 564 |       "  (2, 2)\t1\n",
 565 |       "  (2, 3)\t2\n"
 566 |      ]
 567 |     }
 568 |    ],
 569 |    "source": [
 570 |     "# examine the sparse matrix contents\n",
 571 |     "print(simple_train_dtm)"
 572 |    ]
 573 |   },
 574 |   {
 575 |    "cell_type": "markdown",
 576 |    "metadata": {},
 577 |    "source": [
 578 |     "From the [scikit-learn documentation](http://scikit-learn.org/stable/modules/feature_extraction.html#text-feature-extraction):\n",
 579 |     "\n",
 580 |     "> As most documents will typically use a very small subset of the words used in the corpus, the resulting matrix will have **many feature values that are zeros** (typically more than 99% of them).\n",
 581 |     "\n",
 582 |     "> For instance, a collection of 10,000 short text documents (such as emails) will use a vocabulary with a size in the order of 100,000 unique words in total while each document will use 100 to 1000 unique words individually.\n",
 583 |     "\n",
 584 |     "> In order to be able to **store such a matrix in memory** but also to **speed up operations**, implementations will typically use a **sparse representation** such as the implementations available in the `scipy.sparse` package."
 585 |    ]
 586 |   },
 587 |   {
 588 |    "cell_type": "code",
 589 |    "execution_count": 19,
 590 |    "metadata": {
 591 |     "collapsed": false
 592 |    },
 593 |    "outputs": [
 594 |     {
 595 |      "data": {
 596 |       "text/plain": [
 597 |        "KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',\n",
 598 |        "           metric_params=None, n_jobs=1, n_neighbors=1, p=2,\n",
 599 |        "           weights='uniform')"
 600 |       ]
 601 |      },
 602 |      "execution_count": 19,
 603 |      "metadata": {},
 604 |      "output_type": "execute_result"
 605 |     }
 606 |    ],
 607 |    "source": [
 608 |     "# build a model to predict desperation\n",
 609 |     "knn = KNeighborsClassifier(n_neighbors=1)\n",
 610 |     "knn.fit(simple_train_dtm, is_desperate)"
 611 |    ]
 612 |   },
 613 |   {
 614 |    "cell_type": "code",
 615 |    "execution_count": 20,
 616 |    "metadata": {
 617 |     "collapsed": true
 618 |    },
 619 |    "outputs": [],
 620 |    "source": [
 621 |     "# example text for model testing\n",
 622 |     "simple_test = [\"please don't call me\"]"
 623 |    ]
 624 |   },
 625 |   {
 626 |    "cell_type": "markdown",
 627 |    "metadata": {},
 628 |    "source": [
 629 |     "In order to **make a prediction**, the new observation must have the **same features as the training observations**, both in number and meaning."
 630 |    ]
 631 |   },
 632 |   {
 633 |    "cell_type": "code",
 634 |    "execution_count": 21,
 635 |    "metadata": {
 636 |     "collapsed": false
 637 |    },
 638 |    "outputs": [
 639 |     {
 640 |      "data": {
 641 |       "text/plain": [
 642 |        "array([[0, 1, 1, 1, 0, 0]])"
 643 |       ]
 644 |      },
 645 |      "execution_count": 21,
 646 |      "metadata": {},
 647 |      "output_type": "execute_result"
 648 |     }
 649 |    ],
 650 |    "source": [
 651 |     "# transform testing data into a document-term matrix (using existing vocabulary)\n",
 652 |     "simple_test_dtm = vect.transform(simple_test)\n",
 653 |     "simple_test_dtm.toarray()"
 654 |    ]
 655 |   },
 656 |   {
 657 |    "cell_type": "code",
 658 |    "execution_count": 22,
 659 |    "metadata": {
 660 |     "collapsed": false
 661 |    },
 662 |    "outputs": [
 663 |     {
 664 |      "data": {
 665 |       "text/html": [
 666 |        "<div>\n",
 667 |        "<table border=\"1\" class=\"dataframe\">\n",
 668 |        "  <thead>\n",
 669 |        "    <tr style=\"text-align: right;\">\n",
 670 |        "      <th></th>\n",
 671 |        "      <th>cab</th>\n",
 672 |        "      <th>call</th>\n",
 673 |        "      <th>me</th>\n",
 674 |        "      <th>please</th>\n",
 675 |        "      <th>tonight</th>\n",
 676 |        "      <th>you</th>\n",
 677 |        "    </tr>\n",
 678 |        "  </thead>\n",
 679 |        "  <tbody>\n",
 680 |        "    <tr>\n",
 681 |        "      <th>0</th>\n",
 682 |        "      <td>0</td>\n",
 683 |        "      <td>1</td>\n",
 684 |        "      <td>1</td>\n",
 685 |        "      <td>1</td>\n",
 686 |        "      <td>0</td>\n",
 687 |        "      <td>0</td>\n",
 688 |        "    </tr>\n",
 689 |        "  </tbody>\n",
 690 |        "</table>\n",
 691 |        "</div>"
 692 |       ],
 693 |       "text/plain": [
 694 |        "   cab  call  me  please  tonight  you\n",
 695 |        "0    0     1   1       1        0    0"
 696 |       ]
 697 |      },
 698 |      "execution_count": 22,
 699 |      "metadata": {},
 700 |      "output_type": "execute_result"
 701 |     }
 702 |    ],
 703 |    "source": [
 704 |     "# examine the vocabulary and document-term matrix together\n",
 705 |     "pd.DataFrame(simple_test_dtm.toarray(), columns=vect.get_feature_names())"
 706 |    ]
 707 |   },
 708 |   {
 709 |    "cell_type": "code",
 710 |    "execution_count": 23,
 711 |    "metadata": {
 712 |     "collapsed": false
 713 |    },
 714 |    "outputs": [
 715 |     {
 716 |      "data": {
 717 |       "text/plain": [
 718 |        "array([1])"
 719 |       ]
 720 |      },
 721 |      "execution_count": 23,
 722 |      "metadata": {},
 723 |      "output_type": "execute_result"
 724 |     }
 725 |    ],
 726 |    "source": [
 727 |     "# predict whether simple_test is desperate\n",
 728 |     "knn.predict(simple_test_dtm)"
 729 |    ]
 730 |   },
 731 |   {
 732 |    "cell_type": "markdown",
 733 |    "metadata": {},
 734 |    "source": [
 735 |     "**Summary:**\n",
 736 |     "\n",
 737 |     "- `vect.fit(train)` **learns the vocabulary** of the training data\n",
 738 |     "- `vect.transform(train)` uses the **fitted vocabulary** to build a document-term matrix from the training data\n",
 739 |     "- `vect.transform(test)` uses the **fitted vocabulary** to build a document-term matrix from the testing data (and **ignores tokens** it hasn't seen before)"
 740 |    ]
 741 |   },
 742 |   {
 743 |    "cell_type": "markdown",
 744 |    "metadata": {},
 745 |    "source": [
 746 |     "## Part 3: Reading a text-based dataset into pandas"
 747 |    ]
 748 |   },
 749 |   {
 750 |    "cell_type": "code",
 751 |    "execution_count": 24,
 752 |    "metadata": {
 753 |     "collapsed": true
 754 |    },
 755 |    "outputs": [],
 756 |    "source": [
 757 |     "# read file into pandas from the working directory\n",
 758 |     "sms = pd.read_table('sms.tsv', header=None, names=['label', 'message'])"
 759 |    ]
 760 |   },
 761 |   {
 762 |    "cell_type": "code",
 763 |    "execution_count": 25,
 764 |    "metadata": {
 765 |     "collapsed": false
 766 |    },
 767 |    "outputs": [],
 768 |    "source": [
 769 |     "# alternative: read file into pandas from a URL\n",
 770 |     "# url = 'https://raw.githubusercontent.com/justmarkham/pydata-dc-2016-tutorial/master/sms.tsv'\n",
 771 |     "# sms = pd.read_table(url, header=None, names=['label', 'message'])"
 772 |    ]
 773 |   },
 774 |   {
 775 |    "cell_type": "code",
 776 |    "execution_count": 26,
 777 |    "metadata": {
 778 |     "collapsed": false
 779 |    },
 780 |    "outputs": [
 781 |     {
 782 |      "data": {
 783 |       "text/plain": [
 784 |        "(5572, 2)"
 785 |       ]
 786 |      },
 787 |      "execution_count": 26,
 788 |      "metadata": {},
 789 |      "output_type": "execute_result"
 790 |     }
 791 |    ],
 792 |    "source": [
 793 |     "# examine the shape\n",
 794 |     "sms.shape"
 795 |    ]
 796 |   },
 797 |   {
 798 |    "cell_type": "code",
 799 |    "execution_count": 27,
 800 |    "metadata": {
 801 |     "collapsed": false
 802 |    },
 803 |    "outputs": [
 804 |     {
 805 |      "data": {
 806 |       "text/html": [
 807 |        "<div>\n",
 808 |        "<table border=\"1\" class=\"dataframe\">\n",
 809 |        "  <thead>\n",
 810 |        "    <tr style=\"text-align: right;\">\n",
 811 |        "      <th></th>\n",
 812 |        "      <th>label</th>\n",
 813 |        "      <th>message</th>\n",
 814 |        "    </tr>\n",
 815 |        "  </thead>\n",
 816 |        "  <tbody>\n",
 817 |        "    <tr>\n",
 818 |        "      <th>0</th>\n",
 819 |        "      <td>ham</td>\n",
 820 |        "      <td>Go until jurong point, crazy.. Available only ...</td>\n",
 821 |        "    </tr>\n",
 822 |        "    <tr>\n",
 823 |        "      <th>1</th>\n",
 824 |        "      <td>ham</td>\n",
 825 |        "      <td>Ok lar... Joking wif u oni...</td>\n",
 826 |        "    </tr>\n",
 827 |        "    <tr>\n",
 828 |        "      <th>2</th>\n",
 829 |        "      <td>spam</td>\n",
 830 |        "      <td>Free entry in 2 a wkly comp to win FA Cup fina...</td>\n",
 831 |        "    </tr>\n",
 832 |        "    <tr>\n",
 833 |        "      <th>3</th>\n",
 834 |        "      <td>ham</td>\n",
 835 |        "      <td>U dun say so early hor... U c already then say...</td>\n",
 836 |        "    </tr>\n",
 837 |        "    <tr>\n",
 838 |        "      <th>4</th>\n",
 839 |        "      <td>ham</td>\n",
 840 |        "      <td>Nah I don't think he goes to usf, he lives aro...</td>\n",
 841 |        "    </tr>\n",
 842 |        "    <tr>\n",
 843 |        "      <th>5</th>\n",
 844 |        "      <td>spam</td>\n",
 845 |        "      <td>FreeMsg Hey there darling it's been 3 week's n...</td>\n",
 846 |        "    </tr>\n",
 847 |        "    <tr>\n",
 848 |        "      <th>6</th>\n",
 849 |        "      <td>ham</td>\n",
 850 |        "      <td>Even my brother is not like to speak with me. ...</td>\n",
 851 |        "    </tr>\n",
 852 |        "    <tr>\n",
 853 |        "      <th>7</th>\n",
 854 |        "      <td>ham</td>\n",
 855 |        "      <td>As per your request 'Melle Melle (Oru Minnamin...</td>\n",
 856 |        "    </tr>\n",
 857 |        "    <tr>\n",
 858 |        "      <th>8</th>\n",
 859 |        "      <td>spam</td>\n",
 860 |        "      <td>WINNER!! As a valued network customer you have...</td>\n",
 861 |        "    </tr>\n",
 862 |        "    <tr>\n",
 863 |        "      <th>9</th>\n",
 864 |        "      <td>spam</td>\n",
 865 |        "      <td>Had your mobile 11 months or more? U R entitle...</td>\n",
 866 |        "    </tr>\n",
 867 |        "  </tbody>\n",
 868 |        "</table>\n",
 869 |        "</div>"
 870 |       ],
 871 |       "text/plain": [
 872 |        "  label                                            message\n",
 873 |        "0   ham  Go until jurong point, crazy.. Available only ...\n",
 874 |        "1   ham                      Ok lar... Joking wif u oni...\n",
 875 |        "2  spam  Free entry in 2 a wkly comp to win FA Cup fina...\n",
 876 |        "3   ham  U dun say so early hor... U c already then say...\n",
 877 |        "4   ham  Nah I don't think he goes to usf, he lives aro...\n",
 878 |        "5  spam  FreeMsg Hey there darling it's been 3 week's n...\n",
 879 |        "6   ham  Even my brother is not like to speak with me. ...\n",
 880 |        "7   ham  As per your request 'Melle Melle (Oru Minnamin...\n",
 881 |        "8  spam  WINNER!! As a valued network customer you have...\n",
 882 |        "9  spam  Had your mobile 11 months or more? U R entitle..."
 883 |       ]
 884 |      },
 885 |      "execution_count": 27,
 886 |      "metadata": {},
 887 |      "output_type": "execute_result"
 888 |     }
 889 |    ],
 890 |    "source": [
 891 |     "# examine the first 10 rows\n",
 892 |     "sms.head(10)"
 893 |    ]
 894 |   },
 895 |   {
 896 |    "cell_type": "code",
 897 |    "execution_count": 28,
 898 |    "metadata": {
 899 |     "collapsed": false
 900 |    },
 901 |    "outputs": [
 902 |     {
 903 |      "data": {
 904 |       "text/plain": [
 905 |        "ham     4825\n",
 906 |        "spam     747\n",
 907 |        "Name: label, dtype: int64"
 908 |       ]
 909 |      },
 910 |      "execution_count": 28,
 911 |      "metadata": {},
 912 |      "output_type": "execute_result"
 913 |     }
 914 |    ],
 915 |    "source": [
 916 |     "# examine the class distribution\n",
 917 |     "sms.label.value_counts()"
 918 |    ]
 919 |   },
 920 |   {
 921 |    "cell_type": "code",
 922 |    "execution_count": 29,
 923 |    "metadata": {
 924 |     "collapsed": true
 925 |    },
 926 |    "outputs": [],
 927 |    "source": [
 928 |     "# convert label to a numerical variable\n",
 929 |     "sms['label_num'] = sms.label.map({'ham':0, 'spam':1})"
 930 |    ]
 931 |   },
 932 |   {
 933 |    "cell_type": "code",
 934 |    "execution_count": 30,
 935 |    "metadata": {
 936 |     "collapsed": false
 937 |    },
 938 |    "outputs": [
 939 |     {
 940 |      "data": {
 941 |       "text/html": [
 942 |        "<div>\n",
 943 |        "<table border=\"1\" class=\"dataframe\">\n",
 944 |        "  <thead>\n",
 945 |        "    <tr style=\"text-align: right;\">\n",
 946 |        "      <th></th>\n",
 947 |        "      <th>label</th>\n",
 948 |        "      <th>message</th>\n",
 949 |        "      <th>label_num</th>\n",
 950 |        "    </tr>\n",
 951 |        "  </thead>\n",
 952 |        "  <tbody>\n",
 953 |        "    <tr>\n",
 954 |        "      <th>0</th>\n",
 955 |        "      <td>ham</td>\n",
 956 |        "      <td>Go until jurong point, crazy.. Available only ...</td>\n",
 957 |        "      <td>0</td>\n",
 958 |        "    </tr>\n",
 959 |        "    <tr>\n",
 960 |        "      <th>1</th>\n",
 961 |        "      <td>ham</td>\n",
 962 |        "      <td>Ok lar... Joking wif u oni...</td>\n",
 963 |        "      <td>0</td>\n",
 964 |        "    </tr>\n",
 965 |        "    <tr>\n",
 966 |        "      <th>2</th>\n",
 967 |        "      <td>spam</td>\n",
 968 |        "      <td>Free entry in 2 a wkly comp to win FA Cup fina...</td>\n",
 969 |        "      <td>1</td>\n",
 970 |        "    </tr>\n",
 971 |        "    <tr>\n",
 972 |        "      <th>3</th>\n",
 973 |        "      <td>ham</td>\n",
 974 |        "      <td>U dun say so early hor... U c already then say...</td>\n",
 975 |        "      <td>0</td>\n",
 976 |        "    </tr>\n",
 977 |        "    <tr>\n",
 978 |        "      <th>4</th>\n",
 979 |        "      <td>ham</td>\n",
 980 |        "      <td>Nah I don't think he goes to usf, he lives aro...</td>\n",
 981 |        "      <td>0</td>\n",
 982 |        "    </tr>\n",
 983 |        "    <tr>\n",
 984 |        "      <th>5</th>\n",
 985 |        "      <td>spam</td>\n",
 986 |        "      <td>FreeMsg Hey there darling it's been 3 week's n...</td>\n",
 987 |        "      <td>1</td>\n",
 988 |        "    </tr>\n",
 989 |        "    <tr>\n",
 990 |        "      <th>6</th>\n",
 991 |        "      <td>ham</td>\n",
 992 |        "      <td>Even my brother is not like to speak with me. ...</td>\n",
 993 |        "      <td>0</td>\n",
 994 |        "    </tr>\n",
 995 |        "    <tr>\n",
 996 |        "      <th>7</th>\n",
 997 |        "      <td>ham</td>\n",
 998 |        "      <td>As per your request 'Melle Melle (Oru Minnamin...</td>\n",
 999 |        "      <td>0</td>\n",
1000 |        "    </tr>\n",
1001 |        "    <tr>\n",
1002 |        "      <th>8</th>\n",
1003 |        "      <td>spam</td>\n",
1004 |        "      <td>WINNER!! As a valued network customer you have...</td>\n",
1005 |        "      <td>1</td>\n",
1006 |        "    </tr>\n",
1007 |        "    <tr>\n",
1008 |        "      <th>9</th>\n",
1009 |        "      <td>spam</td>\n",
1010 |        "      <td>Had your mobile 11 months or more? U R entitle...</td>\n",
1011 |        "      <td>1</td>\n",
1012 |        "    </tr>\n",
1013 |        "  </tbody>\n",
1014 |        "</table>\n",
1015 |        "</div>"
1016 |       ],
1017 |       "text/plain": [
1018 |        "  label                                            message  label_num\n",
1019 |        "0   ham  Go until jurong point, crazy.. Available only ...          0\n",
1020 |        "1   ham                      Ok lar... Joking wif u oni...          0\n",
1021 |        "2  spam  Free entry in 2 a wkly comp to win FA Cup fina...          1\n",
1022 |        "3   ham  U dun say so early hor... U c already then say...          0\n",
1023 |        "4   ham  Nah I don't think he goes to usf, he lives aro...          0\n",
1024 |        "5  spam  FreeMsg Hey there darling it's been 3 week's n...          1\n",
1025 |        "6   ham  Even my brother is not like to speak with me. ...          0\n",
1026 |        "7   ham  As per your request 'Melle Melle (Oru Minnamin...          0\n",
1027 |        "8  spam  WINNER!! As a valued network customer you have...          1\n",
1028 |        "9  spam  Had your mobile 11 months or more? U R entitle...          1"
1029 |       ]
1030 |      },
1031 |      "execution_count": 30,
1032 |      "metadata": {},
1033 |      "output_type": "execute_result"
1034 |     }
1035 |    ],
1036 |    "source": [
1037 |     "# check that the conversion worked\n",
1038 |     "sms.head(10)"
1039 |    ]
1040 |   },
1041 |   {
1042 |    "cell_type": "code",
1043 |    "execution_count": 31,
1044 |    "metadata": {
1045 |     "collapsed": false
1046 |    },
1047 |    "outputs": [
1048 |     {
1049 |      "name": "stdout",
1050 |      "output_type": "stream",
1051 |      "text": [
1052 |       "(150, 4)\n",
1053 |       "(150,)\n"
1054 |      ]
1055 |     }
1056 |    ],
1057 |    "source": [
1058 |     "# how to define X and y (from the iris data) for use with a MODEL\n",
1059 |     "X = iris.data\n",
1060 |     "y = iris.target\n",
1061 |     "print(X.shape)\n",
1062 |     "print(y.shape)"
1063 |    ]
1064 |   },
1065 |   {
1066 |    "cell_type": "code",
1067 |    "execution_count": 32,
1068 |    "metadata": {
1069 |     "collapsed": false
1070 |    },
1071 |    "outputs": [
1072 |     {
1073 |      "name": "stdout",
1074 |      "output_type": "stream",
1075 |      "text": [
1076 |       "(5572,)\n",
1077 |       "(5572,)\n"
1078 |      ]
1079 |     }
1080 |    ],
1081 |    "source": [
1082 |     "# how to define X and y (from the SMS data) for use with COUNTVECTORIZER\n",
1083 |     "X = sms.message\n",
1084 |     "y = sms.label_num\n",
1085 |     "print(X.shape)\n",
1086 |     "print(y.shape)"
1087 |    ]
1088 |   },
1089 |   {
1090 |    "cell_type": "code",
1091 |    "execution_count": 33,
1092 |    "metadata": {
1093 |     "collapsed": false
1094 |    },
1095 |    "outputs": [
1096 |     {
1097 |      "name": "stdout",
1098 |      "output_type": "stream",
1099 |      "text": [
1100 |       "(4179,)\n",
1101 |       "(1393,)\n",
1102 |       "(4179,)\n",
1103 |       "(1393,)\n"
1104 |      ]
1105 |     }
1106 |    ],
1107 |    "source": [
1108 |     "# split X and y into training and testing sets\n",
1109 |     "from sklearn.cross_validation import train_test_split\n",
1110 |     "X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=1)\n",
1111 |     "print(X_train.shape)\n",
1112 |     "print(X_test.shape)\n",
1113 |     "print(y_train.shape)\n",
1114 |     "print(y_test.shape)"
1115 |    ]
1116 |   },
1117 |   {
1118 |    "cell_type": "markdown",
1119 |    "metadata": {},
1120 |    "source": [
1121 |     "## Part 4: Vectorizing our dataset"
1122 |    ]
1123 |   },
1124 |   {
1125 |    "cell_type": "code",
1126 |    "execution_count": 34,
1127 |    "metadata": {
1128 |     "collapsed": true
1129 |    },
1130 |    "outputs": [],
1131 |    "source": [
1132 |     "# instantiate the vectorizer\n",
1133 |     "vect = CountVectorizer()"
1134 |    ]
1135 |   },
1136 |   {
1137 |    "cell_type": "code",
1138 |    "execution_count": 35,
1139 |    "metadata": {
1140 |     "collapsed": true
1141 |    },
1142 |    "outputs": [],
1143 |    "source": [
1144 |     "# learn training data vocabulary, then use it to create a document-term matrix\n",
1145 |     "vect.fit(X_train)\n",
1146 |     "X_train_dtm = vect.transform(X_train)"
1147 |    ]
1148 |   },
1149 |   {
1150 |    "cell_type": "code",
1151 |    "execution_count": 36,
1152 |    "metadata": {
1153 |     "collapsed": true
1154 |    },
1155 |    "outputs": [],
1156 |    "source": [
1157 |     "# equivalently: combine fit and transform into a single step\n",
1158 |     "X_train_dtm = vect.fit_transform(X_train)"
1159 |    ]
1160 |   },
1161 |   {
1162 |    "cell_type": "code",
1163 |    "execution_count": 37,
1164 |    "metadata": {
1165 |     "collapsed": false
1166 |    },
1167 |    "outputs": [
1168 |     {
1169 |      "data": {
1170 |       "text/plain": [
1171 |        "<4179x7456 sparse matrix of type '<class 'numpy.int64'>'\n",
1172 |        "\twith 55209 stored elements in Compressed Sparse Row format>"
1173 |       ]
1174 |      },
1175 |      "execution_count": 37,
1176 |      "metadata": {},
1177 |      "output_type": "execute_result"
1178 |     }
1179 |    ],
1180 |    "source": [
1181 |     "# examine the document-term matrix\n",
1182 |     "X_train_dtm"
1183 |    ]
1184 |   },
1185 |   {
1186 |    "cell_type": "code",
1187 |    "execution_count": 38,
1188 |    "metadata": {
1189 |     "collapsed": false
1190 |    },
1191 |    "outputs": [
1192 |     {
1193 |      "data": {
1194 |       "text/plain": [
1195 |        "<1393x7456 sparse matrix of type '<class 'numpy.int64'>'\n",
1196 |        "\twith 17604 stored elements in Compressed Sparse Row format>"
1197 |       ]
1198 |      },
1199 |      "execution_count": 38,
1200 |      "metadata": {},
1201 |      "output_type": "execute_result"
1202 |     }
1203 |    ],
1204 |    "source": [
1205 |     "# transform testing data (using fitted vocabulary) into a document-term matrix\n",
1206 |     "X_test_dtm = vect.transform(X_test)\n",
1207 |     "X_test_dtm"
1208 |    ]
1209 |   },
1210 |   {
1211 |    "cell_type": "markdown",
1212 |    "metadata": {},
1213 |    "source": [
1214 |     "## Part 5: Building and evaluating a model\n",
1215 |     "\n",
1216 |     "We will use [multinomial Naive Bayes](http://scikit-learn.org/stable/modules/generated/sklearn.naive_bayes.MultinomialNB.html):\n",
1217 |     "\n",
1218 |     "> The multinomial Naive Bayes classifier is suitable for classification with **discrete features** (e.g., word counts for text classification). The multinomial distribution normally requires integer feature counts. However, in practice, fractional counts such as tf-idf may also work."
1219 |    ]
1220 |   },
1221 |   {
1222 |    "cell_type": "code",
1223 |    "execution_count": 39,
1224 |    "metadata": {
1225 |     "collapsed": true
1226 |    },
1227 |    "outputs": [],
1228 |    "source": [
1229 |     "# import and instantiate a Multinomial Naive Bayes model\n",
1230 |     "from sklearn.naive_bayes import MultinomialNB\n",
1231 |     "nb = MultinomialNB()"
1232 |    ]
1233 |   },
1234 |   {
1235 |    "cell_type": "code",
1236 |    "execution_count": 40,
1237 |    "metadata": {
1238 |     "collapsed": false
1239 |    },
1240 |    "outputs": [
1241 |     {
1242 |      "name": "stdout",
1243 |      "output_type": "stream",
1244 |      "text": [
1245 |       "CPU times: user 0 ns, sys: 0 ns, total: 0 ns\n",
1246 |       "Wall time: 2.78 ms\n"
1247 |      ]
1248 |     },
1249 |     {
1250 |      "data": {
1251 |       "text/plain": [
1252 |        "MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True)"
1253 |       ]
1254 |      },
1255 |      "execution_count": 40,
1256 |      "metadata": {},
1257 |      "output_type": "execute_result"
1258 |     }
1259 |    ],
1260 |    "source": [
1261 |     "# train the model using X_train_dtm (timing it with an IPython \"magic command\")\n",
1262 |     "%time nb.fit(X_train_dtm, y_train)"
1263 |    ]
1264 |   },
1265 |   {
1266 |    "cell_type": "code",
1267 |    "execution_count": 41,
1268 |    "metadata": {
1269 |     "collapsed": true
1270 |    },
1271 |    "outputs": [],
1272 |    "source": [
1273 |     "# make class predictions for X_test_dtm\n",
1274 |     "y_pred_class = nb.predict(X_test_dtm)"
1275 |    ]
1276 |   },
1277 |   {
1278 |    "cell_type": "code",
1279 |    "execution_count": 42,
1280 |    "metadata": {
1281 |     "collapsed": false
1282 |    },
1283 |    "outputs": [
1284 |     {
1285 |      "data": {
1286 |       "text/plain": [
1287 |        "0.98851399856424982"
1288 |       ]
1289 |      },
1290 |      "execution_count": 42,
1291 |      "metadata": {},
1292 |      "output_type": "execute_result"
1293 |     }
1294 |    ],
1295 |    "source": [
1296 |     "# calculate accuracy of class predictions\n",
1297 |     "from sklearn import metrics\n",
1298 |     "metrics.accuracy_score(y_test, y_pred_class)"
1299 |    ]
1300 |   },
1301 |   {
1302 |    "cell_type": "code",
1303 |    "execution_count": 43,
1304 |    "metadata": {
1305 |     "collapsed": false
1306 |    },
1307 |    "outputs": [
1308 |     {
1309 |      "data": {
1310 |       "text/plain": [
1311 |        "array([[1203,    5],\n",
1312 |        "       [  11,  174]])"
1313 |       ]
1314 |      },
1315 |      "execution_count": 43,
1316 |      "metadata": {},
1317 |      "output_type": "execute_result"
1318 |     }
1319 |    ],
1320 |    "source": [
1321 |     "# print the confusion matrix\n",
1322 |     "metrics.confusion_matrix(y_test, y_pred_class)"
1323 |    ]
1324 |   },
1325 |   {
1326 |    "cell_type": "code",
1327 |    "execution_count": 44,
1328 |    "metadata": {
1329 |     "collapsed": false
1330 |    },
1331 |    "outputs": [],
1332 |    "source": [
1333 |     "# print message text for the false positives (ham incorrectly classified as spam)\n"
1334 |    ]
1335 |   },
1336 |   {
1337 |    "cell_type": "code",
1338 |    "execution_count": 45,
1339 |    "metadata": {
1340 |     "collapsed": false,
1341 |     "scrolled": true
1342 |    },
1343 |    "outputs": [],
1344 |    "source": [
1345 |     "# print message text for the false negatives (spam incorrectly classified as ham)\n"
1346 |    ]
1347 |   },
1348 |   {
1349 |    "cell_type": "code",
1350 |    "execution_count": 46,
1351 |    "metadata": {
1352 |     "collapsed": false,
1353 |     "scrolled": true
1354 |    },
1355 |    "outputs": [
1356 |     {
1357 |      "data": {
1358 |       "text/plain": [
1359 |        "\"LookAtMe!: Thanks for your purchase of a video clip from LookAtMe!, you've been charged 35p. Think you can do better? Why not send a video in a MMSto 32323.\""
1360 |       ]
1361 |      },
1362 |      "execution_count": 46,
1363 |      "metadata": {},
1364 |      "output_type": "execute_result"
1365 |     }
1366 |    ],
1367 |    "source": [
1368 |     "# example false negative\n",
1369 |     "X_test[3132]"
1370 |    ]
1371 |   },
1372 |   {
1373 |    "cell_type": "code",
1374 |    "execution_count": 47,
1375 |    "metadata": {
1376 |     "collapsed": false
1377 |    },
1378 |    "outputs": [
1379 |     {
1380 |      "data": {
1381 |       "text/plain": [
1382 |        "array([  2.87744864e-03,   1.83488846e-05,   2.07301295e-03, ...,\n",
1383 |        "         1.09026171e-06,   1.00000000e+00,   3.98279868e-09])"
1384 |       ]
1385 |      },
1386 |      "execution_count": 47,
1387 |      "metadata": {},
1388 |      "output_type": "execute_result"
1389 |     }
1390 |    ],
1391 |    "source": [
1392 |     "# calculate predicted probabilities for X_test_dtm (poorly calibrated)\n",
1393 |     "y_pred_prob = nb.predict_proba(X_test_dtm)[:, 1]\n",
1394 |     "y_pred_prob"
1395 |    ]
1396 |   },
1397 |   {
1398 |    "cell_type": "code",
1399 |    "execution_count": 48,
1400 |    "metadata": {
1401 |     "collapsed": false
1402 |    },
1403 |    "outputs": [
1404 |     {
1405 |      "data": {
1406 |       "text/plain": [
1407 |        "0.98664310005369604"
1408 |       ]
1409 |      },
1410 |      "execution_count": 48,
1411 |      "metadata": {},
1412 |      "output_type": "execute_result"
1413 |     }
1414 |    ],
1415 |    "source": [
1416 |     "# calculate AUC\n",
1417 |     "metrics.roc_auc_score(y_test, y_pred_prob)"
1418 |    ]
1419 |   },
1420 |   {
1421 |    "cell_type": "markdown",
1422 |    "metadata": {},
1423 |    "source": [
1424 |     "## Part 6: Comparing models\n",
1425 |     "\n",
1426 |     "We will compare multinomial Naive Bayes with [logistic regression](http://scikit-learn.org/stable/modules/linear_model.html#logistic-regression):\n",
1427 |     "\n",
1428 |     "> Logistic regression, despite its name, is a **linear model for classification** rather than regression. Logistic regression is also known in the literature as logit regression, maximum-entropy classification (MaxEnt) or the log-linear classifier. In this model, the probabilities describing the possible outcomes of a single trial are modeled using a logistic function."
1429 |    ]
1430 |   },
1431 |   {
1432 |    "cell_type": "code",
1433 |    "execution_count": 49,
1434 |    "metadata": {
1435 |     "collapsed": true
1436 |    },
1437 |    "outputs": [],
1438 |    "source": [
1439 |     "# import and instantiate a logistic regression model\n",
1440 |     "from sklearn.linear_model import LogisticRegression\n",
1441 |     "logreg = LogisticRegression()"
1442 |    ]
1443 |   },
1444 |   {
1445 |    "cell_type": "code",
1446 |    "execution_count": 50,
1447 |    "metadata": {
1448 |     "collapsed": false
1449 |    },
1450 |    "outputs": [
1451 |     {
1452 |      "name": "stdout",
1453 |      "output_type": "stream",
1454 |      "text": [
1455 |       "CPU times: user 56 ms, sys: 0 ns, total: 56 ms\n",
1456 |       "Wall time: 273 ms\n"
1457 |      ]
1458 |     },
1459 |     {
1460 |      "data": {
1461 |       "text/plain": [
1462 |        "LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,\n",
1463 |        "          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,\n",
1464 |        "          penalty='l2', random_state=None, solver='liblinear', tol=0.0001,\n",
1465 |        "          verbose=0, warm_start=False)"
1466 |       ]
1467 |      },
1468 |      "execution_count": 50,
1469 |      "metadata": {},
1470 |      "output_type": "execute_result"
1471 |     }
1472 |    ],
1473 |    "source": [
1474 |     "# train the model using X_train_dtm\n",
1475 |     "%time logreg.fit(X_train_dtm, y_train)"
1476 |    ]
1477 |   },
1478 |   {
1479 |    "cell_type": "code",
1480 |    "execution_count": 51,
1481 |    "metadata": {
1482 |     "collapsed": true
1483 |    },
1484 |    "outputs": [],
1485 |    "source": [
1486 |     "# make class predictions for X_test_dtm\n",
1487 |     "y_pred_class = logreg.predict(X_test_dtm)"
1488 |    ]
1489 |   },
1490 |   {
1491 |    "cell_type": "code",
1492 |    "execution_count": 52,
1493 |    "metadata": {
1494 |     "collapsed": false
1495 |    },
1496 |    "outputs": [
1497 |     {
1498 |      "data": {
1499 |       "text/plain": [
1500 |        "array([ 0.01269556,  0.00347183,  0.00616517, ...,  0.03354907,\n",
1501 |        "        0.99725053,  0.00157706])"
1502 |       ]
1503 |      },
1504 |      "execution_count": 52,
1505 |      "metadata": {},
1506 |      "output_type": "execute_result"
1507 |     }
1508 |    ],
1509 |    "source": [
1510 |     "# calculate predicted probabilities for X_test_dtm (well calibrated)\n",
1511 |     "y_pred_prob = logreg.predict_proba(X_test_dtm)[:, 1]\n",
1512 |     "y_pred_prob"
1513 |    ]
1514 |   },
1515 |   {
1516 |    "cell_type": "code",
1517 |    "execution_count": 53,
1518 |    "metadata": {
1519 |     "collapsed": false
1520 |    },
1521 |    "outputs": [
1522 |     {
1523 |      "data": {
1524 |       "text/plain": [
1525 |        "0.9877961234745154"
1526 |       ]
1527 |      },
1528 |      "execution_count": 53,
1529 |      "metadata": {},
1530 |      "output_type": "execute_result"
1531 |     }
1532 |    ],
1533 |    "source": [
1534 |     "# calculate accuracy\n",
1535 |     "metrics.accuracy_score(y_test, y_pred_class)"
1536 |    ]
1537 |   },
1538 |   {
1539 |    "cell_type": "code",
1540 |    "execution_count": 54,
1541 |    "metadata": {
1542 |     "collapsed": false
1543 |    },
1544 |    "outputs": [
1545 |     {
1546 |      "data": {
1547 |       "text/plain": [
1548 |        "0.99368176123143015"
1549 |       ]
1550 |      },
1551 |      "execution_count": 54,
1552 |      "metadata": {},
1553 |      "output_type": "execute_result"
1554 |     }
1555 |    ],
1556 |    "source": [
1557 |     "# calculate AUC\n",
1558 |     "metrics.roc_auc_score(y_test, y_pred_prob)"
1559 |    ]
1560 |   }
1561 |  ],
1562 |  "metadata": {
1563 |   "kernelspec": {
1564 |    "display_name": "Python [conda root]",
1565 |    "language": "python",
1566 |    "name": "conda-root-py"
1567 |   },
1568 |   "language_info": {
1569 |    "codemirror_mode": {
1570 |     "name": "ipython",
1571 |     "version": 3
1572 |    },
1573 |    "file_extension": ".py",
1574 |    "mimetype": "text/x-python",
1575 |    "name": "python",
1576 |    "nbconvert_exporter": "python",
1577 |    "pygments_lexer": "ipython3",
1578 |    "version": "3.5.2"
1579 |   }
1580 |  },
1581 |  "nbformat": 4,
1582 |  "nbformat_minor": 0
1583 | }
1584 | 


--------------------------------------------------------------------------------
/textAnalisis/.ipynb_checkpoints/tutorial-checkpoint.ipynb:
--------------------------------------------------------------------------------
   1 | {
   2 |  "cells": [
   3 |   {
   4 |    "cell_type": "markdown",
   5 |    "metadata": {},
   6 |    "source": [
   7 |     "# Tutorial: Machine Learning with Text in scikit-learn"
   8 |    ]
   9 |   },
  10 |   {
  11 |    "cell_type": "markdown",
  12 |    "metadata": {},
  13 |    "source": [
  14 |     "## Agenda\n",
  15 |     "\n",
  16 |     "1. Model building in scikit-learn (refresher)\n",
  17 |     "2. Representing text as numerical data\n",
  18 |     "3. Reading a text-based dataset into pandas\n",
  19 |     "4. Vectorizing our dataset\n",
  20 |     "5. Building and evaluating a model\n",
  21 |     "6. Comparing models"
  22 |    ]
  23 |   },
  24 |   {
  25 |    "cell_type": "code",
  26 |    "execution_count": 1,
  27 |    "metadata": {
  28 |     "collapsed": false
  29 |    },
  30 |    "outputs": [],
  31 |    "source": [
  32 |     "# for Python 2: use print only as a function\n",
  33 |     "from __future__ import print_function"
  34 |    ]
  35 |   },
  36 |   {
  37 |    "cell_type": "markdown",
  38 |    "metadata": {},
  39 |    "source": [
  40 |     "## Part 1: Model building in scikit-learn (refresher)"
  41 |    ]
  42 |   },
  43 |   {
  44 |    "cell_type": "code",
  45 |    "execution_count": 2,
  46 |    "metadata": {
  47 |     "collapsed": true
  48 |    },
  49 |    "outputs": [],
  50 |    "source": [
  51 |     "# load the iris dataset as an example\n",
  52 |     "from sklearn.datasets import load_iris\n",
  53 |     "iris = load_iris()"
  54 |    ]
  55 |   },
  56 |   {
  57 |    "cell_type": "code",
  58 |    "execution_count": 3,
  59 |    "metadata": {
  60 |     "collapsed": true
  61 |    },
  62 |    "outputs": [],
  63 |    "source": [
  64 |     "# store the feature matrix (X) and response vector (y)\n",
  65 |     "X = iris.data\n",
  66 |     "y = iris.target"
  67 |    ]
  68 |   },
  69 |   {
  70 |    "cell_type": "markdown",
  71 |    "metadata": {},
  72 |    "source": [
  73 |     "**\"Features\"** are also known as predictors, inputs, or attributes. The **\"response\"** is also known as the target, label, or output."
  74 |    ]
  75 |   },
  76 |   {
  77 |    "cell_type": "code",
  78 |    "execution_count": 4,
  79 |    "metadata": {
  80 |     "collapsed": false
  81 |    },
  82 |    "outputs": [
  83 |     {
  84 |      "name": "stdout",
  85 |      "output_type": "stream",
  86 |      "text": [
  87 |       "(150, 4)\n",
  88 |       "(150,)\n"
  89 |      ]
  90 |     }
  91 |    ],
  92 |    "source": [
  93 |     "# check the shapes of X and y\n",
  94 |     "print(X.shape)\n",
  95 |     "print(y.shape)"
  96 |    ]
  97 |   },
  98 |   {
  99 |    "cell_type": "markdown",
 100 |    "metadata": {},
 101 |    "source": [
 102 |     "**\"Observations\"** are also known as samples, instances, or records."
 103 |    ]
 104 |   },
 105 |   {
 106 |    "cell_type": "code",
 107 |    "execution_count": 5,
 108 |    "metadata": {
 109 |     "collapsed": false
 110 |    },
 111 |    "outputs": [
 112 |     {
 113 |      "data": {
 114 |       "text/html": [
 115 |        "<div>\n",
 116 |        "<table border=\"1\" class=\"dataframe\">\n",
 117 |        "  <thead>\n",
 118 |        "    <tr style=\"text-align: right;\">\n",
 119 |        "      <th></th>\n",
 120 |        "      <th>sepal length (cm)</th>\n",
 121 |        "      <th>sepal width (cm)</th>\n",
 122 |        "      <th>petal length (cm)</th>\n",
 123 |        "      <th>petal width (cm)</th>\n",
 124 |        "    </tr>\n",
 125 |        "  </thead>\n",
 126 |        "  <tbody>\n",
 127 |        "    <tr>\n",
 128 |        "      <th>0</th>\n",
 129 |        "      <td>5.1</td>\n",
 130 |        "      <td>3.5</td>\n",
 131 |        "      <td>1.4</td>\n",
 132 |        "      <td>0.2</td>\n",
 133 |        "    </tr>\n",
 134 |        "    <tr>\n",
 135 |        "      <th>1</th>\n",
 136 |        "      <td>4.9</td>\n",
 137 |        "      <td>3.0</td>\n",
 138 |        "      <td>1.4</td>\n",
 139 |        "      <td>0.2</td>\n",
 140 |        "    </tr>\n",
 141 |        "    <tr>\n",
 142 |        "      <th>2</th>\n",
 143 |        "      <td>4.7</td>\n",
 144 |        "      <td>3.2</td>\n",
 145 |        "      <td>1.3</td>\n",
 146 |        "      <td>0.2</td>\n",
 147 |        "    </tr>\n",
 148 |        "    <tr>\n",
 149 |        "      <th>3</th>\n",
 150 |        "      <td>4.6</td>\n",
 151 |        "      <td>3.1</td>\n",
 152 |        "      <td>1.5</td>\n",
 153 |        "      <td>0.2</td>\n",
 154 |        "    </tr>\n",
 155 |        "    <tr>\n",
 156 |        "      <th>4</th>\n",
 157 |        "      <td>5.0</td>\n",
 158 |        "      <td>3.6</td>\n",
 159 |        "      <td>1.4</td>\n",
 160 |        "      <td>0.2</td>\n",
 161 |        "    </tr>\n",
 162 |        "  </tbody>\n",
 163 |        "</table>\n",
 164 |        "</div>"
 165 |       ],
 166 |       "text/plain": [
 167 |        "   sepal length (cm)  sepal width (cm)  petal length (cm)  petal width (cm)\n",
 168 |        "0                5.1               3.5                1.4               0.2\n",
 169 |        "1                4.9               3.0                1.4               0.2\n",
 170 |        "2                4.7               3.2                1.3               0.2\n",
 171 |        "3                4.6               3.1                1.5               0.2\n",
 172 |        "4                5.0               3.6                1.4               0.2"
 173 |       ]
 174 |      },
 175 |      "execution_count": 5,
 176 |      "metadata": {},
 177 |      "output_type": "execute_result"
 178 |     }
 179 |    ],
 180 |    "source": [
 181 |     "# examine the first 5 rows of the feature matrix (including the feature names)\n",
 182 |     "import pandas as pd\n",
 183 |     "pd.DataFrame(X, columns=iris.feature_names).head()"
 184 |    ]
 185 |   },
 186 |   {
 187 |    "cell_type": "code",
 188 |    "execution_count": 6,
 189 |    "metadata": {
 190 |     "collapsed": false
 191 |    },
 192 |    "outputs": [
 193 |     {
 194 |      "name": "stdout",
 195 |      "output_type": "stream",
 196 |      "text": [
 197 |       "[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0\n",
 198 |       " 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1\n",
 199 |       " 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2\n",
 200 |       " 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2\n",
 201 |       " 2 2]\n"
 202 |      ]
 203 |     }
 204 |    ],
 205 |    "source": [
 206 |     "# examine the response vector\n",
 207 |     "print(y)"
 208 |    ]
 209 |   },
 210 |   {
 211 |    "cell_type": "markdown",
 212 |    "metadata": {},
 213 |    "source": [
 214 |     "In order to **build a model**, the features must be **numeric**, and every observation must have the **same features in the same order**."
 215 |    ]
 216 |   },
 217 |   {
 218 |    "cell_type": "code",
 219 |    "execution_count": 7,
 220 |    "metadata": {
 221 |     "collapsed": false
 222 |    },
 223 |    "outputs": [
 224 |     {
 225 |      "data": {
 226 |       "text/plain": [
 227 |        "KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',\n",
 228 |        "           metric_params=None, n_jobs=1, n_neighbors=5, p=2,\n",
 229 |        "           weights='uniform')"
 230 |       ]
 231 |      },
 232 |      "execution_count": 7,
 233 |      "metadata": {},
 234 |      "output_type": "execute_result"
 235 |     }
 236 |    ],
 237 |    "source": [
 238 |     "# import the class\n",
 239 |     "from sklearn.neighbors import KNeighborsClassifier\n",
 240 |     "\n",
 241 |     "# instantiate the model (with the default parameters)\n",
 242 |     "knn = KNeighborsClassifier()\n",
 243 |     "\n",
 244 |     "# fit the model with data (occurs in-place)\n",
 245 |     "knn.fit(X, y)"
 246 |    ]
 247 |   },
 248 |   {
 249 |    "cell_type": "markdown",
 250 |    "metadata": {},
 251 |    "source": [
 252 |     "In order to **make a prediction**, the new observation must have the **same features as the training observations**, both in number and meaning."
 253 |    ]
 254 |   },
 255 |   {
 256 |    "cell_type": "code",
 257 |    "execution_count": 8,
 258 |    "metadata": {
 259 |     "collapsed": false
 260 |    },
 261 |    "outputs": [
 262 |     {
 263 |      "data": {
 264 |       "text/plain": [
 265 |        "array([1])"
 266 |       ]
 267 |      },
 268 |      "execution_count": 8,
 269 |      "metadata": {},
 270 |      "output_type": "execute_result"
 271 |     }
 272 |    ],
 273 |    "source": [
 274 |     "# predict the response for a new observation\n",
 275 |     "knn.predict([[3, 5, 4, 2]])"
 276 |    ]
 277 |   },
 278 |   {
 279 |    "cell_type": "markdown",
 280 |    "metadata": {},
 281 |    "source": [
 282 |     "## Part 2: Representing text as numerical data"
 283 |    ]
 284 |   },
 285 |   {
 286 |    "cell_type": "code",
 287 |    "execution_count": 9,
 288 |    "metadata": {
 289 |     "collapsed": true
 290 |    },
 291 |    "outputs": [],
 292 |    "source": [
 293 |     "# example text for model training (SMS messages)\n",
 294 |     "simple_train = ['call you tonight', 'Call me a cab', 'please call me... PLEASE!']"
 295 |    ]
 296 |   },
 297 |   {
 298 |    "cell_type": "code",
 299 |    "execution_count": 10,
 300 |    "metadata": {
 301 |     "collapsed": true
 302 |    },
 303 |    "outputs": [],
 304 |    "source": [
 305 |     "# example response vector\n",
 306 |     "is_desperate = [0, 0, 1]"
 307 |    ]
 308 |   },
 309 |   {
 310 |    "cell_type": "markdown",
 311 |    "metadata": {},
 312 |    "source": [
 313 |     "From the [scikit-learn documentation](http://scikit-learn.org/stable/modules/feature_extraction.html#text-feature-extraction):\n",
 314 |     "\n",
 315 |     "> Text Analysis is a major application field for machine learning algorithms. However the raw data, a sequence of symbols cannot be fed directly to the algorithms themselves as most of them expect **numerical feature vectors with a fixed size** rather than the **raw text documents with variable length**.\n",
 316 |     "\n",
 317 |     "We will use [CountVectorizer](http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html) to \"convert text into a matrix of token counts\":"
 318 |    ]
 319 |   },
 320 |   {
 321 |    "cell_type": "code",
 322 |    "execution_count": 11,
 323 |    "metadata": {
 324 |     "collapsed": true
 325 |    },
 326 |    "outputs": [],
 327 |    "source": [
 328 |     "# import and instantiate CountVectorizer (with the default parameters)\n",
 329 |     "from sklearn.feature_extraction.text import CountVectorizer\n",
 330 |     "vect = CountVectorizer()"
 331 |    ]
 332 |   },
 333 |   {
 334 |    "cell_type": "code",
 335 |    "execution_count": 12,
 336 |    "metadata": {
 337 |     "collapsed": false
 338 |    },
 339 |    "outputs": [
 340 |     {
 341 |      "data": {
 342 |       "text/plain": [
 343 |        "CountVectorizer(analyzer='word', binary=False, decode_error='strict',\n",
 344 |        "        dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',\n",
 345 |        "        lowercase=True, max_df=1.0, max_features=None, min_df=1,\n",
 346 |        "        ngram_range=(1, 1), preprocessor=None, stop_words=None,\n",
 347 |        "        strip_accents=None, token_pattern='(?u)\\\\b\\\\w\\\\w+\\\\b',\n",
 348 |        "        tokenizer=None, vocabulary=None)"
 349 |       ]
 350 |      },
 351 |      "execution_count": 12,
 352 |      "metadata": {},
 353 |      "output_type": "execute_result"
 354 |     }
 355 |    ],
 356 |    "source": [
 357 |     "# learn the 'vocabulary' of the training data (occurs in-place)\n",
 358 |     "vect.fit(simple_train)"
 359 |    ]
 360 |   },
 361 |   {
 362 |    "cell_type": "code",
 363 |    "execution_count": 13,
 364 |    "metadata": {
 365 |     "collapsed": false
 366 |    },
 367 |    "outputs": [
 368 |     {
 369 |      "data": {
 370 |       "text/plain": [
 371 |        "['cab', 'call', 'me', 'please', 'tonight', 'you']"
 372 |       ]
 373 |      },
 374 |      "execution_count": 13,
 375 |      "metadata": {},
 376 |      "output_type": "execute_result"
 377 |     }
 378 |    ],
 379 |    "source": [
 380 |     "# examine the fitted vocabulary\n",
 381 |     "vect.get_feature_names()"
 382 |    ]
 383 |   },
 384 |   {
 385 |    "cell_type": "code",
 386 |    "execution_count": 14,
 387 |    "metadata": {
 388 |     "collapsed": false
 389 |    },
 390 |    "outputs": [
 391 |     {
 392 |      "data": {
 393 |       "text/plain": [
 394 |        "<3x6 sparse matrix of type '<class 'numpy.int64'>'\n",
 395 |        "\twith 9 stored elements in Compressed Sparse Row format>"
 396 |       ]
 397 |      },
 398 |      "execution_count": 14,
 399 |      "metadata": {},
 400 |      "output_type": "execute_result"
 401 |     }
 402 |    ],
 403 |    "source": [
 404 |     "# transform training data into a 'document-term matrix'\n",
 405 |     "simple_train_dtm = vect.transform(simple_train)\n",
 406 |     "simple_train_dtm"
 407 |    ]
 408 |   },
 409 |   {
 410 |    "cell_type": "code",
 411 |    "execution_count": 15,
 412 |    "metadata": {
 413 |     "collapsed": false
 414 |    },
 415 |    "outputs": [
 416 |     {
 417 |      "data": {
 418 |       "text/plain": [
 419 |        "array([[0, 1, 0, 0, 1, 1],\n",
 420 |        "       [1, 1, 1, 0, 0, 0],\n",
 421 |        "       [0, 1, 1, 2, 0, 0]])"
 422 |       ]
 423 |      },
 424 |      "execution_count": 15,
 425 |      "metadata": {},
 426 |      "output_type": "execute_result"
 427 |     }
 428 |    ],
 429 |    "source": [
 430 |     "# convert sparse matrix to a dense matrix\n",
 431 |     "simple_train_dtm.toarray()"
 432 |    ]
 433 |   },
 434 |   {
 435 |    "cell_type": "code",
 436 |    "execution_count": 16,
 437 |    "metadata": {
 438 |     "collapsed": false
 439 |    },
 440 |    "outputs": [
 441 |     {
 442 |      "data": {
 443 |       "text/html": [
 444 |        "<div>\n",
 445 |        "<table border=\"1\" class=\"dataframe\">\n",
 446 |        "  <thead>\n",
 447 |        "    <tr style=\"text-align: right;\">\n",
 448 |        "      <th></th>\n",
 449 |        "      <th>cab</th>\n",
 450 |        "      <th>call</th>\n",
 451 |        "      <th>me</th>\n",
 452 |        "      <th>please</th>\n",
 453 |        "      <th>tonight</th>\n",
 454 |        "      <th>you</th>\n",
 455 |        "    </tr>\n",
 456 |        "  </thead>\n",
 457 |        "  <tbody>\n",
 458 |        "    <tr>\n",
 459 |        "      <th>0</th>\n",
 460 |        "      <td>0</td>\n",
 461 |        "      <td>1</td>\n",
 462 |        "      <td>0</td>\n",
 463 |        "      <td>0</td>\n",
 464 |        "      <td>1</td>\n",
 465 |        "      <td>1</td>\n",
 466 |        "    </tr>\n",
 467 |        "    <tr>\n",
 468 |        "      <th>1</th>\n",
 469 |        "      <td>1</td>\n",
 470 |        "      <td>1</td>\n",
 471 |        "      <td>1</td>\n",
 472 |        "      <td>0</td>\n",
 473 |        "      <td>0</td>\n",
 474 |        "      <td>0</td>\n",
 475 |        "    </tr>\n",
 476 |        "    <tr>\n",
 477 |        "      <th>2</th>\n",
 478 |        "      <td>0</td>\n",
 479 |        "      <td>1</td>\n",
 480 |        "      <td>1</td>\n",
 481 |        "      <td>2</td>\n",
 482 |        "      <td>0</td>\n",
 483 |        "      <td>0</td>\n",
 484 |        "    </tr>\n",
 485 |        "  </tbody>\n",
 486 |        "</table>\n",
 487 |        "</div>"
 488 |       ],
 489 |       "text/plain": [
 490 |        "   cab  call  me  please  tonight  you\n",
 491 |        "0    0     1   0       0        1    1\n",
 492 |        "1    1     1   1       0        0    0\n",
 493 |        "2    0     1   1       2        0    0"
 494 |       ]
 495 |      },
 496 |      "execution_count": 16,
 497 |      "metadata": {},
 498 |      "output_type": "execute_result"
 499 |     }
 500 |    ],
 501 |    "source": [
 502 |     "# examine the vocabulary and document-term matrix together\n",
 503 |     "pd.DataFrame(simple_train_dtm.toarray(), columns=vect.get_feature_names())"
 504 |    ]
 505 |   },
 506 |   {
 507 |    "cell_type": "markdown",
 508 |    "metadata": {},
 509 |    "source": [
 510 |     "From the [scikit-learn documentation](http://scikit-learn.org/stable/modules/feature_extraction.html#text-feature-extraction):\n",
 511 |     "\n",
 512 |     "> In this scheme, features and samples are defined as follows:\n",
 513 |     "\n",
 514 |     "> - Each individual token occurrence frequency (normalized or not) is treated as a **feature**.\n",
 515 |     "> - The vector of all the token frequencies for a given document is considered a multivariate **sample**.\n",
 516 |     "\n",
 517 |     "> A **corpus of documents** can thus be represented by a matrix with **one row per document** and **one column per token** (e.g. word) occurring in the corpus.\n",
 518 |     "\n",
 519 |     "> We call **vectorization** the general process of turning a collection of text documents into numerical feature vectors. This specific strategy (tokenization, counting and normalization) is called the **Bag of Words** or \"Bag of n-grams\" representation. Documents are described by word occurrences while completely ignoring the relative position information of the words in the document."
 520 |    ]
 521 |   },
 522 |   {
 523 |    "cell_type": "code",
 524 |    "execution_count": 17,
 525 |    "metadata": {
 526 |     "collapsed": false
 527 |    },
 528 |    "outputs": [
 529 |     {
 530 |      "data": {
 531 |       "text/plain": [
 532 |        "scipy.sparse.csr.csr_matrix"
 533 |       ]
 534 |      },
 535 |      "execution_count": 17,
 536 |      "metadata": {},
 537 |      "output_type": "execute_result"
 538 |     }
 539 |    ],
 540 |    "source": [
 541 |     "# check the type of the document-term matrix\n",
 542 |     "type(simple_train_dtm)"
 543 |    ]
 544 |   },
 545 |   {
 546 |    "cell_type": "code",
 547 |    "execution_count": 18,
 548 |    "metadata": {
 549 |     "collapsed": false,
 550 |     "scrolled": true
 551 |    },
 552 |    "outputs": [
 553 |     {
 554 |      "name": "stdout",
 555 |      "output_type": "stream",
 556 |      "text": [
 557 |       "  (0, 1)\t1\n",
 558 |       "  (0, 4)\t1\n",
 559 |       "  (0, 5)\t1\n",
 560 |       "  (1, 0)\t1\n",
 561 |       "  (1, 1)\t1\n",
 562 |       "  (1, 2)\t1\n",
 563 |       "  (2, 1)\t1\n",
 564 |       "  (2, 2)\t1\n",
 565 |       "  (2, 3)\t2\n"
 566 |      ]
 567 |     }
 568 |    ],
 569 |    "source": [
 570 |     "# examine the sparse matrix contents\n",
 571 |     "print(simple_train_dtm)"
 572 |    ]
 573 |   },
 574 |   {
 575 |    "cell_type": "markdown",
 576 |    "metadata": {},
 577 |    "source": [
 578 |     "From the [scikit-learn documentation](http://scikit-learn.org/stable/modules/feature_extraction.html#text-feature-extraction):\n",
 579 |     "\n",
 580 |     "> As most documents will typically use a very small subset of the words used in the corpus, the resulting matrix will have **many feature values that are zeros** (typically more than 99% of them).\n",
 581 |     "\n",
 582 |     "> For instance, a collection of 10,000 short text documents (such as emails) will use a vocabulary with a size in the order of 100,000 unique words in total while each document will use 100 to 1000 unique words individually.\n",
 583 |     "\n",
 584 |     "> In order to be able to **store such a matrix in memory** but also to **speed up operations**, implementations will typically use a **sparse representation** such as the implementations available in the `scipy.sparse` package."
 585 |    ]
 586 |   },
 587 |   {
 588 |    "cell_type": "code",
 589 |    "execution_count": 19,
 590 |    "metadata": {
 591 |     "collapsed": false
 592 |    },
 593 |    "outputs": [
 594 |     {
 595 |      "data": {
 596 |       "text/plain": [
 597 |        "KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',\n",
 598 |        "           metric_params=None, n_jobs=1, n_neighbors=1, p=2,\n",
 599 |        "           weights='uniform')"
 600 |       ]
 601 |      },
 602 |      "execution_count": 19,
 603 |      "metadata": {},
 604 |      "output_type": "execute_result"
 605 |     }
 606 |    ],
 607 |    "source": [
 608 |     "# build a model to predict desperation\n",
 609 |     "knn = KNeighborsClassifier(n_neighbors=1)\n",
 610 |     "knn.fit(simple_train_dtm, is_desperate)"
 611 |    ]
 612 |   },
 613 |   {
 614 |    "cell_type": "code",
 615 |    "execution_count": 20,
 616 |    "metadata": {
 617 |     "collapsed": true
 618 |    },
 619 |    "outputs": [],
 620 |    "source": [
 621 |     "# example text for model testing\n",
 622 |     "simple_test = [\"please don't call me\"]"
 623 |    ]
 624 |   },
 625 |   {
 626 |    "cell_type": "markdown",
 627 |    "metadata": {},
 628 |    "source": [
 629 |     "In order to **make a prediction**, the new observation must have the **same features as the training observations**, both in number and meaning."
 630 |    ]
 631 |   },
 632 |   {
 633 |    "cell_type": "code",
 634 |    "execution_count": 21,
 635 |    "metadata": {
 636 |     "collapsed": false
 637 |    },
 638 |    "outputs": [
 639 |     {
 640 |      "data": {
 641 |       "text/plain": [
 642 |        "array([[0, 1, 1, 1, 0, 0]])"
 643 |       ]
 644 |      },
 645 |      "execution_count": 21,
 646 |      "metadata": {},
 647 |      "output_type": "execute_result"
 648 |     }
 649 |    ],
 650 |    "source": [
 651 |     "# transform testing data into a document-term matrix (using existing vocabulary)\n",
 652 |     "simple_test_dtm = vect.transform(simple_test)\n",
 653 |     "simple_test_dtm.toarray()"
 654 |    ]
 655 |   },
 656 |   {
 657 |    "cell_type": "code",
 658 |    "execution_count": 22,
 659 |    "metadata": {
 660 |     "collapsed": false
 661 |    },
 662 |    "outputs": [
 663 |     {
 664 |      "data": {
 665 |       "text/html": [
 666 |        "<div>\n",
 667 |        "<table border=\"1\" class=\"dataframe\">\n",
 668 |        "  <thead>\n",
 669 |        "    <tr style=\"text-align: right;\">\n",
 670 |        "      <th></th>\n",
 671 |        "      <th>cab</th>\n",
 672 |        "      <th>call</th>\n",
 673 |        "      <th>me</th>\n",
 674 |        "      <th>please</th>\n",
 675 |        "      <th>tonight</th>\n",
 676 |        "      <th>you</th>\n",
 677 |        "    </tr>\n",
 678 |        "  </thead>\n",
 679 |        "  <tbody>\n",
 680 |        "    <tr>\n",
 681 |        "      <th>0</th>\n",
 682 |        "      <td>0</td>\n",
 683 |        "      <td>1</td>\n",
 684 |        "      <td>1</td>\n",
 685 |        "      <td>1</td>\n",
 686 |        "      <td>0</td>\n",
 687 |        "      <td>0</td>\n",
 688 |        "    </tr>\n",
 689 |        "  </tbody>\n",
 690 |        "</table>\n",
 691 |        "</div>"
 692 |       ],
 693 |       "text/plain": [
 694 |        "   cab  call  me  please  tonight  you\n",
 695 |        "0    0     1   1       1        0    0"
 696 |       ]
 697 |      },
 698 |      "execution_count": 22,
 699 |      "metadata": {},
 700 |      "output_type": "execute_result"
 701 |     }
 702 |    ],
 703 |    "source": [
 704 |     "# examine the vocabulary and document-term matrix together\n",
 705 |     "pd.DataFrame(simple_test_dtm.toarray(), columns=vect.get_feature_names())"
 706 |    ]
 707 |   },
 708 |   {
 709 |    "cell_type": "code",
 710 |    "execution_count": 23,
 711 |    "metadata": {
 712 |     "collapsed": false
 713 |    },
 714 |    "outputs": [
 715 |     {
 716 |      "data": {
 717 |       "text/plain": [
 718 |        "array([1])"
 719 |       ]
 720 |      },
 721 |      "execution_count": 23,
 722 |      "metadata": {},
 723 |      "output_type": "execute_result"
 724 |     }
 725 |    ],
 726 |    "source": [
 727 |     "# predict whether simple_test is desperate\n",
 728 |     "knn.predict(simple_test_dtm)"
 729 |    ]
 730 |   },
 731 |   {
 732 |    "cell_type": "markdown",
 733 |    "metadata": {},
 734 |    "source": [
 735 |     "**Summary:**\n",
 736 |     "\n",
 737 |     "- `vect.fit(train)` **learns the vocabulary** of the training data\n",
 738 |     "- `vect.transform(train)` uses the **fitted vocabulary** to build a document-term matrix from the training data\n",
 739 |     "- `vect.transform(test)` uses the **fitted vocabulary** to build a document-term matrix from the testing data (and **ignores tokens** it hasn't seen before)"
 740 |    ]
 741 |   },
 742 |   {
 743 |    "cell_type": "markdown",
 744 |    "metadata": {},
 745 |    "source": [
 746 |     "## Part 3: Reading a text-based dataset into pandas"
 747 |    ]
 748 |   },
 749 |   {
 750 |    "cell_type": "code",
 751 |    "execution_count": 24,
 752 |    "metadata": {
 753 |     "collapsed": true
 754 |    },
 755 |    "outputs": [],
 756 |    "source": [
 757 |     "# read file into pandas from the working directory\n",
 758 |     "sms = pd.read_table('sms.tsv', header=None, names=['label', 'message'])"
 759 |    ]
 760 |   },
 761 |   {
 762 |    "cell_type": "code",
 763 |    "execution_count": 25,
 764 |    "metadata": {
 765 |     "collapsed": false
 766 |    },
 767 |    "outputs": [],
 768 |    "source": [
 769 |     "# alternative: read file into pandas from a URL\n",
 770 |     "# url = 'https://raw.githubusercontent.com/justmarkham/pydata-dc-2016-tutorial/master/sms.tsv'\n",
 771 |     "# sms = pd.read_table(url, header=None, names=['label', 'message'])"
 772 |    ]
 773 |   },
 774 |   {
 775 |    "cell_type": "code",
 776 |    "execution_count": 26,
 777 |    "metadata": {
 778 |     "collapsed": false
 779 |    },
 780 |    "outputs": [
 781 |     {
 782 |      "data": {
 783 |       "text/plain": [
 784 |        "(5572, 2)"
 785 |       ]
 786 |      },
 787 |      "execution_count": 26,
 788 |      "metadata": {},
 789 |      "output_type": "execute_result"
 790 |     }
 791 |    ],
 792 |    "source": [
 793 |     "# examine the shape\n",
 794 |     "sms.shape"
 795 |    ]
 796 |   },
 797 |   {
 798 |    "cell_type": "code",
 799 |    "execution_count": 27,
 800 |    "metadata": {
 801 |     "collapsed": false
 802 |    },
 803 |    "outputs": [
 804 |     {
 805 |      "data": {
 806 |       "text/html": [
 807 |        "<div>\n",
 808 |        "<table border=\"1\" class=\"dataframe\">\n",
 809 |        "  <thead>\n",
 810 |        "    <tr style=\"text-align: right;\">\n",
 811 |        "      <th></th>\n",
 812 |        "      <th>label</th>\n",
 813 |        "      <th>message</th>\n",
 814 |        "    </tr>\n",
 815 |        "  </thead>\n",
 816 |        "  <tbody>\n",
 817 |        "    <tr>\n",
 818 |        "      <th>0</th>\n",
 819 |        "      <td>ham</td>\n",
 820 |        "      <td>Go until jurong point, crazy.. Available only ...</td>\n",
 821 |        "    </tr>\n",
 822 |        "    <tr>\n",
 823 |        "      <th>1</th>\n",
 824 |        "      <td>ham</td>\n",
 825 |        "      <td>Ok lar... Joking wif u oni...</td>\n",
 826 |        "    </tr>\n",
 827 |        "    <tr>\n",
 828 |        "      <th>2</th>\n",
 829 |        "      <td>spam</td>\n",
 830 |        "      <td>Free entry in 2 a wkly comp to win FA Cup fina...</td>\n",
 831 |        "    </tr>\n",
 832 |        "    <tr>\n",
 833 |        "      <th>3</th>\n",
 834 |        "      <td>ham</td>\n",
 835 |        "      <td>U dun say so early hor... U c already then say...</td>\n",
 836 |        "    </tr>\n",
 837 |        "    <tr>\n",
 838 |        "      <th>4</th>\n",
 839 |        "      <td>ham</td>\n",
 840 |        "      <td>Nah I don't think he goes to usf, he lives aro...</td>\n",
 841 |        "    </tr>\n",
 842 |        "    <tr>\n",
 843 |        "      <th>5</th>\n",
 844 |        "      <td>spam</td>\n",
 845 |        "      <td>FreeMsg Hey there darling it's been 3 week's n...</td>\n",
 846 |        "    </tr>\n",
 847 |        "    <tr>\n",
 848 |        "      <th>6</th>\n",
 849 |        "      <td>ham</td>\n",
 850 |        "      <td>Even my brother is not like to speak with me. ...</td>\n",
 851 |        "    </tr>\n",
 852 |        "    <tr>\n",
 853 |        "      <th>7</th>\n",
 854 |        "      <td>ham</td>\n",
 855 |        "      <td>As per your request 'Melle Melle (Oru Minnamin...</td>\n",
 856 |        "    </tr>\n",
 857 |        "    <tr>\n",
 858 |        "      <th>8</th>\n",
 859 |        "      <td>spam</td>\n",
 860 |        "      <td>WINNER!! As a valued network customer you have...</td>\n",
 861 |        "    </tr>\n",
 862 |        "    <tr>\n",
 863 |        "      <th>9</th>\n",
 864 |        "      <td>spam</td>\n",
 865 |        "      <td>Had your mobile 11 months or more? U R entitle...</td>\n",
 866 |        "    </tr>\n",
 867 |        "  </tbody>\n",
 868 |        "</table>\n",
 869 |        "</div>"
 870 |       ],
 871 |       "text/plain": [
 872 |        "  label                                            message\n",
 873 |        "0   ham  Go until jurong point, crazy.. Available only ...\n",
 874 |        "1   ham                      Ok lar... Joking wif u oni...\n",
 875 |        "2  spam  Free entry in 2 a wkly comp to win FA Cup fina...\n",
 876 |        "3   ham  U dun say so early hor... U c already then say...\n",
 877 |        "4   ham  Nah I don't think he goes to usf, he lives aro...\n",
 878 |        "5  spam  FreeMsg Hey there darling it's been 3 week's n...\n",
 879 |        "6   ham  Even my brother is not like to speak with me. ...\n",
 880 |        "7   ham  As per your request 'Melle Melle (Oru Minnamin...\n",
 881 |        "8  spam  WINNER!! As a valued network customer you have...\n",
 882 |        "9  spam  Had your mobile 11 months or more? U R entitle..."
 883 |       ]
 884 |      },
 885 |      "execution_count": 27,
 886 |      "metadata": {},
 887 |      "output_type": "execute_result"
 888 |     }
 889 |    ],
 890 |    "source": [
 891 |     "# examine the first 10 rows\n",
 892 |     "sms.head(10)"
 893 |    ]
 894 |   },
 895 |   {
 896 |    "cell_type": "code",
 897 |    "execution_count": 28,
 898 |    "metadata": {
 899 |     "collapsed": false
 900 |    },
 901 |    "outputs": [
 902 |     {
 903 |      "data": {
 904 |       "text/plain": [
 905 |        "ham     4825\n",
 906 |        "spam     747\n",
 907 |        "Name: label, dtype: int64"
 908 |       ]
 909 |      },
 910 |      "execution_count": 28,
 911 |      "metadata": {},
 912 |      "output_type": "execute_result"
 913 |     }
 914 |    ],
 915 |    "source": [
 916 |     "# examine the class distribution\n",
 917 |     "sms.label.value_counts()"
 918 |    ]
 919 |   },
 920 |   {
 921 |    "cell_type": "code",
 922 |    "execution_count": 29,
 923 |    "metadata": {
 924 |     "collapsed": true
 925 |    },
 926 |    "outputs": [],
 927 |    "source": [
 928 |     "# convert label to a numerical variable\n",
 929 |     "sms['label_num'] = sms.label.map({'ham':0, 'spam':1})"
 930 |    ]
 931 |   },
 932 |   {
 933 |    "cell_type": "code",
 934 |    "execution_count": 30,
 935 |    "metadata": {
 936 |     "collapsed": false
 937 |    },
 938 |    "outputs": [
 939 |     {
 940 |      "data": {
 941 |       "text/html": [
 942 |        "<div>\n",
 943 |        "<table border=\"1\" class=\"dataframe\">\n",
 944 |        "  <thead>\n",
 945 |        "    <tr style=\"text-align: right;\">\n",
 946 |        "      <th></th>\n",
 947 |        "      <th>label</th>\n",
 948 |        "      <th>message</th>\n",
 949 |        "      <th>label_num</th>\n",
 950 |        "    </tr>\n",
 951 |        "  </thead>\n",
 952 |        "  <tbody>\n",
 953 |        "    <tr>\n",
 954 |        "      <th>0</th>\n",
 955 |        "      <td>ham</td>\n",
 956 |        "      <td>Go until jurong point, crazy.. Available only ...</td>\n",
 957 |        "      <td>0</td>\n",
 958 |        "    </tr>\n",
 959 |        "    <tr>\n",
 960 |        "      <th>1</th>\n",
 961 |        "      <td>ham</td>\n",
 962 |        "      <td>Ok lar... Joking wif u oni...</td>\n",
 963 |        "      <td>0</td>\n",
 964 |        "    </tr>\n",
 965 |        "    <tr>\n",
 966 |        "      <th>2</th>\n",
 967 |        "      <td>spam</td>\n",
 968 |        "      <td>Free entry in 2 a wkly comp to win FA Cup fina...</td>\n",
 969 |        "      <td>1</td>\n",
 970 |        "    </tr>\n",
 971 |        "    <tr>\n",
 972 |        "      <th>3</th>\n",
 973 |        "      <td>ham</td>\n",
 974 |        "      <td>U dun say so early hor... U c already then say...</td>\n",
 975 |        "      <td>0</td>\n",
 976 |        "    </tr>\n",
 977 |        "    <tr>\n",
 978 |        "      <th>4</th>\n",
 979 |        "      <td>ham</td>\n",
 980 |        "      <td>Nah I don't think he goes to usf, he lives aro...</td>\n",
 981 |        "      <td>0</td>\n",
 982 |        "    </tr>\n",
 983 |        "    <tr>\n",
 984 |        "      <th>5</th>\n",
 985 |        "      <td>spam</td>\n",
 986 |        "      <td>FreeMsg Hey there darling it's been 3 week's n...</td>\n",
 987 |        "      <td>1</td>\n",
 988 |        "    </tr>\n",
 989 |        "    <tr>\n",
 990 |        "      <th>6</th>\n",
 991 |        "      <td>ham</td>\n",
 992 |        "      <td>Even my brother is not like to speak with me. ...</td>\n",
 993 |        "      <td>0</td>\n",
 994 |        "    </tr>\n",
 995 |        "    <tr>\n",
 996 |        "      <th>7</th>\n",
 997 |        "      <td>ham</td>\n",
 998 |        "      <td>As per your request 'Melle Melle (Oru Minnamin...</td>\n",
 999 |        "      <td>0</td>\n",
1000 |        "    </tr>\n",
1001 |        "    <tr>\n",
1002 |        "      <th>8</th>\n",
1003 |        "      <td>spam</td>\n",
1004 |        "      <td>WINNER!! As a valued network customer you have...</td>\n",
1005 |        "      <td>1</td>\n",
1006 |        "    </tr>\n",
1007 |        "    <tr>\n",
1008 |        "      <th>9</th>\n",
1009 |        "      <td>spam</td>\n",
1010 |        "      <td>Had your mobile 11 months or more? U R entitle...</td>\n",
1011 |        "      <td>1</td>\n",
1012 |        "    </tr>\n",
1013 |        "  </tbody>\n",
1014 |        "</table>\n",
1015 |        "</div>"
1016 |       ],
1017 |       "text/plain": [
1018 |        "  label                                            message  label_num\n",
1019 |        "0   ham  Go until jurong point, crazy.. Available only ...          0\n",
1020 |        "1   ham                      Ok lar... Joking wif u oni...          0\n",
1021 |        "2  spam  Free entry in 2 a wkly comp to win FA Cup fina...          1\n",
1022 |        "3   ham  U dun say so early hor... U c already then say...          0\n",
1023 |        "4   ham  Nah I don't think he goes to usf, he lives aro...          0\n",
1024 |        "5  spam  FreeMsg Hey there darling it's been 3 week's n...          1\n",
1025 |        "6   ham  Even my brother is not like to speak with me. ...          0\n",
1026 |        "7   ham  As per your request 'Melle Melle (Oru Minnamin...          0\n",
1027 |        "8  spam  WINNER!! As a valued network customer you have...          1\n",
1028 |        "9  spam  Had your mobile 11 months or more? U R entitle...          1"
1029 |       ]
1030 |      },
1031 |      "execution_count": 30,
1032 |      "metadata": {},
1033 |      "output_type": "execute_result"
1034 |     }
1035 |    ],
1036 |    "source": [
1037 |     "# check that the conversion worked\n",
1038 |     "sms.head(10)"
1039 |    ]
1040 |   },
1041 |   {
1042 |    "cell_type": "code",
1043 |    "execution_count": 31,
1044 |    "metadata": {
1045 |     "collapsed": false
1046 |    },
1047 |    "outputs": [
1048 |     {
1049 |      "name": "stdout",
1050 |      "output_type": "stream",
1051 |      "text": [
1052 |       "(150, 4)\n",
1053 |       "(150,)\n"
1054 |      ]
1055 |     }
1056 |    ],
1057 |    "source": [
1058 |     "# how to define X and y (from the iris data) for use with a MODEL\n",
1059 |     "X = iris.data\n",
1060 |     "y = iris.target\n",
1061 |     "print(X.shape)\n",
1062 |     "print(y.shape)"
1063 |    ]
1064 |   },
1065 |   {
1066 |    "cell_type": "code",
1067 |    "execution_count": 32,
1068 |    "metadata": {
1069 |     "collapsed": false
1070 |    },
1071 |    "outputs": [
1072 |     {
1073 |      "name": "stdout",
1074 |      "output_type": "stream",
1075 |      "text": [
1076 |       "(5572,)\n",
1077 |       "(5572,)\n"
1078 |      ]
1079 |     }
1080 |    ],
1081 |    "source": [
1082 |     "# how to define X and y (from the SMS data) for use with COUNTVECTORIZER\n",
1083 |     "X = sms.message\n",
1084 |     "y = sms.label_num\n",
1085 |     "print(X.shape)\n",
1086 |     "print(y.shape)"
1087 |    ]
1088 |   },
1089 |   {
1090 |    "cell_type": "code",
1091 |    "execution_count": 33,
1092 |    "metadata": {
1093 |     "collapsed": false
1094 |    },
1095 |    "outputs": [
1096 |     {
1097 |      "name": "stdout",
1098 |      "output_type": "stream",
1099 |      "text": [
1100 |       "(4179,)\n",
1101 |       "(1393,)\n",
1102 |       "(4179,)\n",
1103 |       "(1393,)\n"
1104 |      ]
1105 |     }
1106 |    ],
1107 |    "source": [
1108 |     "# split X and y into training and testing sets\n",
1109 |     "from sklearn.cross_validation import train_test_split\n",
1110 |     "X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=1)\n",
1111 |     "print(X_train.shape)\n",
1112 |     "print(X_test.shape)\n",
1113 |     "print(y_train.shape)\n",
1114 |     "print(y_test.shape)"
1115 |    ]
1116 |   },
1117 |   {
1118 |    "cell_type": "markdown",
1119 |    "metadata": {},
1120 |    "source": [
1121 |     "## Part 4: Vectorizing our dataset"
1122 |    ]
1123 |   },
1124 |   {
1125 |    "cell_type": "code",
1126 |    "execution_count": 34,
1127 |    "metadata": {
1128 |     "collapsed": true
1129 |    },
1130 |    "outputs": [],
1131 |    "source": [
1132 |     "# instantiate the vectorizer\n",
1133 |     "vect = CountVectorizer()"
1134 |    ]
1135 |   },
1136 |   {
1137 |    "cell_type": "code",
1138 |    "execution_count": 35,
1139 |    "metadata": {
1140 |     "collapsed": true
1141 |    },
1142 |    "outputs": [],
1143 |    "source": [
1144 |     "# learn training data vocabulary, then use it to create a document-term matrix\n",
1145 |     "vect.fit(X_train)\n",
1146 |     "X_train_dtm = vect.transform(X_train)"
1147 |    ]
1148 |   },
1149 |   {
1150 |    "cell_type": "code",
1151 |    "execution_count": 36,
1152 |    "metadata": {
1153 |     "collapsed": true
1154 |    },
1155 |    "outputs": [],
1156 |    "source": [
1157 |     "# equivalently: combine fit and transform into a single step\n",
1158 |     "X_train_dtm = vect.fit_transform(X_train)"
1159 |    ]
1160 |   },
1161 |   {
1162 |    "cell_type": "code",
1163 |    "execution_count": 37,
1164 |    "metadata": {
1165 |     "collapsed": false
1166 |    },
1167 |    "outputs": [
1168 |     {
1169 |      "data": {
1170 |       "text/plain": [
1171 |        "<4179x7456 sparse matrix of type '<class 'numpy.int64'>'\n",
1172 |        "\twith 55209 stored elements in Compressed Sparse Row format>"
1173 |       ]
1174 |      },
1175 |      "execution_count": 37,
1176 |      "metadata": {},
1177 |      "output_type": "execute_result"
1178 |     }
1179 |    ],
1180 |    "source": [
1181 |     "# examine the document-term matrix\n",
1182 |     "X_train_dtm"
1183 |    ]
1184 |   },
1185 |   {
1186 |    "cell_type": "code",
1187 |    "execution_count": 38,
1188 |    "metadata": {
1189 |     "collapsed": false
1190 |    },
1191 |    "outputs": [
1192 |     {
1193 |      "data": {
1194 |       "text/plain": [
1195 |        "<1393x7456 sparse matrix of type '<class 'numpy.int64'>'\n",
1196 |        "\twith 17604 stored elements in Compressed Sparse Row format>"
1197 |       ]
1198 |      },
1199 |      "execution_count": 38,
1200 |      "metadata": {},
1201 |      "output_type": "execute_result"
1202 |     }
1203 |    ],
1204 |    "source": [
1205 |     "# transform testing data (using fitted vocabulary) into a document-term matrix\n",
1206 |     "X_test_dtm = vect.transform(X_test)\n",
1207 |     "X_test_dtm"
1208 |    ]
1209 |   },
1210 |   {
1211 |    "cell_type": "markdown",
1212 |    "metadata": {},
1213 |    "source": [
1214 |     "## Part 5: Building and evaluating a model\n",
1215 |     "\n",
1216 |     "We will use [multinomial Naive Bayes](http://scikit-learn.org/stable/modules/generated/sklearn.naive_bayes.MultinomialNB.html):\n",
1217 |     "\n",
1218 |     "> The multinomial Naive Bayes classifier is suitable for classification with **discrete features** (e.g., word counts for text classification). The multinomial distribution normally requires integer feature counts. However, in practice, fractional counts such as tf-idf may also work."
1219 |    ]
1220 |   },
1221 |   {
1222 |    "cell_type": "code",
1223 |    "execution_count": 39,
1224 |    "metadata": {
1225 |     "collapsed": true
1226 |    },
1227 |    "outputs": [],
1228 |    "source": [
1229 |     "# import and instantiate a Multinomial Naive Bayes model\n",
1230 |     "from sklearn.naive_bayes import MultinomialNB\n",
1231 |     "nb = MultinomialNB()"
1232 |    ]
1233 |   },
1234 |   {
1235 |    "cell_type": "code",
1236 |    "execution_count": 40,
1237 |    "metadata": {
1238 |     "collapsed": false
1239 |    },
1240 |    "outputs": [
1241 |     {
1242 |      "name": "stdout",
1243 |      "output_type": "stream",
1244 |      "text": [
1245 |       "CPU times: user 0 ns, sys: 0 ns, total: 0 ns\n",
1246 |       "Wall time: 2.78 ms\n"
1247 |      ]
1248 |     },
1249 |     {
1250 |      "data": {
1251 |       "text/plain": [
1252 |        "MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True)"
1253 |       ]
1254 |      },
1255 |      "execution_count": 40,
1256 |      "metadata": {},
1257 |      "output_type": "execute_result"
1258 |     }
1259 |    ],
1260 |    "source": [
1261 |     "# train the model using X_train_dtm (timing it with an IPython \"magic command\")\n",
1262 |     "%time nb.fit(X_train_dtm, y_train)"
1263 |    ]
1264 |   },
1265 |   {
1266 |    "cell_type": "code",
1267 |    "execution_count": 41,
1268 |    "metadata": {
1269 |     "collapsed": true
1270 |    },
1271 |    "outputs": [],
1272 |    "source": [
1273 |     "# make class predictions for X_test_dtm\n",
1274 |     "y_pred_class = nb.predict(X_test_dtm)"
1275 |    ]
1276 |   },
1277 |   {
1278 |    "cell_type": "code",
1279 |    "execution_count": 42,
1280 |    "metadata": {
1281 |     "collapsed": false
1282 |    },
1283 |    "outputs": [
1284 |     {
1285 |      "data": {
1286 |       "text/plain": [
1287 |        "0.98851399856424982"
1288 |       ]
1289 |      },
1290 |      "execution_count": 42,
1291 |      "metadata": {},
1292 |      "output_type": "execute_result"
1293 |     }
1294 |    ],
1295 |    "source": [
1296 |     "# calculate accuracy of class predictions\n",
1297 |     "from sklearn import metrics\n",
1298 |     "metrics.accuracy_score(y_test, y_pred_class)"
1299 |    ]
1300 |   },
1301 |   {
1302 |    "cell_type": "code",
1303 |    "execution_count": 43,
1304 |    "metadata": {
1305 |     "collapsed": false
1306 |    },
1307 |    "outputs": [
1308 |     {
1309 |      "data": {
1310 |       "text/plain": [
1311 |        "array([[1203,    5],\n",
1312 |        "       [  11,  174]])"
1313 |       ]
1314 |      },
1315 |      "execution_count": 43,
1316 |      "metadata": {},
1317 |      "output_type": "execute_result"
1318 |     }
1319 |    ],
1320 |    "source": [
1321 |     "# print the confusion matrix\n",
1322 |     "metrics.confusion_matrix(y_test, y_pred_class)"
1323 |    ]
1324 |   },
1325 |   {
1326 |    "cell_type": "code",
1327 |    "execution_count": 44,
1328 |    "metadata": {
1329 |     "collapsed": false
1330 |    },
1331 |    "outputs": [],
1332 |    "source": [
1333 |     "# print message text for the false positives (ham incorrectly classified as spam)\n"
1334 |    ]
1335 |   },
1336 |   {
1337 |    "cell_type": "code",
1338 |    "execution_count": 45,
1339 |    "metadata": {
1340 |     "collapsed": false,
1341 |     "scrolled": true
1342 |    },
1343 |    "outputs": [],
1344 |    "source": [
1345 |     "# print message text for the false negatives (spam incorrectly classified as ham)\n"
1346 |    ]
1347 |   },
1348 |   {
1349 |    "cell_type": "code",
1350 |    "execution_count": 46,
1351 |    "metadata": {
1352 |     "collapsed": false,
1353 |     "scrolled": true
1354 |    },
1355 |    "outputs": [
1356 |     {
1357 |      "data": {
1358 |       "text/plain": [
1359 |        "\"LookAtMe!: Thanks for your purchase of a video clip from LookAtMe!, you've been charged 35p. Think you can do better? Why not send a video in a MMSto 32323.\""
1360 |       ]
1361 |      },
1362 |      "execution_count": 46,
1363 |      "metadata": {},
1364 |      "output_type": "execute_result"
1365 |     }
1366 |    ],
1367 |    "source": [
1368 |     "# example false negative\n",
1369 |     "X_test[3132]"
1370 |    ]
1371 |   },
1372 |   {
1373 |    "cell_type": "code",
1374 |    "execution_count": 47,
1375 |    "metadata": {
1376 |     "collapsed": false
1377 |    },
1378 |    "outputs": [
1379 |     {
1380 |      "data": {
1381 |       "text/plain": [
1382 |        "array([  2.87744864e-03,   1.83488846e-05,   2.07301295e-03, ...,\n",
1383 |        "         1.09026171e-06,   1.00000000e+00,   3.98279868e-09])"
1384 |       ]
1385 |      },
1386 |      "execution_count": 47,
1387 |      "metadata": {},
1388 |      "output_type": "execute_result"
1389 |     }
1390 |    ],
1391 |    "source": [
1392 |     "# calculate predicted probabilities for X_test_dtm (poorly calibrated)\n",
1393 |     "y_pred_prob = nb.predict_proba(X_test_dtm)[:, 1]\n",
1394 |     "y_pred_prob"
1395 |    ]
1396 |   },
1397 |   {
1398 |    "cell_type": "code",
1399 |    "execution_count": 48,
1400 |    "metadata": {
1401 |     "collapsed": false
1402 |    },
1403 |    "outputs": [
1404 |     {
1405 |      "data": {
1406 |       "text/plain": [
1407 |        "0.98664310005369604"
1408 |       ]
1409 |      },
1410 |      "execution_count": 48,
1411 |      "metadata": {},
1412 |      "output_type": "execute_result"
1413 |     }
1414 |    ],
1415 |    "source": [
1416 |     "# calculate AUC\n",
1417 |     "metrics.roc_auc_score(y_test, y_pred_prob)"
1418 |    ]
1419 |   },
1420 |   {
1421 |    "cell_type": "markdown",
1422 |    "metadata": {},
1423 |    "source": [
1424 |     "## Part 6: Comparing models\n",
1425 |     "\n",
1426 |     "We will compare multinomial Naive Bayes with [logistic regression](http://scikit-learn.org/stable/modules/linear_model.html#logistic-regression):\n",
1427 |     "\n",
1428 |     "> Logistic regression, despite its name, is a **linear model for classification** rather than regression. Logistic regression is also known in the literature as logit regression, maximum-entropy classification (MaxEnt) or the log-linear classifier. In this model, the probabilities describing the possible outcomes of a single trial are modeled using a logistic function."
1429 |    ]
1430 |   },
1431 |   {
1432 |    "cell_type": "code",
1433 |    "execution_count": 49,
1434 |    "metadata": {
1435 |     "collapsed": true
1436 |    },
1437 |    "outputs": [],
1438 |    "source": [
1439 |     "# import and instantiate a logistic regression model\n",
1440 |     "from sklearn.linear_model import LogisticRegression\n",
1441 |     "logreg = LogisticRegression()"
1442 |    ]
1443 |   },
1444 |   {
1445 |    "cell_type": "code",
1446 |    "execution_count": 50,
1447 |    "metadata": {
1448 |     "collapsed": false
1449 |    },
1450 |    "outputs": [
1451 |     {
1452 |      "name": "stdout",
1453 |      "output_type": "stream",
1454 |      "text": [
1455 |       "CPU times: user 56 ms, sys: 0 ns, total: 56 ms\n",
1456 |       "Wall time: 273 ms\n"
1457 |      ]
1458 |     },
1459 |     {
1460 |      "data": {
1461 |       "text/plain": [
1462 |        "LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,\n",
1463 |        "          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,\n",
1464 |        "          penalty='l2', random_state=None, solver='liblinear', tol=0.0001,\n",
1465 |        "          verbose=0, warm_start=False)"
1466 |       ]
1467 |      },
1468 |      "execution_count": 50,
1469 |      "metadata": {},
1470 |      "output_type": "execute_result"
1471 |     }
1472 |    ],
1473 |    "source": [
1474 |     "# train the model using X_train_dtm\n",
1475 |     "%time logreg.fit(X_train_dtm, y_train)"
1476 |    ]
1477 |   },
1478 |   {
1479 |    "cell_type": "code",
1480 |    "execution_count": 51,
1481 |    "metadata": {
1482 |     "collapsed": true
1483 |    },
1484 |    "outputs": [],
1485 |    "source": [
1486 |     "# make class predictions for X_test_dtm\n",
1487 |     "y_pred_class = logreg.predict(X_test_dtm)"
1488 |    ]
1489 |   },
1490 |   {
1491 |    "cell_type": "code",
1492 |    "execution_count": 52,
1493 |    "metadata": {
1494 |     "collapsed": false
1495 |    },
1496 |    "outputs": [
1497 |     {
1498 |      "data": {
1499 |       "text/plain": [
1500 |        "array([ 0.01269556,  0.00347183,  0.00616517, ...,  0.03354907,\n",
1501 |        "        0.99725053,  0.00157706])"
1502 |       ]
1503 |      },
1504 |      "execution_count": 52,
1505 |      "metadata": {},
1506 |      "output_type": "execute_result"
1507 |     }
1508 |    ],
1509 |    "source": [
1510 |     "# calculate predicted probabilities for X_test_dtm (well calibrated)\n",
1511 |     "y_pred_prob = logreg.predict_proba(X_test_dtm)[:, 1]\n",
1512 |     "y_pred_prob"
1513 |    ]
1514 |   },
1515 |   {
1516 |    "cell_type": "code",
1517 |    "execution_count": 53,
1518 |    "metadata": {
1519 |     "collapsed": false
1520 |    },
1521 |    "outputs": [
1522 |     {
1523 |      "data": {
1524 |       "text/plain": [
1525 |        "0.9877961234745154"
1526 |       ]
1527 |      },
1528 |      "execution_count": 53,
1529 |      "metadata": {},
1530 |      "output_type": "execute_result"
1531 |     }
1532 |    ],
1533 |    "source": [
1534 |     "# calculate accuracy\n",
1535 |     "metrics.accuracy_score(y_test, y_pred_class)"
1536 |    ]
1537 |   },
1538 |   {
1539 |    "cell_type": "code",
1540 |    "execution_count": 54,
1541 |    "metadata": {
1542 |     "collapsed": false
1543 |    },
1544 |    "outputs": [
1545 |     {
1546 |      "data": {
1547 |       "text/plain": [
1548 |        "0.99368176123143015"
1549 |       ]
1550 |      },
1551 |      "execution_count": 54,
1552 |      "metadata": {},
1553 |      "output_type": "execute_result"
1554 |     }
1555 |    ],
1556 |    "source": [
1557 |     "# calculate AUC\n",
1558 |     "metrics.roc_auc_score(y_test, y_pred_prob)"
1559 |    ]
1560 |   }
1561 |  ],
1562 |  "metadata": {
1563 |   "kernelspec": {
1564 |    "display_name": "Python [conda root]",
1565 |    "language": "python",
1566 |    "name": "conda-root-py"
1567 |   },
1568 |   "language_info": {
1569 |    "codemirror_mode": {
1570 |     "name": "ipython",
1571 |     "version": 3
1572 |    },
1573 |    "file_extension": ".py",
1574 |    "mimetype": "text/x-python",
1575 |    "name": "python",
1576 |    "nbconvert_exporter": "python",
1577 |    "pygments_lexer": "ipython3",
1578 |    "version": "3.5.2"
1579 |   }
1580 |  },
1581 |  "nbformat": 4,
1582 |  "nbformat_minor": 0
1583 | }
1584 | 


--------------------------------------------------------------------------------
	algo	amo	amor	ao	assim	de	eu	gosto	meu	mim	muito	não	odeio	pra	presta	te	tudo	você
Eu te amo	0	1	0	0	0	0	1	0	0	0	0	0	0	0	0	1	0	0
Você é algo assim... é tudo pra mim. Ao meu amor... Amor!	1	0	2	1	1	0	0	0	1	1	0	0	0	1	0	0	1	1
Eu te odeio muito, você não presta!	0	0	0	0	0	0	1	0	0	0	1	1	1	0	1	1	0	1
Não gosto de você	0	0	0	0	0	1	0	1	0	0	0	1	0	0	0	0	0	1
	label	message
0	ham	Go until jurong point, crazy.. Available only ...
1	ham	Ok lar... Joking wif u oni...
2	spam	Free entry in 2 a wkly comp to win FA Cup fina...
3	ham	U dun say so early hor... U c already then say...
4	ham	Nah I don't think he goes to usf, he lives aro...
5	spam	FreeMsg Hey there darling it's been 3 week's n...
6	ham	Even my brother is not like to speak with me. ...
7	ham	As per your request 'Melle Melle (Oru Minnamin...
8	spam	WINNER!! As a valued network customer you have...
9	spam	Had your mobile 11 months or more? U R entitle...
	sepal length (cm)	sepal width (cm)	petal length (cm)	petal width (cm)
0	5.1	3.5	1.4	0.2
1	4.9	3.0	1.4	0.2
2	4.7	3.2	1.3	0.2
3	4.6	3.1	1.5	0.2
4	5.0	3.6	1.4	0.2
	sepal length (cm)	sepal width (cm)	petal length (cm)	petal width (cm)
0	5.1	3.5	1.4	0.2
1	4.9	3.0	1.4	0.2
2	4.7	3.2	1.3	0.2
3	4.6	3.1	1.5	0.2
4	5.0	3.6	1.4	0.2