├── AnaliseTexto
├── Mineração de Textos.pdf
├── .ipynb_checkpoints
│ ├── TextVector-checkpoint.ipynb
│ └── tutorial-checkpoint.ipynb
└── AnaliseDeSentimento.ipynb
├── .idea
└── vcs.xml
└── textAnalisis
└── .ipynb_checkpoints
├── TextVector-checkpoint.ipynb
├── TextAnalise-Plin-checkpoint.ipynb
└── tutorial-checkpoint.ipynb
/AnaliseTexto/Mineração de Textos.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/sandeco/01-AulasDataScience/HEAD/AnaliseTexto/Mineração de Textos.pdf
--------------------------------------------------------------------------------
/.idea/vcs.xml:
--------------------------------------------------------------------------------
1 |
2 |
3 |
4 |
5 |
6 |
--------------------------------------------------------------------------------
/AnaliseTexto/.ipynb_checkpoints/TextVector-checkpoint.ipynb:
--------------------------------------------------------------------------------
1 | {
2 | "cells": [
3 | {
4 | "cell_type": "markdown",
5 | "metadata": {
6 | "nbpresent": {
7 | "id": "e4c7d791-d39c-4247-a950-8f541b2b2b2b"
8 | },
9 | "slideshow": {
10 | "slide_type": "-"
11 | }
12 | },
13 | "source": [
14 | "# Classificação de textos com *scikit-learn*\n",
15 | "por Prof. Sanderson Macedo"
16 | ]
17 | },
18 | {
19 | "cell_type": "markdown",
20 | "metadata": {
21 | "nbpresent": {
22 | "id": "918ce0e7-8f69-4d3c-8106-d3c5264c94e3"
23 | },
24 | "slideshow": {
25 | "slide_type": "-"
26 | }
27 | },
28 | "source": [
29 | "
"
30 | ]
31 | },
32 | {
33 | "cell_type": "markdown",
34 | "metadata": {
35 | "nbpresent": {
36 | "id": "ca5fe97a-0224-4915-a59d-38e6baa218a2"
37 | }
38 | },
39 | "source": [
40 | "## Agenda\n",
41 | "\n",
42 | "\n",
43 | "1. Representar um texto como dados numéricos\n",
44 | "2. Ler o *dataset* de texto no Pandas\n",
45 | "2. Vetorizar nossso *dataset*\n",
46 | "4. Construir e avaliar um modelo\n",
47 | "5. Comparar modelos\n"
48 | ]
49 | },
50 | {
51 | "cell_type": "code",
52 | "execution_count": 73,
53 | "metadata": {
54 | "collapsed": true,
55 | "nbpresent": {
56 | "id": "d2e20804-da18-483c-bd40-8c25e2d4699c"
57 | }
58 | },
59 | "outputs": [],
60 | "source": [
61 | "##Importando pandas e numpy\n",
62 | "import pandas as pd\n",
63 | "import numpy as np\n",
64 | "\n"
65 | ]
66 | },
67 | {
68 | "cell_type": "markdown",
69 | "metadata": {
70 | "nbpresent": {
71 | "id": "76e5a32a-69c4-4dc5-a66b-23d2cca623af"
72 | }
73 | },
74 | "source": [
75 | "## 1. Definindo um vetor de textos \n",
76 | "Os textos do vetor podem ser adquiridos por meio da leitura de \n",
77 | "pdf's, doc's, twitter's... etc.\n",
78 | "\n",
79 | "Esses textos serão a base de treinamento\n",
80 | "para a classificação do sentimento de um novo texto."
81 | ]
82 | },
83 | {
84 | "cell_type": "code",
85 | "execution_count": 88,
86 | "metadata": {
87 | "collapsed": false,
88 | "nbpresent": {
89 | "id": "56bab267-0993-4d7a-9436-11bc5de3d1d3"
90 | }
91 | },
92 | "outputs": [],
93 | "source": [
94 | "train = [\n",
95 | " 'Eu te amo e não existe nada melhor que você',\n",
96 | " 'Você é algo assim... é tudo pra mim. Ao meu amor... Amor!',\n",
97 | " 'Eu te odeio muito, você não presta!',\n",
98 | " 'Não gosto de você'\n",
99 | " \n",
100 | " ]\n",
101 | "\n"
102 | ]
103 | },
104 | {
105 | "cell_type": "markdown",
106 | "metadata": {
107 | "nbpresent": {
108 | "id": "fc1fc669-a603-412e-8855-837d750718ff"
109 | }
110 | },
111 | "source": [
112 | "## 2. Definindo um vetor de sentimentos\n",
113 | "Criaremos um vetor de sentimentos chamado **_felling_**. \n",
114 | "\n",
115 | "Cada posição do vetor **_felling_** representa o sentimento **BOM** (1) ou **RUIM** (0) para os textos que passamos ao vetor **_train_**.\n",
116 | "\n",
117 | "Por exemplo: a frase da primeira posição do vetor **_train_**:\n",
118 | "\n",
119 | "> 'Eu te amo e não existe nada melhor que você'\n",
120 | "\n",
121 | "Foi classificada como sendo um texto **BOM**:\n",
122 | "\n",
123 | "> 1"
124 | ]
125 | },
126 | {
127 | "cell_type": "code",
128 | "execution_count": 89,
129 | "metadata": {
130 | "collapsed": true,
131 | "nbpresent": {
132 | "id": "68a4277e-e38c-42ac-8528-0b90efe86e42"
133 | }
134 | },
135 | "outputs": [],
136 | "source": [
137 | "felling = [1,1,0,0]"
138 | ]
139 | },
140 | {
141 | "cell_type": "markdown",
142 | "metadata": {
143 | "nbpresent": {
144 | "id": "f43ff54a-e843-4a35-8447-66665f36ebca"
145 | }
146 | },
147 | "source": [
148 | "## 3. Análise de texto com _scikit-learn_.\n",
149 | "\n",
150 | "Texto de [scikit-learn documentation](http://scikit-learn.org/stable/modules/feature_extraction.html#text-feature-extraction):\n",
151 | "\n",
152 | "> Análise de texto é um campo de aplicação importante para algoritmos de aprendizado de máquina. No entanto, uma sequência de símbolos não podem ser passada diretamente aos algoritmos de Machine Learning, pois a maioria deles espera vetores de características numéricas com um tamanho fixo, em vez de documentos de texto com comprimento variável.\n",
153 | "\n",
154 | "Mas nesse caso podemos realizar algumas transformações de para poder manipular textos em algoritmos de aprendizagem.\n",
155 | "\n",
156 | "Portanto, aqui utilizaremos a [CountVectorizer](http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html)\n",
157 | "para converter textos em uma matriz que expressará a quantidade \"tokens\" dos textos.\n",
158 | "\n",
159 | "Importamos a classe e criamos uma instância chamada **_vect_**.\n"
160 | ]
161 | },
162 | {
163 | "cell_type": "code",
164 | "execution_count": 90,
165 | "metadata": {
166 | "collapsed": false,
167 | "nbpresent": {
168 | "id": "1ada59d7-f1ba-4625-8999-b8af5aaf461c"
169 | }
170 | },
171 | "outputs": [],
172 | "source": [
173 | "from sklearn.feature_extraction.text import CountVectorizer\n",
174 | "vect = CountVectorizer()"
175 | ]
176 | },
177 | {
178 | "cell_type": "markdown",
179 | "metadata": {
180 | "nbpresent": {
181 | "id": "154ef867-0532-45ad-9910-c87f6711d1b0"
182 | }
183 | },
184 | "source": [
185 | "## 4. Treinamento criando o dicionário.\n",
186 | "Agora treinamos o algoritmo com o vetor de textos que criamos acima. Chamamos o método **_fit()_** passando o vetor de textos."
187 | ]
188 | },
189 | {
190 | "cell_type": "code",
191 | "execution_count": 96,
192 | "metadata": {
193 | "collapsed": false,
194 | "nbpresent": {
195 | "id": "eff3a289-8c0d-4374-9400-d988a6b36624"
196 | }
197 | },
198 | "outputs": [
199 | {
200 | "data": {
201 | "text/plain": [
202 | "CountVectorizer(analyzer='word', binary=False, decode_error='strict',\n",
203 | " dtype=, encoding='utf-8', input='content',\n",
204 | " lowercase=True, max_df=1.0, max_features=None, min_df=1,\n",
205 | " ngram_range=(1, 1), preprocessor=None, stop_words=None,\n",
206 | " strip_accents=None, token_pattern='(?u)\\\\b\\\\w\\\\w+\\\\b',\n",
207 | " tokenizer=None, vocabulary=None)"
208 | ]
209 | },
210 | "execution_count": 96,
211 | "metadata": {},
212 | "output_type": "execute_result"
213 | }
214 | ],
215 | "source": [
216 | "vect.fit(train)"
217 | ]
218 | },
219 | {
220 | "cell_type": "markdown",
221 | "metadata": {},
222 | "source": [
223 | "Veja que o parametro *analyzer* é defindo por padrão como *'word'* na classe *CountVectorizer*. Isso signicica que a classe ignora palavras com menos de dois (2) caracteres e pontuações. "
224 | ]
225 | },
226 | {
227 | "cell_type": "markdown",
228 | "metadata": {
229 | "nbpresent": {
230 | "id": "d4093cdd-6b19-4fed-9a01-5ee02f41ca51"
231 | }
232 | },
233 | "source": [
234 | "## 5. Nosso dicionário\n",
235 | "Aqui vamos listar de forma única\n",
236 | "quais palavras forma utilizadas no texto, formando assim um dicionário de palavras."
237 | ]
238 | },
239 | {
240 | "cell_type": "code",
241 | "execution_count": 95,
242 | "metadata": {
243 | "collapsed": false,
244 | "nbpresent": {
245 | "id": "3ab9a844-7f38-40c5-a57f-4a2fbf3343ba"
246 | }
247 | },
248 | "outputs": [
249 | {
250 | "data": {
251 | "text/plain": [
252 | "['algo',\n",
253 | " 'amo',\n",
254 | " 'amor',\n",
255 | " 'ao',\n",
256 | " 'assim',\n",
257 | " 'de',\n",
258 | " 'eu',\n",
259 | " 'existe',\n",
260 | " 'gosto',\n",
261 | " 'melhor',\n",
262 | " 'meu',\n",
263 | " 'mim',\n",
264 | " 'muito',\n",
265 | " 'nada',\n",
266 | " 'não',\n",
267 | " 'odeio',\n",
268 | " 'pra',\n",
269 | " 'presta',\n",
270 | " 'que',\n",
271 | " 'te',\n",
272 | " 'tudo',\n",
273 | " 'você']"
274 | ]
275 | },
276 | "execution_count": 95,
277 | "metadata": {},
278 | "output_type": "execute_result"
279 | }
280 | ],
281 | "source": [
282 | "## examinando o dicionário criado em ordem alfabética.\n",
283 | "vect.get_feature_names()"
284 | ]
285 | },
286 | {
287 | "cell_type": "markdown",
288 | "metadata": {},
289 | "source": [
290 | "## 6. Transformação em matriz esparsa em relação as frases\n",
291 | "Essa transformação é importante porque cria uma matriz onde:\n",
292 | "\n",
293 | "1. Cada linha representa um texto do vetor **_train_** \n",
294 | "2. Cada coluna uma palavra do dicionário aprendido.\n",
295 | "3. Se a palavra ocorrer no texto o valor será 1 caso contrário 0.\n",
296 | "\n",
297 | "\n"
298 | ]
299 | },
300 | {
301 | "cell_type": "code",
302 | "execution_count": 93,
303 | "metadata": {
304 | "collapsed": false,
305 | "nbpresent": {
306 | "id": "34cfd603-24de-4379-9a69-353ba0e50fba"
307 | }
308 | },
309 | "outputs": [],
310 | "source": [
311 | "simple_train_dtm = vect.transform(text)\n",
312 | "ocorrencias = simple_train_dtm.toarray()"
313 | ]
314 | },
315 | {
316 | "cell_type": "code",
317 | "execution_count": 94,
318 | "metadata": {
319 | "collapsed": false,
320 | "nbpresent": {
321 | "id": "88fe39dd-0355-4dd7-b9d6-ed668225208d"
322 | }
323 | },
324 | "outputs": [
325 | {
326 | "data": {
327 | "text/plain": [
328 | "array([[0, 1, 0, 0, 0, 0, 1, 1, 0, 1, 0, 0, 0, 1, 1, 0, 0, 0, 1, 1, 0, 1],\n",
329 | " [1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 1, 1],\n",
330 | " [0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 1, 1, 0, 1, 0, 1, 0, 1],\n",
331 | " [0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 1]])"
332 | ]
333 | },
334 | "execution_count": 94,
335 | "metadata": {},
336 | "output_type": "execute_result"
337 | }
338 | ],
339 | "source": [
340 | "ocorrencias"
341 | ]
342 | },
343 | {
344 | "cell_type": "code",
345 | "execution_count": 56,
346 | "metadata": {
347 | "collapsed": false,
348 | "nbpresent": {
349 | "id": "2e563c0f-37c5-4861-85c6-9185c20e3507"
350 | }
351 | },
352 | "outputs": [
353 | {
354 | "data": {
355 | "text/html": [
356 | "\n",
357 | "
\n",
358 | " \n",
359 | " \n",
360 | " | \n",
361 | " algo | \n",
362 | " amo | \n",
363 | " assim | \n",
364 | " eu | \n",
365 | " existe | \n",
366 | " melhor | \n",
367 | " mim | \n",
368 | " muito | \n",
369 | " nada | \n",
370 | " não | \n",
371 | " odeio | \n",
372 | " pra | \n",
373 | " presta | \n",
374 | " que | \n",
375 | " te | \n",
376 | " tudo | \n",
377 | " você | \n",
378 | "
\n",
379 | " \n",
380 | " \n",
381 | " \n",
382 | " | 0 | \n",
383 | " 0 | \n",
384 | " 1 | \n",
385 | " 0 | \n",
386 | " 1 | \n",
387 | " 1 | \n",
388 | " 1 | \n",
389 | " 0 | \n",
390 | " 0 | \n",
391 | " 1 | \n",
392 | " 1 | \n",
393 | " 0 | \n",
394 | " 0 | \n",
395 | " 0 | \n",
396 | " 1 | \n",
397 | " 1 | \n",
398 | " 0 | \n",
399 | " 1 | \n",
400 | "
\n",
401 | " \n",
402 | " | 1 | \n",
403 | " 1 | \n",
404 | " 0 | \n",
405 | " 1 | \n",
406 | " 0 | \n",
407 | " 0 | \n",
408 | " 0 | \n",
409 | " 1 | \n",
410 | " 0 | \n",
411 | " 0 | \n",
412 | " 0 | \n",
413 | " 0 | \n",
414 | " 1 | \n",
415 | " 0 | \n",
416 | " 0 | \n",
417 | " 0 | \n",
418 | " 1 | \n",
419 | " 1 | \n",
420 | "
\n",
421 | " \n",
422 | " | 2 | \n",
423 | " 0 | \n",
424 | " 0 | \n",
425 | " 0 | \n",
426 | " 1 | \n",
427 | " 0 | \n",
428 | " 0 | \n",
429 | " 0 | \n",
430 | " 1 | \n",
431 | " 0 | \n",
432 | " 1 | \n",
433 | " 1 | \n",
434 | " 0 | \n",
435 | " 1 | \n",
436 | " 0 | \n",
437 | " 1 | \n",
438 | " 0 | \n",
439 | " 1 | \n",
440 | "
\n",
441 | " \n",
442 | "
\n",
443 | "
"
444 | ],
445 | "text/plain": [
446 | " algo amo assim eu existe melhor mim muito nada não odeio pra \\\n",
447 | "0 0 1 0 1 1 1 0 0 1 1 0 0 \n",
448 | "1 1 0 1 0 0 0 1 0 0 0 0 1 \n",
449 | "2 0 0 0 1 0 0 0 1 0 1 1 0 \n",
450 | "\n",
451 | " presta que te tudo você \n",
452 | "0 0 1 1 0 1 \n",
453 | "1 0 0 0 1 1 \n",
454 | "2 1 0 1 0 1 "
455 | ]
456 | },
457 | "execution_count": 56,
458 | "metadata": {},
459 | "output_type": "execute_result"
460 | }
461 | ],
462 | "source": [
463 | "df = pd.DataFrame(simple_train_dtm.toarray(), columns=vect.get_feature_names())\n",
464 | "df"
465 | ]
466 | },
467 | {
468 | "cell_type": "code",
469 | "execution_count": 57,
470 | "metadata": {
471 | "collapsed": false,
472 | "nbpresent": {
473 | "id": "d30743bf-e9b2-46ba-93bd-0615c79b1b29"
474 | }
475 | },
476 | "outputs": [
477 | {
478 | "data": {
479 | "text/plain": [
480 | "scipy.sparse.csr.csr_matrix"
481 | ]
482 | },
483 | "execution_count": 57,
484 | "metadata": {},
485 | "output_type": "execute_result"
486 | }
487 | ],
488 | "source": [
489 | "type(simple_train_dtm)"
490 | ]
491 | },
492 | {
493 | "cell_type": "code",
494 | "execution_count": 60,
495 | "metadata": {
496 | "collapsed": false,
497 | "nbpresent": {
498 | "id": "95d91cb6-e3f8-4b4b-ab82-900f8719f4db"
499 | }
500 | },
501 | "outputs": [
502 | {
503 | "name": "stdout",
504 | "output_type": "stream",
505 | "text": [
506 | " (0, 1)\t1\n",
507 | " (0, 3)\t1\n",
508 | " (0, 4)\t1\n",
509 | " (0, 5)\t1\n",
510 | " (0, 8)\t1\n",
511 | " (0, 9)\t1\n",
512 | " (0, 13)\t1\n",
513 | " (0, 14)\t1\n",
514 | " (0, 16)\t1\n",
515 | " (1, 0)\t1\n",
516 | " (1, 2)\t1\n",
517 | " (1, 6)\t1\n",
518 | " (1, 11)\t1\n",
519 | " (1, 15)\t1\n",
520 | " (1, 16)\t1\n",
521 | " (2, 3)\t1\n",
522 | " (2, 7)\t1\n",
523 | " (2, 9)\t1\n",
524 | " (2, 10)\t1\n",
525 | " (2, 12)\t1\n",
526 | " (2, 14)\t1\n",
527 | " (2, 16)\t1\n"
528 | ]
529 | }
530 | ],
531 | "source": [
532 | "print(simple_train_dtm)"
533 | ]
534 | },
535 | {
536 | "cell_type": "code",
537 | "execution_count": null,
538 | "metadata": {
539 | "collapsed": true,
540 | "nbpresent": {
541 | "id": "201b01cf-47f9-4a94-baf5-9270271e053e"
542 | }
543 | },
544 | "outputs": [],
545 | "source": []
546 | }
547 | ],
548 | "metadata": {
549 | "kernelspec": {
550 | "display_name": "Python [conda root]",
551 | "language": "python",
552 | "name": "conda-root-py"
553 | },
554 | "language_info": {
555 | "codemirror_mode": {
556 | "name": "ipython",
557 | "version": 3
558 | },
559 | "file_extension": ".py",
560 | "mimetype": "text/x-python",
561 | "name": "python",
562 | "nbconvert_exporter": "python",
563 | "pygments_lexer": "ipython3",
564 | "version": "3.5.2"
565 | }
566 | },
567 | "nbformat": 4,
568 | "nbformat_minor": 1
569 | }
570 |
--------------------------------------------------------------------------------
/textAnalisis/.ipynb_checkpoints/TextVector-checkpoint.ipynb:
--------------------------------------------------------------------------------
1 | {
2 | "cells": [
3 | {
4 | "cell_type": "markdown",
5 | "metadata": {
6 | "nbpresent": {
7 | "id": "e4c7d791-d39c-4247-a950-8f541b2b2b2b"
8 | },
9 | "slideshow": {
10 | "slide_type": "-"
11 | }
12 | },
13 | "source": [
14 | "# Classificação de textos com *scikit-learn*\n",
15 | "por Prof. Sanderson Macedo"
16 | ]
17 | },
18 | {
19 | "cell_type": "markdown",
20 | "metadata": {
21 | "nbpresent": {
22 | "id": "918ce0e7-8f69-4d3c-8106-d3c5264c94e3"
23 | },
24 | "slideshow": {
25 | "slide_type": "-"
26 | }
27 | },
28 | "source": [
29 | "
"
30 | ]
31 | },
32 | {
33 | "cell_type": "markdown",
34 | "metadata": {
35 | "nbpresent": {
36 | "id": "ca5fe97a-0224-4915-a59d-38e6baa218a2"
37 | }
38 | },
39 | "source": [
40 | "## Agenda\n",
41 | "\n",
42 | "\n",
43 | "1. Representar um texto como dados numéricos\n",
44 | "2. Ler o *dataset* de texto no Pandas\n",
45 | "2. Vetorizar nossso *dataset*\n",
46 | "4. Construir e avaliar um modelo\n",
47 | "5. Comparar modelos\n"
48 | ]
49 | },
50 | {
51 | "cell_type": "code",
52 | "execution_count": 73,
53 | "metadata": {
54 | "collapsed": true,
55 | "nbpresent": {
56 | "id": "d2e20804-da18-483c-bd40-8c25e2d4699c"
57 | }
58 | },
59 | "outputs": [],
60 | "source": [
61 | "##Importando pandas e numpy\n",
62 | "import pandas as pd\n",
63 | "import numpy as np\n",
64 | "\n"
65 | ]
66 | },
67 | {
68 | "cell_type": "markdown",
69 | "metadata": {
70 | "nbpresent": {
71 | "id": "76e5a32a-69c4-4dc5-a66b-23d2cca623af"
72 | }
73 | },
74 | "source": [
75 | "## 1. Definindo um vetor de textos \n",
76 | "Os textos do vetor podem ser adquiridos por meio da leitura de \n",
77 | "pdf's, doc's, twitter's... etc.\n",
78 | "\n",
79 | "Esses textos serão a base de treinamento\n",
80 | "para a classificação do sentimento de um novo texto."
81 | ]
82 | },
83 | {
84 | "cell_type": "code",
85 | "execution_count": 88,
86 | "metadata": {
87 | "collapsed": false,
88 | "nbpresent": {
89 | "id": "56bab267-0993-4d7a-9436-11bc5de3d1d3"
90 | }
91 | },
92 | "outputs": [],
93 | "source": [
94 | "train = [\n",
95 | " 'Eu te amo e não existe nada melhor que você',\n",
96 | " 'Você é algo assim... é tudo pra mim. Ao meu amor... Amor!',\n",
97 | " 'Eu te odeio muito, você não presta!',\n",
98 | " 'Não gosto de você'\n",
99 | " \n",
100 | " ]\n",
101 | "\n"
102 | ]
103 | },
104 | {
105 | "cell_type": "markdown",
106 | "metadata": {
107 | "nbpresent": {
108 | "id": "fc1fc669-a603-412e-8855-837d750718ff"
109 | }
110 | },
111 | "source": [
112 | "## 2. Definindo um vetor de sentimentos\n",
113 | "Criaremos um vetor de sentimentos chamado **_felling_**. \n",
114 | "\n",
115 | "Cada posição do vetor **_felling_** representa o sentimento **BOM** (1) ou **RUIM** (0) para os textos que passamos ao vetor **_train_**.\n",
116 | "\n",
117 | "Por exemplo: a frase da primeira posição do vetor **_train_**:\n",
118 | "\n",
119 | "> 'Eu te amo e não existe nada melhor que você'\n",
120 | "\n",
121 | "Foi classificada como sendo um texto **BOM**:\n",
122 | "\n",
123 | "> 1"
124 | ]
125 | },
126 | {
127 | "cell_type": "code",
128 | "execution_count": 89,
129 | "metadata": {
130 | "collapsed": true,
131 | "nbpresent": {
132 | "id": "68a4277e-e38c-42ac-8528-0b90efe86e42"
133 | }
134 | },
135 | "outputs": [],
136 | "source": [
137 | "felling = [1,1,0,0]"
138 | ]
139 | },
140 | {
141 | "cell_type": "markdown",
142 | "metadata": {
143 | "nbpresent": {
144 | "id": "f43ff54a-e843-4a35-8447-66665f36ebca"
145 | }
146 | },
147 | "source": [
148 | "## 3. Análise de texto com _scikit-learn_.\n",
149 | "\n",
150 | "Texto de [scikit-learn documentation](http://scikit-learn.org/stable/modules/feature_extraction.html#text-feature-extraction):\n",
151 | "\n",
152 | "> Análise de texto é um campo de aplicação importante para algoritmos de aprendizado de máquina. No entanto, uma sequência de símbolos não podem ser passada diretamente aos algoritmos de Machine Learning, pois a maioria deles espera vetores de características numéricas com um tamanho fixo, em vez de documentos de texto com comprimento variável.\n",
153 | "\n",
154 | "Mas nesse caso podemos realizar algumas transformações de para poder manipular textos em algoritmos de aprendizagem.\n",
155 | "\n",
156 | "Portanto, aqui utilizaremos a [CountVectorizer](http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html)\n",
157 | "para converter textos em uma matriz que expressará a quantidade \"tokens\" dos textos.\n",
158 | "\n",
159 | "Importamos a classe e criamos uma instância chamada **_vect_**.\n"
160 | ]
161 | },
162 | {
163 | "cell_type": "code",
164 | "execution_count": 90,
165 | "metadata": {
166 | "collapsed": false,
167 | "nbpresent": {
168 | "id": "1ada59d7-f1ba-4625-8999-b8af5aaf461c"
169 | }
170 | },
171 | "outputs": [],
172 | "source": [
173 | "from sklearn.feature_extraction.text import CountVectorizer\n",
174 | "vect = CountVectorizer()"
175 | ]
176 | },
177 | {
178 | "cell_type": "markdown",
179 | "metadata": {
180 | "nbpresent": {
181 | "id": "154ef867-0532-45ad-9910-c87f6711d1b0"
182 | }
183 | },
184 | "source": [
185 | "## 4. Treinamento criando o dicionário.\n",
186 | "Agora treinamos o algoritmo com o vetor de textos que criamos acima. Chamamos o método **_fit()_** passando o vetor de textos."
187 | ]
188 | },
189 | {
190 | "cell_type": "code",
191 | "execution_count": 96,
192 | "metadata": {
193 | "collapsed": false,
194 | "nbpresent": {
195 | "id": "eff3a289-8c0d-4374-9400-d988a6b36624"
196 | }
197 | },
198 | "outputs": [
199 | {
200 | "data": {
201 | "text/plain": [
202 | "CountVectorizer(analyzer='word', binary=False, decode_error='strict',\n",
203 | " dtype=, encoding='utf-8', input='content',\n",
204 | " lowercase=True, max_df=1.0, max_features=None, min_df=1,\n",
205 | " ngram_range=(1, 1), preprocessor=None, stop_words=None,\n",
206 | " strip_accents=None, token_pattern='(?u)\\\\b\\\\w\\\\w+\\\\b',\n",
207 | " tokenizer=None, vocabulary=None)"
208 | ]
209 | },
210 | "execution_count": 96,
211 | "metadata": {},
212 | "output_type": "execute_result"
213 | }
214 | ],
215 | "source": [
216 | "vect.fit(train)"
217 | ]
218 | },
219 | {
220 | "cell_type": "markdown",
221 | "metadata": {},
222 | "source": [
223 | "Veja que o parametro *analyzer* é defindo por padrão como *'word'* na classe *CountVectorizer*. Isso signicica que a classe ignora palavras com menos de dois (2) caracteres e pontuações. "
224 | ]
225 | },
226 | {
227 | "cell_type": "markdown",
228 | "metadata": {
229 | "nbpresent": {
230 | "id": "d4093cdd-6b19-4fed-9a01-5ee02f41ca51"
231 | }
232 | },
233 | "source": [
234 | "## 5. Nosso dicionário\n",
235 | "Aqui vamos listar de forma única\n",
236 | "quais palavras forma utilizadas no texto, formando assim um dicionário de palavras."
237 | ]
238 | },
239 | {
240 | "cell_type": "code",
241 | "execution_count": 95,
242 | "metadata": {
243 | "collapsed": false,
244 | "nbpresent": {
245 | "id": "3ab9a844-7f38-40c5-a57f-4a2fbf3343ba"
246 | }
247 | },
248 | "outputs": [
249 | {
250 | "data": {
251 | "text/plain": [
252 | "['algo',\n",
253 | " 'amo',\n",
254 | " 'amor',\n",
255 | " 'ao',\n",
256 | " 'assim',\n",
257 | " 'de',\n",
258 | " 'eu',\n",
259 | " 'existe',\n",
260 | " 'gosto',\n",
261 | " 'melhor',\n",
262 | " 'meu',\n",
263 | " 'mim',\n",
264 | " 'muito',\n",
265 | " 'nada',\n",
266 | " 'não',\n",
267 | " 'odeio',\n",
268 | " 'pra',\n",
269 | " 'presta',\n",
270 | " 'que',\n",
271 | " 'te',\n",
272 | " 'tudo',\n",
273 | " 'você']"
274 | ]
275 | },
276 | "execution_count": 95,
277 | "metadata": {},
278 | "output_type": "execute_result"
279 | }
280 | ],
281 | "source": [
282 | "## examinando o dicionário criado em ordem alfabética.\n",
283 | "vect.get_feature_names()"
284 | ]
285 | },
286 | {
287 | "cell_type": "markdown",
288 | "metadata": {},
289 | "source": [
290 | "## 6. Transformação em matriz esparsa em relação as frases\n",
291 | "Essa transformação é importante porque cria uma matriz onde:\n",
292 | "\n",
293 | "1. Cada linha representa um texto do vetor **_train_** \n",
294 | "2. Cada coluna uma palavra do dicionário aprendido.\n",
295 | "3. Se a palavra ocorrer no texto o valor será 1 caso contrário 0.\n",
296 | "\n",
297 | "\n"
298 | ]
299 | },
300 | {
301 | "cell_type": "code",
302 | "execution_count": 93,
303 | "metadata": {
304 | "collapsed": false,
305 | "nbpresent": {
306 | "id": "34cfd603-24de-4379-9a69-353ba0e50fba"
307 | }
308 | },
309 | "outputs": [],
310 | "source": [
311 | "simple_train_dtm = vect.transform(text)\n",
312 | "ocorrencias = simple_train_dtm.toarray()"
313 | ]
314 | },
315 | {
316 | "cell_type": "code",
317 | "execution_count": 94,
318 | "metadata": {
319 | "collapsed": false,
320 | "nbpresent": {
321 | "id": "88fe39dd-0355-4dd7-b9d6-ed668225208d"
322 | }
323 | },
324 | "outputs": [
325 | {
326 | "data": {
327 | "text/plain": [
328 | "array([[0, 1, 0, 0, 0, 0, 1, 1, 0, 1, 0, 0, 0, 1, 1, 0, 0, 0, 1, 1, 0, 1],\n",
329 | " [1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 1, 1],\n",
330 | " [0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 1, 1, 0, 1, 0, 1, 0, 1],\n",
331 | " [0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 1]])"
332 | ]
333 | },
334 | "execution_count": 94,
335 | "metadata": {},
336 | "output_type": "execute_result"
337 | }
338 | ],
339 | "source": [
340 | "ocorrencias"
341 | ]
342 | },
343 | {
344 | "cell_type": "code",
345 | "execution_count": 56,
346 | "metadata": {
347 | "collapsed": false,
348 | "nbpresent": {
349 | "id": "2e563c0f-37c5-4861-85c6-9185c20e3507"
350 | }
351 | },
352 | "outputs": [
353 | {
354 | "data": {
355 | "text/html": [
356 | "\n",
357 | "
\n",
358 | " \n",
359 | " \n",
360 | " | \n",
361 | " algo | \n",
362 | " amo | \n",
363 | " assim | \n",
364 | " eu | \n",
365 | " existe | \n",
366 | " melhor | \n",
367 | " mim | \n",
368 | " muito | \n",
369 | " nada | \n",
370 | " não | \n",
371 | " odeio | \n",
372 | " pra | \n",
373 | " presta | \n",
374 | " que | \n",
375 | " te | \n",
376 | " tudo | \n",
377 | " você | \n",
378 | "
\n",
379 | " \n",
380 | " \n",
381 | " \n",
382 | " | 0 | \n",
383 | " 0 | \n",
384 | " 1 | \n",
385 | " 0 | \n",
386 | " 1 | \n",
387 | " 1 | \n",
388 | " 1 | \n",
389 | " 0 | \n",
390 | " 0 | \n",
391 | " 1 | \n",
392 | " 1 | \n",
393 | " 0 | \n",
394 | " 0 | \n",
395 | " 0 | \n",
396 | " 1 | \n",
397 | " 1 | \n",
398 | " 0 | \n",
399 | " 1 | \n",
400 | "
\n",
401 | " \n",
402 | " | 1 | \n",
403 | " 1 | \n",
404 | " 0 | \n",
405 | " 1 | \n",
406 | " 0 | \n",
407 | " 0 | \n",
408 | " 0 | \n",
409 | " 1 | \n",
410 | " 0 | \n",
411 | " 0 | \n",
412 | " 0 | \n",
413 | " 0 | \n",
414 | " 1 | \n",
415 | " 0 | \n",
416 | " 0 | \n",
417 | " 0 | \n",
418 | " 1 | \n",
419 | " 1 | \n",
420 | "
\n",
421 | " \n",
422 | " | 2 | \n",
423 | " 0 | \n",
424 | " 0 | \n",
425 | " 0 | \n",
426 | " 1 | \n",
427 | " 0 | \n",
428 | " 0 | \n",
429 | " 0 | \n",
430 | " 1 | \n",
431 | " 0 | \n",
432 | " 1 | \n",
433 | " 1 | \n",
434 | " 0 | \n",
435 | " 1 | \n",
436 | " 0 | \n",
437 | " 1 | \n",
438 | " 0 | \n",
439 | " 1 | \n",
440 | "
\n",
441 | " \n",
442 | "
\n",
443 | "
"
444 | ],
445 | "text/plain": [
446 | " algo amo assim eu existe melhor mim muito nada não odeio pra \\\n",
447 | "0 0 1 0 1 1 1 0 0 1 1 0 0 \n",
448 | "1 1 0 1 0 0 0 1 0 0 0 0 1 \n",
449 | "2 0 0 0 1 0 0 0 1 0 1 1 0 \n",
450 | "\n",
451 | " presta que te tudo você \n",
452 | "0 0 1 1 0 1 \n",
453 | "1 0 0 0 1 1 \n",
454 | "2 1 0 1 0 1 "
455 | ]
456 | },
457 | "execution_count": 56,
458 | "metadata": {},
459 | "output_type": "execute_result"
460 | }
461 | ],
462 | "source": [
463 | "df = pd.DataFrame(simple_train_dtm.toarray(), columns=vect.get_feature_names())\n",
464 | "df"
465 | ]
466 | },
467 | {
468 | "cell_type": "code",
469 | "execution_count": 57,
470 | "metadata": {
471 | "collapsed": false,
472 | "nbpresent": {
473 | "id": "d30743bf-e9b2-46ba-93bd-0615c79b1b29"
474 | }
475 | },
476 | "outputs": [
477 | {
478 | "data": {
479 | "text/plain": [
480 | "scipy.sparse.csr.csr_matrix"
481 | ]
482 | },
483 | "execution_count": 57,
484 | "metadata": {},
485 | "output_type": "execute_result"
486 | }
487 | ],
488 | "source": [
489 | "type(simple_train_dtm)"
490 | ]
491 | },
492 | {
493 | "cell_type": "code",
494 | "execution_count": 60,
495 | "metadata": {
496 | "collapsed": false,
497 | "nbpresent": {
498 | "id": "95d91cb6-e3f8-4b4b-ab82-900f8719f4db"
499 | }
500 | },
501 | "outputs": [
502 | {
503 | "name": "stdout",
504 | "output_type": "stream",
505 | "text": [
506 | " (0, 1)\t1\n",
507 | " (0, 3)\t1\n",
508 | " (0, 4)\t1\n",
509 | " (0, 5)\t1\n",
510 | " (0, 8)\t1\n",
511 | " (0, 9)\t1\n",
512 | " (0, 13)\t1\n",
513 | " (0, 14)\t1\n",
514 | " (0, 16)\t1\n",
515 | " (1, 0)\t1\n",
516 | " (1, 2)\t1\n",
517 | " (1, 6)\t1\n",
518 | " (1, 11)\t1\n",
519 | " (1, 15)\t1\n",
520 | " (1, 16)\t1\n",
521 | " (2, 3)\t1\n",
522 | " (2, 7)\t1\n",
523 | " (2, 9)\t1\n",
524 | " (2, 10)\t1\n",
525 | " (2, 12)\t1\n",
526 | " (2, 14)\t1\n",
527 | " (2, 16)\t1\n"
528 | ]
529 | }
530 | ],
531 | "source": [
532 | "print(simple_train_dtm)"
533 | ]
534 | },
535 | {
536 | "cell_type": "code",
537 | "execution_count": null,
538 | "metadata": {
539 | "collapsed": true,
540 | "nbpresent": {
541 | "id": "201b01cf-47f9-4a94-baf5-9270271e053e"
542 | }
543 | },
544 | "outputs": [],
545 | "source": []
546 | }
547 | ],
548 | "metadata": {
549 | "kernelspec": {
550 | "display_name": "Python [conda root]",
551 | "language": "python",
552 | "name": "conda-root-py"
553 | },
554 | "language_info": {
555 | "codemirror_mode": {
556 | "name": "ipython",
557 | "version": 3
558 | },
559 | "file_extension": ".py",
560 | "mimetype": "text/x-python",
561 | "name": "python",
562 | "nbconvert_exporter": "python",
563 | "pygments_lexer": "ipython3",
564 | "version": "3.5.2"
565 | }
566 | },
567 | "nbformat": 4,
568 | "nbformat_minor": 1
569 | }
570 |
--------------------------------------------------------------------------------
/textAnalisis/.ipynb_checkpoints/TextAnalise-Plin-checkpoint.ipynb:
--------------------------------------------------------------------------------
1 | {
2 | "cells": [
3 | {
4 | "cell_type": "code",
5 | "execution_count": 19,
6 | "metadata": {
7 | "collapsed": true
8 | },
9 | "outputs": [],
10 | "source": [
11 | "from sklearn.feature_extraction.text import CountVectorizer\n",
12 | "from collections import Counter\n",
13 | "\n",
14 | "vect = CountVectorizer()"
15 | ]
16 | },
17 | {
18 | "cell_type": "code",
19 | "execution_count": 20,
20 | "metadata": {
21 | "collapsed": true
22 | },
23 | "outputs": [],
24 | "source": [
25 | "text = [\n",
26 | " '#MachineLearning with Text in scikit learn http://buff.ly/2dJINuD #DataScience #IoT #BigData #AI',\n",
27 | " 'How The Internet Of Things Will Impact Your Everyday Life http://buff.ly/2dIUyMO #IoT #DataScience #BigData #MachineLearning',\n",
28 | " 'The best Brazilian Captain passed away this day. Captain of Brazil Team in 1970. #RIPCapita Carlos Alberto Torres.',\n",
29 | " '10 Videos Featuring Data Science Topics. By Vincent http://buff.ly/2eCWIkA #DataScience #BigData #IoT #MachineLearning',\n",
30 | " 'Data Preparation Tips, Tricks, and Tools: An Interview with the Insiders http://buff.ly/2dDSJ3E #DataScience #BigData #IoT #MachineLearning',\n",
31 | " 'Deep Learning with Neural Networks and TensorFlow Introduction - Youtube http://buff.ly/2efTvdQ #DataScience #MachineLearning #IoT #BigData',\n",
32 | " 'Matplotlib Tutorial - a youtube course http://buff.ly/2eBK4AQ #DadaScience #MachineLearning #IoT #BigData',\n",
33 | " 'Kaggle Releases Data Sets About Global Warming: Make your own Predictions – Data Science Central http://buff.ly/2dUFLQf #DataScience #IoT',\n",
34 | " '#MachineLearning as a Service http://buff.ly/2ep1Jjk #BigData #IoT #DataScience',\n",
35 | " '50 Predictions for the Internet of Things in 2016 https://goo.gl/5Zv28z #IoT #BigData #DataScience #MachineLearning',\n",
36 | " 'IoT Programming Languages http://flip.it/wtVufo #IoT #BigData #DataScience',\n",
37 | " 'An Introduction to Variable and Feature Selection #dataScience #IoT #BigData http://www.datasciencecentral.com/profiles/blogs/an-introduction-to-variable-and-feature-selection …',\n",
38 | " 'Use the simulated device to experience the IBM Watson IoT Platform http://buff.ly/2ekeKGi #IoT #BigData #DataScience #DataViz',\n",
39 | " 'Top 10 Data Science and Machine Learning Podcasts http://buff.ly/2erx7cI #MachineLearning',\n",
40 | " 'Adorei esse copão de café. SVM é fantástico algoritmo de #MachineLearning',\n",
41 | " 'IBM Watsons latest gig: Improving cancer treatment with genomic sequencing http://buff.ly/2dZ5lVP #DataScience #MachineLearning #BigData',\n",
42 | " 'An Introduction to Implementing Neural Networks using TensorFlow http://buff.ly/2ervn3s #DataScience #MachineLearning #IoT #BigData',\n",
43 | " 'Oi testa serviço de monitoramento baseado na Internet das Coisas http://buff.ly/2e3gg21 #DataScience #MachineLearning #IoT #BigData',\n",
44 | " 'Moving from R to Python: The Libraries You Need to Know http://buff.ly/2eeUHuE #DataScience #MachineLearning #IoT #BigData',\n",
45 | " 'Internet of Things Articles : IoT startup and smart cam-maker Smartfrog raises further $20M http://buff.ly/2ei1Kky #MachineLearning #IoT',\n",
46 | " 'An overview of gradient descent optimization algorithms http://buff.ly/2dldKVO #DataScience #MachineLearning #IoT #BigData',\n",
47 | " 'Datafloq - 8 Easy Steps to Become a Data Scientist http://buff.ly/2en6TbA #DataScience #IoT #BigData #MachineLearning',\n",
48 | " 'Time to educate teachers about #datascience'\n",
49 | "]"
50 | ]
51 | },
52 | {
53 | "cell_type": "code",
54 | "execution_count": 21,
55 | "metadata": {
56 | "collapsed": false
57 | },
58 | "outputs": [
59 | {
60 | "data": {
61 | "text/plain": [
62 | "CountVectorizer(analyzer='word', binary=False, decode_error='strict',\n",
63 | " dtype=, encoding='utf-8', input='content',\n",
64 | " lowercase=True, max_df=1.0, max_features=None, min_df=1,\n",
65 | " ngram_range=(1, 1), preprocessor=None, stop_words=None,\n",
66 | " strip_accents=None, token_pattern='(?u)\\\\b\\\\w\\\\w+\\\\b',\n",
67 | " tokenizer=None, vocabulary=None)"
68 | ]
69 | },
70 | "execution_count": 21,
71 | "metadata": {},
72 | "output_type": "execute_result"
73 | }
74 | ],
75 | "source": [
76 | "vect.fit(text)"
77 | ]
78 | },
79 | {
80 | "cell_type": "code",
81 | "execution_count": 22,
82 | "metadata": {
83 | "collapsed": false
84 | },
85 | "outputs": [
86 | {
87 | "data": {
88 | "text/plain": [
89 | "['10',\n",
90 | " '1970',\n",
91 | " '2016',\n",
92 | " '20m',\n",
93 | " '2ddsj3e',\n",
94 | " '2diuymo',\n",
95 | " '2djinud',\n",
96 | " '2dldkvo',\n",
97 | " '2duflqf',\n",
98 | " '2dz5lvp',\n",
99 | " '2e3gg21',\n",
100 | " '2ebk4aq',\n",
101 | " '2ecwika',\n",
102 | " '2eeuhue',\n",
103 | " '2eftvdq',\n",
104 | " '2ei1kky',\n",
105 | " '2ekekgi',\n",
106 | " '2en6tba',\n",
107 | " '2ep1jjk',\n",
108 | " '2ervn3s',\n",
109 | " '2erx7ci',\n",
110 | " '50',\n",
111 | " '5zv28z',\n",
112 | " 'about',\n",
113 | " 'adorei',\n",
114 | " 'ai',\n",
115 | " 'alberto',\n",
116 | " 'algorithms',\n",
117 | " 'algoritmo',\n",
118 | " 'an',\n",
119 | " 'and',\n",
120 | " 'articles',\n",
121 | " 'as',\n",
122 | " 'away',\n",
123 | " 'baseado',\n",
124 | " 'become',\n",
125 | " 'best',\n",
126 | " 'bigdata',\n",
127 | " 'blogs',\n",
128 | " 'brazil',\n",
129 | " 'brazilian',\n",
130 | " 'buff',\n",
131 | " 'by',\n",
132 | " 'café',\n",
133 | " 'cam',\n",
134 | " 'cancer',\n",
135 | " 'captain',\n",
136 | " 'carlos',\n",
137 | " 'central',\n",
138 | " 'coisas',\n",
139 | " 'com',\n",
140 | " 'copão',\n",
141 | " 'course',\n",
142 | " 'dadascience',\n",
143 | " 'das',\n",
144 | " 'data',\n",
145 | " 'datafloq',\n",
146 | " 'datascience',\n",
147 | " 'datasciencecentral',\n",
148 | " 'dataviz',\n",
149 | " 'day',\n",
150 | " 'de',\n",
151 | " 'deep',\n",
152 | " 'descent',\n",
153 | " 'device',\n",
154 | " 'easy',\n",
155 | " 'educate',\n",
156 | " 'esse',\n",
157 | " 'everyday',\n",
158 | " 'experience',\n",
159 | " 'fantástico',\n",
160 | " 'feature',\n",
161 | " 'featuring',\n",
162 | " 'flip',\n",
163 | " 'for',\n",
164 | " 'from',\n",
165 | " 'further',\n",
166 | " 'genomic',\n",
167 | " 'gig',\n",
168 | " 'gl',\n",
169 | " 'global',\n",
170 | " 'goo',\n",
171 | " 'gradient',\n",
172 | " 'how',\n",
173 | " 'http',\n",
174 | " 'https',\n",
175 | " 'ibm',\n",
176 | " 'impact',\n",
177 | " 'implementing',\n",
178 | " 'improving',\n",
179 | " 'in',\n",
180 | " 'insiders',\n",
181 | " 'internet',\n",
182 | " 'interview',\n",
183 | " 'introduction',\n",
184 | " 'iot',\n",
185 | " 'it',\n",
186 | " 'kaggle',\n",
187 | " 'know',\n",
188 | " 'languages',\n",
189 | " 'latest',\n",
190 | " 'learn',\n",
191 | " 'learning',\n",
192 | " 'libraries',\n",
193 | " 'life',\n",
194 | " 'ly',\n",
195 | " 'machine',\n",
196 | " 'machinelearning',\n",
197 | " 'make',\n",
198 | " 'maker',\n",
199 | " 'matplotlib',\n",
200 | " 'monitoramento',\n",
201 | " 'moving',\n",
202 | " 'na',\n",
203 | " 'need',\n",
204 | " 'networks',\n",
205 | " 'neural',\n",
206 | " 'of',\n",
207 | " 'oi',\n",
208 | " 'optimization',\n",
209 | " 'overview',\n",
210 | " 'own',\n",
211 | " 'passed',\n",
212 | " 'platform',\n",
213 | " 'podcasts',\n",
214 | " 'predictions',\n",
215 | " 'preparation',\n",
216 | " 'profiles',\n",
217 | " 'programming',\n",
218 | " 'python',\n",
219 | " 'raises',\n",
220 | " 'releases',\n",
221 | " 'ripcapita',\n",
222 | " 'science',\n",
223 | " 'scientist',\n",
224 | " 'scikit',\n",
225 | " 'selection',\n",
226 | " 'sequencing',\n",
227 | " 'service',\n",
228 | " 'serviço',\n",
229 | " 'sets',\n",
230 | " 'simulated',\n",
231 | " 'smart',\n",
232 | " 'smartfrog',\n",
233 | " 'startup',\n",
234 | " 'steps',\n",
235 | " 'svm',\n",
236 | " 'teachers',\n",
237 | " 'team',\n",
238 | " 'tensorflow',\n",
239 | " 'testa',\n",
240 | " 'text',\n",
241 | " 'the',\n",
242 | " 'things',\n",
243 | " 'this',\n",
244 | " 'time',\n",
245 | " 'tips',\n",
246 | " 'to',\n",
247 | " 'tools',\n",
248 | " 'top',\n",
249 | " 'topics',\n",
250 | " 'torres',\n",
251 | " 'treatment',\n",
252 | " 'tricks',\n",
253 | " 'tutorial',\n",
254 | " 'use',\n",
255 | " 'using',\n",
256 | " 'variable',\n",
257 | " 'videos',\n",
258 | " 'vincent',\n",
259 | " 'warming',\n",
260 | " 'watson',\n",
261 | " 'watsons',\n",
262 | " 'will',\n",
263 | " 'with',\n",
264 | " 'wtvufo',\n",
265 | " 'www',\n",
266 | " 'you',\n",
267 | " 'your',\n",
268 | " 'youtube']"
269 | ]
270 | },
271 | "execution_count": 22,
272 | "metadata": {},
273 | "output_type": "execute_result"
274 | }
275 | ],
276 | "source": [
277 | "vect.get_feature_names()"
278 | ]
279 | },
280 | {
281 | "cell_type": "code",
282 | "execution_count": 23,
283 | "metadata": {
284 | "collapsed": false
285 | },
286 | "outputs": [
287 | {
288 | "data": {
289 | "text/plain": [
290 | "<23x180 sparse matrix of type ''\n",
291 | "\twith 346 stored elements in Compressed Sparse Row format>"
292 | ]
293 | },
294 | "execution_count": 23,
295 | "metadata": {},
296 | "output_type": "execute_result"
297 | }
298 | ],
299 | "source": [
300 | "simple_train_dtm = vect.transform(text)\n",
301 | "simple_train_dtm"
302 | ]
303 | },
304 | {
305 | "cell_type": "code",
306 | "execution_count": 24,
307 | "metadata": {
308 | "collapsed": false
309 | },
310 | "outputs": [
311 | {
312 | "data": {
313 | "text/plain": [
314 | "array([[0, 0, 0, ..., 0, 0, 0],\n",
315 | " [0, 0, 0, ..., 0, 1, 0],\n",
316 | " [0, 1, 0, ..., 0, 0, 0],\n",
317 | " ..., \n",
318 | " [0, 0, 0, ..., 0, 0, 0],\n",
319 | " [0, 0, 0, ..., 0, 0, 0],\n",
320 | " [0, 0, 0, ..., 0, 0, 0]])"
321 | ]
322 | },
323 | "execution_count": 24,
324 | "metadata": {},
325 | "output_type": "execute_result"
326 | }
327 | ],
328 | "source": [
329 | "simple_train_dtm.toarray()"
330 | ]
331 | },
332 | {
333 | "cell_type": "code",
334 | "execution_count": 25,
335 | "metadata": {
336 | "collapsed": false
337 | },
338 | "outputs": [
339 | {
340 | "data": {
341 | "text/plain": [
342 | "['10',\n",
343 | " '1970',\n",
344 | " '2016',\n",
345 | " '20m',\n",
346 | " '2ddsj3e',\n",
347 | " '2diuymo',\n",
348 | " '2djinud',\n",
349 | " '2dldkvo',\n",
350 | " '2duflqf',\n",
351 | " '2dz5lvp',\n",
352 | " '2e3gg21',\n",
353 | " '2ebk4aq',\n",
354 | " '2ecwika',\n",
355 | " '2eeuhue',\n",
356 | " '2eftvdq',\n",
357 | " '2ei1kky',\n",
358 | " '2ekekgi',\n",
359 | " '2en6tba',\n",
360 | " '2ep1jjk',\n",
361 | " '2ervn3s',\n",
362 | " '2erx7ci',\n",
363 | " '50',\n",
364 | " '5zv28z',\n",
365 | " 'about',\n",
366 | " 'adorei',\n",
367 | " 'ai',\n",
368 | " 'alberto',\n",
369 | " 'algorithms',\n",
370 | " 'algoritmo',\n",
371 | " 'an',\n",
372 | " 'and',\n",
373 | " 'articles',\n",
374 | " 'as',\n",
375 | " 'away',\n",
376 | " 'baseado',\n",
377 | " 'become',\n",
378 | " 'best',\n",
379 | " 'bigdata',\n",
380 | " 'blogs',\n",
381 | " 'brazil',\n",
382 | " 'brazilian',\n",
383 | " 'buff',\n",
384 | " 'by',\n",
385 | " 'café',\n",
386 | " 'cam',\n",
387 | " 'cancer',\n",
388 | " 'captain',\n",
389 | " 'carlos',\n",
390 | " 'central',\n",
391 | " 'coisas',\n",
392 | " 'com',\n",
393 | " 'copão',\n",
394 | " 'course',\n",
395 | " 'dadascience',\n",
396 | " 'das',\n",
397 | " 'data',\n",
398 | " 'datafloq',\n",
399 | " 'datascience',\n",
400 | " 'datasciencecentral',\n",
401 | " 'dataviz',\n",
402 | " 'day',\n",
403 | " 'de',\n",
404 | " 'deep',\n",
405 | " 'descent',\n",
406 | " 'device',\n",
407 | " 'easy',\n",
408 | " 'educate',\n",
409 | " 'esse',\n",
410 | " 'everyday',\n",
411 | " 'experience',\n",
412 | " 'fantástico',\n",
413 | " 'feature',\n",
414 | " 'featuring',\n",
415 | " 'flip',\n",
416 | " 'for',\n",
417 | " 'from',\n",
418 | " 'further',\n",
419 | " 'genomic',\n",
420 | " 'gig',\n",
421 | " 'gl',\n",
422 | " 'global',\n",
423 | " 'goo',\n",
424 | " 'gradient',\n",
425 | " 'how',\n",
426 | " 'http',\n",
427 | " 'https',\n",
428 | " 'ibm',\n",
429 | " 'impact',\n",
430 | " 'implementing',\n",
431 | " 'improving',\n",
432 | " 'in',\n",
433 | " 'insiders',\n",
434 | " 'internet',\n",
435 | " 'interview',\n",
436 | " 'introduction',\n",
437 | " 'iot',\n",
438 | " 'it',\n",
439 | " 'kaggle',\n",
440 | " 'know',\n",
441 | " 'languages',\n",
442 | " 'latest',\n",
443 | " 'learn',\n",
444 | " 'learning',\n",
445 | " 'libraries',\n",
446 | " 'life',\n",
447 | " 'ly',\n",
448 | " 'machine',\n",
449 | " 'machinelearning',\n",
450 | " 'make',\n",
451 | " 'maker',\n",
452 | " 'matplotlib',\n",
453 | " 'monitoramento',\n",
454 | " 'moving',\n",
455 | " 'na',\n",
456 | " 'need',\n",
457 | " 'networks',\n",
458 | " 'neural',\n",
459 | " 'of',\n",
460 | " 'oi',\n",
461 | " 'optimization',\n",
462 | " 'overview',\n",
463 | " 'own',\n",
464 | " 'passed',\n",
465 | " 'platform',\n",
466 | " 'podcasts',\n",
467 | " 'predictions',\n",
468 | " 'preparation',\n",
469 | " 'profiles',\n",
470 | " 'programming',\n",
471 | " 'python',\n",
472 | " 'raises',\n",
473 | " 'releases',\n",
474 | " 'ripcapita',\n",
475 | " 'science',\n",
476 | " 'scientist',\n",
477 | " 'scikit',\n",
478 | " 'selection',\n",
479 | " 'sequencing',\n",
480 | " 'service',\n",
481 | " 'serviço',\n",
482 | " 'sets',\n",
483 | " 'simulated',\n",
484 | " 'smart',\n",
485 | " 'smartfrog',\n",
486 | " 'startup',\n",
487 | " 'steps',\n",
488 | " 'svm',\n",
489 | " 'teachers',\n",
490 | " 'team',\n",
491 | " 'tensorflow',\n",
492 | " 'testa',\n",
493 | " 'text',\n",
494 | " 'the',\n",
495 | " 'things',\n",
496 | " 'this',\n",
497 | " 'time',\n",
498 | " 'tips',\n",
499 | " 'to',\n",
500 | " 'tools',\n",
501 | " 'top',\n",
502 | " 'topics',\n",
503 | " 'torres',\n",
504 | " 'treatment',\n",
505 | " 'tricks',\n",
506 | " 'tutorial',\n",
507 | " 'use',\n",
508 | " 'using',\n",
509 | " 'variable',\n",
510 | " 'videos',\n",
511 | " 'vincent',\n",
512 | " 'warming',\n",
513 | " 'watson',\n",
514 | " 'watsons',\n",
515 | " 'will',\n",
516 | " 'with',\n",
517 | " 'wtvufo',\n",
518 | " 'www',\n",
519 | " 'you',\n",
520 | " 'your',\n",
521 | " 'youtube']"
522 | ]
523 | },
524 | "execution_count": 25,
525 | "metadata": {},
526 | "output_type": "execute_result"
527 | }
528 | ],
529 | "source": [
530 | "\n",
531 | "vocab = list(vect.get_feature_names())\n",
532 | "vocab"
533 | ]
534 | },
535 | {
536 | "cell_type": "code",
537 | "execution_count": 36,
538 | "metadata": {
539 | "collapsed": false
540 | },
541 | "outputs": [
542 | {
543 | "data": {
544 | "text/plain": [
545 | "[('iot', 21),\n",
546 | " ('http', 19),\n",
547 | " ('datascience', 18),\n",
548 | " ('machinelearning', 17),\n",
549 | " ('bigdata', 17),\n",
550 | " ('buff', 17),\n",
551 | " ('ly', 17),\n",
552 | " ('to', 8),\n",
553 | " ('the', 7),\n",
554 | " ('and', 6),\n",
555 | " ('data', 6),\n",
556 | " ('an', 5),\n",
557 | " ('of', 5),\n",
558 | " ('internet', 4),\n",
559 | " ('introduction', 4),\n",
560 | " ('with', 4),\n",
561 | " ('de', 3),\n",
562 | " ('things', 3),\n",
563 | " ('in', 3),\n",
564 | " ('science', 3),\n",
565 | " ('neural', 2),\n",
566 | " ('about', 2),\n",
567 | " ('networks', 2),\n",
568 | " ('feature', 2),\n",
569 | " ('tensorflow', 2),\n",
570 | " ('ibm', 2),\n",
571 | " ('variable', 2),\n",
572 | " ('learning', 2),\n",
573 | " ('selection', 2),\n",
574 | " ('youtube', 2),\n",
575 | " ('your', 2),\n",
576 | " ('predictions', 2),\n",
577 | " ('10', 2),\n",
578 | " ('captain', 2),\n",
579 | " ('by', 1),\n",
580 | " ('podcasts', 1),\n",
581 | " ('tools', 1),\n",
582 | " ('team', 1),\n",
583 | " ('text', 1),\n",
584 | " ('genomic', 1),\n",
585 | " ('languages', 1),\n",
586 | " ('esse', 1),\n",
587 | " ('2diuymo', 1),\n",
588 | " ('maker', 1),\n",
589 | " ('libraries', 1),\n",
590 | " ('learn', 1),\n",
591 | " ('interview', 1),\n",
592 | " ('gl', 1),\n",
593 | " ('scientist', 1),\n",
594 | " ('café', 1),\n",
595 | " ('everyday', 1),\n",
596 | " ('2duflqf', 1),\n",
597 | " ('cam', 1),\n",
598 | " ('baseado', 1),\n",
599 | " ('away', 1),\n",
600 | " ('device', 1),\n",
601 | " ('watsons', 1),\n",
602 | " ('improving', 1),\n",
603 | " ('programming', 1),\n",
604 | " ('overview', 1),\n",
605 | " ('warming', 1),\n",
606 | " ('2ecwika', 1),\n",
607 | " ('how', 1),\n",
608 | " ('own', 1),\n",
609 | " ('make', 1),\n",
610 | " ('machine', 1),\n",
611 | " ('steps', 1),\n",
612 | " ('kaggle', 1),\n",
613 | " ('raises', 1),\n",
614 | " ('svm', 1),\n",
615 | " ('vincent', 1),\n",
616 | " ('time', 1),\n",
617 | " ('python', 1),\n",
618 | " ('datasciencecentral', 1),\n",
619 | " ('copão', 1),\n",
620 | " ('best', 1),\n",
621 | " ('need', 1),\n",
622 | " ('datafloq', 1),\n",
623 | " ('das', 1),\n",
624 | " ('2erx7ci', 1),\n",
625 | " ('testa', 1),\n",
626 | " ('flip', 1),\n",
627 | " ('become', 1),\n",
628 | " ('2ekekgi', 1),\n",
629 | " ('fantástico', 1),\n",
630 | " ('platform', 1),\n",
631 | " ('serviço', 1),\n",
632 | " ('smart', 1),\n",
633 | " ('scikit', 1),\n",
634 | " ('tutorial', 1),\n",
635 | " ('cancer', 1),\n",
636 | " ('ai', 1),\n",
637 | " ('top', 1),\n",
638 | " ('2ei1kky', 1),\n",
639 | " ('it', 1),\n",
640 | " ('startup', 1),\n",
641 | " ('sets', 1),\n",
642 | " ('2ep1jjk', 1),\n",
643 | " ('from', 1),\n",
644 | " ('algoritmo', 1),\n",
645 | " ('2eftvdq', 1),\n",
646 | " ('2dz5lvp', 1),\n",
647 | " ('blogs', 1),\n",
648 | " ('50', 1),\n",
649 | " ('easy', 1),\n",
650 | " ('dataviz', 1),\n",
651 | " ('further', 1),\n",
652 | " ('5zv28z', 1),\n",
653 | " ('central', 1),\n",
654 | " ('goo', 1),\n",
655 | " ('topics', 1),\n",
656 | " ('2e3gg21', 1),\n",
657 | " ('preparation', 1),\n",
658 | " ('implementing', 1),\n",
659 | " ('2eeuhue', 1),\n",
660 | " ('descent', 1),\n",
661 | " ('as', 1),\n",
662 | " ('20m', 1),\n",
663 | " ('using', 1),\n",
664 | " ('treatment', 1),\n",
665 | " ('latest', 1),\n",
666 | " ('will', 1),\n",
667 | " ('releases', 1),\n",
668 | " ('monitoramento', 1),\n",
669 | " ('https', 1),\n",
670 | " ('alberto', 1),\n",
671 | " ('watson', 1),\n",
672 | " ('ripcapita', 1),\n",
673 | " ('torres', 1),\n",
674 | " ('course', 1),\n",
675 | " ('featuring', 1),\n",
676 | " ('brazil', 1),\n",
677 | " ('wtvufo', 1),\n",
678 | " ('coisas', 1),\n",
679 | " ('use', 1),\n",
680 | " ('passed', 1),\n",
681 | " ('oi', 1),\n",
682 | " ('optimization', 1),\n",
683 | " ('moving', 1),\n",
684 | " ('com', 1),\n",
685 | " ('know', 1),\n",
686 | " ('simulated', 1),\n",
687 | " ('2ervn3s', 1),\n",
688 | " ('you', 1),\n",
689 | " ('www', 1),\n",
690 | " ('this', 1),\n",
691 | " ('dadascience', 1),\n",
692 | " ('adorei', 1),\n",
693 | " ('educate', 1),\n",
694 | " ('for', 1),\n",
695 | " ('1970', 1),\n",
696 | " ('2en6tba', 1),\n",
697 | " ('teachers', 1),\n",
698 | " ('matplotlib', 1),\n",
699 | " ('global', 1),\n",
700 | " ('sequencing', 1),\n",
701 | " ('life', 1),\n",
702 | " ('2ebk4aq', 1),\n",
703 | " ('insiders', 1),\n",
704 | " ('gig', 1),\n",
705 | " ('carlos', 1),\n",
706 | " ('2016', 1),\n",
707 | " ('impact', 1),\n",
708 | " ('day', 1),\n",
709 | " ('2ddsj3e', 1),\n",
710 | " ('profiles', 1),\n",
711 | " ('experience', 1),\n",
712 | " ('brazilian', 1),\n",
713 | " ('smartfrog', 1),\n",
714 | " ('deep', 1),\n",
715 | " ('gradient', 1),\n",
716 | " ('na', 1),\n",
717 | " ('videos', 1),\n",
718 | " ('service', 1),\n",
719 | " ('tricks', 1),\n",
720 | " ('algorithms', 1),\n",
721 | " ('tips', 1),\n",
722 | " ('2dldkvo', 1),\n",
723 | " ('2djinud', 1),\n",
724 | " ('articles', 1)]"
725 | ]
726 | },
727 | "execution_count": 36,
728 | "metadata": {},
729 | "output_type": "execute_result"
730 | }
731 | ],
732 | "source": [
733 | "counts = simple_train_dtm.sum(axis=0).A1\n",
734 | "\n",
735 | "freq_distribution = Counter(dict(zip(vocab, counts)))\n",
736 | "##print (freq_distribution.most_common(100))\n",
737 | "list(freq_distribution.most_common())\n",
738 | "\n"
739 | ]
740 | },
741 | {
742 | "cell_type": "code",
743 | "execution_count": null,
744 | "metadata": {
745 | "collapsed": true
746 | },
747 | "outputs": [],
748 | "source": []
749 | }
750 | ],
751 | "metadata": {
752 | "kernelspec": {
753 | "display_name": "Python [conda root]",
754 | "language": "python",
755 | "name": "conda-root-py"
756 | },
757 | "language_info": {
758 | "codemirror_mode": {
759 | "name": "ipython",
760 | "version": 3
761 | },
762 | "file_extension": ".py",
763 | "mimetype": "text/x-python",
764 | "name": "python",
765 | "nbconvert_exporter": "python",
766 | "pygments_lexer": "ipython3",
767 | "version": "3.5.2"
768 | }
769 | },
770 | "nbformat": 4,
771 | "nbformat_minor": 1
772 | }
773 |
--------------------------------------------------------------------------------
/AnaliseTexto/AnaliseDeSentimento.ipynb:
--------------------------------------------------------------------------------
1 | {
2 | "cells": [
3 | {
4 | "cell_type": "markdown",
5 | "metadata": {
6 | "nbpresent": {
7 | "id": "e4c7d791-d39c-4247-a950-8f541b2b2b2b"
8 | },
9 | "slideshow": {
10 | "slide_type": "-"
11 | }
12 | },
13 | "source": [
14 | "# Classificação de textos com *scikit-learn*\n",
15 | "por Prof. Sanderson Macedo"
16 | ]
17 | },
18 | {
19 | "cell_type": "markdown",
20 | "metadata": {
21 | "nbpresent": {
22 | "id": "918ce0e7-8f69-4d3c-8106-d3c5264c94e3"
23 | },
24 | "slideshow": {
25 | "slide_type": "-"
26 | }
27 | },
28 | "source": [
29 | "
"
30 | ]
31 | },
32 | {
33 | "cell_type": "markdown",
34 | "metadata": {
35 | "nbpresent": {
36 | "id": "ca5fe97a-0224-4915-a59d-38e6baa218a2"
37 | }
38 | },
39 | "source": [
40 | "## Agenda\n",
41 | "\n",
42 | "\n",
43 | "1. Representar um texto como dados numéricos\n",
44 | "2. Ler o *dataset* de texto no Pandas\n",
45 | "2. Vetorizar nossso *dataset*\n",
46 | "4. Construir e avaliar um modelo\n",
47 | "5. Comparar modelos\n"
48 | ]
49 | },
50 | {
51 | "cell_type": "code",
52 | "execution_count": 353,
53 | "metadata": {
54 | "collapsed": true,
55 | "nbpresent": {
56 | "id": "d2e20804-da18-483c-bd40-8c25e2d4699c"
57 | }
58 | },
59 | "outputs": [],
60 | "source": [
61 | "##Importando pandas e numpy\n",
62 | "import pandas as pd\n",
63 | "import numpy as np"
64 | ]
65 | },
66 | {
67 | "cell_type": "markdown",
68 | "metadata": {
69 | "nbpresent": {
70 | "id": "76e5a32a-69c4-4dc5-a66b-23d2cca623af"
71 | }
72 | },
73 | "source": [
74 | "## 1. Definindo um vetor de textos \n",
75 | "Os textos do vetor podem ser adquiridos por meio da leitura de \n",
76 | "pdf's, doc's, twitter's... etc.\n",
77 | "\n",
78 | "Esses textos serão a base de treinamento\n",
79 | "para a classificação do sentimento de um novo texto."
80 | ]
81 | },
82 | {
83 | "cell_type": "code",
84 | "execution_count": 354,
85 | "metadata": {
86 | "collapsed": false,
87 | "nbpresent": {
88 | "id": "56bab267-0993-4d7a-9436-11bc5de3d1d3"
89 | }
90 | },
91 | "outputs": [],
92 | "source": [
93 | "train = [\n",
94 | " 'Eu te amo',\n",
95 | " 'Você é algo assim... é tudo pra mim. Ao meu amor... Amor!',\n",
96 | " 'Eu te odeio muito, você não presta!',\n",
97 | " 'Não gosto de você'\n",
98 | " ]"
99 | ]
100 | },
101 | {
102 | "cell_type": "markdown",
103 | "metadata": {
104 | "nbpresent": {
105 | "id": "fc1fc669-a603-412e-8855-837d750718ff"
106 | }
107 | },
108 | "source": [
109 | "## 2. Definindo um vetor de sentimentos\n",
110 | "Criaremos um vetor de sentimentos chamado **_felling_**. \n",
111 | "\n",
112 | "Cada posição do vetor **_felling_** representa o sentimento **BOM** (1) ou **RUIM** (0) para os textos que passamos ao vetor **_train_**.\n",
113 | "\n",
114 | "Por exemplo: a frase da primeira posição do vetor **_train_**:\n",
115 | "\n",
116 | "> 'Eu te amo'\n",
117 | "\n",
118 | "Foi classificada como sendo um texto **BOM**:\n",
119 | "\n",
120 | "> 1"
121 | ]
122 | },
123 | {
124 | "cell_type": "code",
125 | "execution_count": 355,
126 | "metadata": {
127 | "collapsed": true,
128 | "nbpresent": {
129 | "id": "68a4277e-e38c-42ac-8528-0b90efe86e42"
130 | }
131 | },
132 | "outputs": [],
133 | "source": [
134 | "felling = [1,1,0,0]"
135 | ]
136 | },
137 | {
138 | "cell_type": "markdown",
139 | "metadata": {
140 | "nbpresent": {
141 | "id": "f43ff54a-e843-4a35-8447-66665f36ebca"
142 | }
143 | },
144 | "source": [
145 | "## 3. Análise de texto com _scikit-learn_.\n",
146 | "\n",
147 | "Texto de [scikit-learn documentation](http://scikit-learn.org/stable/modules/feature_extraction.html#text-feature-extraction):\n",
148 | "\n",
149 | "> Análise de texto é um campo de aplicação importante para algoritmos de aprendizado de máquina. No entanto, uma sequência de símbolos não podem ser passada diretamente aos algoritmos de Machine Learning, pois a maioria deles espera vetores de características numéricas com um tamanho fixo, em vez de documentos de texto com comprimento variável.\n",
150 | "\n",
151 | "Mas nesse caso podemos realizar algumas transformações de para poder manipular textos em algoritmos de aprendizagem.\n",
152 | "\n",
153 | "Portanto, aqui utilizaremos a [CountVectorizer](http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html)\n",
154 | "para converter textos em uma matriz que expressará a quantidade \"tokens\" dos textos.\n",
155 | "\n",
156 | "Importamos a classe e criamos uma instância chamada **_vect_**.\n"
157 | ]
158 | },
159 | {
160 | "cell_type": "code",
161 | "execution_count": 356,
162 | "metadata": {
163 | "collapsed": false,
164 | "nbpresent": {
165 | "id": "1ada59d7-f1ba-4625-8999-b8af5aaf461c"
166 | }
167 | },
168 | "outputs": [],
169 | "source": [
170 | "from sklearn.feature_extraction.text import CountVectorizer\n",
171 | "vect = CountVectorizer()"
172 | ]
173 | },
174 | {
175 | "cell_type": "markdown",
176 | "metadata": {
177 | "nbpresent": {
178 | "id": "154ef867-0532-45ad-9910-c87f6711d1b0"
179 | }
180 | },
181 | "source": [
182 | "## 4. Treinamento criando o dicionário.\n",
183 | "Agora treinamos o algoritmo com o vetor de textos que criamos acima. Chamamos o método **_fit()_** passando o vetor de textos."
184 | ]
185 | },
186 | {
187 | "cell_type": "code",
188 | "execution_count": 357,
189 | "metadata": {
190 | "collapsed": false,
191 | "nbpresent": {
192 | "id": "eff3a289-8c0d-4374-9400-d988a6b36624"
193 | }
194 | },
195 | "outputs": [
196 | {
197 | "data": {
198 | "text/plain": [
199 | "CountVectorizer(analyzer='word', binary=False, decode_error='strict',\n",
200 | " dtype=, encoding='utf-8', input='content',\n",
201 | " lowercase=True, max_df=1.0, max_features=None, min_df=1,\n",
202 | " ngram_range=(1, 1), preprocessor=None, stop_words=None,\n",
203 | " strip_accents=None, token_pattern='(?u)\\\\b\\\\w\\\\w+\\\\b',\n",
204 | " tokenizer=None, vocabulary=None)"
205 | ]
206 | },
207 | "execution_count": 357,
208 | "metadata": {},
209 | "output_type": "execute_result"
210 | }
211 | ],
212 | "source": [
213 | "vect.fit(train)"
214 | ]
215 | },
216 | {
217 | "cell_type": "markdown",
218 | "metadata": {},
219 | "source": [
220 | "Veja que o parametro *analyzer* é defindo por padrão como *'word'* na classe *CountVectorizer*. Isso signicica que a classe ignora palavras com menos de dois (2) caracteres e pontuações. "
221 | ]
222 | },
223 | {
224 | "cell_type": "markdown",
225 | "metadata": {
226 | "nbpresent": {
227 | "id": "d4093cdd-6b19-4fed-9a01-5ee02f41ca51"
228 | }
229 | },
230 | "source": [
231 | "## 5. Nosso dicionário de palavras\n",
232 | "Aqui vamos listar quais palavras forma utilizadas nos textos de **_train_**, formando nosso dicionário de palavras. Nessa listagem as palavras não se repetem."
233 | ]
234 | },
235 | {
236 | "cell_type": "code",
237 | "execution_count": 358,
238 | "metadata": {
239 | "collapsed": false,
240 | "nbpresent": {
241 | "id": "3ab9a844-7f38-40c5-a57f-4a2fbf3343ba"
242 | }
243 | },
244 | "outputs": [
245 | {
246 | "data": {
247 | "text/plain": [
248 | "['algo',\n",
249 | " 'amo',\n",
250 | " 'amor',\n",
251 | " 'ao',\n",
252 | " 'assim',\n",
253 | " 'de',\n",
254 | " 'eu',\n",
255 | " 'gosto',\n",
256 | " 'meu',\n",
257 | " 'mim',\n",
258 | " 'muito',\n",
259 | " 'não',\n",
260 | " 'odeio',\n",
261 | " 'pra',\n",
262 | " 'presta',\n",
263 | " 'te',\n",
264 | " 'tudo',\n",
265 | " 'você']"
266 | ]
267 | },
268 | "execution_count": 358,
269 | "metadata": {},
270 | "output_type": "execute_result"
271 | }
272 | ],
273 | "source": [
274 | "## examinando o dicionário criado em ordem alfabética.\n",
275 | "vect.get_feature_names()"
276 | ]
277 | },
278 | {
279 | "cell_type": "markdown",
280 | "metadata": {},
281 | "source": [
282 | "## 6. Criação de uma matriz de ocorrência\n",
283 | "\n",
284 | "\n",
285 | "\n",
286 | "A matriz de ocorrência mostra a ocorrência de cada palavra em cada texto passado para o algoritmo que criou o dicionário.\n",
287 | "Essa transformação cria uma matriz onde:\n",
288 | "\n",
289 | "1. Cada linha representa um texto do vetor **_train_** \n",
290 | "2. Cada coluna uma palavra do dicionário aprendido.\n",
291 | "3. Se a palavra ocorrer no texto o valor será 1 caso contrário 0.\n",
292 | "\n",
293 | "Por exemplo:\n",
294 | "A primeira linha da matriz é a frase\n",
295 | "\n",
296 | "> Eu te amo\n",
297 | "\n",
298 | "Essa frase tem somente três(3) palavras **_eu_**, **_te_** e **_amo_** que serão marcados na matriz com a quantidade que ocorrem no texto nesse caso **_1_** e as outras palavras do dicionário serão marcadas pelo valor zero(0), por não estarem no texto.\n",
299 | "\n",
300 | "A segunda frase\n",
301 | "\n",
302 | "> Você é algo assim... é tudo pra mim. Ao meu amor... Amor!\n",
303 | "\n",
304 | "A palavra **_amor_** ocorre duas(2) vezes, por isso que a terceira posição tem o valor 2. "
305 | ]
306 | },
307 | {
308 | "cell_type": "code",
309 | "execution_count": 359,
310 | "metadata": {
311 | "collapsed": false,
312 | "nbpresent": {
313 | "id": "34cfd603-24de-4379-9a69-353ba0e50fba"
314 | }
315 | },
316 | "outputs": [
317 | {
318 | "data": {
319 | "text/plain": [
320 | "array([[0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0],\n",
321 | " [1, 0, 2, 1, 1, 0, 0, 0, 1, 1, 0, 0, 0, 1, 0, 0, 1, 1],\n",
322 | " [0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 1, 1, 0, 1, 1, 0, 1],\n",
323 | " [0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1]])"
324 | ]
325 | },
326 | "execution_count": 359,
327 | "metadata": {},
328 | "output_type": "execute_result"
329 | }
330 | ],
331 | "source": [
332 | "simple_train_dtm = vect.transform(train)\n",
333 | "simple_train_dtm.toarray()"
334 | ]
335 | },
336 | {
337 | "cell_type": "markdown",
338 | "metadata": {},
339 | "source": [
340 | "#### Criando um *dataframe* pandas para visualizar melhor os dados."
341 | ]
342 | },
343 | {
344 | "cell_type": "code",
345 | "execution_count": 360,
346 | "metadata": {
347 | "collapsed": false,
348 | "nbpresent": {
349 | "id": "2e563c0f-37c5-4861-85c6-9185c20e3507"
350 | }
351 | },
352 | "outputs": [
353 | {
354 | "data": {
355 | "text/html": [
356 | "\n",
357 | "
\n",
358 | " \n",
359 | " \n",
360 | " | \n",
361 | " algo | \n",
362 | " amo | \n",
363 | " amor | \n",
364 | " ao | \n",
365 | " assim | \n",
366 | " de | \n",
367 | " eu | \n",
368 | " gosto | \n",
369 | " meu | \n",
370 | " mim | \n",
371 | " muito | \n",
372 | " não | \n",
373 | " odeio | \n",
374 | " pra | \n",
375 | " presta | \n",
376 | " te | \n",
377 | " tudo | \n",
378 | " você | \n",
379 | "
\n",
380 | " \n",
381 | " \n",
382 | " \n",
383 | " | Eu te amo | \n",
384 | " 0 | \n",
385 | " 1 | \n",
386 | " 0 | \n",
387 | " 0 | \n",
388 | " 0 | \n",
389 | " 0 | \n",
390 | " 1 | \n",
391 | " 0 | \n",
392 | " 0 | \n",
393 | " 0 | \n",
394 | " 0 | \n",
395 | " 0 | \n",
396 | " 0 | \n",
397 | " 0 | \n",
398 | " 0 | \n",
399 | " 1 | \n",
400 | " 0 | \n",
401 | " 0 | \n",
402 | "
\n",
403 | " \n",
404 | " | Você é algo assim... é tudo pra mim. Ao meu amor... Amor! | \n",
405 | " 1 | \n",
406 | " 0 | \n",
407 | " 2 | \n",
408 | " 1 | \n",
409 | " 1 | \n",
410 | " 0 | \n",
411 | " 0 | \n",
412 | " 0 | \n",
413 | " 1 | \n",
414 | " 1 | \n",
415 | " 0 | \n",
416 | " 0 | \n",
417 | " 0 | \n",
418 | " 1 | \n",
419 | " 0 | \n",
420 | " 0 | \n",
421 | " 1 | \n",
422 | " 1 | \n",
423 | "
\n",
424 | " \n",
425 | " | Eu te odeio muito, você não presta! | \n",
426 | " 0 | \n",
427 | " 0 | \n",
428 | " 0 | \n",
429 | " 0 | \n",
430 | " 0 | \n",
431 | " 0 | \n",
432 | " 1 | \n",
433 | " 0 | \n",
434 | " 0 | \n",
435 | " 0 | \n",
436 | " 1 | \n",
437 | " 1 | \n",
438 | " 1 | \n",
439 | " 0 | \n",
440 | " 1 | \n",
441 | " 1 | \n",
442 | " 0 | \n",
443 | " 1 | \n",
444 | "
\n",
445 | " \n",
446 | " | Não gosto de você | \n",
447 | " 0 | \n",
448 | " 0 | \n",
449 | " 0 | \n",
450 | " 0 | \n",
451 | " 0 | \n",
452 | " 1 | \n",
453 | " 0 | \n",
454 | " 1 | \n",
455 | " 0 | \n",
456 | " 0 | \n",
457 | " 0 | \n",
458 | " 1 | \n",
459 | " 0 | \n",
460 | " 0 | \n",
461 | " 0 | \n",
462 | " 0 | \n",
463 | " 0 | \n",
464 | " 1 | \n",
465 | "
\n",
466 | " \n",
467 | "
\n",
468 | "
"
469 | ],
470 | "text/plain": [
471 | " algo amo amor ao \\\n",
472 | "Eu te amo 0 1 0 0 \n",
473 | "Você é algo assim... é tudo pra mim. Ao meu amo... 1 0 2 1 \n",
474 | "Eu te odeio muito, você não presta! 0 0 0 0 \n",
475 | "Não gosto de você 0 0 0 0 \n",
476 | "\n",
477 | " assim de eu gosto meu \\\n",
478 | "Eu te amo 0 0 1 0 0 \n",
479 | "Você é algo assim... é tudo pra mim. Ao meu amo... 1 0 0 0 1 \n",
480 | "Eu te odeio muito, você não presta! 0 0 1 0 0 \n",
481 | "Não gosto de você 0 1 0 1 0 \n",
482 | "\n",
483 | " mim muito não odeio \\\n",
484 | "Eu te amo 0 0 0 0 \n",
485 | "Você é algo assim... é tudo pra mim. Ao meu amo... 1 0 0 0 \n",
486 | "Eu te odeio muito, você não presta! 0 1 1 1 \n",
487 | "Não gosto de você 0 0 1 0 \n",
488 | "\n",
489 | " pra presta te tudo \\\n",
490 | "Eu te amo 0 0 1 0 \n",
491 | "Você é algo assim... é tudo pra mim. Ao meu amo... 1 0 0 1 \n",
492 | "Eu te odeio muito, você não presta! 0 1 1 0 \n",
493 | "Não gosto de você 0 0 0 0 \n",
494 | "\n",
495 | " você \n",
496 | "Eu te amo 0 \n",
497 | "Você é algo assim... é tudo pra mim. Ao meu amo... 1 \n",
498 | "Eu te odeio muito, você não presta! 1 \n",
499 | "Não gosto de você 1 "
500 | ]
501 | },
502 | "execution_count": 360,
503 | "metadata": {},
504 | "output_type": "execute_result"
505 | }
506 | ],
507 | "source": [
508 | "df = pd.DataFrame(simple_train_dtm.toarray(), columns=vect.get_feature_names(), index=train)\n",
509 | "df"
510 | ]
511 | },
512 | {
513 | "cell_type": "markdown",
514 | "metadata": {},
515 | "source": [
516 | "## 7. esparsividade\n",
517 | "A matriz de ocorrência é uma matriz normalmente muito esparsa, ou seja, com muitos valores zero. Essa quantidade de zeros na matriz aumenta substâncialmente o processamento das informações para a classificação de um novo texto. Portanto, a matriz esparsa ficará melhor representada pela ocorrência sem os valores zero.\n",
518 | "A linha abaixo mostra que a matriz é do tipo esparsa.\n"
519 | ]
520 | },
521 | {
522 | "cell_type": "code",
523 | "execution_count": 361,
524 | "metadata": {
525 | "collapsed": false
526 | },
527 | "outputs": [
528 | {
529 | "data": {
530 | "text/plain": [
531 | "scipy.sparse.csr.csr_matrix"
532 | ]
533 | },
534 | "execution_count": 361,
535 | "metadata": {},
536 | "output_type": "execute_result"
537 | }
538 | ],
539 | "source": [
540 | "type(simple_train_dtm)"
541 | ]
542 | },
543 | {
544 | "cell_type": "markdown",
545 | "metadata": {},
546 | "source": [
547 | "O comando anterior mostra os mesmos valores da matriz de ocorrências de palavras só que retirando as não ocorrências.\n",
548 | "\n",
549 | "Por exemplo:\n",
550 | "As três(3) primeiras linhas da impressão do comando se refere a frase:\n",
551 | "\n",
552 | "> Eu te amo\n",
553 | "\n",
554 | "(0, 1)\t1
\n",
555 | "(0, 6)\t1
\n",
556 | "(0, 15)\t1
\n",
557 | "\n",
558 | "Essa é a frase zero(0) ou seja a primeira frase. os valores 1, 6, 16 é posição da matriz onde ocorres as palavras [amo, eu, te] (em ordem alfabética), e os valores 1 são as quantidades de ocorrências de cada palavra nessa frase"
559 | ]
560 | },
561 | {
562 | "cell_type": "code",
563 | "execution_count": 362,
564 | "metadata": {
565 | "collapsed": false,
566 | "nbpresent": {
567 | "id": "95d91cb6-e3f8-4b4b-ab82-900f8719f4db"
568 | }
569 | },
570 | "outputs": [
571 | {
572 | "name": "stdout",
573 | "output_type": "stream",
574 | "text": [
575 | " (0, 1)\t1\n",
576 | " (0, 6)\t1\n",
577 | " (0, 15)\t1\n",
578 | " (1, 0)\t1\n",
579 | " (1, 2)\t2\n",
580 | " (1, 3)\t1\n",
581 | " (1, 4)\t1\n",
582 | " (1, 8)\t1\n",
583 | " (1, 9)\t1\n",
584 | " (1, 13)\t1\n",
585 | " (1, 16)\t1\n",
586 | " (1, 17)\t1\n",
587 | " (2, 6)\t1\n",
588 | " (2, 10)\t1\n",
589 | " (2, 11)\t1\n",
590 | " (2, 12)\t1\n",
591 | " (2, 14)\t1\n",
592 | " (2, 15)\t1\n",
593 | " (2, 17)\t1\n",
594 | " (3, 5)\t1\n",
595 | " (3, 7)\t1\n",
596 | " (3, 11)\t1\n",
597 | " (3, 17)\t1\n"
598 | ]
599 | }
600 | ],
601 | "source": [
602 | "print(simple_train_dtm)"
603 | ]
604 | },
605 | {
606 | "cell_type": "markdown",
607 | "metadata": {},
608 | "source": [
609 | "Normalmente muitos documentos usarão somente um pequeno subconjuto das palavras do nosso *dicionário*, por isso a matriz resultante terá vários valores zerados nas palavras (basicamente mais de 99% delas)\n",
610 | "\n",
611 | "Por exemplo, um conjunto de **dez mil (10.000)** pequenos textos (tais como emails) terá um vocabulário da ordem de **cem mil (100.000)** palavras únicas. Porém cada texto normalmente usará somente **cem (100)** palavras únicas individualmente \n",
612 | "\n",
613 | "Visando o armazenamento dessa matri e a aceleração de operações, algoritimos normalmente usam a representação esparsa como a implementação disponível no pacote **_scipy.sparse_**"
614 | ]
615 | },
616 | {
617 | "cell_type": "markdown",
618 | "metadata": {},
619 | "source": [
620 | "## 8. Classificações"
621 | ]
622 | },
623 | {
624 | "cell_type": "markdown",
625 | "metadata": {},
626 | "source": [
627 | "### 8.1 Classificando um novo texto\n",
628 | "\n",
629 | "Nosso objetivo é inferir se um novo texto é **BOM** ou **RUIM**\n",
630 | "tendo como base os textos anteriormente classificados.\n",
631 | "o vetor ***novo_texto*** contém um novo texto obtido e que será classificado por nosso algoritmo de aprendizagem de máquina.\n",
632 | "\n",
633 | "Basicamente classificaremos o texto com o algoritmo ***KNN***."
634 | ]
635 | },
636 | {
637 | "cell_type": "code",
638 | "execution_count": 372,
639 | "metadata": {
640 | "collapsed": false
641 | },
642 | "outputs": [],
643 | "source": [
644 | "novo_texto = ['te odeio']"
645 | ]
646 | },
647 | {
648 | "cell_type": "markdown",
649 | "metadata": {},
650 | "source": [
651 | "#### Criando a matriz de ocorrência para o novo texto\n",
652 | "A matriz ***simple_test_dtm*** é que será usada para a nova classificação"
653 | ]
654 | },
655 | {
656 | "cell_type": "code",
657 | "execution_count": 373,
658 | "metadata": {
659 | "collapsed": false
660 | },
661 | "outputs": [
662 | {
663 | "data": {
664 | "text/html": [
665 | "\n",
666 | "
\n",
667 | " \n",
668 | " \n",
669 | " | \n",
670 | " algo | \n",
671 | " amo | \n",
672 | " amor | \n",
673 | " ao | \n",
674 | " assim | \n",
675 | " de | \n",
676 | " eu | \n",
677 | " gosto | \n",
678 | " meu | \n",
679 | " mim | \n",
680 | " muito | \n",
681 | " não | \n",
682 | " odeio | \n",
683 | " pra | \n",
684 | " presta | \n",
685 | " te | \n",
686 | " tudo | \n",
687 | " você | \n",
688 | "
\n",
689 | " \n",
690 | " \n",
691 | " \n",
692 | " | te odeio | \n",
693 | " 0 | \n",
694 | " 0 | \n",
695 | " 0 | \n",
696 | " 0 | \n",
697 | " 0 | \n",
698 | " 0 | \n",
699 | " 0 | \n",
700 | " 0 | \n",
701 | " 0 | \n",
702 | " 0 | \n",
703 | " 0 | \n",
704 | " 0 | \n",
705 | " 1 | \n",
706 | " 0 | \n",
707 | " 0 | \n",
708 | " 1 | \n",
709 | " 0 | \n",
710 | " 0 | \n",
711 | "
\n",
712 | " \n",
713 | "
\n",
714 | "
"
715 | ],
716 | "text/plain": [
717 | " algo amo amor ao assim de eu gosto meu mim muito não \\\n",
718 | "te odeio 0 0 0 0 0 0 0 0 0 0 0 0 \n",
719 | "\n",
720 | " odeio pra presta te tudo você \n",
721 | "te odeio 1 0 0 1 0 0 "
722 | ]
723 | },
724 | "execution_count": 373,
725 | "metadata": {},
726 | "output_type": "execute_result"
727 | }
728 | ],
729 | "source": [
730 | "simple_test_dtm = vect.transform(novo_texto)\n",
731 | "\n",
732 | "##criando a visualização da matriz de ocorrência\n",
733 | "pd.DataFrame(simple_test_dtm.toarray(), columns=vect.get_feature_names(), index=novo_texto)"
734 | ]
735 | },
736 | {
737 | "cell_type": "markdown",
738 | "metadata": {},
739 | "source": [
740 | "### 8.2 Classificador KNN\n",
741 | "\n",
742 | "Importando o classificador KNN do scikit-learn\n",
743 | "\n",
744 | "Referência sobre o classificador KNN você pode acessar o [wikpedia-KNN](https://en.wikipedia.org/wiki/K-nearest_neighbors_algorithm) e a referência do [KNN no scikit-learn](http://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsClassifier.html) "
745 | ]
746 | },
747 | {
748 | "cell_type": "code",
749 | "execution_count": 374,
750 | "metadata": {
751 | "collapsed": true
752 | },
753 | "outputs": [],
754 | "source": [
755 | "## importanto o classificador\n",
756 | "from sklearn.neighbors import KNeighborsClassifier"
757 | ]
758 | },
759 | {
760 | "cell_type": "markdown",
761 | "metadata": {},
762 | "source": [
763 | "Treinando o classificador KNN"
764 | ]
765 | },
766 | {
767 | "cell_type": "code",
768 | "execution_count": 375,
769 | "metadata": {
770 | "collapsed": false
771 | },
772 | "outputs": [
773 | {
774 | "data": {
775 | "text/plain": [
776 | "KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',\n",
777 | " metric_params=None, n_jobs=1, n_neighbors=1, p=2,\n",
778 | " weights='uniform')"
779 | ]
780 | },
781 | "execution_count": 375,
782 | "metadata": {},
783 | "output_type": "execute_result"
784 | }
785 | ],
786 | "source": [
787 | "knn = KNeighborsClassifier(n_neighbors=1)\n",
788 | "knn.fit(simple_train_dtm, felling)"
789 | ]
790 | },
791 | {
792 | "cell_type": "markdown",
793 | "metadata": {},
794 | "source": [
795 | "### 8.3 Gerando uma classificação\n",
796 | "Para isso utiliza-se o método ***predict()*** do classificador"
797 | ]
798 | },
799 | {
800 | "cell_type": "code",
801 | "execution_count": 376,
802 | "metadata": {
803 | "collapsed": false
804 | },
805 | "outputs": [
806 | {
807 | "data": {
808 | "text/plain": [
809 | "1"
810 | ]
811 | },
812 | "execution_count": 376,
813 | "metadata": {},
814 | "output_type": "execute_result"
815 | }
816 | ],
817 | "source": [
818 | "fell = knn.predict(simple_test_dtm)[0]\n",
819 | "fell"
820 | ]
821 | },
822 | {
823 | "cell_type": "code",
824 | "execution_count": 377,
825 | "metadata": {
826 | "collapsed": false
827 | },
828 | "outputs": [
829 | {
830 | "name": "stdout",
831 | "output_type": "stream",
832 | "text": [
833 | "Bom sentimento\n"
834 | ]
835 | }
836 | ],
837 | "source": [
838 | "if fell==1:\n",
839 | " print(\"Bom sentimento\")\n",
840 | "else:\n",
841 | " print(\"Mal sentimento\")"
842 | ]
843 | },
844 | {
845 | "cell_type": "code",
846 | "execution_count": null,
847 | "metadata": {
848 | "collapsed": true
849 | },
850 | "outputs": [],
851 | "source": [
852 | ""
853 | ]
854 | },
855 | {
856 | "cell_type": "code",
857 | "execution_count": null,
858 | "metadata": {
859 | "collapsed": true
860 | },
861 | "outputs": [],
862 | "source": [
863 | ""
864 | ]
865 | },
866 | {
867 | "cell_type": "code",
868 | "execution_count": 369,
869 | "metadata": {
870 | "collapsed": true
871 | },
872 | "outputs": [],
873 | "source": [
874 | "sms = pd.read_table('sms.tsv', header=None, names=['label', 'message'])"
875 | ]
876 | },
877 | {
878 | "cell_type": "code",
879 | "execution_count": 370,
880 | "metadata": {
881 | "collapsed": false
882 | },
883 | "outputs": [
884 | {
885 | "data": {
886 | "text/html": [
887 | "\n",
888 | "
\n",
889 | " \n",
890 | " \n",
891 | " | \n",
892 | " label | \n",
893 | " message | \n",
894 | "
\n",
895 | " \n",
896 | " \n",
897 | " \n",
898 | " | 0 | \n",
899 | " ham | \n",
900 | " Go until jurong point, crazy.. Available only ... | \n",
901 | "
\n",
902 | " \n",
903 | " | 1 | \n",
904 | " ham | \n",
905 | " Ok lar... Joking wif u oni... | \n",
906 | "
\n",
907 | " \n",
908 | " | 2 | \n",
909 | " spam | \n",
910 | " Free entry in 2 a wkly comp to win FA Cup fina... | \n",
911 | "
\n",
912 | " \n",
913 | " | 3 | \n",
914 | " ham | \n",
915 | " U dun say so early hor... U c already then say... | \n",
916 | "
\n",
917 | " \n",
918 | " | 4 | \n",
919 | " ham | \n",
920 | " Nah I don't think he goes to usf, he lives aro... | \n",
921 | "
\n",
922 | " \n",
923 | " | 5 | \n",
924 | " spam | \n",
925 | " FreeMsg Hey there darling it's been 3 week's n... | \n",
926 | "
\n",
927 | " \n",
928 | " | 6 | \n",
929 | " ham | \n",
930 | " Even my brother is not like to speak with me. ... | \n",
931 | "
\n",
932 | " \n",
933 | " | 7 | \n",
934 | " ham | \n",
935 | " As per your request 'Melle Melle (Oru Minnamin... | \n",
936 | "
\n",
937 | " \n",
938 | " | 8 | \n",
939 | " spam | \n",
940 | " WINNER!! As a valued network customer you have... | \n",
941 | "
\n",
942 | " \n",
943 | " | 9 | \n",
944 | " spam | \n",
945 | " Had your mobile 11 months or more? U R entitle... | \n",
946 | "
\n",
947 | " \n",
948 | "
\n",
949 | "
"
950 | ],
951 | "text/plain": [
952 | " label message\n",
953 | "0 ham Go until jurong point, crazy.. Available only ...\n",
954 | "1 ham Ok lar... Joking wif u oni...\n",
955 | "2 spam Free entry in 2 a wkly comp to win FA Cup fina...\n",
956 | "3 ham U dun say so early hor... U c already then say...\n",
957 | "4 ham Nah I don't think he goes to usf, he lives aro...\n",
958 | "5 spam FreeMsg Hey there darling it's been 3 week's n...\n",
959 | "6 ham Even my brother is not like to speak with me. ...\n",
960 | "7 ham As per your request 'Melle Melle (Oru Minnamin...\n",
961 | "8 spam WINNER!! As a valued network customer you have...\n",
962 | "9 spam Had your mobile 11 months or more? U R entitle..."
963 | ]
964 | },
965 | "execution_count": 370,
966 | "metadata": {},
967 | "output_type": "execute_result"
968 | }
969 | ],
970 | "source": [
971 | "sms.head(10)"
972 | ]
973 | },
974 | {
975 | "cell_type": "code",
976 | "execution_count": 371,
977 | "metadata": {
978 | "collapsed": false
979 | },
980 | "outputs": [
981 | {
982 | "data": {
983 | "text/plain": [
984 | "ham 4825\n",
985 | "spam 747\n",
986 | "Name: label, dtype: int64"
987 | ]
988 | },
989 | "execution_count": 371,
990 | "metadata": {},
991 | "output_type": "execute_result"
992 | }
993 | ],
994 | "source": [
995 | "sms.label.value_counts()"
996 | ]
997 | },
998 | {
999 | "cell_type": "code",
1000 | "execution_count": null,
1001 | "metadata": {
1002 | "collapsed": true
1003 | },
1004 | "outputs": [],
1005 | "source": [
1006 | ""
1007 | ]
1008 | }
1009 | ],
1010 | "metadata": {
1011 | "kernelspec": {
1012 | "display_name": "Python [conda root]",
1013 | "language": "python",
1014 | "name": "conda-root-py"
1015 | },
1016 | "language_info": {
1017 | "codemirror_mode": {
1018 | "name": "ipython",
1019 | "version": 3.0
1020 | },
1021 | "file_extension": ".py",
1022 | "mimetype": "text/x-python",
1023 | "name": "python",
1024 | "nbconvert_exporter": "python",
1025 | "pygments_lexer": "ipython3",
1026 | "version": "3.5.2"
1027 | }
1028 | },
1029 | "nbformat": 4,
1030 | "nbformat_minor": 0
1031 | }
--------------------------------------------------------------------------------
/AnaliseTexto/.ipynb_checkpoints/tutorial-checkpoint.ipynb:
--------------------------------------------------------------------------------
1 | {
2 | "cells": [
3 | {
4 | "cell_type": "markdown",
5 | "metadata": {},
6 | "source": [
7 | "# Tutorial: Machine Learning with Text in scikit-learn"
8 | ]
9 | },
10 | {
11 | "cell_type": "markdown",
12 | "metadata": {},
13 | "source": [
14 | "## Agenda\n",
15 | "\n",
16 | "1. Model building in scikit-learn (refresher)\n",
17 | "2. Representing text as numerical data\n",
18 | "3. Reading a text-based dataset into pandas\n",
19 | "4. Vectorizing our dataset\n",
20 | "5. Building and evaluating a model\n",
21 | "6. Comparing models"
22 | ]
23 | },
24 | {
25 | "cell_type": "code",
26 | "execution_count": 1,
27 | "metadata": {
28 | "collapsed": false
29 | },
30 | "outputs": [],
31 | "source": [
32 | "# for Python 2: use print only as a function\n",
33 | "from __future__ import print_function"
34 | ]
35 | },
36 | {
37 | "cell_type": "markdown",
38 | "metadata": {},
39 | "source": [
40 | "## Part 1: Model building in scikit-learn (refresher)"
41 | ]
42 | },
43 | {
44 | "cell_type": "code",
45 | "execution_count": 2,
46 | "metadata": {
47 | "collapsed": true
48 | },
49 | "outputs": [],
50 | "source": [
51 | "# load the iris dataset as an example\n",
52 | "from sklearn.datasets import load_iris\n",
53 | "iris = load_iris()"
54 | ]
55 | },
56 | {
57 | "cell_type": "code",
58 | "execution_count": 3,
59 | "metadata": {
60 | "collapsed": true
61 | },
62 | "outputs": [],
63 | "source": [
64 | "# store the feature matrix (X) and response vector (y)\n",
65 | "X = iris.data\n",
66 | "y = iris.target"
67 | ]
68 | },
69 | {
70 | "cell_type": "markdown",
71 | "metadata": {},
72 | "source": [
73 | "**\"Features\"** are also known as predictors, inputs, or attributes. The **\"response\"** is also known as the target, label, or output."
74 | ]
75 | },
76 | {
77 | "cell_type": "code",
78 | "execution_count": 4,
79 | "metadata": {
80 | "collapsed": false
81 | },
82 | "outputs": [
83 | {
84 | "name": "stdout",
85 | "output_type": "stream",
86 | "text": [
87 | "(150, 4)\n",
88 | "(150,)\n"
89 | ]
90 | }
91 | ],
92 | "source": [
93 | "# check the shapes of X and y\n",
94 | "print(X.shape)\n",
95 | "print(y.shape)"
96 | ]
97 | },
98 | {
99 | "cell_type": "markdown",
100 | "metadata": {},
101 | "source": [
102 | "**\"Observations\"** are also known as samples, instances, or records."
103 | ]
104 | },
105 | {
106 | "cell_type": "code",
107 | "execution_count": 5,
108 | "metadata": {
109 | "collapsed": false
110 | },
111 | "outputs": [
112 | {
113 | "data": {
114 | "text/html": [
115 | "\n",
116 | "
\n",
117 | " \n",
118 | " \n",
119 | " | \n",
120 | " sepal length (cm) | \n",
121 | " sepal width (cm) | \n",
122 | " petal length (cm) | \n",
123 | " petal width (cm) | \n",
124 | "
\n",
125 | " \n",
126 | " \n",
127 | " \n",
128 | " | 0 | \n",
129 | " 5.1 | \n",
130 | " 3.5 | \n",
131 | " 1.4 | \n",
132 | " 0.2 | \n",
133 | "
\n",
134 | " \n",
135 | " | 1 | \n",
136 | " 4.9 | \n",
137 | " 3.0 | \n",
138 | " 1.4 | \n",
139 | " 0.2 | \n",
140 | "
\n",
141 | " \n",
142 | " | 2 | \n",
143 | " 4.7 | \n",
144 | " 3.2 | \n",
145 | " 1.3 | \n",
146 | " 0.2 | \n",
147 | "
\n",
148 | " \n",
149 | " | 3 | \n",
150 | " 4.6 | \n",
151 | " 3.1 | \n",
152 | " 1.5 | \n",
153 | " 0.2 | \n",
154 | "
\n",
155 | " \n",
156 | " | 4 | \n",
157 | " 5.0 | \n",
158 | " 3.6 | \n",
159 | " 1.4 | \n",
160 | " 0.2 | \n",
161 | "
\n",
162 | " \n",
163 | "
\n",
164 | "
"
165 | ],
166 | "text/plain": [
167 | " sepal length (cm) sepal width (cm) petal length (cm) petal width (cm)\n",
168 | "0 5.1 3.5 1.4 0.2\n",
169 | "1 4.9 3.0 1.4 0.2\n",
170 | "2 4.7 3.2 1.3 0.2\n",
171 | "3 4.6 3.1 1.5 0.2\n",
172 | "4 5.0 3.6 1.4 0.2"
173 | ]
174 | },
175 | "execution_count": 5,
176 | "metadata": {},
177 | "output_type": "execute_result"
178 | }
179 | ],
180 | "source": [
181 | "# examine the first 5 rows of the feature matrix (including the feature names)\n",
182 | "import pandas as pd\n",
183 | "pd.DataFrame(X, columns=iris.feature_names).head()"
184 | ]
185 | },
186 | {
187 | "cell_type": "code",
188 | "execution_count": 6,
189 | "metadata": {
190 | "collapsed": false
191 | },
192 | "outputs": [
193 | {
194 | "name": "stdout",
195 | "output_type": "stream",
196 | "text": [
197 | "[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0\n",
198 | " 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1\n",
199 | " 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2\n",
200 | " 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2\n",
201 | " 2 2]\n"
202 | ]
203 | }
204 | ],
205 | "source": [
206 | "# examine the response vector\n",
207 | "print(y)"
208 | ]
209 | },
210 | {
211 | "cell_type": "markdown",
212 | "metadata": {},
213 | "source": [
214 | "In order to **build a model**, the features must be **numeric**, and every observation must have the **same features in the same order**."
215 | ]
216 | },
217 | {
218 | "cell_type": "code",
219 | "execution_count": 7,
220 | "metadata": {
221 | "collapsed": false
222 | },
223 | "outputs": [
224 | {
225 | "data": {
226 | "text/plain": [
227 | "KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',\n",
228 | " metric_params=None, n_jobs=1, n_neighbors=5, p=2,\n",
229 | " weights='uniform')"
230 | ]
231 | },
232 | "execution_count": 7,
233 | "metadata": {},
234 | "output_type": "execute_result"
235 | }
236 | ],
237 | "source": [
238 | "# import the class\n",
239 | "from sklearn.neighbors import KNeighborsClassifier\n",
240 | "\n",
241 | "# instantiate the model (with the default parameters)\n",
242 | "knn = KNeighborsClassifier()\n",
243 | "\n",
244 | "# fit the model with data (occurs in-place)\n",
245 | "knn.fit(X, y)"
246 | ]
247 | },
248 | {
249 | "cell_type": "markdown",
250 | "metadata": {},
251 | "source": [
252 | "In order to **make a prediction**, the new observation must have the **same features as the training observations**, both in number and meaning."
253 | ]
254 | },
255 | {
256 | "cell_type": "code",
257 | "execution_count": 8,
258 | "metadata": {
259 | "collapsed": false
260 | },
261 | "outputs": [
262 | {
263 | "data": {
264 | "text/plain": [
265 | "array([1])"
266 | ]
267 | },
268 | "execution_count": 8,
269 | "metadata": {},
270 | "output_type": "execute_result"
271 | }
272 | ],
273 | "source": [
274 | "# predict the response for a new observation\n",
275 | "knn.predict([[3, 5, 4, 2]])"
276 | ]
277 | },
278 | {
279 | "cell_type": "markdown",
280 | "metadata": {},
281 | "source": [
282 | "## Part 2: Representing text as numerical data"
283 | ]
284 | },
285 | {
286 | "cell_type": "code",
287 | "execution_count": 9,
288 | "metadata": {
289 | "collapsed": true
290 | },
291 | "outputs": [],
292 | "source": [
293 | "# example text for model training (SMS messages)\n",
294 | "simple_train = ['call you tonight', 'Call me a cab', 'please call me... PLEASE!']"
295 | ]
296 | },
297 | {
298 | "cell_type": "code",
299 | "execution_count": 10,
300 | "metadata": {
301 | "collapsed": true
302 | },
303 | "outputs": [],
304 | "source": [
305 | "# example response vector\n",
306 | "is_desperate = [0, 0, 1]"
307 | ]
308 | },
309 | {
310 | "cell_type": "markdown",
311 | "metadata": {},
312 | "source": [
313 | "From the [scikit-learn documentation](http://scikit-learn.org/stable/modules/feature_extraction.html#text-feature-extraction):\n",
314 | "\n",
315 | "> Text Analysis is a major application field for machine learning algorithms. However the raw data, a sequence of symbols cannot be fed directly to the algorithms themselves as most of them expect **numerical feature vectors with a fixed size** rather than the **raw text documents with variable length**.\n",
316 | "\n",
317 | "We will use [CountVectorizer](http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html) to \"convert text into a matrix of token counts\":"
318 | ]
319 | },
320 | {
321 | "cell_type": "code",
322 | "execution_count": 11,
323 | "metadata": {
324 | "collapsed": true
325 | },
326 | "outputs": [],
327 | "source": [
328 | "# import and instantiate CountVectorizer (with the default parameters)\n",
329 | "from sklearn.feature_extraction.text import CountVectorizer\n",
330 | "vect = CountVectorizer()"
331 | ]
332 | },
333 | {
334 | "cell_type": "code",
335 | "execution_count": 12,
336 | "metadata": {
337 | "collapsed": false
338 | },
339 | "outputs": [
340 | {
341 | "data": {
342 | "text/plain": [
343 | "CountVectorizer(analyzer='word', binary=False, decode_error='strict',\n",
344 | " dtype=, encoding='utf-8', input='content',\n",
345 | " lowercase=True, max_df=1.0, max_features=None, min_df=1,\n",
346 | " ngram_range=(1, 1), preprocessor=None, stop_words=None,\n",
347 | " strip_accents=None, token_pattern='(?u)\\\\b\\\\w\\\\w+\\\\b',\n",
348 | " tokenizer=None, vocabulary=None)"
349 | ]
350 | },
351 | "execution_count": 12,
352 | "metadata": {},
353 | "output_type": "execute_result"
354 | }
355 | ],
356 | "source": [
357 | "# learn the 'vocabulary' of the training data (occurs in-place)\n",
358 | "vect.fit(simple_train)"
359 | ]
360 | },
361 | {
362 | "cell_type": "code",
363 | "execution_count": 13,
364 | "metadata": {
365 | "collapsed": false
366 | },
367 | "outputs": [
368 | {
369 | "data": {
370 | "text/plain": [
371 | "['cab', 'call', 'me', 'please', 'tonight', 'you']"
372 | ]
373 | },
374 | "execution_count": 13,
375 | "metadata": {},
376 | "output_type": "execute_result"
377 | }
378 | ],
379 | "source": [
380 | "# examine the fitted vocabulary\n",
381 | "vect.get_feature_names()"
382 | ]
383 | },
384 | {
385 | "cell_type": "code",
386 | "execution_count": 14,
387 | "metadata": {
388 | "collapsed": false
389 | },
390 | "outputs": [
391 | {
392 | "data": {
393 | "text/plain": [
394 | "<3x6 sparse matrix of type ''\n",
395 | "\twith 9 stored elements in Compressed Sparse Row format>"
396 | ]
397 | },
398 | "execution_count": 14,
399 | "metadata": {},
400 | "output_type": "execute_result"
401 | }
402 | ],
403 | "source": [
404 | "# transform training data into a 'document-term matrix'\n",
405 | "simple_train_dtm = vect.transform(simple_train)\n",
406 | "simple_train_dtm"
407 | ]
408 | },
409 | {
410 | "cell_type": "code",
411 | "execution_count": 15,
412 | "metadata": {
413 | "collapsed": false
414 | },
415 | "outputs": [
416 | {
417 | "data": {
418 | "text/plain": [
419 | "array([[0, 1, 0, 0, 1, 1],\n",
420 | " [1, 1, 1, 0, 0, 0],\n",
421 | " [0, 1, 1, 2, 0, 0]])"
422 | ]
423 | },
424 | "execution_count": 15,
425 | "metadata": {},
426 | "output_type": "execute_result"
427 | }
428 | ],
429 | "source": [
430 | "# convert sparse matrix to a dense matrix\n",
431 | "simple_train_dtm.toarray()"
432 | ]
433 | },
434 | {
435 | "cell_type": "code",
436 | "execution_count": 16,
437 | "metadata": {
438 | "collapsed": false
439 | },
440 | "outputs": [
441 | {
442 | "data": {
443 | "text/html": [
444 | "\n",
445 | "
\n",
446 | " \n",
447 | " \n",
448 | " | \n",
449 | " cab | \n",
450 | " call | \n",
451 | " me | \n",
452 | " please | \n",
453 | " tonight | \n",
454 | " you | \n",
455 | "
\n",
456 | " \n",
457 | " \n",
458 | " \n",
459 | " | 0 | \n",
460 | " 0 | \n",
461 | " 1 | \n",
462 | " 0 | \n",
463 | " 0 | \n",
464 | " 1 | \n",
465 | " 1 | \n",
466 | "
\n",
467 | " \n",
468 | " | 1 | \n",
469 | " 1 | \n",
470 | " 1 | \n",
471 | " 1 | \n",
472 | " 0 | \n",
473 | " 0 | \n",
474 | " 0 | \n",
475 | "
\n",
476 | " \n",
477 | " | 2 | \n",
478 | " 0 | \n",
479 | " 1 | \n",
480 | " 1 | \n",
481 | " 2 | \n",
482 | " 0 | \n",
483 | " 0 | \n",
484 | "
\n",
485 | " \n",
486 | "
\n",
487 | "
"
488 | ],
489 | "text/plain": [
490 | " cab call me please tonight you\n",
491 | "0 0 1 0 0 1 1\n",
492 | "1 1 1 1 0 0 0\n",
493 | "2 0 1 1 2 0 0"
494 | ]
495 | },
496 | "execution_count": 16,
497 | "metadata": {},
498 | "output_type": "execute_result"
499 | }
500 | ],
501 | "source": [
502 | "# examine the vocabulary and document-term matrix together\n",
503 | "pd.DataFrame(simple_train_dtm.toarray(), columns=vect.get_feature_names())"
504 | ]
505 | },
506 | {
507 | "cell_type": "markdown",
508 | "metadata": {},
509 | "source": [
510 | "From the [scikit-learn documentation](http://scikit-learn.org/stable/modules/feature_extraction.html#text-feature-extraction):\n",
511 | "\n",
512 | "> In this scheme, features and samples are defined as follows:\n",
513 | "\n",
514 | "> - Each individual token occurrence frequency (normalized or not) is treated as a **feature**.\n",
515 | "> - The vector of all the token frequencies for a given document is considered a multivariate **sample**.\n",
516 | "\n",
517 | "> A **corpus of documents** can thus be represented by a matrix with **one row per document** and **one column per token** (e.g. word) occurring in the corpus.\n",
518 | "\n",
519 | "> We call **vectorization** the general process of turning a collection of text documents into numerical feature vectors. This specific strategy (tokenization, counting and normalization) is called the **Bag of Words** or \"Bag of n-grams\" representation. Documents are described by word occurrences while completely ignoring the relative position information of the words in the document."
520 | ]
521 | },
522 | {
523 | "cell_type": "code",
524 | "execution_count": 17,
525 | "metadata": {
526 | "collapsed": false
527 | },
528 | "outputs": [
529 | {
530 | "data": {
531 | "text/plain": [
532 | "scipy.sparse.csr.csr_matrix"
533 | ]
534 | },
535 | "execution_count": 17,
536 | "metadata": {},
537 | "output_type": "execute_result"
538 | }
539 | ],
540 | "source": [
541 | "# check the type of the document-term matrix\n",
542 | "type(simple_train_dtm)"
543 | ]
544 | },
545 | {
546 | "cell_type": "code",
547 | "execution_count": 18,
548 | "metadata": {
549 | "collapsed": false,
550 | "scrolled": true
551 | },
552 | "outputs": [
553 | {
554 | "name": "stdout",
555 | "output_type": "stream",
556 | "text": [
557 | " (0, 1)\t1\n",
558 | " (0, 4)\t1\n",
559 | " (0, 5)\t1\n",
560 | " (1, 0)\t1\n",
561 | " (1, 1)\t1\n",
562 | " (1, 2)\t1\n",
563 | " (2, 1)\t1\n",
564 | " (2, 2)\t1\n",
565 | " (2, 3)\t2\n"
566 | ]
567 | }
568 | ],
569 | "source": [
570 | "# examine the sparse matrix contents\n",
571 | "print(simple_train_dtm)"
572 | ]
573 | },
574 | {
575 | "cell_type": "markdown",
576 | "metadata": {},
577 | "source": [
578 | "From the [scikit-learn documentation](http://scikit-learn.org/stable/modules/feature_extraction.html#text-feature-extraction):\n",
579 | "\n",
580 | "> As most documents will typically use a very small subset of the words used in the corpus, the resulting matrix will have **many feature values that are zeros** (typically more than 99% of them).\n",
581 | "\n",
582 | "> For instance, a collection of 10,000 short text documents (such as emails) will use a vocabulary with a size in the order of 100,000 unique words in total while each document will use 100 to 1000 unique words individually.\n",
583 | "\n",
584 | "> In order to be able to **store such a matrix in memory** but also to **speed up operations**, implementations will typically use a **sparse representation** such as the implementations available in the `scipy.sparse` package."
585 | ]
586 | },
587 | {
588 | "cell_type": "code",
589 | "execution_count": 19,
590 | "metadata": {
591 | "collapsed": false
592 | },
593 | "outputs": [
594 | {
595 | "data": {
596 | "text/plain": [
597 | "KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',\n",
598 | " metric_params=None, n_jobs=1, n_neighbors=1, p=2,\n",
599 | " weights='uniform')"
600 | ]
601 | },
602 | "execution_count": 19,
603 | "metadata": {},
604 | "output_type": "execute_result"
605 | }
606 | ],
607 | "source": [
608 | "# build a model to predict desperation\n",
609 | "knn = KNeighborsClassifier(n_neighbors=1)\n",
610 | "knn.fit(simple_train_dtm, is_desperate)"
611 | ]
612 | },
613 | {
614 | "cell_type": "code",
615 | "execution_count": 20,
616 | "metadata": {
617 | "collapsed": true
618 | },
619 | "outputs": [],
620 | "source": [
621 | "# example text for model testing\n",
622 | "simple_test = [\"please don't call me\"]"
623 | ]
624 | },
625 | {
626 | "cell_type": "markdown",
627 | "metadata": {},
628 | "source": [
629 | "In order to **make a prediction**, the new observation must have the **same features as the training observations**, both in number and meaning."
630 | ]
631 | },
632 | {
633 | "cell_type": "code",
634 | "execution_count": 21,
635 | "metadata": {
636 | "collapsed": false
637 | },
638 | "outputs": [
639 | {
640 | "data": {
641 | "text/plain": [
642 | "array([[0, 1, 1, 1, 0, 0]])"
643 | ]
644 | },
645 | "execution_count": 21,
646 | "metadata": {},
647 | "output_type": "execute_result"
648 | }
649 | ],
650 | "source": [
651 | "# transform testing data into a document-term matrix (using existing vocabulary)\n",
652 | "simple_test_dtm = vect.transform(simple_test)\n",
653 | "simple_test_dtm.toarray()"
654 | ]
655 | },
656 | {
657 | "cell_type": "code",
658 | "execution_count": 22,
659 | "metadata": {
660 | "collapsed": false
661 | },
662 | "outputs": [
663 | {
664 | "data": {
665 | "text/html": [
666 | "\n",
667 | "
\n",
668 | " \n",
669 | " \n",
670 | " | \n",
671 | " cab | \n",
672 | " call | \n",
673 | " me | \n",
674 | " please | \n",
675 | " tonight | \n",
676 | " you | \n",
677 | "
\n",
678 | " \n",
679 | " \n",
680 | " \n",
681 | " | 0 | \n",
682 | " 0 | \n",
683 | " 1 | \n",
684 | " 1 | \n",
685 | " 1 | \n",
686 | " 0 | \n",
687 | " 0 | \n",
688 | "
\n",
689 | " \n",
690 | "
\n",
691 | "
"
692 | ],
693 | "text/plain": [
694 | " cab call me please tonight you\n",
695 | "0 0 1 1 1 0 0"
696 | ]
697 | },
698 | "execution_count": 22,
699 | "metadata": {},
700 | "output_type": "execute_result"
701 | }
702 | ],
703 | "source": [
704 | "# examine the vocabulary and document-term matrix together\n",
705 | "pd.DataFrame(simple_test_dtm.toarray(), columns=vect.get_feature_names())"
706 | ]
707 | },
708 | {
709 | "cell_type": "code",
710 | "execution_count": 23,
711 | "metadata": {
712 | "collapsed": false
713 | },
714 | "outputs": [
715 | {
716 | "data": {
717 | "text/plain": [
718 | "array([1])"
719 | ]
720 | },
721 | "execution_count": 23,
722 | "metadata": {},
723 | "output_type": "execute_result"
724 | }
725 | ],
726 | "source": [
727 | "# predict whether simple_test is desperate\n",
728 | "knn.predict(simple_test_dtm)"
729 | ]
730 | },
731 | {
732 | "cell_type": "markdown",
733 | "metadata": {},
734 | "source": [
735 | "**Summary:**\n",
736 | "\n",
737 | "- `vect.fit(train)` **learns the vocabulary** of the training data\n",
738 | "- `vect.transform(train)` uses the **fitted vocabulary** to build a document-term matrix from the training data\n",
739 | "- `vect.transform(test)` uses the **fitted vocabulary** to build a document-term matrix from the testing data (and **ignores tokens** it hasn't seen before)"
740 | ]
741 | },
742 | {
743 | "cell_type": "markdown",
744 | "metadata": {},
745 | "source": [
746 | "## Part 3: Reading a text-based dataset into pandas"
747 | ]
748 | },
749 | {
750 | "cell_type": "code",
751 | "execution_count": 24,
752 | "metadata": {
753 | "collapsed": true
754 | },
755 | "outputs": [],
756 | "source": [
757 | "# read file into pandas from the working directory\n",
758 | "sms = pd.read_table('sms.tsv', header=None, names=['label', 'message'])"
759 | ]
760 | },
761 | {
762 | "cell_type": "code",
763 | "execution_count": 25,
764 | "metadata": {
765 | "collapsed": false
766 | },
767 | "outputs": [],
768 | "source": [
769 | "# alternative: read file into pandas from a URL\n",
770 | "# url = 'https://raw.githubusercontent.com/justmarkham/pydata-dc-2016-tutorial/master/sms.tsv'\n",
771 | "# sms = pd.read_table(url, header=None, names=['label', 'message'])"
772 | ]
773 | },
774 | {
775 | "cell_type": "code",
776 | "execution_count": 26,
777 | "metadata": {
778 | "collapsed": false
779 | },
780 | "outputs": [
781 | {
782 | "data": {
783 | "text/plain": [
784 | "(5572, 2)"
785 | ]
786 | },
787 | "execution_count": 26,
788 | "metadata": {},
789 | "output_type": "execute_result"
790 | }
791 | ],
792 | "source": [
793 | "# examine the shape\n",
794 | "sms.shape"
795 | ]
796 | },
797 | {
798 | "cell_type": "code",
799 | "execution_count": 27,
800 | "metadata": {
801 | "collapsed": false
802 | },
803 | "outputs": [
804 | {
805 | "data": {
806 | "text/html": [
807 | "\n",
808 | "
\n",
809 | " \n",
810 | " \n",
811 | " | \n",
812 | " label | \n",
813 | " message | \n",
814 | "
\n",
815 | " \n",
816 | " \n",
817 | " \n",
818 | " | 0 | \n",
819 | " ham | \n",
820 | " Go until jurong point, crazy.. Available only ... | \n",
821 | "
\n",
822 | " \n",
823 | " | 1 | \n",
824 | " ham | \n",
825 | " Ok lar... Joking wif u oni... | \n",
826 | "
\n",
827 | " \n",
828 | " | 2 | \n",
829 | " spam | \n",
830 | " Free entry in 2 a wkly comp to win FA Cup fina... | \n",
831 | "
\n",
832 | " \n",
833 | " | 3 | \n",
834 | " ham | \n",
835 | " U dun say so early hor... U c already then say... | \n",
836 | "
\n",
837 | " \n",
838 | " | 4 | \n",
839 | " ham | \n",
840 | " Nah I don't think he goes to usf, he lives aro... | \n",
841 | "
\n",
842 | " \n",
843 | " | 5 | \n",
844 | " spam | \n",
845 | " FreeMsg Hey there darling it's been 3 week's n... | \n",
846 | "
\n",
847 | " \n",
848 | " | 6 | \n",
849 | " ham | \n",
850 | " Even my brother is not like to speak with me. ... | \n",
851 | "
\n",
852 | " \n",
853 | " | 7 | \n",
854 | " ham | \n",
855 | " As per your request 'Melle Melle (Oru Minnamin... | \n",
856 | "
\n",
857 | " \n",
858 | " | 8 | \n",
859 | " spam | \n",
860 | " WINNER!! As a valued network customer you have... | \n",
861 | "
\n",
862 | " \n",
863 | " | 9 | \n",
864 | " spam | \n",
865 | " Had your mobile 11 months or more? U R entitle... | \n",
866 | "
\n",
867 | " \n",
868 | "
\n",
869 | "
"
870 | ],
871 | "text/plain": [
872 | " label message\n",
873 | "0 ham Go until jurong point, crazy.. Available only ...\n",
874 | "1 ham Ok lar... Joking wif u oni...\n",
875 | "2 spam Free entry in 2 a wkly comp to win FA Cup fina...\n",
876 | "3 ham U dun say so early hor... U c already then say...\n",
877 | "4 ham Nah I don't think he goes to usf, he lives aro...\n",
878 | "5 spam FreeMsg Hey there darling it's been 3 week's n...\n",
879 | "6 ham Even my brother is not like to speak with me. ...\n",
880 | "7 ham As per your request 'Melle Melle (Oru Minnamin...\n",
881 | "8 spam WINNER!! As a valued network customer you have...\n",
882 | "9 spam Had your mobile 11 months or more? U R entitle..."
883 | ]
884 | },
885 | "execution_count": 27,
886 | "metadata": {},
887 | "output_type": "execute_result"
888 | }
889 | ],
890 | "source": [
891 | "# examine the first 10 rows\n",
892 | "sms.head(10)"
893 | ]
894 | },
895 | {
896 | "cell_type": "code",
897 | "execution_count": 28,
898 | "metadata": {
899 | "collapsed": false
900 | },
901 | "outputs": [
902 | {
903 | "data": {
904 | "text/plain": [
905 | "ham 4825\n",
906 | "spam 747\n",
907 | "Name: label, dtype: int64"
908 | ]
909 | },
910 | "execution_count": 28,
911 | "metadata": {},
912 | "output_type": "execute_result"
913 | }
914 | ],
915 | "source": [
916 | "# examine the class distribution\n",
917 | "sms.label.value_counts()"
918 | ]
919 | },
920 | {
921 | "cell_type": "code",
922 | "execution_count": 29,
923 | "metadata": {
924 | "collapsed": true
925 | },
926 | "outputs": [],
927 | "source": [
928 | "# convert label to a numerical variable\n",
929 | "sms['label_num'] = sms.label.map({'ham':0, 'spam':1})"
930 | ]
931 | },
932 | {
933 | "cell_type": "code",
934 | "execution_count": 30,
935 | "metadata": {
936 | "collapsed": false
937 | },
938 | "outputs": [
939 | {
940 | "data": {
941 | "text/html": [
942 | "\n",
943 | "
\n",
944 | " \n",
945 | " \n",
946 | " | \n",
947 | " label | \n",
948 | " message | \n",
949 | " label_num | \n",
950 | "
\n",
951 | " \n",
952 | " \n",
953 | " \n",
954 | " | 0 | \n",
955 | " ham | \n",
956 | " Go until jurong point, crazy.. Available only ... | \n",
957 | " 0 | \n",
958 | "
\n",
959 | " \n",
960 | " | 1 | \n",
961 | " ham | \n",
962 | " Ok lar... Joking wif u oni... | \n",
963 | " 0 | \n",
964 | "
\n",
965 | " \n",
966 | " | 2 | \n",
967 | " spam | \n",
968 | " Free entry in 2 a wkly comp to win FA Cup fina... | \n",
969 | " 1 | \n",
970 | "
\n",
971 | " \n",
972 | " | 3 | \n",
973 | " ham | \n",
974 | " U dun say so early hor... U c already then say... | \n",
975 | " 0 | \n",
976 | "
\n",
977 | " \n",
978 | " | 4 | \n",
979 | " ham | \n",
980 | " Nah I don't think he goes to usf, he lives aro... | \n",
981 | " 0 | \n",
982 | "
\n",
983 | " \n",
984 | " | 5 | \n",
985 | " spam | \n",
986 | " FreeMsg Hey there darling it's been 3 week's n... | \n",
987 | " 1 | \n",
988 | "
\n",
989 | " \n",
990 | " | 6 | \n",
991 | " ham | \n",
992 | " Even my brother is not like to speak with me. ... | \n",
993 | " 0 | \n",
994 | "
\n",
995 | " \n",
996 | " | 7 | \n",
997 | " ham | \n",
998 | " As per your request 'Melle Melle (Oru Minnamin... | \n",
999 | " 0 | \n",
1000 | "
\n",
1001 | " \n",
1002 | " | 8 | \n",
1003 | " spam | \n",
1004 | " WINNER!! As a valued network customer you have... | \n",
1005 | " 1 | \n",
1006 | "
\n",
1007 | " \n",
1008 | " | 9 | \n",
1009 | " spam | \n",
1010 | " Had your mobile 11 months or more? U R entitle... | \n",
1011 | " 1 | \n",
1012 | "
\n",
1013 | " \n",
1014 | "
\n",
1015 | "
"
1016 | ],
1017 | "text/plain": [
1018 | " label message label_num\n",
1019 | "0 ham Go until jurong point, crazy.. Available only ... 0\n",
1020 | "1 ham Ok lar... Joking wif u oni... 0\n",
1021 | "2 spam Free entry in 2 a wkly comp to win FA Cup fina... 1\n",
1022 | "3 ham U dun say so early hor... U c already then say... 0\n",
1023 | "4 ham Nah I don't think he goes to usf, he lives aro... 0\n",
1024 | "5 spam FreeMsg Hey there darling it's been 3 week's n... 1\n",
1025 | "6 ham Even my brother is not like to speak with me. ... 0\n",
1026 | "7 ham As per your request 'Melle Melle (Oru Minnamin... 0\n",
1027 | "8 spam WINNER!! As a valued network customer you have... 1\n",
1028 | "9 spam Had your mobile 11 months or more? U R entitle... 1"
1029 | ]
1030 | },
1031 | "execution_count": 30,
1032 | "metadata": {},
1033 | "output_type": "execute_result"
1034 | }
1035 | ],
1036 | "source": [
1037 | "# check that the conversion worked\n",
1038 | "sms.head(10)"
1039 | ]
1040 | },
1041 | {
1042 | "cell_type": "code",
1043 | "execution_count": 31,
1044 | "metadata": {
1045 | "collapsed": false
1046 | },
1047 | "outputs": [
1048 | {
1049 | "name": "stdout",
1050 | "output_type": "stream",
1051 | "text": [
1052 | "(150, 4)\n",
1053 | "(150,)\n"
1054 | ]
1055 | }
1056 | ],
1057 | "source": [
1058 | "# how to define X and y (from the iris data) for use with a MODEL\n",
1059 | "X = iris.data\n",
1060 | "y = iris.target\n",
1061 | "print(X.shape)\n",
1062 | "print(y.shape)"
1063 | ]
1064 | },
1065 | {
1066 | "cell_type": "code",
1067 | "execution_count": 32,
1068 | "metadata": {
1069 | "collapsed": false
1070 | },
1071 | "outputs": [
1072 | {
1073 | "name": "stdout",
1074 | "output_type": "stream",
1075 | "text": [
1076 | "(5572,)\n",
1077 | "(5572,)\n"
1078 | ]
1079 | }
1080 | ],
1081 | "source": [
1082 | "# how to define X and y (from the SMS data) for use with COUNTVECTORIZER\n",
1083 | "X = sms.message\n",
1084 | "y = sms.label_num\n",
1085 | "print(X.shape)\n",
1086 | "print(y.shape)"
1087 | ]
1088 | },
1089 | {
1090 | "cell_type": "code",
1091 | "execution_count": 33,
1092 | "metadata": {
1093 | "collapsed": false
1094 | },
1095 | "outputs": [
1096 | {
1097 | "name": "stdout",
1098 | "output_type": "stream",
1099 | "text": [
1100 | "(4179,)\n",
1101 | "(1393,)\n",
1102 | "(4179,)\n",
1103 | "(1393,)\n"
1104 | ]
1105 | }
1106 | ],
1107 | "source": [
1108 | "# split X and y into training and testing sets\n",
1109 | "from sklearn.cross_validation import train_test_split\n",
1110 | "X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=1)\n",
1111 | "print(X_train.shape)\n",
1112 | "print(X_test.shape)\n",
1113 | "print(y_train.shape)\n",
1114 | "print(y_test.shape)"
1115 | ]
1116 | },
1117 | {
1118 | "cell_type": "markdown",
1119 | "metadata": {},
1120 | "source": [
1121 | "## Part 4: Vectorizing our dataset"
1122 | ]
1123 | },
1124 | {
1125 | "cell_type": "code",
1126 | "execution_count": 34,
1127 | "metadata": {
1128 | "collapsed": true
1129 | },
1130 | "outputs": [],
1131 | "source": [
1132 | "# instantiate the vectorizer\n",
1133 | "vect = CountVectorizer()"
1134 | ]
1135 | },
1136 | {
1137 | "cell_type": "code",
1138 | "execution_count": 35,
1139 | "metadata": {
1140 | "collapsed": true
1141 | },
1142 | "outputs": [],
1143 | "source": [
1144 | "# learn training data vocabulary, then use it to create a document-term matrix\n",
1145 | "vect.fit(X_train)\n",
1146 | "X_train_dtm = vect.transform(X_train)"
1147 | ]
1148 | },
1149 | {
1150 | "cell_type": "code",
1151 | "execution_count": 36,
1152 | "metadata": {
1153 | "collapsed": true
1154 | },
1155 | "outputs": [],
1156 | "source": [
1157 | "# equivalently: combine fit and transform into a single step\n",
1158 | "X_train_dtm = vect.fit_transform(X_train)"
1159 | ]
1160 | },
1161 | {
1162 | "cell_type": "code",
1163 | "execution_count": 37,
1164 | "metadata": {
1165 | "collapsed": false
1166 | },
1167 | "outputs": [
1168 | {
1169 | "data": {
1170 | "text/plain": [
1171 | "<4179x7456 sparse matrix of type ''\n",
1172 | "\twith 55209 stored elements in Compressed Sparse Row format>"
1173 | ]
1174 | },
1175 | "execution_count": 37,
1176 | "metadata": {},
1177 | "output_type": "execute_result"
1178 | }
1179 | ],
1180 | "source": [
1181 | "# examine the document-term matrix\n",
1182 | "X_train_dtm"
1183 | ]
1184 | },
1185 | {
1186 | "cell_type": "code",
1187 | "execution_count": 38,
1188 | "metadata": {
1189 | "collapsed": false
1190 | },
1191 | "outputs": [
1192 | {
1193 | "data": {
1194 | "text/plain": [
1195 | "<1393x7456 sparse matrix of type ''\n",
1196 | "\twith 17604 stored elements in Compressed Sparse Row format>"
1197 | ]
1198 | },
1199 | "execution_count": 38,
1200 | "metadata": {},
1201 | "output_type": "execute_result"
1202 | }
1203 | ],
1204 | "source": [
1205 | "# transform testing data (using fitted vocabulary) into a document-term matrix\n",
1206 | "X_test_dtm = vect.transform(X_test)\n",
1207 | "X_test_dtm"
1208 | ]
1209 | },
1210 | {
1211 | "cell_type": "markdown",
1212 | "metadata": {},
1213 | "source": [
1214 | "## Part 5: Building and evaluating a model\n",
1215 | "\n",
1216 | "We will use [multinomial Naive Bayes](http://scikit-learn.org/stable/modules/generated/sklearn.naive_bayes.MultinomialNB.html):\n",
1217 | "\n",
1218 | "> The multinomial Naive Bayes classifier is suitable for classification with **discrete features** (e.g., word counts for text classification). The multinomial distribution normally requires integer feature counts. However, in practice, fractional counts such as tf-idf may also work."
1219 | ]
1220 | },
1221 | {
1222 | "cell_type": "code",
1223 | "execution_count": 39,
1224 | "metadata": {
1225 | "collapsed": true
1226 | },
1227 | "outputs": [],
1228 | "source": [
1229 | "# import and instantiate a Multinomial Naive Bayes model\n",
1230 | "from sklearn.naive_bayes import MultinomialNB\n",
1231 | "nb = MultinomialNB()"
1232 | ]
1233 | },
1234 | {
1235 | "cell_type": "code",
1236 | "execution_count": 40,
1237 | "metadata": {
1238 | "collapsed": false
1239 | },
1240 | "outputs": [
1241 | {
1242 | "name": "stdout",
1243 | "output_type": "stream",
1244 | "text": [
1245 | "CPU times: user 0 ns, sys: 0 ns, total: 0 ns\n",
1246 | "Wall time: 2.78 ms\n"
1247 | ]
1248 | },
1249 | {
1250 | "data": {
1251 | "text/plain": [
1252 | "MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True)"
1253 | ]
1254 | },
1255 | "execution_count": 40,
1256 | "metadata": {},
1257 | "output_type": "execute_result"
1258 | }
1259 | ],
1260 | "source": [
1261 | "# train the model using X_train_dtm (timing it with an IPython \"magic command\")\n",
1262 | "%time nb.fit(X_train_dtm, y_train)"
1263 | ]
1264 | },
1265 | {
1266 | "cell_type": "code",
1267 | "execution_count": 41,
1268 | "metadata": {
1269 | "collapsed": true
1270 | },
1271 | "outputs": [],
1272 | "source": [
1273 | "# make class predictions for X_test_dtm\n",
1274 | "y_pred_class = nb.predict(X_test_dtm)"
1275 | ]
1276 | },
1277 | {
1278 | "cell_type": "code",
1279 | "execution_count": 42,
1280 | "metadata": {
1281 | "collapsed": false
1282 | },
1283 | "outputs": [
1284 | {
1285 | "data": {
1286 | "text/plain": [
1287 | "0.98851399856424982"
1288 | ]
1289 | },
1290 | "execution_count": 42,
1291 | "metadata": {},
1292 | "output_type": "execute_result"
1293 | }
1294 | ],
1295 | "source": [
1296 | "# calculate accuracy of class predictions\n",
1297 | "from sklearn import metrics\n",
1298 | "metrics.accuracy_score(y_test, y_pred_class)"
1299 | ]
1300 | },
1301 | {
1302 | "cell_type": "code",
1303 | "execution_count": 43,
1304 | "metadata": {
1305 | "collapsed": false
1306 | },
1307 | "outputs": [
1308 | {
1309 | "data": {
1310 | "text/plain": [
1311 | "array([[1203, 5],\n",
1312 | " [ 11, 174]])"
1313 | ]
1314 | },
1315 | "execution_count": 43,
1316 | "metadata": {},
1317 | "output_type": "execute_result"
1318 | }
1319 | ],
1320 | "source": [
1321 | "# print the confusion matrix\n",
1322 | "metrics.confusion_matrix(y_test, y_pred_class)"
1323 | ]
1324 | },
1325 | {
1326 | "cell_type": "code",
1327 | "execution_count": 44,
1328 | "metadata": {
1329 | "collapsed": false
1330 | },
1331 | "outputs": [],
1332 | "source": [
1333 | "# print message text for the false positives (ham incorrectly classified as spam)\n"
1334 | ]
1335 | },
1336 | {
1337 | "cell_type": "code",
1338 | "execution_count": 45,
1339 | "metadata": {
1340 | "collapsed": false,
1341 | "scrolled": true
1342 | },
1343 | "outputs": [],
1344 | "source": [
1345 | "# print message text for the false negatives (spam incorrectly classified as ham)\n"
1346 | ]
1347 | },
1348 | {
1349 | "cell_type": "code",
1350 | "execution_count": 46,
1351 | "metadata": {
1352 | "collapsed": false,
1353 | "scrolled": true
1354 | },
1355 | "outputs": [
1356 | {
1357 | "data": {
1358 | "text/plain": [
1359 | "\"LookAtMe!: Thanks for your purchase of a video clip from LookAtMe!, you've been charged 35p. Think you can do better? Why not send a video in a MMSto 32323.\""
1360 | ]
1361 | },
1362 | "execution_count": 46,
1363 | "metadata": {},
1364 | "output_type": "execute_result"
1365 | }
1366 | ],
1367 | "source": [
1368 | "# example false negative\n",
1369 | "X_test[3132]"
1370 | ]
1371 | },
1372 | {
1373 | "cell_type": "code",
1374 | "execution_count": 47,
1375 | "metadata": {
1376 | "collapsed": false
1377 | },
1378 | "outputs": [
1379 | {
1380 | "data": {
1381 | "text/plain": [
1382 | "array([ 2.87744864e-03, 1.83488846e-05, 2.07301295e-03, ...,\n",
1383 | " 1.09026171e-06, 1.00000000e+00, 3.98279868e-09])"
1384 | ]
1385 | },
1386 | "execution_count": 47,
1387 | "metadata": {},
1388 | "output_type": "execute_result"
1389 | }
1390 | ],
1391 | "source": [
1392 | "# calculate predicted probabilities for X_test_dtm (poorly calibrated)\n",
1393 | "y_pred_prob = nb.predict_proba(X_test_dtm)[:, 1]\n",
1394 | "y_pred_prob"
1395 | ]
1396 | },
1397 | {
1398 | "cell_type": "code",
1399 | "execution_count": 48,
1400 | "metadata": {
1401 | "collapsed": false
1402 | },
1403 | "outputs": [
1404 | {
1405 | "data": {
1406 | "text/plain": [
1407 | "0.98664310005369604"
1408 | ]
1409 | },
1410 | "execution_count": 48,
1411 | "metadata": {},
1412 | "output_type": "execute_result"
1413 | }
1414 | ],
1415 | "source": [
1416 | "# calculate AUC\n",
1417 | "metrics.roc_auc_score(y_test, y_pred_prob)"
1418 | ]
1419 | },
1420 | {
1421 | "cell_type": "markdown",
1422 | "metadata": {},
1423 | "source": [
1424 | "## Part 6: Comparing models\n",
1425 | "\n",
1426 | "We will compare multinomial Naive Bayes with [logistic regression](http://scikit-learn.org/stable/modules/linear_model.html#logistic-regression):\n",
1427 | "\n",
1428 | "> Logistic regression, despite its name, is a **linear model for classification** rather than regression. Logistic regression is also known in the literature as logit regression, maximum-entropy classification (MaxEnt) or the log-linear classifier. In this model, the probabilities describing the possible outcomes of a single trial are modeled using a logistic function."
1429 | ]
1430 | },
1431 | {
1432 | "cell_type": "code",
1433 | "execution_count": 49,
1434 | "metadata": {
1435 | "collapsed": true
1436 | },
1437 | "outputs": [],
1438 | "source": [
1439 | "# import and instantiate a logistic regression model\n",
1440 | "from sklearn.linear_model import LogisticRegression\n",
1441 | "logreg = LogisticRegression()"
1442 | ]
1443 | },
1444 | {
1445 | "cell_type": "code",
1446 | "execution_count": 50,
1447 | "metadata": {
1448 | "collapsed": false
1449 | },
1450 | "outputs": [
1451 | {
1452 | "name": "stdout",
1453 | "output_type": "stream",
1454 | "text": [
1455 | "CPU times: user 56 ms, sys: 0 ns, total: 56 ms\n",
1456 | "Wall time: 273 ms\n"
1457 | ]
1458 | },
1459 | {
1460 | "data": {
1461 | "text/plain": [
1462 | "LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,\n",
1463 | " intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,\n",
1464 | " penalty='l2', random_state=None, solver='liblinear', tol=0.0001,\n",
1465 | " verbose=0, warm_start=False)"
1466 | ]
1467 | },
1468 | "execution_count": 50,
1469 | "metadata": {},
1470 | "output_type": "execute_result"
1471 | }
1472 | ],
1473 | "source": [
1474 | "# train the model using X_train_dtm\n",
1475 | "%time logreg.fit(X_train_dtm, y_train)"
1476 | ]
1477 | },
1478 | {
1479 | "cell_type": "code",
1480 | "execution_count": 51,
1481 | "metadata": {
1482 | "collapsed": true
1483 | },
1484 | "outputs": [],
1485 | "source": [
1486 | "# make class predictions for X_test_dtm\n",
1487 | "y_pred_class = logreg.predict(X_test_dtm)"
1488 | ]
1489 | },
1490 | {
1491 | "cell_type": "code",
1492 | "execution_count": 52,
1493 | "metadata": {
1494 | "collapsed": false
1495 | },
1496 | "outputs": [
1497 | {
1498 | "data": {
1499 | "text/plain": [
1500 | "array([ 0.01269556, 0.00347183, 0.00616517, ..., 0.03354907,\n",
1501 | " 0.99725053, 0.00157706])"
1502 | ]
1503 | },
1504 | "execution_count": 52,
1505 | "metadata": {},
1506 | "output_type": "execute_result"
1507 | }
1508 | ],
1509 | "source": [
1510 | "# calculate predicted probabilities for X_test_dtm (well calibrated)\n",
1511 | "y_pred_prob = logreg.predict_proba(X_test_dtm)[:, 1]\n",
1512 | "y_pred_prob"
1513 | ]
1514 | },
1515 | {
1516 | "cell_type": "code",
1517 | "execution_count": 53,
1518 | "metadata": {
1519 | "collapsed": false
1520 | },
1521 | "outputs": [
1522 | {
1523 | "data": {
1524 | "text/plain": [
1525 | "0.9877961234745154"
1526 | ]
1527 | },
1528 | "execution_count": 53,
1529 | "metadata": {},
1530 | "output_type": "execute_result"
1531 | }
1532 | ],
1533 | "source": [
1534 | "# calculate accuracy\n",
1535 | "metrics.accuracy_score(y_test, y_pred_class)"
1536 | ]
1537 | },
1538 | {
1539 | "cell_type": "code",
1540 | "execution_count": 54,
1541 | "metadata": {
1542 | "collapsed": false
1543 | },
1544 | "outputs": [
1545 | {
1546 | "data": {
1547 | "text/plain": [
1548 | "0.99368176123143015"
1549 | ]
1550 | },
1551 | "execution_count": 54,
1552 | "metadata": {},
1553 | "output_type": "execute_result"
1554 | }
1555 | ],
1556 | "source": [
1557 | "# calculate AUC\n",
1558 | "metrics.roc_auc_score(y_test, y_pred_prob)"
1559 | ]
1560 | }
1561 | ],
1562 | "metadata": {
1563 | "kernelspec": {
1564 | "display_name": "Python [conda root]",
1565 | "language": "python",
1566 | "name": "conda-root-py"
1567 | },
1568 | "language_info": {
1569 | "codemirror_mode": {
1570 | "name": "ipython",
1571 | "version": 3
1572 | },
1573 | "file_extension": ".py",
1574 | "mimetype": "text/x-python",
1575 | "name": "python",
1576 | "nbconvert_exporter": "python",
1577 | "pygments_lexer": "ipython3",
1578 | "version": "3.5.2"
1579 | }
1580 | },
1581 | "nbformat": 4,
1582 | "nbformat_minor": 0
1583 | }
1584 |
--------------------------------------------------------------------------------
/textAnalisis/.ipynb_checkpoints/tutorial-checkpoint.ipynb:
--------------------------------------------------------------------------------
1 | {
2 | "cells": [
3 | {
4 | "cell_type": "markdown",
5 | "metadata": {},
6 | "source": [
7 | "# Tutorial: Machine Learning with Text in scikit-learn"
8 | ]
9 | },
10 | {
11 | "cell_type": "markdown",
12 | "metadata": {},
13 | "source": [
14 | "## Agenda\n",
15 | "\n",
16 | "1. Model building in scikit-learn (refresher)\n",
17 | "2. Representing text as numerical data\n",
18 | "3. Reading a text-based dataset into pandas\n",
19 | "4. Vectorizing our dataset\n",
20 | "5. Building and evaluating a model\n",
21 | "6. Comparing models"
22 | ]
23 | },
24 | {
25 | "cell_type": "code",
26 | "execution_count": 1,
27 | "metadata": {
28 | "collapsed": false
29 | },
30 | "outputs": [],
31 | "source": [
32 | "# for Python 2: use print only as a function\n",
33 | "from __future__ import print_function"
34 | ]
35 | },
36 | {
37 | "cell_type": "markdown",
38 | "metadata": {},
39 | "source": [
40 | "## Part 1: Model building in scikit-learn (refresher)"
41 | ]
42 | },
43 | {
44 | "cell_type": "code",
45 | "execution_count": 2,
46 | "metadata": {
47 | "collapsed": true
48 | },
49 | "outputs": [],
50 | "source": [
51 | "# load the iris dataset as an example\n",
52 | "from sklearn.datasets import load_iris\n",
53 | "iris = load_iris()"
54 | ]
55 | },
56 | {
57 | "cell_type": "code",
58 | "execution_count": 3,
59 | "metadata": {
60 | "collapsed": true
61 | },
62 | "outputs": [],
63 | "source": [
64 | "# store the feature matrix (X) and response vector (y)\n",
65 | "X = iris.data\n",
66 | "y = iris.target"
67 | ]
68 | },
69 | {
70 | "cell_type": "markdown",
71 | "metadata": {},
72 | "source": [
73 | "**\"Features\"** are also known as predictors, inputs, or attributes. The **\"response\"** is also known as the target, label, or output."
74 | ]
75 | },
76 | {
77 | "cell_type": "code",
78 | "execution_count": 4,
79 | "metadata": {
80 | "collapsed": false
81 | },
82 | "outputs": [
83 | {
84 | "name": "stdout",
85 | "output_type": "stream",
86 | "text": [
87 | "(150, 4)\n",
88 | "(150,)\n"
89 | ]
90 | }
91 | ],
92 | "source": [
93 | "# check the shapes of X and y\n",
94 | "print(X.shape)\n",
95 | "print(y.shape)"
96 | ]
97 | },
98 | {
99 | "cell_type": "markdown",
100 | "metadata": {},
101 | "source": [
102 | "**\"Observations\"** are also known as samples, instances, or records."
103 | ]
104 | },
105 | {
106 | "cell_type": "code",
107 | "execution_count": 5,
108 | "metadata": {
109 | "collapsed": false
110 | },
111 | "outputs": [
112 | {
113 | "data": {
114 | "text/html": [
115 | "\n",
116 | "
\n",
117 | " \n",
118 | " \n",
119 | " | \n",
120 | " sepal length (cm) | \n",
121 | " sepal width (cm) | \n",
122 | " petal length (cm) | \n",
123 | " petal width (cm) | \n",
124 | "
\n",
125 | " \n",
126 | " \n",
127 | " \n",
128 | " | 0 | \n",
129 | " 5.1 | \n",
130 | " 3.5 | \n",
131 | " 1.4 | \n",
132 | " 0.2 | \n",
133 | "
\n",
134 | " \n",
135 | " | 1 | \n",
136 | " 4.9 | \n",
137 | " 3.0 | \n",
138 | " 1.4 | \n",
139 | " 0.2 | \n",
140 | "
\n",
141 | " \n",
142 | " | 2 | \n",
143 | " 4.7 | \n",
144 | " 3.2 | \n",
145 | " 1.3 | \n",
146 | " 0.2 | \n",
147 | "
\n",
148 | " \n",
149 | " | 3 | \n",
150 | " 4.6 | \n",
151 | " 3.1 | \n",
152 | " 1.5 | \n",
153 | " 0.2 | \n",
154 | "
\n",
155 | " \n",
156 | " | 4 | \n",
157 | " 5.0 | \n",
158 | " 3.6 | \n",
159 | " 1.4 | \n",
160 | " 0.2 | \n",
161 | "
\n",
162 | " \n",
163 | "
\n",
164 | "
"
165 | ],
166 | "text/plain": [
167 | " sepal length (cm) sepal width (cm) petal length (cm) petal width (cm)\n",
168 | "0 5.1 3.5 1.4 0.2\n",
169 | "1 4.9 3.0 1.4 0.2\n",
170 | "2 4.7 3.2 1.3 0.2\n",
171 | "3 4.6 3.1 1.5 0.2\n",
172 | "4 5.0 3.6 1.4 0.2"
173 | ]
174 | },
175 | "execution_count": 5,
176 | "metadata": {},
177 | "output_type": "execute_result"
178 | }
179 | ],
180 | "source": [
181 | "# examine the first 5 rows of the feature matrix (including the feature names)\n",
182 | "import pandas as pd\n",
183 | "pd.DataFrame(X, columns=iris.feature_names).head()"
184 | ]
185 | },
186 | {
187 | "cell_type": "code",
188 | "execution_count": 6,
189 | "metadata": {
190 | "collapsed": false
191 | },
192 | "outputs": [
193 | {
194 | "name": "stdout",
195 | "output_type": "stream",
196 | "text": [
197 | "[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0\n",
198 | " 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1\n",
199 | " 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2\n",
200 | " 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2\n",
201 | " 2 2]\n"
202 | ]
203 | }
204 | ],
205 | "source": [
206 | "# examine the response vector\n",
207 | "print(y)"
208 | ]
209 | },
210 | {
211 | "cell_type": "markdown",
212 | "metadata": {},
213 | "source": [
214 | "In order to **build a model**, the features must be **numeric**, and every observation must have the **same features in the same order**."
215 | ]
216 | },
217 | {
218 | "cell_type": "code",
219 | "execution_count": 7,
220 | "metadata": {
221 | "collapsed": false
222 | },
223 | "outputs": [
224 | {
225 | "data": {
226 | "text/plain": [
227 | "KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',\n",
228 | " metric_params=None, n_jobs=1, n_neighbors=5, p=2,\n",
229 | " weights='uniform')"
230 | ]
231 | },
232 | "execution_count": 7,
233 | "metadata": {},
234 | "output_type": "execute_result"
235 | }
236 | ],
237 | "source": [
238 | "# import the class\n",
239 | "from sklearn.neighbors import KNeighborsClassifier\n",
240 | "\n",
241 | "# instantiate the model (with the default parameters)\n",
242 | "knn = KNeighborsClassifier()\n",
243 | "\n",
244 | "# fit the model with data (occurs in-place)\n",
245 | "knn.fit(X, y)"
246 | ]
247 | },
248 | {
249 | "cell_type": "markdown",
250 | "metadata": {},
251 | "source": [
252 | "In order to **make a prediction**, the new observation must have the **same features as the training observations**, both in number and meaning."
253 | ]
254 | },
255 | {
256 | "cell_type": "code",
257 | "execution_count": 8,
258 | "metadata": {
259 | "collapsed": false
260 | },
261 | "outputs": [
262 | {
263 | "data": {
264 | "text/plain": [
265 | "array([1])"
266 | ]
267 | },
268 | "execution_count": 8,
269 | "metadata": {},
270 | "output_type": "execute_result"
271 | }
272 | ],
273 | "source": [
274 | "# predict the response for a new observation\n",
275 | "knn.predict([[3, 5, 4, 2]])"
276 | ]
277 | },
278 | {
279 | "cell_type": "markdown",
280 | "metadata": {},
281 | "source": [
282 | "## Part 2: Representing text as numerical data"
283 | ]
284 | },
285 | {
286 | "cell_type": "code",
287 | "execution_count": 9,
288 | "metadata": {
289 | "collapsed": true
290 | },
291 | "outputs": [],
292 | "source": [
293 | "# example text for model training (SMS messages)\n",
294 | "simple_train = ['call you tonight', 'Call me a cab', 'please call me... PLEASE!']"
295 | ]
296 | },
297 | {
298 | "cell_type": "code",
299 | "execution_count": 10,
300 | "metadata": {
301 | "collapsed": true
302 | },
303 | "outputs": [],
304 | "source": [
305 | "# example response vector\n",
306 | "is_desperate = [0, 0, 1]"
307 | ]
308 | },
309 | {
310 | "cell_type": "markdown",
311 | "metadata": {},
312 | "source": [
313 | "From the [scikit-learn documentation](http://scikit-learn.org/stable/modules/feature_extraction.html#text-feature-extraction):\n",
314 | "\n",
315 | "> Text Analysis is a major application field for machine learning algorithms. However the raw data, a sequence of symbols cannot be fed directly to the algorithms themselves as most of them expect **numerical feature vectors with a fixed size** rather than the **raw text documents with variable length**.\n",
316 | "\n",
317 | "We will use [CountVectorizer](http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html) to \"convert text into a matrix of token counts\":"
318 | ]
319 | },
320 | {
321 | "cell_type": "code",
322 | "execution_count": 11,
323 | "metadata": {
324 | "collapsed": true
325 | },
326 | "outputs": [],
327 | "source": [
328 | "# import and instantiate CountVectorizer (with the default parameters)\n",
329 | "from sklearn.feature_extraction.text import CountVectorizer\n",
330 | "vect = CountVectorizer()"
331 | ]
332 | },
333 | {
334 | "cell_type": "code",
335 | "execution_count": 12,
336 | "metadata": {
337 | "collapsed": false
338 | },
339 | "outputs": [
340 | {
341 | "data": {
342 | "text/plain": [
343 | "CountVectorizer(analyzer='word', binary=False, decode_error='strict',\n",
344 | " dtype=, encoding='utf-8', input='content',\n",
345 | " lowercase=True, max_df=1.0, max_features=None, min_df=1,\n",
346 | " ngram_range=(1, 1), preprocessor=None, stop_words=None,\n",
347 | " strip_accents=None, token_pattern='(?u)\\\\b\\\\w\\\\w+\\\\b',\n",
348 | " tokenizer=None, vocabulary=None)"
349 | ]
350 | },
351 | "execution_count": 12,
352 | "metadata": {},
353 | "output_type": "execute_result"
354 | }
355 | ],
356 | "source": [
357 | "# learn the 'vocabulary' of the training data (occurs in-place)\n",
358 | "vect.fit(simple_train)"
359 | ]
360 | },
361 | {
362 | "cell_type": "code",
363 | "execution_count": 13,
364 | "metadata": {
365 | "collapsed": false
366 | },
367 | "outputs": [
368 | {
369 | "data": {
370 | "text/plain": [
371 | "['cab', 'call', 'me', 'please', 'tonight', 'you']"
372 | ]
373 | },
374 | "execution_count": 13,
375 | "metadata": {},
376 | "output_type": "execute_result"
377 | }
378 | ],
379 | "source": [
380 | "# examine the fitted vocabulary\n",
381 | "vect.get_feature_names()"
382 | ]
383 | },
384 | {
385 | "cell_type": "code",
386 | "execution_count": 14,
387 | "metadata": {
388 | "collapsed": false
389 | },
390 | "outputs": [
391 | {
392 | "data": {
393 | "text/plain": [
394 | "<3x6 sparse matrix of type ''\n",
395 | "\twith 9 stored elements in Compressed Sparse Row format>"
396 | ]
397 | },
398 | "execution_count": 14,
399 | "metadata": {},
400 | "output_type": "execute_result"
401 | }
402 | ],
403 | "source": [
404 | "# transform training data into a 'document-term matrix'\n",
405 | "simple_train_dtm = vect.transform(simple_train)\n",
406 | "simple_train_dtm"
407 | ]
408 | },
409 | {
410 | "cell_type": "code",
411 | "execution_count": 15,
412 | "metadata": {
413 | "collapsed": false
414 | },
415 | "outputs": [
416 | {
417 | "data": {
418 | "text/plain": [
419 | "array([[0, 1, 0, 0, 1, 1],\n",
420 | " [1, 1, 1, 0, 0, 0],\n",
421 | " [0, 1, 1, 2, 0, 0]])"
422 | ]
423 | },
424 | "execution_count": 15,
425 | "metadata": {},
426 | "output_type": "execute_result"
427 | }
428 | ],
429 | "source": [
430 | "# convert sparse matrix to a dense matrix\n",
431 | "simple_train_dtm.toarray()"
432 | ]
433 | },
434 | {
435 | "cell_type": "code",
436 | "execution_count": 16,
437 | "metadata": {
438 | "collapsed": false
439 | },
440 | "outputs": [
441 | {
442 | "data": {
443 | "text/html": [
444 | "\n",
445 | "
\n",
446 | " \n",
447 | " \n",
448 | " | \n",
449 | " cab | \n",
450 | " call | \n",
451 | " me | \n",
452 | " please | \n",
453 | " tonight | \n",
454 | " you | \n",
455 | "
\n",
456 | " \n",
457 | " \n",
458 | " \n",
459 | " | 0 | \n",
460 | " 0 | \n",
461 | " 1 | \n",
462 | " 0 | \n",
463 | " 0 | \n",
464 | " 1 | \n",
465 | " 1 | \n",
466 | "
\n",
467 | " \n",
468 | " | 1 | \n",
469 | " 1 | \n",
470 | " 1 | \n",
471 | " 1 | \n",
472 | " 0 | \n",
473 | " 0 | \n",
474 | " 0 | \n",
475 | "
\n",
476 | " \n",
477 | " | 2 | \n",
478 | " 0 | \n",
479 | " 1 | \n",
480 | " 1 | \n",
481 | " 2 | \n",
482 | " 0 | \n",
483 | " 0 | \n",
484 | "
\n",
485 | " \n",
486 | "
\n",
487 | "
"
488 | ],
489 | "text/plain": [
490 | " cab call me please tonight you\n",
491 | "0 0 1 0 0 1 1\n",
492 | "1 1 1 1 0 0 0\n",
493 | "2 0 1 1 2 0 0"
494 | ]
495 | },
496 | "execution_count": 16,
497 | "metadata": {},
498 | "output_type": "execute_result"
499 | }
500 | ],
501 | "source": [
502 | "# examine the vocabulary and document-term matrix together\n",
503 | "pd.DataFrame(simple_train_dtm.toarray(), columns=vect.get_feature_names())"
504 | ]
505 | },
506 | {
507 | "cell_type": "markdown",
508 | "metadata": {},
509 | "source": [
510 | "From the [scikit-learn documentation](http://scikit-learn.org/stable/modules/feature_extraction.html#text-feature-extraction):\n",
511 | "\n",
512 | "> In this scheme, features and samples are defined as follows:\n",
513 | "\n",
514 | "> - Each individual token occurrence frequency (normalized or not) is treated as a **feature**.\n",
515 | "> - The vector of all the token frequencies for a given document is considered a multivariate **sample**.\n",
516 | "\n",
517 | "> A **corpus of documents** can thus be represented by a matrix with **one row per document** and **one column per token** (e.g. word) occurring in the corpus.\n",
518 | "\n",
519 | "> We call **vectorization** the general process of turning a collection of text documents into numerical feature vectors. This specific strategy (tokenization, counting and normalization) is called the **Bag of Words** or \"Bag of n-grams\" representation. Documents are described by word occurrences while completely ignoring the relative position information of the words in the document."
520 | ]
521 | },
522 | {
523 | "cell_type": "code",
524 | "execution_count": 17,
525 | "metadata": {
526 | "collapsed": false
527 | },
528 | "outputs": [
529 | {
530 | "data": {
531 | "text/plain": [
532 | "scipy.sparse.csr.csr_matrix"
533 | ]
534 | },
535 | "execution_count": 17,
536 | "metadata": {},
537 | "output_type": "execute_result"
538 | }
539 | ],
540 | "source": [
541 | "# check the type of the document-term matrix\n",
542 | "type(simple_train_dtm)"
543 | ]
544 | },
545 | {
546 | "cell_type": "code",
547 | "execution_count": 18,
548 | "metadata": {
549 | "collapsed": false,
550 | "scrolled": true
551 | },
552 | "outputs": [
553 | {
554 | "name": "stdout",
555 | "output_type": "stream",
556 | "text": [
557 | " (0, 1)\t1\n",
558 | " (0, 4)\t1\n",
559 | " (0, 5)\t1\n",
560 | " (1, 0)\t1\n",
561 | " (1, 1)\t1\n",
562 | " (1, 2)\t1\n",
563 | " (2, 1)\t1\n",
564 | " (2, 2)\t1\n",
565 | " (2, 3)\t2\n"
566 | ]
567 | }
568 | ],
569 | "source": [
570 | "# examine the sparse matrix contents\n",
571 | "print(simple_train_dtm)"
572 | ]
573 | },
574 | {
575 | "cell_type": "markdown",
576 | "metadata": {},
577 | "source": [
578 | "From the [scikit-learn documentation](http://scikit-learn.org/stable/modules/feature_extraction.html#text-feature-extraction):\n",
579 | "\n",
580 | "> As most documents will typically use a very small subset of the words used in the corpus, the resulting matrix will have **many feature values that are zeros** (typically more than 99% of them).\n",
581 | "\n",
582 | "> For instance, a collection of 10,000 short text documents (such as emails) will use a vocabulary with a size in the order of 100,000 unique words in total while each document will use 100 to 1000 unique words individually.\n",
583 | "\n",
584 | "> In order to be able to **store such a matrix in memory** but also to **speed up operations**, implementations will typically use a **sparse representation** such as the implementations available in the `scipy.sparse` package."
585 | ]
586 | },
587 | {
588 | "cell_type": "code",
589 | "execution_count": 19,
590 | "metadata": {
591 | "collapsed": false
592 | },
593 | "outputs": [
594 | {
595 | "data": {
596 | "text/plain": [
597 | "KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',\n",
598 | " metric_params=None, n_jobs=1, n_neighbors=1, p=2,\n",
599 | " weights='uniform')"
600 | ]
601 | },
602 | "execution_count": 19,
603 | "metadata": {},
604 | "output_type": "execute_result"
605 | }
606 | ],
607 | "source": [
608 | "# build a model to predict desperation\n",
609 | "knn = KNeighborsClassifier(n_neighbors=1)\n",
610 | "knn.fit(simple_train_dtm, is_desperate)"
611 | ]
612 | },
613 | {
614 | "cell_type": "code",
615 | "execution_count": 20,
616 | "metadata": {
617 | "collapsed": true
618 | },
619 | "outputs": [],
620 | "source": [
621 | "# example text for model testing\n",
622 | "simple_test = [\"please don't call me\"]"
623 | ]
624 | },
625 | {
626 | "cell_type": "markdown",
627 | "metadata": {},
628 | "source": [
629 | "In order to **make a prediction**, the new observation must have the **same features as the training observations**, both in number and meaning."
630 | ]
631 | },
632 | {
633 | "cell_type": "code",
634 | "execution_count": 21,
635 | "metadata": {
636 | "collapsed": false
637 | },
638 | "outputs": [
639 | {
640 | "data": {
641 | "text/plain": [
642 | "array([[0, 1, 1, 1, 0, 0]])"
643 | ]
644 | },
645 | "execution_count": 21,
646 | "metadata": {},
647 | "output_type": "execute_result"
648 | }
649 | ],
650 | "source": [
651 | "# transform testing data into a document-term matrix (using existing vocabulary)\n",
652 | "simple_test_dtm = vect.transform(simple_test)\n",
653 | "simple_test_dtm.toarray()"
654 | ]
655 | },
656 | {
657 | "cell_type": "code",
658 | "execution_count": 22,
659 | "metadata": {
660 | "collapsed": false
661 | },
662 | "outputs": [
663 | {
664 | "data": {
665 | "text/html": [
666 | "\n",
667 | "
\n",
668 | " \n",
669 | " \n",
670 | " | \n",
671 | " cab | \n",
672 | " call | \n",
673 | " me | \n",
674 | " please | \n",
675 | " tonight | \n",
676 | " you | \n",
677 | "
\n",
678 | " \n",
679 | " \n",
680 | " \n",
681 | " | 0 | \n",
682 | " 0 | \n",
683 | " 1 | \n",
684 | " 1 | \n",
685 | " 1 | \n",
686 | " 0 | \n",
687 | " 0 | \n",
688 | "
\n",
689 | " \n",
690 | "
\n",
691 | "
"
692 | ],
693 | "text/plain": [
694 | " cab call me please tonight you\n",
695 | "0 0 1 1 1 0 0"
696 | ]
697 | },
698 | "execution_count": 22,
699 | "metadata": {},
700 | "output_type": "execute_result"
701 | }
702 | ],
703 | "source": [
704 | "# examine the vocabulary and document-term matrix together\n",
705 | "pd.DataFrame(simple_test_dtm.toarray(), columns=vect.get_feature_names())"
706 | ]
707 | },
708 | {
709 | "cell_type": "code",
710 | "execution_count": 23,
711 | "metadata": {
712 | "collapsed": false
713 | },
714 | "outputs": [
715 | {
716 | "data": {
717 | "text/plain": [
718 | "array([1])"
719 | ]
720 | },
721 | "execution_count": 23,
722 | "metadata": {},
723 | "output_type": "execute_result"
724 | }
725 | ],
726 | "source": [
727 | "# predict whether simple_test is desperate\n",
728 | "knn.predict(simple_test_dtm)"
729 | ]
730 | },
731 | {
732 | "cell_type": "markdown",
733 | "metadata": {},
734 | "source": [
735 | "**Summary:**\n",
736 | "\n",
737 | "- `vect.fit(train)` **learns the vocabulary** of the training data\n",
738 | "- `vect.transform(train)` uses the **fitted vocabulary** to build a document-term matrix from the training data\n",
739 | "- `vect.transform(test)` uses the **fitted vocabulary** to build a document-term matrix from the testing data (and **ignores tokens** it hasn't seen before)"
740 | ]
741 | },
742 | {
743 | "cell_type": "markdown",
744 | "metadata": {},
745 | "source": [
746 | "## Part 3: Reading a text-based dataset into pandas"
747 | ]
748 | },
749 | {
750 | "cell_type": "code",
751 | "execution_count": 24,
752 | "metadata": {
753 | "collapsed": true
754 | },
755 | "outputs": [],
756 | "source": [
757 | "# read file into pandas from the working directory\n",
758 | "sms = pd.read_table('sms.tsv', header=None, names=['label', 'message'])"
759 | ]
760 | },
761 | {
762 | "cell_type": "code",
763 | "execution_count": 25,
764 | "metadata": {
765 | "collapsed": false
766 | },
767 | "outputs": [],
768 | "source": [
769 | "# alternative: read file into pandas from a URL\n",
770 | "# url = 'https://raw.githubusercontent.com/justmarkham/pydata-dc-2016-tutorial/master/sms.tsv'\n",
771 | "# sms = pd.read_table(url, header=None, names=['label', 'message'])"
772 | ]
773 | },
774 | {
775 | "cell_type": "code",
776 | "execution_count": 26,
777 | "metadata": {
778 | "collapsed": false
779 | },
780 | "outputs": [
781 | {
782 | "data": {
783 | "text/plain": [
784 | "(5572, 2)"
785 | ]
786 | },
787 | "execution_count": 26,
788 | "metadata": {},
789 | "output_type": "execute_result"
790 | }
791 | ],
792 | "source": [
793 | "# examine the shape\n",
794 | "sms.shape"
795 | ]
796 | },
797 | {
798 | "cell_type": "code",
799 | "execution_count": 27,
800 | "metadata": {
801 | "collapsed": false
802 | },
803 | "outputs": [
804 | {
805 | "data": {
806 | "text/html": [
807 | "\n",
808 | "
\n",
809 | " \n",
810 | " \n",
811 | " | \n",
812 | " label | \n",
813 | " message | \n",
814 | "
\n",
815 | " \n",
816 | " \n",
817 | " \n",
818 | " | 0 | \n",
819 | " ham | \n",
820 | " Go until jurong point, crazy.. Available only ... | \n",
821 | "
\n",
822 | " \n",
823 | " | 1 | \n",
824 | " ham | \n",
825 | " Ok lar... Joking wif u oni... | \n",
826 | "
\n",
827 | " \n",
828 | " | 2 | \n",
829 | " spam | \n",
830 | " Free entry in 2 a wkly comp to win FA Cup fina... | \n",
831 | "
\n",
832 | " \n",
833 | " | 3 | \n",
834 | " ham | \n",
835 | " U dun say so early hor... U c already then say... | \n",
836 | "
\n",
837 | " \n",
838 | " | 4 | \n",
839 | " ham | \n",
840 | " Nah I don't think he goes to usf, he lives aro... | \n",
841 | "
\n",
842 | " \n",
843 | " | 5 | \n",
844 | " spam | \n",
845 | " FreeMsg Hey there darling it's been 3 week's n... | \n",
846 | "
\n",
847 | " \n",
848 | " | 6 | \n",
849 | " ham | \n",
850 | " Even my brother is not like to speak with me. ... | \n",
851 | "
\n",
852 | " \n",
853 | " | 7 | \n",
854 | " ham | \n",
855 | " As per your request 'Melle Melle (Oru Minnamin... | \n",
856 | "
\n",
857 | " \n",
858 | " | 8 | \n",
859 | " spam | \n",
860 | " WINNER!! As a valued network customer you have... | \n",
861 | "
\n",
862 | " \n",
863 | " | 9 | \n",
864 | " spam | \n",
865 | " Had your mobile 11 months or more? U R entitle... | \n",
866 | "
\n",
867 | " \n",
868 | "
\n",
869 | "
"
870 | ],
871 | "text/plain": [
872 | " label message\n",
873 | "0 ham Go until jurong point, crazy.. Available only ...\n",
874 | "1 ham Ok lar... Joking wif u oni...\n",
875 | "2 spam Free entry in 2 a wkly comp to win FA Cup fina...\n",
876 | "3 ham U dun say so early hor... U c already then say...\n",
877 | "4 ham Nah I don't think he goes to usf, he lives aro...\n",
878 | "5 spam FreeMsg Hey there darling it's been 3 week's n...\n",
879 | "6 ham Even my brother is not like to speak with me. ...\n",
880 | "7 ham As per your request 'Melle Melle (Oru Minnamin...\n",
881 | "8 spam WINNER!! As a valued network customer you have...\n",
882 | "9 spam Had your mobile 11 months or more? U R entitle..."
883 | ]
884 | },
885 | "execution_count": 27,
886 | "metadata": {},
887 | "output_type": "execute_result"
888 | }
889 | ],
890 | "source": [
891 | "# examine the first 10 rows\n",
892 | "sms.head(10)"
893 | ]
894 | },
895 | {
896 | "cell_type": "code",
897 | "execution_count": 28,
898 | "metadata": {
899 | "collapsed": false
900 | },
901 | "outputs": [
902 | {
903 | "data": {
904 | "text/plain": [
905 | "ham 4825\n",
906 | "spam 747\n",
907 | "Name: label, dtype: int64"
908 | ]
909 | },
910 | "execution_count": 28,
911 | "metadata": {},
912 | "output_type": "execute_result"
913 | }
914 | ],
915 | "source": [
916 | "# examine the class distribution\n",
917 | "sms.label.value_counts()"
918 | ]
919 | },
920 | {
921 | "cell_type": "code",
922 | "execution_count": 29,
923 | "metadata": {
924 | "collapsed": true
925 | },
926 | "outputs": [],
927 | "source": [
928 | "# convert label to a numerical variable\n",
929 | "sms['label_num'] = sms.label.map({'ham':0, 'spam':1})"
930 | ]
931 | },
932 | {
933 | "cell_type": "code",
934 | "execution_count": 30,
935 | "metadata": {
936 | "collapsed": false
937 | },
938 | "outputs": [
939 | {
940 | "data": {
941 | "text/html": [
942 | "\n",
943 | "
\n",
944 | " \n",
945 | " \n",
946 | " | \n",
947 | " label | \n",
948 | " message | \n",
949 | " label_num | \n",
950 | "
\n",
951 | " \n",
952 | " \n",
953 | " \n",
954 | " | 0 | \n",
955 | " ham | \n",
956 | " Go until jurong point, crazy.. Available only ... | \n",
957 | " 0 | \n",
958 | "
\n",
959 | " \n",
960 | " | 1 | \n",
961 | " ham | \n",
962 | " Ok lar... Joking wif u oni... | \n",
963 | " 0 | \n",
964 | "
\n",
965 | " \n",
966 | " | 2 | \n",
967 | " spam | \n",
968 | " Free entry in 2 a wkly comp to win FA Cup fina... | \n",
969 | " 1 | \n",
970 | "
\n",
971 | " \n",
972 | " | 3 | \n",
973 | " ham | \n",
974 | " U dun say so early hor... U c already then say... | \n",
975 | " 0 | \n",
976 | "
\n",
977 | " \n",
978 | " | 4 | \n",
979 | " ham | \n",
980 | " Nah I don't think he goes to usf, he lives aro... | \n",
981 | " 0 | \n",
982 | "
\n",
983 | " \n",
984 | " | 5 | \n",
985 | " spam | \n",
986 | " FreeMsg Hey there darling it's been 3 week's n... | \n",
987 | " 1 | \n",
988 | "
\n",
989 | " \n",
990 | " | 6 | \n",
991 | " ham | \n",
992 | " Even my brother is not like to speak with me. ... | \n",
993 | " 0 | \n",
994 | "
\n",
995 | " \n",
996 | " | 7 | \n",
997 | " ham | \n",
998 | " As per your request 'Melle Melle (Oru Minnamin... | \n",
999 | " 0 | \n",
1000 | "
\n",
1001 | " \n",
1002 | " | 8 | \n",
1003 | " spam | \n",
1004 | " WINNER!! As a valued network customer you have... | \n",
1005 | " 1 | \n",
1006 | "
\n",
1007 | " \n",
1008 | " | 9 | \n",
1009 | " spam | \n",
1010 | " Had your mobile 11 months or more? U R entitle... | \n",
1011 | " 1 | \n",
1012 | "
\n",
1013 | " \n",
1014 | "
\n",
1015 | "
"
1016 | ],
1017 | "text/plain": [
1018 | " label message label_num\n",
1019 | "0 ham Go until jurong point, crazy.. Available only ... 0\n",
1020 | "1 ham Ok lar... Joking wif u oni... 0\n",
1021 | "2 spam Free entry in 2 a wkly comp to win FA Cup fina... 1\n",
1022 | "3 ham U dun say so early hor... U c already then say... 0\n",
1023 | "4 ham Nah I don't think he goes to usf, he lives aro... 0\n",
1024 | "5 spam FreeMsg Hey there darling it's been 3 week's n... 1\n",
1025 | "6 ham Even my brother is not like to speak with me. ... 0\n",
1026 | "7 ham As per your request 'Melle Melle (Oru Minnamin... 0\n",
1027 | "8 spam WINNER!! As a valued network customer you have... 1\n",
1028 | "9 spam Had your mobile 11 months or more? U R entitle... 1"
1029 | ]
1030 | },
1031 | "execution_count": 30,
1032 | "metadata": {},
1033 | "output_type": "execute_result"
1034 | }
1035 | ],
1036 | "source": [
1037 | "# check that the conversion worked\n",
1038 | "sms.head(10)"
1039 | ]
1040 | },
1041 | {
1042 | "cell_type": "code",
1043 | "execution_count": 31,
1044 | "metadata": {
1045 | "collapsed": false
1046 | },
1047 | "outputs": [
1048 | {
1049 | "name": "stdout",
1050 | "output_type": "stream",
1051 | "text": [
1052 | "(150, 4)\n",
1053 | "(150,)\n"
1054 | ]
1055 | }
1056 | ],
1057 | "source": [
1058 | "# how to define X and y (from the iris data) for use with a MODEL\n",
1059 | "X = iris.data\n",
1060 | "y = iris.target\n",
1061 | "print(X.shape)\n",
1062 | "print(y.shape)"
1063 | ]
1064 | },
1065 | {
1066 | "cell_type": "code",
1067 | "execution_count": 32,
1068 | "metadata": {
1069 | "collapsed": false
1070 | },
1071 | "outputs": [
1072 | {
1073 | "name": "stdout",
1074 | "output_type": "stream",
1075 | "text": [
1076 | "(5572,)\n",
1077 | "(5572,)\n"
1078 | ]
1079 | }
1080 | ],
1081 | "source": [
1082 | "# how to define X and y (from the SMS data) for use with COUNTVECTORIZER\n",
1083 | "X = sms.message\n",
1084 | "y = sms.label_num\n",
1085 | "print(X.shape)\n",
1086 | "print(y.shape)"
1087 | ]
1088 | },
1089 | {
1090 | "cell_type": "code",
1091 | "execution_count": 33,
1092 | "metadata": {
1093 | "collapsed": false
1094 | },
1095 | "outputs": [
1096 | {
1097 | "name": "stdout",
1098 | "output_type": "stream",
1099 | "text": [
1100 | "(4179,)\n",
1101 | "(1393,)\n",
1102 | "(4179,)\n",
1103 | "(1393,)\n"
1104 | ]
1105 | }
1106 | ],
1107 | "source": [
1108 | "# split X and y into training and testing sets\n",
1109 | "from sklearn.cross_validation import train_test_split\n",
1110 | "X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=1)\n",
1111 | "print(X_train.shape)\n",
1112 | "print(X_test.shape)\n",
1113 | "print(y_train.shape)\n",
1114 | "print(y_test.shape)"
1115 | ]
1116 | },
1117 | {
1118 | "cell_type": "markdown",
1119 | "metadata": {},
1120 | "source": [
1121 | "## Part 4: Vectorizing our dataset"
1122 | ]
1123 | },
1124 | {
1125 | "cell_type": "code",
1126 | "execution_count": 34,
1127 | "metadata": {
1128 | "collapsed": true
1129 | },
1130 | "outputs": [],
1131 | "source": [
1132 | "# instantiate the vectorizer\n",
1133 | "vect = CountVectorizer()"
1134 | ]
1135 | },
1136 | {
1137 | "cell_type": "code",
1138 | "execution_count": 35,
1139 | "metadata": {
1140 | "collapsed": true
1141 | },
1142 | "outputs": [],
1143 | "source": [
1144 | "# learn training data vocabulary, then use it to create a document-term matrix\n",
1145 | "vect.fit(X_train)\n",
1146 | "X_train_dtm = vect.transform(X_train)"
1147 | ]
1148 | },
1149 | {
1150 | "cell_type": "code",
1151 | "execution_count": 36,
1152 | "metadata": {
1153 | "collapsed": true
1154 | },
1155 | "outputs": [],
1156 | "source": [
1157 | "# equivalently: combine fit and transform into a single step\n",
1158 | "X_train_dtm = vect.fit_transform(X_train)"
1159 | ]
1160 | },
1161 | {
1162 | "cell_type": "code",
1163 | "execution_count": 37,
1164 | "metadata": {
1165 | "collapsed": false
1166 | },
1167 | "outputs": [
1168 | {
1169 | "data": {
1170 | "text/plain": [
1171 | "<4179x7456 sparse matrix of type ''\n",
1172 | "\twith 55209 stored elements in Compressed Sparse Row format>"
1173 | ]
1174 | },
1175 | "execution_count": 37,
1176 | "metadata": {},
1177 | "output_type": "execute_result"
1178 | }
1179 | ],
1180 | "source": [
1181 | "# examine the document-term matrix\n",
1182 | "X_train_dtm"
1183 | ]
1184 | },
1185 | {
1186 | "cell_type": "code",
1187 | "execution_count": 38,
1188 | "metadata": {
1189 | "collapsed": false
1190 | },
1191 | "outputs": [
1192 | {
1193 | "data": {
1194 | "text/plain": [
1195 | "<1393x7456 sparse matrix of type ''\n",
1196 | "\twith 17604 stored elements in Compressed Sparse Row format>"
1197 | ]
1198 | },
1199 | "execution_count": 38,
1200 | "metadata": {},
1201 | "output_type": "execute_result"
1202 | }
1203 | ],
1204 | "source": [
1205 | "# transform testing data (using fitted vocabulary) into a document-term matrix\n",
1206 | "X_test_dtm = vect.transform(X_test)\n",
1207 | "X_test_dtm"
1208 | ]
1209 | },
1210 | {
1211 | "cell_type": "markdown",
1212 | "metadata": {},
1213 | "source": [
1214 | "## Part 5: Building and evaluating a model\n",
1215 | "\n",
1216 | "We will use [multinomial Naive Bayes](http://scikit-learn.org/stable/modules/generated/sklearn.naive_bayes.MultinomialNB.html):\n",
1217 | "\n",
1218 | "> The multinomial Naive Bayes classifier is suitable for classification with **discrete features** (e.g., word counts for text classification). The multinomial distribution normally requires integer feature counts. However, in practice, fractional counts such as tf-idf may also work."
1219 | ]
1220 | },
1221 | {
1222 | "cell_type": "code",
1223 | "execution_count": 39,
1224 | "metadata": {
1225 | "collapsed": true
1226 | },
1227 | "outputs": [],
1228 | "source": [
1229 | "# import and instantiate a Multinomial Naive Bayes model\n",
1230 | "from sklearn.naive_bayes import MultinomialNB\n",
1231 | "nb = MultinomialNB()"
1232 | ]
1233 | },
1234 | {
1235 | "cell_type": "code",
1236 | "execution_count": 40,
1237 | "metadata": {
1238 | "collapsed": false
1239 | },
1240 | "outputs": [
1241 | {
1242 | "name": "stdout",
1243 | "output_type": "stream",
1244 | "text": [
1245 | "CPU times: user 0 ns, sys: 0 ns, total: 0 ns\n",
1246 | "Wall time: 2.78 ms\n"
1247 | ]
1248 | },
1249 | {
1250 | "data": {
1251 | "text/plain": [
1252 | "MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True)"
1253 | ]
1254 | },
1255 | "execution_count": 40,
1256 | "metadata": {},
1257 | "output_type": "execute_result"
1258 | }
1259 | ],
1260 | "source": [
1261 | "# train the model using X_train_dtm (timing it with an IPython \"magic command\")\n",
1262 | "%time nb.fit(X_train_dtm, y_train)"
1263 | ]
1264 | },
1265 | {
1266 | "cell_type": "code",
1267 | "execution_count": 41,
1268 | "metadata": {
1269 | "collapsed": true
1270 | },
1271 | "outputs": [],
1272 | "source": [
1273 | "# make class predictions for X_test_dtm\n",
1274 | "y_pred_class = nb.predict(X_test_dtm)"
1275 | ]
1276 | },
1277 | {
1278 | "cell_type": "code",
1279 | "execution_count": 42,
1280 | "metadata": {
1281 | "collapsed": false
1282 | },
1283 | "outputs": [
1284 | {
1285 | "data": {
1286 | "text/plain": [
1287 | "0.98851399856424982"
1288 | ]
1289 | },
1290 | "execution_count": 42,
1291 | "metadata": {},
1292 | "output_type": "execute_result"
1293 | }
1294 | ],
1295 | "source": [
1296 | "# calculate accuracy of class predictions\n",
1297 | "from sklearn import metrics\n",
1298 | "metrics.accuracy_score(y_test, y_pred_class)"
1299 | ]
1300 | },
1301 | {
1302 | "cell_type": "code",
1303 | "execution_count": 43,
1304 | "metadata": {
1305 | "collapsed": false
1306 | },
1307 | "outputs": [
1308 | {
1309 | "data": {
1310 | "text/plain": [
1311 | "array([[1203, 5],\n",
1312 | " [ 11, 174]])"
1313 | ]
1314 | },
1315 | "execution_count": 43,
1316 | "metadata": {},
1317 | "output_type": "execute_result"
1318 | }
1319 | ],
1320 | "source": [
1321 | "# print the confusion matrix\n",
1322 | "metrics.confusion_matrix(y_test, y_pred_class)"
1323 | ]
1324 | },
1325 | {
1326 | "cell_type": "code",
1327 | "execution_count": 44,
1328 | "metadata": {
1329 | "collapsed": false
1330 | },
1331 | "outputs": [],
1332 | "source": [
1333 | "# print message text for the false positives (ham incorrectly classified as spam)\n"
1334 | ]
1335 | },
1336 | {
1337 | "cell_type": "code",
1338 | "execution_count": 45,
1339 | "metadata": {
1340 | "collapsed": false,
1341 | "scrolled": true
1342 | },
1343 | "outputs": [],
1344 | "source": [
1345 | "# print message text for the false negatives (spam incorrectly classified as ham)\n"
1346 | ]
1347 | },
1348 | {
1349 | "cell_type": "code",
1350 | "execution_count": 46,
1351 | "metadata": {
1352 | "collapsed": false,
1353 | "scrolled": true
1354 | },
1355 | "outputs": [
1356 | {
1357 | "data": {
1358 | "text/plain": [
1359 | "\"LookAtMe!: Thanks for your purchase of a video clip from LookAtMe!, you've been charged 35p. Think you can do better? Why not send a video in a MMSto 32323.\""
1360 | ]
1361 | },
1362 | "execution_count": 46,
1363 | "metadata": {},
1364 | "output_type": "execute_result"
1365 | }
1366 | ],
1367 | "source": [
1368 | "# example false negative\n",
1369 | "X_test[3132]"
1370 | ]
1371 | },
1372 | {
1373 | "cell_type": "code",
1374 | "execution_count": 47,
1375 | "metadata": {
1376 | "collapsed": false
1377 | },
1378 | "outputs": [
1379 | {
1380 | "data": {
1381 | "text/plain": [
1382 | "array([ 2.87744864e-03, 1.83488846e-05, 2.07301295e-03, ...,\n",
1383 | " 1.09026171e-06, 1.00000000e+00, 3.98279868e-09])"
1384 | ]
1385 | },
1386 | "execution_count": 47,
1387 | "metadata": {},
1388 | "output_type": "execute_result"
1389 | }
1390 | ],
1391 | "source": [
1392 | "# calculate predicted probabilities for X_test_dtm (poorly calibrated)\n",
1393 | "y_pred_prob = nb.predict_proba(X_test_dtm)[:, 1]\n",
1394 | "y_pred_prob"
1395 | ]
1396 | },
1397 | {
1398 | "cell_type": "code",
1399 | "execution_count": 48,
1400 | "metadata": {
1401 | "collapsed": false
1402 | },
1403 | "outputs": [
1404 | {
1405 | "data": {
1406 | "text/plain": [
1407 | "0.98664310005369604"
1408 | ]
1409 | },
1410 | "execution_count": 48,
1411 | "metadata": {},
1412 | "output_type": "execute_result"
1413 | }
1414 | ],
1415 | "source": [
1416 | "# calculate AUC\n",
1417 | "metrics.roc_auc_score(y_test, y_pred_prob)"
1418 | ]
1419 | },
1420 | {
1421 | "cell_type": "markdown",
1422 | "metadata": {},
1423 | "source": [
1424 | "## Part 6: Comparing models\n",
1425 | "\n",
1426 | "We will compare multinomial Naive Bayes with [logistic regression](http://scikit-learn.org/stable/modules/linear_model.html#logistic-regression):\n",
1427 | "\n",
1428 | "> Logistic regression, despite its name, is a **linear model for classification** rather than regression. Logistic regression is also known in the literature as logit regression, maximum-entropy classification (MaxEnt) or the log-linear classifier. In this model, the probabilities describing the possible outcomes of a single trial are modeled using a logistic function."
1429 | ]
1430 | },
1431 | {
1432 | "cell_type": "code",
1433 | "execution_count": 49,
1434 | "metadata": {
1435 | "collapsed": true
1436 | },
1437 | "outputs": [],
1438 | "source": [
1439 | "# import and instantiate a logistic regression model\n",
1440 | "from sklearn.linear_model import LogisticRegression\n",
1441 | "logreg = LogisticRegression()"
1442 | ]
1443 | },
1444 | {
1445 | "cell_type": "code",
1446 | "execution_count": 50,
1447 | "metadata": {
1448 | "collapsed": false
1449 | },
1450 | "outputs": [
1451 | {
1452 | "name": "stdout",
1453 | "output_type": "stream",
1454 | "text": [
1455 | "CPU times: user 56 ms, sys: 0 ns, total: 56 ms\n",
1456 | "Wall time: 273 ms\n"
1457 | ]
1458 | },
1459 | {
1460 | "data": {
1461 | "text/plain": [
1462 | "LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,\n",
1463 | " intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,\n",
1464 | " penalty='l2', random_state=None, solver='liblinear', tol=0.0001,\n",
1465 | " verbose=0, warm_start=False)"
1466 | ]
1467 | },
1468 | "execution_count": 50,
1469 | "metadata": {},
1470 | "output_type": "execute_result"
1471 | }
1472 | ],
1473 | "source": [
1474 | "# train the model using X_train_dtm\n",
1475 | "%time logreg.fit(X_train_dtm, y_train)"
1476 | ]
1477 | },
1478 | {
1479 | "cell_type": "code",
1480 | "execution_count": 51,
1481 | "metadata": {
1482 | "collapsed": true
1483 | },
1484 | "outputs": [],
1485 | "source": [
1486 | "# make class predictions for X_test_dtm\n",
1487 | "y_pred_class = logreg.predict(X_test_dtm)"
1488 | ]
1489 | },
1490 | {
1491 | "cell_type": "code",
1492 | "execution_count": 52,
1493 | "metadata": {
1494 | "collapsed": false
1495 | },
1496 | "outputs": [
1497 | {
1498 | "data": {
1499 | "text/plain": [
1500 | "array([ 0.01269556, 0.00347183, 0.00616517, ..., 0.03354907,\n",
1501 | " 0.99725053, 0.00157706])"
1502 | ]
1503 | },
1504 | "execution_count": 52,
1505 | "metadata": {},
1506 | "output_type": "execute_result"
1507 | }
1508 | ],
1509 | "source": [
1510 | "# calculate predicted probabilities for X_test_dtm (well calibrated)\n",
1511 | "y_pred_prob = logreg.predict_proba(X_test_dtm)[:, 1]\n",
1512 | "y_pred_prob"
1513 | ]
1514 | },
1515 | {
1516 | "cell_type": "code",
1517 | "execution_count": 53,
1518 | "metadata": {
1519 | "collapsed": false
1520 | },
1521 | "outputs": [
1522 | {
1523 | "data": {
1524 | "text/plain": [
1525 | "0.9877961234745154"
1526 | ]
1527 | },
1528 | "execution_count": 53,
1529 | "metadata": {},
1530 | "output_type": "execute_result"
1531 | }
1532 | ],
1533 | "source": [
1534 | "# calculate accuracy\n",
1535 | "metrics.accuracy_score(y_test, y_pred_class)"
1536 | ]
1537 | },
1538 | {
1539 | "cell_type": "code",
1540 | "execution_count": 54,
1541 | "metadata": {
1542 | "collapsed": false
1543 | },
1544 | "outputs": [
1545 | {
1546 | "data": {
1547 | "text/plain": [
1548 | "0.99368176123143015"
1549 | ]
1550 | },
1551 | "execution_count": 54,
1552 | "metadata": {},
1553 | "output_type": "execute_result"
1554 | }
1555 | ],
1556 | "source": [
1557 | "# calculate AUC\n",
1558 | "metrics.roc_auc_score(y_test, y_pred_prob)"
1559 | ]
1560 | }
1561 | ],
1562 | "metadata": {
1563 | "kernelspec": {
1564 | "display_name": "Python [conda root]",
1565 | "language": "python",
1566 | "name": "conda-root-py"
1567 | },
1568 | "language_info": {
1569 | "codemirror_mode": {
1570 | "name": "ipython",
1571 | "version": 3
1572 | },
1573 | "file_extension": ".py",
1574 | "mimetype": "text/x-python",
1575 | "name": "python",
1576 | "nbconvert_exporter": "python",
1577 | "pygments_lexer": "ipython3",
1578 | "version": "3.5.2"
1579 | }
1580 | },
1581 | "nbformat": 4,
1582 | "nbformat_minor": 0
1583 | }
1584 |
--------------------------------------------------------------------------------