├── AnaliseTexto ├── Mineração de Textos.pdf ├── .ipynb_checkpoints │ ├── TextVector-checkpoint.ipynb │ └── tutorial-checkpoint.ipynb └── AnaliseDeSentimento.ipynb ├── .idea └── vcs.xml └── textAnalisis └── .ipynb_checkpoints ├── TextVector-checkpoint.ipynb ├── TextAnalise-Plin-checkpoint.ipynb └── tutorial-checkpoint.ipynb /AnaliseTexto/Mineração de Textos.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/sandeco/01-AulasDataScience/HEAD/AnaliseTexto/Mineração de Textos.pdf -------------------------------------------------------------------------------- /.idea/vcs.xml: -------------------------------------------------------------------------------- 1 | 2 | 3 | 4 | 5 | 6 | -------------------------------------------------------------------------------- /AnaliseTexto/.ipynb_checkpoints/TextVector-checkpoint.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": { 6 | "nbpresent": { 7 | "id": "e4c7d791-d39c-4247-a950-8f541b2b2b2b" 8 | }, 9 | "slideshow": { 10 | "slide_type": "-" 11 | } 12 | }, 13 | "source": [ 14 | "# Classificação de textos com *scikit-learn*\n", 15 | "por Prof. Sanderson Macedo" 16 | ] 17 | }, 18 | { 19 | "cell_type": "markdown", 20 | "metadata": { 21 | "nbpresent": { 22 | "id": "918ce0e7-8f69-4d3c-8106-d3c5264c94e3" 23 | }, 24 | "slideshow": { 25 | "slide_type": "-" 26 | } 27 | }, 28 | "source": [ 29 | "
" 30 | ] 31 | }, 32 | { 33 | "cell_type": "markdown", 34 | "metadata": { 35 | "nbpresent": { 36 | "id": "ca5fe97a-0224-4915-a59d-38e6baa218a2" 37 | } 38 | }, 39 | "source": [ 40 | "## Agenda\n", 41 | "\n", 42 | "\n", 43 | "1. Representar um texto como dados numéricos\n", 44 | "2. Ler o *dataset* de texto no Pandas\n", 45 | "2. Vetorizar nossso *dataset*\n", 46 | "4. Construir e avaliar um modelo\n", 47 | "5. Comparar modelos\n" 48 | ] 49 | }, 50 | { 51 | "cell_type": "code", 52 | "execution_count": 73, 53 | "metadata": { 54 | "collapsed": true, 55 | "nbpresent": { 56 | "id": "d2e20804-da18-483c-bd40-8c25e2d4699c" 57 | } 58 | }, 59 | "outputs": [], 60 | "source": [ 61 | "##Importando pandas e numpy\n", 62 | "import pandas as pd\n", 63 | "import numpy as np\n", 64 | "\n" 65 | ] 66 | }, 67 | { 68 | "cell_type": "markdown", 69 | "metadata": { 70 | "nbpresent": { 71 | "id": "76e5a32a-69c4-4dc5-a66b-23d2cca623af" 72 | } 73 | }, 74 | "source": [ 75 | "## 1. Definindo um vetor de textos \n", 76 | "Os textos do vetor podem ser adquiridos por meio da leitura de \n", 77 | "pdf's, doc's, twitter's... etc.\n", 78 | "\n", 79 | "Esses textos serão a base de treinamento\n", 80 | "para a classificação do sentimento de um novo texto." 81 | ] 82 | }, 83 | { 84 | "cell_type": "code", 85 | "execution_count": 88, 86 | "metadata": { 87 | "collapsed": false, 88 | "nbpresent": { 89 | "id": "56bab267-0993-4d7a-9436-11bc5de3d1d3" 90 | } 91 | }, 92 | "outputs": [], 93 | "source": [ 94 | "train = [\n", 95 | " 'Eu te amo e não existe nada melhor que você',\n", 96 | " 'Você é algo assim... é tudo pra mim. Ao meu amor... Amor!',\n", 97 | " 'Eu te odeio muito, você não presta!',\n", 98 | " 'Não gosto de você'\n", 99 | " \n", 100 | " ]\n", 101 | "\n" 102 | ] 103 | }, 104 | { 105 | "cell_type": "markdown", 106 | "metadata": { 107 | "nbpresent": { 108 | "id": "fc1fc669-a603-412e-8855-837d750718ff" 109 | } 110 | }, 111 | "source": [ 112 | "## 2. Definindo um vetor de sentimentos\n", 113 | "Criaremos um vetor de sentimentos chamado **_felling_**. \n", 114 | "\n", 115 | "Cada posição do vetor **_felling_** representa o sentimento **BOM** (1) ou **RUIM** (0) para os textos que passamos ao vetor **_train_**.\n", 116 | "\n", 117 | "Por exemplo: a frase da primeira posição do vetor **_train_**:\n", 118 | "\n", 119 | "> 'Eu te amo e não existe nada melhor que você'\n", 120 | "\n", 121 | "Foi classificada como sendo um texto **BOM**:\n", 122 | "\n", 123 | "> 1" 124 | ] 125 | }, 126 | { 127 | "cell_type": "code", 128 | "execution_count": 89, 129 | "metadata": { 130 | "collapsed": true, 131 | "nbpresent": { 132 | "id": "68a4277e-e38c-42ac-8528-0b90efe86e42" 133 | } 134 | }, 135 | "outputs": [], 136 | "source": [ 137 | "felling = [1,1,0,0]" 138 | ] 139 | }, 140 | { 141 | "cell_type": "markdown", 142 | "metadata": { 143 | "nbpresent": { 144 | "id": "f43ff54a-e843-4a35-8447-66665f36ebca" 145 | } 146 | }, 147 | "source": [ 148 | "## 3. Análise de texto com _scikit-learn_.\n", 149 | "\n", 150 | "Texto de [scikit-learn documentation](http://scikit-learn.org/stable/modules/feature_extraction.html#text-feature-extraction):\n", 151 | "\n", 152 | "> Análise de texto é um campo de aplicação importante para algoritmos de aprendizado de máquina. No entanto, uma sequência de símbolos não podem ser passada diretamente aos algoritmos de Machine Learning, pois a maioria deles espera vetores de características numéricas com um tamanho fixo, em vez de documentos de texto com comprimento variável.\n", 153 | "\n", 154 | "Mas nesse caso podemos realizar algumas transformações de para poder manipular textos em algoritmos de aprendizagem.\n", 155 | "\n", 156 | "Portanto, aqui utilizaremos a [CountVectorizer](http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html)\n", 157 | "para converter textos em uma matriz que expressará a quantidade \"tokens\" dos textos.\n", 158 | "\n", 159 | "Importamos a classe e criamos uma instância chamada **_vect_**.\n" 160 | ] 161 | }, 162 | { 163 | "cell_type": "code", 164 | "execution_count": 90, 165 | "metadata": { 166 | "collapsed": false, 167 | "nbpresent": { 168 | "id": "1ada59d7-f1ba-4625-8999-b8af5aaf461c" 169 | } 170 | }, 171 | "outputs": [], 172 | "source": [ 173 | "from sklearn.feature_extraction.text import CountVectorizer\n", 174 | "vect = CountVectorizer()" 175 | ] 176 | }, 177 | { 178 | "cell_type": "markdown", 179 | "metadata": { 180 | "nbpresent": { 181 | "id": "154ef867-0532-45ad-9910-c87f6711d1b0" 182 | } 183 | }, 184 | "source": [ 185 | "## 4. Treinamento criando o dicionário.\n", 186 | "Agora treinamos o algoritmo com o vetor de textos que criamos acima. Chamamos o método **_fit()_** passando o vetor de textos." 187 | ] 188 | }, 189 | { 190 | "cell_type": "code", 191 | "execution_count": 96, 192 | "metadata": { 193 | "collapsed": false, 194 | "nbpresent": { 195 | "id": "eff3a289-8c0d-4374-9400-d988a6b36624" 196 | } 197 | }, 198 | "outputs": [ 199 | { 200 | "data": { 201 | "text/plain": [ 202 | "CountVectorizer(analyzer='word', binary=False, decode_error='strict',\n", 203 | " dtype=, encoding='utf-8', input='content',\n", 204 | " lowercase=True, max_df=1.0, max_features=None, min_df=1,\n", 205 | " ngram_range=(1, 1), preprocessor=None, stop_words=None,\n", 206 | " strip_accents=None, token_pattern='(?u)\\\\b\\\\w\\\\w+\\\\b',\n", 207 | " tokenizer=None, vocabulary=None)" 208 | ] 209 | }, 210 | "execution_count": 96, 211 | "metadata": {}, 212 | "output_type": "execute_result" 213 | } 214 | ], 215 | "source": [ 216 | "vect.fit(train)" 217 | ] 218 | }, 219 | { 220 | "cell_type": "markdown", 221 | "metadata": {}, 222 | "source": [ 223 | "Veja que o parametro *analyzer* é defindo por padrão como *'word'* na classe *CountVectorizer*. Isso signicica que a classe ignora palavras com menos de dois (2) caracteres e pontuações. " 224 | ] 225 | }, 226 | { 227 | "cell_type": "markdown", 228 | "metadata": { 229 | "nbpresent": { 230 | "id": "d4093cdd-6b19-4fed-9a01-5ee02f41ca51" 231 | } 232 | }, 233 | "source": [ 234 | "## 5. Nosso dicionário\n", 235 | "Aqui vamos listar de forma única\n", 236 | "quais palavras forma utilizadas no texto, formando assim um dicionário de palavras." 237 | ] 238 | }, 239 | { 240 | "cell_type": "code", 241 | "execution_count": 95, 242 | "metadata": { 243 | "collapsed": false, 244 | "nbpresent": { 245 | "id": "3ab9a844-7f38-40c5-a57f-4a2fbf3343ba" 246 | } 247 | }, 248 | "outputs": [ 249 | { 250 | "data": { 251 | "text/plain": [ 252 | "['algo',\n", 253 | " 'amo',\n", 254 | " 'amor',\n", 255 | " 'ao',\n", 256 | " 'assim',\n", 257 | " 'de',\n", 258 | " 'eu',\n", 259 | " 'existe',\n", 260 | " 'gosto',\n", 261 | " 'melhor',\n", 262 | " 'meu',\n", 263 | " 'mim',\n", 264 | " 'muito',\n", 265 | " 'nada',\n", 266 | " 'não',\n", 267 | " 'odeio',\n", 268 | " 'pra',\n", 269 | " 'presta',\n", 270 | " 'que',\n", 271 | " 'te',\n", 272 | " 'tudo',\n", 273 | " 'você']" 274 | ] 275 | }, 276 | "execution_count": 95, 277 | "metadata": {}, 278 | "output_type": "execute_result" 279 | } 280 | ], 281 | "source": [ 282 | "## examinando o dicionário criado em ordem alfabética.\n", 283 | "vect.get_feature_names()" 284 | ] 285 | }, 286 | { 287 | "cell_type": "markdown", 288 | "metadata": {}, 289 | "source": [ 290 | "## 6. Transformação em matriz esparsa em relação as frases\n", 291 | "Essa transformação é importante porque cria uma matriz onde:\n", 292 | "\n", 293 | "1. Cada linha representa um texto do vetor **_train_** \n", 294 | "2. Cada coluna uma palavra do dicionário aprendido.\n", 295 | "3. Se a palavra ocorrer no texto o valor será 1 caso contrário 0.\n", 296 | "\n", 297 | "\n" 298 | ] 299 | }, 300 | { 301 | "cell_type": "code", 302 | "execution_count": 93, 303 | "metadata": { 304 | "collapsed": false, 305 | "nbpresent": { 306 | "id": "34cfd603-24de-4379-9a69-353ba0e50fba" 307 | } 308 | }, 309 | "outputs": [], 310 | "source": [ 311 | "simple_train_dtm = vect.transform(text)\n", 312 | "ocorrencias = simple_train_dtm.toarray()" 313 | ] 314 | }, 315 | { 316 | "cell_type": "code", 317 | "execution_count": 94, 318 | "metadata": { 319 | "collapsed": false, 320 | "nbpresent": { 321 | "id": "88fe39dd-0355-4dd7-b9d6-ed668225208d" 322 | } 323 | }, 324 | "outputs": [ 325 | { 326 | "data": { 327 | "text/plain": [ 328 | "array([[0, 1, 0, 0, 0, 0, 1, 1, 0, 1, 0, 0, 0, 1, 1, 0, 0, 0, 1, 1, 0, 1],\n", 329 | " [1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 1, 1],\n", 330 | " [0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 1, 1, 0, 1, 0, 1, 0, 1],\n", 331 | " [0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 1]])" 332 | ] 333 | }, 334 | "execution_count": 94, 335 | "metadata": {}, 336 | "output_type": "execute_result" 337 | } 338 | ], 339 | "source": [ 340 | "ocorrencias" 341 | ] 342 | }, 343 | { 344 | "cell_type": "code", 345 | "execution_count": 56, 346 | "metadata": { 347 | "collapsed": false, 348 | "nbpresent": { 349 | "id": "2e563c0f-37c5-4861-85c6-9185c20e3507" 350 | } 351 | }, 352 | "outputs": [ 353 | { 354 | "data": { 355 | "text/html": [ 356 | "
\n", 357 | "\n", 358 | " \n", 359 | " \n", 360 | " \n", 361 | " \n", 362 | " \n", 363 | " \n", 364 | " \n", 365 | " \n", 366 | " \n", 367 | " \n", 368 | " \n", 369 | " \n", 370 | " \n", 371 | " \n", 372 | " \n", 373 | " \n", 374 | " \n", 375 | " \n", 376 | " \n", 377 | " \n", 378 | " \n", 379 | " \n", 380 | " \n", 381 | " \n", 382 | " \n", 383 | " \n", 384 | " \n", 385 | " \n", 386 | " \n", 387 | " \n", 388 | " \n", 389 | " \n", 390 | " \n", 391 | " \n", 392 | " \n", 393 | " \n", 394 | " \n", 395 | " \n", 396 | " \n", 397 | " \n", 398 | " \n", 399 | " \n", 400 | " \n", 401 | " \n", 402 | " \n", 403 | " \n", 404 | " \n", 405 | " \n", 406 | " \n", 407 | " \n", 408 | " \n", 409 | " \n", 410 | " \n", 411 | " \n", 412 | " \n", 413 | " \n", 414 | " \n", 415 | " \n", 416 | " \n", 417 | " \n", 418 | " \n", 419 | " \n", 420 | " \n", 421 | " \n", 422 | " \n", 423 | " \n", 424 | " \n", 425 | " \n", 426 | " \n", 427 | " \n", 428 | " \n", 429 | " \n", 430 | " \n", 431 | " \n", 432 | " \n", 433 | " \n", 434 | " \n", 435 | " \n", 436 | " \n", 437 | " \n", 438 | " \n", 439 | " \n", 440 | " \n", 441 | " \n", 442 | "
algoamoassimeuexistemelhormimmuitonadanãoodeiopraprestaquetetudovocê
001011100110001101
110100010000100011
200010001011010101
\n", 443 | "
" 444 | ], 445 | "text/plain": [ 446 | " algo amo assim eu existe melhor mim muito nada não odeio pra \\\n", 447 | "0 0 1 0 1 1 1 0 0 1 1 0 0 \n", 448 | "1 1 0 1 0 0 0 1 0 0 0 0 1 \n", 449 | "2 0 0 0 1 0 0 0 1 0 1 1 0 \n", 450 | "\n", 451 | " presta que te tudo você \n", 452 | "0 0 1 1 0 1 \n", 453 | "1 0 0 0 1 1 \n", 454 | "2 1 0 1 0 1 " 455 | ] 456 | }, 457 | "execution_count": 56, 458 | "metadata": {}, 459 | "output_type": "execute_result" 460 | } 461 | ], 462 | "source": [ 463 | "df = pd.DataFrame(simple_train_dtm.toarray(), columns=vect.get_feature_names())\n", 464 | "df" 465 | ] 466 | }, 467 | { 468 | "cell_type": "code", 469 | "execution_count": 57, 470 | "metadata": { 471 | "collapsed": false, 472 | "nbpresent": { 473 | "id": "d30743bf-e9b2-46ba-93bd-0615c79b1b29" 474 | } 475 | }, 476 | "outputs": [ 477 | { 478 | "data": { 479 | "text/plain": [ 480 | "scipy.sparse.csr.csr_matrix" 481 | ] 482 | }, 483 | "execution_count": 57, 484 | "metadata": {}, 485 | "output_type": "execute_result" 486 | } 487 | ], 488 | "source": [ 489 | "type(simple_train_dtm)" 490 | ] 491 | }, 492 | { 493 | "cell_type": "code", 494 | "execution_count": 60, 495 | "metadata": { 496 | "collapsed": false, 497 | "nbpresent": { 498 | "id": "95d91cb6-e3f8-4b4b-ab82-900f8719f4db" 499 | } 500 | }, 501 | "outputs": [ 502 | { 503 | "name": "stdout", 504 | "output_type": "stream", 505 | "text": [ 506 | " (0, 1)\t1\n", 507 | " (0, 3)\t1\n", 508 | " (0, 4)\t1\n", 509 | " (0, 5)\t1\n", 510 | " (0, 8)\t1\n", 511 | " (0, 9)\t1\n", 512 | " (0, 13)\t1\n", 513 | " (0, 14)\t1\n", 514 | " (0, 16)\t1\n", 515 | " (1, 0)\t1\n", 516 | " (1, 2)\t1\n", 517 | " (1, 6)\t1\n", 518 | " (1, 11)\t1\n", 519 | " (1, 15)\t1\n", 520 | " (1, 16)\t1\n", 521 | " (2, 3)\t1\n", 522 | " (2, 7)\t1\n", 523 | " (2, 9)\t1\n", 524 | " (2, 10)\t1\n", 525 | " (2, 12)\t1\n", 526 | " (2, 14)\t1\n", 527 | " (2, 16)\t1\n" 528 | ] 529 | } 530 | ], 531 | "source": [ 532 | "print(simple_train_dtm)" 533 | ] 534 | }, 535 | { 536 | "cell_type": "code", 537 | "execution_count": null, 538 | "metadata": { 539 | "collapsed": true, 540 | "nbpresent": { 541 | "id": "201b01cf-47f9-4a94-baf5-9270271e053e" 542 | } 543 | }, 544 | "outputs": [], 545 | "source": [] 546 | } 547 | ], 548 | "metadata": { 549 | "kernelspec": { 550 | "display_name": "Python [conda root]", 551 | "language": "python", 552 | "name": "conda-root-py" 553 | }, 554 | "language_info": { 555 | "codemirror_mode": { 556 | "name": "ipython", 557 | "version": 3 558 | }, 559 | "file_extension": ".py", 560 | "mimetype": "text/x-python", 561 | "name": "python", 562 | "nbconvert_exporter": "python", 563 | "pygments_lexer": "ipython3", 564 | "version": "3.5.2" 565 | } 566 | }, 567 | "nbformat": 4, 568 | "nbformat_minor": 1 569 | } 570 | -------------------------------------------------------------------------------- /textAnalisis/.ipynb_checkpoints/TextVector-checkpoint.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": { 6 | "nbpresent": { 7 | "id": "e4c7d791-d39c-4247-a950-8f541b2b2b2b" 8 | }, 9 | "slideshow": { 10 | "slide_type": "-" 11 | } 12 | }, 13 | "source": [ 14 | "# Classificação de textos com *scikit-learn*\n", 15 | "por Prof. Sanderson Macedo" 16 | ] 17 | }, 18 | { 19 | "cell_type": "markdown", 20 | "metadata": { 21 | "nbpresent": { 22 | "id": "918ce0e7-8f69-4d3c-8106-d3c5264c94e3" 23 | }, 24 | "slideshow": { 25 | "slide_type": "-" 26 | } 27 | }, 28 | "source": [ 29 | "
" 30 | ] 31 | }, 32 | { 33 | "cell_type": "markdown", 34 | "metadata": { 35 | "nbpresent": { 36 | "id": "ca5fe97a-0224-4915-a59d-38e6baa218a2" 37 | } 38 | }, 39 | "source": [ 40 | "## Agenda\n", 41 | "\n", 42 | "\n", 43 | "1. Representar um texto como dados numéricos\n", 44 | "2. Ler o *dataset* de texto no Pandas\n", 45 | "2. Vetorizar nossso *dataset*\n", 46 | "4. Construir e avaliar um modelo\n", 47 | "5. Comparar modelos\n" 48 | ] 49 | }, 50 | { 51 | "cell_type": "code", 52 | "execution_count": 73, 53 | "metadata": { 54 | "collapsed": true, 55 | "nbpresent": { 56 | "id": "d2e20804-da18-483c-bd40-8c25e2d4699c" 57 | } 58 | }, 59 | "outputs": [], 60 | "source": [ 61 | "##Importando pandas e numpy\n", 62 | "import pandas as pd\n", 63 | "import numpy as np\n", 64 | "\n" 65 | ] 66 | }, 67 | { 68 | "cell_type": "markdown", 69 | "metadata": { 70 | "nbpresent": { 71 | "id": "76e5a32a-69c4-4dc5-a66b-23d2cca623af" 72 | } 73 | }, 74 | "source": [ 75 | "## 1. Definindo um vetor de textos \n", 76 | "Os textos do vetor podem ser adquiridos por meio da leitura de \n", 77 | "pdf's, doc's, twitter's... etc.\n", 78 | "\n", 79 | "Esses textos serão a base de treinamento\n", 80 | "para a classificação do sentimento de um novo texto." 81 | ] 82 | }, 83 | { 84 | "cell_type": "code", 85 | "execution_count": 88, 86 | "metadata": { 87 | "collapsed": false, 88 | "nbpresent": { 89 | "id": "56bab267-0993-4d7a-9436-11bc5de3d1d3" 90 | } 91 | }, 92 | "outputs": [], 93 | "source": [ 94 | "train = [\n", 95 | " 'Eu te amo e não existe nada melhor que você',\n", 96 | " 'Você é algo assim... é tudo pra mim. Ao meu amor... Amor!',\n", 97 | " 'Eu te odeio muito, você não presta!',\n", 98 | " 'Não gosto de você'\n", 99 | " \n", 100 | " ]\n", 101 | "\n" 102 | ] 103 | }, 104 | { 105 | "cell_type": "markdown", 106 | "metadata": { 107 | "nbpresent": { 108 | "id": "fc1fc669-a603-412e-8855-837d750718ff" 109 | } 110 | }, 111 | "source": [ 112 | "## 2. Definindo um vetor de sentimentos\n", 113 | "Criaremos um vetor de sentimentos chamado **_felling_**. \n", 114 | "\n", 115 | "Cada posição do vetor **_felling_** representa o sentimento **BOM** (1) ou **RUIM** (0) para os textos que passamos ao vetor **_train_**.\n", 116 | "\n", 117 | "Por exemplo: a frase da primeira posição do vetor **_train_**:\n", 118 | "\n", 119 | "> 'Eu te amo e não existe nada melhor que você'\n", 120 | "\n", 121 | "Foi classificada como sendo um texto **BOM**:\n", 122 | "\n", 123 | "> 1" 124 | ] 125 | }, 126 | { 127 | "cell_type": "code", 128 | "execution_count": 89, 129 | "metadata": { 130 | "collapsed": true, 131 | "nbpresent": { 132 | "id": "68a4277e-e38c-42ac-8528-0b90efe86e42" 133 | } 134 | }, 135 | "outputs": [], 136 | "source": [ 137 | "felling = [1,1,0,0]" 138 | ] 139 | }, 140 | { 141 | "cell_type": "markdown", 142 | "metadata": { 143 | "nbpresent": { 144 | "id": "f43ff54a-e843-4a35-8447-66665f36ebca" 145 | } 146 | }, 147 | "source": [ 148 | "## 3. Análise de texto com _scikit-learn_.\n", 149 | "\n", 150 | "Texto de [scikit-learn documentation](http://scikit-learn.org/stable/modules/feature_extraction.html#text-feature-extraction):\n", 151 | "\n", 152 | "> Análise de texto é um campo de aplicação importante para algoritmos de aprendizado de máquina. No entanto, uma sequência de símbolos não podem ser passada diretamente aos algoritmos de Machine Learning, pois a maioria deles espera vetores de características numéricas com um tamanho fixo, em vez de documentos de texto com comprimento variável.\n", 153 | "\n", 154 | "Mas nesse caso podemos realizar algumas transformações de para poder manipular textos em algoritmos de aprendizagem.\n", 155 | "\n", 156 | "Portanto, aqui utilizaremos a [CountVectorizer](http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html)\n", 157 | "para converter textos em uma matriz que expressará a quantidade \"tokens\" dos textos.\n", 158 | "\n", 159 | "Importamos a classe e criamos uma instância chamada **_vect_**.\n" 160 | ] 161 | }, 162 | { 163 | "cell_type": "code", 164 | "execution_count": 90, 165 | "metadata": { 166 | "collapsed": false, 167 | "nbpresent": { 168 | "id": "1ada59d7-f1ba-4625-8999-b8af5aaf461c" 169 | } 170 | }, 171 | "outputs": [], 172 | "source": [ 173 | "from sklearn.feature_extraction.text import CountVectorizer\n", 174 | "vect = CountVectorizer()" 175 | ] 176 | }, 177 | { 178 | "cell_type": "markdown", 179 | "metadata": { 180 | "nbpresent": { 181 | "id": "154ef867-0532-45ad-9910-c87f6711d1b0" 182 | } 183 | }, 184 | "source": [ 185 | "## 4. Treinamento criando o dicionário.\n", 186 | "Agora treinamos o algoritmo com o vetor de textos que criamos acima. Chamamos o método **_fit()_** passando o vetor de textos." 187 | ] 188 | }, 189 | { 190 | "cell_type": "code", 191 | "execution_count": 96, 192 | "metadata": { 193 | "collapsed": false, 194 | "nbpresent": { 195 | "id": "eff3a289-8c0d-4374-9400-d988a6b36624" 196 | } 197 | }, 198 | "outputs": [ 199 | { 200 | "data": { 201 | "text/plain": [ 202 | "CountVectorizer(analyzer='word', binary=False, decode_error='strict',\n", 203 | " dtype=, encoding='utf-8', input='content',\n", 204 | " lowercase=True, max_df=1.0, max_features=None, min_df=1,\n", 205 | " ngram_range=(1, 1), preprocessor=None, stop_words=None,\n", 206 | " strip_accents=None, token_pattern='(?u)\\\\b\\\\w\\\\w+\\\\b',\n", 207 | " tokenizer=None, vocabulary=None)" 208 | ] 209 | }, 210 | "execution_count": 96, 211 | "metadata": {}, 212 | "output_type": "execute_result" 213 | } 214 | ], 215 | "source": [ 216 | "vect.fit(train)" 217 | ] 218 | }, 219 | { 220 | "cell_type": "markdown", 221 | "metadata": {}, 222 | "source": [ 223 | "Veja que o parametro *analyzer* é defindo por padrão como *'word'* na classe *CountVectorizer*. Isso signicica que a classe ignora palavras com menos de dois (2) caracteres e pontuações. " 224 | ] 225 | }, 226 | { 227 | "cell_type": "markdown", 228 | "metadata": { 229 | "nbpresent": { 230 | "id": "d4093cdd-6b19-4fed-9a01-5ee02f41ca51" 231 | } 232 | }, 233 | "source": [ 234 | "## 5. Nosso dicionário\n", 235 | "Aqui vamos listar de forma única\n", 236 | "quais palavras forma utilizadas no texto, formando assim um dicionário de palavras." 237 | ] 238 | }, 239 | { 240 | "cell_type": "code", 241 | "execution_count": 95, 242 | "metadata": { 243 | "collapsed": false, 244 | "nbpresent": { 245 | "id": "3ab9a844-7f38-40c5-a57f-4a2fbf3343ba" 246 | } 247 | }, 248 | "outputs": [ 249 | { 250 | "data": { 251 | "text/plain": [ 252 | "['algo',\n", 253 | " 'amo',\n", 254 | " 'amor',\n", 255 | " 'ao',\n", 256 | " 'assim',\n", 257 | " 'de',\n", 258 | " 'eu',\n", 259 | " 'existe',\n", 260 | " 'gosto',\n", 261 | " 'melhor',\n", 262 | " 'meu',\n", 263 | " 'mim',\n", 264 | " 'muito',\n", 265 | " 'nada',\n", 266 | " 'não',\n", 267 | " 'odeio',\n", 268 | " 'pra',\n", 269 | " 'presta',\n", 270 | " 'que',\n", 271 | " 'te',\n", 272 | " 'tudo',\n", 273 | " 'você']" 274 | ] 275 | }, 276 | "execution_count": 95, 277 | "metadata": {}, 278 | "output_type": "execute_result" 279 | } 280 | ], 281 | "source": [ 282 | "## examinando o dicionário criado em ordem alfabética.\n", 283 | "vect.get_feature_names()" 284 | ] 285 | }, 286 | { 287 | "cell_type": "markdown", 288 | "metadata": {}, 289 | "source": [ 290 | "## 6. Transformação em matriz esparsa em relação as frases\n", 291 | "Essa transformação é importante porque cria uma matriz onde:\n", 292 | "\n", 293 | "1. Cada linha representa um texto do vetor **_train_** \n", 294 | "2. Cada coluna uma palavra do dicionário aprendido.\n", 295 | "3. Se a palavra ocorrer no texto o valor será 1 caso contrário 0.\n", 296 | "\n", 297 | "\n" 298 | ] 299 | }, 300 | { 301 | "cell_type": "code", 302 | "execution_count": 93, 303 | "metadata": { 304 | "collapsed": false, 305 | "nbpresent": { 306 | "id": "34cfd603-24de-4379-9a69-353ba0e50fba" 307 | } 308 | }, 309 | "outputs": [], 310 | "source": [ 311 | "simple_train_dtm = vect.transform(text)\n", 312 | "ocorrencias = simple_train_dtm.toarray()" 313 | ] 314 | }, 315 | { 316 | "cell_type": "code", 317 | "execution_count": 94, 318 | "metadata": { 319 | "collapsed": false, 320 | "nbpresent": { 321 | "id": "88fe39dd-0355-4dd7-b9d6-ed668225208d" 322 | } 323 | }, 324 | "outputs": [ 325 | { 326 | "data": { 327 | "text/plain": [ 328 | "array([[0, 1, 0, 0, 0, 0, 1, 1, 0, 1, 0, 0, 0, 1, 1, 0, 0, 0, 1, 1, 0, 1],\n", 329 | " [1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 1, 1],\n", 330 | " [0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 1, 1, 0, 1, 0, 1, 0, 1],\n", 331 | " [0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 1]])" 332 | ] 333 | }, 334 | "execution_count": 94, 335 | "metadata": {}, 336 | "output_type": "execute_result" 337 | } 338 | ], 339 | "source": [ 340 | "ocorrencias" 341 | ] 342 | }, 343 | { 344 | "cell_type": "code", 345 | "execution_count": 56, 346 | "metadata": { 347 | "collapsed": false, 348 | "nbpresent": { 349 | "id": "2e563c0f-37c5-4861-85c6-9185c20e3507" 350 | } 351 | }, 352 | "outputs": [ 353 | { 354 | "data": { 355 | "text/html": [ 356 | "
\n", 357 | "\n", 358 | " \n", 359 | " \n", 360 | " \n", 361 | " \n", 362 | " \n", 363 | " \n", 364 | " \n", 365 | " \n", 366 | " \n", 367 | " \n", 368 | " \n", 369 | " \n", 370 | " \n", 371 | " \n", 372 | " \n", 373 | " \n", 374 | " \n", 375 | " \n", 376 | " \n", 377 | " \n", 378 | " \n", 379 | " \n", 380 | " \n", 381 | " \n", 382 | " \n", 383 | " \n", 384 | " \n", 385 | " \n", 386 | " \n", 387 | " \n", 388 | " \n", 389 | " \n", 390 | " \n", 391 | " \n", 392 | " \n", 393 | " \n", 394 | " \n", 395 | " \n", 396 | " \n", 397 | " \n", 398 | " \n", 399 | " \n", 400 | " \n", 401 | " \n", 402 | " \n", 403 | " \n", 404 | " \n", 405 | " \n", 406 | " \n", 407 | " \n", 408 | " \n", 409 | " \n", 410 | " \n", 411 | " \n", 412 | " \n", 413 | " \n", 414 | " \n", 415 | " \n", 416 | " \n", 417 | " \n", 418 | " \n", 419 | " \n", 420 | " \n", 421 | " \n", 422 | " \n", 423 | " \n", 424 | " \n", 425 | " \n", 426 | " \n", 427 | " \n", 428 | " \n", 429 | " \n", 430 | " \n", 431 | " \n", 432 | " \n", 433 | " \n", 434 | " \n", 435 | " \n", 436 | " \n", 437 | " \n", 438 | " \n", 439 | " \n", 440 | " \n", 441 | " \n", 442 | "
algoamoassimeuexistemelhormimmuitonadanãoodeiopraprestaquetetudovocê
001011100110001101
110100010000100011
200010001011010101
\n", 443 | "
" 444 | ], 445 | "text/plain": [ 446 | " algo amo assim eu existe melhor mim muito nada não odeio pra \\\n", 447 | "0 0 1 0 1 1 1 0 0 1 1 0 0 \n", 448 | "1 1 0 1 0 0 0 1 0 0 0 0 1 \n", 449 | "2 0 0 0 1 0 0 0 1 0 1 1 0 \n", 450 | "\n", 451 | " presta que te tudo você \n", 452 | "0 0 1 1 0 1 \n", 453 | "1 0 0 0 1 1 \n", 454 | "2 1 0 1 0 1 " 455 | ] 456 | }, 457 | "execution_count": 56, 458 | "metadata": {}, 459 | "output_type": "execute_result" 460 | } 461 | ], 462 | "source": [ 463 | "df = pd.DataFrame(simple_train_dtm.toarray(), columns=vect.get_feature_names())\n", 464 | "df" 465 | ] 466 | }, 467 | { 468 | "cell_type": "code", 469 | "execution_count": 57, 470 | "metadata": { 471 | "collapsed": false, 472 | "nbpresent": { 473 | "id": "d30743bf-e9b2-46ba-93bd-0615c79b1b29" 474 | } 475 | }, 476 | "outputs": [ 477 | { 478 | "data": { 479 | "text/plain": [ 480 | "scipy.sparse.csr.csr_matrix" 481 | ] 482 | }, 483 | "execution_count": 57, 484 | "metadata": {}, 485 | "output_type": "execute_result" 486 | } 487 | ], 488 | "source": [ 489 | "type(simple_train_dtm)" 490 | ] 491 | }, 492 | { 493 | "cell_type": "code", 494 | "execution_count": 60, 495 | "metadata": { 496 | "collapsed": false, 497 | "nbpresent": { 498 | "id": "95d91cb6-e3f8-4b4b-ab82-900f8719f4db" 499 | } 500 | }, 501 | "outputs": [ 502 | { 503 | "name": "stdout", 504 | "output_type": "stream", 505 | "text": [ 506 | " (0, 1)\t1\n", 507 | " (0, 3)\t1\n", 508 | " (0, 4)\t1\n", 509 | " (0, 5)\t1\n", 510 | " (0, 8)\t1\n", 511 | " (0, 9)\t1\n", 512 | " (0, 13)\t1\n", 513 | " (0, 14)\t1\n", 514 | " (0, 16)\t1\n", 515 | " (1, 0)\t1\n", 516 | " (1, 2)\t1\n", 517 | " (1, 6)\t1\n", 518 | " (1, 11)\t1\n", 519 | " (1, 15)\t1\n", 520 | " (1, 16)\t1\n", 521 | " (2, 3)\t1\n", 522 | " (2, 7)\t1\n", 523 | " (2, 9)\t1\n", 524 | " (2, 10)\t1\n", 525 | " (2, 12)\t1\n", 526 | " (2, 14)\t1\n", 527 | " (2, 16)\t1\n" 528 | ] 529 | } 530 | ], 531 | "source": [ 532 | "print(simple_train_dtm)" 533 | ] 534 | }, 535 | { 536 | "cell_type": "code", 537 | "execution_count": null, 538 | "metadata": { 539 | "collapsed": true, 540 | "nbpresent": { 541 | "id": "201b01cf-47f9-4a94-baf5-9270271e053e" 542 | } 543 | }, 544 | "outputs": [], 545 | "source": [] 546 | } 547 | ], 548 | "metadata": { 549 | "kernelspec": { 550 | "display_name": "Python [conda root]", 551 | "language": "python", 552 | "name": "conda-root-py" 553 | }, 554 | "language_info": { 555 | "codemirror_mode": { 556 | "name": "ipython", 557 | "version": 3 558 | }, 559 | "file_extension": ".py", 560 | "mimetype": "text/x-python", 561 | "name": "python", 562 | "nbconvert_exporter": "python", 563 | "pygments_lexer": "ipython3", 564 | "version": "3.5.2" 565 | } 566 | }, 567 | "nbformat": 4, 568 | "nbformat_minor": 1 569 | } 570 | -------------------------------------------------------------------------------- /textAnalisis/.ipynb_checkpoints/TextAnalise-Plin-checkpoint.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "code", 5 | "execution_count": 19, 6 | "metadata": { 7 | "collapsed": true 8 | }, 9 | "outputs": [], 10 | "source": [ 11 | "from sklearn.feature_extraction.text import CountVectorizer\n", 12 | "from collections import Counter\n", 13 | "\n", 14 | "vect = CountVectorizer()" 15 | ] 16 | }, 17 | { 18 | "cell_type": "code", 19 | "execution_count": 20, 20 | "metadata": { 21 | "collapsed": true 22 | }, 23 | "outputs": [], 24 | "source": [ 25 | "text = [\n", 26 | " '#MachineLearning with Text in scikit learn http://buff.ly/2dJINuD #DataScience #IoT #BigData #AI',\n", 27 | " 'How The Internet Of Things Will Impact Your Everyday Life http://buff.ly/2dIUyMO #IoT #DataScience #BigData #MachineLearning',\n", 28 | " 'The best Brazilian Captain passed away this day. Captain of Brazil Team in 1970. #RIPCapita Carlos Alberto Torres.',\n", 29 | " '10 Videos Featuring Data Science Topics. By Vincent http://buff.ly/2eCWIkA #DataScience #BigData #IoT #MachineLearning',\n", 30 | " 'Data Preparation Tips, Tricks, and Tools: An Interview with the Insiders http://buff.ly/2dDSJ3E #DataScience #BigData #IoT #MachineLearning',\n", 31 | " 'Deep Learning with Neural Networks and TensorFlow Introduction - Youtube http://buff.ly/2efTvdQ #DataScience #MachineLearning #IoT #BigData',\n", 32 | " 'Matplotlib Tutorial - a youtube course http://buff.ly/2eBK4AQ #DadaScience #MachineLearning #IoT #BigData',\n", 33 | " 'Kaggle Releases Data Sets About Global Warming: Make your own Predictions – Data Science Central http://buff.ly/2dUFLQf #DataScience #IoT',\n", 34 | " '#MachineLearning as a Service http://buff.ly/2ep1Jjk #BigData #IoT #DataScience',\n", 35 | " '50 Predictions for the Internet of Things in 2016 https://goo.gl/5Zv28z #IoT #BigData #DataScience #MachineLearning',\n", 36 | " 'IoT Programming Languages http://flip.it/wtVufo #IoT #BigData #DataScience',\n", 37 | " 'An Introduction to Variable and Feature Selection #dataScience #IoT #BigData http://www.datasciencecentral.com/profiles/blogs/an-introduction-to-variable-and-feature-selection …',\n", 38 | " 'Use the simulated device to experience the IBM Watson IoT Platform http://buff.ly/2ekeKGi #IoT #BigData #DataScience #DataViz',\n", 39 | " 'Top 10 Data Science and Machine Learning Podcasts http://buff.ly/2erx7cI #MachineLearning',\n", 40 | " 'Adorei esse copão de café. SVM é fantástico algoritmo de #MachineLearning',\n", 41 | " 'IBM Watsons latest gig: Improving cancer treatment with genomic sequencing http://buff.ly/2dZ5lVP #DataScience #MachineLearning #BigData',\n", 42 | " 'An Introduction to Implementing Neural Networks using TensorFlow http://buff.ly/2ervn3s #DataScience #MachineLearning #IoT #BigData',\n", 43 | " 'Oi testa serviço de monitoramento baseado na Internet das Coisas http://buff.ly/2e3gg21 #DataScience #MachineLearning #IoT #BigData',\n", 44 | " 'Moving from R to Python: The Libraries You Need to Know http://buff.ly/2eeUHuE #DataScience #MachineLearning #IoT #BigData',\n", 45 | " 'Internet of Things Articles : IoT startup and smart cam-maker Smartfrog raises further $20M http://buff.ly/2ei1Kky #MachineLearning #IoT',\n", 46 | " 'An overview of gradient descent optimization algorithms http://buff.ly/2dldKVO #DataScience #MachineLearning #IoT #BigData',\n", 47 | " 'Datafloq - 8 Easy Steps to Become a Data Scientist http://buff.ly/2en6TbA #DataScience #IoT #BigData #MachineLearning',\n", 48 | " 'Time to educate teachers about #datascience'\n", 49 | "]" 50 | ] 51 | }, 52 | { 53 | "cell_type": "code", 54 | "execution_count": 21, 55 | "metadata": { 56 | "collapsed": false 57 | }, 58 | "outputs": [ 59 | { 60 | "data": { 61 | "text/plain": [ 62 | "CountVectorizer(analyzer='word', binary=False, decode_error='strict',\n", 63 | " dtype=, encoding='utf-8', input='content',\n", 64 | " lowercase=True, max_df=1.0, max_features=None, min_df=1,\n", 65 | " ngram_range=(1, 1), preprocessor=None, stop_words=None,\n", 66 | " strip_accents=None, token_pattern='(?u)\\\\b\\\\w\\\\w+\\\\b',\n", 67 | " tokenizer=None, vocabulary=None)" 68 | ] 69 | }, 70 | "execution_count": 21, 71 | "metadata": {}, 72 | "output_type": "execute_result" 73 | } 74 | ], 75 | "source": [ 76 | "vect.fit(text)" 77 | ] 78 | }, 79 | { 80 | "cell_type": "code", 81 | "execution_count": 22, 82 | "metadata": { 83 | "collapsed": false 84 | }, 85 | "outputs": [ 86 | { 87 | "data": { 88 | "text/plain": [ 89 | "['10',\n", 90 | " '1970',\n", 91 | " '2016',\n", 92 | " '20m',\n", 93 | " '2ddsj3e',\n", 94 | " '2diuymo',\n", 95 | " '2djinud',\n", 96 | " '2dldkvo',\n", 97 | " '2duflqf',\n", 98 | " '2dz5lvp',\n", 99 | " '2e3gg21',\n", 100 | " '2ebk4aq',\n", 101 | " '2ecwika',\n", 102 | " '2eeuhue',\n", 103 | " '2eftvdq',\n", 104 | " '2ei1kky',\n", 105 | " '2ekekgi',\n", 106 | " '2en6tba',\n", 107 | " '2ep1jjk',\n", 108 | " '2ervn3s',\n", 109 | " '2erx7ci',\n", 110 | " '50',\n", 111 | " '5zv28z',\n", 112 | " 'about',\n", 113 | " 'adorei',\n", 114 | " 'ai',\n", 115 | " 'alberto',\n", 116 | " 'algorithms',\n", 117 | " 'algoritmo',\n", 118 | " 'an',\n", 119 | " 'and',\n", 120 | " 'articles',\n", 121 | " 'as',\n", 122 | " 'away',\n", 123 | " 'baseado',\n", 124 | " 'become',\n", 125 | " 'best',\n", 126 | " 'bigdata',\n", 127 | " 'blogs',\n", 128 | " 'brazil',\n", 129 | " 'brazilian',\n", 130 | " 'buff',\n", 131 | " 'by',\n", 132 | " 'café',\n", 133 | " 'cam',\n", 134 | " 'cancer',\n", 135 | " 'captain',\n", 136 | " 'carlos',\n", 137 | " 'central',\n", 138 | " 'coisas',\n", 139 | " 'com',\n", 140 | " 'copão',\n", 141 | " 'course',\n", 142 | " 'dadascience',\n", 143 | " 'das',\n", 144 | " 'data',\n", 145 | " 'datafloq',\n", 146 | " 'datascience',\n", 147 | " 'datasciencecentral',\n", 148 | " 'dataviz',\n", 149 | " 'day',\n", 150 | " 'de',\n", 151 | " 'deep',\n", 152 | " 'descent',\n", 153 | " 'device',\n", 154 | " 'easy',\n", 155 | " 'educate',\n", 156 | " 'esse',\n", 157 | " 'everyday',\n", 158 | " 'experience',\n", 159 | " 'fantástico',\n", 160 | " 'feature',\n", 161 | " 'featuring',\n", 162 | " 'flip',\n", 163 | " 'for',\n", 164 | " 'from',\n", 165 | " 'further',\n", 166 | " 'genomic',\n", 167 | " 'gig',\n", 168 | " 'gl',\n", 169 | " 'global',\n", 170 | " 'goo',\n", 171 | " 'gradient',\n", 172 | " 'how',\n", 173 | " 'http',\n", 174 | " 'https',\n", 175 | " 'ibm',\n", 176 | " 'impact',\n", 177 | " 'implementing',\n", 178 | " 'improving',\n", 179 | " 'in',\n", 180 | " 'insiders',\n", 181 | " 'internet',\n", 182 | " 'interview',\n", 183 | " 'introduction',\n", 184 | " 'iot',\n", 185 | " 'it',\n", 186 | " 'kaggle',\n", 187 | " 'know',\n", 188 | " 'languages',\n", 189 | " 'latest',\n", 190 | " 'learn',\n", 191 | " 'learning',\n", 192 | " 'libraries',\n", 193 | " 'life',\n", 194 | " 'ly',\n", 195 | " 'machine',\n", 196 | " 'machinelearning',\n", 197 | " 'make',\n", 198 | " 'maker',\n", 199 | " 'matplotlib',\n", 200 | " 'monitoramento',\n", 201 | " 'moving',\n", 202 | " 'na',\n", 203 | " 'need',\n", 204 | " 'networks',\n", 205 | " 'neural',\n", 206 | " 'of',\n", 207 | " 'oi',\n", 208 | " 'optimization',\n", 209 | " 'overview',\n", 210 | " 'own',\n", 211 | " 'passed',\n", 212 | " 'platform',\n", 213 | " 'podcasts',\n", 214 | " 'predictions',\n", 215 | " 'preparation',\n", 216 | " 'profiles',\n", 217 | " 'programming',\n", 218 | " 'python',\n", 219 | " 'raises',\n", 220 | " 'releases',\n", 221 | " 'ripcapita',\n", 222 | " 'science',\n", 223 | " 'scientist',\n", 224 | " 'scikit',\n", 225 | " 'selection',\n", 226 | " 'sequencing',\n", 227 | " 'service',\n", 228 | " 'serviço',\n", 229 | " 'sets',\n", 230 | " 'simulated',\n", 231 | " 'smart',\n", 232 | " 'smartfrog',\n", 233 | " 'startup',\n", 234 | " 'steps',\n", 235 | " 'svm',\n", 236 | " 'teachers',\n", 237 | " 'team',\n", 238 | " 'tensorflow',\n", 239 | " 'testa',\n", 240 | " 'text',\n", 241 | " 'the',\n", 242 | " 'things',\n", 243 | " 'this',\n", 244 | " 'time',\n", 245 | " 'tips',\n", 246 | " 'to',\n", 247 | " 'tools',\n", 248 | " 'top',\n", 249 | " 'topics',\n", 250 | " 'torres',\n", 251 | " 'treatment',\n", 252 | " 'tricks',\n", 253 | " 'tutorial',\n", 254 | " 'use',\n", 255 | " 'using',\n", 256 | " 'variable',\n", 257 | " 'videos',\n", 258 | " 'vincent',\n", 259 | " 'warming',\n", 260 | " 'watson',\n", 261 | " 'watsons',\n", 262 | " 'will',\n", 263 | " 'with',\n", 264 | " 'wtvufo',\n", 265 | " 'www',\n", 266 | " 'you',\n", 267 | " 'your',\n", 268 | " 'youtube']" 269 | ] 270 | }, 271 | "execution_count": 22, 272 | "metadata": {}, 273 | "output_type": "execute_result" 274 | } 275 | ], 276 | "source": [ 277 | "vect.get_feature_names()" 278 | ] 279 | }, 280 | { 281 | "cell_type": "code", 282 | "execution_count": 23, 283 | "metadata": { 284 | "collapsed": false 285 | }, 286 | "outputs": [ 287 | { 288 | "data": { 289 | "text/plain": [ 290 | "<23x180 sparse matrix of type ''\n", 291 | "\twith 346 stored elements in Compressed Sparse Row format>" 292 | ] 293 | }, 294 | "execution_count": 23, 295 | "metadata": {}, 296 | "output_type": "execute_result" 297 | } 298 | ], 299 | "source": [ 300 | "simple_train_dtm = vect.transform(text)\n", 301 | "simple_train_dtm" 302 | ] 303 | }, 304 | { 305 | "cell_type": "code", 306 | "execution_count": 24, 307 | "metadata": { 308 | "collapsed": false 309 | }, 310 | "outputs": [ 311 | { 312 | "data": { 313 | "text/plain": [ 314 | "array([[0, 0, 0, ..., 0, 0, 0],\n", 315 | " [0, 0, 0, ..., 0, 1, 0],\n", 316 | " [0, 1, 0, ..., 0, 0, 0],\n", 317 | " ..., \n", 318 | " [0, 0, 0, ..., 0, 0, 0],\n", 319 | " [0, 0, 0, ..., 0, 0, 0],\n", 320 | " [0, 0, 0, ..., 0, 0, 0]])" 321 | ] 322 | }, 323 | "execution_count": 24, 324 | "metadata": {}, 325 | "output_type": "execute_result" 326 | } 327 | ], 328 | "source": [ 329 | "simple_train_dtm.toarray()" 330 | ] 331 | }, 332 | { 333 | "cell_type": "code", 334 | "execution_count": 25, 335 | "metadata": { 336 | "collapsed": false 337 | }, 338 | "outputs": [ 339 | { 340 | "data": { 341 | "text/plain": [ 342 | "['10',\n", 343 | " '1970',\n", 344 | " '2016',\n", 345 | " '20m',\n", 346 | " '2ddsj3e',\n", 347 | " '2diuymo',\n", 348 | " '2djinud',\n", 349 | " '2dldkvo',\n", 350 | " '2duflqf',\n", 351 | " '2dz5lvp',\n", 352 | " '2e3gg21',\n", 353 | " '2ebk4aq',\n", 354 | " '2ecwika',\n", 355 | " '2eeuhue',\n", 356 | " '2eftvdq',\n", 357 | " '2ei1kky',\n", 358 | " '2ekekgi',\n", 359 | " '2en6tba',\n", 360 | " '2ep1jjk',\n", 361 | " '2ervn3s',\n", 362 | " '2erx7ci',\n", 363 | " '50',\n", 364 | " '5zv28z',\n", 365 | " 'about',\n", 366 | " 'adorei',\n", 367 | " 'ai',\n", 368 | " 'alberto',\n", 369 | " 'algorithms',\n", 370 | " 'algoritmo',\n", 371 | " 'an',\n", 372 | " 'and',\n", 373 | " 'articles',\n", 374 | " 'as',\n", 375 | " 'away',\n", 376 | " 'baseado',\n", 377 | " 'become',\n", 378 | " 'best',\n", 379 | " 'bigdata',\n", 380 | " 'blogs',\n", 381 | " 'brazil',\n", 382 | " 'brazilian',\n", 383 | " 'buff',\n", 384 | " 'by',\n", 385 | " 'café',\n", 386 | " 'cam',\n", 387 | " 'cancer',\n", 388 | " 'captain',\n", 389 | " 'carlos',\n", 390 | " 'central',\n", 391 | " 'coisas',\n", 392 | " 'com',\n", 393 | " 'copão',\n", 394 | " 'course',\n", 395 | " 'dadascience',\n", 396 | " 'das',\n", 397 | " 'data',\n", 398 | " 'datafloq',\n", 399 | " 'datascience',\n", 400 | " 'datasciencecentral',\n", 401 | " 'dataviz',\n", 402 | " 'day',\n", 403 | " 'de',\n", 404 | " 'deep',\n", 405 | " 'descent',\n", 406 | " 'device',\n", 407 | " 'easy',\n", 408 | " 'educate',\n", 409 | " 'esse',\n", 410 | " 'everyday',\n", 411 | " 'experience',\n", 412 | " 'fantástico',\n", 413 | " 'feature',\n", 414 | " 'featuring',\n", 415 | " 'flip',\n", 416 | " 'for',\n", 417 | " 'from',\n", 418 | " 'further',\n", 419 | " 'genomic',\n", 420 | " 'gig',\n", 421 | " 'gl',\n", 422 | " 'global',\n", 423 | " 'goo',\n", 424 | " 'gradient',\n", 425 | " 'how',\n", 426 | " 'http',\n", 427 | " 'https',\n", 428 | " 'ibm',\n", 429 | " 'impact',\n", 430 | " 'implementing',\n", 431 | " 'improving',\n", 432 | " 'in',\n", 433 | " 'insiders',\n", 434 | " 'internet',\n", 435 | " 'interview',\n", 436 | " 'introduction',\n", 437 | " 'iot',\n", 438 | " 'it',\n", 439 | " 'kaggle',\n", 440 | " 'know',\n", 441 | " 'languages',\n", 442 | " 'latest',\n", 443 | " 'learn',\n", 444 | " 'learning',\n", 445 | " 'libraries',\n", 446 | " 'life',\n", 447 | " 'ly',\n", 448 | " 'machine',\n", 449 | " 'machinelearning',\n", 450 | " 'make',\n", 451 | " 'maker',\n", 452 | " 'matplotlib',\n", 453 | " 'monitoramento',\n", 454 | " 'moving',\n", 455 | " 'na',\n", 456 | " 'need',\n", 457 | " 'networks',\n", 458 | " 'neural',\n", 459 | " 'of',\n", 460 | " 'oi',\n", 461 | " 'optimization',\n", 462 | " 'overview',\n", 463 | " 'own',\n", 464 | " 'passed',\n", 465 | " 'platform',\n", 466 | " 'podcasts',\n", 467 | " 'predictions',\n", 468 | " 'preparation',\n", 469 | " 'profiles',\n", 470 | " 'programming',\n", 471 | " 'python',\n", 472 | " 'raises',\n", 473 | " 'releases',\n", 474 | " 'ripcapita',\n", 475 | " 'science',\n", 476 | " 'scientist',\n", 477 | " 'scikit',\n", 478 | " 'selection',\n", 479 | " 'sequencing',\n", 480 | " 'service',\n", 481 | " 'serviço',\n", 482 | " 'sets',\n", 483 | " 'simulated',\n", 484 | " 'smart',\n", 485 | " 'smartfrog',\n", 486 | " 'startup',\n", 487 | " 'steps',\n", 488 | " 'svm',\n", 489 | " 'teachers',\n", 490 | " 'team',\n", 491 | " 'tensorflow',\n", 492 | " 'testa',\n", 493 | " 'text',\n", 494 | " 'the',\n", 495 | " 'things',\n", 496 | " 'this',\n", 497 | " 'time',\n", 498 | " 'tips',\n", 499 | " 'to',\n", 500 | " 'tools',\n", 501 | " 'top',\n", 502 | " 'topics',\n", 503 | " 'torres',\n", 504 | " 'treatment',\n", 505 | " 'tricks',\n", 506 | " 'tutorial',\n", 507 | " 'use',\n", 508 | " 'using',\n", 509 | " 'variable',\n", 510 | " 'videos',\n", 511 | " 'vincent',\n", 512 | " 'warming',\n", 513 | " 'watson',\n", 514 | " 'watsons',\n", 515 | " 'will',\n", 516 | " 'with',\n", 517 | " 'wtvufo',\n", 518 | " 'www',\n", 519 | " 'you',\n", 520 | " 'your',\n", 521 | " 'youtube']" 522 | ] 523 | }, 524 | "execution_count": 25, 525 | "metadata": {}, 526 | "output_type": "execute_result" 527 | } 528 | ], 529 | "source": [ 530 | "\n", 531 | "vocab = list(vect.get_feature_names())\n", 532 | "vocab" 533 | ] 534 | }, 535 | { 536 | "cell_type": "code", 537 | "execution_count": 36, 538 | "metadata": { 539 | "collapsed": false 540 | }, 541 | "outputs": [ 542 | { 543 | "data": { 544 | "text/plain": [ 545 | "[('iot', 21),\n", 546 | " ('http', 19),\n", 547 | " ('datascience', 18),\n", 548 | " ('machinelearning', 17),\n", 549 | " ('bigdata', 17),\n", 550 | " ('buff', 17),\n", 551 | " ('ly', 17),\n", 552 | " ('to', 8),\n", 553 | " ('the', 7),\n", 554 | " ('and', 6),\n", 555 | " ('data', 6),\n", 556 | " ('an', 5),\n", 557 | " ('of', 5),\n", 558 | " ('internet', 4),\n", 559 | " ('introduction', 4),\n", 560 | " ('with', 4),\n", 561 | " ('de', 3),\n", 562 | " ('things', 3),\n", 563 | " ('in', 3),\n", 564 | " ('science', 3),\n", 565 | " ('neural', 2),\n", 566 | " ('about', 2),\n", 567 | " ('networks', 2),\n", 568 | " ('feature', 2),\n", 569 | " ('tensorflow', 2),\n", 570 | " ('ibm', 2),\n", 571 | " ('variable', 2),\n", 572 | " ('learning', 2),\n", 573 | " ('selection', 2),\n", 574 | " ('youtube', 2),\n", 575 | " ('your', 2),\n", 576 | " ('predictions', 2),\n", 577 | " ('10', 2),\n", 578 | " ('captain', 2),\n", 579 | " ('by', 1),\n", 580 | " ('podcasts', 1),\n", 581 | " ('tools', 1),\n", 582 | " ('team', 1),\n", 583 | " ('text', 1),\n", 584 | " ('genomic', 1),\n", 585 | " ('languages', 1),\n", 586 | " ('esse', 1),\n", 587 | " ('2diuymo', 1),\n", 588 | " ('maker', 1),\n", 589 | " ('libraries', 1),\n", 590 | " ('learn', 1),\n", 591 | " ('interview', 1),\n", 592 | " ('gl', 1),\n", 593 | " ('scientist', 1),\n", 594 | " ('café', 1),\n", 595 | " ('everyday', 1),\n", 596 | " ('2duflqf', 1),\n", 597 | " ('cam', 1),\n", 598 | " ('baseado', 1),\n", 599 | " ('away', 1),\n", 600 | " ('device', 1),\n", 601 | " ('watsons', 1),\n", 602 | " ('improving', 1),\n", 603 | " ('programming', 1),\n", 604 | " ('overview', 1),\n", 605 | " ('warming', 1),\n", 606 | " ('2ecwika', 1),\n", 607 | " ('how', 1),\n", 608 | " ('own', 1),\n", 609 | " ('make', 1),\n", 610 | " ('machine', 1),\n", 611 | " ('steps', 1),\n", 612 | " ('kaggle', 1),\n", 613 | " ('raises', 1),\n", 614 | " ('svm', 1),\n", 615 | " ('vincent', 1),\n", 616 | " ('time', 1),\n", 617 | " ('python', 1),\n", 618 | " ('datasciencecentral', 1),\n", 619 | " ('copão', 1),\n", 620 | " ('best', 1),\n", 621 | " ('need', 1),\n", 622 | " ('datafloq', 1),\n", 623 | " ('das', 1),\n", 624 | " ('2erx7ci', 1),\n", 625 | " ('testa', 1),\n", 626 | " ('flip', 1),\n", 627 | " ('become', 1),\n", 628 | " ('2ekekgi', 1),\n", 629 | " ('fantástico', 1),\n", 630 | " ('platform', 1),\n", 631 | " ('serviço', 1),\n", 632 | " ('smart', 1),\n", 633 | " ('scikit', 1),\n", 634 | " ('tutorial', 1),\n", 635 | " ('cancer', 1),\n", 636 | " ('ai', 1),\n", 637 | " ('top', 1),\n", 638 | " ('2ei1kky', 1),\n", 639 | " ('it', 1),\n", 640 | " ('startup', 1),\n", 641 | " ('sets', 1),\n", 642 | " ('2ep1jjk', 1),\n", 643 | " ('from', 1),\n", 644 | " ('algoritmo', 1),\n", 645 | " ('2eftvdq', 1),\n", 646 | " ('2dz5lvp', 1),\n", 647 | " ('blogs', 1),\n", 648 | " ('50', 1),\n", 649 | " ('easy', 1),\n", 650 | " ('dataviz', 1),\n", 651 | " ('further', 1),\n", 652 | " ('5zv28z', 1),\n", 653 | " ('central', 1),\n", 654 | " ('goo', 1),\n", 655 | " ('topics', 1),\n", 656 | " ('2e3gg21', 1),\n", 657 | " ('preparation', 1),\n", 658 | " ('implementing', 1),\n", 659 | " ('2eeuhue', 1),\n", 660 | " ('descent', 1),\n", 661 | " ('as', 1),\n", 662 | " ('20m', 1),\n", 663 | " ('using', 1),\n", 664 | " ('treatment', 1),\n", 665 | " ('latest', 1),\n", 666 | " ('will', 1),\n", 667 | " ('releases', 1),\n", 668 | " ('monitoramento', 1),\n", 669 | " ('https', 1),\n", 670 | " ('alberto', 1),\n", 671 | " ('watson', 1),\n", 672 | " ('ripcapita', 1),\n", 673 | " ('torres', 1),\n", 674 | " ('course', 1),\n", 675 | " ('featuring', 1),\n", 676 | " ('brazil', 1),\n", 677 | " ('wtvufo', 1),\n", 678 | " ('coisas', 1),\n", 679 | " ('use', 1),\n", 680 | " ('passed', 1),\n", 681 | " ('oi', 1),\n", 682 | " ('optimization', 1),\n", 683 | " ('moving', 1),\n", 684 | " ('com', 1),\n", 685 | " ('know', 1),\n", 686 | " ('simulated', 1),\n", 687 | " ('2ervn3s', 1),\n", 688 | " ('you', 1),\n", 689 | " ('www', 1),\n", 690 | " ('this', 1),\n", 691 | " ('dadascience', 1),\n", 692 | " ('adorei', 1),\n", 693 | " ('educate', 1),\n", 694 | " ('for', 1),\n", 695 | " ('1970', 1),\n", 696 | " ('2en6tba', 1),\n", 697 | " ('teachers', 1),\n", 698 | " ('matplotlib', 1),\n", 699 | " ('global', 1),\n", 700 | " ('sequencing', 1),\n", 701 | " ('life', 1),\n", 702 | " ('2ebk4aq', 1),\n", 703 | " ('insiders', 1),\n", 704 | " ('gig', 1),\n", 705 | " ('carlos', 1),\n", 706 | " ('2016', 1),\n", 707 | " ('impact', 1),\n", 708 | " ('day', 1),\n", 709 | " ('2ddsj3e', 1),\n", 710 | " ('profiles', 1),\n", 711 | " ('experience', 1),\n", 712 | " ('brazilian', 1),\n", 713 | " ('smartfrog', 1),\n", 714 | " ('deep', 1),\n", 715 | " ('gradient', 1),\n", 716 | " ('na', 1),\n", 717 | " ('videos', 1),\n", 718 | " ('service', 1),\n", 719 | " ('tricks', 1),\n", 720 | " ('algorithms', 1),\n", 721 | " ('tips', 1),\n", 722 | " ('2dldkvo', 1),\n", 723 | " ('2djinud', 1),\n", 724 | " ('articles', 1)]" 725 | ] 726 | }, 727 | "execution_count": 36, 728 | "metadata": {}, 729 | "output_type": "execute_result" 730 | } 731 | ], 732 | "source": [ 733 | "counts = simple_train_dtm.sum(axis=0).A1\n", 734 | "\n", 735 | "freq_distribution = Counter(dict(zip(vocab, counts)))\n", 736 | "##print (freq_distribution.most_common(100))\n", 737 | "list(freq_distribution.most_common())\n", 738 | "\n" 739 | ] 740 | }, 741 | { 742 | "cell_type": "code", 743 | "execution_count": null, 744 | "metadata": { 745 | "collapsed": true 746 | }, 747 | "outputs": [], 748 | "source": [] 749 | } 750 | ], 751 | "metadata": { 752 | "kernelspec": { 753 | "display_name": "Python [conda root]", 754 | "language": "python", 755 | "name": "conda-root-py" 756 | }, 757 | "language_info": { 758 | "codemirror_mode": { 759 | "name": "ipython", 760 | "version": 3 761 | }, 762 | "file_extension": ".py", 763 | "mimetype": "text/x-python", 764 | "name": "python", 765 | "nbconvert_exporter": "python", 766 | "pygments_lexer": "ipython3", 767 | "version": "3.5.2" 768 | } 769 | }, 770 | "nbformat": 4, 771 | "nbformat_minor": 1 772 | } 773 | -------------------------------------------------------------------------------- /AnaliseTexto/AnaliseDeSentimento.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": { 6 | "nbpresent": { 7 | "id": "e4c7d791-d39c-4247-a950-8f541b2b2b2b" 8 | }, 9 | "slideshow": { 10 | "slide_type": "-" 11 | } 12 | }, 13 | "source": [ 14 | "# Classificação de textos com *scikit-learn*\n", 15 | "por Prof. Sanderson Macedo" 16 | ] 17 | }, 18 | { 19 | "cell_type": "markdown", 20 | "metadata": { 21 | "nbpresent": { 22 | "id": "918ce0e7-8f69-4d3c-8106-d3c5264c94e3" 23 | }, 24 | "slideshow": { 25 | "slide_type": "-" 26 | } 27 | }, 28 | "source": [ 29 | "
" 30 | ] 31 | }, 32 | { 33 | "cell_type": "markdown", 34 | "metadata": { 35 | "nbpresent": { 36 | "id": "ca5fe97a-0224-4915-a59d-38e6baa218a2" 37 | } 38 | }, 39 | "source": [ 40 | "## Agenda\n", 41 | "\n", 42 | "\n", 43 | "1. Representar um texto como dados numéricos\n", 44 | "2. Ler o *dataset* de texto no Pandas\n", 45 | "2. Vetorizar nossso *dataset*\n", 46 | "4. Construir e avaliar um modelo\n", 47 | "5. Comparar modelos\n" 48 | ] 49 | }, 50 | { 51 | "cell_type": "code", 52 | "execution_count": 353, 53 | "metadata": { 54 | "collapsed": true, 55 | "nbpresent": { 56 | "id": "d2e20804-da18-483c-bd40-8c25e2d4699c" 57 | } 58 | }, 59 | "outputs": [], 60 | "source": [ 61 | "##Importando pandas e numpy\n", 62 | "import pandas as pd\n", 63 | "import numpy as np" 64 | ] 65 | }, 66 | { 67 | "cell_type": "markdown", 68 | "metadata": { 69 | "nbpresent": { 70 | "id": "76e5a32a-69c4-4dc5-a66b-23d2cca623af" 71 | } 72 | }, 73 | "source": [ 74 | "## 1. Definindo um vetor de textos \n", 75 | "Os textos do vetor podem ser adquiridos por meio da leitura de \n", 76 | "pdf's, doc's, twitter's... etc.\n", 77 | "\n", 78 | "Esses textos serão a base de treinamento\n", 79 | "para a classificação do sentimento de um novo texto." 80 | ] 81 | }, 82 | { 83 | "cell_type": "code", 84 | "execution_count": 354, 85 | "metadata": { 86 | "collapsed": false, 87 | "nbpresent": { 88 | "id": "56bab267-0993-4d7a-9436-11bc5de3d1d3" 89 | } 90 | }, 91 | "outputs": [], 92 | "source": [ 93 | "train = [\n", 94 | " 'Eu te amo',\n", 95 | " 'Você é algo assim... é tudo pra mim. Ao meu amor... Amor!',\n", 96 | " 'Eu te odeio muito, você não presta!',\n", 97 | " 'Não gosto de você'\n", 98 | " ]" 99 | ] 100 | }, 101 | { 102 | "cell_type": "markdown", 103 | "metadata": { 104 | "nbpresent": { 105 | "id": "fc1fc669-a603-412e-8855-837d750718ff" 106 | } 107 | }, 108 | "source": [ 109 | "## 2. Definindo um vetor de sentimentos\n", 110 | "Criaremos um vetor de sentimentos chamado **_felling_**. \n", 111 | "\n", 112 | "Cada posição do vetor **_felling_** representa o sentimento **BOM** (1) ou **RUIM** (0) para os textos que passamos ao vetor **_train_**.\n", 113 | "\n", 114 | "Por exemplo: a frase da primeira posição do vetor **_train_**:\n", 115 | "\n", 116 | "> 'Eu te amo'\n", 117 | "\n", 118 | "Foi classificada como sendo um texto **BOM**:\n", 119 | "\n", 120 | "> 1" 121 | ] 122 | }, 123 | { 124 | "cell_type": "code", 125 | "execution_count": 355, 126 | "metadata": { 127 | "collapsed": true, 128 | "nbpresent": { 129 | "id": "68a4277e-e38c-42ac-8528-0b90efe86e42" 130 | } 131 | }, 132 | "outputs": [], 133 | "source": [ 134 | "felling = [1,1,0,0]" 135 | ] 136 | }, 137 | { 138 | "cell_type": "markdown", 139 | "metadata": { 140 | "nbpresent": { 141 | "id": "f43ff54a-e843-4a35-8447-66665f36ebca" 142 | } 143 | }, 144 | "source": [ 145 | "## 3. Análise de texto com _scikit-learn_.\n", 146 | "\n", 147 | "Texto de [scikit-learn documentation](http://scikit-learn.org/stable/modules/feature_extraction.html#text-feature-extraction):\n", 148 | "\n", 149 | "> Análise de texto é um campo de aplicação importante para algoritmos de aprendizado de máquina. No entanto, uma sequência de símbolos não podem ser passada diretamente aos algoritmos de Machine Learning, pois a maioria deles espera vetores de características numéricas com um tamanho fixo, em vez de documentos de texto com comprimento variável.\n", 150 | "\n", 151 | "Mas nesse caso podemos realizar algumas transformações de para poder manipular textos em algoritmos de aprendizagem.\n", 152 | "\n", 153 | "Portanto, aqui utilizaremos a [CountVectorizer](http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html)\n", 154 | "para converter textos em uma matriz que expressará a quantidade \"tokens\" dos textos.\n", 155 | "\n", 156 | "Importamos a classe e criamos uma instância chamada **_vect_**.\n" 157 | ] 158 | }, 159 | { 160 | "cell_type": "code", 161 | "execution_count": 356, 162 | "metadata": { 163 | "collapsed": false, 164 | "nbpresent": { 165 | "id": "1ada59d7-f1ba-4625-8999-b8af5aaf461c" 166 | } 167 | }, 168 | "outputs": [], 169 | "source": [ 170 | "from sklearn.feature_extraction.text import CountVectorizer\n", 171 | "vect = CountVectorizer()" 172 | ] 173 | }, 174 | { 175 | "cell_type": "markdown", 176 | "metadata": { 177 | "nbpresent": { 178 | "id": "154ef867-0532-45ad-9910-c87f6711d1b0" 179 | } 180 | }, 181 | "source": [ 182 | "## 4. Treinamento criando o dicionário.\n", 183 | "Agora treinamos o algoritmo com o vetor de textos que criamos acima. Chamamos o método **_fit()_** passando o vetor de textos." 184 | ] 185 | }, 186 | { 187 | "cell_type": "code", 188 | "execution_count": 357, 189 | "metadata": { 190 | "collapsed": false, 191 | "nbpresent": { 192 | "id": "eff3a289-8c0d-4374-9400-d988a6b36624" 193 | } 194 | }, 195 | "outputs": [ 196 | { 197 | "data": { 198 | "text/plain": [ 199 | "CountVectorizer(analyzer='word', binary=False, decode_error='strict',\n", 200 | " dtype=, encoding='utf-8', input='content',\n", 201 | " lowercase=True, max_df=1.0, max_features=None, min_df=1,\n", 202 | " ngram_range=(1, 1), preprocessor=None, stop_words=None,\n", 203 | " strip_accents=None, token_pattern='(?u)\\\\b\\\\w\\\\w+\\\\b',\n", 204 | " tokenizer=None, vocabulary=None)" 205 | ] 206 | }, 207 | "execution_count": 357, 208 | "metadata": {}, 209 | "output_type": "execute_result" 210 | } 211 | ], 212 | "source": [ 213 | "vect.fit(train)" 214 | ] 215 | }, 216 | { 217 | "cell_type": "markdown", 218 | "metadata": {}, 219 | "source": [ 220 | "Veja que o parametro *analyzer* é defindo por padrão como *'word'* na classe *CountVectorizer*. Isso signicica que a classe ignora palavras com menos de dois (2) caracteres e pontuações. " 221 | ] 222 | }, 223 | { 224 | "cell_type": "markdown", 225 | "metadata": { 226 | "nbpresent": { 227 | "id": "d4093cdd-6b19-4fed-9a01-5ee02f41ca51" 228 | } 229 | }, 230 | "source": [ 231 | "## 5. Nosso dicionário de palavras\n", 232 | "Aqui vamos listar quais palavras forma utilizadas nos textos de **_train_**, formando nosso dicionário de palavras. Nessa listagem as palavras não se repetem." 233 | ] 234 | }, 235 | { 236 | "cell_type": "code", 237 | "execution_count": 358, 238 | "metadata": { 239 | "collapsed": false, 240 | "nbpresent": { 241 | "id": "3ab9a844-7f38-40c5-a57f-4a2fbf3343ba" 242 | } 243 | }, 244 | "outputs": [ 245 | { 246 | "data": { 247 | "text/plain": [ 248 | "['algo',\n", 249 | " 'amo',\n", 250 | " 'amor',\n", 251 | " 'ao',\n", 252 | " 'assim',\n", 253 | " 'de',\n", 254 | " 'eu',\n", 255 | " 'gosto',\n", 256 | " 'meu',\n", 257 | " 'mim',\n", 258 | " 'muito',\n", 259 | " 'não',\n", 260 | " 'odeio',\n", 261 | " 'pra',\n", 262 | " 'presta',\n", 263 | " 'te',\n", 264 | " 'tudo',\n", 265 | " 'você']" 266 | ] 267 | }, 268 | "execution_count": 358, 269 | "metadata": {}, 270 | "output_type": "execute_result" 271 | } 272 | ], 273 | "source": [ 274 | "## examinando o dicionário criado em ordem alfabética.\n", 275 | "vect.get_feature_names()" 276 | ] 277 | }, 278 | { 279 | "cell_type": "markdown", 280 | "metadata": {}, 281 | "source": [ 282 | "## 6. Criação de uma matriz de ocorrência\n", 283 | "\n", 284 | "\n", 285 | "\n", 286 | "A matriz de ocorrência mostra a ocorrência de cada palavra em cada texto passado para o algoritmo que criou o dicionário.\n", 287 | "Essa transformação cria uma matriz onde:\n", 288 | "\n", 289 | "1. Cada linha representa um texto do vetor **_train_** \n", 290 | "2. Cada coluna uma palavra do dicionário aprendido.\n", 291 | "3. Se a palavra ocorrer no texto o valor será 1 caso contrário 0.\n", 292 | "\n", 293 | "Por exemplo:\n", 294 | "A primeira linha da matriz é a frase\n", 295 | "\n", 296 | "> Eu te amo\n", 297 | "\n", 298 | "Essa frase tem somente três(3) palavras **_eu_**, **_te_** e **_amo_** que serão marcados na matriz com a quantidade que ocorrem no texto nesse caso **_1_** e as outras palavras do dicionário serão marcadas pelo valor zero(0), por não estarem no texto.\n", 299 | "\n", 300 | "A segunda frase\n", 301 | "\n", 302 | "> Você é algo assim... é tudo pra mim. Ao meu amor... Amor!\n", 303 | "\n", 304 | "A palavra **_amor_** ocorre duas(2) vezes, por isso que a terceira posição tem o valor 2. " 305 | ] 306 | }, 307 | { 308 | "cell_type": "code", 309 | "execution_count": 359, 310 | "metadata": { 311 | "collapsed": false, 312 | "nbpresent": { 313 | "id": "34cfd603-24de-4379-9a69-353ba0e50fba" 314 | } 315 | }, 316 | "outputs": [ 317 | { 318 | "data": { 319 | "text/plain": [ 320 | "array([[0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0],\n", 321 | " [1, 0, 2, 1, 1, 0, 0, 0, 1, 1, 0, 0, 0, 1, 0, 0, 1, 1],\n", 322 | " [0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 1, 1, 0, 1, 1, 0, 1],\n", 323 | " [0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1]])" 324 | ] 325 | }, 326 | "execution_count": 359, 327 | "metadata": {}, 328 | "output_type": "execute_result" 329 | } 330 | ], 331 | "source": [ 332 | "simple_train_dtm = vect.transform(train)\n", 333 | "simple_train_dtm.toarray()" 334 | ] 335 | }, 336 | { 337 | "cell_type": "markdown", 338 | "metadata": {}, 339 | "source": [ 340 | "#### Criando um *dataframe* pandas para visualizar melhor os dados." 341 | ] 342 | }, 343 | { 344 | "cell_type": "code", 345 | "execution_count": 360, 346 | "metadata": { 347 | "collapsed": false, 348 | "nbpresent": { 349 | "id": "2e563c0f-37c5-4861-85c6-9185c20e3507" 350 | } 351 | }, 352 | "outputs": [ 353 | { 354 | "data": { 355 | "text/html": [ 356 | "
\n", 357 | "\n", 358 | " \n", 359 | " \n", 360 | " \n", 361 | " \n", 362 | " \n", 363 | " \n", 364 | " \n", 365 | " \n", 366 | " \n", 367 | " \n", 368 | " \n", 369 | " \n", 370 | " \n", 371 | " \n", 372 | " \n", 373 | " \n", 374 | " \n", 375 | " \n", 376 | " \n", 377 | " \n", 378 | " \n", 379 | " \n", 380 | " \n", 381 | " \n", 382 | " \n", 383 | " \n", 384 | " \n", 385 | " \n", 386 | " \n", 387 | " \n", 388 | " \n", 389 | " \n", 390 | " \n", 391 | " \n", 392 | " \n", 393 | " \n", 394 | " \n", 395 | " \n", 396 | " \n", 397 | " \n", 398 | " \n", 399 | " \n", 400 | " \n", 401 | " \n", 402 | " \n", 403 | " \n", 404 | " \n", 405 | " \n", 406 | " \n", 407 | " \n", 408 | " \n", 409 | " \n", 410 | " \n", 411 | " \n", 412 | " \n", 413 | " \n", 414 | " \n", 415 | " \n", 416 | " \n", 417 | " \n", 418 | " \n", 419 | " \n", 420 | " \n", 421 | " \n", 422 | " \n", 423 | " \n", 424 | " \n", 425 | " \n", 426 | " \n", 427 | " \n", 428 | " \n", 429 | " \n", 430 | " \n", 431 | " \n", 432 | " \n", 433 | " \n", 434 | " \n", 435 | " \n", 436 | " \n", 437 | " \n", 438 | " \n", 439 | " \n", 440 | " \n", 441 | " \n", 442 | " \n", 443 | " \n", 444 | " \n", 445 | " \n", 446 | " \n", 447 | " \n", 448 | " \n", 449 | " \n", 450 | " \n", 451 | " \n", 452 | " \n", 453 | " \n", 454 | " \n", 455 | " \n", 456 | " \n", 457 | " \n", 458 | " \n", 459 | " \n", 460 | " \n", 461 | " \n", 462 | " \n", 463 | " \n", 464 | " \n", 465 | " \n", 466 | " \n", 467 | "
algoamoamoraoassimdeeugostomeumimmuitonãoodeiopraprestatetudovocê
Eu te amo010000100000000100
Você é algo assim... é tudo pra mim. Ao meu amor... Amor!102110001100010011
Eu te odeio muito, você não presta!000000100011101101
Não gosto de você000001010001000001
\n", 468 | "
" 469 | ], 470 | "text/plain": [ 471 | " algo amo amor ao \\\n", 472 | "Eu te amo 0 1 0 0 \n", 473 | "Você é algo assim... é tudo pra mim. Ao meu amo... 1 0 2 1 \n", 474 | "Eu te odeio muito, você não presta! 0 0 0 0 \n", 475 | "Não gosto de você 0 0 0 0 \n", 476 | "\n", 477 | " assim de eu gosto meu \\\n", 478 | "Eu te amo 0 0 1 0 0 \n", 479 | "Você é algo assim... é tudo pra mim. Ao meu amo... 1 0 0 0 1 \n", 480 | "Eu te odeio muito, você não presta! 0 0 1 0 0 \n", 481 | "Não gosto de você 0 1 0 1 0 \n", 482 | "\n", 483 | " mim muito não odeio \\\n", 484 | "Eu te amo 0 0 0 0 \n", 485 | "Você é algo assim... é tudo pra mim. Ao meu amo... 1 0 0 0 \n", 486 | "Eu te odeio muito, você não presta! 0 1 1 1 \n", 487 | "Não gosto de você 0 0 1 0 \n", 488 | "\n", 489 | " pra presta te tudo \\\n", 490 | "Eu te amo 0 0 1 0 \n", 491 | "Você é algo assim... é tudo pra mim. Ao meu amo... 1 0 0 1 \n", 492 | "Eu te odeio muito, você não presta! 0 1 1 0 \n", 493 | "Não gosto de você 0 0 0 0 \n", 494 | "\n", 495 | " você \n", 496 | "Eu te amo 0 \n", 497 | "Você é algo assim... é tudo pra mim. Ao meu amo... 1 \n", 498 | "Eu te odeio muito, você não presta! 1 \n", 499 | "Não gosto de você 1 " 500 | ] 501 | }, 502 | "execution_count": 360, 503 | "metadata": {}, 504 | "output_type": "execute_result" 505 | } 506 | ], 507 | "source": [ 508 | "df = pd.DataFrame(simple_train_dtm.toarray(), columns=vect.get_feature_names(), index=train)\n", 509 | "df" 510 | ] 511 | }, 512 | { 513 | "cell_type": "markdown", 514 | "metadata": {}, 515 | "source": [ 516 | "## 7. esparsividade\n", 517 | "A matriz de ocorrência é uma matriz normalmente muito esparsa, ou seja, com muitos valores zero. Essa quantidade de zeros na matriz aumenta substâncialmente o processamento das informações para a classificação de um novo texto. Portanto, a matriz esparsa ficará melhor representada pela ocorrência sem os valores zero.\n", 518 | "A linha abaixo mostra que a matriz é do tipo esparsa.\n" 519 | ] 520 | }, 521 | { 522 | "cell_type": "code", 523 | "execution_count": 361, 524 | "metadata": { 525 | "collapsed": false 526 | }, 527 | "outputs": [ 528 | { 529 | "data": { 530 | "text/plain": [ 531 | "scipy.sparse.csr.csr_matrix" 532 | ] 533 | }, 534 | "execution_count": 361, 535 | "metadata": {}, 536 | "output_type": "execute_result" 537 | } 538 | ], 539 | "source": [ 540 | "type(simple_train_dtm)" 541 | ] 542 | }, 543 | { 544 | "cell_type": "markdown", 545 | "metadata": {}, 546 | "source": [ 547 | "O comando anterior mostra os mesmos valores da matriz de ocorrências de palavras só que retirando as não ocorrências.\n", 548 | "\n", 549 | "Por exemplo:\n", 550 | "As três(3) primeiras linhas da impressão do comando se refere a frase:\n", 551 | "\n", 552 | "> Eu te amo\n", 553 | "\n", 554 | "(0, 1)\t1
\n", 555 | "(0, 6)\t1
\n", 556 | "(0, 15)\t1
\n", 557 | "\n", 558 | "Essa é a frase zero(0) ou seja a primeira frase. os valores 1, 6, 16 é posição da matriz onde ocorres as palavras [amo, eu, te] (em ordem alfabética), e os valores 1 são as quantidades de ocorrências de cada palavra nessa frase" 559 | ] 560 | }, 561 | { 562 | "cell_type": "code", 563 | "execution_count": 362, 564 | "metadata": { 565 | "collapsed": false, 566 | "nbpresent": { 567 | "id": "95d91cb6-e3f8-4b4b-ab82-900f8719f4db" 568 | } 569 | }, 570 | "outputs": [ 571 | { 572 | "name": "stdout", 573 | "output_type": "stream", 574 | "text": [ 575 | " (0, 1)\t1\n", 576 | " (0, 6)\t1\n", 577 | " (0, 15)\t1\n", 578 | " (1, 0)\t1\n", 579 | " (1, 2)\t2\n", 580 | " (1, 3)\t1\n", 581 | " (1, 4)\t1\n", 582 | " (1, 8)\t1\n", 583 | " (1, 9)\t1\n", 584 | " (1, 13)\t1\n", 585 | " (1, 16)\t1\n", 586 | " (1, 17)\t1\n", 587 | " (2, 6)\t1\n", 588 | " (2, 10)\t1\n", 589 | " (2, 11)\t1\n", 590 | " (2, 12)\t1\n", 591 | " (2, 14)\t1\n", 592 | " (2, 15)\t1\n", 593 | " (2, 17)\t1\n", 594 | " (3, 5)\t1\n", 595 | " (3, 7)\t1\n", 596 | " (3, 11)\t1\n", 597 | " (3, 17)\t1\n" 598 | ] 599 | } 600 | ], 601 | "source": [ 602 | "print(simple_train_dtm)" 603 | ] 604 | }, 605 | { 606 | "cell_type": "markdown", 607 | "metadata": {}, 608 | "source": [ 609 | "Normalmente muitos documentos usarão somente um pequeno subconjuto das palavras do nosso *dicionário*, por isso a matriz resultante terá vários valores zerados nas palavras (basicamente mais de 99% delas)\n", 610 | "\n", 611 | "Por exemplo, um conjunto de **dez mil (10.000)** pequenos textos (tais como emails) terá um vocabulário da ordem de **cem mil (100.000)** palavras únicas. Porém cada texto normalmente usará somente **cem (100)** palavras únicas individualmente \n", 612 | "\n", 613 | "Visando o armazenamento dessa matri e a aceleração de operações, algoritimos normalmente usam a representação esparsa como a implementação disponível no pacote **_scipy.sparse_**" 614 | ] 615 | }, 616 | { 617 | "cell_type": "markdown", 618 | "metadata": {}, 619 | "source": [ 620 | "## 8. Classificações" 621 | ] 622 | }, 623 | { 624 | "cell_type": "markdown", 625 | "metadata": {}, 626 | "source": [ 627 | "### 8.1 Classificando um novo texto\n", 628 | "\n", 629 | "Nosso objetivo é inferir se um novo texto é **BOM** ou **RUIM**\n", 630 | "tendo como base os textos anteriormente classificados.\n", 631 | "o vetor ***novo_texto*** contém um novo texto obtido e que será classificado por nosso algoritmo de aprendizagem de máquina.\n", 632 | "\n", 633 | "Basicamente classificaremos o texto com o algoritmo ***KNN***." 634 | ] 635 | }, 636 | { 637 | "cell_type": "code", 638 | "execution_count": 372, 639 | "metadata": { 640 | "collapsed": false 641 | }, 642 | "outputs": [], 643 | "source": [ 644 | "novo_texto = ['te odeio']" 645 | ] 646 | }, 647 | { 648 | "cell_type": "markdown", 649 | "metadata": {}, 650 | "source": [ 651 | "#### Criando a matriz de ocorrência para o novo texto\n", 652 | "A matriz ***simple_test_dtm*** é que será usada para a nova classificação" 653 | ] 654 | }, 655 | { 656 | "cell_type": "code", 657 | "execution_count": 373, 658 | "metadata": { 659 | "collapsed": false 660 | }, 661 | "outputs": [ 662 | { 663 | "data": { 664 | "text/html": [ 665 | "
\n", 666 | "\n", 667 | " \n", 668 | " \n", 669 | " \n", 670 | " \n", 671 | " \n", 672 | " \n", 673 | " \n", 674 | " \n", 675 | " \n", 676 | " \n", 677 | " \n", 678 | " \n", 679 | " \n", 680 | " \n", 681 | " \n", 682 | " \n", 683 | " \n", 684 | " \n", 685 | " \n", 686 | " \n", 687 | " \n", 688 | " \n", 689 | " \n", 690 | " \n", 691 | " \n", 692 | " \n", 693 | " \n", 694 | " \n", 695 | " \n", 696 | " \n", 697 | " \n", 698 | " \n", 699 | " \n", 700 | " \n", 701 | " \n", 702 | " \n", 703 | " \n", 704 | " \n", 705 | " \n", 706 | " \n", 707 | " \n", 708 | " \n", 709 | " \n", 710 | " \n", 711 | " \n", 712 | " \n", 713 | "
algoamoamoraoassimdeeugostomeumimmuitonãoodeiopraprestatetudovocê
te odeio000000000000100100
\n", 714 | "
" 715 | ], 716 | "text/plain": [ 717 | " algo amo amor ao assim de eu gosto meu mim muito não \\\n", 718 | "te odeio 0 0 0 0 0 0 0 0 0 0 0 0 \n", 719 | "\n", 720 | " odeio pra presta te tudo você \n", 721 | "te odeio 1 0 0 1 0 0 " 722 | ] 723 | }, 724 | "execution_count": 373, 725 | "metadata": {}, 726 | "output_type": "execute_result" 727 | } 728 | ], 729 | "source": [ 730 | "simple_test_dtm = vect.transform(novo_texto)\n", 731 | "\n", 732 | "##criando a visualização da matriz de ocorrência\n", 733 | "pd.DataFrame(simple_test_dtm.toarray(), columns=vect.get_feature_names(), index=novo_texto)" 734 | ] 735 | }, 736 | { 737 | "cell_type": "markdown", 738 | "metadata": {}, 739 | "source": [ 740 | "### 8.2 Classificador KNN\n", 741 | "\n", 742 | "Importando o classificador KNN do scikit-learn\n", 743 | "\n", 744 | "Referência sobre o classificador KNN você pode acessar o [wikpedia-KNN](https://en.wikipedia.org/wiki/K-nearest_neighbors_algorithm) e a referência do [KNN no scikit-learn](http://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsClassifier.html) " 745 | ] 746 | }, 747 | { 748 | "cell_type": "code", 749 | "execution_count": 374, 750 | "metadata": { 751 | "collapsed": true 752 | }, 753 | "outputs": [], 754 | "source": [ 755 | "## importanto o classificador\n", 756 | "from sklearn.neighbors import KNeighborsClassifier" 757 | ] 758 | }, 759 | { 760 | "cell_type": "markdown", 761 | "metadata": {}, 762 | "source": [ 763 | "Treinando o classificador KNN" 764 | ] 765 | }, 766 | { 767 | "cell_type": "code", 768 | "execution_count": 375, 769 | "metadata": { 770 | "collapsed": false 771 | }, 772 | "outputs": [ 773 | { 774 | "data": { 775 | "text/plain": [ 776 | "KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',\n", 777 | " metric_params=None, n_jobs=1, n_neighbors=1, p=2,\n", 778 | " weights='uniform')" 779 | ] 780 | }, 781 | "execution_count": 375, 782 | "metadata": {}, 783 | "output_type": "execute_result" 784 | } 785 | ], 786 | "source": [ 787 | "knn = KNeighborsClassifier(n_neighbors=1)\n", 788 | "knn.fit(simple_train_dtm, felling)" 789 | ] 790 | }, 791 | { 792 | "cell_type": "markdown", 793 | "metadata": {}, 794 | "source": [ 795 | "### 8.3 Gerando uma classificação\n", 796 | "Para isso utiliza-se o método ***predict()*** do classificador" 797 | ] 798 | }, 799 | { 800 | "cell_type": "code", 801 | "execution_count": 376, 802 | "metadata": { 803 | "collapsed": false 804 | }, 805 | "outputs": [ 806 | { 807 | "data": { 808 | "text/plain": [ 809 | "1" 810 | ] 811 | }, 812 | "execution_count": 376, 813 | "metadata": {}, 814 | "output_type": "execute_result" 815 | } 816 | ], 817 | "source": [ 818 | "fell = knn.predict(simple_test_dtm)[0]\n", 819 | "fell" 820 | ] 821 | }, 822 | { 823 | "cell_type": "code", 824 | "execution_count": 377, 825 | "metadata": { 826 | "collapsed": false 827 | }, 828 | "outputs": [ 829 | { 830 | "name": "stdout", 831 | "output_type": "stream", 832 | "text": [ 833 | "Bom sentimento\n" 834 | ] 835 | } 836 | ], 837 | "source": [ 838 | "if fell==1:\n", 839 | " print(\"Bom sentimento\")\n", 840 | "else:\n", 841 | " print(\"Mal sentimento\")" 842 | ] 843 | }, 844 | { 845 | "cell_type": "code", 846 | "execution_count": null, 847 | "metadata": { 848 | "collapsed": true 849 | }, 850 | "outputs": [], 851 | "source": [ 852 | "" 853 | ] 854 | }, 855 | { 856 | "cell_type": "code", 857 | "execution_count": null, 858 | "metadata": { 859 | "collapsed": true 860 | }, 861 | "outputs": [], 862 | "source": [ 863 | "" 864 | ] 865 | }, 866 | { 867 | "cell_type": "code", 868 | "execution_count": 369, 869 | "metadata": { 870 | "collapsed": true 871 | }, 872 | "outputs": [], 873 | "source": [ 874 | "sms = pd.read_table('sms.tsv', header=None, names=['label', 'message'])" 875 | ] 876 | }, 877 | { 878 | "cell_type": "code", 879 | "execution_count": 370, 880 | "metadata": { 881 | "collapsed": false 882 | }, 883 | "outputs": [ 884 | { 885 | "data": { 886 | "text/html": [ 887 | "
\n", 888 | "\n", 889 | " \n", 890 | " \n", 891 | " \n", 892 | " \n", 893 | " \n", 894 | " \n", 895 | " \n", 896 | " \n", 897 | " \n", 898 | " \n", 899 | " \n", 900 | " \n", 901 | " \n", 902 | " \n", 903 | " \n", 904 | " \n", 905 | " \n", 906 | " \n", 907 | " \n", 908 | " \n", 909 | " \n", 910 | " \n", 911 | " \n", 912 | " \n", 913 | " \n", 914 | " \n", 915 | " \n", 916 | " \n", 917 | " \n", 918 | " \n", 919 | " \n", 920 | " \n", 921 | " \n", 922 | " \n", 923 | " \n", 924 | " \n", 925 | " \n", 926 | " \n", 927 | " \n", 928 | " \n", 929 | " \n", 930 | " \n", 931 | " \n", 932 | " \n", 933 | " \n", 934 | " \n", 935 | " \n", 936 | " \n", 937 | " \n", 938 | " \n", 939 | " \n", 940 | " \n", 941 | " \n", 942 | " \n", 943 | " \n", 944 | " \n", 945 | " \n", 946 | " \n", 947 | " \n", 948 | "
labelmessage
0hamGo until jurong point, crazy.. Available only ...
1hamOk lar... Joking wif u oni...
2spamFree entry in 2 a wkly comp to win FA Cup fina...
3hamU dun say so early hor... U c already then say...
4hamNah I don't think he goes to usf, he lives aro...
5spamFreeMsg Hey there darling it's been 3 week's n...
6hamEven my brother is not like to speak with me. ...
7hamAs per your request 'Melle Melle (Oru Minnamin...
8spamWINNER!! As a valued network customer you have...
9spamHad your mobile 11 months or more? U R entitle...
\n", 949 | "
" 950 | ], 951 | "text/plain": [ 952 | " label message\n", 953 | "0 ham Go until jurong point, crazy.. Available only ...\n", 954 | "1 ham Ok lar... Joking wif u oni...\n", 955 | "2 spam Free entry in 2 a wkly comp to win FA Cup fina...\n", 956 | "3 ham U dun say so early hor... U c already then say...\n", 957 | "4 ham Nah I don't think he goes to usf, he lives aro...\n", 958 | "5 spam FreeMsg Hey there darling it's been 3 week's n...\n", 959 | "6 ham Even my brother is not like to speak with me. ...\n", 960 | "7 ham As per your request 'Melle Melle (Oru Minnamin...\n", 961 | "8 spam WINNER!! As a valued network customer you have...\n", 962 | "9 spam Had your mobile 11 months or more? U R entitle..." 963 | ] 964 | }, 965 | "execution_count": 370, 966 | "metadata": {}, 967 | "output_type": "execute_result" 968 | } 969 | ], 970 | "source": [ 971 | "sms.head(10)" 972 | ] 973 | }, 974 | { 975 | "cell_type": "code", 976 | "execution_count": 371, 977 | "metadata": { 978 | "collapsed": false 979 | }, 980 | "outputs": [ 981 | { 982 | "data": { 983 | "text/plain": [ 984 | "ham 4825\n", 985 | "spam 747\n", 986 | "Name: label, dtype: int64" 987 | ] 988 | }, 989 | "execution_count": 371, 990 | "metadata": {}, 991 | "output_type": "execute_result" 992 | } 993 | ], 994 | "source": [ 995 | "sms.label.value_counts()" 996 | ] 997 | }, 998 | { 999 | "cell_type": "code", 1000 | "execution_count": null, 1001 | "metadata": { 1002 | "collapsed": true 1003 | }, 1004 | "outputs": [], 1005 | "source": [ 1006 | "" 1007 | ] 1008 | } 1009 | ], 1010 | "metadata": { 1011 | "kernelspec": { 1012 | "display_name": "Python [conda root]", 1013 | "language": "python", 1014 | "name": "conda-root-py" 1015 | }, 1016 | "language_info": { 1017 | "codemirror_mode": { 1018 | "name": "ipython", 1019 | "version": 3.0 1020 | }, 1021 | "file_extension": ".py", 1022 | "mimetype": "text/x-python", 1023 | "name": "python", 1024 | "nbconvert_exporter": "python", 1025 | "pygments_lexer": "ipython3", 1026 | "version": "3.5.2" 1027 | } 1028 | }, 1029 | "nbformat": 4, 1030 | "nbformat_minor": 0 1031 | } -------------------------------------------------------------------------------- /AnaliseTexto/.ipynb_checkpoints/tutorial-checkpoint.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "# Tutorial: Machine Learning with Text in scikit-learn" 8 | ] 9 | }, 10 | { 11 | "cell_type": "markdown", 12 | "metadata": {}, 13 | "source": [ 14 | "## Agenda\n", 15 | "\n", 16 | "1. Model building in scikit-learn (refresher)\n", 17 | "2. Representing text as numerical data\n", 18 | "3. Reading a text-based dataset into pandas\n", 19 | "4. Vectorizing our dataset\n", 20 | "5. Building and evaluating a model\n", 21 | "6. Comparing models" 22 | ] 23 | }, 24 | { 25 | "cell_type": "code", 26 | "execution_count": 1, 27 | "metadata": { 28 | "collapsed": false 29 | }, 30 | "outputs": [], 31 | "source": [ 32 | "# for Python 2: use print only as a function\n", 33 | "from __future__ import print_function" 34 | ] 35 | }, 36 | { 37 | "cell_type": "markdown", 38 | "metadata": {}, 39 | "source": [ 40 | "## Part 1: Model building in scikit-learn (refresher)" 41 | ] 42 | }, 43 | { 44 | "cell_type": "code", 45 | "execution_count": 2, 46 | "metadata": { 47 | "collapsed": true 48 | }, 49 | "outputs": [], 50 | "source": [ 51 | "# load the iris dataset as an example\n", 52 | "from sklearn.datasets import load_iris\n", 53 | "iris = load_iris()" 54 | ] 55 | }, 56 | { 57 | "cell_type": "code", 58 | "execution_count": 3, 59 | "metadata": { 60 | "collapsed": true 61 | }, 62 | "outputs": [], 63 | "source": [ 64 | "# store the feature matrix (X) and response vector (y)\n", 65 | "X = iris.data\n", 66 | "y = iris.target" 67 | ] 68 | }, 69 | { 70 | "cell_type": "markdown", 71 | "metadata": {}, 72 | "source": [ 73 | "**\"Features\"** are also known as predictors, inputs, or attributes. The **\"response\"** is also known as the target, label, or output." 74 | ] 75 | }, 76 | { 77 | "cell_type": "code", 78 | "execution_count": 4, 79 | "metadata": { 80 | "collapsed": false 81 | }, 82 | "outputs": [ 83 | { 84 | "name": "stdout", 85 | "output_type": "stream", 86 | "text": [ 87 | "(150, 4)\n", 88 | "(150,)\n" 89 | ] 90 | } 91 | ], 92 | "source": [ 93 | "# check the shapes of X and y\n", 94 | "print(X.shape)\n", 95 | "print(y.shape)" 96 | ] 97 | }, 98 | { 99 | "cell_type": "markdown", 100 | "metadata": {}, 101 | "source": [ 102 | "**\"Observations\"** are also known as samples, instances, or records." 103 | ] 104 | }, 105 | { 106 | "cell_type": "code", 107 | "execution_count": 5, 108 | "metadata": { 109 | "collapsed": false 110 | }, 111 | "outputs": [ 112 | { 113 | "data": { 114 | "text/html": [ 115 | "
\n", 116 | "\n", 117 | " \n", 118 | " \n", 119 | " \n", 120 | " \n", 121 | " \n", 122 | " \n", 123 | " \n", 124 | " \n", 125 | " \n", 126 | " \n", 127 | " \n", 128 | " \n", 129 | " \n", 130 | " \n", 131 | " \n", 132 | " \n", 133 | " \n", 134 | " \n", 135 | " \n", 136 | " \n", 137 | " \n", 138 | " \n", 139 | " \n", 140 | " \n", 141 | " \n", 142 | " \n", 143 | " \n", 144 | " \n", 145 | " \n", 146 | " \n", 147 | " \n", 148 | " \n", 149 | " \n", 150 | " \n", 151 | " \n", 152 | " \n", 153 | " \n", 154 | " \n", 155 | " \n", 156 | " \n", 157 | " \n", 158 | " \n", 159 | " \n", 160 | " \n", 161 | " \n", 162 | " \n", 163 | "
sepal length (cm)sepal width (cm)petal length (cm)petal width (cm)
05.13.51.40.2
14.93.01.40.2
24.73.21.30.2
34.63.11.50.2
45.03.61.40.2
\n", 164 | "
" 165 | ], 166 | "text/plain": [ 167 | " sepal length (cm) sepal width (cm) petal length (cm) petal width (cm)\n", 168 | "0 5.1 3.5 1.4 0.2\n", 169 | "1 4.9 3.0 1.4 0.2\n", 170 | "2 4.7 3.2 1.3 0.2\n", 171 | "3 4.6 3.1 1.5 0.2\n", 172 | "4 5.0 3.6 1.4 0.2" 173 | ] 174 | }, 175 | "execution_count": 5, 176 | "metadata": {}, 177 | "output_type": "execute_result" 178 | } 179 | ], 180 | "source": [ 181 | "# examine the first 5 rows of the feature matrix (including the feature names)\n", 182 | "import pandas as pd\n", 183 | "pd.DataFrame(X, columns=iris.feature_names).head()" 184 | ] 185 | }, 186 | { 187 | "cell_type": "code", 188 | "execution_count": 6, 189 | "metadata": { 190 | "collapsed": false 191 | }, 192 | "outputs": [ 193 | { 194 | "name": "stdout", 195 | "output_type": "stream", 196 | "text": [ 197 | "[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0\n", 198 | " 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1\n", 199 | " 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2\n", 200 | " 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2\n", 201 | " 2 2]\n" 202 | ] 203 | } 204 | ], 205 | "source": [ 206 | "# examine the response vector\n", 207 | "print(y)" 208 | ] 209 | }, 210 | { 211 | "cell_type": "markdown", 212 | "metadata": {}, 213 | "source": [ 214 | "In order to **build a model**, the features must be **numeric**, and every observation must have the **same features in the same order**." 215 | ] 216 | }, 217 | { 218 | "cell_type": "code", 219 | "execution_count": 7, 220 | "metadata": { 221 | "collapsed": false 222 | }, 223 | "outputs": [ 224 | { 225 | "data": { 226 | "text/plain": [ 227 | "KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',\n", 228 | " metric_params=None, n_jobs=1, n_neighbors=5, p=2,\n", 229 | " weights='uniform')" 230 | ] 231 | }, 232 | "execution_count": 7, 233 | "metadata": {}, 234 | "output_type": "execute_result" 235 | } 236 | ], 237 | "source": [ 238 | "# import the class\n", 239 | "from sklearn.neighbors import KNeighborsClassifier\n", 240 | "\n", 241 | "# instantiate the model (with the default parameters)\n", 242 | "knn = KNeighborsClassifier()\n", 243 | "\n", 244 | "# fit the model with data (occurs in-place)\n", 245 | "knn.fit(X, y)" 246 | ] 247 | }, 248 | { 249 | "cell_type": "markdown", 250 | "metadata": {}, 251 | "source": [ 252 | "In order to **make a prediction**, the new observation must have the **same features as the training observations**, both in number and meaning." 253 | ] 254 | }, 255 | { 256 | "cell_type": "code", 257 | "execution_count": 8, 258 | "metadata": { 259 | "collapsed": false 260 | }, 261 | "outputs": [ 262 | { 263 | "data": { 264 | "text/plain": [ 265 | "array([1])" 266 | ] 267 | }, 268 | "execution_count": 8, 269 | "metadata": {}, 270 | "output_type": "execute_result" 271 | } 272 | ], 273 | "source": [ 274 | "# predict the response for a new observation\n", 275 | "knn.predict([[3, 5, 4, 2]])" 276 | ] 277 | }, 278 | { 279 | "cell_type": "markdown", 280 | "metadata": {}, 281 | "source": [ 282 | "## Part 2: Representing text as numerical data" 283 | ] 284 | }, 285 | { 286 | "cell_type": "code", 287 | "execution_count": 9, 288 | "metadata": { 289 | "collapsed": true 290 | }, 291 | "outputs": [], 292 | "source": [ 293 | "# example text for model training (SMS messages)\n", 294 | "simple_train = ['call you tonight', 'Call me a cab', 'please call me... PLEASE!']" 295 | ] 296 | }, 297 | { 298 | "cell_type": "code", 299 | "execution_count": 10, 300 | "metadata": { 301 | "collapsed": true 302 | }, 303 | "outputs": [], 304 | "source": [ 305 | "# example response vector\n", 306 | "is_desperate = [0, 0, 1]" 307 | ] 308 | }, 309 | { 310 | "cell_type": "markdown", 311 | "metadata": {}, 312 | "source": [ 313 | "From the [scikit-learn documentation](http://scikit-learn.org/stable/modules/feature_extraction.html#text-feature-extraction):\n", 314 | "\n", 315 | "> Text Analysis is a major application field for machine learning algorithms. However the raw data, a sequence of symbols cannot be fed directly to the algorithms themselves as most of them expect **numerical feature vectors with a fixed size** rather than the **raw text documents with variable length**.\n", 316 | "\n", 317 | "We will use [CountVectorizer](http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html) to \"convert text into a matrix of token counts\":" 318 | ] 319 | }, 320 | { 321 | "cell_type": "code", 322 | "execution_count": 11, 323 | "metadata": { 324 | "collapsed": true 325 | }, 326 | "outputs": [], 327 | "source": [ 328 | "# import and instantiate CountVectorizer (with the default parameters)\n", 329 | "from sklearn.feature_extraction.text import CountVectorizer\n", 330 | "vect = CountVectorizer()" 331 | ] 332 | }, 333 | { 334 | "cell_type": "code", 335 | "execution_count": 12, 336 | "metadata": { 337 | "collapsed": false 338 | }, 339 | "outputs": [ 340 | { 341 | "data": { 342 | "text/plain": [ 343 | "CountVectorizer(analyzer='word', binary=False, decode_error='strict',\n", 344 | " dtype=, encoding='utf-8', input='content',\n", 345 | " lowercase=True, max_df=1.0, max_features=None, min_df=1,\n", 346 | " ngram_range=(1, 1), preprocessor=None, stop_words=None,\n", 347 | " strip_accents=None, token_pattern='(?u)\\\\b\\\\w\\\\w+\\\\b',\n", 348 | " tokenizer=None, vocabulary=None)" 349 | ] 350 | }, 351 | "execution_count": 12, 352 | "metadata": {}, 353 | "output_type": "execute_result" 354 | } 355 | ], 356 | "source": [ 357 | "# learn the 'vocabulary' of the training data (occurs in-place)\n", 358 | "vect.fit(simple_train)" 359 | ] 360 | }, 361 | { 362 | "cell_type": "code", 363 | "execution_count": 13, 364 | "metadata": { 365 | "collapsed": false 366 | }, 367 | "outputs": [ 368 | { 369 | "data": { 370 | "text/plain": [ 371 | "['cab', 'call', 'me', 'please', 'tonight', 'you']" 372 | ] 373 | }, 374 | "execution_count": 13, 375 | "metadata": {}, 376 | "output_type": "execute_result" 377 | } 378 | ], 379 | "source": [ 380 | "# examine the fitted vocabulary\n", 381 | "vect.get_feature_names()" 382 | ] 383 | }, 384 | { 385 | "cell_type": "code", 386 | "execution_count": 14, 387 | "metadata": { 388 | "collapsed": false 389 | }, 390 | "outputs": [ 391 | { 392 | "data": { 393 | "text/plain": [ 394 | "<3x6 sparse matrix of type ''\n", 395 | "\twith 9 stored elements in Compressed Sparse Row format>" 396 | ] 397 | }, 398 | "execution_count": 14, 399 | "metadata": {}, 400 | "output_type": "execute_result" 401 | } 402 | ], 403 | "source": [ 404 | "# transform training data into a 'document-term matrix'\n", 405 | "simple_train_dtm = vect.transform(simple_train)\n", 406 | "simple_train_dtm" 407 | ] 408 | }, 409 | { 410 | "cell_type": "code", 411 | "execution_count": 15, 412 | "metadata": { 413 | "collapsed": false 414 | }, 415 | "outputs": [ 416 | { 417 | "data": { 418 | "text/plain": [ 419 | "array([[0, 1, 0, 0, 1, 1],\n", 420 | " [1, 1, 1, 0, 0, 0],\n", 421 | " [0, 1, 1, 2, 0, 0]])" 422 | ] 423 | }, 424 | "execution_count": 15, 425 | "metadata": {}, 426 | "output_type": "execute_result" 427 | } 428 | ], 429 | "source": [ 430 | "# convert sparse matrix to a dense matrix\n", 431 | "simple_train_dtm.toarray()" 432 | ] 433 | }, 434 | { 435 | "cell_type": "code", 436 | "execution_count": 16, 437 | "metadata": { 438 | "collapsed": false 439 | }, 440 | "outputs": [ 441 | { 442 | "data": { 443 | "text/html": [ 444 | "
\n", 445 | "\n", 446 | " \n", 447 | " \n", 448 | " \n", 449 | " \n", 450 | " \n", 451 | " \n", 452 | " \n", 453 | " \n", 454 | " \n", 455 | " \n", 456 | " \n", 457 | " \n", 458 | " \n", 459 | " \n", 460 | " \n", 461 | " \n", 462 | " \n", 463 | " \n", 464 | " \n", 465 | " \n", 466 | " \n", 467 | " \n", 468 | " \n", 469 | " \n", 470 | " \n", 471 | " \n", 472 | " \n", 473 | " \n", 474 | " \n", 475 | " \n", 476 | " \n", 477 | " \n", 478 | " \n", 479 | " \n", 480 | " \n", 481 | " \n", 482 | " \n", 483 | " \n", 484 | " \n", 485 | " \n", 486 | "
cabcallmepleasetonightyou
0010011
1111000
2011200
\n", 487 | "
" 488 | ], 489 | "text/plain": [ 490 | " cab call me please tonight you\n", 491 | "0 0 1 0 0 1 1\n", 492 | "1 1 1 1 0 0 0\n", 493 | "2 0 1 1 2 0 0" 494 | ] 495 | }, 496 | "execution_count": 16, 497 | "metadata": {}, 498 | "output_type": "execute_result" 499 | } 500 | ], 501 | "source": [ 502 | "# examine the vocabulary and document-term matrix together\n", 503 | "pd.DataFrame(simple_train_dtm.toarray(), columns=vect.get_feature_names())" 504 | ] 505 | }, 506 | { 507 | "cell_type": "markdown", 508 | "metadata": {}, 509 | "source": [ 510 | "From the [scikit-learn documentation](http://scikit-learn.org/stable/modules/feature_extraction.html#text-feature-extraction):\n", 511 | "\n", 512 | "> In this scheme, features and samples are defined as follows:\n", 513 | "\n", 514 | "> - Each individual token occurrence frequency (normalized or not) is treated as a **feature**.\n", 515 | "> - The vector of all the token frequencies for a given document is considered a multivariate **sample**.\n", 516 | "\n", 517 | "> A **corpus of documents** can thus be represented by a matrix with **one row per document** and **one column per token** (e.g. word) occurring in the corpus.\n", 518 | "\n", 519 | "> We call **vectorization** the general process of turning a collection of text documents into numerical feature vectors. This specific strategy (tokenization, counting and normalization) is called the **Bag of Words** or \"Bag of n-grams\" representation. Documents are described by word occurrences while completely ignoring the relative position information of the words in the document." 520 | ] 521 | }, 522 | { 523 | "cell_type": "code", 524 | "execution_count": 17, 525 | "metadata": { 526 | "collapsed": false 527 | }, 528 | "outputs": [ 529 | { 530 | "data": { 531 | "text/plain": [ 532 | "scipy.sparse.csr.csr_matrix" 533 | ] 534 | }, 535 | "execution_count": 17, 536 | "metadata": {}, 537 | "output_type": "execute_result" 538 | } 539 | ], 540 | "source": [ 541 | "# check the type of the document-term matrix\n", 542 | "type(simple_train_dtm)" 543 | ] 544 | }, 545 | { 546 | "cell_type": "code", 547 | "execution_count": 18, 548 | "metadata": { 549 | "collapsed": false, 550 | "scrolled": true 551 | }, 552 | "outputs": [ 553 | { 554 | "name": "stdout", 555 | "output_type": "stream", 556 | "text": [ 557 | " (0, 1)\t1\n", 558 | " (0, 4)\t1\n", 559 | " (0, 5)\t1\n", 560 | " (1, 0)\t1\n", 561 | " (1, 1)\t1\n", 562 | " (1, 2)\t1\n", 563 | " (2, 1)\t1\n", 564 | " (2, 2)\t1\n", 565 | " (2, 3)\t2\n" 566 | ] 567 | } 568 | ], 569 | "source": [ 570 | "# examine the sparse matrix contents\n", 571 | "print(simple_train_dtm)" 572 | ] 573 | }, 574 | { 575 | "cell_type": "markdown", 576 | "metadata": {}, 577 | "source": [ 578 | "From the [scikit-learn documentation](http://scikit-learn.org/stable/modules/feature_extraction.html#text-feature-extraction):\n", 579 | "\n", 580 | "> As most documents will typically use a very small subset of the words used in the corpus, the resulting matrix will have **many feature values that are zeros** (typically more than 99% of them).\n", 581 | "\n", 582 | "> For instance, a collection of 10,000 short text documents (such as emails) will use a vocabulary with a size in the order of 100,000 unique words in total while each document will use 100 to 1000 unique words individually.\n", 583 | "\n", 584 | "> In order to be able to **store such a matrix in memory** but also to **speed up operations**, implementations will typically use a **sparse representation** such as the implementations available in the `scipy.sparse` package." 585 | ] 586 | }, 587 | { 588 | "cell_type": "code", 589 | "execution_count": 19, 590 | "metadata": { 591 | "collapsed": false 592 | }, 593 | "outputs": [ 594 | { 595 | "data": { 596 | "text/plain": [ 597 | "KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',\n", 598 | " metric_params=None, n_jobs=1, n_neighbors=1, p=2,\n", 599 | " weights='uniform')" 600 | ] 601 | }, 602 | "execution_count": 19, 603 | "metadata": {}, 604 | "output_type": "execute_result" 605 | } 606 | ], 607 | "source": [ 608 | "# build a model to predict desperation\n", 609 | "knn = KNeighborsClassifier(n_neighbors=1)\n", 610 | "knn.fit(simple_train_dtm, is_desperate)" 611 | ] 612 | }, 613 | { 614 | "cell_type": "code", 615 | "execution_count": 20, 616 | "metadata": { 617 | "collapsed": true 618 | }, 619 | "outputs": [], 620 | "source": [ 621 | "# example text for model testing\n", 622 | "simple_test = [\"please don't call me\"]" 623 | ] 624 | }, 625 | { 626 | "cell_type": "markdown", 627 | "metadata": {}, 628 | "source": [ 629 | "In order to **make a prediction**, the new observation must have the **same features as the training observations**, both in number and meaning." 630 | ] 631 | }, 632 | { 633 | "cell_type": "code", 634 | "execution_count": 21, 635 | "metadata": { 636 | "collapsed": false 637 | }, 638 | "outputs": [ 639 | { 640 | "data": { 641 | "text/plain": [ 642 | "array([[0, 1, 1, 1, 0, 0]])" 643 | ] 644 | }, 645 | "execution_count": 21, 646 | "metadata": {}, 647 | "output_type": "execute_result" 648 | } 649 | ], 650 | "source": [ 651 | "# transform testing data into a document-term matrix (using existing vocabulary)\n", 652 | "simple_test_dtm = vect.transform(simple_test)\n", 653 | "simple_test_dtm.toarray()" 654 | ] 655 | }, 656 | { 657 | "cell_type": "code", 658 | "execution_count": 22, 659 | "metadata": { 660 | "collapsed": false 661 | }, 662 | "outputs": [ 663 | { 664 | "data": { 665 | "text/html": [ 666 | "
\n", 667 | "\n", 668 | " \n", 669 | " \n", 670 | " \n", 671 | " \n", 672 | " \n", 673 | " \n", 674 | " \n", 675 | " \n", 676 | " \n", 677 | " \n", 678 | " \n", 679 | " \n", 680 | " \n", 681 | " \n", 682 | " \n", 683 | " \n", 684 | " \n", 685 | " \n", 686 | " \n", 687 | " \n", 688 | " \n", 689 | " \n", 690 | "
cabcallmepleasetonightyou
0011100
\n", 691 | "
" 692 | ], 693 | "text/plain": [ 694 | " cab call me please tonight you\n", 695 | "0 0 1 1 1 0 0" 696 | ] 697 | }, 698 | "execution_count": 22, 699 | "metadata": {}, 700 | "output_type": "execute_result" 701 | } 702 | ], 703 | "source": [ 704 | "# examine the vocabulary and document-term matrix together\n", 705 | "pd.DataFrame(simple_test_dtm.toarray(), columns=vect.get_feature_names())" 706 | ] 707 | }, 708 | { 709 | "cell_type": "code", 710 | "execution_count": 23, 711 | "metadata": { 712 | "collapsed": false 713 | }, 714 | "outputs": [ 715 | { 716 | "data": { 717 | "text/plain": [ 718 | "array([1])" 719 | ] 720 | }, 721 | "execution_count": 23, 722 | "metadata": {}, 723 | "output_type": "execute_result" 724 | } 725 | ], 726 | "source": [ 727 | "# predict whether simple_test is desperate\n", 728 | "knn.predict(simple_test_dtm)" 729 | ] 730 | }, 731 | { 732 | "cell_type": "markdown", 733 | "metadata": {}, 734 | "source": [ 735 | "**Summary:**\n", 736 | "\n", 737 | "- `vect.fit(train)` **learns the vocabulary** of the training data\n", 738 | "- `vect.transform(train)` uses the **fitted vocabulary** to build a document-term matrix from the training data\n", 739 | "- `vect.transform(test)` uses the **fitted vocabulary** to build a document-term matrix from the testing data (and **ignores tokens** it hasn't seen before)" 740 | ] 741 | }, 742 | { 743 | "cell_type": "markdown", 744 | "metadata": {}, 745 | "source": [ 746 | "## Part 3: Reading a text-based dataset into pandas" 747 | ] 748 | }, 749 | { 750 | "cell_type": "code", 751 | "execution_count": 24, 752 | "metadata": { 753 | "collapsed": true 754 | }, 755 | "outputs": [], 756 | "source": [ 757 | "# read file into pandas from the working directory\n", 758 | "sms = pd.read_table('sms.tsv', header=None, names=['label', 'message'])" 759 | ] 760 | }, 761 | { 762 | "cell_type": "code", 763 | "execution_count": 25, 764 | "metadata": { 765 | "collapsed": false 766 | }, 767 | "outputs": [], 768 | "source": [ 769 | "# alternative: read file into pandas from a URL\n", 770 | "# url = 'https://raw.githubusercontent.com/justmarkham/pydata-dc-2016-tutorial/master/sms.tsv'\n", 771 | "# sms = pd.read_table(url, header=None, names=['label', 'message'])" 772 | ] 773 | }, 774 | { 775 | "cell_type": "code", 776 | "execution_count": 26, 777 | "metadata": { 778 | "collapsed": false 779 | }, 780 | "outputs": [ 781 | { 782 | "data": { 783 | "text/plain": [ 784 | "(5572, 2)" 785 | ] 786 | }, 787 | "execution_count": 26, 788 | "metadata": {}, 789 | "output_type": "execute_result" 790 | } 791 | ], 792 | "source": [ 793 | "# examine the shape\n", 794 | "sms.shape" 795 | ] 796 | }, 797 | { 798 | "cell_type": "code", 799 | "execution_count": 27, 800 | "metadata": { 801 | "collapsed": false 802 | }, 803 | "outputs": [ 804 | { 805 | "data": { 806 | "text/html": [ 807 | "
\n", 808 | "\n", 809 | " \n", 810 | " \n", 811 | " \n", 812 | " \n", 813 | " \n", 814 | " \n", 815 | " \n", 816 | " \n", 817 | " \n", 818 | " \n", 819 | " \n", 820 | " \n", 821 | " \n", 822 | " \n", 823 | " \n", 824 | " \n", 825 | " \n", 826 | " \n", 827 | " \n", 828 | " \n", 829 | " \n", 830 | " \n", 831 | " \n", 832 | " \n", 833 | " \n", 834 | " \n", 835 | " \n", 836 | " \n", 837 | " \n", 838 | " \n", 839 | " \n", 840 | " \n", 841 | " \n", 842 | " \n", 843 | " \n", 844 | " \n", 845 | " \n", 846 | " \n", 847 | " \n", 848 | " \n", 849 | " \n", 850 | " \n", 851 | " \n", 852 | " \n", 853 | " \n", 854 | " \n", 855 | " \n", 856 | " \n", 857 | " \n", 858 | " \n", 859 | " \n", 860 | " \n", 861 | " \n", 862 | " \n", 863 | " \n", 864 | " \n", 865 | " \n", 866 | " \n", 867 | " \n", 868 | "
labelmessage
0hamGo until jurong point, crazy.. Available only ...
1hamOk lar... Joking wif u oni...
2spamFree entry in 2 a wkly comp to win FA Cup fina...
3hamU dun say so early hor... U c already then say...
4hamNah I don't think he goes to usf, he lives aro...
5spamFreeMsg Hey there darling it's been 3 week's n...
6hamEven my brother is not like to speak with me. ...
7hamAs per your request 'Melle Melle (Oru Minnamin...
8spamWINNER!! As a valued network customer you have...
9spamHad your mobile 11 months or more? U R entitle...
\n", 869 | "
" 870 | ], 871 | "text/plain": [ 872 | " label message\n", 873 | "0 ham Go until jurong point, crazy.. Available only ...\n", 874 | "1 ham Ok lar... Joking wif u oni...\n", 875 | "2 spam Free entry in 2 a wkly comp to win FA Cup fina...\n", 876 | "3 ham U dun say so early hor... U c already then say...\n", 877 | "4 ham Nah I don't think he goes to usf, he lives aro...\n", 878 | "5 spam FreeMsg Hey there darling it's been 3 week's n...\n", 879 | "6 ham Even my brother is not like to speak with me. ...\n", 880 | "7 ham As per your request 'Melle Melle (Oru Minnamin...\n", 881 | "8 spam WINNER!! As a valued network customer you have...\n", 882 | "9 spam Had your mobile 11 months or more? U R entitle..." 883 | ] 884 | }, 885 | "execution_count": 27, 886 | "metadata": {}, 887 | "output_type": "execute_result" 888 | } 889 | ], 890 | "source": [ 891 | "# examine the first 10 rows\n", 892 | "sms.head(10)" 893 | ] 894 | }, 895 | { 896 | "cell_type": "code", 897 | "execution_count": 28, 898 | "metadata": { 899 | "collapsed": false 900 | }, 901 | "outputs": [ 902 | { 903 | "data": { 904 | "text/plain": [ 905 | "ham 4825\n", 906 | "spam 747\n", 907 | "Name: label, dtype: int64" 908 | ] 909 | }, 910 | "execution_count": 28, 911 | "metadata": {}, 912 | "output_type": "execute_result" 913 | } 914 | ], 915 | "source": [ 916 | "# examine the class distribution\n", 917 | "sms.label.value_counts()" 918 | ] 919 | }, 920 | { 921 | "cell_type": "code", 922 | "execution_count": 29, 923 | "metadata": { 924 | "collapsed": true 925 | }, 926 | "outputs": [], 927 | "source": [ 928 | "# convert label to a numerical variable\n", 929 | "sms['label_num'] = sms.label.map({'ham':0, 'spam':1})" 930 | ] 931 | }, 932 | { 933 | "cell_type": "code", 934 | "execution_count": 30, 935 | "metadata": { 936 | "collapsed": false 937 | }, 938 | "outputs": [ 939 | { 940 | "data": { 941 | "text/html": [ 942 | "
\n", 943 | "\n", 944 | " \n", 945 | " \n", 946 | " \n", 947 | " \n", 948 | " \n", 949 | " \n", 950 | " \n", 951 | " \n", 952 | " \n", 953 | " \n", 954 | " \n", 955 | " \n", 956 | " \n", 957 | " \n", 958 | " \n", 959 | " \n", 960 | " \n", 961 | " \n", 962 | " \n", 963 | " \n", 964 | " \n", 965 | " \n", 966 | " \n", 967 | " \n", 968 | " \n", 969 | " \n", 970 | " \n", 971 | " \n", 972 | " \n", 973 | " \n", 974 | " \n", 975 | " \n", 976 | " \n", 977 | " \n", 978 | " \n", 979 | " \n", 980 | " \n", 981 | " \n", 982 | " \n", 983 | " \n", 984 | " \n", 985 | " \n", 986 | " \n", 987 | " \n", 988 | " \n", 989 | " \n", 990 | " \n", 991 | " \n", 992 | " \n", 993 | " \n", 994 | " \n", 995 | " \n", 996 | " \n", 997 | " \n", 998 | " \n", 999 | " \n", 1000 | " \n", 1001 | " \n", 1002 | " \n", 1003 | " \n", 1004 | " \n", 1005 | " \n", 1006 | " \n", 1007 | " \n", 1008 | " \n", 1009 | " \n", 1010 | " \n", 1011 | " \n", 1012 | " \n", 1013 | " \n", 1014 | "
labelmessagelabel_num
0hamGo until jurong point, crazy.. Available only ...0
1hamOk lar... Joking wif u oni...0
2spamFree entry in 2 a wkly comp to win FA Cup fina...1
3hamU dun say so early hor... U c already then say...0
4hamNah I don't think he goes to usf, he lives aro...0
5spamFreeMsg Hey there darling it's been 3 week's n...1
6hamEven my brother is not like to speak with me. ...0
7hamAs per your request 'Melle Melle (Oru Minnamin...0
8spamWINNER!! As a valued network customer you have...1
9spamHad your mobile 11 months or more? U R entitle...1
\n", 1015 | "
" 1016 | ], 1017 | "text/plain": [ 1018 | " label message label_num\n", 1019 | "0 ham Go until jurong point, crazy.. Available only ... 0\n", 1020 | "1 ham Ok lar... Joking wif u oni... 0\n", 1021 | "2 spam Free entry in 2 a wkly comp to win FA Cup fina... 1\n", 1022 | "3 ham U dun say so early hor... U c already then say... 0\n", 1023 | "4 ham Nah I don't think he goes to usf, he lives aro... 0\n", 1024 | "5 spam FreeMsg Hey there darling it's been 3 week's n... 1\n", 1025 | "6 ham Even my brother is not like to speak with me. ... 0\n", 1026 | "7 ham As per your request 'Melle Melle (Oru Minnamin... 0\n", 1027 | "8 spam WINNER!! As a valued network customer you have... 1\n", 1028 | "9 spam Had your mobile 11 months or more? U R entitle... 1" 1029 | ] 1030 | }, 1031 | "execution_count": 30, 1032 | "metadata": {}, 1033 | "output_type": "execute_result" 1034 | } 1035 | ], 1036 | "source": [ 1037 | "# check that the conversion worked\n", 1038 | "sms.head(10)" 1039 | ] 1040 | }, 1041 | { 1042 | "cell_type": "code", 1043 | "execution_count": 31, 1044 | "metadata": { 1045 | "collapsed": false 1046 | }, 1047 | "outputs": [ 1048 | { 1049 | "name": "stdout", 1050 | "output_type": "stream", 1051 | "text": [ 1052 | "(150, 4)\n", 1053 | "(150,)\n" 1054 | ] 1055 | } 1056 | ], 1057 | "source": [ 1058 | "# how to define X and y (from the iris data) for use with a MODEL\n", 1059 | "X = iris.data\n", 1060 | "y = iris.target\n", 1061 | "print(X.shape)\n", 1062 | "print(y.shape)" 1063 | ] 1064 | }, 1065 | { 1066 | "cell_type": "code", 1067 | "execution_count": 32, 1068 | "metadata": { 1069 | "collapsed": false 1070 | }, 1071 | "outputs": [ 1072 | { 1073 | "name": "stdout", 1074 | "output_type": "stream", 1075 | "text": [ 1076 | "(5572,)\n", 1077 | "(5572,)\n" 1078 | ] 1079 | } 1080 | ], 1081 | "source": [ 1082 | "# how to define X and y (from the SMS data) for use with COUNTVECTORIZER\n", 1083 | "X = sms.message\n", 1084 | "y = sms.label_num\n", 1085 | "print(X.shape)\n", 1086 | "print(y.shape)" 1087 | ] 1088 | }, 1089 | { 1090 | "cell_type": "code", 1091 | "execution_count": 33, 1092 | "metadata": { 1093 | "collapsed": false 1094 | }, 1095 | "outputs": [ 1096 | { 1097 | "name": "stdout", 1098 | "output_type": "stream", 1099 | "text": [ 1100 | "(4179,)\n", 1101 | "(1393,)\n", 1102 | "(4179,)\n", 1103 | "(1393,)\n" 1104 | ] 1105 | } 1106 | ], 1107 | "source": [ 1108 | "# split X and y into training and testing sets\n", 1109 | "from sklearn.cross_validation import train_test_split\n", 1110 | "X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=1)\n", 1111 | "print(X_train.shape)\n", 1112 | "print(X_test.shape)\n", 1113 | "print(y_train.shape)\n", 1114 | "print(y_test.shape)" 1115 | ] 1116 | }, 1117 | { 1118 | "cell_type": "markdown", 1119 | "metadata": {}, 1120 | "source": [ 1121 | "## Part 4: Vectorizing our dataset" 1122 | ] 1123 | }, 1124 | { 1125 | "cell_type": "code", 1126 | "execution_count": 34, 1127 | "metadata": { 1128 | "collapsed": true 1129 | }, 1130 | "outputs": [], 1131 | "source": [ 1132 | "# instantiate the vectorizer\n", 1133 | "vect = CountVectorizer()" 1134 | ] 1135 | }, 1136 | { 1137 | "cell_type": "code", 1138 | "execution_count": 35, 1139 | "metadata": { 1140 | "collapsed": true 1141 | }, 1142 | "outputs": [], 1143 | "source": [ 1144 | "# learn training data vocabulary, then use it to create a document-term matrix\n", 1145 | "vect.fit(X_train)\n", 1146 | "X_train_dtm = vect.transform(X_train)" 1147 | ] 1148 | }, 1149 | { 1150 | "cell_type": "code", 1151 | "execution_count": 36, 1152 | "metadata": { 1153 | "collapsed": true 1154 | }, 1155 | "outputs": [], 1156 | "source": [ 1157 | "# equivalently: combine fit and transform into a single step\n", 1158 | "X_train_dtm = vect.fit_transform(X_train)" 1159 | ] 1160 | }, 1161 | { 1162 | "cell_type": "code", 1163 | "execution_count": 37, 1164 | "metadata": { 1165 | "collapsed": false 1166 | }, 1167 | "outputs": [ 1168 | { 1169 | "data": { 1170 | "text/plain": [ 1171 | "<4179x7456 sparse matrix of type ''\n", 1172 | "\twith 55209 stored elements in Compressed Sparse Row format>" 1173 | ] 1174 | }, 1175 | "execution_count": 37, 1176 | "metadata": {}, 1177 | "output_type": "execute_result" 1178 | } 1179 | ], 1180 | "source": [ 1181 | "# examine the document-term matrix\n", 1182 | "X_train_dtm" 1183 | ] 1184 | }, 1185 | { 1186 | "cell_type": "code", 1187 | "execution_count": 38, 1188 | "metadata": { 1189 | "collapsed": false 1190 | }, 1191 | "outputs": [ 1192 | { 1193 | "data": { 1194 | "text/plain": [ 1195 | "<1393x7456 sparse matrix of type ''\n", 1196 | "\twith 17604 stored elements in Compressed Sparse Row format>" 1197 | ] 1198 | }, 1199 | "execution_count": 38, 1200 | "metadata": {}, 1201 | "output_type": "execute_result" 1202 | } 1203 | ], 1204 | "source": [ 1205 | "# transform testing data (using fitted vocabulary) into a document-term matrix\n", 1206 | "X_test_dtm = vect.transform(X_test)\n", 1207 | "X_test_dtm" 1208 | ] 1209 | }, 1210 | { 1211 | "cell_type": "markdown", 1212 | "metadata": {}, 1213 | "source": [ 1214 | "## Part 5: Building and evaluating a model\n", 1215 | "\n", 1216 | "We will use [multinomial Naive Bayes](http://scikit-learn.org/stable/modules/generated/sklearn.naive_bayes.MultinomialNB.html):\n", 1217 | "\n", 1218 | "> The multinomial Naive Bayes classifier is suitable for classification with **discrete features** (e.g., word counts for text classification). The multinomial distribution normally requires integer feature counts. However, in practice, fractional counts such as tf-idf may also work." 1219 | ] 1220 | }, 1221 | { 1222 | "cell_type": "code", 1223 | "execution_count": 39, 1224 | "metadata": { 1225 | "collapsed": true 1226 | }, 1227 | "outputs": [], 1228 | "source": [ 1229 | "# import and instantiate a Multinomial Naive Bayes model\n", 1230 | "from sklearn.naive_bayes import MultinomialNB\n", 1231 | "nb = MultinomialNB()" 1232 | ] 1233 | }, 1234 | { 1235 | "cell_type": "code", 1236 | "execution_count": 40, 1237 | "metadata": { 1238 | "collapsed": false 1239 | }, 1240 | "outputs": [ 1241 | { 1242 | "name": "stdout", 1243 | "output_type": "stream", 1244 | "text": [ 1245 | "CPU times: user 0 ns, sys: 0 ns, total: 0 ns\n", 1246 | "Wall time: 2.78 ms\n" 1247 | ] 1248 | }, 1249 | { 1250 | "data": { 1251 | "text/plain": [ 1252 | "MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True)" 1253 | ] 1254 | }, 1255 | "execution_count": 40, 1256 | "metadata": {}, 1257 | "output_type": "execute_result" 1258 | } 1259 | ], 1260 | "source": [ 1261 | "# train the model using X_train_dtm (timing it with an IPython \"magic command\")\n", 1262 | "%time nb.fit(X_train_dtm, y_train)" 1263 | ] 1264 | }, 1265 | { 1266 | "cell_type": "code", 1267 | "execution_count": 41, 1268 | "metadata": { 1269 | "collapsed": true 1270 | }, 1271 | "outputs": [], 1272 | "source": [ 1273 | "# make class predictions for X_test_dtm\n", 1274 | "y_pred_class = nb.predict(X_test_dtm)" 1275 | ] 1276 | }, 1277 | { 1278 | "cell_type": "code", 1279 | "execution_count": 42, 1280 | "metadata": { 1281 | "collapsed": false 1282 | }, 1283 | "outputs": [ 1284 | { 1285 | "data": { 1286 | "text/plain": [ 1287 | "0.98851399856424982" 1288 | ] 1289 | }, 1290 | "execution_count": 42, 1291 | "metadata": {}, 1292 | "output_type": "execute_result" 1293 | } 1294 | ], 1295 | "source": [ 1296 | "# calculate accuracy of class predictions\n", 1297 | "from sklearn import metrics\n", 1298 | "metrics.accuracy_score(y_test, y_pred_class)" 1299 | ] 1300 | }, 1301 | { 1302 | "cell_type": "code", 1303 | "execution_count": 43, 1304 | "metadata": { 1305 | "collapsed": false 1306 | }, 1307 | "outputs": [ 1308 | { 1309 | "data": { 1310 | "text/plain": [ 1311 | "array([[1203, 5],\n", 1312 | " [ 11, 174]])" 1313 | ] 1314 | }, 1315 | "execution_count": 43, 1316 | "metadata": {}, 1317 | "output_type": "execute_result" 1318 | } 1319 | ], 1320 | "source": [ 1321 | "# print the confusion matrix\n", 1322 | "metrics.confusion_matrix(y_test, y_pred_class)" 1323 | ] 1324 | }, 1325 | { 1326 | "cell_type": "code", 1327 | "execution_count": 44, 1328 | "metadata": { 1329 | "collapsed": false 1330 | }, 1331 | "outputs": [], 1332 | "source": [ 1333 | "# print message text for the false positives (ham incorrectly classified as spam)\n" 1334 | ] 1335 | }, 1336 | { 1337 | "cell_type": "code", 1338 | "execution_count": 45, 1339 | "metadata": { 1340 | "collapsed": false, 1341 | "scrolled": true 1342 | }, 1343 | "outputs": [], 1344 | "source": [ 1345 | "# print message text for the false negatives (spam incorrectly classified as ham)\n" 1346 | ] 1347 | }, 1348 | { 1349 | "cell_type": "code", 1350 | "execution_count": 46, 1351 | "metadata": { 1352 | "collapsed": false, 1353 | "scrolled": true 1354 | }, 1355 | "outputs": [ 1356 | { 1357 | "data": { 1358 | "text/plain": [ 1359 | "\"LookAtMe!: Thanks for your purchase of a video clip from LookAtMe!, you've been charged 35p. Think you can do better? Why not send a video in a MMSto 32323.\"" 1360 | ] 1361 | }, 1362 | "execution_count": 46, 1363 | "metadata": {}, 1364 | "output_type": "execute_result" 1365 | } 1366 | ], 1367 | "source": [ 1368 | "# example false negative\n", 1369 | "X_test[3132]" 1370 | ] 1371 | }, 1372 | { 1373 | "cell_type": "code", 1374 | "execution_count": 47, 1375 | "metadata": { 1376 | "collapsed": false 1377 | }, 1378 | "outputs": [ 1379 | { 1380 | "data": { 1381 | "text/plain": [ 1382 | "array([ 2.87744864e-03, 1.83488846e-05, 2.07301295e-03, ...,\n", 1383 | " 1.09026171e-06, 1.00000000e+00, 3.98279868e-09])" 1384 | ] 1385 | }, 1386 | "execution_count": 47, 1387 | "metadata": {}, 1388 | "output_type": "execute_result" 1389 | } 1390 | ], 1391 | "source": [ 1392 | "# calculate predicted probabilities for X_test_dtm (poorly calibrated)\n", 1393 | "y_pred_prob = nb.predict_proba(X_test_dtm)[:, 1]\n", 1394 | "y_pred_prob" 1395 | ] 1396 | }, 1397 | { 1398 | "cell_type": "code", 1399 | "execution_count": 48, 1400 | "metadata": { 1401 | "collapsed": false 1402 | }, 1403 | "outputs": [ 1404 | { 1405 | "data": { 1406 | "text/plain": [ 1407 | "0.98664310005369604" 1408 | ] 1409 | }, 1410 | "execution_count": 48, 1411 | "metadata": {}, 1412 | "output_type": "execute_result" 1413 | } 1414 | ], 1415 | "source": [ 1416 | "# calculate AUC\n", 1417 | "metrics.roc_auc_score(y_test, y_pred_prob)" 1418 | ] 1419 | }, 1420 | { 1421 | "cell_type": "markdown", 1422 | "metadata": {}, 1423 | "source": [ 1424 | "## Part 6: Comparing models\n", 1425 | "\n", 1426 | "We will compare multinomial Naive Bayes with [logistic regression](http://scikit-learn.org/stable/modules/linear_model.html#logistic-regression):\n", 1427 | "\n", 1428 | "> Logistic regression, despite its name, is a **linear model for classification** rather than regression. Logistic regression is also known in the literature as logit regression, maximum-entropy classification (MaxEnt) or the log-linear classifier. In this model, the probabilities describing the possible outcomes of a single trial are modeled using a logistic function." 1429 | ] 1430 | }, 1431 | { 1432 | "cell_type": "code", 1433 | "execution_count": 49, 1434 | "metadata": { 1435 | "collapsed": true 1436 | }, 1437 | "outputs": [], 1438 | "source": [ 1439 | "# import and instantiate a logistic regression model\n", 1440 | "from sklearn.linear_model import LogisticRegression\n", 1441 | "logreg = LogisticRegression()" 1442 | ] 1443 | }, 1444 | { 1445 | "cell_type": "code", 1446 | "execution_count": 50, 1447 | "metadata": { 1448 | "collapsed": false 1449 | }, 1450 | "outputs": [ 1451 | { 1452 | "name": "stdout", 1453 | "output_type": "stream", 1454 | "text": [ 1455 | "CPU times: user 56 ms, sys: 0 ns, total: 56 ms\n", 1456 | "Wall time: 273 ms\n" 1457 | ] 1458 | }, 1459 | { 1460 | "data": { 1461 | "text/plain": [ 1462 | "LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,\n", 1463 | " intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,\n", 1464 | " penalty='l2', random_state=None, solver='liblinear', tol=0.0001,\n", 1465 | " verbose=0, warm_start=False)" 1466 | ] 1467 | }, 1468 | "execution_count": 50, 1469 | "metadata": {}, 1470 | "output_type": "execute_result" 1471 | } 1472 | ], 1473 | "source": [ 1474 | "# train the model using X_train_dtm\n", 1475 | "%time logreg.fit(X_train_dtm, y_train)" 1476 | ] 1477 | }, 1478 | { 1479 | "cell_type": "code", 1480 | "execution_count": 51, 1481 | "metadata": { 1482 | "collapsed": true 1483 | }, 1484 | "outputs": [], 1485 | "source": [ 1486 | "# make class predictions for X_test_dtm\n", 1487 | "y_pred_class = logreg.predict(X_test_dtm)" 1488 | ] 1489 | }, 1490 | { 1491 | "cell_type": "code", 1492 | "execution_count": 52, 1493 | "metadata": { 1494 | "collapsed": false 1495 | }, 1496 | "outputs": [ 1497 | { 1498 | "data": { 1499 | "text/plain": [ 1500 | "array([ 0.01269556, 0.00347183, 0.00616517, ..., 0.03354907,\n", 1501 | " 0.99725053, 0.00157706])" 1502 | ] 1503 | }, 1504 | "execution_count": 52, 1505 | "metadata": {}, 1506 | "output_type": "execute_result" 1507 | } 1508 | ], 1509 | "source": [ 1510 | "# calculate predicted probabilities for X_test_dtm (well calibrated)\n", 1511 | "y_pred_prob = logreg.predict_proba(X_test_dtm)[:, 1]\n", 1512 | "y_pred_prob" 1513 | ] 1514 | }, 1515 | { 1516 | "cell_type": "code", 1517 | "execution_count": 53, 1518 | "metadata": { 1519 | "collapsed": false 1520 | }, 1521 | "outputs": [ 1522 | { 1523 | "data": { 1524 | "text/plain": [ 1525 | "0.9877961234745154" 1526 | ] 1527 | }, 1528 | "execution_count": 53, 1529 | "metadata": {}, 1530 | "output_type": "execute_result" 1531 | } 1532 | ], 1533 | "source": [ 1534 | "# calculate accuracy\n", 1535 | "metrics.accuracy_score(y_test, y_pred_class)" 1536 | ] 1537 | }, 1538 | { 1539 | "cell_type": "code", 1540 | "execution_count": 54, 1541 | "metadata": { 1542 | "collapsed": false 1543 | }, 1544 | "outputs": [ 1545 | { 1546 | "data": { 1547 | "text/plain": [ 1548 | "0.99368176123143015" 1549 | ] 1550 | }, 1551 | "execution_count": 54, 1552 | "metadata": {}, 1553 | "output_type": "execute_result" 1554 | } 1555 | ], 1556 | "source": [ 1557 | "# calculate AUC\n", 1558 | "metrics.roc_auc_score(y_test, y_pred_prob)" 1559 | ] 1560 | } 1561 | ], 1562 | "metadata": { 1563 | "kernelspec": { 1564 | "display_name": "Python [conda root]", 1565 | "language": "python", 1566 | "name": "conda-root-py" 1567 | }, 1568 | "language_info": { 1569 | "codemirror_mode": { 1570 | "name": "ipython", 1571 | "version": 3 1572 | }, 1573 | "file_extension": ".py", 1574 | "mimetype": "text/x-python", 1575 | "name": "python", 1576 | "nbconvert_exporter": "python", 1577 | "pygments_lexer": "ipython3", 1578 | "version": "3.5.2" 1579 | } 1580 | }, 1581 | "nbformat": 4, 1582 | "nbformat_minor": 0 1583 | } 1584 | -------------------------------------------------------------------------------- /textAnalisis/.ipynb_checkpoints/tutorial-checkpoint.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "# Tutorial: Machine Learning with Text in scikit-learn" 8 | ] 9 | }, 10 | { 11 | "cell_type": "markdown", 12 | "metadata": {}, 13 | "source": [ 14 | "## Agenda\n", 15 | "\n", 16 | "1. Model building in scikit-learn (refresher)\n", 17 | "2. Representing text as numerical data\n", 18 | "3. Reading a text-based dataset into pandas\n", 19 | "4. Vectorizing our dataset\n", 20 | "5. Building and evaluating a model\n", 21 | "6. Comparing models" 22 | ] 23 | }, 24 | { 25 | "cell_type": "code", 26 | "execution_count": 1, 27 | "metadata": { 28 | "collapsed": false 29 | }, 30 | "outputs": [], 31 | "source": [ 32 | "# for Python 2: use print only as a function\n", 33 | "from __future__ import print_function" 34 | ] 35 | }, 36 | { 37 | "cell_type": "markdown", 38 | "metadata": {}, 39 | "source": [ 40 | "## Part 1: Model building in scikit-learn (refresher)" 41 | ] 42 | }, 43 | { 44 | "cell_type": "code", 45 | "execution_count": 2, 46 | "metadata": { 47 | "collapsed": true 48 | }, 49 | "outputs": [], 50 | "source": [ 51 | "# load the iris dataset as an example\n", 52 | "from sklearn.datasets import load_iris\n", 53 | "iris = load_iris()" 54 | ] 55 | }, 56 | { 57 | "cell_type": "code", 58 | "execution_count": 3, 59 | "metadata": { 60 | "collapsed": true 61 | }, 62 | "outputs": [], 63 | "source": [ 64 | "# store the feature matrix (X) and response vector (y)\n", 65 | "X = iris.data\n", 66 | "y = iris.target" 67 | ] 68 | }, 69 | { 70 | "cell_type": "markdown", 71 | "metadata": {}, 72 | "source": [ 73 | "**\"Features\"** are also known as predictors, inputs, or attributes. The **\"response\"** is also known as the target, label, or output." 74 | ] 75 | }, 76 | { 77 | "cell_type": "code", 78 | "execution_count": 4, 79 | "metadata": { 80 | "collapsed": false 81 | }, 82 | "outputs": [ 83 | { 84 | "name": "stdout", 85 | "output_type": "stream", 86 | "text": [ 87 | "(150, 4)\n", 88 | "(150,)\n" 89 | ] 90 | } 91 | ], 92 | "source": [ 93 | "# check the shapes of X and y\n", 94 | "print(X.shape)\n", 95 | "print(y.shape)" 96 | ] 97 | }, 98 | { 99 | "cell_type": "markdown", 100 | "metadata": {}, 101 | "source": [ 102 | "**\"Observations\"** are also known as samples, instances, or records." 103 | ] 104 | }, 105 | { 106 | "cell_type": "code", 107 | "execution_count": 5, 108 | "metadata": { 109 | "collapsed": false 110 | }, 111 | "outputs": [ 112 | { 113 | "data": { 114 | "text/html": [ 115 | "
\n", 116 | "\n", 117 | " \n", 118 | " \n", 119 | " \n", 120 | " \n", 121 | " \n", 122 | " \n", 123 | " \n", 124 | " \n", 125 | " \n", 126 | " \n", 127 | " \n", 128 | " \n", 129 | " \n", 130 | " \n", 131 | " \n", 132 | " \n", 133 | " \n", 134 | " \n", 135 | " \n", 136 | " \n", 137 | " \n", 138 | " \n", 139 | " \n", 140 | " \n", 141 | " \n", 142 | " \n", 143 | " \n", 144 | " \n", 145 | " \n", 146 | " \n", 147 | " \n", 148 | " \n", 149 | " \n", 150 | " \n", 151 | " \n", 152 | " \n", 153 | " \n", 154 | " \n", 155 | " \n", 156 | " \n", 157 | " \n", 158 | " \n", 159 | " \n", 160 | " \n", 161 | " \n", 162 | " \n", 163 | "
sepal length (cm)sepal width (cm)petal length (cm)petal width (cm)
05.13.51.40.2
14.93.01.40.2
24.73.21.30.2
34.63.11.50.2
45.03.61.40.2
\n", 164 | "
" 165 | ], 166 | "text/plain": [ 167 | " sepal length (cm) sepal width (cm) petal length (cm) petal width (cm)\n", 168 | "0 5.1 3.5 1.4 0.2\n", 169 | "1 4.9 3.0 1.4 0.2\n", 170 | "2 4.7 3.2 1.3 0.2\n", 171 | "3 4.6 3.1 1.5 0.2\n", 172 | "4 5.0 3.6 1.4 0.2" 173 | ] 174 | }, 175 | "execution_count": 5, 176 | "metadata": {}, 177 | "output_type": "execute_result" 178 | } 179 | ], 180 | "source": [ 181 | "# examine the first 5 rows of the feature matrix (including the feature names)\n", 182 | "import pandas as pd\n", 183 | "pd.DataFrame(X, columns=iris.feature_names).head()" 184 | ] 185 | }, 186 | { 187 | "cell_type": "code", 188 | "execution_count": 6, 189 | "metadata": { 190 | "collapsed": false 191 | }, 192 | "outputs": [ 193 | { 194 | "name": "stdout", 195 | "output_type": "stream", 196 | "text": [ 197 | "[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0\n", 198 | " 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1\n", 199 | " 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2\n", 200 | " 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2\n", 201 | " 2 2]\n" 202 | ] 203 | } 204 | ], 205 | "source": [ 206 | "# examine the response vector\n", 207 | "print(y)" 208 | ] 209 | }, 210 | { 211 | "cell_type": "markdown", 212 | "metadata": {}, 213 | "source": [ 214 | "In order to **build a model**, the features must be **numeric**, and every observation must have the **same features in the same order**." 215 | ] 216 | }, 217 | { 218 | "cell_type": "code", 219 | "execution_count": 7, 220 | "metadata": { 221 | "collapsed": false 222 | }, 223 | "outputs": [ 224 | { 225 | "data": { 226 | "text/plain": [ 227 | "KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',\n", 228 | " metric_params=None, n_jobs=1, n_neighbors=5, p=2,\n", 229 | " weights='uniform')" 230 | ] 231 | }, 232 | "execution_count": 7, 233 | "metadata": {}, 234 | "output_type": "execute_result" 235 | } 236 | ], 237 | "source": [ 238 | "# import the class\n", 239 | "from sklearn.neighbors import KNeighborsClassifier\n", 240 | "\n", 241 | "# instantiate the model (with the default parameters)\n", 242 | "knn = KNeighborsClassifier()\n", 243 | "\n", 244 | "# fit the model with data (occurs in-place)\n", 245 | "knn.fit(X, y)" 246 | ] 247 | }, 248 | { 249 | "cell_type": "markdown", 250 | "metadata": {}, 251 | "source": [ 252 | "In order to **make a prediction**, the new observation must have the **same features as the training observations**, both in number and meaning." 253 | ] 254 | }, 255 | { 256 | "cell_type": "code", 257 | "execution_count": 8, 258 | "metadata": { 259 | "collapsed": false 260 | }, 261 | "outputs": [ 262 | { 263 | "data": { 264 | "text/plain": [ 265 | "array([1])" 266 | ] 267 | }, 268 | "execution_count": 8, 269 | "metadata": {}, 270 | "output_type": "execute_result" 271 | } 272 | ], 273 | "source": [ 274 | "# predict the response for a new observation\n", 275 | "knn.predict([[3, 5, 4, 2]])" 276 | ] 277 | }, 278 | { 279 | "cell_type": "markdown", 280 | "metadata": {}, 281 | "source": [ 282 | "## Part 2: Representing text as numerical data" 283 | ] 284 | }, 285 | { 286 | "cell_type": "code", 287 | "execution_count": 9, 288 | "metadata": { 289 | "collapsed": true 290 | }, 291 | "outputs": [], 292 | "source": [ 293 | "# example text for model training (SMS messages)\n", 294 | "simple_train = ['call you tonight', 'Call me a cab', 'please call me... PLEASE!']" 295 | ] 296 | }, 297 | { 298 | "cell_type": "code", 299 | "execution_count": 10, 300 | "metadata": { 301 | "collapsed": true 302 | }, 303 | "outputs": [], 304 | "source": [ 305 | "# example response vector\n", 306 | "is_desperate = [0, 0, 1]" 307 | ] 308 | }, 309 | { 310 | "cell_type": "markdown", 311 | "metadata": {}, 312 | "source": [ 313 | "From the [scikit-learn documentation](http://scikit-learn.org/stable/modules/feature_extraction.html#text-feature-extraction):\n", 314 | "\n", 315 | "> Text Analysis is a major application field for machine learning algorithms. However the raw data, a sequence of symbols cannot be fed directly to the algorithms themselves as most of them expect **numerical feature vectors with a fixed size** rather than the **raw text documents with variable length**.\n", 316 | "\n", 317 | "We will use [CountVectorizer](http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html) to \"convert text into a matrix of token counts\":" 318 | ] 319 | }, 320 | { 321 | "cell_type": "code", 322 | "execution_count": 11, 323 | "metadata": { 324 | "collapsed": true 325 | }, 326 | "outputs": [], 327 | "source": [ 328 | "# import and instantiate CountVectorizer (with the default parameters)\n", 329 | "from sklearn.feature_extraction.text import CountVectorizer\n", 330 | "vect = CountVectorizer()" 331 | ] 332 | }, 333 | { 334 | "cell_type": "code", 335 | "execution_count": 12, 336 | "metadata": { 337 | "collapsed": false 338 | }, 339 | "outputs": [ 340 | { 341 | "data": { 342 | "text/plain": [ 343 | "CountVectorizer(analyzer='word', binary=False, decode_error='strict',\n", 344 | " dtype=, encoding='utf-8', input='content',\n", 345 | " lowercase=True, max_df=1.0, max_features=None, min_df=1,\n", 346 | " ngram_range=(1, 1), preprocessor=None, stop_words=None,\n", 347 | " strip_accents=None, token_pattern='(?u)\\\\b\\\\w\\\\w+\\\\b',\n", 348 | " tokenizer=None, vocabulary=None)" 349 | ] 350 | }, 351 | "execution_count": 12, 352 | "metadata": {}, 353 | "output_type": "execute_result" 354 | } 355 | ], 356 | "source": [ 357 | "# learn the 'vocabulary' of the training data (occurs in-place)\n", 358 | "vect.fit(simple_train)" 359 | ] 360 | }, 361 | { 362 | "cell_type": "code", 363 | "execution_count": 13, 364 | "metadata": { 365 | "collapsed": false 366 | }, 367 | "outputs": [ 368 | { 369 | "data": { 370 | "text/plain": [ 371 | "['cab', 'call', 'me', 'please', 'tonight', 'you']" 372 | ] 373 | }, 374 | "execution_count": 13, 375 | "metadata": {}, 376 | "output_type": "execute_result" 377 | } 378 | ], 379 | "source": [ 380 | "# examine the fitted vocabulary\n", 381 | "vect.get_feature_names()" 382 | ] 383 | }, 384 | { 385 | "cell_type": "code", 386 | "execution_count": 14, 387 | "metadata": { 388 | "collapsed": false 389 | }, 390 | "outputs": [ 391 | { 392 | "data": { 393 | "text/plain": [ 394 | "<3x6 sparse matrix of type ''\n", 395 | "\twith 9 stored elements in Compressed Sparse Row format>" 396 | ] 397 | }, 398 | "execution_count": 14, 399 | "metadata": {}, 400 | "output_type": "execute_result" 401 | } 402 | ], 403 | "source": [ 404 | "# transform training data into a 'document-term matrix'\n", 405 | "simple_train_dtm = vect.transform(simple_train)\n", 406 | "simple_train_dtm" 407 | ] 408 | }, 409 | { 410 | "cell_type": "code", 411 | "execution_count": 15, 412 | "metadata": { 413 | "collapsed": false 414 | }, 415 | "outputs": [ 416 | { 417 | "data": { 418 | "text/plain": [ 419 | "array([[0, 1, 0, 0, 1, 1],\n", 420 | " [1, 1, 1, 0, 0, 0],\n", 421 | " [0, 1, 1, 2, 0, 0]])" 422 | ] 423 | }, 424 | "execution_count": 15, 425 | "metadata": {}, 426 | "output_type": "execute_result" 427 | } 428 | ], 429 | "source": [ 430 | "# convert sparse matrix to a dense matrix\n", 431 | "simple_train_dtm.toarray()" 432 | ] 433 | }, 434 | { 435 | "cell_type": "code", 436 | "execution_count": 16, 437 | "metadata": { 438 | "collapsed": false 439 | }, 440 | "outputs": [ 441 | { 442 | "data": { 443 | "text/html": [ 444 | "
\n", 445 | "\n", 446 | " \n", 447 | " \n", 448 | " \n", 449 | " \n", 450 | " \n", 451 | " \n", 452 | " \n", 453 | " \n", 454 | " \n", 455 | " \n", 456 | " \n", 457 | " \n", 458 | " \n", 459 | " \n", 460 | " \n", 461 | " \n", 462 | " \n", 463 | " \n", 464 | " \n", 465 | " \n", 466 | " \n", 467 | " \n", 468 | " \n", 469 | " \n", 470 | " \n", 471 | " \n", 472 | " \n", 473 | " \n", 474 | " \n", 475 | " \n", 476 | " \n", 477 | " \n", 478 | " \n", 479 | " \n", 480 | " \n", 481 | " \n", 482 | " \n", 483 | " \n", 484 | " \n", 485 | " \n", 486 | "
cabcallmepleasetonightyou
0010011
1111000
2011200
\n", 487 | "
" 488 | ], 489 | "text/plain": [ 490 | " cab call me please tonight you\n", 491 | "0 0 1 0 0 1 1\n", 492 | "1 1 1 1 0 0 0\n", 493 | "2 0 1 1 2 0 0" 494 | ] 495 | }, 496 | "execution_count": 16, 497 | "metadata": {}, 498 | "output_type": "execute_result" 499 | } 500 | ], 501 | "source": [ 502 | "# examine the vocabulary and document-term matrix together\n", 503 | "pd.DataFrame(simple_train_dtm.toarray(), columns=vect.get_feature_names())" 504 | ] 505 | }, 506 | { 507 | "cell_type": "markdown", 508 | "metadata": {}, 509 | "source": [ 510 | "From the [scikit-learn documentation](http://scikit-learn.org/stable/modules/feature_extraction.html#text-feature-extraction):\n", 511 | "\n", 512 | "> In this scheme, features and samples are defined as follows:\n", 513 | "\n", 514 | "> - Each individual token occurrence frequency (normalized or not) is treated as a **feature**.\n", 515 | "> - The vector of all the token frequencies for a given document is considered a multivariate **sample**.\n", 516 | "\n", 517 | "> A **corpus of documents** can thus be represented by a matrix with **one row per document** and **one column per token** (e.g. word) occurring in the corpus.\n", 518 | "\n", 519 | "> We call **vectorization** the general process of turning a collection of text documents into numerical feature vectors. This specific strategy (tokenization, counting and normalization) is called the **Bag of Words** or \"Bag of n-grams\" representation. Documents are described by word occurrences while completely ignoring the relative position information of the words in the document." 520 | ] 521 | }, 522 | { 523 | "cell_type": "code", 524 | "execution_count": 17, 525 | "metadata": { 526 | "collapsed": false 527 | }, 528 | "outputs": [ 529 | { 530 | "data": { 531 | "text/plain": [ 532 | "scipy.sparse.csr.csr_matrix" 533 | ] 534 | }, 535 | "execution_count": 17, 536 | "metadata": {}, 537 | "output_type": "execute_result" 538 | } 539 | ], 540 | "source": [ 541 | "# check the type of the document-term matrix\n", 542 | "type(simple_train_dtm)" 543 | ] 544 | }, 545 | { 546 | "cell_type": "code", 547 | "execution_count": 18, 548 | "metadata": { 549 | "collapsed": false, 550 | "scrolled": true 551 | }, 552 | "outputs": [ 553 | { 554 | "name": "stdout", 555 | "output_type": "stream", 556 | "text": [ 557 | " (0, 1)\t1\n", 558 | " (0, 4)\t1\n", 559 | " (0, 5)\t1\n", 560 | " (1, 0)\t1\n", 561 | " (1, 1)\t1\n", 562 | " (1, 2)\t1\n", 563 | " (2, 1)\t1\n", 564 | " (2, 2)\t1\n", 565 | " (2, 3)\t2\n" 566 | ] 567 | } 568 | ], 569 | "source": [ 570 | "# examine the sparse matrix contents\n", 571 | "print(simple_train_dtm)" 572 | ] 573 | }, 574 | { 575 | "cell_type": "markdown", 576 | "metadata": {}, 577 | "source": [ 578 | "From the [scikit-learn documentation](http://scikit-learn.org/stable/modules/feature_extraction.html#text-feature-extraction):\n", 579 | "\n", 580 | "> As most documents will typically use a very small subset of the words used in the corpus, the resulting matrix will have **many feature values that are zeros** (typically more than 99% of them).\n", 581 | "\n", 582 | "> For instance, a collection of 10,000 short text documents (such as emails) will use a vocabulary with a size in the order of 100,000 unique words in total while each document will use 100 to 1000 unique words individually.\n", 583 | "\n", 584 | "> In order to be able to **store such a matrix in memory** but also to **speed up operations**, implementations will typically use a **sparse representation** such as the implementations available in the `scipy.sparse` package." 585 | ] 586 | }, 587 | { 588 | "cell_type": "code", 589 | "execution_count": 19, 590 | "metadata": { 591 | "collapsed": false 592 | }, 593 | "outputs": [ 594 | { 595 | "data": { 596 | "text/plain": [ 597 | "KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',\n", 598 | " metric_params=None, n_jobs=1, n_neighbors=1, p=2,\n", 599 | " weights='uniform')" 600 | ] 601 | }, 602 | "execution_count": 19, 603 | "metadata": {}, 604 | "output_type": "execute_result" 605 | } 606 | ], 607 | "source": [ 608 | "# build a model to predict desperation\n", 609 | "knn = KNeighborsClassifier(n_neighbors=1)\n", 610 | "knn.fit(simple_train_dtm, is_desperate)" 611 | ] 612 | }, 613 | { 614 | "cell_type": "code", 615 | "execution_count": 20, 616 | "metadata": { 617 | "collapsed": true 618 | }, 619 | "outputs": [], 620 | "source": [ 621 | "# example text for model testing\n", 622 | "simple_test = [\"please don't call me\"]" 623 | ] 624 | }, 625 | { 626 | "cell_type": "markdown", 627 | "metadata": {}, 628 | "source": [ 629 | "In order to **make a prediction**, the new observation must have the **same features as the training observations**, both in number and meaning." 630 | ] 631 | }, 632 | { 633 | "cell_type": "code", 634 | "execution_count": 21, 635 | "metadata": { 636 | "collapsed": false 637 | }, 638 | "outputs": [ 639 | { 640 | "data": { 641 | "text/plain": [ 642 | "array([[0, 1, 1, 1, 0, 0]])" 643 | ] 644 | }, 645 | "execution_count": 21, 646 | "metadata": {}, 647 | "output_type": "execute_result" 648 | } 649 | ], 650 | "source": [ 651 | "# transform testing data into a document-term matrix (using existing vocabulary)\n", 652 | "simple_test_dtm = vect.transform(simple_test)\n", 653 | "simple_test_dtm.toarray()" 654 | ] 655 | }, 656 | { 657 | "cell_type": "code", 658 | "execution_count": 22, 659 | "metadata": { 660 | "collapsed": false 661 | }, 662 | "outputs": [ 663 | { 664 | "data": { 665 | "text/html": [ 666 | "
\n", 667 | "\n", 668 | " \n", 669 | " \n", 670 | " \n", 671 | " \n", 672 | " \n", 673 | " \n", 674 | " \n", 675 | " \n", 676 | " \n", 677 | " \n", 678 | " \n", 679 | " \n", 680 | " \n", 681 | " \n", 682 | " \n", 683 | " \n", 684 | " \n", 685 | " \n", 686 | " \n", 687 | " \n", 688 | " \n", 689 | " \n", 690 | "
cabcallmepleasetonightyou
0011100
\n", 691 | "
" 692 | ], 693 | "text/plain": [ 694 | " cab call me please tonight you\n", 695 | "0 0 1 1 1 0 0" 696 | ] 697 | }, 698 | "execution_count": 22, 699 | "metadata": {}, 700 | "output_type": "execute_result" 701 | } 702 | ], 703 | "source": [ 704 | "# examine the vocabulary and document-term matrix together\n", 705 | "pd.DataFrame(simple_test_dtm.toarray(), columns=vect.get_feature_names())" 706 | ] 707 | }, 708 | { 709 | "cell_type": "code", 710 | "execution_count": 23, 711 | "metadata": { 712 | "collapsed": false 713 | }, 714 | "outputs": [ 715 | { 716 | "data": { 717 | "text/plain": [ 718 | "array([1])" 719 | ] 720 | }, 721 | "execution_count": 23, 722 | "metadata": {}, 723 | "output_type": "execute_result" 724 | } 725 | ], 726 | "source": [ 727 | "# predict whether simple_test is desperate\n", 728 | "knn.predict(simple_test_dtm)" 729 | ] 730 | }, 731 | { 732 | "cell_type": "markdown", 733 | "metadata": {}, 734 | "source": [ 735 | "**Summary:**\n", 736 | "\n", 737 | "- `vect.fit(train)` **learns the vocabulary** of the training data\n", 738 | "- `vect.transform(train)` uses the **fitted vocabulary** to build a document-term matrix from the training data\n", 739 | "- `vect.transform(test)` uses the **fitted vocabulary** to build a document-term matrix from the testing data (and **ignores tokens** it hasn't seen before)" 740 | ] 741 | }, 742 | { 743 | "cell_type": "markdown", 744 | "metadata": {}, 745 | "source": [ 746 | "## Part 3: Reading a text-based dataset into pandas" 747 | ] 748 | }, 749 | { 750 | "cell_type": "code", 751 | "execution_count": 24, 752 | "metadata": { 753 | "collapsed": true 754 | }, 755 | "outputs": [], 756 | "source": [ 757 | "# read file into pandas from the working directory\n", 758 | "sms = pd.read_table('sms.tsv', header=None, names=['label', 'message'])" 759 | ] 760 | }, 761 | { 762 | "cell_type": "code", 763 | "execution_count": 25, 764 | "metadata": { 765 | "collapsed": false 766 | }, 767 | "outputs": [], 768 | "source": [ 769 | "# alternative: read file into pandas from a URL\n", 770 | "# url = 'https://raw.githubusercontent.com/justmarkham/pydata-dc-2016-tutorial/master/sms.tsv'\n", 771 | "# sms = pd.read_table(url, header=None, names=['label', 'message'])" 772 | ] 773 | }, 774 | { 775 | "cell_type": "code", 776 | "execution_count": 26, 777 | "metadata": { 778 | "collapsed": false 779 | }, 780 | "outputs": [ 781 | { 782 | "data": { 783 | "text/plain": [ 784 | "(5572, 2)" 785 | ] 786 | }, 787 | "execution_count": 26, 788 | "metadata": {}, 789 | "output_type": "execute_result" 790 | } 791 | ], 792 | "source": [ 793 | "# examine the shape\n", 794 | "sms.shape" 795 | ] 796 | }, 797 | { 798 | "cell_type": "code", 799 | "execution_count": 27, 800 | "metadata": { 801 | "collapsed": false 802 | }, 803 | "outputs": [ 804 | { 805 | "data": { 806 | "text/html": [ 807 | "
\n", 808 | "\n", 809 | " \n", 810 | " \n", 811 | " \n", 812 | " \n", 813 | " \n", 814 | " \n", 815 | " \n", 816 | " \n", 817 | " \n", 818 | " \n", 819 | " \n", 820 | " \n", 821 | " \n", 822 | " \n", 823 | " \n", 824 | " \n", 825 | " \n", 826 | " \n", 827 | " \n", 828 | " \n", 829 | " \n", 830 | " \n", 831 | " \n", 832 | " \n", 833 | " \n", 834 | " \n", 835 | " \n", 836 | " \n", 837 | " \n", 838 | " \n", 839 | " \n", 840 | " \n", 841 | " \n", 842 | " \n", 843 | " \n", 844 | " \n", 845 | " \n", 846 | " \n", 847 | " \n", 848 | " \n", 849 | " \n", 850 | " \n", 851 | " \n", 852 | " \n", 853 | " \n", 854 | " \n", 855 | " \n", 856 | " \n", 857 | " \n", 858 | " \n", 859 | " \n", 860 | " \n", 861 | " \n", 862 | " \n", 863 | " \n", 864 | " \n", 865 | " \n", 866 | " \n", 867 | " \n", 868 | "
labelmessage
0hamGo until jurong point, crazy.. Available only ...
1hamOk lar... Joking wif u oni...
2spamFree entry in 2 a wkly comp to win FA Cup fina...
3hamU dun say so early hor... U c already then say...
4hamNah I don't think he goes to usf, he lives aro...
5spamFreeMsg Hey there darling it's been 3 week's n...
6hamEven my brother is not like to speak with me. ...
7hamAs per your request 'Melle Melle (Oru Minnamin...
8spamWINNER!! As a valued network customer you have...
9spamHad your mobile 11 months or more? U R entitle...
\n", 869 | "
" 870 | ], 871 | "text/plain": [ 872 | " label message\n", 873 | "0 ham Go until jurong point, crazy.. Available only ...\n", 874 | "1 ham Ok lar... Joking wif u oni...\n", 875 | "2 spam Free entry in 2 a wkly comp to win FA Cup fina...\n", 876 | "3 ham U dun say so early hor... U c already then say...\n", 877 | "4 ham Nah I don't think he goes to usf, he lives aro...\n", 878 | "5 spam FreeMsg Hey there darling it's been 3 week's n...\n", 879 | "6 ham Even my brother is not like to speak with me. ...\n", 880 | "7 ham As per your request 'Melle Melle (Oru Minnamin...\n", 881 | "8 spam WINNER!! As a valued network customer you have...\n", 882 | "9 spam Had your mobile 11 months or more? U R entitle..." 883 | ] 884 | }, 885 | "execution_count": 27, 886 | "metadata": {}, 887 | "output_type": "execute_result" 888 | } 889 | ], 890 | "source": [ 891 | "# examine the first 10 rows\n", 892 | "sms.head(10)" 893 | ] 894 | }, 895 | { 896 | "cell_type": "code", 897 | "execution_count": 28, 898 | "metadata": { 899 | "collapsed": false 900 | }, 901 | "outputs": [ 902 | { 903 | "data": { 904 | "text/plain": [ 905 | "ham 4825\n", 906 | "spam 747\n", 907 | "Name: label, dtype: int64" 908 | ] 909 | }, 910 | "execution_count": 28, 911 | "metadata": {}, 912 | "output_type": "execute_result" 913 | } 914 | ], 915 | "source": [ 916 | "# examine the class distribution\n", 917 | "sms.label.value_counts()" 918 | ] 919 | }, 920 | { 921 | "cell_type": "code", 922 | "execution_count": 29, 923 | "metadata": { 924 | "collapsed": true 925 | }, 926 | "outputs": [], 927 | "source": [ 928 | "# convert label to a numerical variable\n", 929 | "sms['label_num'] = sms.label.map({'ham':0, 'spam':1})" 930 | ] 931 | }, 932 | { 933 | "cell_type": "code", 934 | "execution_count": 30, 935 | "metadata": { 936 | "collapsed": false 937 | }, 938 | "outputs": [ 939 | { 940 | "data": { 941 | "text/html": [ 942 | "
\n", 943 | "\n", 944 | " \n", 945 | " \n", 946 | " \n", 947 | " \n", 948 | " \n", 949 | " \n", 950 | " \n", 951 | " \n", 952 | " \n", 953 | " \n", 954 | " \n", 955 | " \n", 956 | " \n", 957 | " \n", 958 | " \n", 959 | " \n", 960 | " \n", 961 | " \n", 962 | " \n", 963 | " \n", 964 | " \n", 965 | " \n", 966 | " \n", 967 | " \n", 968 | " \n", 969 | " \n", 970 | " \n", 971 | " \n", 972 | " \n", 973 | " \n", 974 | " \n", 975 | " \n", 976 | " \n", 977 | " \n", 978 | " \n", 979 | " \n", 980 | " \n", 981 | " \n", 982 | " \n", 983 | " \n", 984 | " \n", 985 | " \n", 986 | " \n", 987 | " \n", 988 | " \n", 989 | " \n", 990 | " \n", 991 | " \n", 992 | " \n", 993 | " \n", 994 | " \n", 995 | " \n", 996 | " \n", 997 | " \n", 998 | " \n", 999 | " \n", 1000 | " \n", 1001 | " \n", 1002 | " \n", 1003 | " \n", 1004 | " \n", 1005 | " \n", 1006 | " \n", 1007 | " \n", 1008 | " \n", 1009 | " \n", 1010 | " \n", 1011 | " \n", 1012 | " \n", 1013 | " \n", 1014 | "
labelmessagelabel_num
0hamGo until jurong point, crazy.. Available only ...0
1hamOk lar... Joking wif u oni...0
2spamFree entry in 2 a wkly comp to win FA Cup fina...1
3hamU dun say so early hor... U c already then say...0
4hamNah I don't think he goes to usf, he lives aro...0
5spamFreeMsg Hey there darling it's been 3 week's n...1
6hamEven my brother is not like to speak with me. ...0
7hamAs per your request 'Melle Melle (Oru Minnamin...0
8spamWINNER!! As a valued network customer you have...1
9spamHad your mobile 11 months or more? U R entitle...1
\n", 1015 | "
" 1016 | ], 1017 | "text/plain": [ 1018 | " label message label_num\n", 1019 | "0 ham Go until jurong point, crazy.. Available only ... 0\n", 1020 | "1 ham Ok lar... Joking wif u oni... 0\n", 1021 | "2 spam Free entry in 2 a wkly comp to win FA Cup fina... 1\n", 1022 | "3 ham U dun say so early hor... U c already then say... 0\n", 1023 | "4 ham Nah I don't think he goes to usf, he lives aro... 0\n", 1024 | "5 spam FreeMsg Hey there darling it's been 3 week's n... 1\n", 1025 | "6 ham Even my brother is not like to speak with me. ... 0\n", 1026 | "7 ham As per your request 'Melle Melle (Oru Minnamin... 0\n", 1027 | "8 spam WINNER!! As a valued network customer you have... 1\n", 1028 | "9 spam Had your mobile 11 months or more? U R entitle... 1" 1029 | ] 1030 | }, 1031 | "execution_count": 30, 1032 | "metadata": {}, 1033 | "output_type": "execute_result" 1034 | } 1035 | ], 1036 | "source": [ 1037 | "# check that the conversion worked\n", 1038 | "sms.head(10)" 1039 | ] 1040 | }, 1041 | { 1042 | "cell_type": "code", 1043 | "execution_count": 31, 1044 | "metadata": { 1045 | "collapsed": false 1046 | }, 1047 | "outputs": [ 1048 | { 1049 | "name": "stdout", 1050 | "output_type": "stream", 1051 | "text": [ 1052 | "(150, 4)\n", 1053 | "(150,)\n" 1054 | ] 1055 | } 1056 | ], 1057 | "source": [ 1058 | "# how to define X and y (from the iris data) for use with a MODEL\n", 1059 | "X = iris.data\n", 1060 | "y = iris.target\n", 1061 | "print(X.shape)\n", 1062 | "print(y.shape)" 1063 | ] 1064 | }, 1065 | { 1066 | "cell_type": "code", 1067 | "execution_count": 32, 1068 | "metadata": { 1069 | "collapsed": false 1070 | }, 1071 | "outputs": [ 1072 | { 1073 | "name": "stdout", 1074 | "output_type": "stream", 1075 | "text": [ 1076 | "(5572,)\n", 1077 | "(5572,)\n" 1078 | ] 1079 | } 1080 | ], 1081 | "source": [ 1082 | "# how to define X and y (from the SMS data) for use with COUNTVECTORIZER\n", 1083 | "X = sms.message\n", 1084 | "y = sms.label_num\n", 1085 | "print(X.shape)\n", 1086 | "print(y.shape)" 1087 | ] 1088 | }, 1089 | { 1090 | "cell_type": "code", 1091 | "execution_count": 33, 1092 | "metadata": { 1093 | "collapsed": false 1094 | }, 1095 | "outputs": [ 1096 | { 1097 | "name": "stdout", 1098 | "output_type": "stream", 1099 | "text": [ 1100 | "(4179,)\n", 1101 | "(1393,)\n", 1102 | "(4179,)\n", 1103 | "(1393,)\n" 1104 | ] 1105 | } 1106 | ], 1107 | "source": [ 1108 | "# split X and y into training and testing sets\n", 1109 | "from sklearn.cross_validation import train_test_split\n", 1110 | "X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=1)\n", 1111 | "print(X_train.shape)\n", 1112 | "print(X_test.shape)\n", 1113 | "print(y_train.shape)\n", 1114 | "print(y_test.shape)" 1115 | ] 1116 | }, 1117 | { 1118 | "cell_type": "markdown", 1119 | "metadata": {}, 1120 | "source": [ 1121 | "## Part 4: Vectorizing our dataset" 1122 | ] 1123 | }, 1124 | { 1125 | "cell_type": "code", 1126 | "execution_count": 34, 1127 | "metadata": { 1128 | "collapsed": true 1129 | }, 1130 | "outputs": [], 1131 | "source": [ 1132 | "# instantiate the vectorizer\n", 1133 | "vect = CountVectorizer()" 1134 | ] 1135 | }, 1136 | { 1137 | "cell_type": "code", 1138 | "execution_count": 35, 1139 | "metadata": { 1140 | "collapsed": true 1141 | }, 1142 | "outputs": [], 1143 | "source": [ 1144 | "# learn training data vocabulary, then use it to create a document-term matrix\n", 1145 | "vect.fit(X_train)\n", 1146 | "X_train_dtm = vect.transform(X_train)" 1147 | ] 1148 | }, 1149 | { 1150 | "cell_type": "code", 1151 | "execution_count": 36, 1152 | "metadata": { 1153 | "collapsed": true 1154 | }, 1155 | "outputs": [], 1156 | "source": [ 1157 | "# equivalently: combine fit and transform into a single step\n", 1158 | "X_train_dtm = vect.fit_transform(X_train)" 1159 | ] 1160 | }, 1161 | { 1162 | "cell_type": "code", 1163 | "execution_count": 37, 1164 | "metadata": { 1165 | "collapsed": false 1166 | }, 1167 | "outputs": [ 1168 | { 1169 | "data": { 1170 | "text/plain": [ 1171 | "<4179x7456 sparse matrix of type ''\n", 1172 | "\twith 55209 stored elements in Compressed Sparse Row format>" 1173 | ] 1174 | }, 1175 | "execution_count": 37, 1176 | "metadata": {}, 1177 | "output_type": "execute_result" 1178 | } 1179 | ], 1180 | "source": [ 1181 | "# examine the document-term matrix\n", 1182 | "X_train_dtm" 1183 | ] 1184 | }, 1185 | { 1186 | "cell_type": "code", 1187 | "execution_count": 38, 1188 | "metadata": { 1189 | "collapsed": false 1190 | }, 1191 | "outputs": [ 1192 | { 1193 | "data": { 1194 | "text/plain": [ 1195 | "<1393x7456 sparse matrix of type ''\n", 1196 | "\twith 17604 stored elements in Compressed Sparse Row format>" 1197 | ] 1198 | }, 1199 | "execution_count": 38, 1200 | "metadata": {}, 1201 | "output_type": "execute_result" 1202 | } 1203 | ], 1204 | "source": [ 1205 | "# transform testing data (using fitted vocabulary) into a document-term matrix\n", 1206 | "X_test_dtm = vect.transform(X_test)\n", 1207 | "X_test_dtm" 1208 | ] 1209 | }, 1210 | { 1211 | "cell_type": "markdown", 1212 | "metadata": {}, 1213 | "source": [ 1214 | "## Part 5: Building and evaluating a model\n", 1215 | "\n", 1216 | "We will use [multinomial Naive Bayes](http://scikit-learn.org/stable/modules/generated/sklearn.naive_bayes.MultinomialNB.html):\n", 1217 | "\n", 1218 | "> The multinomial Naive Bayes classifier is suitable for classification with **discrete features** (e.g., word counts for text classification). The multinomial distribution normally requires integer feature counts. However, in practice, fractional counts such as tf-idf may also work." 1219 | ] 1220 | }, 1221 | { 1222 | "cell_type": "code", 1223 | "execution_count": 39, 1224 | "metadata": { 1225 | "collapsed": true 1226 | }, 1227 | "outputs": [], 1228 | "source": [ 1229 | "# import and instantiate a Multinomial Naive Bayes model\n", 1230 | "from sklearn.naive_bayes import MultinomialNB\n", 1231 | "nb = MultinomialNB()" 1232 | ] 1233 | }, 1234 | { 1235 | "cell_type": "code", 1236 | "execution_count": 40, 1237 | "metadata": { 1238 | "collapsed": false 1239 | }, 1240 | "outputs": [ 1241 | { 1242 | "name": "stdout", 1243 | "output_type": "stream", 1244 | "text": [ 1245 | "CPU times: user 0 ns, sys: 0 ns, total: 0 ns\n", 1246 | "Wall time: 2.78 ms\n" 1247 | ] 1248 | }, 1249 | { 1250 | "data": { 1251 | "text/plain": [ 1252 | "MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True)" 1253 | ] 1254 | }, 1255 | "execution_count": 40, 1256 | "metadata": {}, 1257 | "output_type": "execute_result" 1258 | } 1259 | ], 1260 | "source": [ 1261 | "# train the model using X_train_dtm (timing it with an IPython \"magic command\")\n", 1262 | "%time nb.fit(X_train_dtm, y_train)" 1263 | ] 1264 | }, 1265 | { 1266 | "cell_type": "code", 1267 | "execution_count": 41, 1268 | "metadata": { 1269 | "collapsed": true 1270 | }, 1271 | "outputs": [], 1272 | "source": [ 1273 | "# make class predictions for X_test_dtm\n", 1274 | "y_pred_class = nb.predict(X_test_dtm)" 1275 | ] 1276 | }, 1277 | { 1278 | "cell_type": "code", 1279 | "execution_count": 42, 1280 | "metadata": { 1281 | "collapsed": false 1282 | }, 1283 | "outputs": [ 1284 | { 1285 | "data": { 1286 | "text/plain": [ 1287 | "0.98851399856424982" 1288 | ] 1289 | }, 1290 | "execution_count": 42, 1291 | "metadata": {}, 1292 | "output_type": "execute_result" 1293 | } 1294 | ], 1295 | "source": [ 1296 | "# calculate accuracy of class predictions\n", 1297 | "from sklearn import metrics\n", 1298 | "metrics.accuracy_score(y_test, y_pred_class)" 1299 | ] 1300 | }, 1301 | { 1302 | "cell_type": "code", 1303 | "execution_count": 43, 1304 | "metadata": { 1305 | "collapsed": false 1306 | }, 1307 | "outputs": [ 1308 | { 1309 | "data": { 1310 | "text/plain": [ 1311 | "array([[1203, 5],\n", 1312 | " [ 11, 174]])" 1313 | ] 1314 | }, 1315 | "execution_count": 43, 1316 | "metadata": {}, 1317 | "output_type": "execute_result" 1318 | } 1319 | ], 1320 | "source": [ 1321 | "# print the confusion matrix\n", 1322 | "metrics.confusion_matrix(y_test, y_pred_class)" 1323 | ] 1324 | }, 1325 | { 1326 | "cell_type": "code", 1327 | "execution_count": 44, 1328 | "metadata": { 1329 | "collapsed": false 1330 | }, 1331 | "outputs": [], 1332 | "source": [ 1333 | "# print message text for the false positives (ham incorrectly classified as spam)\n" 1334 | ] 1335 | }, 1336 | { 1337 | "cell_type": "code", 1338 | "execution_count": 45, 1339 | "metadata": { 1340 | "collapsed": false, 1341 | "scrolled": true 1342 | }, 1343 | "outputs": [], 1344 | "source": [ 1345 | "# print message text for the false negatives (spam incorrectly classified as ham)\n" 1346 | ] 1347 | }, 1348 | { 1349 | "cell_type": "code", 1350 | "execution_count": 46, 1351 | "metadata": { 1352 | "collapsed": false, 1353 | "scrolled": true 1354 | }, 1355 | "outputs": [ 1356 | { 1357 | "data": { 1358 | "text/plain": [ 1359 | "\"LookAtMe!: Thanks for your purchase of a video clip from LookAtMe!, you've been charged 35p. Think you can do better? Why not send a video in a MMSto 32323.\"" 1360 | ] 1361 | }, 1362 | "execution_count": 46, 1363 | "metadata": {}, 1364 | "output_type": "execute_result" 1365 | } 1366 | ], 1367 | "source": [ 1368 | "# example false negative\n", 1369 | "X_test[3132]" 1370 | ] 1371 | }, 1372 | { 1373 | "cell_type": "code", 1374 | "execution_count": 47, 1375 | "metadata": { 1376 | "collapsed": false 1377 | }, 1378 | "outputs": [ 1379 | { 1380 | "data": { 1381 | "text/plain": [ 1382 | "array([ 2.87744864e-03, 1.83488846e-05, 2.07301295e-03, ...,\n", 1383 | " 1.09026171e-06, 1.00000000e+00, 3.98279868e-09])" 1384 | ] 1385 | }, 1386 | "execution_count": 47, 1387 | "metadata": {}, 1388 | "output_type": "execute_result" 1389 | } 1390 | ], 1391 | "source": [ 1392 | "# calculate predicted probabilities for X_test_dtm (poorly calibrated)\n", 1393 | "y_pred_prob = nb.predict_proba(X_test_dtm)[:, 1]\n", 1394 | "y_pred_prob" 1395 | ] 1396 | }, 1397 | { 1398 | "cell_type": "code", 1399 | "execution_count": 48, 1400 | "metadata": { 1401 | "collapsed": false 1402 | }, 1403 | "outputs": [ 1404 | { 1405 | "data": { 1406 | "text/plain": [ 1407 | "0.98664310005369604" 1408 | ] 1409 | }, 1410 | "execution_count": 48, 1411 | "metadata": {}, 1412 | "output_type": "execute_result" 1413 | } 1414 | ], 1415 | "source": [ 1416 | "# calculate AUC\n", 1417 | "metrics.roc_auc_score(y_test, y_pred_prob)" 1418 | ] 1419 | }, 1420 | { 1421 | "cell_type": "markdown", 1422 | "metadata": {}, 1423 | "source": [ 1424 | "## Part 6: Comparing models\n", 1425 | "\n", 1426 | "We will compare multinomial Naive Bayes with [logistic regression](http://scikit-learn.org/stable/modules/linear_model.html#logistic-regression):\n", 1427 | "\n", 1428 | "> Logistic regression, despite its name, is a **linear model for classification** rather than regression. Logistic regression is also known in the literature as logit regression, maximum-entropy classification (MaxEnt) or the log-linear classifier. In this model, the probabilities describing the possible outcomes of a single trial are modeled using a logistic function." 1429 | ] 1430 | }, 1431 | { 1432 | "cell_type": "code", 1433 | "execution_count": 49, 1434 | "metadata": { 1435 | "collapsed": true 1436 | }, 1437 | "outputs": [], 1438 | "source": [ 1439 | "# import and instantiate a logistic regression model\n", 1440 | "from sklearn.linear_model import LogisticRegression\n", 1441 | "logreg = LogisticRegression()" 1442 | ] 1443 | }, 1444 | { 1445 | "cell_type": "code", 1446 | "execution_count": 50, 1447 | "metadata": { 1448 | "collapsed": false 1449 | }, 1450 | "outputs": [ 1451 | { 1452 | "name": "stdout", 1453 | "output_type": "stream", 1454 | "text": [ 1455 | "CPU times: user 56 ms, sys: 0 ns, total: 56 ms\n", 1456 | "Wall time: 273 ms\n" 1457 | ] 1458 | }, 1459 | { 1460 | "data": { 1461 | "text/plain": [ 1462 | "LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,\n", 1463 | " intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,\n", 1464 | " penalty='l2', random_state=None, solver='liblinear', tol=0.0001,\n", 1465 | " verbose=0, warm_start=False)" 1466 | ] 1467 | }, 1468 | "execution_count": 50, 1469 | "metadata": {}, 1470 | "output_type": "execute_result" 1471 | } 1472 | ], 1473 | "source": [ 1474 | "# train the model using X_train_dtm\n", 1475 | "%time logreg.fit(X_train_dtm, y_train)" 1476 | ] 1477 | }, 1478 | { 1479 | "cell_type": "code", 1480 | "execution_count": 51, 1481 | "metadata": { 1482 | "collapsed": true 1483 | }, 1484 | "outputs": [], 1485 | "source": [ 1486 | "# make class predictions for X_test_dtm\n", 1487 | "y_pred_class = logreg.predict(X_test_dtm)" 1488 | ] 1489 | }, 1490 | { 1491 | "cell_type": "code", 1492 | "execution_count": 52, 1493 | "metadata": { 1494 | "collapsed": false 1495 | }, 1496 | "outputs": [ 1497 | { 1498 | "data": { 1499 | "text/plain": [ 1500 | "array([ 0.01269556, 0.00347183, 0.00616517, ..., 0.03354907,\n", 1501 | " 0.99725053, 0.00157706])" 1502 | ] 1503 | }, 1504 | "execution_count": 52, 1505 | "metadata": {}, 1506 | "output_type": "execute_result" 1507 | } 1508 | ], 1509 | "source": [ 1510 | "# calculate predicted probabilities for X_test_dtm (well calibrated)\n", 1511 | "y_pred_prob = logreg.predict_proba(X_test_dtm)[:, 1]\n", 1512 | "y_pred_prob" 1513 | ] 1514 | }, 1515 | { 1516 | "cell_type": "code", 1517 | "execution_count": 53, 1518 | "metadata": { 1519 | "collapsed": false 1520 | }, 1521 | "outputs": [ 1522 | { 1523 | "data": { 1524 | "text/plain": [ 1525 | "0.9877961234745154" 1526 | ] 1527 | }, 1528 | "execution_count": 53, 1529 | "metadata": {}, 1530 | "output_type": "execute_result" 1531 | } 1532 | ], 1533 | "source": [ 1534 | "# calculate accuracy\n", 1535 | "metrics.accuracy_score(y_test, y_pred_class)" 1536 | ] 1537 | }, 1538 | { 1539 | "cell_type": "code", 1540 | "execution_count": 54, 1541 | "metadata": { 1542 | "collapsed": false 1543 | }, 1544 | "outputs": [ 1545 | { 1546 | "data": { 1547 | "text/plain": [ 1548 | "0.99368176123143015" 1549 | ] 1550 | }, 1551 | "execution_count": 54, 1552 | "metadata": {}, 1553 | "output_type": "execute_result" 1554 | } 1555 | ], 1556 | "source": [ 1557 | "# calculate AUC\n", 1558 | "metrics.roc_auc_score(y_test, y_pred_prob)" 1559 | ] 1560 | } 1561 | ], 1562 | "metadata": { 1563 | "kernelspec": { 1564 | "display_name": "Python [conda root]", 1565 | "language": "python", 1566 | "name": "conda-root-py" 1567 | }, 1568 | "language_info": { 1569 | "codemirror_mode": { 1570 | "name": "ipython", 1571 | "version": 3 1572 | }, 1573 | "file_extension": ".py", 1574 | "mimetype": "text/x-python", 1575 | "name": "python", 1576 | "nbconvert_exporter": "python", 1577 | "pygments_lexer": "ipython3", 1578 | "version": "3.5.2" 1579 | } 1580 | }, 1581 | "nbformat": 4, 1582 | "nbformat_minor": 0 1583 | } 1584 | --------------------------------------------------------------------------------