├── ACADEMIA ├── BOOK │ ├── README.md │ ├── Transformers for NLP.pdf │ └── long-short-term-memory-networks-with-python.pdf └── PAPERS │ ├── Attention Is All You Need.pdf │ ├── Improving Language Understanding by Generative Pre-Training (GPT 1).pdf │ └── README.md ├── Attention Based Models └── README.md ├── MODEL ├── Advanced_RNN.ipynb ├── BiLSTM │ ├── README.md │ └── imdb dataset.py ├── GRU │ ├── README.md │ └── imdb dataset.py ├── Long-Short Term Memory │ └── README.md ├── README.md └── Recurrent Neural Network │ ├── Encoding the words.py │ ├── README.md │ ├── RNN_reviews_label.ipynb │ ├── TEXT CLASSIFICATION │ ├── README.md │ ├── RNN Text Classification.ipynb │ └── train.csv │ ├── corpus.py │ ├── data │ ├── README.md │ └── labels.txt │ └── padding.py ├── Notebooks ├── Word2vec_Google_News_300.ipynb └── ty.py ├── Pre Processing ├── Basic Cleaning │ ├── DealingWIthEmoji.ipynb │ ├── README.md │ └── StopWords.py ├── Co-occurrence matrix │ ├── README.md │ ├── concur.jpg │ └── concurrence.jpg ├── Lemmatization │ ├── Lemmatization.ipynb │ └── README.md ├── PRE PROCESSING STEP 01.py ├── README.md ├── RemovePunctuation.py ├── Stemming │ ├── PorterStemmer.ipynb │ ├── README.md │ └── Stemming.py ├── Stop Words │ ├── README.md │ └── Remove default stopwords.py ├── Text_Processing_in_NLP.ipynb ├── Tokenizer and padding.ipynb ├── Word2Vec │ ├── Google_News_word2vec.ipynb │ └── README.md ├── number_to_word.py └── tokenizer │ ├── README.md │ ├── Spacy.py │ └── nltk_tokenize.ipynb ├── Projects └── dataset │ ├── BBC News Train.csv │ └── README.md ├── README.md └── Transfer Learning ├── Projects ├── Quora Insincere Questions Classification │ ├── Quora_Insincere_Questions_Classification_(using_Transfer_Learning).ipynb │ ├── README.md │ └── images │ │ └── final_accuracy_matrics.png └── README.md └── README.md /ACADEMIA/BOOK/README.md: -------------------------------------------------------------------------------- 1 | # BOOK 2 | -------------------------------------------------------------------------------- /ACADEMIA/BOOK/Transformers for NLP.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/vaasu2002/Natural-Language-Processing/309514bb40042c5c6bdffacfb882164d9b9bac03/ACADEMIA/BOOK/Transformers for NLP.pdf -------------------------------------------------------------------------------- /ACADEMIA/BOOK/long-short-term-memory-networks-with-python.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/vaasu2002/Natural-Language-Processing/309514bb40042c5c6bdffacfb882164d9b9bac03/ACADEMIA/BOOK/long-short-term-memory-networks-with-python.pdf -------------------------------------------------------------------------------- /ACADEMIA/PAPERS/Attention Is All You Need.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/vaasu2002/Natural-Language-Processing/309514bb40042c5c6bdffacfb882164d9b9bac03/ACADEMIA/PAPERS/Attention Is All You Need.pdf -------------------------------------------------------------------------------- /ACADEMIA/PAPERS/Improving Language Understanding by Generative Pre-Training (GPT 1).pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/vaasu2002/Natural-Language-Processing/309514bb40042c5c6bdffacfb882164d9b9bac03/ACADEMIA/PAPERS/Improving Language Understanding by Generative Pre-Training (GPT 1).pdf -------------------------------------------------------------------------------- /ACADEMIA/PAPERS/README.md: -------------------------------------------------------------------------------- 1 | # PAPERS 2 | -------------------------------------------------------------------------------- /Attention Based Models/README.md: -------------------------------------------------------------------------------- 1 | 2 | 3 | 4 | [Transformers for Natural Language Processing](https://github.com/PacktPublishing/Transformers-for-Natural-Language-Processing) 5 | -------------------------------------------------------------------------------- /MODEL/Advanced_RNN.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": { 6 | "id": "u4V9VGBGO0na" 7 | }, 8 | "source": [ 9 | "# Advanced RNN - 3\n", 10 | "- CuDNNGRU & CuDNNLSTM implementation\n", 11 | "- Note that you need to install Tensorflow > 1.4 & Keras > 2.08 to implement \n", 12 | "- This source code is running on i5-7500 & GTX 1060 6GB" 13 | ] 14 | }, 15 | { 16 | "cell_type": "code", 17 | "execution_count": 1, 18 | "metadata": { 19 | "id": "Lu6ok4j1O0ns" 20 | }, 21 | "outputs": [], 22 | "source": [ 23 | "from keras.datasets import imdb\n", 24 | "from keras.layers import GRU, LSTM, CuDNNGRU, CuDNNLSTM, Activation\n", 25 | "from keras.preprocessing.sequence import pad_sequences\n", 26 | "from keras.models import Sequential" 27 | ] 28 | }, 29 | { 30 | "cell_type": "markdown", 31 | "metadata": { 32 | "id": "ohWqA2sEO0n0" 33 | }, 34 | "source": [ 35 | "### Import dataset\n", 36 | "- IMDB dataset in Keras datasets\n", 37 | "- doc: https://keras.io/datasets/" 38 | ] 39 | }, 40 | { 41 | "cell_type": "code", 42 | "execution_count": 2, 43 | "metadata": { 44 | "collapsed": true, 45 | "id": "C6XUZUkHO0n2" 46 | }, 47 | "outputs": [], 48 | "source": [ 49 | "num_words = 30000\n", 50 | "maxlen = 300" 51 | ] 52 | }, 53 | { 54 | "cell_type": "code", 55 | "execution_count": 3, 56 | "metadata": { 57 | "collapsed": true, 58 | "colab": { 59 | "base_uri": "https://localhost:8080/" 60 | }, 61 | "id": "11gT4IUEO0n5", 62 | "outputId": "7d850376-d5cd-4051-ae2e-66a51360790a" 63 | }, 64 | "outputs": [ 65 | { 66 | "output_type": "stream", 67 | "name": "stdout", 68 | "text": [ 69 | "Downloading data from https://storage.googleapis.com/tensorflow/tf-keras-datasets/imdb.npz\n", 70 | "17465344/17464789 [==============================] - 0s 0us/step\n", 71 | "17473536/17464789 [==============================] - 0s 0us/step\n" 72 | ] 73 | } 74 | ], 75 | "source": [ 76 | "(X_train, y_train), (X_test, y_test) = imdb.load_data(num_words = num_words)" 77 | ] 78 | }, 79 | { 80 | "cell_type": "code", 81 | "execution_count": 4, 82 | "metadata": { 83 | "colab": { 84 | "base_uri": "https://localhost:8080/" 85 | }, 86 | "id": "H5ZLz3X7O0n7", 87 | "outputId": "2d378bb9-8c3c-4e8d-c086-8ba76b336df0" 88 | }, 89 | "outputs": [ 90 | { 91 | "output_type": "stream", 92 | "name": "stdout", 93 | "text": [ 94 | "(25000,)\n", 95 | "(25000,)\n", 96 | "(25000,)\n", 97 | "(25000,)\n" 98 | ] 99 | } 100 | ], 101 | "source": [ 102 | "print(X_train.shape)\n", 103 | "print(X_test.shape)\n", 104 | "print(y_train.shape)\n", 105 | "print(y_test.shape)" 106 | ] 107 | }, 108 | { 109 | "cell_type": "code", 110 | "execution_count": 5, 111 | "metadata": { 112 | "collapsed": true, 113 | "id": "JeoXMrVkO0n-" 114 | }, 115 | "outputs": [], 116 | "source": [ 117 | "# pad the sequences with zeros \n", 118 | "# padding parameter is set to 'post' => 0's are appended to end of sequences\n", 119 | "X_train = pad_sequences(X_train, maxlen = maxlen, padding = 'post')\n", 120 | "X_test = pad_sequences(X_test, maxlen = maxlen, padding = 'post')" 121 | ] 122 | }, 123 | { 124 | "cell_type": "code", 125 | "execution_count": 6, 126 | "metadata": { 127 | "collapsed": true, 128 | "id": "N_qepFcfO0oB" 129 | }, 130 | "outputs": [], 131 | "source": [ 132 | "X_train = X_train.reshape(X_train.shape + (1,))\n", 133 | "X_test = X_test.reshape(X_test.shape + (1,))" 134 | ] 135 | }, 136 | { 137 | "cell_type": "code", 138 | "execution_count": 7, 139 | "metadata": { 140 | "colab": { 141 | "base_uri": "https://localhost:8080/" 142 | }, 143 | "id": "uLblL-RHO0oE", 144 | "outputId": "f1a8db71-579c-4d0f-d50d-6f640ca1973f" 145 | }, 146 | "outputs": [ 147 | { 148 | "output_type": "stream", 149 | "name": "stdout", 150 | "text": [ 151 | "(25000, 300, 1)\n", 152 | "(25000, 300, 1)\n", 153 | "(25000,)\n", 154 | "(25000,)\n" 155 | ] 156 | } 157 | ], 158 | "source": [ 159 | "print(X_train.shape)\n", 160 | "print(X_test.shape)\n", 161 | "print(y_train.shape)\n", 162 | "print(y_test.shape)" 163 | ] 164 | }, 165 | { 166 | "cell_type": "markdown", 167 | "metadata": { 168 | "id": "X3UbWz6TO0oH" 169 | }, 170 | "source": [ 171 | "### LSTM\n", 172 | "- Naive LSTM model without CuDNN" 173 | ] 174 | }, 175 | { 176 | "cell_type": "code", 177 | "execution_count": 8, 178 | "metadata": { 179 | "collapsed": true, 180 | "id": "rtQz75-sO0oJ" 181 | }, 182 | "outputs": [], 183 | "source": [ 184 | "def lstm_model():\n", 185 | " model = Sequential()\n", 186 | " model.add(LSTM(50, input_shape = (300,1), return_sequences = True))\n", 187 | " model.add(LSTM(1, return_sequences = False))\n", 188 | " model.add(Activation('sigmoid'))\n", 189 | " \n", 190 | " model.compile(loss = 'binary_crossentropy', optimizer = 'adam', metrics = ['accuracy'])\n", 191 | " return model" 192 | ] 193 | }, 194 | { 195 | "cell_type": "code", 196 | "execution_count": 9, 197 | "metadata": { 198 | "collapsed": true, 199 | "id": "OlL6XIMEO0oK" 200 | }, 201 | "outputs": [], 202 | "source": [ 203 | "model = lstm_model()" 204 | ] 205 | }, 206 | { 207 | "cell_type": "code", 208 | "execution_count": 10, 209 | "metadata": { 210 | "colab": { 211 | "base_uri": "https://localhost:8080/" 212 | }, 213 | "id": "8r9J-yfoO0oL", 214 | "outputId": "b86d20e9-0b92-4ca5-f9ee-f0da10c1ea1d" 215 | }, 216 | "outputs": [ 217 | { 218 | "output_type": "stream", 219 | "name": "stdout", 220 | "text": [ 221 | "CPU times: user 1min 20s, sys: 3.25 s, total: 1min 23s\n", 222 | "Wall time: 1min 15s\n" 223 | ] 224 | }, 225 | { 226 | "output_type": "execute_result", 227 | "data": { 228 | "text/plain": [ 229 | "" 230 | ] 231 | }, 232 | "metadata": {}, 233 | "execution_count": 10 234 | } 235 | ], 236 | "source": [ 237 | "%%time\n", 238 | "model.fit(X_train, y_train, batch_size = 100, epochs = 10, verbose = 0)" 239 | ] 240 | }, 241 | { 242 | "cell_type": "code", 243 | "execution_count": 11, 244 | "metadata": { 245 | "colab": { 246 | "base_uri": "https://localhost:8080/" 247 | }, 248 | "id": "FDNJ2ZBGO0oM", 249 | "outputId": "bfe45081-48f1-45fe-970d-32bfe4725c90" 250 | }, 251 | "outputs": [ 252 | { 253 | "output_type": "stream", 254 | "name": "stdout", 255 | "text": [ 256 | "Accuracy: 51.08%\n" 257 | ] 258 | } 259 | ], 260 | "source": [ 261 | "scores = model.evaluate(X_test, y_test, verbose=0)\n", 262 | "print(\"Accuracy: %.2f%%\" % (scores[1]*100))" 263 | ] 264 | }, 265 | { 266 | "cell_type": "markdown", 267 | "metadata": { 268 | "id": "BInIENuqO0oO" 269 | }, 270 | "source": [ 271 | "### GRU\n", 272 | "- Naive GRU model without CuDNN" 273 | ] 274 | }, 275 | { 276 | "cell_type": "code", 277 | "execution_count": 12, 278 | "metadata": { 279 | "collapsed": true, 280 | "id": "-P_jegYiO0oP" 281 | }, 282 | "outputs": [], 283 | "source": [ 284 | "def gru_model():\n", 285 | " model = Sequential()\n", 286 | " model.add(GRU(50, input_shape = (300,1), return_sequences = True))\n", 287 | " model.add(GRU(1, return_sequences = False))\n", 288 | " model.add(Activation('sigmoid'))\n", 289 | " \n", 290 | " model.compile(loss = 'binary_crossentropy', optimizer = 'adam', metrics = ['accuracy'])\n", 291 | " return model" 292 | ] 293 | }, 294 | { 295 | "cell_type": "code", 296 | "execution_count": 13, 297 | "metadata": { 298 | "collapsed": true, 299 | "id": "yiWt4yt2O0oQ" 300 | }, 301 | "outputs": [], 302 | "source": [ 303 | "model = gru_model()" 304 | ] 305 | }, 306 | { 307 | "cell_type": "code", 308 | "execution_count": 14, 309 | "metadata": { 310 | "colab": { 311 | "base_uri": "https://localhost:8080/" 312 | }, 313 | "id": "0xGUYNIMO0oQ", 314 | "outputId": "d453ff0f-6f55-4d28-e936-78a8f3f491af" 315 | }, 316 | "outputs": [ 317 | { 318 | "output_type": "stream", 319 | "name": "stdout", 320 | "text": [ 321 | "CPU times: user 1min 26s, sys: 2.43 s, total: 1min 28s\n", 322 | "Wall time: 1min 20s\n" 323 | ] 324 | }, 325 | { 326 | "output_type": "execute_result", 327 | "data": { 328 | "text/plain": [ 329 | "" 330 | ] 331 | }, 332 | "metadata": {}, 333 | "execution_count": 14 334 | } 335 | ], 336 | "source": [ 337 | "%%time\n", 338 | "model.fit(X_train, y_train, batch_size = 100, epochs = 10, verbose = 0)" 339 | ] 340 | }, 341 | { 342 | "cell_type": "code", 343 | "execution_count": 15, 344 | "metadata": { 345 | "colab": { 346 | "base_uri": "https://localhost:8080/" 347 | }, 348 | "id": "H_UBe9wQO0oR", 349 | "outputId": "55e19fbb-2603-4063-8006-6ca86eeb0ee1" 350 | }, 351 | "outputs": [ 352 | { 353 | "output_type": "stream", 354 | "name": "stdout", 355 | "text": [ 356 | "Accuracy: 52.01%\n" 357 | ] 358 | } 359 | ], 360 | "source": [ 361 | "scores = model.evaluate(X_test, y_test, verbose=0)\n", 362 | "print(\"Accuracy: %.2f%%\" % (scores[1]*100))" 363 | ] 364 | }, 365 | { 366 | "cell_type": "markdown", 367 | "metadata": { 368 | "collapsed": true, 369 | "id": "fCGw2oeuO0oS" 370 | }, 371 | "source": [ 372 | "### CuDNN LSTM" 373 | ] 374 | }, 375 | { 376 | "cell_type": "code", 377 | "execution_count": 16, 378 | "metadata": { 379 | "collapsed": true, 380 | "id": "YnZAGR5iO0oS" 381 | }, 382 | "outputs": [], 383 | "source": [ 384 | "def cudnn_lstm_model():\n", 385 | " model = Sequential()\n", 386 | " model.add(CuDNNLSTM(50, input_shape = (300,1), return_sequences = True))\n", 387 | " model.add(CuDNNLSTM(1, return_sequences = False))\n", 388 | " model.add(Activation('sigmoid'))\n", 389 | " \n", 390 | " model.compile(loss = 'binary_crossentropy', optimizer = 'adam', metrics = ['accuracy'])\n", 391 | " return model" 392 | ] 393 | }, 394 | { 395 | "cell_type": "code", 396 | "execution_count": 17, 397 | "metadata": { 398 | "collapsed": true, 399 | "id": "Yiez0iGoO0oT" 400 | }, 401 | "outputs": [], 402 | "source": [ 403 | "model = cudnn_lstm_model()" 404 | ] 405 | }, 406 | { 407 | "cell_type": "code", 408 | "execution_count": 18, 409 | "metadata": { 410 | "colab": { 411 | "base_uri": "https://localhost:8080/" 412 | }, 413 | "id": "ueABPBUFO0oT", 414 | "outputId": "b1dbd356-bacd-4d68-a74d-c0f7692ff0f6" 415 | }, 416 | "outputs": [ 417 | { 418 | "output_type": "stream", 419 | "name": "stdout", 420 | "text": [ 421 | "CPU times: user 1min 15s, sys: 1.66 s, total: 1min 16s\n", 422 | "Wall time: 1min 12s\n" 423 | ] 424 | }, 425 | { 426 | "output_type": "execute_result", 427 | "data": { 428 | "text/plain": [ 429 | "" 430 | ] 431 | }, 432 | "metadata": {}, 433 | "execution_count": 18 434 | } 435 | ], 436 | "source": [ 437 | "%%time\n", 438 | "model.fit(X_train, y_train, batch_size = 100, epochs = 10, verbose = 0)" 439 | ] 440 | }, 441 | { 442 | "cell_type": "code", 443 | "execution_count": 19, 444 | "metadata": { 445 | "colab": { 446 | "base_uri": "https://localhost:8080/" 447 | }, 448 | "id": "W6zndIX5O0oU", 449 | "outputId": "c9cc5cf3-6001-42a3-9ff3-b82328cb6ad0" 450 | }, 451 | "outputs": [ 452 | { 453 | "output_type": "stream", 454 | "name": "stdout", 455 | "text": [ 456 | "Accuracy: 52.24%\n" 457 | ] 458 | } 459 | ], 460 | "source": [ 461 | "scores = model.evaluate(X_test, y_test, verbose=0)\n", 462 | "print(\"Accuracy: %.2f%%\" % (scores[1]*100))" 463 | ] 464 | }, 465 | { 466 | "cell_type": "markdown", 467 | "metadata": { 468 | "collapsed": true, 469 | "id": "fbXtRg4vO0oV" 470 | }, 471 | "source": [ 472 | "### CuDNN GRU" 473 | ] 474 | }, 475 | { 476 | "cell_type": "code", 477 | "execution_count": 20, 478 | "metadata": { 479 | "collapsed": true, 480 | "id": "y9dUp6TvO0oW" 481 | }, 482 | "outputs": [], 483 | "source": [ 484 | "def cudnn_gru_model():\n", 485 | " model = Sequential()\n", 486 | " model.add(CuDNNGRU(50, input_shape = (300,1), return_sequences = True))\n", 487 | " model.add(CuDNNGRU(1, return_sequences = False))\n", 488 | " model.add(Activation('sigmoid'))\n", 489 | " \n", 490 | " model.compile(loss = 'binary_crossentropy', optimizer = 'adam', metrics = ['accuracy'])\n", 491 | " return model" 492 | ] 493 | }, 494 | { 495 | "cell_type": "code", 496 | "execution_count": 21, 497 | "metadata": { 498 | "collapsed": true, 499 | "id": "vfpflYaaO0oW" 500 | }, 501 | "outputs": [], 502 | "source": [ 503 | "model = cudnn_gru_model()" 504 | ] 505 | }, 506 | { 507 | "cell_type": "code", 508 | "execution_count": 22, 509 | "metadata": { 510 | "colab": { 511 | "base_uri": "https://localhost:8080/" 512 | }, 513 | "id": "5nVeX6AdO0oW", 514 | "outputId": "13993709-5ed4-45a7-d39f-0bf70b8f2797" 515 | }, 516 | "outputs": [ 517 | { 518 | "output_type": "stream", 519 | "name": "stdout", 520 | "text": [ 521 | "CPU times: user 1min 21s, sys: 1.46 s, total: 1min 23s\n", 522 | "Wall time: 1min 19s\n" 523 | ] 524 | }, 525 | { 526 | "output_type": "execute_result", 527 | "data": { 528 | "text/plain": [ 529 | "" 530 | ] 531 | }, 532 | "metadata": {}, 533 | "execution_count": 22 534 | } 535 | ], 536 | "source": [ 537 | "%%time\n", 538 | "model.fit(X_train, y_train, batch_size = 100, epochs = 10, verbose = 0)" 539 | ] 540 | }, 541 | { 542 | "cell_type": "code", 543 | "execution_count": 23, 544 | "metadata": { 545 | "colab": { 546 | "base_uri": "https://localhost:8080/" 547 | }, 548 | "id": "9u5_aauKO0oX", 549 | "outputId": "187b49b1-b504-4388-8994-88bfab35dc7d" 550 | }, 551 | "outputs": [ 552 | { 553 | "output_type": "stream", 554 | "name": "stdout", 555 | "text": [ 556 | "Accuracy: 51.84%\n" 557 | ] 558 | } 559 | ], 560 | "source": [ 561 | "scores = model.evaluate(X_test, y_test, verbose=0)\n", 562 | "print(\"Accuracy: %.2f%%\" % (scores[1]*100))" 563 | ] 564 | } 565 | ], 566 | "metadata": { 567 | "kernelspec": { 568 | "display_name": "Python 3", 569 | "language": "python", 570 | "name": "python3" 571 | }, 572 | "language_info": { 573 | "codemirror_mode": { 574 | "name": "ipython", 575 | "version": 3 576 | }, 577 | "file_extension": ".py", 578 | "mimetype": "text/x-python", 579 | "name": "python", 580 | "nbconvert_exporter": "python", 581 | "pygments_lexer": "ipython3", 582 | "version": "3.8.5" 583 | }, 584 | "colab": { 585 | "name": "Advanced RNN.ipynb", 586 | "provenance": [], 587 | "machine_shape": "hm" 588 | }, 589 | "accelerator": "GPU" 590 | }, 591 | "nbformat": 4, 592 | "nbformat_minor": 0 593 | } -------------------------------------------------------------------------------- /MODEL/BiLSTM/README.md: -------------------------------------------------------------------------------- 1 | # BiLSTM 2 | 3 | ## What is LSTM and BiLSTM? 4 | 5 | LSTM stands for Long short-term memory, which is a type of **RNN(Recurrent neural network)**. Because of its design characteristics, LSTM is too useful for modeling the time series data, such as text data.BiLSTM is the abbreviation of Bi-directional Long Short-Term Memory, which is a combination of forward LSTM and backward LSTM. Both are often used to **model contextual information** in natural language processing tasks. 6 | 7 | 8 | ## Why use LSTM and BiLSTM? 9 | 10 | Combining the representations of words into the representations of sentences, you can use the addition method, that is, add all the representations of the words, or average them, but these methods do not take into account the order of words in the sentence. Such as the sentence ***"I don't think he is good"***. The word "no" is a negation of the following "good", that is, the emotional polarity of the sentence is derogatory. The LSTM model can better capture the long-distance dependencies. Because LSTM can learn what information to remember and what information to forget through the training process. 11 | 12 | But there is still a problem in modeling sentences with LSTM: it is impossible to encode information from back to front.In more fine-grained classification, such as the five classification tasks for strong meaning, weak meaning, neutral, weak derogation, and strong derogation, attention needs to be paid to the interaction between affective words, degree words, and negative words . For example, "This restaurant is too dirty to be good, not as good as next door". Here, "No" is a modification of the degree of "dirty". BiLSTM can better capture the two-way semantic dependency. 13 | 14 | 15 | The thing involved in Bidirectional Recurrent neural networks(RNN) is preety starightforward.Which involves in making an exact copy of the first recurrent layer in the network then providing the input sequence as it is the input of the first layer and providing the reversed copy of a input sequence to the replicated layer. This get the better of the limitations of the traditional RNN. BRNN(Bidirectional recurrent neural network), which can be trained using all avaiable input information in the past and future of the particular time-step. Split of the state neurons in regular Recurrent neural network(RNN) is responsible for the states(which is in positive time direction) and the part of the backward states(which is in negative time direction). 16 | 17 | In speech recognition domain a context of whole utterance is used to explain what is being said rather than the linear interpretation thus the input squence is feeded bi-directionally. To be accurate, time steps in the input sequence are processed one at a time, but the network steps through the sequence in both direction at the same time. 18 | 19 | 20 | -------------------------------------------------------------------------------- /MODEL/BiLSTM/imdb dataset.py: -------------------------------------------------------------------------------- 1 | #importing libraries 2 | from __future__ import print_function 3 | import numpy as np 4 | 5 | from keras.preprocessing import sequence 6 | from keras.models import Sequential 7 | from keras.layers import Dense, Dropout, Embedding, LSTM, Bidirectional 8 | from keras.datasets import imdb 9 | 10 | 11 | max_features = 20000 12 | # cut texts after this number of words 13 | # (among top max_features most common words) 14 | maxlen = 100 15 | batch_size = 32 16 | 17 | print('Loading data...') 18 | (x_train, y_train), (x_test, y_test) = imdb.load_data(num_words=max_features) 19 | print(len(x_train), 'train sequences') 20 | print(len(x_test), 'test sequences') 21 | 22 | print('Pad sequences (samples x time)') 23 | x_train = sequence.pad_sequences(x_train, maxlen=maxlen) 24 | x_test = sequence.pad_sequences(x_test, maxlen=maxlen) 25 | print('x_train shape:', x_train.shape) 26 | print('x_test shape:', x_test.shape) 27 | y_train = np.array(y_train) 28 | y_test = np.array(y_test) 29 | 30 | model = Sequential() 31 | model.add(Embedding(max_features, 128, input_length=maxlen)) 32 | model.add(Bidirectional(LSTM(64))) 33 | model.add(Dropout(0.5)) 34 | model.add(Dense(1, activation='sigmoid')) 35 | 36 | # try using different optimizers and different optimizer configs 37 | model.compile('adam', 'binary_crossentropy', metrics=['accuracy']) 38 | 39 | print('Train...') 40 | model.fit(x_train, y_train, 41 | batch_size=batch_size, 42 | epochs=4, 43 | validation_data=[x_test, y_test]) 44 | -------------------------------------------------------------------------------- /MODEL/GRU/README.md: -------------------------------------------------------------------------------- 1 | # GRU(Gated Recurrent Unit) Network 2 | 3 | With the widespread application of LSTMs in natural language processing, especially text classification tasks, people have gradually discovered that LSTMs have the disadvantages of long training time, many parameters, and complex internal calculations. Cho et al. In 2014 further proposed a simpler GRU model that combines the unit state and hidden layer state of the LSTM with some other changes. The forget gate and the input gate are combined into a single update gate . It also mixes cell states and hidden states . The GRU replaces the forget gates and inputs in the LSTM with update gates. Merging the cell state and the hidden state ht, the method of calculating new information at the current moment is different from that of LSTM. 4 | 5 | The GRU model is a model that maintains the LSTM effect, has a simpler structure, fewer parameters, and better convergence. The GRU model consists of an update gate and a reset gate. 6 | 7 | ## Update and reset gates 8 | 9 | A moment before the output of the hidden layer of the current degree of influence on the hidden layer by updating door control, update value the greater the greater the impact of the door hidden layer output current before a timing of the hidden layer; 10 | 11 | The extent to which the hidden layer information at the previous moment is ignored is controlled by the reset gate . The smaller the value of the reset gate, the more it is ignored. GRU structure is more streamlined. 12 | 13 | One of the reasons for using LSTM is to solve the problem that the Gradient errors of the RNN Deep Network accumulate too much, so that Gradient returns to zero or becomes infinity, so the optimization cannot be continued. The construction of the GRU is simpler: it has one less gate than the LSTM, so there are fewer matrix multiplications. GRU can save a lot of time when the training data is large. GRU, which simplifies the calculation method (simplifies the calculation), and also avoids the disappearance of the gradient to optimize the LSTM. 14 | 15 | ## GRU model 16 | 17 | Unlike LSTM, GRU has only two gates, namely update gate and reset gate , namely $z_t$ and $r_t$ in the figure. 18 | 19 | The update gate is used to control the degree to which the state information of the previous moment is brought into the current state. The larger the value of the update gate is, the more state information is brought into the previous moment. 20 | 21 | The reset gate is used to control the degree of ignoring the state information of the previous moment. The smaller the value of the reset gate, the more it is ignored. 22 | 23 | 24 | 25 | ## Forward communication 26 | 27 | $r_t = \sigma(W_r .[h_{(t-1)}, x_t])$ 28 | 29 | $z_t = \sigma(W_z .[h_{(t-1)}, x_t])$ 30 | 31 | $\tilde{h_t}$ =$tanh(W_{\tilde h} .[r_t * h_{t-1} , x_t])$ 32 | 33 | $h_t = (1 - z_t) * h_{t-1} + z_t * \tilde {h_t}$ 34 | 35 | $y_t = \sigma(W_o . h_t)$ 36 | 37 | # The Difference in between LSTM and GRU 38 | 39 | - GRU parameters are less than LSTM, so it is easy to converge. In the case of a large data set, the expression performance of LSTM is still better than GRU. 40 | 41 | - The performance of GRU and LSTM is similar on general data sets. 42 | 43 | - Structurally, the GRU has only two gates(update and reset), and the LSTM has three gates (forget, input, output). The GRU directly passes the hidden state to the next unit, and the LSTM uses the memory cell to pass the hidden state. 44 | -------------------------------------------------------------------------------- /MODEL/GRU/imdb dataset.py: -------------------------------------------------------------------------------- 1 | from keras.datasets import imdb 2 | from keras.layers import GRU, LSTM, Activation 3 | from keras.preprocessing.sequence import pad_sequences 4 | from keras.models import Sequential 5 | 6 | num_words = 30000 7 | maxlen = 300 8 | 9 | (X_train, y_train), (X_test, y_test) = imdb.load_data(num_words = num_words) 10 | 11 | # pad the sequences with zeros 12 | # padding parameter is set to 'post' => 0's are appended to end of sequences 13 | X_train = pad_sequences(X_train, maxlen = maxlen, padding = 'post') 14 | X_test = pad_sequences(X_test, maxlen = maxlen, padding = 'post') 15 | 16 | X_train = X_train.reshape(X_train.shape + (1,)) 17 | X_test = X_test.reshape(X_test.shape + (1,)) 18 | 19 | def gru_model(): 20 | model = Sequential() 21 | model.add(GRU(50, input_shape = (300,1), return_sequences = True)) 22 | model.add(GRU(1, return_sequences = False)) 23 | model.add(Activation('sigmoid')) 24 | 25 | model.compile(loss = 'binary_crossentropy', optimizer = 'adam', metrics = ['accuracy']) 26 | return model 27 | 28 | model = gru_model() 29 | 30 | model.fit(X_train, y_train, batch_size = 100, epochs = 2, verbose = 1) 31 | 32 | scores = model.evaluate(X_test, y_test, verbose=0) 33 | print("Accuracy: %.2f%%" % (scores[1]*100)) 34 | -------------------------------------------------------------------------------- /MODEL/Long-Short Term Memory/README.md: -------------------------------------------------------------------------------- 1 | ## LSTM (Long-Short Term Memory) 2 | 3 | Some Understanding of RNN(Recurrent Neural Network) 4 | 5 | 6 | When we think about one thing, we don't discard everything before, and then think with a blank brain. Human mind has persistence. Consider such a problem, when we talk to others, we want to predict what this person will say next, usually we need to understand what he said in the previous sentence, and then based on past communication experience, we can predict his next sentence. For example, "It's raining today, I" you might guess he would say "no umbrella" or "don't want to go out". Traditional neural networks cannot do this, and recurrent neural networks (RNNs) can solve the problem of correlation between sequence data. 7 | 8 | The main purpose of the neural network is a processing cycle and the predicted sequence data, the network configuration information prior to the neural network memory cycle will, later affect the output node information before using its typical structure as shown below, it can be seen cycle The nodes between the hidden layers of the neural network are connected. The input of the hidden layer includes not only the input of the input layer but also the output of the previous hidden layer. 9 | 10 | 11 | 12 | One disadvantage of the recurrent neural network structure in the figure above is that it uses only the previous information in the sequence to make predictions, and does not use the subsequent information. Because if this sentence is given, "Teddy Roosevelt was a great President." In order to determine whether Teddy is part of a person's name, it is not enough to know only the first two words in the sentence. This is also very useful, because the sentence may also be like, "Teddy bears are on sale!". So if only the first three words are given, it is impossible to know exactly whether Teddy is part of a person’s name. The first example is a person’s name, and the second example is not, so you can’t tell just by looking at the first three words.Therefore, BRNN is proposed to solve this problem.BLSTM is a typical representative of BRNN. 13 | 14 | 15 | ### LSTM (Long Short-Term Memory Network) 16 | 17 | "A lot of factories are opened in a certain place, the air pollution is very serious ... the sky has turned gray", if our model is trying to predict the last word of this sentence "gray", it can not be done based on short-term dependence alone, because if We can't tell whether the sky is "blue" or "gray" without looking at the "air pollution is very serious" above. Therefore, the text gap between the current predicted position and related information may become very large. When this gap becomes large, the simple recurrent neural network will lose the ability to learn such far information. LSTM is used to solve such problems. 18 | 19 | LSTM network is a special network structure with three "gates", which are "forget gate", "input gate", and "output gate" in this order.The figure below shows the network structure and formula of LSTM, where c is the memory cell state, x is the input, and a is the output of each layer. 20 | 21 | 22 | 23 | Let's explain these three gates separately. Understanding the role of these three gates is also the key to understanding LSTM. 24 | 25 | **1.Forgotten Gate:** 26 | 27 |    Effect on: memory cell state 28 | 29 |    Effect: Selective forgetting of information in memory cells 30 | 31 | Example: "She is busy today ... I am" When predicting "am" we have to selectively forget the previous subject "She", otherwise a syntax error will occur. 32 | 33 | **2.Input gate:** 34 | 35 |    Effect on: memory cell state 36 | 37 |    Effect: Record new information selectively into new cell states 38 | 39 | Example: In the above sentence, we will update this subject information to the cell state based on "I", so "am" will be predicted at the end. 40 | 41 | **3.Output gate:** 42 | 43 |    Effect on: input and hidden layer output 44 | 45 |    Effect: The final output includes both the cell state and the input, and the result is updated to the next hidden layer. 46 | 47 | Through these three gates, the LSTM can more effectively decide which information is forgotten and which information is retained. Through the forward transmission diagram of the LSTM, we can see that a cell state can be easily transmitted to a long distance to affect the output, so the LSTM can Solve the learning of long distance information. 48 | 49 | 50 | 51 | ## Detail Explaination of LSTM architecture 52 | 53 | LSTM is a very common and useful algorithm in deep learning, especially in natural language processing.What is the internal structure of the LSTM architecture? First, let's look at the overall framework of LSTM: 54 | 55 | 56 | In this picture, there is an LSTM module in the middle, and there are three inputs: $c^{(t-1)}$ , $h^{(t-1)}$ and $x^t$ and then after LSTM, the outputs are $c^t$ , $h^t$ and $y^t$ , where $x^t$ represents the input of this round, $h^{(t-1)}$ represents the state quantity output of the previous round,$c^{(t-1)}$ represents the carrier of a global message in the previous round; then $y^t$ represents the output of this round,$h^t$ represents the state quantity output of this round,$c^t$ represents a global information carrier for this round. So it seems that a general framework of LSTM understands. What does the internal structure of LSTM look like? 57 | 58 | First, we will $x^t$ and $h^{(t-1)}$ merges into a vector and multiplies by a vector *W* , another layer tanh function to get a vector z : 59 | 60 | 61 | same thing, we will $x^t$ and $h^{(t-1)}$ merges into a vector, but our activation function uses *sigmoid* , the diagram is as follows: 62 | 63 | multiply by the matrix $W^f$ , $W^i$ and $W^o$ get $z^f$ , $z^i$ and $z^o$ , then we can use these vectors to $c^{(t-1)}$ to obtain $c^t$ , the formula is: 64 | 65 |
$c^t = z^f . c^{(t-1)} + z^i . z$
66 | 67 | then get c after, we can get $h^t$ , the formula is: 68 | 69 |
$h^t = z^o . tanh (c^t)$
70 | 71 | Finally we can get the output of this round $y^t$ , the formula is: 72 | 73 |
$y^t = \sigma(W'H^t)$
74 | 75 | In summary, we can get a complete internal structure of the LSTM as shown below: 76 | 77 | 78 | With this structure, we can clearly and intuitively see the internal structure of the LSTM. First, the green part represents the input of the round $x^t$ and output $y^t$ ; the blue part indicates the status of the previous round $h^{(t-1)}$ and state quantities output by this round $h^t$ ; the red part represents the information carrier of the previous round $c^{t-1}$ and the information carrier output by this roundct $c^t$.This is a single LSTM unit. We can cascade multiple LSTM units to become our LSTM deep learning network. The diagram is as follows: 79 | 80 | 81 | OK, after reading the overall architecture of the LSTM, let's analyze the specific each A part. The reason why the entire LSTM architecture can remember long-term information is mainlycn $c^n$ this state, we can see $c^{t-1}$ to $c^t$ middle only a small amount of information exchange, it is possible to maintain the entire network is passed between LSTM, is as follows c State diagram of t: 82 | 83 | 84 | reason why LSTM can memorize long-term and short-term information is because it has a "gate" structure to remove and add information to the neuron. "Gate" is a method for selectively passing information. The first is theforget gate. The first step in LSTM is to decide what information we need to forget from the neuron state. As shown below, two inputs pass one *sigmoid* function, so the output value is 0 - Between1 , 1 means the information is completely retained, 0 means the information is completely forgotten. Through the forget gate, LSTM can selectively forget some meaningless information. As shown in the box below, this part is the forget gate in LSTM: 85 | 86 | 87 | this part can be expressed by the formula: 88 | 89 |
$z^f = \sigma(W_f.[h_{t-1}, x_t] + b_f)$
90 | 91 | Then the next step we need to confirm what new information is stored in a state of neurons, this section has two inputs, one *sigmoid* layer determines what value LSTM needs to be updated, a tanh layer creates a new candidate value vector, and this value will be added to the state. Then we need to use these two information to generate the update of the state, called theinput gate. The process is as follows: 92 | 93 | entire process can be used The formula is expressed as: 94 | 95 |
$z^i = \sigma(W_i.[h_{t-1}, x_t] + b_i)$
96 | 97 |
$z=tanh(W.[h_{t-1}, x_t] + b)$
98 | 99 | identifying the information that needs to be updated, we can update c variable t is shown in the previous figure $c^{t-1}$ to c 100 | can be expressed by the formula: 101 | 102 |
$c^t = z^f. c^{t-1} + z^i.z$
103 | 104 | in this process,$z^f.c^{t-1}$ represents the previous state information $c^{t-1}$ forgets some of the information to be discarded, and then adds the new candidate value vector of the LSTM system, which is the new round of information of the systemct $c^{t}$ 105 | 106 | After speaking the updated information state, we also need to update the neuron state of the systemh $h^t$, the whole process is shown in the box below: 107 | 108 | 109 | This is theoutput gatethat controls the output of the LSTM. The system needs to determine what value to output. This output will also be based on the current state of the neuron. First we use *sigmoid* to determine which parts the neuron needs to output, then, we pass the information of the LSTM system through *tanh* function performs processing, and finally multiplies them to output, which is the new state quantity of LSTM. Plus this part *sigmoid* is the output of this round $y^t$. Can be wrriten as: 110 |
$z^o = \sigma(W_o.[h_{t-1}, x_t] + b_o)$
111 |
$h_t = z_o.tanh(c^t)$
112 |
$y_t = sigmoid(W'.h_t)$
113 | 114 | So the entire LSTM can be divided into these parts as described above, each part has a different role, I hope you got the idea about the architecture of LSTM this can help you thoroughly understand the structure and principle of the LSTM neural network. If there are any omissions in the text, please don't hesitate to suggest. 115 | -------------------------------------------------------------------------------- /MODEL/README.md: -------------------------------------------------------------------------------- 1 | # MODEL 2 | 3 | #### 1) **RNN** 4 | #### 2) **LSTM** 5 | -------------------------------------------------------------------------------- /MODEL/Recurrent Neural Network/Encoding the words.py: -------------------------------------------------------------------------------- 1 | # forms dictionary 2 | from collections import Counter 3 | counts = Counter(words) 4 | counts 5 | 6 | #display vocabulary(unique words in alphabetical order) 7 | sorted(counts) 8 | 9 | 10 | from collections import Counter 11 | counts = Counter(words) 12 | vocab = sorted(counts, key=counts.get, reverse=True) 13 | vocab 14 | 15 | for i, word in enumerate(vocab, 1): 16 | print(i , word) 17 | 18 | 19 | from collections import Counter 20 | counts = Counter(words) 21 | vocab = sorted(counts, key=counts.get, reverse=True) 22 | 23 | # Create your dictionary that maps vocab words to integers here 24 | vocab_to_int = {word: i for i, word in enumerate(vocab, 1)} # start from 1 25 | 26 | # Convert the reviews to integers, same shape as reviews list, but with integers 27 | review_ints = [] # stores review in number form(as each word has its won number) 28 | for each in reviews: 29 | review_ints.append([vocab_to_int[word] for word in each.split()]) 30 | 31 | vocab_to_int = {word: i for i, word in enumerate(vocab, 1)} 32 | vocab_to_int 33 | 34 | review_lens = Counter([len(x) for x in review_ints]) # length of each review 35 | 36 | 37 | print("Zero-length reviews: {}".format(min(review_lens))) 38 | print("Maximum review length: {}".format(max(review_lens))) 39 | 40 | review_ints = [review for review in review_ints if (len(review) > 0)] # filter out review with zero length 41 | 42 | # forming dataset(padding) 43 | seq_len = 200 44 | features = [] 45 | for review in review_ints: 46 | review_len = len(review) 47 | len_diff = seq_len - review_len 48 | if len_diff <= 0: 49 | features.append(review[:seq_len]) 50 | print(review[:seq_len]) 51 | else: 52 | padding = [0] * len_diff 53 | padded_feature = padding + review 54 | features.append(padded_feature) 55 | print() 56 | features = np.asarray(features) 57 | 58 | 59 | features.shape # 2d (number of review, length of each review(fixed after padding(200 in this case) 60 | 61 | 62 | # splitting dataset 63 | split_frac = 0.8 64 | split_idx = int(len(features) * split_frac) 65 | 66 | train_x, val_x = features[:split_idx], features[split_idx:] 67 | train_y, val_y = labels[:split_idx], labels[:split_idx] 68 | 69 | test_idx = int(len(val_x) * 0.5) 70 | val_x, test_x = val_x[:test_idx], val_x[test_idx:] 71 | val_y, test_y = val_y[:test_idx], val_y[test_idx:] 72 | 73 | print("\t\t\tFeature Shapes:") 74 | print("Train set: \t\t{}".format(train_x.shape), 75 | "\nValidation set: \t{}".format(val_x.shape), 76 | "\nTest set: \t\t{}".format(test_x.shape)) 77 | 78 | 79 | train_x.shape #2d 80 | train_y.shape #1d 81 | # we need train_x as 3d tensor and train_y a 2d tensor 82 | 83 | train_x = np.array(train_x).reshape((train_x.shape[0], train_x.shape[1], 1)) # make 3d 84 | train_y = to_categorical(train_y) # make 2d (numebr of labels , number of categories) 85 | 86 | 87 | # model 88 | def vanilla_rnn(): 89 | model = Sequential() 90 | model.add(SimpleRNN(50, input_shape = (200,1), return_sequences = True)) 91 | model.add(SimpleRNN(50,return_sequences = True)) 92 | model.add(SimpleRNN(50)) 93 | model.add(Dense(46)) 94 | model.add(Dense(46)) 95 | model.add(Dense(46)) 96 | model.add(Activation('sigmoid')) 97 | 98 | adam = tf.optimizers.Adam(learning_rate=0.001) 99 | model.compile(loss = 'sparse_categorical_crossentropy', optimizer = adam, metrics = ['accuracy']) 100 | 101 | return model 102 | 103 | model = KerasClassifier(build_fn = vanilla_rnn, epochs = 1, batch_size = 50, verbose = 1) 104 | 105 | model.fit(train_x, train_y) 106 | 107 | -------------------------------------------------------------------------------- /MODEL/Recurrent Neural Network/README.md: -------------------------------------------------------------------------------- 1 | # Recurrent Neural Network 2 | 3 | ## 1) After CNN, why RNN is there? 4 | 5 | The backpropagation algorithm in CNN (convolutional neural network), we know that their output only considers the influence of the previous input and does not consider 6 | the influence of other moments of input, such as simple cats, dogs, handwritten numbers and other single objects. 7 | 8 | However, for some related to time, such as the prediction of the next moment of the video, the prediction of the content of the previous and subsequent documents, 9 | etc., the performance of these algorithms is not satisfactory. Therefore, RNN should be applied and was born. 10 | 11 | ## 2) What is RNN? 12 | 13 | RNN is a special neural network structure, which is proposed based on the view that " human cognition is based on past experience and memory ". It is different from 14 | DNN and CNN in that it not only considers the input of the previous moment, And it gives the network a 'memory' function of the previous content . 15 | 16 | The reason why RNN is called recurrent neural network is that the current output of a sequence is also related to the previous output. The specific manifestation is 17 | that the network memorizes the previous information and applies it to the current output calculation, that is, the nodes between the hidden layers are connected, and 18 | the input of the hidden layer includes not only the output of the input layer It also includes the output of the hidden layer from the previous moment. 19 | 20 | ## 3) What are the main application areas of RNN ? 21 | 22 | There are many application fields of RNN. It can be said that as long as the problem of chronological order is considered, RNN can be used to solve it. Here are some 23 | common application fields: 24 | 25 |     ① Natural Language Processing (NLP) : There are video processing ,  text generation , language model , image processing 26 | 27 |     ② Machine translation , machine writing novels 28 | 29 |     ③ Speech recognition 30 | 31 |     ④ Image description generation 32 | 33 |     ⑤ Text similarity calculation 34 | 35 |     ⑥ New application areas such as music recommendation , Netease koala product recommendation , Youtube video recommendation, etc. 36 | 37 | ### Different types of RNN 38 | 39 | ![alt](https://miro.medium.com/max/1400/0*1PKOwfxLIg_64TAO.jpeg) 40 | 41 | #### One-to-one: 42 | 43 | This also called as Plain/Vaniall Neural networks. It deals with Fixed size of input to Fixed size of Output where they are independent of previous information/output. 44 | 45 | Ex: Image classification. 46 | 47 | 48 | #### One-to-Many: 49 | 50 | it deals with fixed size of information as input that gives sequence of data as output. 51 | 52 | Ex:Image Captioning takes image as input and outputs a sentence of words. 53 | 54 | ![alt](https://miro.medium.com/max/1400/0*d9FisCKzVZ29SxUu.png) 55 | 56 | #### Many-to-One: 57 | 58 | It takes Sequence of information as input and ouputs a fixed size of output. 59 | 60 | Ex:sentiment analysis where a given sentence is classified as expressing positive or negative sentiment. 61 | 62 | 63 | #### Many-to-Many: 64 | 65 | It takes a Sequence of information as input and process it recurrently outputs a Sequence of data. 66 | 67 | Ex: Machine Translation, where an RNN reads a sentence in English and then outputs a sentence in French. 68 | 69 | ## RNN model structure 70 | 71 | Earlier we said that RNN has the function of "memory" of time, so how does it realize the so-called "memory"? 72 | 73 | 74 |
Figure 1 RNN structure diagram 
75 | 76 | As shown in Figure 1, we can see that the RNN hierarchy is simpler than CNN.It mainly consists of an input layer , a Hidden Layer , and an output layer . 77 | 78 | And you will find that **there is an arrow in the Hidden Layer to  indicate the cyclic update of the data.This is the method to implement the time memory function.** 79 | 80 | ![alt](https://miro.medium.com/max/1400/1*xn5kA92_J5KLaKcP7BMRLA.gif) 81 | 82 | * t — time step 83 | * X — input 84 | * h — hidden state 85 | * length of X — size/dimension of input 86 | * length of h — no. of hidden units. 87 | 88 | 89 |
Figure 2 Unfolded RNN Diagram 
90 | 91 | Figure 2 shows the hierarchical expansion of the Hidden Layer where 92 | 93 | **T-1, t, t + 1 represent the time series** 94 | 95 | **X represents the input sample** 96 | 97 | **St represents the memory of the sample at time t, St = f (W * St -1 + U * Xt).** 98 | 99 | **W is the weight of the input** 100 | 101 | **U is the weight of the input sample at the moment** 102 | 103 | **V is the weight of the output sample.** 104 | 105 | 106 | ### Feedforward 107 | 108 | At t = 1, the general initialization input S0 = 0, randomly initializes W, U, V, and calculates the following formula: 109 | 110 | 111 | 112 | Among them, f and g are activation functions. Among them, f can be tanh, relu, sigmoid and other activation functions, g is usually softmax or other. 113 | 114 | Time advances, and the state s1 at this time, as the memory state at time 1, will participate in the prediction activity at the next time, that is: 115 | 116 | 117 | 118 | By analogy, you can get the final output value: 119 | 120 | 121 | 122 | 123 | Note : 124 | 125 | 1. Here W, U, V are equal at each moment ( weight sharing ). 126 | 127 | 2. The hidden state can be understood as: S = f (existing input + past memory summary ) 128 | 129 | 130 | ## Back propagation through TIME of RNN 131 | 132 | Earlier we introduced the forward propagation method of RNN, so how are the weight parameters W, U, and V of the RNN updated? 133 | 134 | Each output will have a value Ot error value , Et the total error may be expressed as 135 | 136 | The loss function can use either the cross-entropy loss function or the squared error loss function . 137 | 138 | Because the output of each step does not only depend on the network of the current step, but also the state of the previous steps, then this modified BP algorithm is called Backpropagation Through Time ( BPTT ), which is the reverse transfer of the error value at the output end. The gradient descent method is updated. 139 | 140 | That is, the gradient of the parameter is required: 141 | 142 | 143 | First we solve the update method of W. From the previous update of W, it can be seen that it is the sum of the partial derivatives of the deviations at each moment.  144 | 145 | Here we take time t = 3 as an example.According to the chain derivation rule, we can get the partial derivative at time t = 3 as: 146 | 147 | 148 | 149 | 150 | At this time, according to the formula, ![alt](img/331.png) we will find that in addition to W, S3 is also related to S2 at the previous moment. 151 | 152 | ![alt](https://miro.medium.com/max/1400/0*ENwCVS8XI8cjCy55.jpg) 153 | 154 | 155 | ### Going little deeper 156 | 157 | Let’s focus on one error term et. 158 | 159 | You’ve calculated the cost function et, and now you want to propagate your cost function back through the network because you need to update the weights. 160 | 161 | Essentially, every single neuron that participated in the calculation of the output, associated with this cost function, should have its weight updated in order to minimize that error. And the thing with RNNs is that it’s not just the neurons directly below this output layer that contributed but all of the neurons far back in time. So, you have to propagate all the way back through time to these neurons. 162 | 163 | The problem relates to updating wrec (weight recurring) – the weight that is used to connect the hidden layers to themselves in the unrolled temporal loop. 164 | 165 | For instance, to get from xt-3 to xt-2 we multiply xt-3 by wrec. Then, to get from xt-2 to xt-1 we again multiply xt-2 by wrec. So, we multiply with the same exact weight multiple times, and this is where the problem arises: when you multiply something by a small number, your value decreases very quickly. 166 | 167 | As we know, weights are assigned at the start of the neural network with the random values, which are close to zero, and from there the network trains them up. But, when you start with wrec close to zero and multiply xt, xt-1, xt-2, xt-3, … by this value, your gradient becomes less and less with each multiplication. 168 | 169 | 170 | ### Advantages of Recurrent Neural Network 171 | 172 | - RNN can model sequence of data so that each sample can be assumed to be dependent on previous ones 173 | - Recurrent neural network are even used with convolutional layers to extend the effective pixel neighbourhood. 174 | 175 | ### Disadvantages of Recurrent Neural Network 176 | 177 | - Gradient vanishing and exploding problems. 178 | - Training an RNN is a very difficult task. 179 | - It cannot process very long sequences if using tanh or relu as an activation function. 180 | 181 | 182 | 183 | -------------------------------------------------------------------------------- /MODEL/Recurrent Neural Network/TEXT CLASSIFICATION/README.md: -------------------------------------------------------------------------------- 1 | # Text Classification using RNN model 2 | 3 | [**NOTEBOOK**](https://github.com/vaasu2002/Natural-Language-Processing/blob/main/MODEL/Recurrent%20Neural%20Network/TEXT%20CLASSIFICATION/RNN%20Text%20Classification.ipynb) 4 | -------------------------------------------------------------------------------- /MODEL/Recurrent Neural Network/TEXT CLASSIFICATION/RNN Text Classification.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "code", 5 | "execution_count": 1, 6 | "id": "cbfbe48e-ecc6-4311-8adc-0b9dcbff2005", 7 | "metadata": {}, 8 | "outputs": [], 9 | "source": [ 10 | "import numpy as np # linear algebra\n", 11 | "import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)\n", 12 | "import os\n", 13 | "import tensorflow as tf\n", 14 | "from tensorflow import keras\n", 15 | "import matplotlib.pyplot as plt\n", 16 | "import time" 17 | ] 18 | }, 19 | { 20 | "cell_type": "code", 21 | "execution_count": 2, 22 | "id": "4cb1d9a3-1ae1-4b1b-ba4e-117772ab1af0", 23 | "metadata": {}, 24 | "outputs": [ 25 | { 26 | "data": { 27 | "text/html": [ 28 | "
\n", 29 | "\n", 42 | "\n", 43 | " \n", 44 | " \n", 45 | " \n", 46 | " \n", 47 | " \n", 48 | " \n", 49 | " \n", 50 | " \n", 51 | " \n", 52 | " \n", 53 | " \n", 54 | " \n", 55 | " \n", 56 | " \n", 57 | " \n", 58 | " \n", 59 | " \n", 60 | " \n", 61 | " \n", 62 | " \n", 63 | " \n", 64 | " \n", 65 | " \n", 66 | " \n", 67 | " \n", 68 | " \n", 69 | " \n", 70 | " \n", 71 | " \n", 72 | " \n", 73 | " \n", 74 | " \n", 75 | " \n", 76 | " \n", 77 | " \n", 78 | " \n", 79 | " \n", 80 | " \n", 81 | " \n", 82 | " \n", 83 | " \n", 84 | " \n", 85 | " \n", 86 | " \n", 87 | " \n", 88 | " \n", 89 | " \n", 90 | " \n", 91 | " \n", 92 | " \n", 93 | " \n", 94 | " \n", 95 | "
idkeywordlocationtexttarget
01NaNNaNOur Deeds are the Reason of this #earthquake M...1
14NaNNaNForest fire near La Ronge Sask. Canada1
25NaNNaNAll residents asked to 'shelter in place' are ...1
36NaNNaN13,000 people receive #wildfires evacuation or...1
47NaNNaNJust got sent this photo from Ruby #Alaska as ...1
\n", 96 | "
" 97 | ], 98 | "text/plain": [ 99 | " id keyword location text \\\n", 100 | "0 1 NaN NaN Our Deeds are the Reason of this #earthquake M... \n", 101 | "1 4 NaN NaN Forest fire near La Ronge Sask. Canada \n", 102 | "2 5 NaN NaN All residents asked to 'shelter in place' are ... \n", 103 | "3 6 NaN NaN 13,000 people receive #wildfires evacuation or... \n", 104 | "4 7 NaN NaN Just got sent this photo from Ruby #Alaska as ... \n", 105 | "\n", 106 | " target \n", 107 | "0 1 \n", 108 | "1 1 \n", 109 | "2 1 \n", 110 | "3 1 \n", 111 | "4 1 " 112 | ] 113 | }, 114 | "execution_count": 2, 115 | "metadata": {}, 116 | "output_type": "execute_result" 117 | } 118 | ], 119 | "source": [ 120 | "df = pd.read_csv(\"train.csv\")\n", 121 | "df.head()" 122 | ] 123 | }, 124 | { 125 | "cell_type": "code", 126 | "execution_count": 3, 127 | "id": "0759ca8c-2661-4546-929d-029300146164", 128 | "metadata": {}, 129 | "outputs": [ 130 | { 131 | "name": "stdout", 132 | "output_type": "stream", 133 | "text": [ 134 | "3271\n", 135 | "4342\n" 136 | ] 137 | } 138 | ], 139 | "source": [ 140 | "print((df.target == 1).sum()) # Disaster\n", 141 | "print((df.target == 0).sum()) # No Disaster" 142 | ] 143 | }, 144 | { 145 | "cell_type": "markdown", 146 | "id": "6783fd40-ac38-402e-88e3-dda778faf9c5", 147 | "metadata": {}, 148 | "source": [ 149 | "# Preprocessing" 150 | ] 151 | }, 152 | { 153 | "cell_type": "code", 154 | "execution_count": 4, 155 | "id": "2c8121be-a8c9-4784-9ad8-8c65c57272d4", 156 | "metadata": {}, 157 | "outputs": [ 158 | { 159 | "data": { 160 | "text/plain": [ 161 | "'!\"#$%&\\'()*+,-./:;<=>?@[\\\\]^_`{|}~'" 162 | ] 163 | }, 164 | "execution_count": 4, 165 | "metadata": {}, 166 | "output_type": "execute_result" 167 | } 168 | ], 169 | "source": [ 170 | "# Preprocessing\n", 171 | "import re\n", 172 | "import string\n", 173 | "\n", 174 | "def remove_URL(text):\n", 175 | " url = re.compile(r\"https?://\\S+|www\\.\\S+\")\n", 176 | " return url.sub(r\"\", text)\n", 177 | "\n", 178 | "# https://stackoverflow.com/questions/34293875/how-to-remove-punctuation-marks-from-a-string-in-python-3-x-using-translate/34294022\n", 179 | "def remove_punct(text):\n", 180 | " translator = str.maketrans(\"\", \"\", string.punctuation)\n", 181 | " return text.translate(translator)\n", 182 | "\n", 183 | "string.punctuation" 184 | ] 185 | }, 186 | { 187 | "cell_type": "code", 188 | "execution_count": 5, 189 | "id": "48eeae7a-f254-496f-81b4-624f436a1fdf", 190 | "metadata": {}, 191 | "outputs": [ 192 | { 193 | "name": "stdout", 194 | "output_type": "stream", 195 | "text": [ 196 | "@bbcmtd Wholesale Markets ablaze http://t.co/lHYXEOHY6C\n", 197 | "t\n", 198 | "@bbcmtd Wholesale Markets ablaze \n" 199 | ] 200 | } 201 | ], 202 | "source": [ 203 | "pattern = re.compile(r\"https?://(\\S+|www)\\.\\S+\")\n", 204 | "for t in df.text:\n", 205 | " matches = pattern.findall(t)\n", 206 | " for match in matches:\n", 207 | " print(t)\n", 208 | " print(match)\n", 209 | " print(pattern.sub(r\"\", t))\n", 210 | " if len(matches) > 0:\n", 211 | " break" 212 | ] 213 | }, 214 | { 215 | "cell_type": "code", 216 | "execution_count": 6, 217 | "id": "c285a352-26d9-4dbe-ab58-659d22a634ce", 218 | "metadata": {}, 219 | "outputs": [], 220 | "source": [ 221 | "df[\"text\"] = df.text.map(remove_URL) # map(lambda x: remove_URL(x))\n", 222 | "df[\"text\"] = df.text.map(remove_punct)" 223 | ] 224 | }, 225 | { 226 | "cell_type": "code", 227 | "execution_count": 7, 228 | "id": "751ded4d-d685-43b4-b485-4458cd56b41d", 229 | "metadata": {}, 230 | "outputs": [ 231 | { 232 | "name": "stderr", 233 | "output_type": "stream", 234 | "text": [ 235 | "[nltk_data] Downloading package stopwords to\n", 236 | "[nltk_data] C:\\Users\\Vaasu\\AppData\\Roaming\\nltk_data...\n", 237 | "[nltk_data] Package stopwords is already up-to-date!\n" 238 | ] 239 | }, 240 | { 241 | "data": { 242 | "text/plain": [ 243 | "179" 244 | ] 245 | }, 246 | "execution_count": 7, 247 | "metadata": {}, 248 | "output_type": "execute_result" 249 | } 250 | ], 251 | "source": [ 252 | "# remove stopwords\n", 253 | "#!pip install -q nltk\n", 254 | "import nltk\n", 255 | "nltk.download('stopwords')\n", 256 | "from nltk.corpus import stopwords\n", 257 | "\n", 258 | "# Stop Words: A stop word is a commonly used word (such as “the”, “a”, “an”, “in”) that a search engine\n", 259 | "# has been programmed to ignore, both when indexing entries for searching and when retrieving them \n", 260 | "# as the result of a search query.\n", 261 | "stop = set(stopwords.words(\"english\"))\n", 262 | "\n", 263 | "# https://stackoverflow.com/questions/5486337/how-to-remove-stop-words-using-nltk-or-python\n", 264 | "def remove_stopwords(text):\n", 265 | " filtered_words = [word.lower() for word in text.split() if word.lower() not in stop]\n", 266 | " return \" \".join(filtered_words)\n", 267 | "len(stop)" 268 | ] 269 | }, 270 | { 271 | "cell_type": "code", 272 | "execution_count": 8, 273 | "id": "7164ef8e-439c-4b14-b209-966cd2c37eb1", 274 | "metadata": {}, 275 | "outputs": [], 276 | "source": [ 277 | "df[\"text\"] = df.text.map(remove_stopwords)" 278 | ] 279 | }, 280 | { 281 | "cell_type": "markdown", 282 | "id": "949f1b00-1dc9-43df-99f7-14bdb0feb22c", 283 | "metadata": {}, 284 | "source": [ 285 | "### Forming Vocabulary" 286 | ] 287 | }, 288 | { 289 | "cell_type": "code", 290 | "execution_count": 9, 291 | "id": "04281c70-c59c-4f78-8340-6a2fc52cf052", 292 | "metadata": {}, 293 | "outputs": [ 294 | { 295 | "name": "stdout", 296 | "output_type": "stream", 297 | "text": [ 298 | "17971\n" 299 | ] 300 | } 301 | ], 302 | "source": [ 303 | "# Counting number of unique words\n", 304 | "from collections import Counter\n", 305 | "# Count unique words\n", 306 | "def counter_word(text_col):\n", 307 | " count = Counter()\n", 308 | " for text in text_col.values:\n", 309 | " for word in text.split():\n", 310 | " count[word] += 1\n", 311 | " return count\n", 312 | "\n", 313 | "counter = counter_word(df.text)\n", 314 | "\n", 315 | "num_unique_words = len(counter)\n", 316 | "print(num_unique_words)" 317 | ] 318 | }, 319 | { 320 | "cell_type": "markdown", 321 | "id": "fcc6043c-0dec-4c96-9ce1-a1540a4f101a", 322 | "metadata": {}, 323 | "source": [ 324 | "### Train and Validation Split" 325 | ] 326 | }, 327 | { 328 | "cell_type": "code", 329 | "execution_count": 10, 330 | "id": "1715a07b-4ea0-4ac8-b290-373e5980b371", 331 | "metadata": {}, 332 | "outputs": [], 333 | "source": [ 334 | "# Split dataset into training and validation set\n", 335 | "train_size = int(df.shape[0] * 0.9)\n", 336 | "\n", 337 | "train_df = df[:train_size]\n", 338 | "val_df = df[train_size:]\n", 339 | "\n", 340 | "# split text and labels\n", 341 | "train_sentences = train_df.text.to_numpy()\n", 342 | "train_labels = train_df.target.to_numpy()\n", 343 | "val_sentences = val_df.text.to_numpy()\n", 344 | "val_labels = val_df.target.to_numpy()" 345 | ] 346 | }, 347 | { 348 | "cell_type": "code", 349 | "execution_count": 11, 350 | "id": "38623eb0-8224-493a-bcf1-5d18da9a6965", 351 | "metadata": {}, 352 | "outputs": [ 353 | { 354 | "data": { 355 | "text/plain": [ 356 | "((6851,), (762,))" 357 | ] 358 | }, 359 | "execution_count": 11, 360 | "metadata": {}, 361 | "output_type": "execute_result" 362 | } 363 | ], 364 | "source": [ 365 | "train_sentences.shape, val_sentences.shape" 366 | ] 367 | }, 368 | { 369 | "cell_type": "markdown", 370 | "id": "6f8963ac-c64c-4e9c-83ce-ff761cc75ea4", 371 | "metadata": {}, 372 | "source": [ 373 | "### Tokenization" 374 | ] 375 | }, 376 | { 377 | "cell_type": "code", 378 | "execution_count": 12, 379 | "id": "bc006b82-9ffc-4279-9b43-15bf69aac354", 380 | "metadata": {}, 381 | "outputs": [], 382 | "source": [ 383 | "# Tokenize\n", 384 | "from tensorflow.keras.preprocessing.text import Tokenizer\n", 385 | "\n", 386 | "# vectorize a text corpus by turning each text into a sequence of integers\n", 387 | "tokenizer = Tokenizer(num_words=num_unique_words)\n", 388 | "tokenizer.fit_on_texts(train_sentences) # fit only to training\n", 389 | "\n", 390 | "# each word has unique index\n", 391 | "word_index = tokenizer.word_index # dict- each word as key and value is unique indices " 392 | ] 393 | }, 394 | { 395 | "cell_type": "code", 396 | "execution_count": 13, 397 | "id": "6355e04e-e14f-4ab4-bfe4-12982d790f39", 398 | "metadata": {}, 399 | "outputs": [], 400 | "source": [ 401 | "#word_index = {k:(v+3) for k,v in word_index.items()}\n", 402 | "word_index[\"\"] = 0\n", 403 | "#word_index[\"\"] = 1\n", 404 | "#word_index[\"\"] = 2\n", 405 | "#word_index[\"\"] = 3" 406 | ] 407 | }, 408 | { 409 | "cell_type": "markdown", 410 | "id": "15d89bca-aa7f-49b1-8388-4e05ec8b6345", 411 | "metadata": {}, 412 | "source": [ 413 | "#### Forming Sequence" 414 | ] 415 | }, 416 | { 417 | "cell_type": "code", 418 | "execution_count": 14, 419 | "id": "947e7e00-163d-4982-aadf-e5caf2c547c2", 420 | "metadata": {}, 421 | "outputs": [], 422 | "source": [ 423 | "train_sequences = tokenizer.texts_to_sequences(train_sentences)\n", 424 | "val_sequences = tokenizer.texts_to_sequences(val_sentences)" 425 | ] 426 | }, 427 | { 428 | "cell_type": "code", 429 | "execution_count": 15, 430 | "id": "dd4e2d1e-a125-4f85-9f34-4772827b0a62", 431 | "metadata": {}, 432 | "outputs": [ 433 | { 434 | "name": "stdout", 435 | "output_type": "stream", 436 | "text": [ 437 | "['three people died heat wave far'\n", 438 | " 'haha south tampa getting flooded hah wait second live south tampa gonna gonna fvck flooding'\n", 439 | " 'raining flooding florida tampabay tampa 18 19 days ive lost count'\n", 440 | " 'flood bago myanmar arrived bago'\n", 441 | " 'damage school bus 80 multi car crash breaking']\n", 442 | "[[463, 8, 437, 168, 358, 486], [750, 511, 2481, 131, 2482, 3090, 554, 529, 112, 511, 2481, 204, 204, 6151, 137], [2483, 137, 2076, 6152, 2481, 1315, 1605, 530, 179, 629, 3091], [114, 4064, 707, 1606, 4064], [125, 94, 334, 4065, 4066, 53, 18, 335]]\n" 443 | ] 444 | } 445 | ], 446 | "source": [ 447 | "print(train_sentences[10:15])\n", 448 | "print(train_sequences[10:15])" 449 | ] 450 | }, 451 | { 452 | "cell_type": "markdown", 453 | "id": "3d5fcc2c-78bc-469d-9fc3-cb25e0b94a1e", 454 | "metadata": {}, 455 | "source": [ 456 | "### Padding" 457 | ] 458 | }, 459 | { 460 | "cell_type": "code", 461 | "execution_count": 16, 462 | "id": "bbb97681-19b7-4465-be57-f8c201be0703", 463 | "metadata": {}, 464 | "outputs": [], 465 | "source": [ 466 | "max_length = 25" 467 | ] 468 | }, 469 | { 470 | "cell_type": "code", 471 | "execution_count": 17, 472 | "id": "e074d743-14cc-46d4-89be-72a3e9ae0570", 473 | "metadata": {}, 474 | "outputs": [], 475 | "source": [ 476 | "train_padded = tf.keras.preprocessing.sequence.pad_sequences(sequences = train_sequences,value=word_index[\"\"],padding=\"post\",maxlen=max_length,truncating='post')\n", 477 | "val_padded = tf.keras.preprocessing.sequence.pad_sequences(sequences = val_sequences,value=word_index[\"\"],padding=\"post\",maxlen=max_length,truncating='post')" 478 | ] 479 | }, 480 | { 481 | "cell_type": "markdown", 482 | "id": "26c0f4a9-b30c-48bd-b6d4-25b6d4a8d5c3", 483 | "metadata": {}, 484 | "source": [ 485 | "# Model Building" 486 | ] 487 | }, 488 | { 489 | "cell_type": "code", 490 | "execution_count": 18, 491 | "id": "e82e22b0-75cb-481a-8afa-e7d6a41919f1", 492 | "metadata": {}, 493 | "outputs": [ 494 | { 495 | "name": "stdout", 496 | "output_type": "stream", 497 | "text": [ 498 | "Model: \"sequential\"\n", 499 | "_________________________________________________________________\n", 500 | "Layer (type) Output Shape Param # \n", 501 | "=================================================================\n", 502 | "embedding (Embedding) (None, 25, 16) 287536 \n", 503 | "_________________________________________________________________\n", 504 | "simple_rnn (SimpleRNN) (None, 32) 1568 \n", 505 | "_________________________________________________________________\n", 506 | "dense (Dense) (None, 1) 33 \n", 507 | "=================================================================\n", 508 | "Total params: 289,137\n", 509 | "Trainable params: 289,137\n", 510 | "Non-trainable params: 0\n", 511 | "_________________________________________________________________\n" 512 | ] 513 | } 514 | ], 515 | "source": [ 516 | "# Create RNN model\n", 517 | "from tensorflow.keras import layers\n", 518 | "\n", 519 | "# Embedding: https://www.tensorflow.org/tutorials/text/word_embeddings\n", 520 | "# Turns positive integers (indexes) into dense vectors of fixed size. (other approach could be one-hot-encoding)\n", 521 | "\n", 522 | "# Word embeddings give us a way to use an efficient, dense representation in which similar words have \n", 523 | "# a similar encoding. Importantly, you do not have to specify this encoding by hand. An embedding is a \n", 524 | "# dense vector of floating point values (the length of the vector is a parameter you specify).\n", 525 | "\n", 526 | "model = keras.models.Sequential()\n", 527 | "model.add(layers.Embedding(num_unique_words, 16, input_length=max_length))\n", 528 | "\n", 529 | "# The layer will take as input an integer matrix of size (batch, input_length),\n", 530 | "# and the largest integer (i.e. word index) in the input should be no larger than num_words (vocabulary size).\n", 531 | "# Now model.output_shape is (None, input_length, 16), where `None` is the batch dimension.\n", 532 | "\n", 533 | "\n", 534 | "model.add(layers.SimpleRNN(32, dropout=0.9))\n", 535 | "model.add(layers.Dense(1, activation=\"sigmoid\"))\n", 536 | "\n", 537 | "model.summary()" 538 | ] 539 | }, 540 | { 541 | "cell_type": "code", 542 | "execution_count": 19, 543 | "id": "6d7ee424-e051-43f2-a24c-39ef63deb871", 544 | "metadata": {}, 545 | "outputs": [ 546 | { 547 | "name": "stderr", 548 | "output_type": "stream", 549 | "text": [ 550 | "C:\\Users\\Vaasu\\AppData\\Roaming\\Python\\Python39\\site-packages\\keras\\optimizer_v2\\optimizer_v2.py:355: UserWarning: The `lr` argument is deprecated, use `learning_rate` instead.\n", 551 | " warnings.warn(\n" 552 | ] 553 | } 554 | ], 555 | "source": [ 556 | "loss = keras.losses.BinaryCrossentropy(from_logits=False)\n", 557 | "optim = keras.optimizers.Adam(lr=0.001)\n", 558 | "metrics = [\"accuracy\"]\n", 559 | "\n", 560 | "model.compile(loss=loss, optimizer=optim, metrics=metrics)" 561 | ] 562 | }, 563 | { 564 | "cell_type": "code", 565 | "execution_count": 20, 566 | "id": "8a952124-32bb-4099-89df-dfabd885afde", 567 | "metadata": {}, 568 | "outputs": [ 569 | { 570 | "name": "stdout", 571 | "output_type": "stream", 572 | "text": [ 573 | "Epoch 1/5\n", 574 | "215/215 - 3s - loss: 0.6859 - accuracy: 0.5604 - val_loss: 0.6986 - val_accuracy: 0.5341\n", 575 | "Epoch 2/5\n", 576 | "215/215 - 1s - loss: 0.6659 - accuracy: 0.5980 - val_loss: 0.6384 - val_accuracy: 0.6457\n", 577 | "Epoch 3/5\n", 578 | "215/215 - 1s - loss: 0.6207 - accuracy: 0.6622 - val_loss: 0.5976 - val_accuracy: 0.7034\n", 579 | "Epoch 4/5\n", 580 | "215/215 - 2s - loss: 0.5522 - accuracy: 0.7244 - val_loss: 0.5313 - val_accuracy: 0.7717\n", 581 | "Epoch 5/5\n", 582 | "215/215 - 2s - loss: 0.5179 - accuracy: 0.7551 - val_loss: 0.4939 - val_accuracy: 0.7782\n" 583 | ] 584 | }, 585 | { 586 | "data": { 587 | "text/plain": [ 588 | "" 589 | ] 590 | }, 591 | "execution_count": 20, 592 | "metadata": {}, 593 | "output_type": "execute_result" 594 | } 595 | ], 596 | "source": [ 597 | "model.fit(train_padded, train_labels, epochs=5, validation_data=(val_padded, val_labels), verbose=2)" 598 | ] 599 | }, 600 | { 601 | "cell_type": "markdown", 602 | "id": "d55d00df-4490-465b-bcc4-358cd3a4d03b", 603 | "metadata": {}, 604 | "source": [ 605 | "# Prediction" 606 | ] 607 | }, 608 | { 609 | "cell_type": "code", 610 | "execution_count": 21, 611 | "id": "e46a4b34-8d3f-459e-a896-5a736ceb507a", 612 | "metadata": {}, 613 | "outputs": [], 614 | "source": [ 615 | "predictions = model.predict(train_padded)\n", 616 | "predictions = [1 if p > 0.5 else 0 for p in predictions]" 617 | ] 618 | }, 619 | { 620 | "cell_type": "markdown", 621 | "id": "e96896a4-5de3-426a-882e-3a7fc6daf5f8", 622 | "metadata": {}, 623 | "source": [ 624 | "### Decoding sequences" 625 | ] 626 | }, 627 | { 628 | "cell_type": "code", 629 | "execution_count": 22, 630 | "id": "23f60e41-d96b-4f73-8739-0087503cfb4e", 631 | "metadata": {}, 632 | "outputs": [], 633 | "source": [ 634 | "# Check reversing the indices\n", 635 | "\n", 636 | "# flip (key, value)\n", 637 | "reverse_word_index = dict([(idx, word) for (word, idx) in word_index.items()])" 638 | ] 639 | }, 640 | { 641 | "cell_type": "code", 642 | "execution_count": 23, 643 | "id": "b1fab433-bc72-4db9-804e-ca03f7089dce", 644 | "metadata": {}, 645 | "outputs": [], 646 | "source": [ 647 | "def decode(sequence):\n", 648 | " return \" \".join([reverse_word_index.get(idx, \"?\") for idx in sequence])" 649 | ] 650 | }, 651 | { 652 | "cell_type": "code", 653 | "execution_count": 24, 654 | "id": "39b48cbe-95fc-4e46-968b-5463f0f00548", 655 | "metadata": {}, 656 | "outputs": [ 657 | { 658 | "name": "stdout", 659 | "output_type": "stream", 660 | "text": [ 661 | "[463, 8, 437, 168, 358, 486]\n", 662 | "three people died heat wave far\n" 663 | ] 664 | } 665 | ], 666 | "source": [ 667 | "decoded_text = decode(train_sequences[10])\n", 668 | "\n", 669 | "print(train_sequences[10])\n", 670 | "print(decoded_text)" 671 | ] 672 | }, 673 | { 674 | "cell_type": "code", 675 | "execution_count": 25, 676 | "id": "c86a5665-c2b1-49f9-a6ed-c32c0531e909", 677 | "metadata": {}, 678 | "outputs": [ 679 | { 680 | "name": "stdout", 681 | "output_type": "stream", 682 | "text": [ 683 | "['three people died heat wave far'\n", 684 | " 'haha south tampa getting flooded hah wait second live south tampa gonna gonna fvck flooding'\n", 685 | " 'raining flooding florida tampabay tampa 18 19 days ive lost count'\n", 686 | " 'flood bago myanmar arrived bago'\n", 687 | " 'damage school bus 80 multi car crash breaking' 'whats man' 'love fruits'\n", 688 | " 'summer lovely' 'car fast' 'goooooooaaaaaal']\n", 689 | "[1 1 1 1 1 0 0 0 0 0]\n", 690 | "[1, 0, 0, 1, 1, 0, 0, 0, 0, 0]\n" 691 | ] 692 | } 693 | ], 694 | "source": [ 695 | "print(train_sentences[10:20])\n", 696 | "\n", 697 | "print(train_labels[10:20])\n", 698 | "print(predictions[10:20])" 699 | ] 700 | } 701 | ], 702 | "metadata": { 703 | "kernelspec": { 704 | "display_name": "Python 3 (ipykernel)", 705 | "language": "python", 706 | "name": "python3" 707 | }, 708 | "language_info": { 709 | "codemirror_mode": { 710 | "name": "ipython", 711 | "version": 3 712 | }, 713 | "file_extension": ".py", 714 | "mimetype": "text/x-python", 715 | "name": "python", 716 | "nbconvert_exporter": "python", 717 | "pygments_lexer": "ipython3", 718 | "version": "3.9.12" 719 | } 720 | }, 721 | "nbformat": 4, 722 | "nbformat_minor": 5 723 | } 724 | -------------------------------------------------------------------------------- /MODEL/Recurrent Neural Network/corpus.py: -------------------------------------------------------------------------------- 1 | from string import punctuation 2 | all_text = ''.join([c for c in reviews if c not in punctuation]) 3 | reviews = all_text.split('\n') 4 | print(reviews[2:3]) 5 | all_text = ' '.join(reviews) 6 | words = all_text.split() 7 | -------------------------------------------------------------------------------- /MODEL/Recurrent Neural Network/data/README.md: -------------------------------------------------------------------------------- 1 | # data 2 | -------------------------------------------------------------------------------- /MODEL/Recurrent Neural Network/padding.py: -------------------------------------------------------------------------------- 1 | seq_len = 200 2 | features = [] 3 | for review in review_ints: 4 | review_len = len(review) 5 | len_diff = seq_len - review_len 6 | if len_diff <= 0: 7 | features.append(review[:seq_len]) 8 | print(review[:seq_len]) 9 | else: 10 | padding = [0] * len_diff 11 | padded_feature = padding + review 12 | features.append(padded_feature) 13 | print() 14 | features = np.asarray(features) 15 | -------------------------------------------------------------------------------- /Notebooks/Word2vec_Google_News_300.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "source": [ 6 | "# Word2vec Google News 300" 7 | ], 8 | "metadata": { 9 | "id": "BQvnCXzHwAH0" 10 | } 11 | }, 12 | { 13 | "cell_type": "markdown", 14 | "source": [ 15 | "[Notebook](https://colab.research.google.com/drive/1MXwLGerTJB2WfxQT_kySIlqLWEfO-UtO?usp=sharing)" 16 | ], 17 | "metadata": { 18 | "id": "_4Ambz2G3pi9" 19 | } 20 | }, 21 | { 22 | "cell_type": "code", 23 | "execution_count": null, 24 | "metadata": { 25 | "id": "4EB1eqf3cQWb" 26 | }, 27 | "outputs": [], 28 | "source": [ 29 | "!pip install -q gensim" 30 | ] 31 | }, 32 | { 33 | "cell_type": "code", 34 | "source": [ 35 | "!gzip -d GoogleNews-vectors-negative300.bin.gz" 36 | ], 37 | "metadata": { 38 | "colab": { 39 | "base_uri": "https://localhost:8080/" 40 | }, 41 | "id": "nch3e7_zxC1A", 42 | "outputId": "244cb60c-2aff-438f-9493-1a3380602b03" 43 | }, 44 | "execution_count": null, 45 | "outputs": [ 46 | { 47 | "output_type": "stream", 48 | "name": "stdout", 49 | "text": [ 50 | "gzip: GoogleNews-vectors-negative300.bin.gz: No such file or directory\n" 51 | ] 52 | } 53 | ] 54 | }, 55 | { 56 | "cell_type": "code", 57 | "execution_count": null, 58 | "metadata": { 59 | "id": "5sUEjvcPcSjR", 60 | "colab": { 61 | "base_uri": "https://localhost:8080/" 62 | }, 63 | "outputId": "4d4c0176-05d9-43e3-e271-ce0876384a9b" 64 | }, 65 | "outputs": [ 66 | { 67 | "output_type": "stream", 68 | "name": "stdout", 69 | "text": [ 70 | "[==================================================] 100.0% 1662.8/1662.8MB downloaded\n" 71 | ] 72 | } 73 | ], 74 | "source": [ 75 | "import gensim\n", 76 | "from gensim.models import Word2Vec, KeyedVectors\n", 77 | "from gensim import models\n", 78 | "import gensim.downloader as api\n", 79 | "wv = api.load('word2vec-google-news-300')" 80 | ] 81 | }, 82 | { 83 | "cell_type": "code", 84 | "source": [ 85 | "vec_king = wv['king']\n", 86 | "vec_king.shape" 87 | ], 88 | "metadata": { 89 | "id": "yd0uCLv_xVWD" 90 | }, 91 | "execution_count": null, 92 | "outputs": [] 93 | }, 94 | { 95 | "cell_type": "code", 96 | "source": [ 97 | "wv.most_similar('cricket')" 98 | ], 99 | "metadata": { 100 | "colab": { 101 | "base_uri": "https://localhost:8080/" 102 | }, 103 | "id": "JiljDl5Oxwz3", 104 | "outputId": "2dec4358-3df0-41ff-98c5-cc25351b21ff" 105 | }, 106 | "execution_count": null, 107 | "outputs": [ 108 | { 109 | "output_type": "execute_result", 110 | "data": { 111 | "text/plain": [ 112 | "[('cricketing', 0.8372225165367126),\n", 113 | " ('cricketers', 0.8165745735168457),\n", 114 | " ('Test_cricket', 0.8094818592071533),\n", 115 | " ('Twenty##_cricket', 0.8068488240242004),\n", 116 | " ('Twenty##', 0.7624266147613525),\n", 117 | " ('Cricket', 0.7541396617889404),\n", 118 | " ('cricketer', 0.7372579574584961),\n", 119 | " ('twenty##', 0.7316356897354126),\n", 120 | " ('T##_cricket', 0.7304614782333374),\n", 121 | " ('West_Indies_cricket', 0.698798656463623)]" 122 | ] 123 | }, 124 | "metadata": {}, 125 | "execution_count": 10 126 | } 127 | ] 128 | }, 129 | { 130 | "cell_type": "code", 131 | "source": [ 132 | "wv.most_similar('happy')" 133 | ], 134 | "metadata": { 135 | "colab": { 136 | "base_uri": "https://localhost:8080/" 137 | }, 138 | "id": "Xr8D0wtXxw30", 139 | "outputId": "1409b11c-f015-495a-fb04-126e311ecea2" 140 | }, 141 | "execution_count": null, 142 | "outputs": [ 143 | { 144 | "output_type": "execute_result", 145 | "data": { 146 | "text/plain": [ 147 | "[('glad', 0.7408890128135681),\n", 148 | " ('pleased', 0.6632171273231506),\n", 149 | " ('ecstatic', 0.6626912355422974),\n", 150 | " ('overjoyed', 0.6599286794662476),\n", 151 | " ('thrilled', 0.6514049768447876),\n", 152 | " ('satisfied', 0.6437950134277344),\n", 153 | " ('proud', 0.636042058467865),\n", 154 | " ('delighted', 0.627237856388092),\n", 155 | " ('disappointed', 0.6269949674606323),\n", 156 | " ('excited', 0.6247666478157043)]" 157 | ] 158 | }, 159 | "metadata": {}, 160 | "execution_count": 11 161 | } 162 | ] 163 | }, 164 | { 165 | "cell_type": "code", 166 | "source": [ 167 | "wv.similarity(\"hockey\",\"sports\")" 168 | ], 169 | "metadata": { 170 | "colab": { 171 | "base_uri": "https://localhost:8080/" 172 | }, 173 | "id": "_BlMuRP2xw6B", 174 | "outputId": "65be7907-7d13-4331-e79e-1d76729a3fbd" 175 | }, 176 | "execution_count": null, 177 | "outputs": [ 178 | { 179 | "output_type": "execute_result", 180 | "data": { 181 | "text/plain": [ 182 | "0.53541523" 183 | ] 184 | }, 185 | "metadata": {}, 186 | "execution_count": 12 187 | } 188 | ] 189 | }, 190 | { 191 | "cell_type": "code", 192 | "source": [ 193 | "vec=wv['king']-wv['man']+wv['woman']\n", 194 | "vec.shape" 195 | ], 196 | "metadata": { 197 | "id": "87aVT96Nxw8I" 198 | }, 199 | "execution_count": null, 200 | "outputs": [] 201 | }, 202 | { 203 | "cell_type": "code", 204 | "source": [ 205 | "wv.most_similar([vec]) # Vector passed as list" 206 | ], 207 | "metadata": { 208 | "colab": { 209 | "base_uri": "https://localhost:8080/" 210 | }, 211 | "id": "dTuKmY1lxw9v", 212 | "outputId": "a88ad297-e42d-4a40-c526-2e03d716d64e" 213 | }, 214 | "execution_count": null, 215 | "outputs": [ 216 | { 217 | "output_type": "execute_result", 218 | "data": { 219 | "text/plain": [ 220 | "[('king', 0.8449392318725586),\n", 221 | " ('queen', 0.7300517559051514),\n", 222 | " ('monarch', 0.6454660892486572),\n", 223 | " ('princess', 0.6156251430511475),\n", 224 | " ('crown_prince', 0.5818676948547363),\n", 225 | " ('prince', 0.5777117609977722),\n", 226 | " ('kings', 0.5613663792610168),\n", 227 | " ('sultan', 0.5376776456832886),\n", 228 | " ('Queen_Consort', 0.5344247817993164),\n", 229 | " ('queens', 0.5289887189865112)]" 230 | ] 231 | }, 232 | "metadata": {}, 233 | "execution_count": 14 234 | } 235 | ] 236 | } 237 | ], 238 | "metadata": { 239 | "accelerator": "GPU", 240 | "colab": { 241 | "name": "Word2vec Google News 300.ipynb", 242 | "provenance": [], 243 | "collapsed_sections": [] 244 | }, 245 | "gpuClass": "standard", 246 | "kernelspec": { 247 | "display_name": "Python 3", 248 | "name": "python3" 249 | }, 250 | "language_info": { 251 | "name": "python" 252 | } 253 | }, 254 | "nbformat": 4, 255 | "nbformat_minor": 0 256 | } -------------------------------------------------------------------------------- /Notebooks/ty.py: -------------------------------------------------------------------------------- 1 | d 2 | -------------------------------------------------------------------------------- /Pre Processing/Basic Cleaning/DealingWIthEmoji.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "nbformat": 4, 3 | "nbformat_minor": 0, 4 | "metadata": { 5 | "colab": { 6 | "name": "DealingWIthEmoji.ipynb", 7 | "provenance": [] 8 | }, 9 | "kernelspec": { 10 | "name": "python3", 11 | "display_name": "Python 3" 12 | }, 13 | "language_info": { 14 | "name": "python" 15 | } 16 | }, 17 | "cells": [ 18 | { 19 | "cell_type": "markdown", 20 | "source": [ 21 | "## REMOVING EMOJI" 22 | ], 23 | "metadata": { 24 | "id": "R8VJSr-JikjJ" 25 | } 26 | }, 27 | { 28 | "cell_type": "code", 29 | "source": [ 30 | "import re\n", 31 | "def remove_emoji(text):\n", 32 | " emoji_pattern = re.compile(\"[\"\n", 33 | " u\"\\U0001F600-\\U0001F64F\" # emoticons\n", 34 | " u\"\\U0001F300-\\U0001F5FF\" # symbols & pictographs\n", 35 | " u\"\\U0001F680-\\U0001F6FF\" # transport & map symbols\n", 36 | " u\"\\U0001F1E0-\\U0001F1FF\" # flags (iOS)\n", 37 | " u\"\\U00002702-\\U000027B0\"\n", 38 | " u\"\\U000024C2-\\U0001F251\"\n", 39 | " \"]+\", flags=re.UNICODE)\n", 40 | " return emoji_pattern.sub(r'', text)" 41 | ], 42 | "metadata": { 43 | "id": "NyuxZXdHhozg" 44 | }, 45 | "execution_count": 12, 46 | "outputs": [] 47 | }, 48 | { 49 | "cell_type": "code", 50 | "source": [ 51 | "remove_emoji(\"Loved the movie. It was 😘😘\")" 52 | ], 53 | "metadata": { 54 | "colab": { 55 | "base_uri": "https://localhost:8080/", 56 | "height": 35 57 | }, 58 | "id": "h0we-16qhlcc", 59 | "outputId": "e399c577-c51d-4b10-cab3-0bb5f8c41bd0" 60 | }, 61 | "execution_count": 13, 62 | "outputs": [ 63 | { 64 | "output_type": "execute_result", 65 | "data": { 66 | "text/plain": [ 67 | "'Loved the movie. It was '" 68 | ], 69 | "application/vnd.google.colaboratory.intrinsic+json": { 70 | "type": "string" 71 | } 72 | }, 73 | "metadata": {}, 74 | "execution_count": 13 75 | } 76 | ] 77 | }, 78 | { 79 | "cell_type": "code", 80 | "source": [ 81 | "remove_emoji(\"Lmao 😂😂\")" 82 | ], 83 | "metadata": { 84 | "colab": { 85 | "base_uri": "https://localhost:8080/", 86 | "height": 35 87 | }, 88 | "id": "Fgx9G74Bhm1t", 89 | "outputId": "39d923fd-d452-40dd-f78a-d53d3e909af7" 90 | }, 91 | "execution_count": 14, 92 | "outputs": [ 93 | { 94 | "output_type": "execute_result", 95 | "data": { 96 | "text/plain": [ 97 | "'Lmao '" 98 | ], 99 | "application/vnd.google.colaboratory.intrinsic+json": { 100 | "type": "string" 101 | } 102 | }, 103 | "metadata": {}, 104 | "execution_count": 14 105 | } 106 | ] 107 | }, 108 | { 109 | "cell_type": "markdown", 110 | "source": [ 111 | "## REPLACE EMOJI" 112 | ], 113 | "metadata": { 114 | "id": "J6oXS2u8inF7" 115 | } 116 | }, 117 | { 118 | "cell_type": "code", 119 | "source": [ 120 | "!pip install emoji" 121 | ], 122 | "metadata": { 123 | "colab": { 124 | "base_uri": "https://localhost:8080/" 125 | }, 126 | "id": "rtW2V3WkisIl", 127 | "outputId": "2e9adfd9-105f-4957-d3e3-d5598c0be857" 128 | }, 129 | "execution_count": 17, 130 | "outputs": [ 131 | { 132 | "output_type": "stream", 133 | "name": "stdout", 134 | "text": [ 135 | "Collecting emoji\n", 136 | " Downloading emoji-1.7.0.tar.gz (175 kB)\n", 137 | "\u001b[?25l\r\u001b[K |█▉ | 10 kB 26.1 MB/s eta 0:00:01\r\u001b[K |███▊ | 20 kB 33.8 MB/s eta 0:00:01\r\u001b[K |█████▋ | 30 kB 23.8 MB/s eta 0:00:01\r\u001b[K |███████▌ | 40 kB 12.7 MB/s eta 0:00:01\r\u001b[K |█████████▍ | 51 kB 11.8 MB/s eta 0:00:01\r\u001b[K |███████████▏ | 61 kB 13.9 MB/s eta 0:00:01\r\u001b[K |█████████████ | 71 kB 13.9 MB/s eta 0:00:01\r\u001b[K |███████████████ | 81 kB 13.1 MB/s eta 0:00:01\r\u001b[K |████████████████▉ | 92 kB 14.6 MB/s eta 0:00:01\r\u001b[K |██████████████████▊ | 102 kB 12.8 MB/s eta 0:00:01\r\u001b[K |████████████████████▌ | 112 kB 12.8 MB/s eta 0:00:01\r\u001b[K |██████████████████████▍ | 122 kB 12.8 MB/s eta 0:00:01\r\u001b[K |████████████████████████▎ | 133 kB 12.8 MB/s eta 0:00:01\r\u001b[K |██████████████████████████▏ | 143 kB 12.8 MB/s eta 0:00:01\r\u001b[K |████████████████████████████ | 153 kB 12.8 MB/s eta 0:00:01\r\u001b[K |█████████████████████████████▉ | 163 kB 12.8 MB/s eta 0:00:01\r\u001b[K |███████████████████████████████▊| 174 kB 12.8 MB/s eta 0:00:01\r\u001b[K |████████████████████████████████| 175 kB 12.8 MB/s \n", 138 | "\u001b[?25hBuilding wheels for collected packages: emoji\n", 139 | " Building wheel for emoji (setup.py) ... \u001b[?25l\u001b[?25hdone\n", 140 | " Created wheel for emoji: filename=emoji-1.7.0-py3-none-any.whl size=171046 sha256=dbf2edb4aa0ac31ab04e0edcb13bc1131f5c7addeed0f837e2fcee7c6487edc6\n", 141 | " Stored in directory: /root/.cache/pip/wheels/8a/4e/b6/57b01db010d17ef6ea9b40300af725ef3e210cb1acfb7ac8b6\n", 142 | "Successfully built emoji\n", 143 | "Installing collected packages: emoji\n", 144 | "Successfully installed emoji-1.7.0\n" 145 | ] 146 | } 147 | ] 148 | }, 149 | { 150 | "cell_type": "code", 151 | "source": [ 152 | "import emoji\n", 153 | "print(emoji.demojize('Python is 🔥'))" 154 | ], 155 | "metadata": { 156 | "colab": { 157 | "base_uri": "https://localhost:8080/" 158 | }, 159 | "id": "ksFRkx2qipJD", 160 | "outputId": "0a28f057-c141-4edf-bd23-f7f358f1d149" 161 | }, 162 | "execution_count": 18, 163 | "outputs": [ 164 | { 165 | "output_type": "stream", 166 | "name": "stdout", 167 | "text": [ 168 | "Python is :fire:\n" 169 | ] 170 | } 171 | ] 172 | }, 173 | { 174 | "cell_type": "code", 175 | "source": [ 176 | "print(emoji.demojize('Loved the movie. It was 😘'))" 177 | ], 178 | "metadata": { 179 | "colab": { 180 | "base_uri": "https://localhost:8080/" 181 | }, 182 | "id": "6fH9ifpFiqqw", 183 | "outputId": "70a22baf-688d-418f-b900-a0bc157048d6" 184 | }, 185 | "execution_count": 19, 186 | "outputs": [ 187 | { 188 | "output_type": "stream", 189 | "name": "stdout", 190 | "text": [ 191 | "Loved the movie. It was :face_blowing_a_kiss:\n" 192 | ] 193 | } 194 | ] 195 | } 196 | ] 197 | } -------------------------------------------------------------------------------- /Pre Processing/Basic Cleaning/README.md: -------------------------------------------------------------------------------- 1 | # Cleaning of texy 2 | 3 | ### Lowercasing 4 | ```ruby 5 | df['review'] = df['review'].str.lower() 6 | ``` 7 | 8 | ### Removing HTML tags 9 | ```ruby 10 | import re 11 | def remove_html_tags(text): 12 | pattern = re.compile('<.*?>') 13 | return pattern.sub(r'', text) 14 | 15 | # remove_html_tags(text) 16 | df['review'] = df['review'].apply(remove_html_tags) 17 | ``` 18 | 19 | ### Removing URL 20 | ```ruby 21 | import re 22 | def remove_url(text): 23 | pattern = re.compile(r'https?://\S+|www\.\S+') 24 | return pattern.sub(r'', text) 25 | ``` 26 | 27 | ### Correct Spelling 28 | ```ruby 29 | from textblob import TextBlob 30 | incorrect_text = 'ceertain conditionas duriing seveal ggenerations aree moodified in the saame maner.' 31 | textBlb = TextBlob(incorrect_text) 32 | textBlb.correct().string 33 | ``` 34 | 35 | ### Remove Stopwords 36 | ```ruby 37 | import nltk 38 | nltk.download('stopwords') 39 | from nltk.corpus import stopwords 40 | stopwords.words('english') 41 | 42 | def remove_stopwords(text): 43 | new_text = [] 44 | 45 | for word in text.split(): 46 | if word in stopwords.words('english'): 47 | new_text.append('') 48 | else: 49 | new_text.append(word) 50 | x = new_text[:] 51 | new_text.clear() 52 | return " ".join(x) 53 | 54 | 55 | df['review'].apply(remove_stopwords) 56 | ``` 57 | 58 | ## EMOJI 59 | - Remove Emoji 60 | ```ruby 61 | import re 62 | def remove_emoji(text): 63 | emoji_pattern = re.compile("[" 64 | u"\U0001F600-\U0001F64F" # emoticons 65 | u"\U0001F300-\U0001F5FF" # symbols & pictographs 66 | u"\U0001F680-\U0001F6FF" # transport & map symbols 67 | u"\U0001F1E0-\U0001F1FF" # flags (iOS) 68 | u"\U00002702-\U000027B0" 69 | u"\U000024C2-\U0001F251" 70 | "]+", flags=re.UNICODE) 71 | return emoji_pattern.sub(r'', text) 72 | remove_emoji("Loved the movie. It was 😘😘") 73 | remove_emoji("Lmao 😂😂") 74 | ``` 75 | - Replace Emoji 76 | ```ruby 77 | import emoji 78 | print(emoji.demojize('Python is 🔥')) 79 | 80 | print(emoji.demojize('Loved the movie. It was 😘')) 81 | ``` 82 | 83 | 84 | 85 | 86 | 87 | 88 | 89 | 90 | 91 | 92 | -------------------------------------------------------------------------------- /Pre Processing/Basic Cleaning/StopWords.py: -------------------------------------------------------------------------------- 1 | import nltk 2 | nltk.download('stopwords') 3 | from nltk.corpus import stopwords 4 | stopwords.words('english') 5 | 6 | def remove_stopwords(text): 7 | new_text = [] 8 | 9 | for word in text.split(): 10 | if word in stopwords.words('english'): 11 | new_text.append('') 12 | else: 13 | new_text.append(word) 14 | x = new_text[:] 15 | new_text.clear() 16 | return " ".join(x) 17 | 18 | 19 | df['review'].apply(remove_stopwords) 20 | 21 | 22 | remove_stopwords('probably my all-time favorite movie, a story of selflessness, sacrifice and dedication to a noble cause, but it\'s not preachy or boring. it just never gets old, despite my having seen it some 15 or more times') 23 | -------------------------------------------------------------------------------- /Pre Processing/Co-occurrence matrix/README.md: -------------------------------------------------------------------------------- 1 | # Co-occurrence matrix 2 | 3 | ### Generated embedding 4 | **Consider relationship between surrounding the word** 5 | 6 | 7 | **The co-occurrence matrix is expressed by considering the relationship between words in the corpus.** 8 | - A very important idea is that we think that the meaning of a word is closely related to the word next to it. This is where we can set a window (the size is generally 5 ~ 10). The size of the window below is 2, so in this window, the words that appear with rests are life, he, in, and peace. Then we use this co-occurrence relationship to generate word vectors. 9 | 10 | 11 | ![alt text](https://github.com/vaasu2002/Natural-Language-Processing/blob/main/Pre%20Processing/Co-occurrence%20matrix/concurrence.jpg) 12 | 13 | 14 | #### I like deep learning. 15 | 16 | #### I like NLP. 17 | 18 | #### I enjoy flying. 19 | 20 | As an example, **we set the window size to 1**, which means that **we only look at the word immediately surrounding a word**. At this point, you will get a symmetric matrix-co-occurrence matrix. Because in our corpus, **the number of times I and like appear as neighbors in the window at the same time is 2**, the value where I and like intersect in the table below is 2. 21 | 22 | In this way, the idea of turning words into vectors is done. Each row (or each column) of the co-occurrence matrix is a vector representation of the corresponding word. 23 | 24 | ![alt text](https://github.com/vaasu2002/Natural-Language-Processing/blob/main/Pre%20Processing/Co-occurrence%20matrix/concur.jpg) 25 | 26 | >Although the Cocurrence matrix solves the relative position between words to some extent, this problem should be paid attention to. But it still faces dimensional disaster. 27 | 28 | >In other words, the vector representation of a word is too long. At this time, it is natural to think of some common dimensionality reduction methods such as SVD or PCA. 29 | 30 | >The selection of the window size is the same as determining n in the n-gram. The size of the matrix will also increase when the window is enlarged, so it still has a large amount of calculation in nature, and the SVD algorithm has a large amount of calculation. If the text set is very More, it is not operable. 31 | -------------------------------------------------------------------------------- /Pre Processing/Co-occurrence matrix/concur.jpg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/vaasu2002/Natural-Language-Processing/309514bb40042c5c6bdffacfb882164d9b9bac03/Pre Processing/Co-occurrence matrix/concur.jpg -------------------------------------------------------------------------------- /Pre Processing/Co-occurrence matrix/concurrence.jpg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/vaasu2002/Natural-Language-Processing/309514bb40042c5c6bdffacfb882164d9b9bac03/Pre Processing/Co-occurrence matrix/concurrence.jpg -------------------------------------------------------------------------------- /Pre Processing/Lemmatization/Lemmatization.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "nbformat": 4, 3 | "nbformat_minor": 0, 4 | "metadata": { 5 | "colab": { 6 | "name": "Lemmatization.ipynb", 7 | "provenance": [] 8 | }, 9 | "kernelspec": { 10 | "name": "python3", 11 | "display_name": "Python 3" 12 | }, 13 | "language_info": { 14 | "name": "python" 15 | } 16 | }, 17 | "cells": [ 18 | { 19 | "cell_type": "code", 20 | "source": [ 21 | "import nltk\n", 22 | "nltk.download('punkt')" 23 | ], 24 | "metadata": { 25 | "colab": { 26 | "base_uri": "https://localhost:8080/" 27 | }, 28 | "id": "QE6p0awdwtnQ", 29 | "outputId": "ec4fa4d6-78ee-4849-9161-38ef47c2ccb1" 30 | }, 31 | "execution_count": 25, 32 | "outputs": [ 33 | { 34 | "output_type": "stream", 35 | "name": "stdout", 36 | "text": [ 37 | "[nltk_data] Downloading package punkt to /root/nltk_data...\n", 38 | "[nltk_data] Unzipping tokenizers/punkt.zip.\n" 39 | ] 40 | }, 41 | { 42 | "output_type": "execute_result", 43 | "data": { 44 | "text/plain": [ 45 | "True" 46 | ] 47 | }, 48 | "metadata": {}, 49 | "execution_count": 25 50 | } 51 | ] 52 | }, 53 | { 54 | "cell_type": "code", 55 | "source": [ 56 | "import nltk\n", 57 | "nltk.download('wordnet')" 58 | ], 59 | "metadata": { 60 | "colab": { 61 | "base_uri": "https://localhost:8080/" 62 | }, 63 | "outputId": "0fa45e26-5f1e-4ba9-b2ba-efeac659dea4", 64 | "id": "HI0ljLFjw0UK" 65 | }, 66 | "execution_count": 27, 67 | "outputs": [ 68 | { 69 | "output_type": "stream", 70 | "name": "stdout", 71 | "text": [ 72 | "[nltk_data] Downloading package wordnet to /root/nltk_data...\n", 73 | "[nltk_data] Unzipping corpora/wordnet.zip.\n" 74 | ] 75 | }, 76 | { 77 | "output_type": "execute_result", 78 | "data": { 79 | "text/plain": [ 80 | "True" 81 | ] 82 | }, 83 | "metadata": {}, 84 | "execution_count": 27 85 | } 86 | ] 87 | }, 88 | { 89 | "cell_type": "code", 90 | "source": [ 91 | "from nltk.stem import WordNetLemmatizer\n", 92 | "wordnet_lemmatizer = WordNetLemmatizer()\n", 93 | "\n", 94 | "sentence = \"He was running and eating at same time. He has bad habit of swimming after playing long hours in the Sun.\"\n", 95 | "punctuations=\"?:!.,;\"\n", 96 | "sentence_words = nltk.word_tokenize(sentence)\n", 97 | "for word in sentence_words:\n", 98 | " if word in punctuations:\n", 99 | " sentence_words.remove(word)\n", 100 | "\n", 101 | "sentence_words\n", 102 | "print(\"{0:20}{1:20}\".format(\"Word\",\"Lemma\"))\n", 103 | "for word in sentence_words:\n", 104 | " print (\"{0:20}{1:20}\".format(word,wordnet_lemmatizer.lemmatize(word,pos='v')))" 105 | ], 106 | "metadata": { 107 | "colab": { 108 | "base_uri": "https://localhost:8080/" 109 | }, 110 | "id": "mhLWNsTZwp_c", 111 | "outputId": "0cb67253-1b37-4831-add2-32b1aa6f3efb" 112 | }, 113 | "execution_count": 28, 114 | "outputs": [ 115 | { 116 | "output_type": "stream", 117 | "name": "stdout", 118 | "text": [ 119 | "Word Lemma \n", 120 | "He He \n", 121 | "was be \n", 122 | "running run \n", 123 | "and and \n", 124 | "eating eat \n", 125 | "at at \n", 126 | "same same \n", 127 | "time time \n", 128 | "He He \n", 129 | "has have \n", 130 | "bad bad \n", 131 | "habit habit \n", 132 | "of of \n", 133 | "swimming swim \n", 134 | "after after \n", 135 | "playing play \n", 136 | "long long \n", 137 | "hours hours \n", 138 | "in in \n", 139 | "the the \n", 140 | "Sun Sun \n" 141 | ] 142 | } 143 | ] 144 | } 145 | ] 146 | } -------------------------------------------------------------------------------- /Pre Processing/Lemmatization/README.md: -------------------------------------------------------------------------------- 1 | ### Lemmatization 2 | 3 | As stemming, lemmatization do the same but the only difference is that lemmatization ensures that root word belongs to the language. Because of the use of 4 | lemmatization we will get the valid words. In NLTK(Natural language Toolkit), we use WordLemmatizer to get the lemmas of words. We also need to provide a context 5 | for the lemmatization.So, we added pos(parts-of-speech) as a parameter. 6 | ```ruby 7 | import nltk 8 | nltk.download("wordnet") 9 | from nltk.stem import WordNetLemmatizer 10 | from nltk.tokenize import word_tokenize 11 | lemma = WordNetLemmatizer() 12 | ``` 13 | -------------------------------------------------------------------------------- /Pre Processing/PRE PROCESSING STEP 01.py: -------------------------------------------------------------------------------- 1 | # IMPORT LIBRARIES 2 | from string import punctuation 3 | import nltk 4 | nltk.download("stopwords") 5 | from nltk.corpus import stopwords 6 | 7 | # BASIC PRE PROCESSING 8 | def basic_PreProcessing(sentence): 9 | removed_punctuation = all_text = ''.join([c for c in sentence if c not in punctuation]) # text is string 10 | lower = removed_punctuation.lower() 11 | words = lower.split() 12 | stop_words = set(stopwords.words("english")) 13 | remove_stopwords = [word for word in words if word not in stop_words] 14 | 15 | sentence = " ".join(remove_stopwords) 16 | return sentence 17 | 18 | # READ ROWS AND PRE PROCESSING 19 | def parse_data_from_file(filename): 20 | sentences = [] 21 | labels = [] 22 | with open(filename, 'r') as csvfile: 23 | reader = csv.reader(csvfile, delimiter=',') 24 | next(reader) 25 | for row in reader: 26 | # each row is a list with 3 elements(same as number of columns) 27 | labels.append(row[2]) 28 | sentence = row[1] # string 29 | sentence = basic_PreProcessing(sentence) 30 | sentences.append(sentence) 31 | 32 | return sentences, labels 33 | 34 | sentences, labels = parse_data_from_file("/kaggle/input/learn-ai-bbc/BBC News Train.csv") 35 | -------------------------------------------------------------------------------- /Pre Processing/README.md: -------------------------------------------------------------------------------- 1 | # NLP PRE PROCESSING 2 | 3 | **Read Text File** 4 | ```ruby 5 | with open('/content/drive/MyDrive/NLP/text.txt', 'r') as f: 6 | text = f.read() 7 | ``` 8 | ----------------------------------------------------------------------------------------------- 9 | **Read CVS data** 10 | ```ruby 11 | # IMPORT LIBRARIES 12 | from string import punctuation 13 | import csv 14 | import nltk 15 | nltk.download("stopwords") 16 | from nltk.corpus import stopwords 17 | 18 | # BASIC PRE PROCESSING 19 | def basic_PreProcessing(sentence): 20 | removed_punctuation = all_text = ''.join([c for c in sentence if c not in punctuation]) # text is string 21 | lower = removed_punctuation.lower() 22 | words = lower.split() 23 | stop_words = set(stopwords.words("english")) 24 | remove_stopwords = [word for word in words if word not in stop_words] 25 | 26 | sentence = " ".join(remove_stopwords) 27 | return sentence 28 | 29 | # READ ROWS AND PRE PROCESSING 30 | def parse_data_from_file(filename): 31 | sentences = [] 32 | labels = [] 33 | with open(filename, 'r') as csvfile: 34 | reader = csv.reader(csvfile, delimiter=',') 35 | next(reader) 36 | for row in reader: 37 | # each row is a list with 3 elements(same as number of columns) 38 | labels.append(row[2]) 39 | sentence = row[1] # string 40 | sentence = basic_PreProcessing(sentence) 41 | sentences.append(sentence) 42 | 43 | return sentences, labels 44 | 45 | sentences, labels = parse_data_from_file("/kaggle/input/learn-ai-bbc/BBC News Train.csv") 46 | ``` 47 | ------------------------------------------------------------------------------ 48 | -------------------------------------------------------------------------------- /Pre Processing/RemovePunctuation.py: -------------------------------------------------------------------------------- 1 | def rem_punct(text): 2 | translator = str.maketrans('', '', string.punctuation) 3 | return text.translate(translator) 4 | 5 | input_str = "Hey, How are you??? I am fine!!!!" 6 | rem_punct(input_str) # OUTPUT:- 'Hey How are you I am fine' 7 | -------------------------------------------------------------------------------- /Pre Processing/Stemming/PorterStemmer.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "nbformat": 4, 3 | "nbformat_minor": 0, 4 | "metadata": { 5 | "colab": { 6 | "name": "PorterStemmer.ipynb", 7 | "provenance": [] 8 | }, 9 | "kernelspec": { 10 | "name": "python3", 11 | "display_name": "Python 3" 12 | }, 13 | "language_info": { 14 | "name": "python" 15 | } 16 | }, 17 | "cells": [ 18 | { 19 | "cell_type": "code", 20 | "source": [ 21 | "from nltk.stem.porter import PorterStemmer\n", 22 | "\n", 23 | "ps = PorterStemmer()\n", 24 | "def stem_words(text):\n", 25 | " return \" \".join([ps.stem(word) for word in text.split()])\n", 26 | "\n", 27 | "sample = \"walk walks walking walked\"\n", 28 | "print(stem_words(sample)) \n", 29 | "\n", 30 | "print()\n", 31 | "\n", 32 | "text = 'probably my alltime favorite movie a story of selflessness sacrifice and dedication to a noble cause but its not preachy or boring it just never gets old despite my having seen it some 15 or more times in the last 25 years paul lukas performance brings tears to my eyes and bette davis in one of her very few truly sympathetic roles is a delight the kids are as grandma says more like dressedup midgets than children but that only makes them more fun to watch and the mothers slow awakening to whats happening in the world and under her own roof is believable and startling if i had a dozen thumbs theyd all be up for this movie'\n", 33 | "print(stem_words(text)) " 34 | ], 35 | "metadata": { 36 | "colab": { 37 | "base_uri": "https://localhost:8080/" 38 | }, 39 | "id": "EXlfAkZZwFAb", 40 | "outputId": "1d798cae-a672-479c-c5cc-eb1997b656c7" 41 | }, 42 | "execution_count": 23, 43 | "outputs": [ 44 | { 45 | "output_type": "stream", 46 | "name": "stdout", 47 | "text": [ 48 | "walk walk walk walk\n", 49 | "\n", 50 | "probabl my alltim favorit movi a stori of selfless sacrific and dedic to a nobl caus but it not preachi or bore it just never get old despit my have seen it some 15 or more time in the last 25 year paul luka perform bring tear to my eye and bett davi in one of her veri few truli sympathet role is a delight the kid are as grandma say more like dressedup midget than children but that onli make them more fun to watch and the mother slow awaken to what happen in the world and under her own roof is believ and startl if i had a dozen thumb theyd all be up for thi movi\n" 51 | ] 52 | } 53 | ] 54 | } 55 | ] 56 | } -------------------------------------------------------------------------------- /Pre Processing/Stemming/README.md: -------------------------------------------------------------------------------- 1 | ### Stemming 2 | 3 | From Stemming we will process of getting the root form of a word. Root or Stem is the part to which inflextional affixes(like -ed, -ize, etc) are added. We would create the stem words by removing the prefix of suffix of a word. So, stemming a word may not result in actual words. 4 | 5 | For Example: Mangoes ---> Mango 6 | 7 | Boys ---> Boy 8 | 9 | going ---> go 10 | 11 | 12 | If our sentences are not in tokens, then we need to convert it into tokens. After we converted strings of text into tokens, then we can convert those word tokens into their root form. These are the Porter stemmer, the snowball stemmer, and the Lancaster Stemmer. We usually use Porter stemmer among them. 13 | -------------------------------------------------------------------------------- /Pre Processing/Stemming/Stemming.py: -------------------------------------------------------------------------------- 1 | #importing nltk's porter stemmer 2 | from nltk.stem.porter import PorterStemmer 3 | from nltk.tokenize import word_tokenize 4 | stem1 = PorterStemmer() 5 | 6 | # stem words in the list of tokenised words 7 | def s_words(text): 8 | word_tokens = word_tokenize(text) 9 | stems = [stem1.stem(word) for word in word_tokens] 10 | return stems 11 | 12 | text = 'Data is the new revolution in the World, in a day one individual would generate terabytes of data.' 13 | s_words(text) 14 | 15 | 16 | 17 | 18 | ''' OUTPUT:- 19 | ['data', 20 | 'is', 21 | 'the', 22 | 'new', 23 | 'revolut', 24 | 'in', 25 | 'the', 26 | 'world', 27 | ',', 28 | 'in', 29 | 'a', 30 | 'day', 31 | 'one', 32 | 'individu', 33 | 'would', 34 | 'gener', 35 | 'terabyt', 36 | 'of', 37 | 'data', 38 | '.'] 39 | ''' 40 | -------------------------------------------------------------------------------- /Pre Processing/Stop Words/README.md: -------------------------------------------------------------------------------- 1 | # Stop Words 2 | A stop word is a commonly used word (such as “the”, “a”, “an”, “in”) that a search engine has been programmed to ignore, both when indexing entries for searching and when retrieving them as the result of a search query. 3 | We would not want these words to take up space in our database, or taking up valuable processing time. 4 | 5 | ```ruby 6 | nltk.download("stopwords") 7 | nltk.download("punkt") 8 | from nltk.corpus import stopwords 9 | from nltk.tokenize import word_tokenize 10 | ``` 11 | 12 | ```ruby 13 | import nltk 14 | from nltk.corpus import stopwords 15 | print(stopwords.words('english')) 16 | ``` 17 | 18 | ['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', "that'll", 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so', 'than', 'too', 'very', 's', 't', 'can', 'will', 'just', 'don', "don't", 'should', "should've", 'now', 'd', 'll', 'm', 'o', 're', 've', 'y', 'ain', 'aren', "aren't", 'couldn', "couldn't", 'didn', "didn't", 'doesn', "doesn't", 'hadn', "hadn't", 'hasn', "hasn't", 'haven', "haven't", 'isn', "isn't", 'ma', 'mightn', "mightn't", 'mustn', "mustn't", 'needn', "needn't", 'shan', "shan't", 'shouldn', "shouldn't", 'wasn', "wasn't", 'weren', "weren't", 'won', "won't", 'wouldn', "wouldn't"] 19 | -------------------------------------------------------------------------------- /Pre Processing/Stop Words/Remove default stopwords.py: -------------------------------------------------------------------------------- 1 | # importing nltk library 2 | from nltk.corpus import stopwords 3 | from nltk.tokenize import word_tokenize 4 | 5 | # remove stopwords function 6 | def rem_stopwords(text): 7 | text = rem_punct(text) 8 | stop_words = set(stopwords.words("english")) 9 | word_tokens = word_tokenize(text) 10 | filtered_text = [word for word in word_tokens if word not in stop_words] 11 | return filtered_text 12 | 13 | ex_text = "Data is the new oil. A.I is the last invention" 14 | rem_stopwords(ex_text) 15 | 16 | # OUTPUT :- ['Data', 'new', 'oil', '.', 'A.I', 'last', 'invention'] 17 | 18 | 19 | 20 | 21 | 22 | 23 | 24 | # For Removing numbers 25 | def remove_num(text): 26 | result = re.sub(r'\d+', '', text) 27 | return result 28 | 29 | 30 | 31 | # importing nltk library 32 | from nltk.corpus import stopwords 33 | from nltk.tokenize import word_tokenize 34 | 35 | # remove stopwords function 36 | def rem_stopwords(text): 37 | text = rem_punct(text) 38 | stop_words = set(stopwords.words("english")) 39 | word_tokens = word_tokenize(text) 40 | filtered_text = [word for word in word_tokens if word not in stop_words] 41 | return filtered_text 42 | 43 | ex_text = "Data is the new oil. A.I is the last invention" 44 | rem_stopwords(ex_text) 45 | 46 | 47 | # OUTPUT :- ['Data', 'new', 'oil', 'AI', 'last', 'invention'] 48 | -------------------------------------------------------------------------------- /Pre Processing/Text_Processing_in_NLP.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "nbformat": 4, 3 | "nbformat_minor": 0, 4 | "metadata": { 5 | "colab": { 6 | "name": "Text Processing in NLP.ipynb", 7 | "provenance": [] 8 | }, 9 | "kernelspec": { 10 | "name": "python3", 11 | "display_name": "Python 3" 12 | }, 13 | "language_info": { 14 | "name": "python" 15 | } 16 | }, 17 | "cells": [ 18 | { 19 | "cell_type": "markdown", 20 | "metadata": { 21 | "id": "AlPiVcIdJFpO" 22 | }, 23 | "source": [ 24 | "# Text Preprocessing\n", 25 | "\n", 26 | "Supose we have textual data available, we need to apply many of pre-processing steps to the data to transform those words into numerical features that work with machine learning algorithms.\n", 27 | "\n", 28 | "The pre-processing steps for the problem depend mainly on the domain and the problem itself.We don't need to apply all the steps for every problem.\n", 29 | "\n", 30 | "Here, we're going to see text preprocessing in Python. We'll use NLTK(Natural language toolkit) library here." 31 | ] 32 | }, 33 | { 34 | "cell_type": "code", 35 | "metadata": { 36 | "id": "K4JGBS1TJFpP" 37 | }, 38 | "source": [ 39 | "# import necessary libraries \n", 40 | "import nltk\n", 41 | "import string\n", 42 | "import re" 43 | ], 44 | "execution_count": 12, 45 | "outputs": [] 46 | }, 47 | { 48 | "cell_type": "markdown", 49 | "metadata": { 50 | "id": "FdD68bHhJFpT" 51 | }, 52 | "source": [ 53 | "### Text lowercase\n", 54 | "\n", 55 | "We do lowercase the text to reduce the size of the vocabulary of our text data." 56 | ] 57 | }, 58 | { 59 | "cell_type": "code", 60 | "metadata": { 61 | "id": "YoXbKGUcJFpT", 62 | "outputId": "e752634e-3d0b-4ba1-86b7-89e83b48f3f7", 63 | "colab": { 64 | "base_uri": "https://localhost:8080/", 65 | "height": 36 66 | } 67 | }, 68 | "source": [ 69 | "def lowercase_text(text): \n", 70 | " return text.lower() \n", 71 | " \n", 72 | "input_str = \"Weather is too Cloudy.Possiblity of Rain is High,Today!!\"\n", 73 | "lowercase_text(input_str) " 74 | ], 75 | "execution_count": 13, 76 | "outputs": [ 77 | { 78 | "output_type": "execute_result", 79 | "data": { 80 | "text/plain": [ 81 | "'weather is too cloudy.possiblity of rain is high,today!!'" 82 | ], 83 | "application/vnd.google.colaboratory.intrinsic+json": { 84 | "type": "string" 85 | } 86 | }, 87 | "metadata": {}, 88 | "execution_count": 13 89 | } 90 | ] 91 | }, 92 | { 93 | "cell_type": "markdown", 94 | "metadata": { 95 | "id": "Kkx5nINIJFpa" 96 | }, 97 | "source": [ 98 | "### Remove numbers\n", 99 | "\n", 100 | "We should either remove the numbers or convert those numbers into textual representations.\n", 101 | "We use regular expressions(re) to remove the numbers." 102 | ] 103 | }, 104 | { 105 | "cell_type": "code", 106 | "metadata": { 107 | "id": "rJxsrbYcJFpc", 108 | "outputId": "4d490643-fa75-478f-9ced-ab0eb04d7fb9", 109 | "colab": { 110 | "base_uri": "https://localhost:8080/", 111 | "height": 36 112 | } 113 | }, 114 | "source": [ 115 | "# For Removing numbers \n", 116 | "def remove_num(text): \n", 117 | " result = re.sub(r'\\d+', '', text) \n", 118 | " return result \n", 119 | " \n", 120 | "input_s = \"You bought 6 candies from shop, and 4 candies are in home.\"\n", 121 | "remove_num(input_s) " 122 | ], 123 | "execution_count": 14, 124 | "outputs": [ 125 | { 126 | "output_type": "execute_result", 127 | "data": { 128 | "text/plain": [ 129 | "'You bought candies from shop, and candies are in home.'" 130 | ], 131 | "application/vnd.google.colaboratory.intrinsic+json": { 132 | "type": "string" 133 | } 134 | }, 135 | "metadata": {}, 136 | "execution_count": 14 137 | } 138 | ] 139 | }, 140 | { 141 | "cell_type": "markdown", 142 | "metadata": { 143 | "id": "A2dGfGX2JFpg" 144 | }, 145 | "source": [ 146 | "## Convert the numbers into words" 147 | ] 148 | }, 149 | { 150 | "cell_type": "code", 151 | "metadata": { 152 | "id": "VafdwS3cJFph", 153 | "outputId": "b3e2aa86-22b2-4dcc-d249-588d51cb1de8", 154 | "colab": { 155 | "base_uri": "https://localhost:8080/" 156 | } 157 | }, 158 | "source": [ 159 | "import inflect \n", 160 | "q = inflect.engine() \n", 161 | "def convert_num(text): \n", 162 | " temp_string = text.split() \n", 163 | " new_str = [] \n", 164 | " for word in temp_string: \n", 165 | " if word.isdigit(): \n", 166 | " temp = q.number_to_words(word) \n", 167 | " new_str.append(temp) \n", 168 | " else: \n", 169 | " new_str.append(word) \n", 170 | " temp_str = ' '.join(new_str) \n", 171 | " return temp_str \n", 172 | " \n", 173 | "input1 = 'I am 20 years old'\n", 174 | "print(convert_num(input1))\n", 175 | "input2 = 'I was born in 2002'\n", 176 | "print(convert_num(input2))" 177 | ], 178 | "execution_count": 28, 179 | "outputs": [ 180 | { 181 | "output_type": "stream", 182 | "name": "stdout", 183 | "text": [ 184 | "I am twenty years old\n", 185 | "I was born in two thousand and two\n" 186 | ] 187 | } 188 | ] 189 | }, 190 | { 191 | "cell_type": "markdown", 192 | "metadata": { 193 | "id": "506D4NxOJFpk" 194 | }, 195 | "source": [ 196 | "### Remove Punctuation\n", 197 | "\n", 198 | "We remove punctuations because of that we don't have different form of the same word. If we don't remove punctuations, then been, been, and been! will be treated separately." 199 | ] 200 | }, 201 | { 202 | "cell_type": "code", 203 | "metadata": { 204 | "id": "5B5eCetpJFpl", 205 | "outputId": "f1bd340c-2ed5-4eb7-ecf0-f6f6527a2cb8", 206 | "colab": { 207 | "base_uri": "https://localhost:8080/", 208 | "height": 36 209 | } 210 | }, 211 | "source": [ 212 | "# let's remove punctuation \n", 213 | "def rem_punct(text): \n", 214 | " translator = str.maketrans('', '', string.punctuation) \n", 215 | " return text.translate(translator) \n", 216 | " \n", 217 | "input_str = \"Hey, Are you excited??, After a week, we will be in Shimla!!!\"\n", 218 | "rem_punct(input_str) " 219 | ], 220 | "execution_count": 29, 221 | "outputs": [ 222 | { 223 | "output_type": "execute_result", 224 | "data": { 225 | "text/plain": [ 226 | "'Hey Are you excited After a week we will be in Shimla'" 227 | ], 228 | "application/vnd.google.colaboratory.intrinsic+json": { 229 | "type": "string" 230 | } 231 | }, 232 | "metadata": {}, 233 | "execution_count": 29 234 | } 235 | ] 236 | }, 237 | { 238 | "cell_type": "markdown", 239 | "metadata": { 240 | "id": "99YJy-_ZJFpo" 241 | }, 242 | "source": [ 243 | "### Remove default stopwords:\n", 244 | "\n", 245 | "Stopwords are words that do not contribute to the meaning of the sentence. Hence, they can be safely removed without causing any change in the meaning of a sentence. The NLTK(Natural Language Toolkit) library has the set of stopwords and we can use these to remove stopwords from our text and return a list of word tokens." 246 | ] 247 | }, 248 | { 249 | "cell_type": "code", 250 | "metadata": { 251 | "id": "Zg0flBIrJFpp", 252 | "outputId": "927339d9-432a-448b-902c-3aa332cc86ad", 253 | "colab": { 254 | "base_uri": "https://localhost:8080/" 255 | } 256 | }, 257 | "source": [ 258 | "# importing nltk library\n", 259 | "from nltk.corpus import stopwords \n", 260 | "from nltk.tokenize import word_tokenize \n", 261 | "\n", 262 | "nltk.download('stopwords')\n", 263 | "nltk.download('punkt')\n", 264 | " \n", 265 | "# remove stopwords function \n", 266 | "def rem_stopwords(text): \n", 267 | " stop_words = set(stopwords.words(\"english\")) \n", 268 | " word_tokens = word_tokenize(text) \n", 269 | " filtered_text = [word for word in word_tokens if word not in stop_words] \n", 270 | " return filtered_text \n", 271 | " \n", 272 | "ex_text = \"Data is the new oil. A.I is the last invention\"\n", 273 | "rem_stopwords(ex_text)" 274 | ], 275 | "execution_count": 17, 276 | "outputs": [ 277 | { 278 | "output_type": "stream", 279 | "name": "stdout", 280 | "text": [ 281 | "[nltk_data] Downloading package stopwords to /root/nltk_data...\n", 282 | "[nltk_data] Package stopwords is already up-to-date!\n", 283 | "[nltk_data] Downloading package punkt to /root/nltk_data...\n", 284 | "[nltk_data] Package punkt is already up-to-date!\n" 285 | ] 286 | }, 287 | { 288 | "output_type": "execute_result", 289 | "data": { 290 | "text/plain": [ 291 | "['Data', 'new', 'oil', '.', 'A.I', 'last', 'invention']" 292 | ] 293 | }, 294 | "metadata": {}, 295 | "execution_count": 17 296 | } 297 | ] 298 | }, 299 | { 300 | "cell_type": "markdown", 301 | "metadata": { 302 | "id": "uZMQOMO1JFps" 303 | }, 304 | "source": [ 305 | "### Stemming\n", 306 | "\n", 307 | "From Stemming we will process of getting the root form of a word. Root or Stem is the part to which inflextional affixes(like -ed, -ize, etc) are added. We would create the stem words by removing the prefix of suffix of a word. So, stemming a word may not result in actual words.\n", 308 | "\n", 309 | "For Example: Mangoes ---> Mango\n", 310 | "\n", 311 | " Boys ---> Boy\n", 312 | " \n", 313 | " going ---> go\n", 314 | " \n", 315 | " \n", 316 | "If our sentences are not in tokens, then we need to convert it into tokens. After we converted strings of text into tokens, then we can convert those word tokens into their root form. These are the Porter stemmer, the snowball stemmer, and the Lancaster Stemmer. We usually use Porter stemmer among them." 317 | ] 318 | }, 319 | { 320 | "cell_type": "code", 321 | "metadata": { 322 | "id": "iwP4-kAgJFpt", 323 | "outputId": "be532381-0aac-42fa-9f98-a9b9c3689bbe", 324 | "colab": { 325 | "base_uri": "https://localhost:8080/" 326 | } 327 | }, 328 | "source": [ 329 | "#importing nltk's porter stemmer \n", 330 | "from nltk.stem.porter import PorterStemmer \n", 331 | "from nltk.tokenize import word_tokenize \n", 332 | "stem1 = PorterStemmer() \n", 333 | " \n", 334 | "# stem words in the list of tokenised words \n", 335 | "def s_words(text): \n", 336 | " word_tokens = word_tokenize(text) \n", 337 | " stems = [stem1.stem(word) for word in word_tokens] \n", 338 | " return stems \n", 339 | " \n", 340 | "text = 'Data is the new revolution in the World, in a day one individual would generate terabytes of data.'\n", 341 | "s_words(text)" 342 | ], 343 | "execution_count": 18, 344 | "outputs": [ 345 | { 346 | "output_type": "execute_result", 347 | "data": { 348 | "text/plain": [ 349 | "['data',\n", 350 | " 'is',\n", 351 | " 'the',\n", 352 | " 'new',\n", 353 | " 'revolut',\n", 354 | " 'in',\n", 355 | " 'the',\n", 356 | " 'world',\n", 357 | " ',',\n", 358 | " 'in',\n", 359 | " 'a',\n", 360 | " 'day',\n", 361 | " 'one',\n", 362 | " 'individu',\n", 363 | " 'would',\n", 364 | " 'gener',\n", 365 | " 'terabyt',\n", 366 | " 'of',\n", 367 | " 'data',\n", 368 | " '.']" 369 | ] 370 | }, 371 | "metadata": {}, 372 | "execution_count": 18 373 | } 374 | ] 375 | }, 376 | { 377 | "cell_type": "markdown", 378 | "metadata": { 379 | "id": "J3auiVuIJFpw" 380 | }, 381 | "source": [ 382 | "### Lemmatization\n", 383 | "\n", 384 | "As stemming, lemmatization do the same but the only difference is that lemmatization ensures that root word belongs to the language. Because of the use of lemmatization we will get the valid words. In NLTK(Natural language Toolkit), we use WordLemmatizer to get the lemmas of words. We also need to provide a context for the lemmatization.So, we added pos(parts-of-speech) as a parameter. " 385 | ] 386 | }, 387 | { 388 | "cell_type": "code", 389 | "metadata": { 390 | "id": "YavFewldJFpx", 391 | "outputId": "4d2a5fc8-8bc7-41b5-e293-b31a5b864f06", 392 | "colab": { 393 | "base_uri": "https://localhost:8080/" 394 | } 395 | }, 396 | "source": [ 397 | "from nltk.stem import wordnet \n", 398 | "from nltk.tokenize import word_tokenize \n", 399 | "lemma = wordnet.WordNetLemmatizer()\n", 400 | "nltk.download('wordnet')\n", 401 | "# lemmatize string \n", 402 | "def lemmatize_word(text): \n", 403 | " word_tokens = word_tokenize(text) \n", 404 | " # provide context i.e. part-of-speech(pos)\n", 405 | " lemmas = [lemma.lemmatize(word, pos ='v') for word in word_tokens] \n", 406 | " return lemmas \n", 407 | " \n", 408 | "text = 'Data is the new revolution in the World, in a day one individual would generate terabytes of data.'\n", 409 | "lemmatize_word(text)" 410 | ], 411 | "execution_count": 19, 412 | "outputs": [ 413 | { 414 | "output_type": "stream", 415 | "name": "stdout", 416 | "text": [ 417 | "[nltk_data] Downloading package wordnet to /root/nltk_data...\n", 418 | "[nltk_data] Package wordnet is already up-to-date!\n" 419 | ] 420 | }, 421 | { 422 | "output_type": "execute_result", 423 | "data": { 424 | "text/plain": [ 425 | "['Data',\n", 426 | " 'be',\n", 427 | " 'the',\n", 428 | " 'new',\n", 429 | " 'revolution',\n", 430 | " 'in',\n", 431 | " 'the',\n", 432 | " 'World',\n", 433 | " ',',\n", 434 | " 'in',\n", 435 | " 'a',\n", 436 | " 'day',\n", 437 | " 'one',\n", 438 | " 'individual',\n", 439 | " 'would',\n", 440 | " 'generate',\n", 441 | " 'terabytes',\n", 442 | " 'of',\n", 443 | " 'data',\n", 444 | " '.']" 445 | ] 446 | }, 447 | "metadata": {}, 448 | "execution_count": 19 449 | } 450 | ] 451 | }, 452 | { 453 | "cell_type": "code", 454 | "metadata": { 455 | "id": "Vs9vYMcSQYu-", 456 | "outputId": "7fdb5c11-75b1-4370-b0bb-c608ed4f6d8a", 457 | "colab": { 458 | "base_uri": "https://localhost:8080/" 459 | } 460 | }, 461 | "source": [ 462 | "import nltk\n", 463 | "nltk.download('punkt')" 464 | ], 465 | "execution_count": 20, 466 | "outputs": [ 467 | { 468 | "output_type": "stream", 469 | "name": "stdout", 470 | "text": [ 471 | "[nltk_data] Downloading package punkt to /root/nltk_data...\n", 472 | "[nltk_data] Package punkt is already up-to-date!\n" 473 | ] 474 | }, 475 | { 476 | "output_type": "execute_result", 477 | "data": { 478 | "text/plain": [ 479 | "True" 480 | ] 481 | }, 482 | "metadata": {}, 483 | "execution_count": 20 484 | } 485 | ] 486 | }, 487 | { 488 | "cell_type": "code", 489 | "metadata": { 490 | "id": "hCjm73RLJFp1", 491 | "outputId": "04f28ab2-b652-49f8-d85a-28d4b956d005", 492 | "colab": { 493 | "base_uri": "https://localhost:8080/" 494 | } 495 | }, 496 | "source": [ 497 | "# importing tokenize library\n", 498 | "from nltk.tokenize import word_tokenize \n", 499 | "from nltk import pos_tag \n", 500 | "nltk.download('averaged_perceptron_tagger')\n", 501 | " \n", 502 | "# convert text into word_tokens with their tags \n", 503 | "def pos_tagg(text): \n", 504 | " word_tokens = word_tokenize(text) \n", 505 | " return pos_tag(word_tokens) \n", 506 | " \n", 507 | "pos_tagg('Are you afraid of something?') " 508 | ], 509 | "execution_count": 21, 510 | "outputs": [ 511 | { 512 | "output_type": "stream", 513 | "name": "stdout", 514 | "text": [ 515 | "[nltk_data] Downloading package averaged_perceptron_tagger to\n", 516 | "[nltk_data] /root/nltk_data...\n", 517 | "[nltk_data] Unzipping taggers/averaged_perceptron_tagger.zip.\n" 518 | ] 519 | }, 520 | { 521 | "output_type": "execute_result", 522 | "data": { 523 | "text/plain": [ 524 | "[('Are', 'NNP'),\n", 525 | " ('you', 'PRP'),\n", 526 | " ('afraid', 'IN'),\n", 527 | " ('of', 'IN'),\n", 528 | " ('something', 'NN'),\n", 529 | " ('?', '.')]" 530 | ] 531 | }, 532 | "metadata": {}, 533 | "execution_count": 21 534 | } 535 | ] 536 | }, 537 | { 538 | "cell_type": "markdown", 539 | "metadata": { 540 | "id": "JtdatdAsJFp4" 541 | }, 542 | "source": [ 543 | "In the above example NNP stands for Proper noun, PRP stands for personal noun, IN as Preposition. We can get all the details pos tags using the Penn Treebank tagset." 544 | ] 545 | }, 546 | { 547 | "cell_type": "code", 548 | "metadata": { 549 | "id": "3Zv_rokfJFp5", 550 | "outputId": "9eabd906-2851-4544-8788-e0a46b1210e4", 551 | "colab": { 552 | "base_uri": "https://localhost:8080/" 553 | } 554 | }, 555 | "source": [ 556 | "# downloading the tagset \n", 557 | "nltk.download('tagsets') \n", 558 | " \n", 559 | "# extract information about the tag \n", 560 | "nltk.help.upenn_tagset('PRP')" 561 | ], 562 | "execution_count": 22, 563 | "outputs": [ 564 | { 565 | "output_type": "stream", 566 | "name": "stdout", 567 | "text": [ 568 | "[nltk_data] Downloading package tagsets to /root/nltk_data...\n", 569 | "[nltk_data] Unzipping help/tagsets.zip.\n", 570 | "PRP: pronoun, personal\n", 571 | " hers herself him himself hisself it itself me myself one oneself ours\n", 572 | " ourselves ownself self she thee theirs them themselves they thou thy us\n" 573 | ] 574 | } 575 | ] 576 | }, 577 | { 578 | "cell_type": "markdown", 579 | "metadata": { 580 | "id": "tRBPHRjSJFp8" 581 | }, 582 | "source": [ 583 | "### Chunking\n", 584 | "\n", 585 | "Chunking is the process of extracting phrases from the Unstructured text and give them more structure to it. We also called them shallow parsing.We can do it on top of pos tagging. It groups words into chunks mainly for noun phrases. chunking we do by using regular expression. " 586 | ] 587 | }, 588 | { 589 | "cell_type": "code", 590 | "metadata": { 591 | "id": "lzLmybEuJFp8", 592 | "outputId": "b0937dde-fd78-401a-f21b-30b4fb921b5c", 593 | "colab": { 594 | "base_uri": "https://localhost:8080/" 595 | } 596 | }, 597 | "source": [ 598 | "#importing libraries\n", 599 | "from nltk.tokenize import word_tokenize \n", 600 | "from nltk import pos_tag \n", 601 | " \n", 602 | "# here we define chunking function with text and regular \n", 603 | "# expressions representing grammar as parameter \n", 604 | "def chunking(text, grammar): \n", 605 | " word_tokens = word_tokenize(text) \n", 606 | " \n", 607 | " # label words with pos \n", 608 | " word_pos = pos_tag(word_tokens) \n", 609 | " \n", 610 | " # create chunk parser using grammar \n", 611 | " chunkParser = nltk.RegexpParser(grammar) \n", 612 | " \n", 613 | " # test it on the list of word tokens with tagged pos \n", 614 | " tree = chunkParser.parse(word_pos) \n", 615 | " \n", 616 | " for subtree in tree.subtrees(): \n", 617 | " print(subtree) \n", 618 | " #tree.draw() \n", 619 | " \n", 620 | "sentence = 'the little red parrot is flying in the sky'\n", 621 | "grammar = \"NP: {
?*}\"\n", 622 | "chunking(sentence, grammar) " 623 | ], 624 | "execution_count": 23, 625 | "outputs": [ 626 | { 627 | "output_type": "stream", 628 | "name": "stdout", 629 | "text": [ 630 | "(S\n", 631 | " (NP the/DT little/JJ red/JJ parrot/NN)\n", 632 | " is/VBZ\n", 633 | " flying/VBG\n", 634 | " in/IN\n", 635 | " (NP the/DT sky/NN))\n", 636 | "(NP the/DT little/JJ red/JJ parrot/NN)\n", 637 | "(NP the/DT sky/NN)\n" 638 | ] 639 | } 640 | ] 641 | }, 642 | { 643 | "cell_type": "markdown", 644 | "metadata": { 645 | "id": "RYT9PqpwJFqA" 646 | }, 647 | "source": [ 648 | "In the above example, we defined the grammar by using the regular expression rule. This rule tells you that NP(noun phrase) chunk should be formed whenever the chunker find the optional determiner(DJ) followed by any no. of adjectives and then a NN(noun).\n", 649 | "\n", 650 | "Image after running above code.\n", 651 | "\n", 652 | "\n", 653 | "Libraries like Spacy and TextBlob are best for chunking." 654 | ] 655 | }, 656 | { 657 | "cell_type": "markdown", 658 | "metadata": { 659 | "id": "E6faIGsRJFqB" 660 | }, 661 | "source": [ 662 | "### Named Entity Recognition\n", 663 | "\n", 664 | "It is used to extract information from unstructured text. It is used to classy the entities which is present in the text into categories like a person, organization, event, places, etc. This will give you a detail knowledge about the text and the relationship between the different entities." 665 | ] 666 | }, 667 | { 668 | "cell_type": "code", 669 | "metadata": { 670 | "id": "QeA_JtIBJFqC", 671 | "outputId": "3eb8a1df-6e57-4552-c0fd-a7bf62aa0830", 672 | "colab": { 673 | "base_uri": "https://localhost:8080/" 674 | } 675 | }, 676 | "source": [ 677 | "#Importing tokenization and chunk\n", 678 | "from nltk.tokenize import word_tokenize \n", 679 | "from nltk import pos_tag, ne_chunk \n", 680 | "nltk.download('maxent_ne_chunker')\n", 681 | "nltk.download('words')\n", 682 | " \n", 683 | "def ner(text): \n", 684 | " # tokenize the text \n", 685 | " word_tokens = word_tokenize(text) \n", 686 | " \n", 687 | " # pos tagging of words \n", 688 | " word_pos = pos_tag(word_tokens) \n", 689 | " \n", 690 | " # tree of word entities \n", 691 | " print(ne_chunk(word_pos)) \n", 692 | " \n", 693 | "text = 'Brain Lara scored the highest 400 runs in a test match which played in between WI and England.'\n", 694 | "ner(text) " 695 | ], 696 | "execution_count": 24, 697 | "outputs": [ 698 | { 699 | "output_type": "stream", 700 | "name": "stdout", 701 | "text": [ 702 | "[nltk_data] Downloading package maxent_ne_chunker to\n", 703 | "[nltk_data] /root/nltk_data...\n", 704 | "[nltk_data] Unzipping chunkers/maxent_ne_chunker.zip.\n", 705 | "[nltk_data] Downloading package words to /root/nltk_data...\n", 706 | "[nltk_data] Unzipping corpora/words.zip.\n", 707 | "(S\n", 708 | " (PERSON Brain/NNP)\n", 709 | " (PERSON Lara/NNP)\n", 710 | " scored/VBD\n", 711 | " the/DT\n", 712 | " highest/JJS\n", 713 | " 400/CD\n", 714 | " runs/NNS\n", 715 | " in/IN\n", 716 | " a/DT\n", 717 | " test/NN\n", 718 | " match/NN\n", 719 | " which/WDT\n", 720 | " played/VBD\n", 721 | " in/IN\n", 722 | " between/IN\n", 723 | " (ORGANIZATION WI/NNP)\n", 724 | " and/CC\n", 725 | " (GPE England/NNP)\n", 726 | " ./.)\n" 727 | ] 728 | } 729 | ] 730 | } 731 | ] 732 | } -------------------------------------------------------------------------------- /Pre Processing/Tokenizer and padding.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "nbformat": 4, 3 | "nbformat_minor": 0, 4 | "metadata": { 5 | "colab": { 6 | "name": "Tokenizer and padded.ipynb", 7 | "provenance": [], 8 | "collapsed_sections": [] 9 | }, 10 | "kernelspec": { 11 | "name": "python3", 12 | "display_name": "Python 3" 13 | }, 14 | "language_info": { 15 | "name": "python" 16 | } 17 | }, 18 | "cells": [ 19 | { 20 | "cell_type": "code", 21 | "source": [ 22 | "import tensorflow as tf\n", 23 | "from tensorflow import keras\n", 24 | "from tensorflow.keras.preprocessing.text import Tokenizer\n", 25 | "from tensorflow.keras.preprocessing.sequence import pad_sequences" 26 | ], 27 | "metadata": { 28 | "id": "cIqsHgign0wE" 29 | }, 30 | "execution_count": 1, 31 | "outputs": [] 32 | }, 33 | { 34 | "cell_type": "code", 35 | "source": [ 36 | "sentences = [\n", 37 | " 'I love my dog',\n", 38 | " 'I love my cat',\n", 39 | " 'You love my dog!',\n", 40 | " 'Do you think my dog is amazing?']" 41 | ], 42 | "metadata": { 43 | "id": "U_XKfuO2n3k6" 44 | }, 45 | "execution_count": 2, 46 | "outputs": [] 47 | }, 48 | { 49 | "cell_type": "markdown", 50 | "source": [ 51 | "### Word based encodings\n", 52 | "\n", 53 | "- **OOV ->** instead of just ignoring unseen words, we put a special value in when an unseen word is encountered. oov is used for outer vocabulary to be used for words that aren't in the word indexx" 54 | ], 55 | "metadata": { 56 | "id": "PZXQO9p4sgzM" 57 | } 58 | }, 59 | { 60 | "cell_type": "code", 61 | "source": [ 62 | "tokenizer = Tokenizer(num_words = 100, oov_token=\"\") # OOV -> out of vocabolary\n", 63 | "tokenizer.fit_on_texts(sentences)\n", 64 | "word_index = tokenizer.word_index\n", 65 | "print(word_index) " 66 | ], 67 | "metadata": { 68 | "colab": { 69 | "base_uri": "https://localhost:8080/" 70 | }, 71 | "id": "dRkEIfoln3m5", 72 | "outputId": "425de5c1-e8f6-40af-87f2-eca3e43b64a3" 73 | }, 74 | "execution_count": 3, 75 | "outputs": [ 76 | { 77 | "output_type": "stream", 78 | "name": "stdout", 79 | "text": [ 80 | "{'': 1, 'my': 2, 'love': 3, 'dog': 4, 'i': 5, 'you': 6, 'cat': 7, 'do': 8, 'think': 9, 'is': 10, 'amazing': 11}\n" 81 | ] 82 | } 83 | ] 84 | }, 85 | { 86 | "cell_type": "markdown", 87 | "source": [ 88 | "### Text to sequence" 89 | ], 90 | "metadata": { 91 | "id": "ioSv0cmhsefh" 92 | } 93 | }, 94 | { 95 | "cell_type": "code", 96 | "source": [ 97 | "sequences = tokenizer.texts_to_sequences(sentences)\n", 98 | "sequences" 99 | ], 100 | "metadata": { 101 | "colab": { 102 | "base_uri": "https://localhost:8080/" 103 | }, 104 | "id": "P9F8ZONhorUX", 105 | "outputId": "9bdfb66c-395b-49c0-b398-e68a237c775e" 106 | }, 107 | "execution_count": 4, 108 | "outputs": [ 109 | { 110 | "output_type": "execute_result", 111 | "data": { 112 | "text/plain": [ 113 | "[[5, 3, 2, 4], [5, 3, 2, 7], [6, 3, 2, 4], [8, 6, 9, 2, 4, 10, 11]]" 114 | ] 115 | }, 116 | "metadata": {}, 117 | "execution_count": 4 118 | } 119 | ] 120 | }, 121 | { 122 | "cell_type": "markdown", 123 | "source": [ 124 | "# Padding\n", 125 | "- Padding the default is pre, which means that we will lose from the beginning of the sentence.\n", 126 | "- Can change it using from pre to post\n", 127 | "```ruby\n", 128 | "padded = pad_sequences(sentences , padding = 'post')\n", 129 | "```\n", 130 | "- Setting max length in sequenes, it will lose from beginning\n", 131 | "```ruby\n", 132 | "sequences = pad_sequences(sentences, maxlen=4)\n", 133 | "print(sequences)\n", 134 | "```" 135 | ], 136 | "metadata": { 137 | "id": "9w48KJ2HtywK" 138 | } 139 | }, 140 | { 141 | "cell_type": "code", 142 | "source": [ 143 | "padded = pad_sequences(sequences, maxlen=5)\n", 144 | "print(padded)" 145 | ], 146 | "metadata": { 147 | "colab": { 148 | "base_uri": "https://localhost:8080/" 149 | }, 150 | "id": "ozSk3Nmhn3or", 151 | "outputId": "138a5263-5a8b-4809-a413-23f93b4c4e79" 152 | }, 153 | "execution_count": 5, 154 | "outputs": [ 155 | { 156 | "output_type": "stream", 157 | "name": "stdout", 158 | "text": [ 159 | "[[ 0 5 3 2 4]\n", 160 | " [ 0 5 3 2 7]\n", 161 | " [ 0 6 3 2 4]\n", 162 | " [ 9 2 4 10 11]]\n" 163 | ] 164 | } 165 | ] 166 | }, 167 | { 168 | "cell_type": "code", 169 | "source": [ 170 | "padded = pad_sequences(sequences, maxlen=8)\n", 171 | "print(padded)" 172 | ], 173 | "metadata": { 174 | "colab": { 175 | "base_uri": "https://localhost:8080/" 176 | }, 177 | "id": "fxYCW_22n3qn", 178 | "outputId": "ef31e0c8-740e-44e6-8c7c-bffc7f0fdb7a" 179 | }, 180 | "execution_count": 6, 181 | "outputs": [ 182 | { 183 | "output_type": "stream", 184 | "name": "stdout", 185 | "text": [ 186 | "[[ 0 0 0 0 5 3 2 4]\n", 187 | " [ 0 0 0 0 5 3 2 7]\n", 188 | " [ 0 0 0 0 6 3 2 4]\n", 189 | " [ 0 8 6 9 2 4 10 11]]\n" 190 | ] 191 | } 192 | ] 193 | }, 194 | { 195 | "cell_type": "code", 196 | "source": [ 197 | "test_data = [\n", 198 | " 'i really love my dog',\n", 199 | " 'my dog loves my manatee'\n", 200 | "]\n", 201 | "\n", 202 | "test_seq = tokenizer.texts_to_sequences(test_data)\n", 203 | "test_seq" 204 | ], 205 | "metadata": { 206 | "colab": { 207 | "base_uri": "https://localhost:8080/" 208 | }, 209 | "id": "cpXFeaS6pLb4", 210 | "outputId": "e576e551-5d74-4fe7-c63c-087be3f4ab78" 211 | }, 212 | "execution_count": 7, 213 | "outputs": [ 214 | { 215 | "output_type": "execute_result", 216 | "data": { 217 | "text/plain": [ 218 | "[[5, 1, 3, 2, 4], [2, 4, 1, 2, 1]]" 219 | ] 220 | }, 221 | "metadata": {}, 222 | "execution_count": 7 223 | } 224 | ] 225 | }, 226 | { 227 | "cell_type": "code", 228 | "source": [ 229 | "padded = pad_sequences(test_seq, maxlen=10) # default padding is left\n", 230 | "padded" 231 | ], 232 | "metadata": { 233 | "colab": { 234 | "base_uri": "https://localhost:8080/" 235 | }, 236 | "id": "R-fjTRoIpLfp", 237 | "outputId": "970bcd96-30ed-4354-ffde-6d7b84f708d2" 238 | }, 239 | "execution_count": 8, 240 | "outputs": [ 241 | { 242 | "output_type": "execute_result", 243 | "data": { 244 | "text/plain": [ 245 | "array([[0, 0, 0, 0, 0, 5, 1, 3, 2, 4],\n", 246 | " [0, 0, 0, 0, 0, 2, 4, 1, 2, 1]], dtype=int32)" 247 | ] 248 | }, 249 | "metadata": {}, 250 | "execution_count": 8 251 | } 252 | ] 253 | }, 254 | { 255 | "cell_type": "code", 256 | "source": [ 257 | "padded = pad_sequences(test_seq, maxlen=10,padding = 'post')\n", 258 | "padded" 259 | ], 260 | "metadata": { 261 | "colab": { 262 | "base_uri": "https://localhost:8080/" 263 | }, 264 | "id": "88RL3YLTp2aQ", 265 | "outputId": "73d576e4-24e2-4649-b8e9-86e91102b869" 266 | }, 267 | "execution_count": 9, 268 | "outputs": [ 269 | { 270 | "output_type": "execute_result", 271 | "data": { 272 | "text/plain": [ 273 | "array([[5, 1, 3, 2, 4, 0, 0, 0, 0, 0],\n", 274 | " [2, 4, 1, 2, 1, 0, 0, 0, 0, 0]], dtype=int32)" 275 | ] 276 | }, 277 | "metadata": {}, 278 | "execution_count": 9 279 | } 280 | ] 281 | } 282 | ] 283 | } 284 | -------------------------------------------------------------------------------- /Pre Processing/Word2Vec/Google_News_word2vec.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "nbformat": 4, 3 | "nbformat_minor": 0, 4 | "metadata": { 5 | "colab": { 6 | "name": "Google News word2vec.ipynb", 7 | "provenance": [], 8 | "machine_shape": "hm" 9 | }, 10 | "kernelspec": { 11 | "name": "python3", 12 | "display_name": "Python 3" 13 | }, 14 | "language_info": { 15 | "name": "python" 16 | }, 17 | "accelerator": "GPU" 18 | }, 19 | "cells": [ 20 | { 21 | "cell_type": "code", 22 | "source": [ 23 | "import gensim.downloader as api\n", 24 | "\n", 25 | "model = api.load('word2vec-google-news-300')" 26 | ], 27 | "metadata": { 28 | "id": "RWTyo8lN83Di", 29 | "colab": { 30 | "base_uri": "https://localhost:8080/" 31 | }, 32 | "outputId": "73a3a1a7-0bf3-4386-87b5-7f5a8547814c" 33 | }, 34 | "execution_count": 1, 35 | "outputs": [ 36 | { 37 | "output_type": "stream", 38 | "name": "stdout", 39 | "text": [ 40 | "[==================================================] 100.0% 1662.8/1662.8MB downloaded\n" 41 | ] 42 | } 43 | ] 44 | }, 45 | { 46 | "cell_type": "code", 47 | "source": [ 48 | "model['cricket']" 49 | ], 50 | "metadata": { 51 | "colab": { 52 | "base_uri": "https://localhost:8080/" 53 | }, 54 | "id": "fxh78F7TEeEs", 55 | "outputId": "1b418de0-2dfe-41fb-8e68-08c9918c27e2" 56 | }, 57 | "execution_count": 2, 58 | "outputs": [ 59 | { 60 | "output_type": "execute_result", 61 | "data": { 62 | "text/plain": [ 63 | "array([-3.67187500e-01, -1.21582031e-01, 2.85156250e-01, 8.15429688e-02,\n", 64 | " 3.19824219e-02, -3.19824219e-02, 1.34765625e-01, -2.73437500e-01,\n", 65 | " 9.46044922e-03, -1.07421875e-01, 2.48046875e-01, -6.05468750e-01,\n", 66 | " 5.02929688e-02, 2.98828125e-01, 9.57031250e-02, 1.39648438e-01,\n", 67 | " -5.41992188e-02, 2.91015625e-01, 2.85156250e-01, 1.51367188e-01,\n", 68 | " -2.89062500e-01, -3.46679688e-02, 1.81884766e-02, -3.92578125e-01,\n", 69 | " 2.46093750e-01, 2.51953125e-01, -9.86328125e-02, 3.22265625e-01,\n", 70 | " 4.49218750e-01, -1.36718750e-01, -2.34375000e-01, 4.12597656e-02,\n", 71 | " -2.15820312e-01, 1.69921875e-01, 2.56347656e-02, 1.50146484e-02,\n", 72 | " -3.75976562e-02, 6.95800781e-03, 4.00390625e-01, 2.09960938e-01,\n", 73 | " 1.17675781e-01, -4.19921875e-02, 2.34375000e-01, 2.03125000e-01,\n", 74 | " -1.86523438e-01, -2.46093750e-01, 3.12500000e-01, -2.59765625e-01,\n", 75 | " -1.06933594e-01, 1.04003906e-01, -1.79687500e-01, 5.71289062e-02,\n", 76 | " -7.41577148e-03, -5.59082031e-02, 7.61718750e-02, -4.14062500e-01,\n", 77 | " -3.65234375e-01, -3.35937500e-01, -1.54296875e-01, -2.39257812e-01,\n", 78 | " -3.73046875e-01, 2.27355957e-03, -3.51562500e-01, 8.64257812e-02,\n", 79 | " 1.26953125e-01, 2.21679688e-01, -9.86328125e-02, 1.08886719e-01,\n", 80 | " 3.65234375e-01, -5.66406250e-02, 5.66406250e-02, -1.09375000e-01,\n", 81 | " -1.66992188e-01, -4.54101562e-02, -2.00195312e-01, -1.22558594e-01,\n", 82 | " 1.31835938e-01, -1.31835938e-01, 1.03027344e-01, -3.41796875e-01,\n", 83 | " -1.57226562e-01, 2.04101562e-01, 4.39453125e-02, 2.44140625e-01,\n", 84 | " -3.19824219e-02, 3.20312500e-01, -4.41894531e-02, 1.08398438e-01,\n", 85 | " -4.98046875e-02, -9.52148438e-03, 2.46093750e-01, -5.59082031e-02,\n", 86 | " 4.07714844e-02, -1.78222656e-02, -2.95410156e-02, 1.65039062e-01,\n", 87 | " 5.03906250e-01, -2.81250000e-01, 9.81445312e-02, 1.80664062e-02,\n", 88 | " -1.83593750e-01, 2.53906250e-01, 2.25585938e-01, 1.63574219e-02,\n", 89 | " 1.81640625e-01, 1.38671875e-01, 3.33984375e-01, 1.39648438e-01,\n", 90 | " 1.45874023e-02, -2.89306641e-02, -8.39843750e-02, 1.50390625e-01,\n", 91 | " 1.67968750e-01, 2.28515625e-01, 3.59375000e-01, 1.22558594e-01,\n", 92 | " -3.28125000e-01, -1.56250000e-01, 2.77343750e-01, 1.77001953e-02,\n", 93 | " -1.46484375e-01, -4.51660156e-03, -4.46777344e-02, 1.75781250e-01,\n", 94 | " -3.75000000e-01, 1.16699219e-01, -1.39648438e-01, 2.55859375e-01,\n", 95 | " -1.96289062e-01, -2.57568359e-02, -5.41992188e-02, -2.51464844e-02,\n", 96 | " -1.93359375e-01, -3.17382812e-02, -8.74023438e-02, -1.32812500e-01,\n", 97 | " -2.12402344e-02, 4.33593750e-01, -5.20019531e-02, 3.46679688e-02,\n", 98 | " 8.00781250e-02, 3.41796875e-02, 1.99218750e-01, -2.39257812e-02,\n", 99 | " -2.37304688e-01, 1.93359375e-01, 7.32421875e-02, -2.87109375e-01,\n", 100 | " 1.25000000e-01, 8.44726562e-02, 1.30859375e-01, -2.19726562e-01,\n", 101 | " -1.61132812e-01, -2.63671875e-01, -5.46875000e-01, -2.96875000e-01,\n", 102 | " 3.44238281e-02, -2.87109375e-01, -1.93359375e-01, -1.61132812e-01,\n", 103 | " -3.84765625e-01, -2.14843750e-01, -6.22558594e-03, -1.27929688e-01,\n", 104 | " -1.00097656e-01, -6.21093750e-01, 3.78906250e-01, -4.58984375e-01,\n", 105 | " 1.44531250e-01, -9.13085938e-02, -3.08593750e-01, 2.23632812e-01,\n", 106 | " 7.86132812e-02, -2.16796875e-01, 8.78906250e-02, -1.66992188e-01,\n", 107 | " 1.14746094e-02, -2.53906250e-01, -6.25000000e-02, 6.04248047e-03,\n", 108 | " 1.56250000e-01, 4.37500000e-01, -2.23632812e-01, -2.32421875e-01,\n", 109 | " 2.75390625e-01, 2.39257812e-01, 4.49218750e-02, -7.51953125e-02,\n", 110 | " 5.74218750e-01, -2.61230469e-02, -1.21582031e-01, 2.44140625e-01,\n", 111 | " -3.37890625e-01, 8.59375000e-02, -7.71484375e-02, 4.85839844e-02,\n", 112 | " 1.43554688e-01, 4.25781250e-01, -4.29687500e-02, -1.08398438e-01,\n", 113 | " 1.19628906e-01, -1.91406250e-01, -2.12890625e-01, -2.87109375e-01,\n", 114 | " -1.14746094e-01, -2.04101562e-01, -2.06298828e-02, -2.53906250e-01,\n", 115 | " 8.25195312e-02, -3.97949219e-02, -1.57226562e-01, 1.34765625e-01,\n", 116 | " 2.08007812e-01, -1.78710938e-01, -2.00195312e-02, -8.34960938e-02,\n", 117 | " -1.20605469e-01, 4.29687500e-02, -1.94335938e-01, -1.32812500e-01,\n", 118 | " -2.17285156e-02, -2.35351562e-01, -3.63281250e-01, 1.51367188e-01,\n", 119 | " 9.32617188e-02, 1.63085938e-01, 1.02050781e-01, -4.27734375e-01,\n", 120 | " 2.83203125e-01, 2.74658203e-04, -3.20312500e-01, 1.68457031e-02,\n", 121 | " 4.06250000e-01, -5.24902344e-02, 7.91015625e-02, -1.41601562e-01,\n", 122 | " 5.27343750e-01, -1.26953125e-01, 4.74609375e-01, -6.64062500e-02,\n", 123 | " 3.41796875e-01, -1.78710938e-01, 3.69140625e-01, -2.05078125e-01,\n", 124 | " 5.82885742e-03, -1.84570312e-01, -8.88671875e-02, -1.81640625e-01,\n", 125 | " -4.80957031e-02, 4.39453125e-01, 2.12890625e-01, -3.07617188e-02,\n", 126 | " 9.32617188e-02, 2.40234375e-01, 2.39257812e-01, 2.51953125e-01,\n", 127 | " -1.98974609e-02, 1.24511719e-01, -4.73632812e-02, -2.13623047e-02,\n", 128 | " 3.12500000e-02, 3.05175781e-02, 2.79296875e-01, 9.08203125e-02,\n", 129 | " -2.02148438e-01, -2.19726562e-02, -2.63671875e-01, 8.78906250e-02,\n", 130 | " -1.07421875e-01, -2.49023438e-01, -1.22070312e-02, 1.73828125e-01,\n", 131 | " -9.91210938e-02, 7.27539062e-02, 2.59765625e-01, -4.60937500e-01,\n", 132 | " 3.59375000e-01, -2.25585938e-01, 1.87988281e-02, -2.19726562e-01,\n", 133 | " -2.08984375e-01, -1.51367188e-01, 8.64257812e-02, 1.11694336e-02,\n", 134 | " 6.93359375e-02, -2.99072266e-02, 1.43554688e-01, 1.89453125e-01,\n", 135 | " -1.32812500e-01, 4.72656250e-01, -1.40625000e-01, -2.52685547e-02,\n", 136 | " 1.91406250e-01, -2.63671875e-01, -1.39648438e-01, 1.09375000e-01,\n", 137 | " 1.97753906e-02, 2.49023438e-01, -1.42578125e-01, 4.15039062e-02],\n", 138 | " dtype=float32)" 139 | ] 140 | }, 141 | "metadata": {}, 142 | "execution_count": 2 143 | } 144 | ] 145 | }, 146 | { 147 | "cell_type": "code", 148 | "source": [ 149 | "model['cricket'].shape" 150 | ], 151 | "metadata": { 152 | "id": "_E8EMuD1L5-W", 153 | "outputId": "6e4a85ff-462b-45b4-b653-83d2c696da91", 154 | "colab": { 155 | "base_uri": "https://localhost:8080/" 156 | } 157 | }, 158 | "execution_count": 3, 159 | "outputs": [ 160 | { 161 | "output_type": "execute_result", 162 | "data": { 163 | "text/plain": [ 164 | "(300,)" 165 | ] 166 | }, 167 | "metadata": {}, 168 | "execution_count": 3 169 | } 170 | ] 171 | }, 172 | { 173 | "cell_type": "code", 174 | "source": [ 175 | "model.most_similar('oscars')" 176 | ], 177 | "metadata": { 178 | "colab": { 179 | "base_uri": "https://localhost:8080/" 180 | }, 181 | "id": "R0vZan3kH1uQ", 182 | "outputId": "7772dbd6-d827-43b4-d729-70f24b34960d" 183 | }, 184 | "execution_count": 4, 185 | "outputs": [ 186 | { 187 | "output_type": "execute_result", 188 | "data": { 189 | "text/plain": [ 190 | "[('oscar', 0.6721882224082947),\n", 191 | " ('emmy_awards', 0.5683821439743042),\n", 192 | " ('grammys', 0.5674149394035339),\n", 193 | " ('kristen_stewart', 0.5671219229698181),\n", 194 | " ('emmys', 0.5651429891586304),\n", 195 | " ('Oscars', 0.5626826286315918),\n", 196 | " ('sandra_bullock', 0.5551333427429199),\n", 197 | " ('mtv', 0.554526686668396),\n", 198 | " ('emmy', 0.5490261316299438),\n", 199 | " ('OSCARS', 0.5428372025489807)]" 200 | ] 201 | }, 202 | "metadata": {}, 203 | "execution_count": 4 204 | } 205 | ] 206 | }, 207 | { 208 | "cell_type": "code", 209 | "source": [ 210 | "model.most_similar('man')" 211 | ], 212 | "metadata": { 213 | "id": "xl0sIgrc8kuD", 214 | "colab": { 215 | "base_uri": "https://localhost:8080/" 216 | }, 217 | "outputId": "fc5a932f-757d-432a-9c3b-d0befb0d3b9b" 218 | }, 219 | "execution_count": 5, 220 | "outputs": [ 221 | { 222 | "output_type": "execute_result", 223 | "data": { 224 | "text/plain": [ 225 | "[('woman', 0.7664012908935547),\n", 226 | " ('boy', 0.6824870109558105),\n", 227 | " ('teenager', 0.6586930155754089),\n", 228 | " ('teenage_girl', 0.6147903800010681),\n", 229 | " ('girl', 0.5921714305877686),\n", 230 | " ('suspected_purse_snatcher', 0.5716364979743958),\n", 231 | " ('robber', 0.5585119128227234),\n", 232 | " ('Robbery_suspect', 0.5584409236907959),\n", 233 | " ('teen_ager', 0.5549196600914001),\n", 234 | " ('men', 0.5489763021469116)]" 235 | ] 236 | }, 237 | "metadata": {}, 238 | "execution_count": 5 239 | } 240 | ] 241 | }, 242 | { 243 | "cell_type": "code", 244 | "source": [ 245 | "model.most_similar('woman')" 246 | ], 247 | "metadata": { 248 | "colab": { 249 | "base_uri": "https://localhost:8080/" 250 | }, 251 | "id": "S8__ZCNCagH3", 252 | "outputId": "b9dac0d6-c6df-4025-d950-02a294389546" 253 | }, 254 | "execution_count": 6, 255 | "outputs": [ 256 | { 257 | "output_type": "execute_result", 258 | "data": { 259 | "text/plain": [ 260 | "[('man', 0.7664012908935547),\n", 261 | " ('girl', 0.7494640946388245),\n", 262 | " ('teenage_girl', 0.7336829900741577),\n", 263 | " ('teenager', 0.631708562374115),\n", 264 | " ('lady', 0.6288785934448242),\n", 265 | " ('teenaged_girl', 0.6141783595085144),\n", 266 | " ('mother', 0.607630729675293),\n", 267 | " ('policewoman', 0.6069462299346924),\n", 268 | " ('boy', 0.5975908041000366),\n", 269 | " ('Woman', 0.5770983695983887)]" 270 | ] 271 | }, 272 | "metadata": {}, 273 | "execution_count": 6 274 | } 275 | ] 276 | }, 277 | { 278 | "cell_type": "code", 279 | "source": [ 280 | "model.most_similar('tesla')" 281 | ], 282 | "metadata": { 283 | "colab": { 284 | "base_uri": "https://localhost:8080/" 285 | }, 286 | "id": "wcuq_YPoaziq", 287 | "outputId": "21d34a0c-ce55-493b-ca13-165284d5c332" 288 | }, 289 | "execution_count": 7, 290 | "outputs": [ 291 | { 292 | "output_type": "execute_result", 293 | "data": { 294 | "text/plain": [ 295 | "[('gauss', 0.6623971462249756),\n", 296 | " ('FT_ICR', 0.5639051795005798),\n", 297 | " ('MeV', 0.5619181990623474),\n", 298 | " ('keV', 0.5605964064598083),\n", 299 | " ('superconducting_magnet', 0.5567352175712585),\n", 300 | " ('electron_volt', 0.5503562092781067),\n", 301 | " ('SQUIDs', 0.5393733382225037),\n", 302 | " ('nT', 0.5386143922805786),\n", 303 | " ('electronvolts', 0.5377054810523987),\n", 304 | " ('kelvin', 0.5367921590805054)]" 305 | ] 306 | }, 307 | "metadata": {}, 308 | "execution_count": 7 309 | } 310 | ] 311 | }, 312 | { 313 | "cell_type": "code", 314 | "source": [ 315 | "model.similarity('man','woman')" 316 | ], 317 | "metadata": { 318 | "colab": { 319 | "base_uri": "https://localhost:8080/" 320 | }, 321 | "id": "LlUXi8KKa33_", 322 | "outputId": "2f5663f2-f101-4cc6-8fbe-16cb8b028baa" 323 | }, 324 | "execution_count": 8, 325 | "outputs": [ 326 | { 327 | "output_type": "execute_result", 328 | "data": { 329 | "text/plain": [ 330 | "0.76640123" 331 | ] 332 | }, 333 | "metadata": {}, 334 | "execution_count": 8 335 | } 336 | ] 337 | }, 338 | { 339 | "cell_type": "code", 340 | "source": [ 341 | "model.similarity('man','PHP')" 342 | ], 343 | "metadata": { 344 | "colab": { 345 | "base_uri": "https://localhost:8080/" 346 | }, 347 | "id": "pZk8zVzbbE9L", 348 | "outputId": "95d316b1-2c86-4696-eb15-958e947acc3e" 349 | }, 350 | "execution_count": 9, 351 | "outputs": [ 352 | { 353 | "output_type": "execute_result", 354 | "data": { 355 | "text/plain": [ 356 | "-0.032995153" 357 | ] 358 | }, 359 | "metadata": {}, 360 | "execution_count": 9 361 | } 362 | ] 363 | }, 364 | { 365 | "cell_type": "code", 366 | "source": [ 367 | "model.doesnt_match(['PHP','java','monkey'])" 368 | ], 369 | "metadata": { 370 | "colab": { 371 | "base_uri": "https://localhost:8080/", 372 | "height": 90 373 | }, 374 | "id": "uApAHCO5db2G", 375 | "outputId": "093ce5f4-43cc-48d3-e948-091a5ae919bf" 376 | }, 377 | "execution_count": 10, 378 | "outputs": [ 379 | { 380 | "output_type": "stream", 381 | "name": "stderr", 382 | "text": [ 383 | "/usr/local/lib/python3.7/dist-packages/gensim/models/keyedvectors.py:895: FutureWarning: arrays to stack must be passed as a \"sequence\" type such as list or tuple. Support for non-sequence iterables such as generators is deprecated as of NumPy 1.16 and will raise an error in the future.\n", 384 | " vectors = vstack(self.word_vec(word, use_norm=True) for word in used_words).astype(REAL)\n" 385 | ] 386 | }, 387 | { 388 | "output_type": "execute_result", 389 | "data": { 390 | "text/plain": [ 391 | "'monkey'" 392 | ], 393 | "application/vnd.google.colaboratory.intrinsic+json": { 394 | "type": "string" 395 | } 396 | }, 397 | "metadata": {}, 398 | "execution_count": 10 399 | } 400 | ] 401 | }, 402 | { 403 | "cell_type": "code", 404 | "source": [ 405 | "vec = model['king'] - model['man'] + model['woman']\n", 406 | "model.most_similar([vec])" 407 | ], 408 | "metadata": { 409 | "colab": { 410 | "base_uri": "https://localhost:8080/" 411 | }, 412 | "id": "bKZ9ERK1bcGY", 413 | "outputId": "8ee15906-906d-4fd2-c474-8d3f3a37d296" 414 | }, 415 | "execution_count": 11, 416 | "outputs": [ 417 | { 418 | "output_type": "execute_result", 419 | "data": { 420 | "text/plain": [ 421 | "[('king', 0.8449392318725586),\n", 422 | " ('queen', 0.7300517559051514),\n", 423 | " ('monarch', 0.6454660892486572),\n", 424 | " ('princess', 0.6156251430511475),\n", 425 | " ('crown_prince', 0.5818676948547363),\n", 426 | " ('prince', 0.5777117609977722),\n", 427 | " ('kings', 0.5613663792610168),\n", 428 | " ('sultan', 0.5376776456832886),\n", 429 | " ('Queen_Consort', 0.5344247817993164),\n", 430 | " ('queens', 0.5289887189865112)]" 431 | ] 432 | }, 433 | "metadata": {}, 434 | "execution_count": 11 435 | } 436 | ] 437 | }, 438 | { 439 | "cell_type": "code", 440 | "source": [ 441 | "vec = model['INR'] - model ['India'] + model['England']\n", 442 | "model.most_similar([vec])" 443 | ], 444 | "metadata": { 445 | "colab": { 446 | "base_uri": "https://localhost:8080/" 447 | }, 448 | "id": "_3i6mgOrcCln", 449 | "outputId": "350fad5f-5d14-4071-8be8-40b5d3e96da5" 450 | }, 451 | "execution_count": 12, 452 | "outputs": [ 453 | { 454 | "output_type": "execute_result", 455 | "data": { 456 | "text/plain": [ 457 | "[('INR', 0.6442340612411499),\n", 458 | " ('GBP', 0.5040826201438904),\n", 459 | " ('£_##.###m', 0.4540838599205017),\n", 460 | " ('England', 0.44649264216423035),\n", 461 | " ('£', 0.43340998888015747),\n", 462 | " ('Â_£', 0.430719792842865),\n", 463 | " ('stg###', 0.4299262464046478),\n", 464 | " ('£_#.##m', 0.42561304569244385),\n", 465 | " ('Pounds_Sterling', 0.42512616515159607),\n", 466 | " ('GBP##', 0.42464494705200195)]" 467 | ] 468 | }, 469 | "metadata": {}, 470 | "execution_count": 12 471 | } 472 | ] 473 | } 474 | ] 475 | } -------------------------------------------------------------------------------- /Pre Processing/Word2Vec/README.md: -------------------------------------------------------------------------------- 1 | # Word2Vec 2 | -------------------------------------------------------------------------------- /Pre Processing/number_to_word.py: -------------------------------------------------------------------------------- 1 | # import the library 2 | import inflect 3 | q = inflect.engine() 4 | 5 | # convert number into text 6 | def convert_num(text): 7 | temp_string = text.split() # split strings into list of texts 8 | new_str = [] # initialise empty list 9 | 10 | for word in temp_string: 11 | '''if text is a digit, convert the digit 12 | to numbers and append into the new_str list''' 13 | if word.isdigit(): 14 | temp = q.number_to_words(word) 15 | new_str.append(temp) 16 | else: 17 | new_str.append(word) # append the texts as it is 18 | 19 | temp_str = ' '.join(new_str) # join the texts of new_str to form a string 20 | return temp_str 21 | 22 | input_str = '34 45' # OUTPUT:- 'thirty-four forty-five' 23 | convert_num(input_str) 24 | -------------------------------------------------------------------------------- /Pre Processing/tokenizer/README.md: -------------------------------------------------------------------------------- 1 | # tokenizer 2 | 3 | ```ruby 4 | import nltk 5 | from nltk.tokenize import sent_tokenize, word_tokenize 6 | nltk.download('punkt') 7 | 8 | sentence = 'Tokenizers divide strings into lists of substrings. For example, tokenizers can be used to find the words and punctuation in a string.This particular tokenizer requires the Punkt sentence tokenization models to be installed. NLTK also provides a simpler, regular-expression based tokenizer, which splits text on whitespace and punctuation.' 9 | 10 | sent = sent_tokenize(sentence) 11 | 12 | for i in sent: 13 | print(word_tokenize(i)) 14 | 15 | 16 | [word_tokenize(t) for t in sent_tokenize(sentence)] 17 | 18 | ``` 19 | **LINK** -> https://www.nltk.org/api/nltk.tokenize.html 20 | -------------------------------------------------------------------------------- /Pre Processing/tokenizer/Spacy.py: -------------------------------------------------------------------------------- 1 | import spacy 2 | nlp = spacy.load('en_core_web_sm') 3 | 4 | sent1 = 'I have a Ph.D in A.I' 5 | sent2 = "We're here to help! mail us at nks@gmail.com" 6 | sent3 = 'A 5km ride cost $10.50' 7 | 8 | doc1 = nlp(sent1) 9 | doc2 = nlp(sent2) 10 | doc3 = nlp(sent3) 11 | 12 | for token in doc1: 13 | print(token) 14 | 15 | for token in doc2: 16 | print(token) 17 | 18 | for token in doc3: 19 | print(token) 20 | -------------------------------------------------------------------------------- /Pre Processing/tokenizer/nltk_tokenize.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "nbformat": 4, 3 | "nbformat_minor": 0, 4 | "metadata": { 5 | "colab": { 6 | "name": "nltk.tokenize.ipynb", 7 | "provenance": [] 8 | }, 9 | "kernelspec": { 10 | "name": "python3", 11 | "display_name": "Python 3" 12 | }, 13 | "language_info": { 14 | "name": "python" 15 | } 16 | }, 17 | "cells": [ 18 | { 19 | "cell_type": "code", 20 | "source": [ 21 | "import nltk\n", 22 | "from nltk.tokenize import sent_tokenize, word_tokenize\n", 23 | "nltk.download('punkt')" 24 | ], 25 | "metadata": { 26 | "colab": { 27 | "base_uri": "https://localhost:8080/" 28 | }, 29 | "id": "igiula48FqPM", 30 | "outputId": "310074c0-be0b-4cf0-8cbb-c8f763df4c3b" 31 | }, 32 | "execution_count": 1, 33 | "outputs": [ 34 | { 35 | "output_type": "stream", 36 | "name": "stdout", 37 | "text": [ 38 | "[nltk_data] Downloading package punkt to /root/nltk_data...\n", 39 | "[nltk_data] Package punkt is already up-to-date!\n" 40 | ] 41 | }, 42 | { 43 | "output_type": "execute_result", 44 | "data": { 45 | "text/plain": [ 46 | "True" 47 | ] 48 | }, 49 | "metadata": {}, 50 | "execution_count": 1 51 | } 52 | ] 53 | }, 54 | { 55 | "cell_type": "code", 56 | "source": [ 57 | "sentence = 'Tokenizers divide strings into lists of substrings. For example, tokenizers can be used to find the words and punctuation in a string.This particular tokenizer requires the Punkt sentence tokenization models to be installed. NLTK also provides a simpler, regular-expression based tokenizer, which splits text on whitespace and punctuation.'\n", 58 | "\n", 59 | "sent = sent_tokenize(sentence)\n", 60 | "sent" 61 | ], 62 | "metadata": { 63 | "colab": { 64 | "base_uri": "https://localhost:8080/" 65 | }, 66 | "id": "YOj-4zmtFvQR", 67 | "outputId": "2347ac6a-7620-43ba-e912-98de95341ef7" 68 | }, 69 | "execution_count": 2, 70 | "outputs": [ 71 | { 72 | "output_type": "execute_result", 73 | "data": { 74 | "text/plain": [ 75 | "['Tokenizers divide strings into lists of substrings.',\n", 76 | " 'For example, tokenizers can be used to find the words and punctuation in a string.This particular tokenizer requires the Punkt sentence tokenization models to be installed.',\n", 77 | " 'NLTK also provides a simpler, regular-expression based tokenizer, which splits text on whitespace and punctuation.']" 78 | ] 79 | }, 80 | "metadata": {}, 81 | "execution_count": 2 82 | } 83 | ] 84 | }, 85 | { 86 | "cell_type": "code", 87 | "source": [ 88 | "len(sent)" 89 | ], 90 | "metadata": { 91 | "colab": { 92 | "base_uri": "https://localhost:8080/" 93 | }, 94 | "id": "jr9Q1_hGFx1g", 95 | "outputId": "c6f424b0-7f4b-4d42-acb7-8c5654a56462" 96 | }, 97 | "execution_count": 3, 98 | "outputs": [ 99 | { 100 | "output_type": "execute_result", 101 | "data": { 102 | "text/plain": [ 103 | "3" 104 | ] 105 | }, 106 | "metadata": {}, 107 | "execution_count": 3 108 | } 109 | ] 110 | }, 111 | { 112 | "cell_type": "code", 113 | "source": [ 114 | "for i in sent:\n", 115 | " print(word_tokenize(i))" 116 | ], 117 | "metadata": { 118 | "colab": { 119 | "base_uri": "https://localhost:8080/" 120 | }, 121 | "id": "qBIGwjefF1KF", 122 | "outputId": "e40a6380-298d-4a89-bf56-32a51a358d5d" 123 | }, 124 | "execution_count": 4, 125 | "outputs": [ 126 | { 127 | "output_type": "stream", 128 | "name": "stdout", 129 | "text": [ 130 | "['Tokenizers', 'divide', 'strings', 'into', 'lists', 'of', 'substrings', '.']\n", 131 | "['For', 'example', ',', 'tokenizers', 'can', 'be', 'used', 'to', 'find', 'the', 'words', 'and', 'punctuation', 'in', 'a', 'string.This', 'particular', 'tokenizer', 'requires', 'the', 'Punkt', 'sentence', 'tokenization', 'models', 'to', 'be', 'installed', '.']\n", 132 | "['NLTK', 'also', 'provides', 'a', 'simpler', ',', 'regular-expression', 'based', 'tokenizer', ',', 'which', 'splits', 'text', 'on', 'whitespace', 'and', 'punctuation', '.']\n" 133 | ] 134 | } 135 | ] 136 | }, 137 | { 138 | "cell_type": "code", 139 | "source": [ 140 | "word = [word_tokenize(t) for t in sent_tokenize(sentence)]\n", 141 | "word" 142 | ], 143 | "metadata": { 144 | "colab": { 145 | "base_uri": "https://localhost:8080/" 146 | }, 147 | "id": "rnU6Xu_jF1O9", 148 | "outputId": "38e1ce95-b7bb-443d-e6e7-4189a842e49a" 149 | }, 150 | "execution_count": 5, 151 | "outputs": [ 152 | { 153 | "output_type": "execute_result", 154 | "data": { 155 | "text/plain": [ 156 | "[['Tokenizers', 'divide', 'strings', 'into', 'lists', 'of', 'substrings', '.'],\n", 157 | " ['For',\n", 158 | " 'example',\n", 159 | " ',',\n", 160 | " 'tokenizers',\n", 161 | " 'can',\n", 162 | " 'be',\n", 163 | " 'used',\n", 164 | " 'to',\n", 165 | " 'find',\n", 166 | " 'the',\n", 167 | " 'words',\n", 168 | " 'and',\n", 169 | " 'punctuation',\n", 170 | " 'in',\n", 171 | " 'a',\n", 172 | " 'string.This',\n", 173 | " 'particular',\n", 174 | " 'tokenizer',\n", 175 | " 'requires',\n", 176 | " 'the',\n", 177 | " 'Punkt',\n", 178 | " 'sentence',\n", 179 | " 'tokenization',\n", 180 | " 'models',\n", 181 | " 'to',\n", 182 | " 'be',\n", 183 | " 'installed',\n", 184 | " '.'],\n", 185 | " ['NLTK',\n", 186 | " 'also',\n", 187 | " 'provides',\n", 188 | " 'a',\n", 189 | " 'simpler',\n", 190 | " ',',\n", 191 | " 'regular-expression',\n", 192 | " 'based',\n", 193 | " 'tokenizer',\n", 194 | " ',',\n", 195 | " 'which',\n", 196 | " 'splits',\n", 197 | " 'text',\n", 198 | " 'on',\n", 199 | " 'whitespace',\n", 200 | " 'and',\n", 201 | " 'punctuation',\n", 202 | " '.']]" 203 | ] 204 | }, 205 | "metadata": {}, 206 | "execution_count": 5 207 | } 208 | ] 209 | }, 210 | { 211 | "cell_type": "code", 212 | "source": [ 213 | "len(word[0])" 214 | ], 215 | "metadata": { 216 | "colab": { 217 | "base_uri": "https://localhost:8080/" 218 | }, 219 | "id": "n0KNqtYHFqS4", 220 | "outputId": "a9bf2f72-cd92-4e33-a12b-d24063ef57c6" 221 | }, 222 | "execution_count": 6, 223 | "outputs": [ 224 | { 225 | "output_type": "execute_result", 226 | "data": { 227 | "text/plain": [ 228 | "8" 229 | ] 230 | }, 231 | "metadata": {}, 232 | "execution_count": 6 233 | } 234 | ] 235 | }, 236 | { 237 | "cell_type": "code", 238 | "source": [ 239 | "len(word[1])" 240 | ], 241 | "metadata": { 242 | "colab": { 243 | "base_uri": "https://localhost:8080/" 244 | }, 245 | "id": "zrCt8QDqFqU6", 246 | "outputId": "dce2fb84-b688-4be0-84fc-7625088050bb" 247 | }, 248 | "execution_count": 7, 249 | "outputs": [ 250 | { 251 | "output_type": "execute_result", 252 | "data": { 253 | "text/plain": [ 254 | "28" 255 | ] 256 | }, 257 | "metadata": {}, 258 | "execution_count": 7 259 | } 260 | ] 261 | }, 262 | { 263 | "cell_type": "code", 264 | "source": [ 265 | "len(word[2])" 266 | ], 267 | "metadata": { 268 | "colab": { 269 | "base_uri": "https://localhost:8080/" 270 | }, 271 | "id": "xfwavwxjFqW5", 272 | "outputId": "046a8c53-1355-4521-c59f-eb0dae55c62a" 273 | }, 274 | "execution_count": 8, 275 | "outputs": [ 276 | { 277 | "output_type": "execute_result", 278 | "data": { 279 | "text/plain": [ 280 | "18" 281 | ] 282 | }, 283 | "metadata": {}, 284 | "execution_count": 8 285 | } 286 | ] 287 | } 288 | ] 289 | } -------------------------------------------------------------------------------- /Projects/dataset/README.md: -------------------------------------------------------------------------------- 1 | 2 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # Natural Language Processing 2 | 3 | ## Academia 4 | 5 | | **Academia** | **Type** | 6 | | ----- | -----| 7 | | 1. [**LSTM With Python**](https://drive.google.com/file/d/16zxePmb3TWIxIevh2gkeaTbO-ZEVRuKi/view?usp=sharing) | Book | 8 | | 2. [**Attention Is All You Need**](https://github.com/vaasu2002/Natural-Language-Processing/blob/main/ACADEMIA/PAPERS/Attention%20Is%20All%20You%20Need.pdf) | Paper | 9 | | 3. [**GPT 1**](https://github.com/vaasu2002/Natural-Language-Processing/blob/main/ACADEMIA/PAPERS/Improving%20Language%20Understanding%20by%20Generative%20Pre-Training%20(GPT%201).pdf) | Paper | 10 | 11 | -------------------------------------------------------------- 12 | 13 | ## Certifications 14 | 15 | | **Certification** | **Platform** | 16 | | ----- | -----| 17 | | 1. [**DeepLearning.AI Natural Language Processing Tensorflow**](https://www.coursera.org/account/accomplishments/certificate/RXGKSDTK9VCW) | [**DeepLearning.ai**](https://www.deeplearning.ai/) | 18 | | 2. [**Natural Language Processing with Classification and Vector Spaces**]() | [**DeepLearning.ai**](https://www.deeplearning.ai/) | 19 | 20 | -------------------------------------------------------------- 21 | 22 | ## Days of Natural Language Processing 23 | 24 | | **Day** | **Topic** | 25 | | ----- | -----| 26 | | **Day 1** | [**Word2vec Google News 300**](https://github.com/vaasu2002/Natural-Language-Processing/blob/main/Notebooks/Word2vec_Google_News_300.ipynb) | 27 | -------------------------------------------------------------------------------- /Transfer Learning/Projects/Quora Insincere Questions Classification/README.md: -------------------------------------------------------------------------------- 1 | # **Quora Insincere Questions Classification** 2 | ## Detect toxic content to improve online conversations 3 | 4 | An insincere question is defined as a question intended to make a statement rather than look for helpful answers. Some characteristics that can signify that a question is insincere: 5 | 6 | *Has a non-neutral tone 7 | - Has an exaggerated tone to underscore a point about a group of people 8 | - Is rhetorical and meant to imply a statement about a group of people 9 | * Is disparaging or inflammatory 10 | - Suggests a discriminatory idea against a protected class of people, or seeks confirmation of a stereotype 11 | - Makes disparaging attacks/insults against a specific person or group of people 12 | - Based on an outlandish premise about a group of people 13 | - Disparages against a characteristic that is not fixable and not measurable 14 | * Isn't grounded in reality 15 | - Based on false information, or contains absurd assumptions 16 | * Uses sexual content (incest, bestiality, pedophilia) for shock value, and not to seek genuine answers 17 | 18 | 19 | The training data includes the question that was asked, and whether it was identified as insincere (target = 1). The ground-truth labels contain some amount of noise: they are not guaranteed to be perfect. 20 | 21 | Note that the distribution of questions in the dataset should not be taken to be representative of the distribution of questions asked on Quora. This is, in part, because of the combination of sampling procedures and sanitization measures that have been applied to the final dataset. 22 | 23 | 24 | 25 | 26 | ![alt text](https://github.com/vaasu2002/Natural-Language-Processing/blob/main/Transfer%20Learning/Projects/Quora%20Insincere%20Questions%20Classification/images/final_accuracy_matrics.png) 27 | -------------------------------------------------------------------------------- /Transfer Learning/Projects/Quora Insincere Questions Classification/images/final_accuracy_matrics.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/vaasu2002/Natural-Language-Processing/309514bb40042c5c6bdffacfb882164d9b9bac03/Transfer Learning/Projects/Quora Insincere Questions Classification/images/final_accuracy_matrics.png -------------------------------------------------------------------------------- /Transfer Learning/Projects/README.md: -------------------------------------------------------------------------------- 1 | # PROJECTS 2 | 3 | 4 | 1) Quora Insincere Questions Classification 5 | -------------------------------------------------------------------------------- /Transfer Learning/README.md: -------------------------------------------------------------------------------- 1 | Tensorflow Hub provides a number of [modules](https://tfhub.dev/s?module-type=text-embedding&tf-version=tf2&q=tf2) to convert sentences into embeddings such as Universal sentence ecoders, NNLM, BERT and Wikiwords. 2 | 3 | Transfer learning makes it possible to save training resources and to achieve good model generalization even when training on a small dataset. In this project, we will demonstrate this by training with several different TF-Hub modules. 4 | 5 | 6 | 7 | --------------------------------------------------------------------------------