├── Sentiment-Analysis-with-BERT-Transformers └── README.md ├── Harry-Potter-Novel-Text-Generation-using-GRU ├── README.md └── HPGEN.ipynb ├── Fake-Disaster-Tweet-Detection-Spacy-Bert-SVM ├── README.md └── bert-spacy-svm.ipynb ├── French-Translation-Char-Level-Seq2Seq-LSTM-Tensorflow-Keras ├── README.md └── s2s │ ├── saved_model.pb │ └── variables │ ├── variables.index │ └── variables.data-00000-of-00001 ├── LICENSE ├── Review-Classification-Tensorflow ├── TF NLP Reviews.py ├── review-tensorflow.ipynb └── review-tensorflow (1).ipynb ├── Readme.md ├── OCR-Captcha-Cracker-using-CNN-LSTM-CTC └── README.md ├── NLP-Reviews-Classification └── nlp-reviews-classification.ipynb ├── Text-Summarization-using-Transformers-T5 ├── README.md └── Tuning Transformer for Summary Generation T5.ipynb ├── Fake-News-Detection └── fake-news-detection.ipynb └── NLP-LSTM-Alexa-Reviews └── alexa-reviews-lstm.ipynb /Sentiment-Analysis-with-BERT-Transformers/README.md: -------------------------------------------------------------------------------- 1 | # Sentiment-Analysis-with-BERT-Transformers -------------------------------------------------------------------------------- /Harry-Potter-Novel-Text-Generation-using-GRU/README.md: -------------------------------------------------------------------------------- 1 | # Harry-Potter-Novel-Text-Generation-using-GRU -------------------------------------------------------------------------------- /Fake-Disaster-Tweet-Detection-Spacy-Bert-SVM/README.md: -------------------------------------------------------------------------------- 1 | # Fake-Disaster-Tweet-Detection-Spacy-Bert-SVM 2 | -------------------------------------------------------------------------------- /French-Translation-Char-Level-Seq2Seq-LSTM-Tensorflow-Keras/README.md: -------------------------------------------------------------------------------- 1 | # Char-Level-Seq2Seq-LSTM-Tensorflow-Keras -------------------------------------------------------------------------------- /French-Translation-Char-Level-Seq2Seq-LSTM-Tensorflow-Keras/s2s/saved_model.pb: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/rohanmistry231/NLP-Projects-02/master/French-Translation-Char-Level-Seq2Seq-LSTM-Tensorflow-Keras/s2s/saved_model.pb -------------------------------------------------------------------------------- /French-Translation-Char-Level-Seq2Seq-LSTM-Tensorflow-Keras/s2s/variables/variables.index: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/rohanmistry231/NLP-Projects-02/master/French-Translation-Char-Level-Seq2Seq-LSTM-Tensorflow-Keras/s2s/variables/variables.index -------------------------------------------------------------------------------- /French-Translation-Char-Level-Seq2Seq-LSTM-Tensorflow-Keras/s2s/variables/variables.data-00000-of-00001: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/rohanmistry231/NLP-Projects-02/master/French-Translation-Char-Level-Seq2Seq-LSTM-Tensorflow-Keras/s2s/variables/variables.data-00000-of-00001 -------------------------------------------------------------------------------- /LICENSE: -------------------------------------------------------------------------------- 1 | MIT License 2 | 3 | Copyright (c) 2020 Harshit Singh 4 | 5 | Permission is hereby granted, free of charge, to any person obtaining a copy 6 | of this software and associated documentation files (the "Software"), to deal 7 | in the Software without restriction, including without limitation the rights 8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell 9 | copies of the Software, and to permit persons to whom the Software is 10 | furnished to do so, subject to the following conditions: 11 | 12 | The above copyright notice and this permission notice shall be included in all 13 | copies or substantial portions of the Software. 14 | 15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR 16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, 17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE 18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER 19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, 20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE 21 | SOFTWARE. 22 | -------------------------------------------------------------------------------- /Review-Classification-Tensorflow/TF NLP Reviews.py: -------------------------------------------------------------------------------- 1 | 2 | import tensorflow as tf 3 | 4 | 5 | import tensorflow_datasets as tfds 6 | imdb, info = tfds.load("imdb_reviews", with_info=True, as_supervised=True) 7 | 8 | 9 | import numpy as np 10 | 11 | train_data, test_data = imdb['train'], imdb['test'] 12 | 13 | training_sentences = [] 14 | training_labels = [] 15 | 16 | testing_sentences = [] 17 | testing_labels = [] 18 | 19 | 20 | for s,l in train_data: 21 | training_sentences.append(str(s.numpy())) 22 | training_labels.append(l.numpy()) 23 | 24 | for s,l in test_data: 25 | testing_sentences.append(str(s.numpy())) 26 | testing_labels.append(l.numpy()) 27 | 28 | training_labels_final = np.array(training_labels) 29 | testing_labels_final = np.array(testing_labels) 30 | 31 | 32 | 33 | training_sentences[0] 34 | 35 | 36 | training_labels[0] 37 | 38 | 39 | vocab_size = 10000 40 | embedding_dim = 16 41 | max_length = 120 42 | trunc_type='post' 43 | oov_tok = "" 44 | 45 | 46 | from tensorflow.keras.preprocessing.text import Tokenizer 47 | from tensorflow.keras.preprocessing.sequence import pad_sequences 48 | 49 | tokenizer = Tokenizer(num_words = vocab_size, oov_token=oov_tok) 50 | tokenizer.fit_on_texts(training_sentences) 51 | word_index = tokenizer.word_index 52 | sequences = tokenizer.texts_to_sequences(training_sentences) 53 | padded = pad_sequences(sequences,maxlen=max_length, truncating=trunc_type) 54 | 55 | testing_sequences = tokenizer.texts_to_sequences(testing_sentences) 56 | testing_padded = pad_sequences(testing_sequences,maxlen=max_length) 57 | 58 | 59 | 60 | model = tf.keras.Sequential([ 61 | tf.keras.layers.Embedding(vocab_size, embedding_dim, input_length=max_length), 62 | tf.keras.layers.Flatten(), 63 | tf.keras.layers.Dense(6, activation='relu'), 64 | tf.keras.layers.Dense(1, activation='sigmoid') 65 | ]) 66 | model.compile(loss='binary_crossentropy',optimizer='adam',metrics=['accuracy']) 67 | model.summary() 68 | 69 | 70 | num_epochs = 10 71 | history = model.fit(padded, training_labels_final, epochs=num_epochs, validation_data=(testing_padded, testing_labels_final)) 72 | 73 | 74 | score = model.evaluate(testing_padded, testing_labels_final) 75 | 76 | 77 | 78 | print('Test accuracy:', score[1]) -------------------------------------------------------------------------------- /Readme.md: -------------------------------------------------------------------------------- 1 | ## Natural Language Processing 2 | 3 | The field of study that focuses on the interactions between human language and computers is called Natural Language Processing, 4 | or NLP for short. It sits at the intersection of computer science, artificial intelligence, and computational linguistics 5 | 6 | Natural Language Processing, usually shortened as NLP, is a branch of artificial intelligence that deals with the interaction 7 | between computers and humans using the natural language. 8 | The ultimate objective of NLP is to read, decipher, understand, and make sense of the human languages in a manner that is valuable. 9 | Most NLP techniques rely on machine learning to derive meaning from human languages. 10 | 11 | Natural language processing (NLP) is a field of artificial intelligence in which computers analyze, understand, and derive meaning from human language in a smart and useful way. 12 | By utilizing NLP, developers can organize and structure knowledge to perform tasks such as automatic summarization, translation, named entity recognition, 13 | relationship extraction, sentiment analysis, speech recognition, and topic segmentation. 14 | 15 | NLP algorithms have a variety of uses. Basically, they allow developers to create a software that understands human language. Due to the complicated nature of human language, 16 | NLP can be difficult to learn and implement correctly. However, with the knowledge gained from this article, you will be better equipped to use NLP successfully. 17 | Some of the projects developers can use NLP algorithms for are: 18 | 19 | * Summarize blocks of text using Summarizer to extract the most important and central ideas while ignoring irrelevant information. 20 | * Create a chat bot using Parsey McParseface, a language parsing deep learning model made by Google that uses Point-of-Speech tagging. 21 | * Automatically generate keyword tags from content using AutoTag, which leverages LDA, a technique that discovers topics contained within a body of text. 22 | * Identify the type of entity extracted, such as it being a person, place, or organization using Named Entity Recognition. 23 | * Use Sentiment Analysis to identify the sentiment of a string of text, from very negative to neutral to very positive. 24 | 25 | 26 |

27 | For-the-Badge-Python 28 | 29 |

30 | -------------------------------------------------------------------------------- /OCR-Captcha-Cracker-using-CNN-LSTM-CTC/README.md: -------------------------------------------------------------------------------- 1 | # OCR-Captcha-Cracker-using-CNN-LSTM-CTC 2 | 3 | This is a simple OCR model built with the Functional API. Apart from combining CNN and RNN, it also illustrates how you can instantiate a new layer and use it as an "Endpoint layer" for implementing CTC loss. 4 | 5 | he NN for such use-cases usually consists of convolutional layers (CNN) to extract a sequence of features and recurrent layers (RNN) to propagate information through this sequence. It outputs character-scores for each sequence-element, which simply is represented by a matrix. 6 | This metric is CTC or Connectionist Temporal Classification Now, there are two things we want to do with this matrix: 7 | 8 | * Train: calculate the loss value to train the NN 9 | * Infer: decode the matrix to get the text contained in the input image 10 | 11 | 12 | 13 | We could train a NN to output a character-score for each horizontal position. However, there are two problems with this naive solution: 14 | 15 | * it is very time-consuming (and boring) to annotate a data-set on character-level. 16 | 17 | * we only get character-scores and therefore need some further processing to get the final text from it. A single character can span multiple horizontal positions, e.g. we could get “ ttooo” because the “o” is a wide character. We have to remove all duplicate “t”s and “o”s. But what if the recognized text would have been “too”? Then removing all duplicate “o”s gets us the wrong result 18 | 19 | CTC solves both problems for us: 20 | 21 | * we only have to tell the CTC loss function the text that occurs in the image. Therefore we ignore both the position and width of the characters in the image. 22 | 23 | * no further processing of the recognized text is needed. 24 | 25 | 26 | The NN-training will be guided by the CTC loss function. We only feed the output matrix of the NN and the corresponding ground-truth (GT) text to the CTC loss function. Instead, it tries all possible alignments of the GT text in the image and takes the sum of all scores. This way, the score of a GT text is high if the sum over the alignment-scores has a high value. 27 | 28 | **Loss calculation** 29 | 30 | We need to calculate the loss value for the training samples (pairs of images and GT texts) to train the NN. You already know that the NN outputs a matrix containing a score for each character at each time-step 31 | 32 | 33 | 34 | 35 | -------------------------------------------------------------------------------- /Review-Classification-Tensorflow/review-tensorflow.ipynb: -------------------------------------------------------------------------------- 1 | {"cells":[{"metadata":{"_uuid":"5687da5f-cbf2-4da5-9948-485a0c7853c6","_cell_guid":"5ca7927c-6eaa-4eac-8d8a-8a7639747b01","trusted":true},"cell_type":"code","source":"\nimport tensorflow as tf\n\n\nimport tensorflow_datasets as tfds\nimdb, info = tfds.load(\"imdb_reviews\", with_info=True, as_supervised=True)\n\n\nimport numpy as np\n\ntrain_data, test_data = imdb['train'], imdb['test']\n\ntraining_sentences = []\ntraining_labels = []\n\ntesting_sentences = []\ntesting_labels = []\n\n\nfor s,l in train_data:\n training_sentences.append(str(s.numpy()))\n training_labels.append(l.numpy())\n \nfor s,l in test_data:\n testing_sentences.append(str(s.numpy()))\n testing_labels.append(l.numpy())\n \ntraining_labels_final = np.array(training_labels)\ntesting_labels_final = np.array(testing_labels)\n\n\n\ntraining_sentences[0]\n\n\ntraining_labels[0]\n\n\nvocab_size = 10000\nembedding_dim = 16\nmax_length = 120\ntrunc_type='post'\noov_tok = \"\"\n\n\nfrom tensorflow.keras.preprocessing.text import Tokenizer\nfrom tensorflow.keras.preprocessing.sequence import pad_sequences\n\ntokenizer = Tokenizer(num_words = vocab_size, oov_token=oov_tok)\ntokenizer.fit_on_texts(training_sentences)\nword_index = tokenizer.word_index\nsequences = tokenizer.texts_to_sequences(training_sentences)\npadded = pad_sequences(sequences,maxlen=max_length, truncating=trunc_type)\n\ntesting_sequences = tokenizer.texts_to_sequences(testing_sentences)\ntesting_padded = pad_sequences(testing_sequences,maxlen=max_length)\n\n\n\nmodel = tf.keras.Sequential([\n tf.keras.layers.Embedding(vocab_size, embedding_dim, input_length=max_length),\n tf.keras.layers.Flatten(),\n tf.keras.layers.Dense(6, activation='relu'),\n tf.keras.layers.Dense(1, activation='sigmoid')\n])\nmodel.compile(loss='binary_crossentropy',optimizer='adam',metrics=['accuracy'])\nmodel.summary()\n\n\nnum_epochs = 10\nhistory = model.fit(padded, training_labels_final, epochs=num_epochs, validation_data=(testing_padded, testing_labels_final))\n\n\nscore = model.evaluate(testing_padded, testing_labels_final)\n\n\n\nprint('Test accuracy:', score[1])","execution_count":13,"outputs":[{"output_type":"stream","text":"Model: \"sequential_1\"\n_________________________________________________________________\nLayer (type) Output Shape Param # \n=================================================================\nembedding_1 (Embedding) (None, 120, 16) 160000 \n_________________________________________________________________\nflatten_1 (Flatten) (None, 1920) 0 \n_________________________________________________________________\ndense_2 (Dense) (None, 6) 11526 \n_________________________________________________________________\ndense_3 (Dense) (None, 1) 7 \n=================================================================\nTotal params: 171,533\nTrainable params: 171,533\nNon-trainable params: 0\n_________________________________________________________________\nTrain on 25000 samples, validate on 25000 samples\nEpoch 1/10\n25000/25000 [==============================] - 7s 284us/sample - loss: 0.4972 - accuracy: 0.7448 - val_loss: 0.3474 - val_accuracy: 0.8472\nEpoch 2/10\n25000/25000 [==============================] - 6s 258us/sample - loss: 0.2415 - accuracy: 0.9074 - val_loss: 0.3700 - val_accuracy: 0.8402\nEpoch 3/10\n25000/25000 [==============================] - 6s 246us/sample - loss: 0.0929 - accuracy: 0.9755 - val_loss: 0.4553 - val_accuracy: 0.8235\nEpoch 4/10\n25000/25000 [==============================] - 6s 249us/sample - loss: 0.0245 - accuracy: 0.9967 - val_loss: 0.5405 - val_accuracy: 0.8231\nEpoch 5/10\n25000/25000 [==============================] - 6s 245us/sample - loss: 0.0066 - accuracy: 0.9994 - val_loss: 0.5952 - val_accuracy: 0.8259\nEpoch 6/10\n25000/25000 [==============================] - 7s 276us/sample - loss: 0.0019 - accuracy: 1.0000 - val_loss: 0.6491 - val_accuracy: 0.8258\nEpoch 7/10\n25000/25000 [==============================] - 7s 263us/sample - loss: 8.7637e-04 - accuracy: 1.0000 - val_loss: 0.6916 - val_accuracy: 0.8277\nEpoch 8/10\n25000/25000 [==============================] - 6s 255us/sample - loss: 4.9061e-04 - accuracy: 1.0000 - val_loss: 0.7317 - val_accuracy: 0.8282\nEpoch 9/10\n25000/25000 [==============================] - 7s 268us/sample - loss: 2.7768e-04 - accuracy: 1.0000 - val_loss: 0.7689 - val_accuracy: 0.8284\nEpoch 10/10\n25000/25000 [==============================] - 6s 259us/sample - loss: 1.6386e-04 - accuracy: 1.0000 - val_loss: 0.8056 - val_accuracy: 0.8286\n25000/25000 [==============================] - 2s 75us/sample - loss: 0.8056 - accuracy: 0.8286\nTest accuracy: 0.8286\n","name":"stdout"}]}],"metadata":{"language_info":{"name":"python","version":"3.6.6","mimetype":"text/x-python","codemirror_mode":{"name":"ipython","version":3},"pygments_lexer":"ipython3","nbconvert_exporter":"python","file_extension":".py"},"kernelspec":{"display_name":"Python 3","language":"python","name":"python3"}},"nbformat":4,"nbformat_minor":4} -------------------------------------------------------------------------------- /Review-Classification-Tensorflow/review-tensorflow (1).ipynb: -------------------------------------------------------------------------------- 1 | {"cells":[{"metadata":{"trusted":true},"cell_type":"code","source":"\nimport tensorflow as tf\n\n\nimport tensorflow_datasets as tfds\nimdb, info = tfds.load(\"imdb_reviews\", with_info=True, as_supervised=True)\n","execution_count":14,"outputs":[]},{"metadata":{"_uuid":"5687da5f-cbf2-4da5-9948-485a0c7853c6","_cell_guid":"5ca7927c-6eaa-4eac-8d8a-8a7639747b01","trusted":true},"cell_type":"code","source":"import numpy as np\n\ntrain_data, test_data = imdb['train'], imdb['test']\n\ntraining_sentences = []\ntraining_labels = []\n\ntesting_sentences = []\ntesting_labels = []\n\n\nfor s,l in train_data:\n training_sentences.append(str(s.numpy()))\n training_labels.append(l.numpy())\n \nfor s,l in test_data:\n testing_sentences.append(str(s.numpy()))\n testing_labels.append(l.numpy())\n \ntraining_labels_final = np.array(training_labels)\ntesting_labels_final = np.array(testing_labels)\n\n\n\ntraining_sentences[0]\n\n\ntraining_labels[0]\n\n\nvocab_size = 10000\nembedding_dim = 16\nmax_length = 120\ntrunc_type='post'\noov_tok = \"\"\n\n\nfrom tensorflow.keras.preprocessing.text import Tokenizer\nfrom tensorflow.keras.preprocessing.sequence import pad_sequences\n\ntokenizer = Tokenizer(num_words = vocab_size, oov_token=oov_tok)\ntokenizer.fit_on_texts(training_sentences)\nword_index = tokenizer.word_index\nsequences = tokenizer.texts_to_sequences(training_sentences)\npadded = pad_sequences(sequences,maxlen=max_length, truncating=trunc_type)\n\ntesting_sequences = tokenizer.texts_to_sequences(testing_sentences)\ntesting_padded = pad_sequences(testing_sequences,maxlen=max_length)","execution_count":15,"outputs":[]},{"metadata":{"trusted":true},"cell_type":"code","source":"model = tf.keras.Sequential([\n tf.keras.layers.Embedding(vocab_size, embedding_dim, input_length=max_length),\n tf.keras.layers.Flatten(),\n tf.keras.layers.Dense(6, activation='relu'),\n tf.keras.layers.Dense(1, activation='sigmoid')\n])\nmodel.compile(loss='binary_crossentropy',optimizer='adam',metrics=['accuracy'])\nmodel.summary()\n\n","execution_count":16,"outputs":[{"output_type":"stream","text":"Model: \"sequential_2\"\n_________________________________________________________________\nLayer (type) Output Shape Param # \n=================================================================\nembedding_2 (Embedding) (None, 120, 16) 160000 \n_________________________________________________________________\nflatten_2 (Flatten) (None, 1920) 0 \n_________________________________________________________________\ndense_4 (Dense) (None, 6) 11526 \n_________________________________________________________________\ndense_5 (Dense) (None, 1) 7 \n=================================================================\nTotal params: 171,533\nTrainable params: 171,533\nNon-trainable params: 0\n_________________________________________________________________\n","name":"stdout"}]},{"metadata":{"trusted":true},"cell_type":"code","source":"num_epochs = 10\nhistory = model.fit(padded, training_labels_final, epochs=num_epochs, validation_data=(testing_padded, testing_labels_final))","execution_count":17,"outputs":[{"output_type":"stream","text":"Train on 25000 samples, validate on 25000 samples\nEpoch 1/10\n25000/25000 [==============================] - 7s 279us/sample - loss: 0.4851 - accuracy: 0.7530 - val_loss: 0.3460 - val_accuracy: 0.8496\nEpoch 2/10\n25000/25000 [==============================] - 6s 260us/sample - loss: 0.2395 - accuracy: 0.9073 - val_loss: 0.3701 - val_accuracy: 0.8394\nEpoch 3/10\n25000/25000 [==============================] - 6s 255us/sample - loss: 0.0928 - accuracy: 0.9754 - val_loss: 0.4493 - val_accuracy: 0.8282\nEpoch 4/10\n25000/25000 [==============================] - 6s 249us/sample - loss: 0.0221 - accuracy: 0.9974 - val_loss: 0.5352 - val_accuracy: 0.8260\nEpoch 5/10\n25000/25000 [==============================] - 6s 259us/sample - loss: 0.0053 - accuracy: 0.9998 - val_loss: 0.6013 - val_accuracy: 0.8250\nEpoch 6/10\n25000/25000 [==============================] - 6s 252us/sample - loss: 0.0018 - accuracy: 1.0000 - val_loss: 0.6519 - val_accuracy: 0.8269\nEpoch 7/10\n25000/25000 [==============================] - 6s 253us/sample - loss: 8.5006e-04 - accuracy: 1.0000 - val_loss: 0.6894 - val_accuracy: 0.8286\nEpoch 8/10\n25000/25000 [==============================] - 7s 273us/sample - loss: 4.6341e-04 - accuracy: 1.0000 - val_loss: 0.7299 - val_accuracy: 0.8280\nEpoch 9/10\n25000/25000 [==============================] - 7s 263us/sample - loss: 2.6600e-04 - accuracy: 1.0000 - val_loss: 0.7658 - val_accuracy: 0.8300\nEpoch 10/10\n25000/25000 [==============================] - 6s 260us/sample - loss: 1.5968e-04 - accuracy: 1.0000 - val_loss: 0.8035 - val_accuracy: 0.8286\n","name":"stdout"}]},{"metadata":{"trusted":true},"cell_type":"code","source":"score = model.evaluate(testing_padded, testing_labels_final)","execution_count":18,"outputs":[{"output_type":"stream","text":"25000/25000 [==============================] - 2s 77us/sample - loss: 0.8035 - accuracy: 0.8286\n","name":"stdout"}]},{"metadata":{"trusted":true},"cell_type":"code","source":"print('Test accuracy:', score[1])","execution_count":19,"outputs":[{"output_type":"stream","text":"Test accuracy: 0.82864\n","name":"stdout"}]}],"metadata":{"language_info":{"name":"python","version":"3.6.6","mimetype":"text/x-python","codemirror_mode":{"name":"ipython","version":3},"pygments_lexer":"ipython3","nbconvert_exporter":"python","file_extension":".py"},"kernelspec":{"display_name":"Python 3","language":"python","name":"python3"}},"nbformat":4,"nbformat_minor":4} -------------------------------------------------------------------------------- /NLP-Reviews-Classification/nlp-reviews-classification.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "code", 5 | "execution_count": 1, 6 | "metadata": { 7 | "_cell_guid": "b1076dfc-b9ad-4769-8c92-a6c4dae69d19", 8 | "_uuid": "8f2839f25d086af736a60e9eeb907d3b93b6e0e5" 9 | }, 10 | "outputs": [ 11 | { 12 | "name": "stdout", 13 | "output_type": "stream", 14 | "text": [ 15 | "/kaggle/input/restaurant-reviews/Restaurant_Reviews.tsv\n" 16 | ] 17 | } 18 | ], 19 | "source": [ 20 | "# This Python 3 environment comes with many helpful analytics libraries installed\n", 21 | "# It is defined by the kaggle/python docker image: https://github.com/kaggle/docker-python\n", 22 | "# For example, here's several helpful packages to load in \n", 23 | "\n", 24 | "import numpy as np # linear algebra\n", 25 | "import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)\n", 26 | "\n", 27 | "# Input data files are available in the \"../input/\" directory.\n", 28 | "# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory\n", 29 | "\n", 30 | "import os\n", 31 | "for dirname, _, filenames in os.walk('/kaggle/input'):\n", 32 | " for filename in filenames:\n", 33 | " print(os.path.join(dirname, filename))\n", 34 | "\n", 35 | "# Any results you write to the current directory are saved as output.\n", 36 | "import warnings\n", 37 | "warnings.filterwarnings(\"ignore\")" 38 | ] 39 | }, 40 | { 41 | "cell_type": "code", 42 | "execution_count": 2, 43 | "metadata": { 44 | "_cell_guid": "79c7e3d0-c299-4dcb-8224-4455121ee9b0", 45 | "_uuid": "d629ff2d2480ee46fbb7e2d37f6b5fab8052498a" 46 | }, 47 | "outputs": [], 48 | "source": [ 49 | "# Importing the dataset\n", 50 | "dataset = pd.read_csv('../input/restaurant-reviews/Restaurant_Reviews.tsv', delimiter = '\\t', quoting = 3)\n" 51 | ] 52 | }, 53 | { 54 | "cell_type": "code", 55 | "execution_count": 3, 56 | "metadata": {}, 57 | "outputs": [ 58 | { 59 | "data": { 60 | "text/html": [ 61 | "
\n", 62 | "\n", 75 | "\n", 76 | " \n", 77 | " \n", 78 | " \n", 79 | " \n", 80 | " \n", 81 | " \n", 82 | " \n", 83 | " \n", 84 | " \n", 85 | " \n", 86 | " \n", 87 | " \n", 88 | " \n", 89 | " \n", 90 | " \n", 91 | " \n", 92 | " \n", 93 | " \n", 94 | " \n", 95 | " \n", 96 | " \n", 97 | " \n", 98 | " \n", 99 | " \n", 100 | " \n", 101 | " \n", 102 | " \n", 103 | " \n", 104 | " \n", 105 | " \n", 106 | " \n", 107 | " \n", 108 | " \n", 109 | " \n", 110 | "
ReviewLiked
0Wow... Loved this place.1
1Crust is not good.0
2Not tasty and the texture was just nasty.0
3Stopped by during the late May bank holiday of...1
4The selection on the menu was great and so wer...1
\n", 111 | "
" 112 | ], 113 | "text/plain": [ 114 | " Review Liked\n", 115 | "0 Wow... Loved this place. 1\n", 116 | "1 Crust is not good. 0\n", 117 | "2 Not tasty and the texture was just nasty. 0\n", 118 | "3 Stopped by during the late May bank holiday of... 1\n", 119 | "4 The selection on the menu was great and so wer... 1" 120 | ] 121 | }, 122 | "execution_count": 3, 123 | "metadata": {}, 124 | "output_type": "execute_result" 125 | } 126 | ], 127 | "source": [ 128 | "dataset.head()" 129 | ] 130 | }, 131 | { 132 | "cell_type": "code", 133 | "execution_count": 4, 134 | "metadata": {}, 135 | "outputs": [ 136 | { 137 | "name": "stdout", 138 | "output_type": "stream", 139 | "text": [ 140 | "[nltk_data] Downloading package stopwords to /usr/share/nltk_data...\n", 141 | "[nltk_data] Package stopwords is already up-to-date!\n" 142 | ] 143 | } 144 | ], 145 | "source": [ 146 | "# Cleaning the texts\n", 147 | "import re\n", 148 | "import nltk\n", 149 | "nltk.download('stopwords') \n", 150 | "from nltk.corpus import stopwords\n", 151 | "from nltk.stem.porter import PorterStemmer\n", 152 | "corpus = []\n", 153 | "for i in range(0, 1000):\n", 154 | " review = re.sub('[^a-zA-Z]', ' ', dataset['Review'][i])\n", 155 | " review = review.lower()\n", 156 | " review = review.split()\n", 157 | " ps = PorterStemmer()\n", 158 | " review = [ps.stem(word) for word in review if not word in set(stopwords.words('english'))]\n", 159 | " review = ' '.join(review)\n", 160 | " corpus.append(review)" 161 | ] 162 | }, 163 | { 164 | "cell_type": "code", 165 | "execution_count": 5, 166 | "metadata": {}, 167 | "outputs": [], 168 | "source": [ 169 | "# Creating the Bag of Words model\n", 170 | "from sklearn.feature_extraction.text import CountVectorizer\n", 171 | "cv = CountVectorizer(max_features = 1500)\n", 172 | "X = cv.fit_transform(corpus).toarray() \n", 173 | "y = dataset.iloc[:, 1].values" 174 | ] 175 | }, 176 | { 177 | "cell_type": "code", 178 | "execution_count": 6, 179 | "metadata": {}, 180 | "outputs": [], 181 | "source": [ 182 | "#splitting data sets \n", 183 | "from sklearn.model_selection import train_test_split\n", 184 | "X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.20, random_state = 0)" 185 | ] 186 | }, 187 | { 188 | "cell_type": "code", 189 | "execution_count": 7, 190 | "metadata": {}, 191 | "outputs": [ 192 | { 193 | "data": { 194 | "text/plain": [ 195 | "GaussianNB(priors=None, var_smoothing=1e-09)" 196 | ] 197 | }, 198 | "execution_count": 7, 199 | "metadata": {}, 200 | "output_type": "execute_result" 201 | } 202 | ], 203 | "source": [ 204 | "# Fitting Naive Bayes to the Training set\n", 205 | "from sklearn.naive_bayes import GaussianNB\n", 206 | "classifier = GaussianNB()\n", 207 | "classifier.fit(X_train, y_train)\n" 208 | ] 209 | }, 210 | { 211 | "cell_type": "code", 212 | "execution_count": 8, 213 | "metadata": {}, 214 | "outputs": [], 215 | "source": [ 216 | "y_pred = classifier.predict(X_test)" 217 | ] 218 | }, 219 | { 220 | "cell_type": "code", 221 | "execution_count": 9, 222 | "metadata": {}, 223 | "outputs": [], 224 | "source": [ 225 | "# Making the Confusion Matrix\n", 226 | "from sklearn.metrics import confusion_matrix\n", 227 | "cm = confusion_matrix(y_test, y_pred)" 228 | ] 229 | }, 230 | { 231 | "cell_type": "code", 232 | "execution_count": 10, 233 | "metadata": {}, 234 | "outputs": [ 235 | { 236 | "data": { 237 | "text/plain": [ 238 | "array([[55, 42],\n", 239 | " [12, 91]])" 240 | ] 241 | }, 242 | "execution_count": 10, 243 | "metadata": {}, 244 | "output_type": "execute_result" 245 | } 246 | ], 247 | "source": [ 248 | "cm" 249 | ] 250 | }, 251 | { 252 | "cell_type": "code", 253 | "execution_count": 11, 254 | "metadata": {}, 255 | "outputs": [ 256 | { 257 | "data": { 258 | "text/plain": [ 259 | "" 260 | ] 261 | }, 262 | "execution_count": 11, 263 | "metadata": {}, 264 | "output_type": "execute_result" 265 | }, 266 | { 267 | "data": { 268 | "image/png": "iVBORw0KGgoAAAANSUhEUgAAAVoAAAD6CAYAAADgOo8sAAAABHNCSVQICAgIfAhkiAAAAAlwSFlzAAALEgAACxIB0t1+/AAAADh0RVh0U29mdHdhcmUAbWF0cGxvdGxpYiB2ZXJzaW9uMy4yLjEsIGh0dHA6Ly9tYXRwbG90bGliLm9yZy+j8jraAAATfklEQVR4nO3de5BcZZnH8e+TBAh3SDBxTLhFIhdForKUiLosEUG8JF5iochGNuWwCiLqKtF1BV0VqFVhdRF3lqiRewQxKWqNC+GqSLgEFDBqMGISExIuCXeFmXn2j2lwgGS6B/qd7px8P9Sp7j6n5+2HKupXD+95zzmRmUiSyhnW6gIkqeoMWkkqzKCVpMIMWkkqzKCVpMIMWkkqzKCVpA2IiE9ExJ0RcVdEnFjbNyoiroiIJbXXHeuOU3od7e/3PtyFunqeqx/ZqdUlqA0du+K8eLFjPHX/0oYzZ7OdJmzw9yLiVcBFwAHAk8B84KPAR4AHM/O0iJgJ7JiZJw30O3a0krR+ewM3ZubjmdkNXAu8G5gCzK59ZzYwtd5ABq2kauntaXwb2J3AmyNidERsBRwB7AyMzcxVALXXMfUGGvEi/5Ukqb30dDf81YjoBDr77erKzC6AzFwcEacDVwCPAr8CGh+8H4NWUqVk9g7iu9kFdA1wfBYwCyAivgasAFZHREdmroqIDmBNvd9x6kBStfT2Nr7VERFjaq+7AO8BLgTmAdNrX5kOzK03jh2tpGoZREfbgEsjYjTwFHBcZq6NiNOAORExA1gGTKs3iEErqVrqn+RqWGa+aT37HgAmD2Ycg1ZStTS3o20Kg1ZSpeQgVh0MFYNWUrU0cJJrqBm0kqrFqQNJKqyJJ8OaxaCVVC12tJJUmCfDJKkwT4ZJUlmZztFKUlnO0UpSYU4dSFJhdrSSVFjPU62u4HkMWknV4tSBJBXm1IEkFWZHK0mFGbSSVFZ6MkySCmvDOVqfgiupWpr7FNxPRsRdEXFnRFwYESMjYveIWBgRSyLi4ojYvN44Bq2kasnexrcBRMQ44ARg/8x8FTAcOBI4HTgjMycCa4EZ9UoyaCVVSxM7WvqmV7eMiBHAVsAq4BDgktrx2cDUeoMYtJKqpUkdbWb+Gfg6sIy+gH0IuBVYl5lP3/R2BTCuXkkGraRq6e5ueIuIzoi4pd/W+fQwEbEjMAXYHXgZsDXwtvX8YtYryVUHkqplEKsOMrML6NrA4bcAf8zM+wAi4sfAG4AdImJErasdD6ys9zt2tJKqpXlztMuA10fEVhERwGTgN8DVwPtq35kOzK03kEErqVqaN0e7kL6TXouAO+jLyy7gJOBTEXE3MBqYVa8kpw4kVUsTL8HNzJOBk5+zeylwwGDGMWglVUsbXhlm0Eqqlm4fNy5JZWXd1VZDzqCVVC3eJlGSCjNoJakwT4ZJUmE9Pa2u4HkMWknV4tSBJBVm0EpSYc7RSlJZ2es6Wkkqy6kDSSrMVQeSVJgd7aZl9ytn0/vY42RPL/T0sGzaCYw+7kNsP+1wuh98CIAHzvwBj113c4sr1VCKYcF7/vffeezetcz/8Dc45Nsf5SWvnkDvU92suX0p18/8Hr3d7deVbTQM2k3P8ukn0bvu4WftWzv7MtZ+/9IWVaRWe9WMw1l790o232ZLAJZcdgNXffxsACb/13Hs9YGD+c25C1pZ4sZtY7ypTETsRd8DysbR9xCylcC8zFxcuDapcrbuGMWukyex6FtzeXVn33P+ll/1q2eOr7n9D2zdMapV5VVDG3a0Az7KJiJOAi4CArgJuLn2/sKImFm+vI1cJuNnfY1dLvk220/728MzdzjqXez6k7MZ+5VPMmy7bVpYoIbaG075EDd+9UJyPV3XsBHDmfjeN7L8ml+3oLIK6c3GtyFSr6OdAbwyM5/qvzMivgncBZxWqrAqWPbBT9Fz34MMH7U942edypN/XM66iy7ngbMvgExGn/CPvOSzH2H1F85odakaArtMnsQT9z/M/XfcQ8eBez/v+Bu/9mHuXfhb7r3pdy2orkKatOogIvYELu63awLwReCHtf27AfcA78/MtQONVe/hjL30Pc/8uTpqxzZU4DPPSr943fI6P1FdPfc92Pf64EM8euUNjNx3T3oeWNf3vzaZPPSj+Yx89Z4trlJD5aV/9wp2fetr+eAvz+AtZx3Hyw7ah0O+9VEAXvfJdzNy1Lbc8KXzW1zlxi97exveBhwn83eZOSkzJwGvAx4HLgNmAgsycyKwoPZ5QPU62hOBBRGxBHg6MXcB9gCOH6DAZ56V/vu9D2+/mekhEFtuATGMfPwJYsst2Oqg1/LAd85n+EtGPRPA2xz6Bv665J7WFqohc9Npc7jptDkAdBy4N/sdewRXnXA2e33gYMb//b5cfuSpbXkiZ6NTZkpgMvCHzPxTREwBDq7tnw1cQ9+TcTdowKDNzPkR8Qr6nvg4jr752RXAzZnp+pMBjBi9Iy/79hdrH4bzyOVX8/jPb+Wlp3+GLfaaAAlP/Xk1q0/5VmsLVcu96dRjeGTF/UydewoAf/zpzSw68yetLWpjVuZeB0cCF9bej83MVQCZuSoixtT741jfpHwzbaodrQZ29SM7tboEtaFjV5wXL3aMx758VMOZs83JFxwLdPbb1VX7P/JnRMTm9K22emVmro6IdZm5Q7/jazNzx4F+x3W0kqplEBd79J/mHMDbgEWZubr2eXVEdNS62Q5gTb3fqXcyTJI2Ltnb+NaYD/C3aQOAecD02vvpwNx6A9jRSqqWJp4Mi4itgEOBY/vtPg2YExEzgGXAtHrjGLSSKqXesq1BjZX5ODD6OfseoG8VQsMMWknV4o2/Jakwg1aSCvPG35JUls8Mk6TSDFpJKqwN70dr0EqqFjtaSSrMoJWksrLHqQNJKsuOVpLKcnmXJJVm0EpSYe03RWvQSqqW7G6/pDVoJVVL++WsQSupWjwZJkml2dFKUll2tJJUWht2tD4FV1KlZHfjWz0RsUNEXBIRv42IxRFxYESMiogrImJJ7XXHeuMYtJIqpclPG/9PYH5m7gXsBywGZgILMnMisKD2eUAGraRq6R3ENoCI2A54MzALIDOfzMx1wBRgdu1rs4Gp9UoyaCVVShM72gnAfcD3I+K2iDgnIrYGxmbmKoDa65h6Axm0kiplMEEbEZ0RcUu/rbPfUCOA1wJnZ+ZrgMdoYJpgfVx1IKlSsica/25mF9C1gcMrgBWZubD2+RL6gnZ1RHRk5qqI6ADW1PsdO1pJldKsqYPMvBdYHhF71nZNBn4DzAOm1/ZNB+bWq8mOVlKlZG/jHW0DPg6cHxGbA0uBY+hrUOdExAxgGTCt3iAGraRKaXDZVmNjZd4O7L+eQ5MHM45BK6lSMpva0TaFQSupUprZ0TaLQSupUnoHsepgqBi0kiqlySfDmsKglVQpBq0kFZbtdztag1ZStdjRSlJhLu+SpMJ6XHUgSWXZ0UpSYc7RSlJhrjqQpMLsaCWpsJ7e9rvNtkErqVKcOpCkwnpddSBJZbm8S5IK2ySnDvb5wx2lf0IboSdWXt/qElRRzZw6iIh7gEeAHqA7M/ePiFHAxcBuwD3A+zNz7UDjtN/pOUl6EXp6hzW8NegfMnNSZj797LCZwILMnAgsqH0ekEErqVJyENsLNAWYXXs/G5ha7w8MWkmV0pvR8NaABP4vIm6NiM7avrGZuQqg9jqm3iCeDJNUKYNZdVALz85+u7oys6vf54Myc2VEjAGuiIjfvpCaDFpJlTKYh+DWQrVrgOMra69rIuIy4ABgdUR0ZOaqiOgA1tT7HacOJFVKEg1vA4mIrSNi26ffA28F7gTmAdNrX5sOzK1Xkx2tpErpbt7yrrHAZREBfVl5QWbOj4ibgTkRMQNYBkyrN5BBK6lS6nWqDY+TuRTYbz37HwAmD2Ysg1ZSpQxmjnaoGLSSKqVZHW0zGbSSKsWOVpIK67GjlaSy2vBJNgatpGrptaOVpLLa8Ha0Bq2kavFkmCQV1htOHUhSUT2tLmA9DFpJleKqA0kqzFUHklSYqw4kqTCnDiSpMJd3SVJhPXa0klSWHa0kFWbQSlJhzXtkWPP4FFxJldI7iK0RETE8Im6LiMtrn3ePiIURsSQiLo6IzeuNYdBKqpSeQWwN+gSwuN/n04EzMnMisBaYUW8Ag1ZSpfRG41s9ETEeeDtwTu1zAIcAl9S+MhuYWm8c52glVUqTT4adCXwW2Lb2eTSwLjO7a59XAOPqDWJHK6lSBjNHGxGdEXFLv63z6XEi4h3Amsy8td/w6+uD6171a0crqVIGc6+DzOwCujZw+CDgXRFxBDAS2I6+DneHiBhR62rHAyvr/Y4draRKadYcbWZ+LjPHZ+ZuwJHAVZl5FHA18L7a16YDc+vVZNBKqpQCqw6e6yTgUxFxN31ztrPq/YFTB5IqpbfAjRIz8xrgmtr7pcABg/l7g1ZSpXgJriQV5o2/JakwO1pJKqw72q+nNWglVUr7xaxBK6linDqQpMJKLO96sQxaSZXSfjFr0EqqGKcOJKmwnjbsaQ1aSZViRytJhaUdrSSV1Y4drbdJLOR/ur7ByhW/4vbbFjyz7/RTv8Cdd1zLoluv4JIfncP222/XwgrVCufO+QlTP/TPTDnqWM69+DIAfnbV9Uw56lj2feMR3Ln49y2ucOPXSza8DRWDtpAf/nAOb3/HUc/ad+WC69hv0iG89nWHsmTJUmaedHyLqlMrLFl6D5fOm8+F55zJpbO/w7U33MSflv+ZPSbsyplf+zdeN+lVrS6xEnIQ21AxaAu5/ucLeXDtumftu+LK6+jp6bvd8I0LFzFuXEcrSlOLLL1nOa9+5V5sOXIkI0YMZ/9J+7Lguht4+W67sPuu41tdXmV0kw1vQ+UFB21EHNPMQjY1x3z4SOb/7OpWl6EhtMeEXbn1V3ey7qGHeeIvf+H6X97Mvavva3VZlZOD+GeovJiTYV8Cvr++A7UnSXYCxPDtGTZs6xfxM9XzuZkn0N3dzQUX/LjVpWgIvXy3Xfino6bxkRM/z1Zbbskr9pjA8OHDW11W5bTjybABgzYifr2hQ8DYDf1d/ydLjth8XPuttWiho4+extuPeAuHHvb+VpeiFnjvOw/jve88DIAzv/sDXjpmpxZXVD3N6lQjYiRwHbAFfVl5SWaeHBG7AxcBo4BFwNGZ+eRAY9XraMcChwFrn1sDcMMLqH2TdthbD+Yz//IxDpn8Xp544i+tLkct8MDadYzecQdW3buGBdf+gvP++5utLqlymtjR/hU4JDMfjYjNgJ9HxE+BTwFnZOZFEfFdYAZw9kAD1Qvay4FtMvP25x6IiGteUOmbiPPOPYu/f/OB7LTTKO5Zegtf+vLXOemzx7PFFlsw/6cXAbBw4SKOO35miyvVUPrk57/CuocfZsSIEfzrpz/G9ttty5XX/oJTzzibB9c9xMc+czJ7TZxA1xlfbXWpG62ebE5Hm5kJPFr7uFltS+AQ4IO1/bOBU6gTtJFNKmpDnDrQ+jyx8vpWl6A2tNlOE+LFjvHBXd/dcOZc8KfLBvy9iBgO3ArsAZwF/AdwY2buUTu+M/DTzBxwbZ7LuyRVymBWHUREZ0Tc0m/rfNZYmT2ZOQkYT98jxvde70/W4SW4kiplMHO0/U/c1/neutp06euBHSJiRGZ20xfAK+v9vR2tpEpp1iW4EfGSiNih9n5L4C3AYuBq4H21r00H5taryY5WUqU08UKEDmB2bZ52GDAnMy+PiN8AF0XEV4DbgFn1BjJoJVVKE1cd/Bp4zXr2L6VvvrZhBq2kSvHhjJJU2EZ3Ca4kbWx8woIkFebUgSQVVvpq1xfCoJVUKT5uXJIKc+pAkgpz6kCSCrOjlaTCXN4lSYU16xLcZjJoJVWKUweSVJhBK0mFuepAkgqzo5Wkwlx1IEmF9WT73SjRoJVUKc7RSlJh7ThH61NwJVVKDuKfgUTEzhFxdUQsjoi7IuITtf2jIuKKiFhSe92xXk0GraRK6c1seKujG/h0Zu4NvB44LiL2AWYCCzJzIrCg9nlABq2kSmlWR5uZqzJzUe39I8BiYBwwBZhd+9psYGq9mpyjlVQpJVYdRMRu9D16fCEwNjNXQV8YR8SYen9vRyupUgYzdRARnRFxS7+t87njRcQ2wKXAiZn58AupyY5WUqUM5oKFzOwCujZ0PCI2oy9kz8/MH9d2r46Ijlo32wGsqfc7drSSKqVZJ8MiIoBZwOLM/Ga/Q/OA6bX304G59Wqyo5VUKU28BPcg4Gjgjoi4vbbv88BpwJyImAEsA6bVG8iglVQpPdnTlHEy8+dAbODw5MGMZdBKqhQvwZWkwtrxElyDVlKl2NFKUmENXFo75AxaSZXijb8lqTBv/C1JhTlHK0mFOUcrSYXZ0UpSYa6jlaTC7GglqTBXHUhSYZ4Mk6TCnDqQpMK8MkySCrOjlaTC2nGONtox/asqIjprD4OTnuF/F9XnwxmH1vMeZSzhfxeVZ9BKUmEGrSQVZtAOLefhtD7+d1FxngyTpMLsaCWpMIN2iETE4RHxu4i4OyJmtroetV5EfC8i1kTEna2uRWUZtEMgIoYDZwFvA/YBPhAR+7S2KrWBHwCHt7oIlWfQDo0DgLszc2lmPglcBExpcU1qscy8Dniw1XWoPIN2aIwDlvf7vKK2T9ImwKAdGrGefS73kDYRBu3QWAHs3O/zeGBli2qRNMQM2qFxMzAxInaPiM2BI4F5La5J0hAxaIdAZnYDxwM/AxYDczLzrtZWpVaLiAuBXwJ7RsSKiJjR6ppUhleGSVJhdrSSVJhBK0mFGbSSVJhBK0mFGbSSVJhBK0mFGbSSVJhBK0mF/T82hPey6+wSgAAAAABJRU5ErkJggg==\n", 269 | "text/plain": [ 270 | "
" 271 | ] 272 | }, 273 | "metadata": { 274 | "needs_background": "light" 275 | }, 276 | "output_type": "display_data" 277 | } 278 | ], 279 | "source": [ 280 | "import seaborn as sns\n", 281 | "sns.heatmap(cm,annot=True)" 282 | ] 283 | }, 284 | { 285 | "cell_type": "code", 286 | "execution_count": 12, 287 | "metadata": {}, 288 | "outputs": [ 289 | { 290 | "data": { 291 | "text/plain": [ 292 | "0.73" 293 | ] 294 | }, 295 | "execution_count": 12, 296 | "metadata": {}, 297 | "output_type": "execute_result" 298 | } 299 | ], 300 | "source": [ 301 | "from sklearn.metrics import accuracy_score\n", 302 | "accuracy_score(y_test,y_pred)" 303 | ] 304 | }, 305 | { 306 | "cell_type": "code", 307 | "execution_count": 13, 308 | "metadata": {}, 309 | "outputs": [ 310 | { 311 | "data": { 312 | "text/plain": [ 313 | "73.0" 314 | ] 315 | }, 316 | "execution_count": 13, 317 | "metadata": {}, 318 | "output_type": "execute_result" 319 | } 320 | ], 321 | "source": [ 322 | "accuracy = accuracy_score(y_test,y_pred)*100\n", 323 | "accuracy" 324 | ] 325 | }, 326 | { 327 | "cell_type": "code", 328 | "execution_count": null, 329 | "metadata": {}, 330 | "outputs": [], 331 | "source": [] 332 | }, 333 | { 334 | "cell_type": "code", 335 | "execution_count": null, 336 | "metadata": {}, 337 | "outputs": [], 338 | "source": [] 339 | }, 340 | { 341 | "cell_type": "code", 342 | "execution_count": null, 343 | "metadata": {}, 344 | "outputs": [], 345 | "source": [] 346 | }, 347 | { 348 | "cell_type": "code", 349 | "execution_count": null, 350 | "metadata": {}, 351 | "outputs": [], 352 | "source": [] 353 | }, 354 | { 355 | "cell_type": "code", 356 | "execution_count": null, 357 | "metadata": {}, 358 | "outputs": [], 359 | "source": [] 360 | }, 361 | { 362 | "cell_type": "code", 363 | "execution_count": null, 364 | "metadata": {}, 365 | "outputs": [], 366 | "source": [] 367 | } 368 | ], 369 | "metadata": { 370 | "kernelspec": { 371 | "display_name": "Python 3", 372 | "language": "python", 373 | "name": "python3" 374 | }, 375 | "language_info": { 376 | "codemirror_mode": { 377 | "name": "ipython", 378 | "version": 3 379 | }, 380 | "file_extension": ".py", 381 | "mimetype": "text/x-python", 382 | "name": "python", 383 | "nbconvert_exporter": "python", 384 | "pygments_lexer": "ipython3", 385 | "version": "3.6.6" 386 | } 387 | }, 388 | "nbformat": 4, 389 | "nbformat_minor": 4 390 | } 391 | -------------------------------------------------------------------------------- /Harry-Potter-Novel-Text-Generation-using-GRU/HPGEN.ipynb: -------------------------------------------------------------------------------- 1 | {"cells":[{"metadata":{"trusted":true},"cell_type":"code","source":"# This Python 3 environment comes with many helpful analytics libraries installed\n# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python\n# For example, here's several helpful packages to load\n\nimport numpy as np # linear algebra\nimport pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)\n\n# Input data files are available in the read-only \"../input/\" directory\n# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory\n\nimport os\nfor dirname, _, filenames in os.walk('/kaggle/input'):\n for filename in filenames:\n print(os.path.join(dirname, filename))\n\n# You can write up to 5GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using \"Save & Run All\" \n# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session","execution_count":1,"outputs":[{"output_type":"stream","text":"/kaggle/input/hp2chamberofsecrets/6TheHalfBloodPrince.txt\n/kaggle/input/hp2chamberofsecrets/2ChamberofSecrets.txt\n/kaggle/input/hp2chamberofsecrets/1SorcerersStone.txt\n/kaggle/input/hp2chamberofsecrets/5OrderofthePhoenix.txt\n/kaggle/input/hp2chamberofsecrets/3ThePrisonerOfAzkaban.txt\n/kaggle/input/hp2chamberofsecrets/7DeathlyHollows.txt\n/kaggle/input/hp2chamberofsecrets/4TheGobletOfFire.txt\n","name":"stdout"}]},{"metadata":{"id":"yG_n40gFzf9s","trusted":true},"cell_type":"code","source":"import tensorflow as tf\nimport numpy as np\nimport os\nimport time","execution_count":2,"outputs":[]},{"metadata":{"id":"aavnuByVymwK","trusted":true},"cell_type":"code","source":"files= ['/kaggle/input/hp2chamberofsecrets/1SorcerersStone.txt', '/kaggle/input/hp2chamberofsecrets/2ChamberofSecrets.txt', '/kaggle/input/hp2chamberofsecrets/3ThePrisonerOfAzkaban.txt', '/kaggle/input/hp2chamberofsecrets/4TheGobletOfFire.txt', '/kaggle/input/hp2chamberofsecrets/5OrderofthePhoenix.txt', '/kaggle/input/hp2chamberofsecrets/6TheHalfBloodPrince.txt', '/kaggle/input/hp2chamberofsecrets/7DeathlyHollows.txt']\nwith open('harrypotter.txt', 'w') as outfile:\n for file in files:\n with open(file) as infile:\n outfile.write(infile.read())\n\ntext = open('harrypotter.txt').read()\nprint ('Length of text: {} characters'.format(len(text)))","execution_count":3,"outputs":[{"output_type":"stream","text":"Length of text: 6251651 characters\n","name":"stdout"}]},{"metadata":{"id":"Duhg9NrUymwO","trusted":true},"cell_type":"code","source":"# Taking a look at the text\nprint(text[:300])","execution_count":4,"outputs":[{"output_type":"stream","text":"Harry Potter and the Sorcerer's Stone \n\nCHAPTER ONE \n\nTHE BOY WHO LIVED \n\nMr. and Mrs. Dursley, of number four, Privet Drive, were proud to say that they were perfectly normal, thank you very much. They were the last people you'd expect to be involved in anything strange or mysterious, because they \n","name":"stdout"}]},{"metadata":{"id":"IlCgQBRVymwR","trusted":true},"cell_type":"code","source":"# The unique characters in the file\nvocab = sorted(set(text))\nprint ('{} unique characters'.format(len(vocab)))","execution_count":5,"outputs":[{"output_type":"stream","text":"106 unique characters\n","name":"stdout"}]},{"metadata":{"id":"IalZLbvOzf-F","trusted":true},"cell_type":"code","source":"# Creating a mapping from unique characters to indices\nchar2index = {u:i for i, u in enumerate(vocab)}\nindex2char = np.array(vocab)\n\ntext_as_int = np.array([char2index[c] for c in text])\n\nprint(text_as_int)","execution_count":6,"outputs":[{"output_type":"stream","text":"[39 64 81 ... 75 75 15]\n","name":"stdout"}]},{"metadata":{"id":"l1VKcQHcymwb","trusted":true},"cell_type":"code","source":"# Show how the first 13 characters from the text are mapped to integers\nprint ('{} -- characters mapped to int -- > {}'.format(repr(text[:13]), text_as_int[:13]))","execution_count":7,"outputs":[{"output_type":"stream","text":"'Harry Potter ' -- characters mapped to int -- > [39 64 81 81 88 3 47 78 83 83 68 81 3]\n","name":"stdout"}]},{"metadata":{"id":"0UHJDA39zf-O","trusted":true},"cell_type":"code","source":"# The maximum length sentence we want for a single input in characters\nseq_length = 100\nexamples_per_epoch = len(text)//(seq_length+1)\n\n# Create training examples / targets\nchar_dataset = tf.data.Dataset.from_tensor_slices(text_as_int)\n\nfor i in char_dataset.take(5):\n print(index2char[i.numpy()])\n","execution_count":8,"outputs":[{"output_type":"stream","text":"H\na\nr\nr\ny\n","name":"stdout"}]},{"metadata":{"id":"l4hkDU3i7ozi","trusted":true},"cell_type":"code","source":"sequences = char_dataset.batch(seq_length+1, drop_remainder=True)\n\nfor item in sequences.take(5):\n print(repr(''.join(index2char[item.numpy()])))\n\n","execution_count":9,"outputs":[{"output_type":"stream","text":"\"Harry Potter and the Sorcerer's Stone \\n\\nCHAPTER ONE \\n\\nTHE BOY WHO LIVED \\n\\nMr. and Mrs. Dursley, of nu\"\n'mber four, Privet Drive, were proud to say that they were perfectly normal, thank you very much. They'\n\" were the last people you'd expect to be involved in anything strange or mysterious, because they jus\"\n\"t didn't hold with such nonsense. \\n\\nMr. Dursley was the director of a firm called Grunnings, which ma\"\n'de drills. He was a big, beefy man with hardly any neck, although he did have a very large mustache. '\n","name":"stdout"}]},{"metadata":{"id":"9NGu-FkO_kYU","trusted":true},"cell_type":"code","source":"def split_input_target(chunk):\n input_text = chunk[:-1]\n target_text = chunk[1:]\n return input_text, target_text\n\ndataset = sequences.map(split_input_target)","execution_count":10,"outputs":[]},{"metadata":{"id":"p2pGotuNzf-S","trusted":true},"cell_type":"code","source":"# Batch size\nBATCH_SIZE = 64\n\n# Buffer size to shuffle the dataset\n# (TF data is designed to work with possibly infinite sequences,\n# so it doesn't attempt to shuffle the entire sequence in memory. Instead,\n# it maintains a buffer in which it shuffles elements).\nBUFFER_SIZE = 10000\n\ndataset = dataset.shuffle(BUFFER_SIZE).batch(BATCH_SIZE, drop_remainder=True)\n\ndataset","execution_count":11,"outputs":[{"output_type":"execute_result","execution_count":11,"data":{"text/plain":""},"metadata":{}}]},{"metadata":{"id":"zHT8cLh7EAsg","trusted":true},"cell_type":"code","source":"# Length of the vocabulary in chars\nvocab_size = len(vocab)\n\n# The embedding dimension\nembedding_dim = 300 #256\n\n# Number of RNN units\nrnn_units1 = 1024\nrnn_units2 = 1024\nrnn_units=[rnn_units1, rnn_units2]\nprint(vocab_size)","execution_count":12,"outputs":[{"output_type":"stream","text":"106\n","name":"stdout"}]},{"metadata":{"id":"MtCrdfzEI2N0","trusted":true},"cell_type":"code","source":"def build_model(vocab_size, embedding_dim, rnn_units, batch_size):\n model = tf.keras.Sequential([\n tf.keras.layers.Embedding(vocab_size, embedding_dim,\n batch_input_shape=[batch_size, None]),\n tf.keras.layers.GRU(rnn_units1,\n return_sequences=True,\n stateful=True,\n recurrent_initializer='glorot_uniform'),\n tf.keras.layers.GRU(rnn_units2,\n return_sequences=True,\n stateful=True,\n recurrent_initializer='glorot_uniform'),\n tf.keras.layers.Dense(vocab_size)\n ])\n return model","execution_count":13,"outputs":[]},{"metadata":{"id":"wwsrpOik5zhv","trusted":true},"cell_type":"code","source":"model = build_model(\n vocab_size = vocab_size,\n embedding_dim=embedding_dim,\n rnn_units=rnn_units,\n batch_size=BATCH_SIZE)","execution_count":14,"outputs":[]},{"metadata":{"id":"vPGmAAXmVLGC","trusted":true},"cell_type":"code","source":"model.summary()","execution_count":15,"outputs":[{"output_type":"stream","text":"Model: \"sequential\"\n_________________________________________________________________\nLayer (type) Output Shape Param # \n=================================================================\nembedding (Embedding) (64, None, 300) 31800 \n_________________________________________________________________\ngru (GRU) (64, None, 1024) 4073472 \n_________________________________________________________________\ngru_1 (GRU) (64, None, 1024) 6297600 \n_________________________________________________________________\ngru_2 (GRU) (64, None, 512) 2362368 \n_________________________________________________________________\ndense (Dense) (64, None, 106) 54378 \n=================================================================\nTotal params: 12,819,618\nTrainable params: 12,819,618\nNon-trainable params: 0\n_________________________________________________________________\n","name":"stdout"}]},{"metadata":{"id":"4HrXTACTdzY-","trusted":true},"cell_type":"code","source":"def loss(labels, logits):\n return tf.keras.losses.sparse_categorical_crossentropy(labels, logits, from_logits=True)","execution_count":16,"outputs":[]},{"metadata":{"id":"DDl1_Een6rL0","trusted":true},"cell_type":"code","source":"model.compile(optimizer='adam', loss=loss, metrics=['accuracy'])","execution_count":17,"outputs":[]},{"metadata":{"id":"W6fWTriUZP-n","trusted":true},"cell_type":"code","source":"# Directory where the checkpoints will be saved\ncheckpoint_dir = './training_checkpoints'\n# Name of the checkpoint files\ncheckpoint_prefix = os.path.join(checkpoint_dir, \"ckpt_{epoch}\")\n\ncheckpoint_callback=tf.keras.callbacks.ModelCheckpoint(\n filepath=checkpoint_prefix,\n save_weights_only=True)","execution_count":18,"outputs":[]},{"metadata":{"id":"7yGBE2zxMMHs","trusted":true},"cell_type":"code","source":"EPOCHS=20","execution_count":19,"outputs":[]},{"metadata":{"id":"UK-hmKjYVoll","trusted":true},"cell_type":"code","source":"history = model.fit(dataset, epochs=EPOCHS, callbacks=[checkpoint_callback])","execution_count":20,"outputs":[{"output_type":"stream","text":"Epoch 1/20\n967/967 [==============================] - 118s 122ms/step - loss: 1.5973 - accuracy: 0.5394\nEpoch 2/20\n967/967 [==============================] - 118s 122ms/step - loss: 1.2154 - accuracy: 0.6327\nEpoch 3/20\n967/967 [==============================] - 118s 122ms/step - loss: 1.1615 - accuracy: 0.6469\nEpoch 4/20\n967/967 [==============================] - 118s 122ms/step - loss: 1.1328 - accuracy: 0.6549\nEpoch 5/20\n967/967 [==============================] - 117s 121ms/step - loss: 1.1135 - accuracy: 0.6601\nEpoch 6/20\n967/967 [==============================] - 117s 121ms/step - loss: 1.0994 - accuracy: 0.6640\nEpoch 7/20\n967/967 [==============================] - 117s 121ms/step - loss: 1.0895 - accuracy: 0.6668\nEpoch 8/20\n967/967 [==============================] - 117s 121ms/step - loss: 1.0822 - accuracy: 0.6688\nEpoch 9/20\n967/967 [==============================] - 117s 121ms/step - loss: 1.0764 - accuracy: 0.6706\nEpoch 10/20\n967/967 [==============================] - 117s 121ms/step - loss: 1.0730 - accuracy: 0.6714\nEpoch 11/20\n967/967 [==============================] - 118s 122ms/step - loss: 1.0715 - accuracy: 0.6719\nEpoch 12/20\n967/967 [==============================] - 118s 122ms/step - loss: 1.0708 - accuracy: 0.6720\nEpoch 13/20\n967/967 [==============================] - 117s 121ms/step - loss: 1.0717 - accuracy: 0.6717\nEpoch 14/20\n967/967 [==============================] - 117s 121ms/step - loss: 1.0735 - accuracy: 0.6713\nEpoch 15/20\n967/967 [==============================] - 117s 121ms/step - loss: 1.0762 - accuracy: 0.6702\nEpoch 16/20\n967/967 [==============================] - 117s 121ms/step - loss: 1.0794 - accuracy: 0.6694\nEpoch 17/20\n967/967 [==============================] - 117s 121ms/step - loss: 1.0838 - accuracy: 0.6680\nEpoch 18/20\n967/967 [==============================] - 117s 121ms/step - loss: 1.0886 - accuracy: 0.6667\nEpoch 19/20\n967/967 [==============================] - 117s 121ms/step - loss: 1.0939 - accuracy: 0.6650\nEpoch 20/20\n967/967 [==============================] - 117s 121ms/step - loss: 1.0994 - accuracy: 0.6633\n","name":"stdout"}]},{"metadata":{"id":"zk2WJ2-XjkGz","trusted":true},"cell_type":"code","source":"latest_check= tf.train.latest_checkpoint(checkpoint_dir)","execution_count":21,"outputs":[]},{"metadata":{"id":"LycQ-ot_jjyu","trusted":true},"cell_type":"code","source":"model = build_model(vocab_size, embedding_dim, rnn_units, batch_size=1)\n\nmodel.load_weights(latest_check)\n\nmodel.build(tf.TensorShape([1, None]))","execution_count":22,"outputs":[]},{"metadata":{"id":"71xa6jnYVrAN","trusted":true},"cell_type":"code","source":"model.summary()","execution_count":23,"outputs":[{"output_type":"stream","text":"Model: \"sequential_1\"\n_________________________________________________________________\nLayer (type) Output Shape Param # \n=================================================================\nembedding_1 (Embedding) (1, None, 300) 31800 \n_________________________________________________________________\ngru_3 (GRU) (1, None, 1024) 4073472 \n_________________________________________________________________\ngru_4 (GRU) (1, None, 1024) 6297600 \n_________________________________________________________________\ngru_5 (GRU) (1, None, 512) 2362368 \n_________________________________________________________________\ndense_1 (Dense) (1, None, 106) 54378 \n=================================================================\nTotal params: 12,819,618\nTrainable params: 12,819,618\nNon-trainable params: 0\n_________________________________________________________________\n","name":"stdout"}]},{"metadata":{"id":"WvuwZBX5Ogfd","trusted":true},"cell_type":"code","source":"def generate_text(model, start_string):\n\n # Number of characters to generate\n num_generate = 1000\n\n # Converting our start string to numbers (vectorizing)\n input_eval = [char2index[s] for s in start_string]\n input_eval = tf.expand_dims(input_eval, 0)\n\n # Empty string to store our results\n text_generated = []\n\n # Low results in more predictable text.\n # Higher results in more surprising text.\n # Experiment to find the best setting.\n scaling = 0.5 #1\n\n # batch size == 1\n \n model.reset_states()\n for i in range(num_generate):\n predictions = model(input_eval)\n # remove the batch dimension\n predictions = tf.squeeze(predictions, 0)\n\n # using a categorical distribution to predict the character returned by the model\n predictions = predictions / scaling\n predicted_id = tf.random.categorical(predictions, num_samples=1)[-1,0].numpy()\n\n # We pass the predicted character as the next input to the model\n # along with the previous hidden state\n input_eval = tf.expand_dims([predicted_id], 0)\n\n text_generated.append(index2char[predicted_id])\n\n return (start_string + ''.join(text_generated))","execution_count":24,"outputs":[]},{"metadata":{"id":"ktovv0RFhrkn","outputId":"8c156bc5-5951-49dc-bb11-634fcb08a1e0","trusted":true},"cell_type":"code","source":"print(generate_text(model, start_string=u\"Dumbledore \"))","execution_count":25,"outputs":[{"output_type":"stream","text":"Dumbledore had wanted to go back to her father and soul, and you know I haven't think we were so long as he says he's gone and he was not the only one.\n\"You listen to your broomstick, and the werewolf had been here at the last of the Death Eaters had been able to find a long time at all. He was the only one he was wrong with the contents of the class and a group of conduction on her hand the second floor with a friendly undertone good in a long dark card from the time that it was worthless and position, the floor had been the only ones who had been the one who had insed one another word. He looked around the marble, and the scream of flames the parchment and the crowd took out a book of flowered cloaks, and the others and examined the gargoyle to the grounds. \"The Prophet has come to a spell and can do it, he thought you'd been a big wand at the tournament of the wizard.\"\n\n\"And what are you doing what?\" said Harry, turning the fake wand. \"I thought you got to his feet the first thing that made th\n","name":"stdout"}]},{"metadata":{"trusted":true},"cell_type":"code","source":"print(generate_text(model, start_string=u\"Harry killed Ron \"))","execution_count":28,"outputs":[{"output_type":"stream","text":"Harry killed Ron and he jumped as he had decided that the three body else had flitted and he was sure that the world was thinking that he was safe. The child was seeing the process of the search of the castle, and whether you can accept that the snake was brandishing his stomach and the only words he had never seen him and he had seen in the forest, and he could hear her alone, so that the starry sky let out a few moments of the banister and the scratching of red lay the other side of the room.\n\"He do,\" said Dumbledore. \"I was a great serpents and the process that was the first time for the portrait of the ground as he could not believe that it was a pretty good look.\n\"The Death Eaters have to kill them when I must be a protection,\" said Ron, looking down at the door and the more to help him bending the moment they had been achieved in the darkness. Harry saw that he was really lived about the moment or two four feet away. He was standing at a small look of what looked like a dream, and he was standing\n","name":"stdout"}]},{"metadata":{"trusted":true},"cell_type":"code","source":"print(generate_text(model, start_string=u\"Harry killed Ron \"))","execution_count":29,"outputs":[{"output_type":"stream","text":"Harry has elderwand and drew his voice from the window. He saw a few seconds before he could recognize the newspaper to Harry. The last time he had found the connection was on the branches of the edging statue that nobody else was closed behind him as the three of them had the last time he had not gone and say he can have to say that. You see, he's coming to make our prisoners that was concerned. He stood up, looking at him the chair in the corner that was the sign of a great surprise, and the other the second of the finer languages were rang out of the edge of the ground and he could not see anything to discover the tops of his mouth with a bang and he was glad to come to the corner, and as he looked at each other. The wizard could hardly scream and cast into the middle to the pont of cold and shabby ranks and shallow seemed to be the sight of the bank of a corner, and a shower of laughter and the Death Eaters were the onlooks.\nSnape seemed to have found the others and the trees to his face. \"He was forc\n","name":"stdout"}]},{"metadata":{"trusted":true},"cell_type":"code","source":"print(generate_text(model, start_string=u\"Hermoine Killed Snape \"))","execution_count":30,"outputs":[{"output_type":"stream","text":"Hermoine Killed Snape of the first. The ground was still clutching its head and clambered onto the dungeon door.\n\"Harry,\" said Ron, looking for a moment she had ever seen a good death in the room and off the boys in the morning. He could not see him when he did not really dead it was the beast with a man who was sorting and scrambled over the table and pulled him on the ground and uncomfortable that had been straight to the castle. The portrait of the last word we had just entered the garden at the time was seemed to have forced to see her from the thing and of course not to tell them the dementors at the time.\n\"He was the real one in the country were the moment or the moment he had said it was the sword of Gryffindor and he saw him as far as if he said it had been some of the few weeks of the force of a goat of red eyes and the more spells when he had never seen getting the elf to the orphanage. He could hardly make a sudden gold and delighted was a broom and pulled out the bed for a second to see anything\n","name":"stdout"}]}],"metadata":{"kernelspec":{"name":"python3","display_name":"Python 3","language":"python"},"language_info":{"name":"python","version":"3.7.6","mimetype":"text/x-python","codemirror_mode":{"name":"ipython","version":3},"pygments_lexer":"ipython3","nbconvert_exporter":"python","file_extension":".py"}},"nbformat":4,"nbformat_minor":4} -------------------------------------------------------------------------------- /Fake-Disaster-Tweet-Detection-Spacy-Bert-SVM/bert-spacy-svm.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "execution_count": null, 6 | "metadata": {}, 7 | "source": [ 8 | "# Spacy and SVM" 9 | ] 10 | }, 11 | { 12 | "cell_type": "code", 13 | "execution_count": 1, 14 | "metadata": { 15 | "_cell_guid": "79c7e3d0-c299-4dcb-8224-4455121ee9b0", 16 | "_uuid": "d629ff2d2480ee46fbb7e2d37f6b5fab8052498a" 17 | }, 18 | "outputs": [], 19 | "source": [ 20 | "import numpy as np \n", 21 | "import pandas as pd\n", 22 | "\n", 23 | "import spacy\n", 24 | "from spacy.matcher import Matcher\n", 25 | "from spacy.tokens import Span\n", 26 | "from spacy import displacy" 27 | ] 28 | }, 29 | { 30 | "cell_type": "code", 31 | "execution_count": 2, 32 | "metadata": {}, 33 | "outputs": [], 34 | "source": [ 35 | "nlp=spacy.load(\"en_core_web_sm\")" 36 | ] 37 | }, 38 | { 39 | "cell_type": "code", 40 | "execution_count": 3, 41 | "metadata": {}, 42 | "outputs": [], 43 | "source": [ 44 | "train=pd.read_csv('/kaggle/input/nlp-getting-started/train.csv')\n", 45 | "test=pd.read_csv('/kaggle/input/nlp-getting-started/test.csv')" 46 | ] 47 | }, 48 | { 49 | "cell_type": "code", 50 | "execution_count": 4, 51 | "metadata": {}, 52 | "outputs": [ 53 | { 54 | "data": { 55 | "text/html": [ 56 | "
\n", 57 | "\n", 70 | "\n", 71 | " \n", 72 | " \n", 73 | " \n", 74 | " \n", 75 | " \n", 76 | " \n", 77 | " \n", 78 | " \n", 79 | " \n", 80 | " \n", 81 | " \n", 82 | " \n", 83 | " \n", 84 | " \n", 85 | " \n", 86 | " \n", 87 | " \n", 88 | " \n", 89 | " \n", 90 | " \n", 91 | " \n", 92 | " \n", 93 | " \n", 94 | " \n", 95 | " \n", 96 | " \n", 97 | " \n", 98 | " \n", 99 | " \n", 100 | " \n", 101 | " \n", 102 | " \n", 103 | " \n", 104 | " \n", 105 | " \n", 106 | " \n", 107 | " \n", 108 | " \n", 109 | " \n", 110 | " \n", 111 | " \n", 112 | " \n", 113 | " \n", 114 | " \n", 115 | " \n", 116 | " \n", 117 | " \n", 118 | " \n", 119 | " \n", 120 | " \n", 121 | " \n", 122 | " \n", 123 | "
idkeywordlocationtexttarget
01NaNNaNOur Deeds are the Reason of this #earthquake M...1
14NaNNaNForest fire near La Ronge Sask. Canada1
25NaNNaNAll residents asked to 'shelter in place' are ...1
36NaNNaN13,000 people receive #wildfires evacuation or...1
47NaNNaNJust got sent this photo from Ruby #Alaska as ...1
\n", 124 | "
" 125 | ], 126 | "text/plain": [ 127 | " id keyword location text \\\n", 128 | "0 1 NaN NaN Our Deeds are the Reason of this #earthquake M... \n", 129 | "1 4 NaN NaN Forest fire near La Ronge Sask. Canada \n", 130 | "2 5 NaN NaN All residents asked to 'shelter in place' are ... \n", 131 | "3 6 NaN NaN 13,000 people receive #wildfires evacuation or... \n", 132 | "4 7 NaN NaN Just got sent this photo from Ruby #Alaska as ... \n", 133 | "\n", 134 | " target \n", 135 | "0 1 \n", 136 | "1 1 \n", 137 | "2 1 \n", 138 | "3 1 \n", 139 | "4 1 " 140 | ] 141 | }, 142 | "execution_count": 4, 143 | "metadata": {}, 144 | "output_type": "execute_result" 145 | } 146 | ], 147 | "source": [ 148 | "train.head()" 149 | ] 150 | }, 151 | { 152 | "cell_type": "code", 153 | "execution_count": 5, 154 | "metadata": {}, 155 | "outputs": [], 156 | "source": [ 157 | "from spacy.lang.en.stop_words import STOP_WORDS\n", 158 | "stopwords = list(STOP_WORDS)\n", 159 | "import string\n", 160 | "punct=string.punctuation\n", 161 | "\n", 162 | "def text_data_cleaning(sentence):\n", 163 | " doc = nlp(sentence)\n", 164 | " \n", 165 | " tokens = []\n", 166 | " for token in doc:\n", 167 | " if token.lemma_ != \"-PRON-\":\n", 168 | " temp = token.lemma_.lower().strip()\n", 169 | " else:\n", 170 | " temp = token.lower_\n", 171 | " tokens.append(temp)\n", 172 | " \n", 173 | " cleaned_tokens = []\n", 174 | " for token in tokens:\n", 175 | " if token not in stopwords and token not in punct:\n", 176 | " cleaned_tokens.append(token)\n", 177 | " return cleaned_tokens" 178 | ] 179 | }, 180 | { 181 | "cell_type": "code", 182 | "execution_count": 6, 183 | "metadata": {}, 184 | "outputs": [], 185 | "source": [ 186 | "from sklearn.svm import LinearSVC\n", 187 | "from sklearn.feature_extraction.text import TfidfVectorizer\n", 188 | "from sklearn.pipeline import Pipeline\n", 189 | "from sklearn.model_selection import train_test_split\n", 190 | "from sklearn.metrics import accuracy_score, classification_report, confusion_matrix" 191 | ] 192 | }, 193 | { 194 | "cell_type": "code", 195 | "execution_count": 7, 196 | "metadata": {}, 197 | "outputs": [], 198 | "source": [ 199 | "tfidf = TfidfVectorizer(tokenizer = text_data_cleaning)\n", 200 | "classifier = LinearSVC()" 201 | ] 202 | }, 203 | { 204 | "cell_type": "code", 205 | "execution_count": 8, 206 | "metadata": {}, 207 | "outputs": [], 208 | "source": [ 209 | "x = train['text']\n", 210 | "y = train['target']" 211 | ] 212 | }, 213 | { 214 | "cell_type": "code", 215 | "execution_count": 9, 216 | "metadata": {}, 217 | "outputs": [], 218 | "source": [ 219 | "X_train, X_test, y_train, y_test = train_test_split(x, y, test_size = 0.2, random_state = 42)" 220 | ] 221 | }, 222 | { 223 | "cell_type": "code", 224 | "execution_count": 10, 225 | "metadata": {}, 226 | "outputs": [], 227 | "source": [ 228 | "clf = Pipeline([('tfidf', tfidf), ('clf', classifier)])" 229 | ] 230 | }, 231 | { 232 | "cell_type": "code", 233 | "execution_count": 11, 234 | "metadata": {}, 235 | "outputs": [ 236 | { 237 | "data": { 238 | "text/plain": [ 239 | "Pipeline(memory=None,\n", 240 | " steps=[('tfidf',\n", 241 | " TfidfVectorizer(analyzer='word', binary=False,\n", 242 | " decode_error='strict',\n", 243 | " dtype=,\n", 244 | " encoding='utf-8', input='content',\n", 245 | " lowercase=True, max_df=1.0, max_features=None,\n", 246 | " min_df=1, ngram_range=(1, 1), norm='l2',\n", 247 | " preprocessor=None, smooth_idf=True,\n", 248 | " stop_words=None, strip_accents=None,\n", 249 | " sublinear_tf=False,\n", 250 | " token_pattern='(?u)\\\\b\\\\w\\\\w+\\\\b',\n", 251 | " tokenizer=,\n", 252 | " use_idf=True, vocabulary=None)),\n", 253 | " ('clf',\n", 254 | " LinearSVC(C=1.0, class_weight=None, dual=True,\n", 255 | " fit_intercept=True, intercept_scaling=1,\n", 256 | " loss='squared_hinge', max_iter=1000,\n", 257 | " multi_class='ovr', penalty='l2', random_state=None,\n", 258 | " tol=0.0001, verbose=0))],\n", 259 | " verbose=False)" 260 | ] 261 | }, 262 | "execution_count": 11, 263 | "metadata": {}, 264 | "output_type": "execute_result" 265 | } 266 | ], 267 | "source": [ 268 | "clf.fit(X_train,y_train)" 269 | ] 270 | }, 271 | { 272 | "cell_type": "code", 273 | "execution_count": 12, 274 | "metadata": {}, 275 | "outputs": [], 276 | "source": [ 277 | "y_pred = clf.predict(X_test)" 278 | ] 279 | }, 280 | { 281 | "cell_type": "code", 282 | "execution_count": 13, 283 | "metadata": {}, 284 | "outputs": [ 285 | { 286 | "name": "stdout", 287 | "output_type": "stream", 288 | "text": [ 289 | " precision recall f1-score support\n", 290 | "\n", 291 | " 0 0.80 0.83 0.81 874\n", 292 | " 1 0.76 0.72 0.74 649\n", 293 | "\n", 294 | " accuracy 0.78 1523\n", 295 | " macro avg 0.78 0.78 0.78 1523\n", 296 | "weighted avg 0.78 0.78 0.78 1523\n", 297 | "\n" 298 | ] 299 | } 300 | ], 301 | "source": [ 302 | "print(classification_report(y_test, y_pred))" 303 | ] 304 | }, 305 | { 306 | "cell_type": "code", 307 | "execution_count": 14, 308 | "metadata": {}, 309 | "outputs": [], 310 | "source": [ 311 | "y_pred=clf.predict(test['text'])" 312 | ] 313 | }, 314 | { 315 | "cell_type": "markdown", 316 | "execution_count": null, 317 | "metadata": {}, 318 | "source": [ 319 | "# BERT" 320 | ] 321 | }, 322 | { 323 | "cell_type": "code", 324 | "execution_count": 15, 325 | "metadata": {}, 326 | "outputs": [], 327 | "source": [ 328 | "!wget --quiet https://raw.githubusercontent.com/tensorflow/models/master/official/nlp/bert/tokenization.py" 329 | ] 330 | }, 331 | { 332 | "cell_type": "code", 333 | "execution_count": 16, 334 | "metadata": {}, 335 | "outputs": [], 336 | "source": [ 337 | "import numpy as np\n", 338 | "import pandas as pd\n", 339 | "import tensorflow as tf\n", 340 | "from tensorflow.keras.layers import Dense, Input\n", 341 | "from tensorflow.keras.optimizers import Adam\n", 342 | "from tensorflow.keras.models import Model\n", 343 | "from tensorflow.keras.callbacks import ModelCheckpoint\n", 344 | "import tensorflow_hub as hub\n", 345 | "\n", 346 | "import tokenization" 347 | ] 348 | }, 349 | { 350 | "cell_type": "code", 351 | "execution_count": 17, 352 | "metadata": {}, 353 | "outputs": [], 354 | "source": [ 355 | "def bert_encode(texts, tokenizer, max_len=512):\n", 356 | " all_tokens = []\n", 357 | " all_masks = []\n", 358 | " all_segments = []\n", 359 | " \n", 360 | " for text in texts:\n", 361 | " text = tokenizer.tokenize(text)\n", 362 | " \n", 363 | " text = text[:max_len-2]\n", 364 | " input_sequence = [\"[CLS]\"] + text + [\"[SEP]\"]\n", 365 | " pad_len = max_len - len(input_sequence)\n", 366 | " \n", 367 | " tokens = tokenizer.convert_tokens_to_ids(input_sequence)\n", 368 | " tokens += [0] * pad_len\n", 369 | " pad_masks = [1] * len(input_sequence) + [0] * pad_len\n", 370 | " segment_ids = [0] * max_len\n", 371 | " \n", 372 | " all_tokens.append(tokens)\n", 373 | " all_masks.append(pad_masks)\n", 374 | " all_segments.append(segment_ids)\n", 375 | " \n", 376 | " return np.array(all_tokens), np.array(all_masks), np.array(all_segments)" 377 | ] 378 | }, 379 | { 380 | "cell_type": "code", 381 | "execution_count": 18, 382 | "metadata": {}, 383 | "outputs": [], 384 | "source": [ 385 | "def build_model(bert_layer, max_len=512):\n", 386 | " input_word_ids = Input(shape=(max_len,), dtype=tf.int32, name=\"input_word_ids\")\n", 387 | " input_mask = Input(shape=(max_len,), dtype=tf.int32, name=\"input_mask\")\n", 388 | " segment_ids = Input(shape=(max_len,), dtype=tf.int32, name=\"segment_ids\")\n", 389 | "\n", 390 | " _, sequence_output = bert_layer([input_word_ids, input_mask, segment_ids])\n", 391 | " clf_output = sequence_output[:, 0, :]\n", 392 | " out = Dense(1, activation='sigmoid')(clf_output)\n", 393 | " \n", 394 | " model = Model(inputs=[input_word_ids, input_mask, segment_ids], outputs=out)\n", 395 | " model.compile(Adam(lr=2e-6), loss='binary_crossentropy', metrics=['accuracy'])\n", 396 | " \n", 397 | " return model" 398 | ] 399 | }, 400 | { 401 | "cell_type": "code", 402 | "execution_count": 19, 403 | "metadata": {}, 404 | "outputs": [ 405 | { 406 | "name": "stdout", 407 | "output_type": "stream", 408 | "text": [ 409 | "CPU times: user 1min 27s, sys: 8.68 s, total: 1min 35s\n", 410 | "Wall time: 1min 39s\n" 411 | ] 412 | } 413 | ], 414 | "source": [ 415 | "%%time\n", 416 | "module_url = \"https://tfhub.dev/tensorflow/bert_en_uncased_L-24_H-1024_A-16/1\"\n", 417 | "bert_layer = hub.KerasLayer(module_url, trainable=True)" 418 | ] 419 | }, 420 | { 421 | "cell_type": "code", 422 | "execution_count": 20, 423 | "metadata": {}, 424 | "outputs": [], 425 | "source": [ 426 | "vocab_file = bert_layer.resolved_object.vocab_file.asset_path.numpy()\n", 427 | "do_lower_case = bert_layer.resolved_object.do_lower_case.numpy()\n", 428 | "tokenizer = tokenization.FullTokenizer(vocab_file, do_lower_case)" 429 | ] 430 | }, 431 | { 432 | "cell_type": "code", 433 | "execution_count": 21, 434 | "metadata": {}, 435 | "outputs": [], 436 | "source": [ 437 | "train_input = bert_encode(train.text.values, tokenizer, max_len=160)\n", 438 | "test_input = bert_encode(test.text.values, tokenizer, max_len=160)\n", 439 | "train_labels = train.target.values" 440 | ] 441 | }, 442 | { 443 | "cell_type": "code", 444 | "execution_count": 22, 445 | "metadata": {}, 446 | "outputs": [ 447 | { 448 | "name": "stdout", 449 | "output_type": "stream", 450 | "text": [ 451 | "Model: \"model\"\n", 452 | "__________________________________________________________________________________________________\n", 453 | "Layer (type) Output Shape Param # Connected to \n", 454 | "==================================================================================================\n", 455 | "input_word_ids (InputLayer) [(None, 160)] 0 \n", 456 | "__________________________________________________________________________________________________\n", 457 | "input_mask (InputLayer) [(None, 160)] 0 \n", 458 | "__________________________________________________________________________________________________\n", 459 | "segment_ids (InputLayer) [(None, 160)] 0 \n", 460 | "__________________________________________________________________________________________________\n", 461 | "keras_layer (KerasLayer) [(None, 1024), (None 335141889 input_word_ids[0][0] \n", 462 | " input_mask[0][0] \n", 463 | " segment_ids[0][0] \n", 464 | "__________________________________________________________________________________________________\n", 465 | "tf_op_layer_strided_slice (Tens [(None, 1024)] 0 keras_layer[0][1] \n", 466 | "__________________________________________________________________________________________________\n", 467 | "dense (Dense) (None, 1) 1025 tf_op_layer_strided_slice[0][0] \n", 468 | "==================================================================================================\n", 469 | "Total params: 335,142,914\n", 470 | "Trainable params: 335,142,913\n", 471 | "Non-trainable params: 1\n", 472 | "__________________________________________________________________________________________________\n" 473 | ] 474 | } 475 | ], 476 | "source": [ 477 | "model = build_model(bert_layer, max_len=160)\n", 478 | "model.summary()" 479 | ] 480 | }, 481 | { 482 | "cell_type": "code", 483 | "execution_count": 23, 484 | "metadata": {}, 485 | "outputs": [ 486 | { 487 | "name": "stdout", 488 | "output_type": "stream", 489 | "text": [ 490 | "Train on 6090 samples, validate on 1523 samples\n", 491 | "Epoch 1/10\n", 492 | "6090/6090 [==============================] - 419s 69ms/sample - loss: 0.4772 - accuracy: 0.7803 - val_loss: 0.4256 - val_accuracy: 0.8214\n", 493 | "Epoch 2/10\n", 494 | "6090/6090 [==============================] - 375s 62ms/sample - loss: 0.3421 - accuracy: 0.8599 - val_loss: 0.4120 - val_accuracy: 0.8293\n", 495 | "Epoch 3/10\n", 496 | "6090/6090 [==============================] - 374s 61ms/sample - loss: 0.2521 - accuracy: 0.8998 - val_loss: 0.4459 - val_accuracy: 0.8273\n", 497 | "Epoch 4/10\n", 498 | "6090/6090 [==============================] - 374s 61ms/sample - loss: 0.1618 - accuracy: 0.9452 - val_loss: 0.4865 - val_accuracy: 0.8240\n", 499 | "Epoch 5/10\n", 500 | "6090/6090 [==============================] - 374s 61ms/sample - loss: 0.0975 - accuracy: 0.9660 - val_loss: 0.5522 - val_accuracy: 0.8201\n", 501 | "Epoch 6/10\n", 502 | "6090/6090 [==============================] - 374s 61ms/sample - loss: 0.0628 - accuracy: 0.9791 - val_loss: 0.6277 - val_accuracy: 0.8148\n", 503 | "Epoch 7/10\n", 504 | "6090/6090 [==============================] - 374s 61ms/sample - loss: 0.0431 - accuracy: 0.9833 - val_loss: 0.6753 - val_accuracy: 0.8181\n", 505 | "Epoch 8/10\n", 506 | "6090/6090 [==============================] - 374s 61ms/sample - loss: 0.0295 - accuracy: 0.9874 - val_loss: 0.7520 - val_accuracy: 0.8162\n", 507 | "Epoch 9/10\n", 508 | "6090/6090 [==============================] - 374s 61ms/sample - loss: 0.0234 - accuracy: 0.9908 - val_loss: 0.7655 - val_accuracy: 0.8214\n", 509 | "Epoch 10/10\n", 510 | "6090/6090 [==============================] - 374s 61ms/sample - loss: 0.0176 - accuracy: 0.9926 - val_loss: 0.8624 - val_accuracy: 0.8096\n" 511 | ] 512 | } 513 | ], 514 | "source": [ 515 | "train_history = model.fit(\n", 516 | " train_input, train_labels,\n", 517 | " validation_split=0.2,\n", 518 | " epochs=10,\n", 519 | " batch_size=16\n", 520 | ")" 521 | ] 522 | }, 523 | { 524 | "cell_type": "code", 525 | "execution_count": null, 526 | "metadata": {}, 527 | "outputs": [], 528 | "source": [] 529 | } 530 | ], 531 | "metadata": { 532 | "kernelspec": { 533 | "display_name": "Python 3", 534 | "language": "python", 535 | "name": "python3" 536 | }, 537 | "language_info": { 538 | "codemirror_mode": { 539 | "name": "ipython", 540 | "version": 3 541 | }, 542 | "file_extension": ".py", 543 | "mimetype": "text/x-python", 544 | "name": "python", 545 | "nbconvert_exporter": "python", 546 | "pygments_lexer": "ipython3", 547 | "version": "3.6.6" 548 | } 549 | }, 550 | "nbformat": 4, 551 | "nbformat_minor": 4 552 | } 553 | -------------------------------------------------------------------------------- /Text-Summarization-using-Transformers-T5/README.md: -------------------------------------------------------------------------------- 1 | # Text-Summarization-using-Transformers-T5 2 | 3 | 4 | ### Introduction 5 | 6 | we will fine tuning a transformer model for **Summarization Task**. 7 | In this task a summary of a given article/document is generated when passed through a network. There are 2 types of summary generation mechanisms: 8 | 9 | 1. ***Extractive Summary:*** the network calculates the most important sentences from the article and gets them together to provide the most meaningful information from the article. 10 | 2. ***Abstractive Summary***: The network creates new sentences to encapsulate maximum gist of the article and generates that as output. The sentences in the summary may or may not be contained in the article. 11 | 12 | we will be generating ***Abstractive Summary***. 13 | 14 | 15 | * We will be using : [Weights and Biases Service](https://www.wandb.com/) WandB in short. 16 | * It is a experiment tracking, parameter optimization and artifact management service. That can be very easily integrated to any of the Deep learning or Machine learning frameworks. 17 | 18 | The notebook will be divided into separate sections to provide a organized walk through for the process used. This process can be modified for individual use cases. The sections are: 19 | 20 | 1. [Preparing Environment and Importing Libraries](#section01) 21 | 2. [Preparing the Dataset for data processing: Class](#section02) 22 | 3. [Fine Tuning the Model: Function](#section03) 23 | 4. [Validating the Model Performance: Function](#section04) 24 | 5. [Main Function](#section05) 25 | * [Initializing WandB](#section501) 26 | * [Importing and Pre-Processing the domain data](#section502) 27 | * [Creation of Dataset and Dataloader](#section503) 28 | * [Neural Network and Optimizer](#section504) 29 | * [Training Model and Logging to WandB](#section505) 30 | * [Validation and generation of Summary](#section506) 31 | 6. [Examples of the Summary Generated from the model](#section06) 32 | 33 | 34 | #### Technical Details 35 | 36 | This script leverages on multiple tools designed by other teams. Details of the tools used below. Please ensure that these elements are present in your setup to successfully implement this script. 37 | 38 | - **Data**: 39 | - We are using the News Summary dataset available at [Kaggle](https://www.kaggle.com/sunnysai12345/news-summary) 40 | - This dataset is the collection created from Newspapers published in India, extracting, details that are listed below. We are referring only to the first csv file from the data dump: `news_summary.csv` 41 | - There are`4514` rows of data. Where each row has the following data-point: 42 | - **author** : Author of the article 43 | - **date** : Date the article was published 44 | - **headline**: Headline for the published article 45 | - **read_more** : URL for the article to follow online 46 | - **text**: This is the summary of the article 47 | - **ctext**: This is the complete article 48 | 49 | 50 | - **Language Model Used**: 51 | - This notebook uses one of the most recent and novel transformers model ***T5***. [Research Paper](https://arxiv.org/abs/1910.10683) 52 | - ***T5*** in many ways is one of its kind transformers architecture that not only gives state of the art results in many NLP tasks, but also has a very radical approach to NLP tasks. 53 | - **Text-2-Text** - According to the graphic taken from the T5 paper. All NLP tasks are converted to a **text-to-text** problem. Tasks such as translation, classification, summarization and question answering, all of them are treated as a text-to-text conversion problem, rather than seen as separate unique problem statements. 54 | - **Unified approach for NLP Deep Learning** - Since the task is reflected purely in the text input and output, you can use the same model, objective, training procedure, and decoding process to ANY task. Above framework can be used for any task - show Q&A, summarization, etc. 55 | - We will be taking inputs from the T5 paper to prepare our dataset prior to fine tuning and training. 56 | - [Documentation for python](https://huggingface.co/transformers/model_doc/t5.html) 57 | 58 | 59 | 60 | 61 | - Hardware Requirements: 62 | - Python 3.6 and above 63 | - Pytorch, Transformers and 64 | - All the stock Python ML Library 65 | - GPU enabled setup 66 | 67 | 68 | - **Script Objective**: 69 | - The objective of this script is to fine tune ***T5 *** to be able to generate summary, that a close to or better than the actual summary while ensuring the important information from the article is not lost. 70 | 71 | --- 72 | 73 | 74 | 75 | ### Preparing the Dataset for data processing: Class 76 | 77 | We will start with creation of Dataset class - This defines how the text is pre-processed before sending it to the neural network. This dataset will be used the the Dataloader method that will feed the data in batches to the neural network for suitable training and processing. 78 | The Dataloader and Dataset will be used inside the `main()`. 79 | Dataset and Dataloader are constructs of the PyTorch library for defining and controlling the data pre-processing and its passage to neural network. For further reading into Dataset and Dataloader . 80 | 81 | #### *CustomDataset* Dataset Class 82 | - This class is defined to accept the Dataframe as input and generate tokenized output that is used by the **T5** model for training. 83 | - We are using the **T5** tokenizer to tokenize the data in the `text` and `ctext` column of the dataframe. 84 | - The tokenizer uses the ` batch_encode_plus` method to perform tokenization and generate the necessary outputs, namely: `source_id`, `source_mask` from the actual text and `target_id` and `target_mask` from the summary text. 85 | - The *CustomDataset* class is used to create 2 datasets, for training and for validation. 86 | - *Training Dataset* is used to fine tune the model: **80% of the original data** 87 | - *Validation Dataset* is used to evaluate the performance of the model. The model has not seen this data during training. 88 | 89 | #### Dataloader: Called inside the `main()` 90 | - Dataloader is used to for creating training and validation dataloader that load data to the neural network in a defined manner. This is needed because all the data from the dataset cannot be loaded to the memory at once, hence the amount of data loaded to the memory and then passed to the neural network needs to be controlled. 91 | - This control is achieved using the parameters such as `batch_size` and `max_len`. 92 | - Training and Validation dataloaders are used in the training and validation part of the flow respectively 93 | 94 | 95 | 96 | ### Fine Tuning the Model: Function 97 | 98 | Here we define a training function that trains the model on the training dataset created above, specified number of times (EPOCH), An epoch defines how many times the complete data will be passed through the network. 99 | 100 | This function is called in the `main()` 101 | 102 | Following events happen in this function to fine tune the neural network: 103 | - The epoch, tokenizer, model, device details, testing_ dataloader and optimizer are passed to the `train ()` when its called from the `main()` 104 | - The dataloader passes data to the model based on the batch size. 105 | - `language_model_labels` are calculated from the `target_ids` also, `source_id` and `attention_mask` are extracted. 106 | - The model outputs first element gives the loss for the forward pass. 107 | - Loss value is used to optimize the weights of the neurons in the network. 108 | - After every 10 steps the loss value is logged in the wandb service. This log is then used to generate graphs for analysis. 109 | - After every 500 steps the loss value is printed in the console. 110 | 111 | 112 | 113 | 114 | ### Validating the Model Performance: Function 115 | 116 | During the validation stage we pass the unseen data(Testing Dataset), trained model, tokenizer and device details to the function to perform the validation run. This step generates new summary for dataset that it has not seen during the training session. 117 | 118 | This function is called in the `main()` 119 | 120 | This unseen data is the 20% of `news_summary.csv` which was seperated during the Dataset creation stage. 121 | During the validation stage the weights of the model are not updated. We use the generate method for generating new text for the summary. 122 | 123 | It depends on the `Beam-Search coding` method developed for sequence generation for models with LM head. 124 | 125 | The generated text and originally summary are decoded from tokens to text and returned to the `main()` 126 | 127 | 128 | ### Main Function 129 | 130 | The `main()` as the name suggests is the central location to execute all the functions/flows created above in the notebook. The following steps are executed in the `main()`: 131 | 132 | 133 | 134 | #### Initializing WandB 135 | 136 | * The `main()` begins with initializing WandB run under a specific project. This command initiates a new run for each execution of this command. 137 | 138 | **[WandB Service](https://www.wandb.com/)** 139 | 140 | * This service has been created to track ML experiments, Optimize the experiments and save artifacts. It is designed to seamlessly integrate with all the Machine Learning and Deep Learning Frameworks. Each script can be organized into *Project* and each execution of the script will be registered as a *run* in the respective project. 141 | 142 | * The service can be configured to log several default metrics, such a network weights, hardware usage, gradients and weights of the network. 143 | 144 | * It can also be used to log user defined metrics, such a loss in the `train()`. 145 | 146 | 147 | 148 | 149 | * Visit the project page to see the details of different runs and what information is logged by the service. 150 | 151 | * Following the initialization of the WandB service we define configuration parameters that will be used across the tutorial such as `batch_size`, `epoch`, `learning_rate` etc. 152 | 153 | * These parameters are also passed to the WandB config. The config construct with all the parameters can be optimized using the Sweep service from WandB. Currently, that is outof scope of this tutorial. 154 | 155 | * Next we defining seed values so that the experiment and results can be reproduced. 156 | 157 | 158 | 159 | #### Importing and Pre-Processing the domain data 160 | 161 | We will be working with the data and preparing it for fine tuning purposes. 162 | *Assuming that the `news_summary.csv` is already downloaded in your `data` folder* 163 | 164 | * The file is imported as a dataframe and give it the headers as per the documentation. 165 | * Cleaning the file to remove the unwanted columns. 166 | * A new string is added to the main article column `summarize: ` prior to the actual article. This is done because **T5** had similar formatting for the summarization dataset. 167 | * The final Dataframe will be something like this: 168 | 169 | |text|ctext| 170 | |--|--| 171 | |summary-1|summarize: article 1| 172 | |summary-2|summarize: article 2| 173 | |summary-3|summarize: article 3| 174 | 175 | * Top 5 rows of the dataframe are printed on the console. 176 | 177 | 178 | #### Creation of Dataset and Dataloader 179 | 180 | * The updated dataframe is divided into 80-20 ratio for test and validation. 181 | * Both the data-frames are passed to the `CustomerDataset` class for tokenization of the new articles and their summaries. 182 | * The tokenization is done using the length parameters passed to the class. 183 | * Train and Validation parameters are defined and passed to the `pytorch Dataloader contstruct` to create `train` and `validation` data loaders. 184 | * These dataloaders will be passed to `train()` and `validate()` respectively for training and validation action. 185 | * The shape of datasets is printed in the console. 186 | 187 | 188 | 189 | #### Neural Network and Optimizer 190 | 191 | * In this stage we define the model and optimizer that will be used for training and to update the weights of the network. 192 | * We are using the `t5-base` transformer model for our project. You can read about the `T5 model` and its features above. 193 | * We use the `T5ForConditionalGeneration.from_pretrained("t5-base")` commad to define our model. The `T5ForConditionalGeneration` adds a Language Model head to our `T5 model`. The Language Model head allows us to generate text based on the training of `T5 model`. 194 | * We are using the `Adam` optimizer for our project. This has been a standard for all our tutorials and is something that can be changed updated to see how different optimizer perform with different learning rates. 195 | * There is also a scope for doing more with Optimizer such a decay, momentum to dynamically update the Learning rate and other parameters. All those concepts have been kept out of scope for these tutorials. 196 | 197 | 198 | 199 | #### Training Model and Logging to WandB 200 | 201 | * Now we log all the metrics in WandB project that we have initialized above. 202 | * Followed by that we call the `train()` with all the necessary parameters. 203 | * Loss at every 500th step is printed on the console. 204 | * Loss at every 10th step is logged as Loss in the WandB service. 205 | 206 | 207 | 208 | #### Validation and generation of Summary 209 | 210 | * After the training is completed, the validation step is initiated. 211 | * As defined in the validation function, the model weights are not updated. We use the fine tuned model to generate new summaries based on the article text. 212 | * An output is printed on the console giving a count of how many steps are complete after every 100th step. 213 | * The original summary and generated summary are converted into a list and returned to the main function. 214 | * Both the lists are used to create the final dataframe with 2 columns **Generated Summary** and **Actual Summary** 215 | * The dataframe is saved as a csv file in the local drive. 216 | * A qualitative analysis can be done with the Dataframe. 217 | 218 | 219 | 220 | 221 | ### Examples of the Summary Generated from the model 222 | 223 | ##### Example 1 224 | 225 | **Original Text** 226 | New Delhi, Apr 25 (PTI) Union minister Vijay Goel today batted for the unification of the three municipal corporations in the national capital saying a discussion over the issue was pertinent. The BJP leader, who was confident of a good show by his party in the MCD polls, the results of which will be declared tomorrow, said the civic bodies needed to be "revamped" in order to deliver the services to the people more effectively. The first thing needed was a discussion on the unification of the three municipal corporations and there should also be an end to the practice of sending Delhi government officials to serve in the civic bodies, said the Union Minister of State (Independent Charge) for Youth Affairs and Sports. "Barring one, the two other civic bodies have been incurring losses. It would be more fruitful and efficient if all the three were merged," he said, referring to the north, south and east Delhi municipal corporations. The erstwhile Municipal Corporation of Delhi (MCD) was trifurcated into NDMC, SDMC and EDMC by the then Sheila Dikshit-led Delhi government in 2012. Goel predicted a "thumping" victory for the BJP in the MCD polls. He said the newly-elected BJP councillors will be trained on the functioning of the civic bodies and dealing with the bureaucracy. 227 | 228 | 229 | **Original Summary** 230 | Union Minister Vijay Goel has favoured unification of three MCDs ? North, South and East ? in order to deliver the services more effectively. "Barring one, the two other civic bodies have been incurring losses. It would be more fruitful and efficient if all the three were merged," he said. MCD was trifurcated into EDMC, NDMC and SDMC in 2012. 231 | 232 | **Generated Summary** 233 | BJP leader Vijay Goel on Saturday batted for the unification of three municipal corporations in the national capital saying a discussion over this was pertinent. "Barring one, two other civic bodies have been incurring losses," said Goels. The erstwhile Municipal Corporations of Delhi (MCD) were trifurcated into NDMC and SDMC by the then Sheilha Dikshi-led government in 2012. Notably, the MCD poll results will be declared tomorrow. 234 | 235 | 236 | 237 | 238 | ##### Example 2 239 | 240 | **Original Text** 241 | After much wait, the first UDAN flight took off from Shimla today after being flagged off by Prime Minister Narendra Modi.The flight will be operated by Alliance Air, the regional arm of Air India. PM Narendra Modi handed over boarding passes to some of passengers travelling via the first UDAN flight at the Shimla airport.Tomorrow PM @narendramodi will flag off the first UDAN flight under the Regional Connectivity Scheme, on Shimla-Delhi sector.Air India yesterday opened bookings for the first launch flight from Shimla to Delhi with all inclusive fares starting at Rs2,036.THE GREAT 'UDAN'The UDAN (Ude Desh ka Aam Naagrik) scheme seeks to make flying more affordable for the common people, holding a plan to connect over 45 unserved and under-served airports.Under UDAN, 50 per cent of the seats on each flight would have a cap of Rs 2,500 per seat/hour. The government has also extended subsidy in the form of viability gap funding to the operators flying on these routes.The scheme was launched to "make air travel accessible to citizens in regionally important cities," and has been described as "a first-of-its-kind scheme globally to stimulate regional connectivity through a market-based mechanism." Report have it the first flight today will not be flying at full capacity on its 70-seater ATR airplane because of payload restrictions related to the short Shimla airfield.|| Read more ||Udan scheme: Now you can fly to these 43 cities, see the full list hereUDAN scheme to fly hour-long flights capped at Rs 2,500 to smaller cities 242 | 243 | 244 | **Original Summary** 245 | PM Narendra Modi on Thursday launched Ude Desh ka Aam Nagrik (UDAN) scheme for regional flight connectivity by flagging off the inaugural flight from Shimla to Delhi. Under UDAN, government will connect small towns by air with 50% plane seats' fare capped at?2,500 for a one-hour journey of 500 kilometres. UDAN will connect over 45 unserved and under-served airports. 246 | 247 | **Generated Summary** 248 | UDAN (Ude Desh Ka Aam Naagrik) scheme, launched to make air travel accessible in regionally important cities under the Regional Connectivity Scheme, took off from Shimla on Tuesday. The first flight will be operated by Alliance Air, which is the regional arm of India's Air India. Under the scheme, 50% seats would have?2,500 per seat/hour and 50% of the seats would have capped at this rate. It was also extended subsidy in form-based funding for operators flying these routes as well. 249 | 250 | 251 | 252 | ##### Example 3 253 | 254 | **Original Text** 255 | New Delhi, Apr 25 (PTI) The Income Tax department has issued a Rs 24,646 crore tax demand notice to Sahara Groups Aamby Valley Limited (AVL) after conducting a special audit of the company. The department, as part of a special investigation and audit into the account books of AVL, found that an income of over Rs 48,000 crore for a particular assessment year was allegedly not reflected in the record books of the firm and hence it raised a fresh tax demand and penalty amount on it. A Sahara Group spokesperson confirmed the development to PTI. "Yes, the Income Tax Department has raised Rs 48,085.79 crores to the income of the Aamby Valley Limited with a total demand of income tax of Rs 24,646.96 crores on the Aamby Valley Limited," the spokesperson said in a brief statement. Officials said the notice was issued by the taxman in January this year after the special audit of AVLs income for the Assessment Year 2012-13 found that the parent firm had allegedly floated a clutch of Special Purpose Vehicles whose incomes were later accounted on the account of AVL as they were merged with the former in due course of time. The AVL, in its income return filed for AY 2012-13, had reflected a loss of few crores but the special I-T audit brought up the added income, a senior official said. The Supreme Court, last week, had asked the Bombay High Courts official liquidator to sell the Rs 34,000 crore worth of properties of Aamby Valley owned by the Sahara Group and directed its chief Subrata Roy to personally appear before it on April 28. 256 | 257 | 258 | **Original Summary** 259 | The Income Tax Department has issued a ?24,646 crore tax demand notice to Sahara Group's Aamby Valley Limited. The department's audit found that an income of over ?48,000 crore for the assessment year 2012-13 was not reflected in the record books of the firm. A week ago, the SC ordered Bombay HC to auction Sahara's Aamby Valley worth ?34,000 crore. 260 | 261 | **Generated Summary** 262 | the Income Tax department has issued a?24,646 crore tax demand notice to Sahara Groups Aamby Valley Limited (AVL) after conducting an audit of the company. The notice was issued in January this year after the special audit found that the parent firm had floated Special Purpose Vehicle income for the Assessment Year 2012-13 and later accounted on its account as they were merged with the former. "Yes...the Income Tax Department raised Rs48,085.79 crores to the income," he added earlier said at the notice. 263 | -------------------------------------------------------------------------------- /Fake-News-Detection/fake-news-detection.ipynb: -------------------------------------------------------------------------------- 1 | {"cells":[{"metadata":{"_uuid":"8f2839f25d086af736a60e9eeb907d3b93b6e0e5","_cell_guid":"b1076dfc-b9ad-4769-8c92-a6c4dae69d19","trusted":true},"cell_type":"code","source":"\n\nimport numpy as np # linear algebra\nimport pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)\nimport tensorflow as tf\nfrom tensorflow import keras\n\nfrom keras.preprocessing.text import Tokenizer\nfrom keras.preprocessing.sequence import pad_sequences\nfrom sklearn.model_selection import train_test_split\nimport matplotlib.pyplot as plt\n\n# Input data files are available in the \"../input/\" directory.\n# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory\n\nimport os\nfor dirname, _, filenames in os.walk('/kaggle/input'):\n for filename in filenames:\n print(os.path.join(dirname, filename))\n\n# Any results you write to the current directory are saved as output.","execution_count":1,"outputs":[{"output_type":"stream","text":"Using TensorFlow backend.\n","name":"stderr"},{"output_type":"stream","text":"/kaggle/input/fake-and-real-news-dataset/Fake.csv\n/kaggle/input/fake-and-real-news-dataset/True.csv\n","name":"stdout"}]},{"metadata":{"_uuid":"d629ff2d2480ee46fbb7e2d37f6b5fab8052498a","_cell_guid":"79c7e3d0-c299-4dcb-8224-4455121ee9b0","trusted":true},"cell_type":"code","source":"import warnings\nwarnings.filterwarnings('ignore')","execution_count":2,"outputs":[]},{"metadata":{"trusted":true},"cell_type":"code","source":"true=pd.read_csv('../input/fake-and-real-news-dataset/True.csv')\nfake=pd.read_csv('../input/fake-and-real-news-dataset/Fake.csv')\n","execution_count":3,"outputs":[]},{"metadata":{"trusted":true},"cell_type":"code","source":"true","execution_count":4,"outputs":[{"output_type":"execute_result","execution_count":4,"data":{"text/plain":" title \\\n0 As U.S. budget fight looms, Republicans flip t... \n1 U.S. military to accept transgender recruits o... \n2 Senior U.S. Republican senator: 'Let Mr. Muell... \n3 FBI Russia probe helped by Australian diplomat... \n4 Trump wants Postal Service to charge 'much mor... \n... ... \n21412 'Fully committed' NATO backs new U.S. approach... \n21413 LexisNexis withdrew two products from Chinese ... \n21414 Minsk cultural hub becomes haven from authorities \n21415 Vatican upbeat on possibility of Pope Francis ... \n21416 Indonesia to buy $1.14 billion worth of Russia... \n\n text subject \\\n0 WASHINGTON (Reuters) - The head of a conservat... politicsNews \n1 WASHINGTON (Reuters) - Transgender people will... politicsNews \n2 WASHINGTON (Reuters) - The special counsel inv... politicsNews \n3 WASHINGTON (Reuters) - Trump campaign adviser ... politicsNews \n4 SEATTLE/WASHINGTON (Reuters) - President Donal... politicsNews \n... ... ... \n21412 BRUSSELS (Reuters) - NATO allies on Tuesday we... worldnews \n21413 LONDON (Reuters) - LexisNexis, a provider of l... worldnews \n21414 MINSK (Reuters) - In the shadow of disused Sov... worldnews \n21415 MOSCOW (Reuters) - Vatican Secretary of State ... worldnews \n21416 JAKARTA (Reuters) - Indonesia will buy 11 Sukh... worldnews \n\n date \n0 December 31, 2017 \n1 December 29, 2017 \n2 December 31, 2017 \n3 December 30, 2017 \n4 December 29, 2017 \n... ... \n21412 August 22, 2017 \n21413 August 22, 2017 \n21414 August 22, 2017 \n21415 August 22, 2017 \n21416 August 22, 2017 \n\n[21417 rows x 4 columns]","text/html":"
\n\n\n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n
titletextsubjectdate
0As U.S. budget fight looms, Republicans flip t...WASHINGTON (Reuters) - The head of a conservat...politicsNewsDecember 31, 2017
1U.S. military to accept transgender recruits o...WASHINGTON (Reuters) - Transgender people will...politicsNewsDecember 29, 2017
2Senior U.S. Republican senator: 'Let Mr. Muell...WASHINGTON (Reuters) - The special counsel inv...politicsNewsDecember 31, 2017
3FBI Russia probe helped by Australian diplomat...WASHINGTON (Reuters) - Trump campaign adviser ...politicsNewsDecember 30, 2017
4Trump wants Postal Service to charge 'much mor...SEATTLE/WASHINGTON (Reuters) - President Donal...politicsNewsDecember 29, 2017
...............
21412'Fully committed' NATO backs new U.S. approach...BRUSSELS (Reuters) - NATO allies on Tuesday we...worldnewsAugust 22, 2017
21413LexisNexis withdrew two products from Chinese ...LONDON (Reuters) - LexisNexis, a provider of l...worldnewsAugust 22, 2017
21414Minsk cultural hub becomes haven from authoritiesMINSK (Reuters) - In the shadow of disused Sov...worldnewsAugust 22, 2017
21415Vatican upbeat on possibility of Pope Francis ...MOSCOW (Reuters) - Vatican Secretary of State ...worldnewsAugust 22, 2017
21416Indonesia to buy $1.14 billion worth of Russia...JAKARTA (Reuters) - Indonesia will buy 11 Sukh...worldnewsAugust 22, 2017
\n

21417 rows × 4 columns

\n
"},"metadata":{}}]},{"metadata":{"trusted":true},"cell_type":"code","source":"fake","execution_count":5,"outputs":[{"output_type":"execute_result","execution_count":5,"data":{"text/plain":" title \\\n0 Donald Trump Sends Out Embarrassing New Year’... \n1 Drunk Bragging Trump Staffer Started Russian ... \n2 Sheriff David Clarke Becomes An Internet Joke... \n3 Trump Is So Obsessed He Even Has Obama’s Name... \n4 Pope Francis Just Called Out Donald Trump Dur... \n... ... \n23476 McPain: John McCain Furious That Iran Treated ... \n23477 JUSTICE? Yahoo Settles E-mail Privacy Class-ac... \n23478 Sunnistan: US and Allied ‘Safe Zone’ Plan to T... \n23479 How to Blow $700 Million: Al Jazeera America F... \n23480 10 U.S. Navy Sailors Held by Iranian Military ... \n\n text subject \\\n0 Donald Trump just couldn t wish all Americans ... News \n1 House Intelligence Committee Chairman Devin Nu... News \n2 On Friday, it was revealed that former Milwauk... News \n3 On Christmas day, Donald Trump announced that ... News \n4 Pope Francis used his annual Christmas Day mes... News \n... ... ... \n23476 21st Century Wire says As 21WIRE reported earl... Middle-east \n23477 21st Century Wire says It s a familiar theme. ... Middle-east \n23478 Patrick Henningsen 21st Century WireRemember ... Middle-east \n23479 21st Century Wire says Al Jazeera America will... Middle-east \n23480 21st Century Wire says As 21WIRE predicted in ... Middle-east \n\n date \n0 December 31, 2017 \n1 December 31, 2017 \n2 December 30, 2017 \n3 December 29, 2017 \n4 December 25, 2017 \n... ... \n23476 January 16, 2016 \n23477 January 16, 2016 \n23478 January 15, 2016 \n23479 January 14, 2016 \n23480 January 12, 2016 \n\n[23481 rows x 4 columns]","text/html":"
\n\n\n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n
titletextsubjectdate
0Donald Trump Sends Out Embarrassing New Year’...Donald Trump just couldn t wish all Americans ...NewsDecember 31, 2017
1Drunk Bragging Trump Staffer Started Russian ...House Intelligence Committee Chairman Devin Nu...NewsDecember 31, 2017
2Sheriff David Clarke Becomes An Internet Joke...On Friday, it was revealed that former Milwauk...NewsDecember 30, 2017
3Trump Is So Obsessed He Even Has Obama’s Name...On Christmas day, Donald Trump announced that ...NewsDecember 29, 2017
4Pope Francis Just Called Out Donald Trump Dur...Pope Francis used his annual Christmas Day mes...NewsDecember 25, 2017
...............
23476McPain: John McCain Furious That Iran Treated ...21st Century Wire says As 21WIRE reported earl...Middle-eastJanuary 16, 2016
23477JUSTICE? Yahoo Settles E-mail Privacy Class-ac...21st Century Wire says It s a familiar theme. ...Middle-eastJanuary 16, 2016
23478Sunnistan: US and Allied ‘Safe Zone’ Plan to T...Patrick Henningsen 21st Century WireRemember ...Middle-eastJanuary 15, 2016
23479How to Blow $700 Million: Al Jazeera America F...21st Century Wire says Al Jazeera America will...Middle-eastJanuary 14, 2016
2348010 U.S. Navy Sailors Held by Iranian Military ...21st Century Wire says As 21WIRE predicted in ...Middle-eastJanuary 12, 2016
\n

23481 rows × 4 columns

\n
"},"metadata":{}}]},{"metadata":{"trusted":true},"cell_type":"code","source":"true['result']=1\nfake['result']=0","execution_count":6,"outputs":[]},{"metadata":{"trusted":true},"cell_type":"code","source":"true.head()","execution_count":7,"outputs":[{"output_type":"execute_result","execution_count":7,"data":{"text/plain":" title \\\n0 As U.S. budget fight looms, Republicans flip t... \n1 U.S. military to accept transgender recruits o... \n2 Senior U.S. Republican senator: 'Let Mr. Muell... \n3 FBI Russia probe helped by Australian diplomat... \n4 Trump wants Postal Service to charge 'much mor... \n\n text subject \\\n0 WASHINGTON (Reuters) - The head of a conservat... politicsNews \n1 WASHINGTON (Reuters) - Transgender people will... politicsNews \n2 WASHINGTON (Reuters) - The special counsel inv... politicsNews \n3 WASHINGTON (Reuters) - Trump campaign adviser ... politicsNews \n4 SEATTLE/WASHINGTON (Reuters) - President Donal... politicsNews \n\n date result \n0 December 31, 2017 1 \n1 December 29, 2017 1 \n2 December 31, 2017 1 \n3 December 30, 2017 1 \n4 December 29, 2017 1 ","text/html":"
\n\n\n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n
titletextsubjectdateresult
0As U.S. budget fight looms, Republicans flip t...WASHINGTON (Reuters) - The head of a conservat...politicsNewsDecember 31, 20171
1U.S. military to accept transgender recruits o...WASHINGTON (Reuters) - Transgender people will...politicsNewsDecember 29, 20171
2Senior U.S. Republican senator: 'Let Mr. Muell...WASHINGTON (Reuters) - The special counsel inv...politicsNewsDecember 31, 20171
3FBI Russia probe helped by Australian diplomat...WASHINGTON (Reuters) - Trump campaign adviser ...politicsNewsDecember 30, 20171
4Trump wants Postal Service to charge 'much mor...SEATTLE/WASHINGTON (Reuters) - President Donal...politicsNewsDecember 29, 20171
\n
"},"metadata":{}}]},{"metadata":{"trusted":true},"cell_type":"code","source":"fake.head()","execution_count":8,"outputs":[{"output_type":"execute_result","execution_count":8,"data":{"text/plain":" title \\\n0 Donald Trump Sends Out Embarrassing New Year’... \n1 Drunk Bragging Trump Staffer Started Russian ... \n2 Sheriff David Clarke Becomes An Internet Joke... \n3 Trump Is So Obsessed He Even Has Obama’s Name... \n4 Pope Francis Just Called Out Donald Trump Dur... \n\n text subject \\\n0 Donald Trump just couldn t wish all Americans ... News \n1 House Intelligence Committee Chairman Devin Nu... News \n2 On Friday, it was revealed that former Milwauk... News \n3 On Christmas day, Donald Trump announced that ... News \n4 Pope Francis used his annual Christmas Day mes... News \n\n date result \n0 December 31, 2017 0 \n1 December 31, 2017 0 \n2 December 30, 2017 0 \n3 December 29, 2017 0 \n4 December 25, 2017 0 ","text/html":"
\n\n\n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n
titletextsubjectdateresult
0Donald Trump Sends Out Embarrassing New Year’...Donald Trump just couldn t wish all Americans ...NewsDecember 31, 20170
1Drunk Bragging Trump Staffer Started Russian ...House Intelligence Committee Chairman Devin Nu...NewsDecember 31, 20170
2Sheriff David Clarke Becomes An Internet Joke...On Friday, it was revealed that former Milwauk...NewsDecember 30, 20170
3Trump Is So Obsessed He Even Has Obama’s Name...On Christmas day, Donald Trump announced that ...NewsDecember 29, 20170
4Pope Francis Just Called Out Donald Trump Dur...Pope Francis used his annual Christmas Day mes...NewsDecember 25, 20170
\n
"},"metadata":{}}]},{"metadata":{"trusted":true},"cell_type":"code","source":"df=pd.concat([true,fake])","execution_count":9,"outputs":[]},{"metadata":{"trusted":true},"cell_type":"code","source":"df.isna().sum()","execution_count":10,"outputs":[{"output_type":"execute_result","execution_count":10,"data":{"text/plain":"title 0\ntext 0\nsubject 0\ndate 0\nresult 0\ndtype: int64"},"metadata":{}}]},{"metadata":{"trusted":true},"cell_type":"code","source":"df['text']=df['title']+\"\"+df['text']+\"\"+df['subject']\ndel df['title']\ndel df['date']\ndel df['subject']\n","execution_count":11,"outputs":[]},{"metadata":{"trusted":true},"cell_type":"code","source":"df","execution_count":12,"outputs":[{"output_type":"execute_result","execution_count":12,"data":{"text/plain":" text result\n0 As U.S. budget fight looms, Republicans flip t... 1\n1 U.S. military to accept transgender recruits o... 1\n2 Senior U.S. Republican senator: 'Let Mr. Muell... 1\n3 FBI Russia probe helped by Australian diplomat... 1\n4 Trump wants Postal Service to charge 'much mor... 1\n... ... ...\n23476 McPain: John McCain Furious That Iran Treated ... 0\n23477 JUSTICE? Yahoo Settles E-mail Privacy Class-ac... 0\n23478 Sunnistan: US and Allied ‘Safe Zone’ Plan to T... 0\n23479 How to Blow $700 Million: Al Jazeera America F... 0\n23480 10 U.S. Navy Sailors Held by Iranian Military ... 0\n\n[44898 rows x 2 columns]","text/html":"
\n\n\n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n
textresult
0As U.S. budget fight looms, Republicans flip t...1
1U.S. military to accept transgender recruits o...1
2Senior U.S. Republican senator: 'Let Mr. Muell...1
3FBI Russia probe helped by Australian diplomat...1
4Trump wants Postal Service to charge 'much mor...1
.........
23476McPain: John McCain Furious That Iran Treated ...0
23477JUSTICE? Yahoo Settles E-mail Privacy Class-ac...0
23478Sunnistan: US and Allied ‘Safe Zone’ Plan to T...0
23479How to Blow $700 Million: Al Jazeera America F...0
2348010 U.S. Navy Sailors Held by Iranian Military ...0
\n

44898 rows × 2 columns

\n
"},"metadata":{}}]},{"metadata":{"trusted":true},"cell_type":"code","source":"sentence = df['text'].values.tolist()\nresult= df['result'].values.tolist()","execution_count":13,"outputs":[]},{"metadata":{"trusted":true},"cell_type":"code","source":"\nX_train, X_test, Y_train,Y_test= train_test_split(sentence, result, test_size=0.2)","execution_count":14,"outputs":[]},{"metadata":{"trusted":true},"cell_type":"code","source":"Y_train=np.array(Y_train)\nY_test=np.array(Y_test)","execution_count":15,"outputs":[]},{"metadata":{"trusted":true},"cell_type":"code","source":"tokenizer=Tokenizer(num_words=10000, oov_token='')\ntokenizer.fit_on_texts(X_train)\nword_index=tokenizer.word_index\nsequences=tokenizer.texts_to_sequences(X_train)\npadded_train=pad_sequences(sequences,5000,truncating='post')","execution_count":16,"outputs":[]},{"metadata":{"trusted":true},"cell_type":"code","source":"sequences_test=tokenizer.texts_to_sequences(X_test)\npadded_test=pad_sequences(sequences_test,5000,truncating='post')","execution_count":17,"outputs":[]},{"metadata":{"trusted":true},"cell_type":"code","source":"padded_test.shape","execution_count":18,"outputs":[{"output_type":"execute_result","execution_count":18,"data":{"text/plain":"(8980, 5000)"},"metadata":{}}]},{"metadata":{"trusted":true},"cell_type":"code","source":"Y_test.shape","execution_count":19,"outputs":[{"output_type":"execute_result","execution_count":19,"data":{"text/plain":"(8980,)"},"metadata":{}}]},{"metadata":{"trusted":true},"cell_type":"code","source":"model= tf.keras.Sequential([\n tf.keras.layers.Embedding(10000,16,input_length=5000),\n tf.keras.layers.Flatten(),\n tf.keras.layers.Dense(6, activation='relu'),\n tf.keras.layers.Dense(1,activation='sigmoid')\n])","execution_count":20,"outputs":[]},{"metadata":{"trusted":true},"cell_type":"code","source":"model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])\nmodel.summary()","execution_count":21,"outputs":[{"output_type":"stream","text":"Model: \"sequential\"\n_________________________________________________________________\nLayer (type) Output Shape Param # \n=================================================================\nembedding (Embedding) (None, 5000, 16) 160000 \n_________________________________________________________________\nflatten (Flatten) (None, 80000) 0 \n_________________________________________________________________\ndense (Dense) (None, 6) 480006 \n_________________________________________________________________\ndense_1 (Dense) (None, 1) 7 \n=================================================================\nTotal params: 640,013\nTrainable params: 640,013\nNon-trainable params: 0\n_________________________________________________________________\n","name":"stdout"}]},{"metadata":{"trusted":true},"cell_type":"code","source":"history=model.fit(padded_train, Y_train, epochs=10, validation_data=(padded_test, Y_test))","execution_count":22,"outputs":[{"output_type":"stream","text":"Train on 35918 samples, validate on 8980 samples\nEpoch 1/10\n35918/35918 [==============================] - 27s 746us/sample - loss: 0.0729 - accuracy: 0.9732 - val_loss: 0.0019 - val_accuracy: 0.9998\nEpoch 2/10\n35918/35918 [==============================] - 24s 674us/sample - loss: 8.5399e-04 - accuracy: 0.9999 - val_loss: 6.5356e-04 - val_accuracy: 0.9999\nEpoch 3/10\n35918/35918 [==============================] - 25s 690us/sample - loss: 1.9084e-04 - accuracy: 1.0000 - val_loss: 4.1836e-04 - val_accuracy: 0.9998\nEpoch 4/10\n35918/35918 [==============================] - 24s 661us/sample - loss: 7.5011e-05 - accuracy: 1.0000 - val_loss: 3.2069e-04 - val_accuracy: 0.9999\nEpoch 5/10\n35918/35918 [==============================] - 23s 646us/sample - loss: 2.8669e-05 - accuracy: 1.0000 - val_loss: 2.5717e-04 - val_accuracy: 0.9999\nEpoch 6/10\n35918/35918 [==============================] - 24s 678us/sample - loss: 1.4836e-05 - accuracy: 1.0000 - val_loss: 2.1666e-04 - val_accuracy: 0.9999\nEpoch 7/10\n35918/35918 [==============================] - 23s 644us/sample - loss: 7.4957e-06 - accuracy: 1.0000 - val_loss: 2.0827e-04 - val_accuracy: 1.0000\nEpoch 8/10\n35918/35918 [==============================] - 24s 672us/sample - loss: 3.6462e-06 - accuracy: 1.0000 - val_loss: 1.7504e-04 - val_accuracy: 0.9999\nEpoch 9/10\n35918/35918 [==============================] - 23s 649us/sample - loss: 2.0849e-06 - accuracy: 1.0000 - val_loss: 1.6841e-04 - val_accuracy: 0.9999\nEpoch 10/10\n35918/35918 [==============================] - 23s 641us/sample - loss: 1.0102e-06 - accuracy: 1.0000 - val_loss: 1.5886e-04 - val_accuracy: 0.9999\n","name":"stdout"}]},{"metadata":{"trusted":true},"cell_type":"code","source":"score = model.evaluate(padded_test, Y_test)","execution_count":24,"outputs":[{"output_type":"stream","text":"8980/8980 [==============================] - 2s 180us/sample - loss: 1.5886e-04 - accuracy: 0.9999\n","name":"stdout"}]},{"metadata":{"trusted":true},"cell_type":"code","source":"print('Test accuracy:', score[1])","execution_count":25,"outputs":[{"output_type":"stream","text":"Test accuracy: 0.99988866\n","name":"stdout"}]}],"metadata":{"kernelspec":{"language":"python","display_name":"Python 3","name":"python3"},"language_info":{"pygments_lexer":"ipython3","nbconvert_exporter":"python","version":"3.6.4","file_extension":".py","codemirror_mode":{"name":"ipython","version":3},"name":"python","mimetype":"text/x-python"}},"nbformat":4,"nbformat_minor":4} -------------------------------------------------------------------------------- /NLP-LSTM-Alexa-Reviews/alexa-reviews-lstm.ipynb: -------------------------------------------------------------------------------- 1 | {"cells":[{"metadata":{"_uuid":"8f2839f25d086af736a60e9eeb907d3b93b6e0e5","_cell_guid":"b1076dfc-b9ad-4769-8c92-a6c4dae69d19","trusted":true},"cell_type":"code","source":"# This Python 3 environment comes with many helpful analytics libraries installed\n# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python\n# For example, here's several helpful packages to load\n\nimport numpy as np # linear algebra\nimport pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)\n\n# Input data files are available in the read-only \"../input/\" directory\n# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory\n\nimport os\nfor dirname, _, filenames in os.walk('/kaggle/input'):\n for filename in filenames:\n print(os.path.join(dirname, filename))\n\n# You can write up to 5GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using \"Save & Run All\" \n# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session","execution_count":1,"outputs":[{"output_type":"stream","text":"/kaggle/input/amazon-alexa-reviews/amazon_alexa.tsv\n","name":"stdout"}]},{"metadata":{"_uuid":"d629ff2d2480ee46fbb7e2d37f6b5fab8052498a","_cell_guid":"79c7e3d0-c299-4dcb-8224-4455121ee9b0","trusted":true},"cell_type":"code","source":"# Importing the dataset\ndf = pd.read_csv('/kaggle/input/amazon-alexa-reviews/amazon_alexa.tsv', delimiter = '\\t', quoting = 3)","execution_count":3,"outputs":[]},{"metadata":{"trusted":true},"cell_type":"code","source":"df","execution_count":4,"outputs":[{"output_type":"execute_result","execution_count":4,"data":{"text/plain":" rating date variation \\\n0 5 31-Jul-18 Charcoal Fabric \n1 5 31-Jul-18 Charcoal Fabric \n2 4 31-Jul-18 Walnut Finish \n3 5 31-Jul-18 Charcoal Fabric \n4 5 31-Jul-18 Charcoal Fabric \n... ... ... ... \n3145 5 30-Jul-18 Black Dot \n3146 5 30-Jul-18 Black Dot \n3147 5 30-Jul-18 Black Dot \n3148 5 30-Jul-18 White Dot \n3149 4 29-Jul-18 Black Dot \n\n verified_reviews feedback \n0 Love my Echo! 1 \n1 Loved it! 1 \n2 \"Sometimes while playing a game, you can answe... 1 \n3 \"I have had a lot of fun with this thing. My 4... 1 \n4 Music 1 \n... ... ... \n3145 \"Perfect for kids, adults and everyone in betw... 1 \n3146 \"Listening to music, searching locations, chec... 1 \n3147 \"I do love these things, i have them running m... 1 \n3148 \"Only complaint I have is that the sound quali... 1 \n3149 Good 1 \n\n[3150 rows x 5 columns]","text/html":"
\n\n\n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n
ratingdatevariationverified_reviewsfeedback
0531-Jul-18Charcoal FabricLove my Echo!1
1531-Jul-18Charcoal FabricLoved it!1
2431-Jul-18Walnut Finish\"Sometimes while playing a game, you can answe...1
3531-Jul-18Charcoal Fabric\"I have had a lot of fun with this thing. My 4...1
4531-Jul-18Charcoal FabricMusic1
..................
3145530-Jul-18Black Dot\"Perfect for kids, adults and everyone in betw...1
3146530-Jul-18Black Dot\"Listening to music, searching locations, chec...1
3147530-Jul-18Black Dot\"I do love these things, i have them running m...1
3148530-Jul-18White Dot\"Only complaint I have is that the sound quali...1
3149429-Jul-18Black DotGood1
\n

3150 rows × 5 columns

\n
"},"metadata":{}}]},{"metadata":{"trusted":true},"cell_type":"code","source":"df.isna().sum()","execution_count":5,"outputs":[{"output_type":"execute_result","execution_count":5,"data":{"text/plain":"rating 0\ndate 0\nvariation 0\nverified_reviews 0\nfeedback 0\ndtype: int64"},"metadata":{}}]},{"metadata":{"trusted":true},"cell_type":"code","source":"df.columns","execution_count":7,"outputs":[{"output_type":"execute_result","execution_count":7,"data":{"text/plain":"Index(['rating', 'date', 'variation', 'verified_reviews', 'feedback'], dtype='object')"},"metadata":{}}]},{"metadata":{"trusted":true},"cell_type":"code","source":"del df['date']","execution_count":9,"outputs":[]},{"metadata":{"trusted":true},"cell_type":"code","source":"del df['variation']\ndel df['feedback']\n","execution_count":10,"outputs":[]},{"metadata":{"trusted":true},"cell_type":"code","source":"df","execution_count":11,"outputs":[{"output_type":"execute_result","execution_count":11,"data":{"text/plain":" rating verified_reviews\n0 5 Love my Echo!\n1 5 Loved it!\n2 4 \"Sometimes while playing a game, you can answe...\n3 5 \"I have had a lot of fun with this thing. My 4...\n4 5 Music\n... ... ...\n3145 5 \"Perfect for kids, adults and everyone in betw...\n3146 5 \"Listening to music, searching locations, chec...\n3147 5 \"I do love these things, i have them running m...\n3148 5 \"Only complaint I have is that the sound quali...\n3149 4 Good\n\n[3150 rows x 2 columns]","text/html":"
\n\n\n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n
ratingverified_reviews
05Love my Echo!
15Loved it!
24\"Sometimes while playing a game, you can answe...
35\"I have had a lot of fun with this thing. My 4...
45Music
.........
31455\"Perfect for kids, adults and everyone in betw...
31465\"Listening to music, searching locations, chec...
31475\"I do love these things, i have them running m...
31485\"Only complaint I have is that the sound quali...
31494Good
\n

3150 rows × 2 columns

\n
"},"metadata":{}}]},{"metadata":{"trusted":true},"cell_type":"code","source":"def sentiment_rating(rating):\n \n if(int(rating) == 1 or int(rating) == 2 or int(rating) == 3):\n return 0\n else: \n return 1\ndf.rating = df.rating.apply(sentiment_rating)","execution_count":16,"outputs":[]},{"metadata":{"trusted":true},"cell_type":"code","source":"df.rating.value_counts()","execution_count":17,"outputs":[{"output_type":"execute_result","execution_count":17,"data":{"text/plain":"1 2741\n0 409\nName: rating, dtype: int64"},"metadata":{}}]},{"metadata":{"trusted":true},"cell_type":"code","source":"df.columns = ['Liked','Review']","execution_count":18,"outputs":[]},{"metadata":{"trusted":true},"cell_type":"code","source":"df","execution_count":19,"outputs":[{"output_type":"execute_result","execution_count":19,"data":{"text/plain":" Liked Review\n0 1 Love my Echo!\n1 1 Loved it!\n2 1 \"Sometimes while playing a game, you can answe...\n3 1 \"I have had a lot of fun with this thing. My 4...\n4 1 Music\n... ... ...\n3145 1 \"Perfect for kids, adults and everyone in betw...\n3146 1 \"Listening to music, searching locations, chec...\n3147 1 \"I do love these things, i have them running m...\n3148 1 \"Only complaint I have is that the sound quali...\n3149 1 Good\n\n[3150 rows x 2 columns]","text/html":"
\n\n\n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n
LikedReview
01Love my Echo!
11Loved it!
21\"Sometimes while playing a game, you can answe...
31\"I have had a lot of fun with this thing. My 4...
41Music
.........
31451\"Perfect for kids, adults and everyone in betw...
31461\"Listening to music, searching locations, chec...
31471\"I do love these things, i have them running m...
31481\"Only complaint I have is that the sound quali...
31491Good
\n

3150 rows × 2 columns

\n
"},"metadata":{}}]},{"metadata":{"trusted":true},"cell_type":"code","source":"X=df.Review.astype('str')\ny=df.Liked","execution_count":20,"outputs":[]},{"metadata":{"trusted":true},"cell_type":"code","source":"from sklearn.model_selection import train_test_split\nX_train,X_test,y_train,y_test=train_test_split(X,y,test_size=0.2,random_state=0)","execution_count":21,"outputs":[]},{"metadata":{"trusted":true},"cell_type":"code","source":"from keras.preprocessing.text import Tokenizer\nfrom keras.preprocessing.sequence import pad_sequences\nvocab=1000\ntokenizer=Tokenizer(vocab,oov_token=\"\")\ntokenizer.fit_on_texts(X_train)\ntrain_sequence=tokenizer.texts_to_sequences(X_train)\ntest_sequence=tokenizer.texts_to_sequences(X_test)\npadded_train=pad_sequences(train_sequence,maxlen=500)\npadded_test=pad_sequences(test_sequence,maxlen=500)","execution_count":22,"outputs":[{"output_type":"stream","text":"Using TensorFlow backend.\n","name":"stderr"}]},{"metadata":{"trusted":true},"cell_type":"code","source":"from keras.models import Sequential\nfrom keras.layers import Dense,LSTM,Embedding,GlobalAveragePooling1D,Bidirectional\nfrom keras.optimizers import Adam","execution_count":24,"outputs":[]},{"metadata":{"trusted":true},"cell_type":"code","source":"model=Sequential()\nmodel.add(Embedding(vocab,1000))\nmodel.add(Bidirectional(LSTM(units = 32)))\nmodel.add(Dense(128,activation='relu'))\nmodel.add(Dense(128,activation='relu'))\nmodel.add(Dense(1,activation='sigmoid'))\nmodel.compile(optimizer=Adam(lr=0.001),loss='binary_crossentropy',metrics=['accuracy'])","execution_count":25,"outputs":[]},{"metadata":{"trusted":true},"cell_type":"code","source":"model.summary()","execution_count":26,"outputs":[{"output_type":"stream","text":"Model: \"sequential_1\"\n_________________________________________________________________\nLayer (type) Output Shape Param # \n=================================================================\nembedding_1 (Embedding) (None, None, 1000) 1000000 \n_________________________________________________________________\nbidirectional_1 (Bidirection (None, 64) 264448 \n_________________________________________________________________\ndense_1 (Dense) (None, 128) 8320 \n_________________________________________________________________\ndense_2 (Dense) (None, 128) 16512 \n_________________________________________________________________\ndense_3 (Dense) (None, 1) 129 \n=================================================================\nTotal params: 1,289,409\nTrainable params: 1,289,409\nNon-trainable params: 0\n_________________________________________________________________\n","name":"stdout"}]},{"metadata":{"trusted":true},"cell_type":"code","source":"history = model.fit(padded_train,y_train,validation_data=(padded_test,y_test),epochs=10)","execution_count":27,"outputs":[{"output_type":"stream","text":"/opt/conda/lib/python3.7/site-packages/tensorflow_core/python/framework/indexed_slices.py:433: UserWarning: Converting sparse IndexedSlices to a dense Tensor of unknown shape. This may consume a large amount of memory.\n \"Converting sparse IndexedSlices to a dense Tensor of unknown shape. \"\n","name":"stderr"},{"output_type":"stream","text":"Train on 2520 samples, validate on 630 samples\nEpoch 1/10\n2520/2520 [==============================] - 91s 36ms/step - loss: 0.3688 - accuracy: 0.8683 - val_loss: 0.2895 - val_accuracy: 0.8905\nEpoch 2/10\n2520/2520 [==============================] - 90s 36ms/step - loss: 0.1908 - accuracy: 0.9179 - val_loss: 0.2617 - val_accuracy: 0.8921\nEpoch 3/10\n2520/2520 [==============================] - 89s 35ms/step - loss: 0.1199 - accuracy: 0.9508 - val_loss: 0.2580 - val_accuracy: 0.8952\nEpoch 4/10\n2520/2520 [==============================] - 90s 36ms/step - loss: 0.0845 - accuracy: 0.9694 - val_loss: 0.3596 - val_accuracy: 0.8873\nEpoch 5/10\n2520/2520 [==============================] - 89s 35ms/step - loss: 0.0670 - accuracy: 0.9718 - val_loss: 0.3868 - val_accuracy: 0.9032\nEpoch 6/10\n2520/2520 [==============================] - 89s 35ms/step - loss: 0.0519 - accuracy: 0.9774 - val_loss: 0.3672 - val_accuracy: 0.9016\nEpoch 7/10\n2520/2520 [==============================] - 88s 35ms/step - loss: 0.0374 - accuracy: 0.9825 - val_loss: 0.4777 - val_accuracy: 0.9032\nEpoch 8/10\n2520/2520 [==============================] - 88s 35ms/step - loss: 0.0409 - accuracy: 0.9821 - val_loss: 0.4225 - val_accuracy: 0.9079\nEpoch 9/10\n2520/2520 [==============================] - 89s 35ms/step - loss: 0.0440 - accuracy: 0.9798 - val_loss: 0.4930 - val_accuracy: 0.9032\nEpoch 10/10\n2520/2520 [==============================] - 90s 36ms/step - loss: 0.0291 - accuracy: 0.9837 - val_loss: 0.5865 - val_accuracy: 0.9063\n","name":"stdout"}]},{"metadata":{"trusted":true},"cell_type":"code","source":"import matplotlib.pyplot as plt\n# summarize history for accuracy\nplt.plot(history.history['accuracy'])\nplt.plot(history.history['val_accuracy'])\nplt.title('model accuracy')\nplt.ylabel('accuracy')\nplt.xlabel('epoch')\nplt.legend(['train', 'test'], loc='upper left')\nplt.show()\n","execution_count":28,"outputs":[{"output_type":"display_data","data":{"text/plain":"
","image/png":"iVBORw0KGgoAAAANSUhEUgAAAYgAAAEWCAYAAAB8LwAVAAAABHNCSVQICAgIfAhkiAAAAAlwSFlzAAALEgAACxIB0t1+/AAAADh0RVh0U29mdHdhcmUAbWF0cGxvdGxpYiB2ZXJzaW9uMy4xLjMsIGh0dHA6Ly9tYXRwbG90bGliLm9yZy+AADFEAAAgAElEQVR4nO3deXxU9dX48c/JvhCSkASUhH1HRZYQQVzrBmqr1qUu1Kq1aFfbX9vH5Wm1y/O09mm1ta2tte5VoYhYbUsVtS5VkEAAZV+FLGxJICFkT+b8/rg3ZBIGGCCTO5k579drXszdZs7Mi9wz3++59/sVVcUYY4zpLMbrAIwxxoQnSxDGGGMCsgRhjDEmIEsQxhhjArIEYYwxJiBLEMYYYwKyBGEMICLPiMj/BLnvNhG5MNQxGeM1SxDGGGMCsgRhTAQRkTivYzCRwxKE6THcrp3vi8gnIlIrIk+KSD8R+ZeI1IjIWyKS6bf/50RkjYhUici7IjLGb9sEEVnuHvdXIKnTe10uIivdYxeJyLggY7xMRFaIyH4RKRGRH3Xafpb7elXu9lvc9cki8pCIbBeRahH5wF13noiUBvgeLnSf/0hE5onI8yKyH7hFRApEZLH7HjtF5PcikuB3/Cki8qaI7BWR3SJyn4icJCJ1IpLlt98kESkXkfhgPruJPJYgTE9zNXARMBL4LPAv4D4gG+f/87cARGQkMBv4NpADLAD+LiIJ7snyb8BfgD7AS+7r4h47EXgKuAPIAv4EvCYiiUHEVwvcDGQAlwFfFZEr3dcd6Mb7Ozem8cBK97hfAZOAM92Y/gvwBfmdXAHMc9/zBaAV+I77nUwFLgC+5saQBrwFvA70B4YDb6vqLuBd4Dq/150JzFHV5iDjMBHGEoTpaX6nqrtVtQz4D7BEVVeoaiPwCjDB3e8LwD9V9U33BPcrIBnnBDwFiAd+o6rNqjoPWOr3Hl8B/qSqS1S1VVWfBRrd445IVd9V1VWq6lPVT3CS1Lnu5puAt1R1tvu+laq6UkRigNuAu1S1zH3PRe5nCsZiVf2b+571qlqkqh+paouqbsNJcG0xXA7sUtWHVLVBVWtUdYm77VmcpICIxAI34CRRE6UsQZieZrff8/oAy73c5/2B7W0bVNUHlAC57rYy7ThS5Xa/54OA77pdNFUiUgUMcI87IhE5Q0TecbtmqoE7cX7J477GlgCHZeN0cQXaFoySTjGMFJF/iMgut9vpZ0HEAPAqMFZEhuK00qpVtfA4YzIRwBKEiVQ7cE70AIiI4Jwcy4CdQK67rs1Av+clwP+qaobfI0VVZwfxvi8CrwEDVDUdeAxoe58SYFiAYyqAhsNsqwVS/D5HLE73lL/OQzL/EVgPjFDV3jhdcEeLAVVtAObitHS+iLUeop4lCBOp5gKXicgFbpH1uzjdRIuAxUAL8C0RiRORzwMFfsf+GbjTbQ2IiKS6xee0IN43Ddirqg0iUgDc6LftBeBCEbnOfd8sERnvtm6eAh4Wkf4iEisiU92ax0YgyX3/eOAHwNFqIWnAfuCAiIwGvuq37R/ASSLybRFJFJE0ETnDb/tzwC3A54Dng/i8JoJZgjARSVU34PSn/w7nF/pngc+qapOqNgGfxzkR7sOpV8z3O3YZTh3i9+72ze6+wfga8BMRqQHux0lUba9bDFyKk6z24hSoT3c3fw9YhVML2Qv8AohR1Wr3NZ/Aaf3UAh2uagrgeziJqQYn2f3VL4YanO6jzwK7gE3A+X7bP8Qpji936xcmiolNGGSM8Sci/wZeVNUnvI7FeMsShDHmIBGZDLyJU0Op8Toe4y3rYjLGACAiz+LcI/FtSw4GrAVhjDHmMKwFYYwxJqCIGtgrOztbBw8e7HUYxhjTYxQVFVWoaud7a4AISxCDBw9m2bJlXodhjDE9hohsP9w262IyxhgTkCUIY4wxAVmCMMYYE1BE1SACaW5uprS0lIaGBq9DCamkpCTy8vKIj7e5XYwxXSPiE0RpaSlpaWkMHjyYjoN3Rg5VpbKyktLSUoYMGeJ1OMaYCBHxXUwNDQ1kZWVFbHIAEBGysrIivpVkjOleEZ8ggIhODm2i4TMaY7pXxHcxGWNMJGpq8bFxdw2ryqqpqmvmq+cFnAfqhFiCCLGqqipefPFFvva1rx3TcZdeeikvvvgiGRkZIYrMGNNTNLe6yaC0mlVlzmP9zhqaWn0A9E1L5I5zhhIT07U9CZYgQqyqqoo//OEPhySI1tZWYmNjD3vcggULQh2aMUHz+ZT1u2pYum0vDc2txMXGEB8rxMfGEBfj/hsrxMU464+2vW05Psb9190vNkaivru0udXHpt0HWFVW5SaD/azbuZ+mFicZpCXGcWpuOrdMG8xpuemclpvOoKyUkHxvliBC7J577mHLli2MHz+e+Ph4evXqxcknn8zKlStZu3YtV155JSUlJTQ0NHDXXXcxa9YsoH3YkAMHDjBjxgzOOussFi1aRG5uLq+++irJyckefzITyVSV7ZV1fLilgkWbK1m8tZK9tU3d8t4JB5PJ4RJLDAlxMZzUO5H+GcnkZiSTl5lMbkYKuZnJZKbE95gk09LqY9OeA04icFsHa/2SQa/EOE7N7c2Xpg7itLwMJxn0SenylsLhhDRBiMh04BEgFnhCVR/stD0TZy7eYTiTtt+mqqvdbd8BbseZkH0VcKs7qfpx+/Hf17B2x/4TeYlDjO3fmwc+e8phtz/44IOsXr2alStX8u6773LZZZexevXqg5ejPvXUU/Tp04f6+nomT57M1VdfTVZWVofX2LRpE7Nnz+bPf/4z1113HS+//DIzZ87s0s9hzJ79DSzaUsmHmytYtKWSsqp6APr1TuS8kTmcOTybqcOySE+Op6XVR3Or0uLz0dKqNLf6aPEpTS3Ov4fb3ty2vtVHs7tfS6vS7PPR3OLs37a9bf+245v9XrexpZUt5bW8v7GC+ubWDp8jOT6W3EwncfQ/mDySD67r1zuJ2G46wfprafWxufxAh26itTv20+gmg9SEWE7NTefmKYM4Lc9pGQzOSu22ZBBIyBKEiMQCj+LMf1sKLBWR11R1rd9u9wErVfUqd3L1R4ELRCQX+BYwVlXrRWQucD3wTKji7S4FBQUd7lX47W9/yyuvvAJASUkJmzZtOiRBDBkyhPHjxwMwadIktm3b1m3xmshVXd/MR1srWbS5gg+3VLJ5zwEA0pPjmTo0izvOHcqZw7IZlpMatr/IVZWqumbKquop3VdPWVU9ZfvqKauqo6yqnlVl1Ye0fOJihJPSk5zk4Zc4/JNKUvzhu3+D0dLqY0t5rdsyqDrYMmhobk8Gp/RPZ+aUQU43UV46QzxOBoGEsgVRAGxW1a0AIjIHuALwTxBjgZ8DqOp6ERksIv38YksWkWYgBdhxogEd6Zd+d0lNTT34/N133+Wtt95i8eLFpKSkcN555wW8lyExMfHg89jYWOrr67slVhNZ6ptaWbZ9Lx9urmTxlgpWlVXjU0iKj2Hy4D5cMymPacOyGdu/tye/sI+HiJCZmkBmagKn5qYH3KeuqYUdnRLIjirn+UdbK9m1vwFfp3nTsnsldEgazvOUg897J8cdTJqtPmVLgJZBW8smJSGWU/r35saCQZyW15vTctMZkt2rR3zHoUwQuUCJ33IpcEanfT4GPg98ICIFwCAgT1WLRORXQDFQDyxU1YUhjDVk0tLSqKkJPHtjdXU1mZmZpKSksH79ej766KNujs5EspZWHx+XVrsthAqWb6+iqdVHXIwwfkAG3/jMCKYNy2L8wAwS407sF3M4S0mIY3jfNIb3TQu4vbnVx67qhkOSR1lVPet31vD2uj0Hu4Ha9EqMIzcjmZTEWDbsqqGuyUkGyfFOMri+YMDBAvLQnJ6RDAIJZYII9I10nt/0QeAREVmJU2dYAbS4tYkrgCFAFfCSiMxU1ecPeRORWcAsgIEDB3Zh+F0jKyuLadOmceqpp5KcnEy/fv0Obps+fTqPPfYY48aNY9SoUUyZMsXDSE1P5/MpG3bXHKwhFH66lwONLQCMPbk3N08dxLTh2Uwe0odeiXZ9Spv42BgG9ElhQJ+UgNtVlYoDTe2Jw22JlO6r50BjM9flDzjYTTSsByeDQEI2J7WITAV+pKqXuMv3Aqjqzw+zvwCfAuOAS4Dpqvpld9vNwBRVPeLNBPn5+dp5wqB169YxZsyYE/w0PUM0fVbjnLiK99bx4eZKFm2pYPGWSird/vbBWSmcOTybacOcwnKf1ASPozXhSkSKVDU/0LZQ/oxYCowQkSFAGU6R+cZOgWUAdarahHPF0vuqul9EioEpIpKC08V0AWBTxZmot6emgcXulUYfbm6/0qhvWiLnjMzhzGFZnDk8m9wMuwzanLiQJQhVbRGRbwBv4Fzm+pSqrhGRO93tjwFjgOdEpBWneP1ld9sSEZkHLAdacLqeHg9VrMaEq4bmVj7aWsl7G8v5cHMFG3c7Vxr1Topj6rAsZp0zlGnDsxiW0ytsrzQyPVdIOyJVdQGwoNO6x/yeLwZGHObYB4AHQhmfMeFGVdlSXst7G8t5b2M5S7ZW0tjiIzEuhoIhfbhqQh7ThmdxSv/0iOrrNuHJKlXGeKymoZlFW5xWwnsbyg92Gw3LSeWmMwZx7qgczhjS54SvzTfmWFmCMKabqSprd+4/mBCKtu+jxaekJsQybXg2Xzt/GOeMyDnsVTXGdBdLEMZ0g321TfxncwXvbSjn/U3llNc0As7lp185Zyjnjsxh4sBMEuKiYooW00NYggix4x3uG+A3v/kNs2bNIiXFfkn2NK0+5ePSKt7b4NQSPi6tQhUyUuI5e0QO547M4ZwR2fTtneR1qMYcliWIEDvccN/B+M1vfsPMmTMtQfQQe/Y3HCwu/2dTBdX1zYjA+AEZ3HXBCM4dmcO4vAwrLpsewxJEiPkP933RRRfRt29f5s6dS2NjI1dddRU//vGPqa2t5brrrqO0tJTW1lZ++MMfsnv3bnbs2MH5559PdnY277zzjtcfxXTS1OKjaPu+g0lh3U5npOCctEQuGtuPc0fmcNbwbDLtJjXTQ0VXgvjXPbBrVde+5kmnwYwHD7vZf7jvhQsXMm/ePAoLC1FVPve5z/H+++9TXl5O//79+ec//wk4YzSlp6fz8MMP884775Cdnd21MZvjVrK3jnc3lvP+xnIWba6gtqmVuBghf3Amd08fzbkjcxhzcprdk2AiQnQlCI8tXLiQhQsXMmHCBAAOHDjApk2bOPvss/ne977H3XffzeWXX87ZZ5/tcaQGnKuNquubWVlSxbsbnKSwtaIWgNyMZK6ckMu57jwJNraRiUTR9b/6CL/0u4Oqcu+993LHHXccsq2oqIgFCxZw7733cvHFF3P//fd7EGF08fmUPTWNlFXVBRwKumxfPbXuKJ2JcTFMGZrFzCnOfQlDs8N3jgRjukp0JQgP+A/3fckll/DDH/6Qm266iV69elFWVkZ8fDwtLS306dOHmTNn0qtXL5555pkOx1oX0/FpbGllZ1X7MM6lfpPJ7KhqYGd1Pc2tHQerzEiJp396MoOyUjlzWDZ5mcmM6JdmN6qZqGQJIsT8h/ueMWMGN954I1OnTgWgV69ePP/882zevJnvf//7xMTEEB8fzx//+EcAZs2axYwZMzj55JOtSB3A/oZm59e+36//Ur/ltnsN2ohAv7Qk+mckcfqADC497WRyM9tnFeufkWxdRcb4Cdlw316w4b4j67OW1zR2nELSbxz+sqp6ahpaOuyfEBtD/4wk52Sf3nEqybyMFE5KT7Ib0YzpxKvhvo05Li2tPu57ZRVzl5V2WJ+WGHfwpF8wpA/9M/wTQDLZvRLDbk5fY3oySxAmrDS2tPLtOSv51+pd3HLmYKa5cxvkZiaTnhzvdXjGRJWoSBCqGvFXnERCV2F9Uyt3PF/E+xvLuf/ysdx21hCvQzImqkV8h2xSUhKVlZURcQI9HFWlsrKSpKSeO67P/oZmbn5qCR9sKuf/rh5nycGYMBDxLYi8vDxKS0spLy/3OpSQSkpKIi8vz+swjsve2iZufmoJ63fW8NsbJnD5uP5eh2SMIQoSRHx8PEOG2K/RcLV7fwMzn1hC8d46/nxzPueP7ut1SMYYV8QnCBO+iivruOnJj9h7oIlnbytgytAsr0MyxvixBGE8sWl3DTOfXEJDs48XvjKF8QMyvA7JGNOJJQjT7VaXVXPzU4XExghz75jKqJPSvA7JGBNAxF/FZMLL0m17ueHxj0iOj+UlSw7GhDVrQZhu8/7Gcmb9ZRn905N5/vYz6J+R7HVIxpgjsARhusXrq3fxrdkrGNa3F3/5cgHZvRK9DskYcxSWIEzIvVxUyn+9/Amn56Xz9C0FpKfYkBnG9ASWIExIPbd4G/e/uoZpw7N4/Iv5pNpw2sb0GCEtUovIdBHZICKbReSeANszReQVEflERApF5FS/bRkiMk9E1ovIOhGZGspYTdd79J3N3P/qGi4a248nvzTZkoMxPUzIEoSIxAKPAjOAscANIjK20273AStVdRxwM/CI37ZHgNdVdTRwOrAuVLGarqWq/OL19fzyjQ1cMb4/f7hpos3GZkwPFMoWRAGwWVW3qmoTMAe4otM+Y4G3AVR1PTBYRPqJSG/gHOBJd1uTqlaFMFbTRXw+5f5X1/DHd7dw4xkD+fV144mPtaupjemJQvmXmwuU+C2Xuuv8fQx8HkBECoBBQB4wFCgHnhaRFSLyhIikBnoTEZklIstEZFmkD8gX7lpafXzvpY/5y0fbueOcofzvlafaBD7G9GChTBCBzgydx9x+EMgUkZXAN4EVQAtO8Xwi8EdVnQDUAofUMABU9XFVzVfV/JycnC4L3hybxpZWvv7icuavKON7F4/knhmjI34ODmMiXSirhqXAAL/lPGCH/w6quh+4FUCcs8mn7iMFKFXVJe6u8zhMgjDeq2tq4Y6/FPGfTRU88Nmx3DrNRs81JhKEsgWxFBghIkNEJAG4HnjNfwf3SqUEd/F24H1V3a+qu4ASERnlbrsAWBvCWM1xqq5v5uYnC/lwcwW/vGacJQdjIkjIWhCq2iIi3wDeAGKBp1R1jYjc6W5/DBgDPCcirTgJ4Mt+L/FN4AU3gWzFbWmY8FF5oJGbnypk4+4afn/jRC497WSvQzLGdCGJpKk48/PzddmyZV6HERV2VTdw0xMfUbqvnse+OInzR9lEP8b0RCJSpKr5gbbZnUvmmG2vrOWmJ5ZQVdfMc7cVcIZN9GNMRLIEYY7Jxt01zHxiCU2tPl78yhmMy7OJfoyJVJYgTNBWlVZz81NLiI+NYe4dUxnZz+ZyMCaSWYIwQSn8dC+3PbOUjJR4Xrj9DAZlBbxv0RgTQSxBmKN6d8Me7ny+iNwMZ6Kfk9Ntoh9jooElCHNEC1bt5K45KxjRN43nbKIfY6KKJQhzWC8tK+Hulz9hwsBMnrplMunJNtGPMdHEEoQJ6JkPP+VHf1/L2SOy+dMXJ5GSYP9VjIk29ldvOlBV/vDuFn75xgYuHtuP3904gcQ4m8vBmGhkCcJ0MH95Gb98YwNXTcjll9eMI87mcjAmalmCMAdV1zfzswXrmDgwg4euPd3mcjAmylmCMAc9vHAD++qaePa2AksOxpiQDvdtepA1O6r5y0fbmTllEKfmpnsdjjEmDFiCMAfnkc5MSeC7F406+gHGmKhgCcLw8vJSirbv4+4Zo0lPsXsdjDEOSxBRrrq+mQf/tZ6JAzO4ZmKe1+EYY8KIFamjnBWmjTGHYy2IKGaFaWPMkViCiFJWmDbGHI0liCg1f0WZFaaNMUdkCSIKVdc383P3jmkrTBtjDseK1FHo129utMK0MeaorAURZdbsqOa5xdusMG2MOSpLEFHE51MesMK0MSZIliCiyPwVZSyzwrQxJkghTRAiMl1ENojIZhG5J8D2TBF5RUQ+EZFCETm10/ZYEVkhIv8IZZzRoK0wPcEK08aYIIUsQYhILPAoMAMYC9wgImM77XYfsFJVxwE3A4902n4XsC5UMUaTtsL0T6841QrTxpighLIFUQBsVtWtqtoEzAGu6LTPWOBtAFVdDwwWkX4AIpIHXAY8EcIYo0JbYfqmM6wwbYwJXigTRC5Q4rdc6q7z9zHweQARKQAGAW39H78B/gvwHelNRGSWiCwTkWXl5eVdEXdEaStMZ6Qk8L2LrTBtjAleKBNEoH4M7bT8IJApIiuBbwIrgBYRuRzYo6pFR3sTVX1cVfNVNT8nJ+eEg440bYXpe6wwbYw5RqG8Ua4UGOC3nAfs8N9BVfcDtwKIiACfuo/rgc+JyKVAEtBbRJ5X1ZkhjDfiOEN5W2HaGHN8QtmCWAqMEJEhIpKAc9J/zX8HEclwtwHcDryvqvtV9V5VzVPVwe5x/7bkcOx+/eZG9tZaYdoYc3yCShAi8rKIXCYiQScUVW0BvgG8gXMl0lxVXSMid4rIne5uY4A1IrIe52qnu44tfHM4a3fst8K0MeaEiGrnskCAnUQuxOkKmgK8BDzjXnUUVvLz83XZsmVeh+E5VeXaxxaztaKWd757ntUejDGHJSJFqpofaFtQLQJVfUtVbwImAtuAN0VkkYjcKiJ29gkz85e7henpVpg2xhy/oLuMRCQLuAWnVrAC56a2icCbIYnMHJfq+mZ+3laYnmSFaWPM8QvqKiYRmQ+MBv4CfFZVd7qb/ioi1qcTRn795kYqa5t45lYbytsYc2KCvcz196r670AbDtd3ZbpfW2F6phWmjTFdINgupjEiktG24A6y97UQxWSOg6py/6ur7Y5pY0yXCTZBfEVVq9oWVHUf8JXQhGSOhxWmjTFdLdgEEePe6QwcHKk14Qj7m25khWljTCgEW4N4A5grIo/hjKd0J/B6yKIyx8QK08aYUAg2QdwN3AF8FWcQvoXYMNxhwQrTxphQCSpBqKoP+KP7MGHCCtPGmFAK9j6IEcDPcSb4SWpbr6pDQxSXCUJbYfr/rh5nhWljTJcLtkj9NE7roQU4H3gO56Y545G2wvT4AVaYNsaERrAJIllV38YZ3G+7qv4I+EzowjJH01aY/p8rbShvY0xoBFukbnCH+t4kIt8AyoC+oQvLHEn7UN4DrTBtjAmZYFsQ3wZSgG8Bk4CZwJdCFZQ5PCtMG2O6y1FbEO5Ncdep6veBA7hThBpv+BemM1LsXkVjTOgctQWhqq3AJP87qY03rDBtjOlOwdYgVgCvishLQG3bSlWdH5KoTEB2x7QxpjsFmyD6AJV0vHJJAUsQ3cQK08aY7hbsndRWd/CQqvLAa1aYNsZ0r2DvpH4ap8XQgare1uURmUO8sqKMpdv28YurT7PCtDGm2wTbxfQPv+dJwFXAjq4Px3S2v6GZny1Yz/gBGVw7aYDX4RhjokiwXUwv+y+LyGzgrZBEZDpwCtONPH3LZCtMG2O6VbA3ynU2AhjYlYGYQ63buZ9nFzmF6dPyrDBtjOlewdYgauhYg9iFM0eECZG2O6bTk+OtMG2M8URQLQhVTVPV3n6PkZ27nQIRkekiskFENovIPQG2Z4rIKyLyiYgUisip7voBIvKOiKwTkTUictexf7Sera0wfc+M0VaYNsZ4IqgEISJXiUi633KGiFx5lGNigUeBGTjzSNwgImM77XYfsFJVxwE3A4+461uA76rqGGAK8PUAx0YsK0wbY8JBsDWIB1S1um1BVauAB45yTAGwWVW3qmoTMAe4otM+Y4G33ddcDwwWkX6qulNVl7vra4B1QG6QsfZ4bYXpn15hQ3kbY7wTbIIItN/R6he5QInfcimHnuQ/Bj4PICIFwCCgwyBDIjIYmAAsCfQmIjJLRJaJyLLy8vKjhBT+rDBtjAkXwSaIZSLysIgME5GhIvJroOgoxwT66dv5ZrsHgUwRWQl8E2fMp5aDLyDSC3gZ+Laq7g/0Jqr6uKrmq2p+Tk5OkB8nPFlh2hgTToK9Ue6bwA+Bv7rLC4EfHOWYUsC/Az2PTjfXuSf9WwHc0WI/dR+ISDxOcnghWgYFtDumjTHhJNgb5WqBQ65COoqlwAgRGYIzA931wI3+O4hIBlDn1ihuB95X1f1usngSWKeqDx/j+/ZItY0tVpg2xoSVYK9ietM9mbctZ4rIG0c6RlVbgG8Ab+AUmeeq6hoRuVNE7nR3GwOsEZH1OFc7tV3OOg34IvAZEVnpPi49pk/Ww7z28Q4qDjTy35eNscK0MSYsBNvFlO1euQSAqu4TkaPOSa2qC4AFndY95vd8Mc5d2Z2P+4DANYyINbuwmNEnpZE/KNPrUIwxBgi+SO0TkYNDa7hXFh0yuqs5PqvLqvmktJobCgZiE/cZY8JFsC2I/wY+EJH33OVzgFmhCSn6zC4sJjEuhisnRM2tHsaYHiDYIvXrIpKPkxRWAq8C9aEMLFrUNrbw6sodXD6uP+nJ8V6HY4wxBwU7WN/tOAXkPJwEMQVYTMcpSM1x+McnOzjQ2MKNZ9iVS8aY8BJsDeIuYDKwXVXPx7mzueffthwGXiwsYWS/XkwcaMVpY0x4CTZBNKhqA4CIJLrjJtmtvidozY5qPi6psuK0MSYsBVukLnXvg/gb8KaI7MOmHD1hcwpLSIyL4SorThtjwlCwReqr3Kc/EpF3gHTg9ZBFFQXqmlr424oyLj3tZBtWwxgTloJtQRykqu8dfS9zNP/4ZCc1jS3cUGAztxpjwtPxzkltTtDswmKG5aQyebAVp40x4ckShAfW7dzPimIrThtjwpslCA/MKSwmITaGqyfmHX1nY4zxiCWIblbf1Mr8FWXMOO0kMlOtOG2MCV+WILrZP1ftpKbBitPGmPB3zFcxmRMzu7CYodmpnDGkj9ehGNOz1O+D4o9g+4ewey30HQODzoSBUyHF/p5CwRJEN9qwq4ai7fv470vHWHHamKOp2QXbF7U/9qwFFGITIGsEbPsAFv/e2TfHTRZtj979PQ09UliC6Eaz24rTk6w4bUwHqrBvGxQvdloI2xfB3q3OtvhUGHgGnHKVc/LPnQTxSdDSCGXLnf2LF8Mnc2HZk84xmYNh0DSndTHoTOgzFCL1R5kq1O2F1Kwuf2lLEN2kobmVV1aUccmpJ9HHitMm2vl8ULHBTQaLnYRQ447ek5wJA8+E/NuckzkvIW8AABQqSURBVPtJp0NsgFNVXCIMmuo8AFpbYPcq9/U+hA3/gpUvONt6ndSxhZEzBmJ6WAm2tQX2fQrlG5zvrnyj82/FJkjKgP+3psvf0hJEN/nX6p1U1zdzQ4EN622iUGsL7PqkvbuoeDHU73W2pZ3cfuIeeCbkjD6+k3dsHPSf4Dymfs35ZV2+AYr9uqnWzHf2Tcpor18MmgYnj4PYMJmPpbneOelXbOyYDCo3g6+5fb+0/pAzEsbfBDmjnM/bxa0kSxDdZPaSEgZnpTB1aNc3A40JO80NsGN5e3dRSSE0HXC2ZQ6BUZe6SWGqsxyK7h8R6DvaeeTf5pxAq4rdZOHGtWGBs298KgwoaE9UuZMgPrnrY/JXX3VoEihf78TYNqOzxDjdZdmjYOTFzr85oyB7BCSlhzY+LEF0i817aijctpd7Z4y24rSJTI01ULKkvbuobBm0Njnb+o6F069vbyH0PtmbGEUgc5DzGH+Ds65mt18LYzG88zMOFsL7T3QTxjQneST1Pvb3VIUDu90k4J8MNjjr28QmQtZwyJ0Ip9/gtAxyRkOfYU69xSOWILrB7MIS4mPFitMmctRWOt1EbUXlnZ+AtoLEQv/xUDDLLRJPCe9LUNP6OcXvU9wBq+v3QfGS9sL3ot/CBw87v+RPOs35TG1dU6nZ7a/j80HV9kOTQPlGaKxu3y8hzTn5D78Qske6rYGRTishJrZbP3owLEGEWENzKy8vL+XiU04iu1ei1+GYY7XmFdj2IaRkOY9U99+UbOcEkdwH4iLsooOmOqirgLpKJxHUVTrLtRVwYA+UFUH5Omff2ETImwxn/z/nxJlXAIm9vI3/RCRnwqjpzgOgqRZKl7YXvpc9BR/9wdmWPco52e/dBpWboKWh/XVS+zon/9OuaU8COaOceksP6kWwBBFib6zZRVVdMzfandM9S2szLPwBLHnM6Z9urj38vonpzq/k1Gy/5OGXSFKy/LZlQWJa950kfD5oqHJP9hV+J/4K59LIthN/XWX7o7ku8GtJrBP/yeNg3LVOd1HuROdqokiVkApDz3MeAC1NsHNlew1jzzqnhjL03I4tgnBuNR0DSxAh9uKSYgZZcbpnqa2AuV+C7R/AlK/DRT9x1tfv63SCrez0vAL2lzndLXUV7X3wncUm+CWPtsSSHbiFkpLltFLaLvNsaez4fv6/8APFVb8X1Bc4jvjU9vdLzXHuTG5LYgeTml9ciek979LQrhaX4NQjBhTAWd/xOpqQC2mCEJHpwCNALPCEqj7YaXsm8BQwDGgAblPV1cEc2xNsKT/Akk/3cvf00cTE9JxmZVTbsRL+OhNqy+Gqx+H0L7Rv65XjPIKh6ly14/9L/ZBf8e6JfMcK59+G6sO/XlIG+FrarwQ6hDjJpi2xZI9wh6Do1Hrxfx7qq3RMjxeyBCEiscCjwEVAKbBURF5T1bV+u90HrFTVq0RktLv/BUEeG/bmFBYTFyNcY8XpnuGTufDaN52T7G2vO9fTHy8RpyspMQ36DAnumNbmAK0Sv+ex8e1JoPMv/OSMsCxymp4tlC2IAmCzqm4FEJE5wBWA/0l+LPBzAFVdLyKDRaQfMDSIY8NaY0sr84pKufiUfuSkRXAfbSRobYG3HnDG9Rl0Flz7TPAtha4UGw9pJzkPY8JAKDsUc4ESv+VSd52/j4HPA4hIATAIyAvy2LD2xprd7KtrtmG9w13dXnj+805yKLgDbv6bN8nBmDAUyhZEoE537bT8IPCIiKwEVgErgJYgj3XeRGQWMAtg4MDwORnPXlLMgD7JTBuWffSdjTd2rYI5Nzqjhl7xKEyY6XVExoSVUCaIUsB/4KE8YIf/Dqq6H7gVQJxbjD91HylHO9bvNR4HHgfIz88PmES629byAyzeWsn3Lxllxelwtfpl+NvXneveb30d8iZ5HZExYSeUXUxLgREiMkREEoDrgdf8dxCRDHcbwO3A+27SOOqx4eyvS0uIixGuzbfidNjxtcKb98O82+Dk02HWu5YcjDmMkLUgVLVFRL4BvIFzqepTqrpGRO50tz8GjAGeE5FWnAL0l490bKhi7UqNLa28VFTKhWP60TfNuzFUTAB1e+Hl22HL287gbdN/EXl3QRvThUJ6H4SqLgAWdFr3mN/zxcCIYI/tCd5cu5u9tU3ccEb41EMMsHuNU2+oLoPPPgKTbvE6ImPCnt1J3cVmFxaTm5HM2cOtOB021r4Kr3zVGSPoln86s5MZY44qyu+b71rbKmr5cHMl108e0LOL0xoWtf4T52uFt38Kc2+GfmNh1nuWHIw5BtaC6EJzlpYQGyNcm98DZo1Tde7QbRuWuG2Y4vINzlAQoy+DSbfCkHN61OiTB9VXwfyvwKaFMOGLcNlDkT2onDEhYAmiizS1+JhXVMJnRvflpPQwKk77fLC/tH3+2rYkULHBGXyuTXyqM37PkLOdMXrWvOI8+gxz+uvH3xSSSdFDonwDzL7BGZ//socg/8s9M8kZ4zFLEF3k7XW7qTjQ5N2w3q3NsPdTNwms7zihuf/wzSlZzjj2Y69oH88+exT0zu04Uuf0B2HN36DoaXjzh/Dvn8KYzzqtisFnhe8Jd/0/Yf4dzixcX/q7M0eBMea4WILoIi8WFtM/PYlzRoZ4mIamOmdyEv+WQPlG2LvFGe2zTe885+Q/8cz2JJAzquMsWEcSn+xMyzj+Bti9FoqegY/nODeYZY1wWxU3hs+49z4fvPcLeO9BZ5C9L7wA6T1qdBZjwo5opBQkce6kXrZsWbe/b8neOs7+v3f4zoUjuevCgFftHru6vX7TF/pNY1hVQvuE5rHOSKFtJ/+2yUqyR4ZmVq+mOqfbqehpZ5at2ESnJZJ/qzO0tFetiob98ModzgT0p98Il//a03l8jelJRKRIVfMDbbMWRBeYs7SYGIHrJgdx53RL46HDObct1+6Byi1OMqjd035MXJJTH8grcAqubTNX9RnWvTd6JaTAhJucx67VTqL4ZC6smuskqUm3OJPTd2eromKTc39D5RaY8X/OXMjh2v1lTA9jLYgToUpzXRXX/fofTM7xcd95fQNPCOO/3FRzmBcTpz7QZ2jHLqHskZAxMHzH+m+qhdXznWRRVuQks7FXOq2KAWeE9mS98Q3nzujYeLj2WafAbow5JkdqQViC8NfS5EzR6D+lY12n5c4zhPn3+/uLSzrM3MSB5inOjowJX3Z+4rYqXnISYc4YJ1GMu84ZFK+r+Hzwn4fgnf+Fk06D619wkqgx5phZgjgSVfj9ZDiwBxqPMOVjcmbAE/2Lq2vZdCCRH1x7FrG9ctpP/AmpJ/ZherLGA04xu+hpZzrNuGQ45SonWeRNPrFWRWMN/O2rsO7vcNp1zrAZCSldF7sxUcZqEEci4lwKGZfk/qLvPKVjp0nj/ZTsreO/336Hb31mBLGjRnoQfJhK7AWTvuQ8dqx0EsWqefDxi9D3lPZWRVL6sb1u5RaYc5NTrL/4f2Hq163eYEwIWQviBDy0cAOPvrOZ/9z9GXIzbAL4I2qsgVUvwbKnYdcnTqvi1KudZJE76egn+k1vwcu3gcTANU/DsPO7J25jIpy1IEKgpdXHX5eWcN6ovpYcgpGY5gyxPelWp9up6GlY9TKsfB76nQb5tzhdRkm9Ox6nCh/8Gt7+CfQ7xak3ZA724hMYE3VssL7j9O/1e9hT02hzTh8rEcidCJ/7HXx3vTMUBsA/vwsPjYbXvglly511TbUw71Z4+8dODePLCy05GNONrAVxnGYXFtOvdyLnj7IJ7o9bUm+YfLszVlJZUfsVUMufc2Z7a22GPevgwh/DtLus3mBMN7MEcRzKqup5d2M53zx/OHGx1gg7YSKQl+88LvmZc/Pdsqedy4hvmgcjLvQ6QmOikiWI4/DXpSUAXDe5Bwzr3dMkpUPBV5yWhbUYjPGU/fw9Ri2tPuYuLeHckTnkZdr19yFjycEYz1mCOEbvbihn1/4GK04bYyKeJYhjNLuwmL5piXxmdF+vQzHGmJCyBHEMdlTV886GPVyXP4B4K04bYyKcneWOwdxlJSjwBStOG2OigCWIILX6lLlLSzh7RA4D+lhx2hgT+SxBBOn9jeXsqG7gxgJrPRhjooMliCC9WFhMdq9ELhjTz+tQjDGmW4Q0QYjIdBHZICKbReSeANvTReTvIvKxiKwRkVv9tn3HXbdaRGaLiGeTDO+qbuDf6/dwXX6eFaeNMVEjZGc7EYkFHgVmAGOBG0RkbKfdvg6sVdXTgfOAh0QkQURygW8B+ap6KhALXB+qWI/mpWUltPrUitPGmKgSyp/DBcBmVd2qqk3AHOCKTvsokCYiAvQC9gJtc3jGAckiEgekADtCGOthtfqUOUtLOGt4NoOyoniWOGNM1AllgsgFSvyWS911/n4PjME5+a8C7lJVn6qWAb8CioGdQLWqLgz0JiIyS0SWiciy8vLyrv4M/GdTOWVV9XbntDEm6oQyQQQaTKfz9HWXACuB/sB44Pci0ltEMnFaG0PcbakiMjPQm6jq46qar6r5OTldP/T27MJislITuGisFaeNMdEllAmiFPDvtM/j0G6iW4H56tgMfAqMBi4EPlXVclVtBuYDZ4Yw1oD27G/grXV7uCY/j4Q4K04bY6JLKM96S4ERIjJERBJwisyvddqnGLgAQET6AaOAre76KSKS4tYnLgDWhTDWgF4qKqXVp1w/2bqXjDHRJ2TzQahqi4h8A3gD5yqkp1R1jYjc6W5/DPgp8IyIrMLpkrpbVSuAChGZByzHKVqvAB4PVayB+HzK7MJizhyWxZBsK04bY6JPSCcMUtUFwIJO6x7ze74DuPgwxz4APBDK+I7kg80VlO6r5+7po70KwRhjPGUd64cxu7CYPqkJXHyKFaeNMdHJEkQAe2oaeHPtbq6ZlEdiXKzX4RhjjCcsQQQwr6iUFp9yvd05bYyJYpYgOvH5lDmFJUwZ2oehOb28DscYYzxjCaKTRVsqKd5bZ3dOG2OiniWITmYXFpOZEs8lp5zkdSjGGOMpSxB+ymsaeWPNLq6emEdSvBWnjTHRzRKEn5eXu8Vp614yxhhLEG1UlTmFxRQM6cPwvlacNsYYSxCuxVsr2VZZx43WejDGGMASxEGzC0tIT45n+qlWnDbGGLAEAUDlgUbeWG3FaWOM8WcJApi/vIymVh83FNid08YY0ybqE4SqM6z35MGZjOiX5nU4xhgTNkI63HdPUNfUSsGQPpw1ItvrUIwxJqxEfYJITYzjwavHeR2GMcaEnajvYjLGGBOYJQhjjDEBWYIwxhgTkCUIY4wxAVmCMMYYE5AlCGOMMQFZgjDGGBOQJQhjjDEBiap6HUOXEZFyYPtxHp4NVHRhOD2ZfRcd2ffRkX0f7SLhuxikqjmBNkRUgjgRIrJMVfO9jiMc2HfRkX0fHdn30S7SvwvrYjLGGBOQJQhjjDEBWYJo97jXAYQR+y46su+jI/s+2kX0d2E1CGOMMQFZC8IYY0xAliCMMcYEFPUJQkSmi8gGEdksIvd4HY+XRGSAiLwjIutEZI2I3OV1TF4TkVgRWSEi//A6Fq+JSIaIzBOR9e7/kalex+QlEfmO+3eyWkRmi0iS1zF1tahOECISCzwKzADGAjeIyFhvo/JUC/BdVR0DTAG+HuXfB8BdwDqvgwgTjwCvq+po4HSi+HsRkVzgW0C+qp4KxALXextV14vqBAEUAJtVdauqNgFzgCs8jskzqrpTVZe7z2twTgC53kblHRHJAy4DnvA6Fq+JSG/gHOBJAFVtUtUqb6PyXByQLCJxQAqww+N4uly0J4hcoMRvuZQoPiH6E5HBwARgibeReOo3wH8BPq8DCQNDgXLgabfL7QkRSfU6KK+oahnwK6AY2AlUq+pCb6PqetGeICTAuqi/7ldEegEvA99W1f1ex+MFEbkc2KOqRV7HEibigInAH1V1AlALRG3NTkQycXobhgD9gVQRmeltVF0v2hNEKTDAbzmPCGwmHgsRicdJDi+o6nyv4/HQNOBzIrINp+vxMyLyvLcheaoUKFXVthblPJyEEa0uBD5V1XJVbQbmA2d6HFOXi/YEsRQYISJDRCQBp8j0mscxeUZEBKePeZ2qPux1PF5S1XtVNU9VB+P8v/i3qkbcL8RgqeouoERERrmrLgDWehiS14qBKSKS4v7dXEAEFu3jvA7AS6raIiLfAN7AuQrhKVVd43FYXpoGfBFYJSIr3XX3qeoCD2My4eObwAvuj6mtwK0ex+MZVV0iIvOA5ThX/60gAofdsKE2jDHGBBTtXUzGGGMOwxKEMcaYgCxBGGOMCcgShDHGmIAsQRhjjAnIEoQxYUBEzrMRY024sQRhjDEmIEsQxhwDEZkpIoUislJE/uTOF3FARB4SkeUi8raI5Lj7jheRj0TkExF5xR2/BxEZLiJvicjH7jHD3Jfv5TffwgvuHbrGeMYShDFBEpExwBeAaao6HmgFbgJSgeWqOhF4D3jAPeQ54G5VHQes8lv/AvCoqp6OM37PTnf9BODbOHOTDMW5s90Yz0T1UBvGHKMLgEnAUvfHfTKwB2c48L+6+zwPzBeRdCBDVd9z1z8LvCQiaUCuqr4CoKoNAO7rFapqqbu8EhgMfBD6j2VMYJYgjAmeAM+q6r0dVor8sNN+Rxq/5kjdRo1+z1uxv0/jMetiMiZ4bwPXiEhfABHpIyKDcP6OrnH3uRH4QFWrgX0icra7/ovAe+78GqUicqX7GokiktKtn8KYINkvFGOCpKprReQHwEIRiQGaga/jTJ5ziogUAdU4dQqALwGPuQnAf/TTLwJ/EpGfuK9xbTd+DGOCZqO5GnOCROSAqvbyOg5jupp1MRljjAnIWhDGGGMCshaEMcaYgCxBGGOMCcgShDHGmIAsQRhjjAnIEoQxxpiA/j9q4C56tVF9tQAAAABJRU5ErkJggg==\n"},"metadata":{"needs_background":"light"}}]},{"metadata":{"trusted":true},"cell_type":"code","source":"","execution_count":null,"outputs":[]}],"metadata":{"kernelspec":{"language":"python","display_name":"Python 3","name":"python3"},"language_info":{"pygments_lexer":"ipython3","nbconvert_exporter":"python","version":"3.6.4","file_extension":".py","codemirror_mode":{"name":"ipython","version":3},"name":"python","mimetype":"text/x-python"}},"nbformat":4,"nbformat_minor":4} -------------------------------------------------------------------------------- /Text-Summarization-using-Transformers-T5/Tuning Transformer for Summary Generation T5.ipynb: -------------------------------------------------------------------------------- 1 | {"cells":[{"metadata":{"trusted":true},"cell_type":"code","source":"# This Python 3 environment comes with many helpful analytics libraries installed\n# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python\n# For example, here's several helpful packages to load\n\nimport numpy as np # linear algebra\nimport pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)\n\n# Input data files are available in the read-only \"../input/\" directory\n# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory\n\nimport os\nfor dirname, _, filenames in os.walk('/kaggle/input'):\n for filename in filenames:\n print(os.path.join(dirname, filename))\n\n# You can write up to 5GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using \"Save & Run All\" \n# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session","execution_count":1,"outputs":[{"output_type":"stream","text":"/kaggle/input/news-summary/news_summary.csv\n/kaggle/input/news-summary/news_summary_more.csv\n","name":"stdout"}]},{"metadata":{"id":"zsyIfDk4fh8_"},"cell_type":"markdown","source":"# Fine Tuning Transformer for Summary Generation"},{"metadata":{"id":"nK-xHQAqfh9E"},"cell_type":"markdown","source":"\n### Introduction\n\nwe will fine tuning a transformer model for **Summarization Task**. \nIn this task a summary of a given article/document is generated when passed through a network. There are 2 types of summary generation mechanisms:\n\n1. ***Extractive Summary:*** the network calculates the most important sentences from the article and gets them together to provide the most meaningful information from the article.\n2. ***Abstractive Summary***: The network creates new sentences to encapsulate maximum gist of the article and generates that as output. The sentences in the summary may or may not be contained in the article. \n\nwe will be generating ***Abstractive Summary***. \n\n\n* We will be using : [Weights and Biases Service](https://www.wandb.com/) WandB in short.\n* It is a experiment tracking, parameter optimization and artifact management service. That can be very easily integrated to any of the Deep learning or Machine learning frameworks. \n\nThe notebook will be divided into separate sections to provide a organized walk through for the process used. This process can be modified for individual use cases. The sections are:\n\n1. [Preparing Environment and Importing Libraries](#section01)\n2. [Preparing the Dataset for data processing: Class](#section02)\n3. [Fine Tuning the Model: Function](#section03)\n4. [Validating the Model Performance: Function](#section04)\n5. [Main Function](#section05)\n * [Initializing WandB](#section501)\n * [Importing and Pre-Processing the domain data](#section502)\n * [Creation of Dataset and Dataloader](#section503)\n * [Neural Network and Optimizer](#section504)\n * [Training Model and Logging to WandB](#section505)\n * [Validation and generation of Summary](#section506)\n6. [Examples of the Summary Generated from the model](#section06)\n\n\n#### Technical Details\n\nThis script leverages on multiple tools designed by other teams. Details of the tools used below. Please ensure that these elements are present in your setup to successfully implement this script.\n\n- **Data**:\n\t- We are using the News Summary dataset available at [Kaggle](https://www.kaggle.com/sunnysai12345/news-summary)\n\t- This dataset is the collection created from Newspapers published in India, extracting, details that are listed below. We are referring only to the first csv file from the data dump: `news_summary.csv`\n\t- There are`4514` rows of data. Where each row has the following data-point:\n\t\t- **author** : Author of the article\n\t\t- **date** : Date the article was published\n\t\t- **headline**: Headline for the published article\n\t\t- **read_more** : URL for the article to follow online\n\t\t- **text**: This is the summary of the article\n\t\t- **ctext**: This is the complete article\n\n\n- **Language Model Used**: \n - This notebook uses one of the most recent and novel transformers model ***T5***. [Research Paper](https://arxiv.org/abs/1910.10683) \n - ***T5*** in many ways is one of its kind transformers architecture that not only gives state of the art results in many NLP tasks, but also has a very radical approach to NLP tasks.\n - **Text-2-Text** - According to the graphic taken from the T5 paper. All NLP tasks are converted to a **text-to-text** problem. Tasks such as translation, classification, summarization and question answering, all of them are treated as a text-to-text conversion problem, rather than seen as separate unique problem statements.\n - **Unified approach for NLP Deep Learning** - Since the task is reflected purely in the text input and output, you can use the same model, objective, training procedure, and decoding process to ANY task. Above framework can be used for any task - show Q&A, summarization, etc. \n - We will be taking inputs from the T5 paper to prepare our dataset prior to fine tuning and training. \n - [Documentation for python](https://huggingface.co/transformers/model_doc/t5.html)\n\n\n\n\n- Hardware Requirements: \n\t- Python 3.6 and above\n\t- Pytorch, Transformers and\n\t- All the stock Python ML Library\n\t- GPU enabled setup \n \n\n- **Script Objective**:\n\t- The objective of this script is to fine tune ***T5 *** to be able to generate summary, that a close to or better than the actual summary while ensuring the important information from the article is not lost.\n\n---\n"},{"metadata":{"id":"Wqb-siZ-fh9G"},"cell_type":"markdown","source":"\n### Preparing Environment and Importing Libraries\n\nAt this step we will be installing the necessary libraries followed by importing the libraries and modules needed to run our script. \nWe will be installing:\n* transformers\n* wandb\n\nLibraries imported are:\n* Pandas\n* Pytorch\n* Pytorch Utils for Dataset and Dataloader\n* Transformers\n* T5 Model and Tokenizer\n* wandb\n\nFollowed by that we will preapre the device for CUDA execeution. This configuration is needed if you want to leverage on onboard GPU. First we will check the GPU avaiable to us, using the nvidia command followed by defining our device.\n\nFinally, we will be logging into the [wandb](https://www.wandb.com/) serice using the login command"},{"metadata":{"id":"WD_vnyLXZQzD","outputId":"b2ff57b8-a147-4893-80bd-e40d18042f98","trusted":true},"cell_type":"code","source":"!pip install transformers -q\n!pip install wandb -q\n\n# Code for TPU packages install\n# !curl -q https://raw.githubusercontent.com/pytorch/xla/master/contrib/scripts/env-setup.py -o pytorch-xla-env-setup.py\n# !python pytorch-xla-env-setup.py --apt-packages libomp5 libopenblas-dev","execution_count":2,"outputs":[{"output_type":"stream","text":"\u001b[33mWARNING: You are using pip version 20.2.1; however, version 20.2.2 is available.\nYou should consider upgrading via the '/opt/conda/bin/python3.7 -m pip install --upgrade pip' command.\u001b[0m\n\u001b[33mWARNING: You are using pip version 20.2.1; however, version 20.2.2 is available.\nYou should consider upgrading via the '/opt/conda/bin/python3.7 -m pip install --upgrade pip' command.\u001b[0m\n","name":"stdout"}]},{"metadata":{"id":"pzM1_ykHaFur","outputId":"58fa0ba8-b486-4b26-aaea-c0331b343b70","trusted":true},"cell_type":"code","source":"# Importing stock libraries\nimport numpy as np\nimport pandas as pd\nimport torch\nimport torch.nn.functional as F\nfrom torch.utils.data import Dataset, DataLoader, RandomSampler, SequentialSampler\n\n# Importing the T5 modules from huggingface/transformers\nfrom transformers import T5Tokenizer, T5ForConditionalGeneration\n\n# WandB – Import the wandb library\nimport wandb","execution_count":3,"outputs":[{"output_type":"stream","text":"\u001b[34m\u001b[1mwandb\u001b[0m: \u001b[33mWARNING\u001b[0m W&B installed but not logged in. Run `wandb login` or set the WANDB_API_KEY env variable.\n","name":"stderr"}]},{"metadata":{"id":"KvPxXdKJguYB","outputId":"6c523635-a25a-429b-cbd8-7b8bf9636972","trusted":true},"cell_type":"code","source":"# Checking out the GPU we have access to.\n!nvidia-smi","execution_count":4,"outputs":[{"output_type":"stream","text":"Fri Aug 14 05:53:29 2020 \r\n+-----------------------------------------------------------------------------+\r\n| NVIDIA-SMI 418.67 Driver Version: 418.67 CUDA Version: 10.1 |\r\n|-------------------------------+----------------------+----------------------+\r\n| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |\r\n| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |\r\n|===============================+======================+======================|\r\n| 0 Tesla P100-PCIE... Off | 00000000:00:04.0 Off | 0 |\r\n| N/A 36C P0 28W / 250W | 0MiB / 16280MiB | 6% Default |\r\n+-------------------------------+----------------------+----------------------+\r\n \r\n+-----------------------------------------------------------------------------+\r\n| Processes: GPU Memory |\r\n| GPU PID Type Process name Usage |\r\n|=============================================================================|\r\n| No running processes found |\r\n+-----------------------------------------------------------------------------+\r\n","name":"stdout"}]},{"metadata":{"id":"NLxxwd1scQNv","trusted":true},"cell_type":"code","source":"# # Setting up the device for GPU usage\nfrom torch import cuda\ndevice = 'cuda' if cuda.is_available() else 'cpu'\n","execution_count":5,"outputs":[]},{"metadata":{"id":"L-ePh9dEKXMw","outputId":"a35fd305-1c09-48ff-978c-fa1d0762c5e2","trusted":true},"cell_type":"code","source":"# Login to wandb to log the model run and all the parameters\n! wandb login ","execution_count":8,"outputs":[{"output_type":"stream","text":"\u001b[34m\u001b[1mwandb\u001b[0m: Appending key for api.wandb.ai to your netrc file: /root/.netrc\r\n\u001b[32mSuccessfully logged in to Weights & Biases!\u001b[0m\r\n","name":"stdout"}]},{"metadata":{"id":"V68izHXPfh-b"},"cell_type":"markdown","source":"\n### Preparing the Dataset for data processing: Class\n\nWe will start with creation of Dataset class - This defines how the text is pre-processed before sending it to the neural network. This dataset will be used the the Dataloader method that will feed the data in batches to the neural network for suitable training and processing. \nThe Dataloader and Dataset will be used inside the `main()`.\nDataset and Dataloader are constructs of the PyTorch library for defining and controlling the data pre-processing and its passage to neural network. For further reading into Dataset and Dataloader .\n\n#### *CustomDataset* Dataset Class\n- This class is defined to accept the Dataframe as input and generate tokenized output that is used by the **T5** model for training. \n- We are using the **T5** tokenizer to tokenize the data in the `text` and `ctext` column of the dataframe. \n- The tokenizer uses the ` batch_encode_plus` method to perform tokenization and generate the necessary outputs, namely: `source_id`, `source_mask` from the actual text and `target_id` and `target_mask` from the summary text.\n- The *CustomDataset* class is used to create 2 datasets, for training and for validation.\n- *Training Dataset* is used to fine tune the model: **80% of the original data**\n- *Validation Dataset* is used to evaluate the performance of the model. The model has not seen this data during training. \n\n#### Dataloader: Called inside the `main()`\n- Dataloader is used to for creating training and validation dataloader that load data to the neural network in a defined manner. This is needed because all the data from the dataset cannot be loaded to the memory at once, hence the amount of data loaded to the memory and then passed to the neural network needs to be controlled.\n- This control is achieved using the parameters such as `batch_size` and `max_len`.\n- Training and Validation dataloaders are used in the training and validation part of the flow respectively"},{"metadata":{"id":"932p8NhxeNw4","trusted":true},"cell_type":"code","source":"# Creating a custom dataset for reading the dataframe and loading it into the dataloader to pass it to the neural network at a later stage for finetuning the model and to prepare it for predictions\n\nclass CustomDataset(Dataset):\n\n def __init__(self, dataframe, tokenizer, source_len, summ_len):\n self.tokenizer = tokenizer\n self.data = dataframe\n self.source_len = source_len\n self.summ_len = summ_len\n self.text = self.data.text\n self.ctext = self.data.ctext\n\n def __len__(self):\n return len(self.text)\n\n def __getitem__(self, index):\n ctext = str(self.ctext[index])\n ctext = ' '.join(ctext.split())\n\n text = str(self.text[index])\n text = ' '.join(text.split())\n\n source = self.tokenizer.batch_encode_plus([ctext], max_length= self.source_len, pad_to_max_length=True,return_tensors='pt')\n target = self.tokenizer.batch_encode_plus([text], max_length= self.summ_len, pad_to_max_length=True,return_tensors='pt')\n\n source_ids = source['input_ids'].squeeze()\n source_mask = source['attention_mask'].squeeze()\n target_ids = target['input_ids'].squeeze()\n target_mask = target['attention_mask'].squeeze()\n\n return {\n 'source_ids': source_ids.to(dtype=torch.long), \n 'source_mask': source_mask.to(dtype=torch.long), \n 'target_ids': target_ids.to(dtype=torch.long),\n 'target_ids_y': target_ids.to(dtype=torch.long)\n }","execution_count":9,"outputs":[]},{"metadata":{"id":"ZhF8xH46fh-p"},"cell_type":"markdown","source":"\n### Fine Tuning the Model: Function\n\nHere we define a training function that trains the model on the training dataset created above, specified number of times (EPOCH), An epoch defines how many times the complete data will be passed through the network. \n\nThis function is called in the `main()`\n\nFollowing events happen in this function to fine tune the neural network:\n- The epoch, tokenizer, model, device details, testing_ dataloader and optimizer are passed to the `train ()` when its called from the `main()`\n- The dataloader passes data to the model based on the batch size.\n- `language_model_labels` are calculated from the `target_ids` also, `source_id` and `attention_mask` are extracted.\n- The model outputs first element gives the loss for the forward pass. \n- Loss value is used to optimize the weights of the neurons in the network.\n- After every 10 steps the loss value is logged in the wandb service. This log is then used to generate graphs for analysis.\n- After every 500 steps the loss value is printed in the console."},{"metadata":{"id":"SaPAR7TWmxoM","trusted":true},"cell_type":"code","source":"# Creating the training function. This will be called in the main function. It is run depending on the epoch value.\n# The model is put into train mode and then we wnumerate over the training loader and passed to the defined network \n\ndef train(epoch, tokenizer, model, device, loader, optimizer):\n model.train()\n for _,data in enumerate(loader, 0):\n y = data['target_ids'].to(device, dtype = torch.long)\n y_ids = y[:, :-1].contiguous()\n lm_labels = y[:, 1:].clone().detach()\n lm_labels[y[:, 1:] == tokenizer.pad_token_id] = -100\n ids = data['source_ids'].to(device, dtype = torch.long)\n mask = data['source_mask'].to(device, dtype = torch.long)\n\n outputs = model(input_ids = ids, attention_mask = mask, decoder_input_ids=y_ids, lm_labels=lm_labels)\n loss = outputs[0]\n \n if _%10 == 0:\n wandb.log({\"Training Loss\": loss.item()})\n\n if _%500==0:\n print(f'Epoch: {epoch}, Loss: {loss.item()}')\n \n optimizer.zero_grad()\n loss.backward()\n optimizer.step()\n ","execution_count":10,"outputs":[]},{"metadata":{"id":"RFZ6ZAR7fh-z"},"cell_type":"markdown","source":"\n### Validating the Model Performance: Function\n\nDuring the validation stage we pass the unseen data(Testing Dataset), trained model, tokenizer and device details to the function to perform the validation run. This step generates new summary for dataset that it has not seen during the training session. \n\nThis function is called in the `main()`\n\nThis unseen data is the 20% of `news_summary.csv` which was seperated during the Dataset creation stage. \nDuring the validation stage the weights of the model are not updated. We use the generate method for generating new text for the summary. \n\nIt depends on the `Beam-Search coding` method developed for sequence generation for models with LM head. \n\nThe generated text and originally summary are decoded from tokens to text and returned to the `main()`"},{"metadata":{"id":"j9TNdHlQ0CLz","trusted":true},"cell_type":"code","source":"def validate(epoch, tokenizer, model, device, loader):\n model.eval()\n predictions = []\n actuals = []\n with torch.no_grad():\n for _, data in enumerate(loader, 0):\n y = data['target_ids'].to(device, dtype = torch.long)\n ids = data['source_ids'].to(device, dtype = torch.long)\n mask = data['source_mask'].to(device, dtype = torch.long)\n\n generated_ids = model.generate(\n input_ids = ids,\n attention_mask = mask, \n max_length=150, \n num_beams=2,\n repetition_penalty=2.5, \n length_penalty=1.0, \n early_stopping=True\n )\n preds = [tokenizer.decode(g, skip_special_tokens=True, clean_up_tokenization_spaces=True) for g in generated_ids]\n target = [tokenizer.decode(t, skip_special_tokens=True, clean_up_tokenization_spaces=True)for t in y]\n if _%100==0:\n print(f'Completed {_}')\n\n predictions.extend(preds)\n actuals.extend(target)\n return predictions, actuals","execution_count":11,"outputs":[]},{"metadata":{"id":"gYs-iVVYfh-5"},"cell_type":"markdown","source":"\n### Main Function\n\nThe `main()` as the name suggests is the central location to execute all the functions/flows created above in the notebook. The following steps are executed in the `main()`:\n\n\n\n#### Initializing WandB \n\n* The `main()` begins with initializing WandB run under a specific project. This command initiates a new run for each execution of this command. \n\n**[WandB Service](https://www.wandb.com/)**\n\n* This service has been created to track ML experiments, Optimize the experiments and save artifacts. It is designed to seamlessly integrate with all the Machine Learning and Deep Learning Frameworks. Each script can be organized into *Project* and each execution of the script will be registered as a *run* in the respective project.\n\n* The service can be configured to log several default metrics, such a network weights, hardware usage, gradients and weights of the network. \n\n* It can also be used to log user defined metrics, such a loss in the `train()`.\n\n\n\n\n* Visit the project page to see the details of different runs and what information is logged by the service. \n\n* Following the initialization of the WandB service we define configuration parameters that will be used across the tutorial such as `batch_size`, `epoch`, `learning_rate` etc.\n\n* These parameters are also passed to the WandB config. The config construct with all the parameters can be optimized using the Sweep service from WandB. Currently, that is outof scope of this tutorial. \n\n* Next we defining seed values so that the experiment and results can be reproduced.\n\n\n\n#### Importing and Pre-Processing the domain data\n\nWe will be working with the data and preparing it for fine tuning purposes. \n*Assuming that the `news_summary.csv` is already downloaded in your `data` folder*\n\n* The file is imported as a dataframe and give it the headers as per the documentation.\n* Cleaning the file to remove the unwanted columns.\n* A new string is added to the main article column `summarize: ` prior to the actual article. This is done because **T5** had similar formatting for the summarization dataset. \n* The final Dataframe will be something like this:\n\n|text|ctext|\n|--|--|\n|summary-1|summarize: article 1|\n|summary-2|summarize: article 2|\n|summary-3|summarize: article 3|\n\n* Top 5 rows of the dataframe are printed on the console.\n\n\n#### Creation of Dataset and Dataloader\n\n* The updated dataframe is divided into 80-20 ratio for test and validation. \n* Both the data-frames are passed to the `CustomerDataset` class for tokenization of the new articles and their summaries.\n* The tokenization is done using the length parameters passed to the class.\n* Train and Validation parameters are defined and passed to the `pytorch Dataloader contstruct` to create `train` and `validation` data loaders.\n* These dataloaders will be passed to `train()` and `validate()` respectively for training and validation action.\n* The shape of datasets is printed in the console.\n\n\n\n#### Neural Network and Optimizer\n\n* In this stage we define the model and optimizer that will be used for training and to update the weights of the network. \n* We are using the `t5-base` transformer model for our project. You can read about the `T5 model` and its features above. \n* We use the `T5ForConditionalGeneration.from_pretrained(\"t5-base\")` commad to define our model. The `T5ForConditionalGeneration` adds a Language Model head to our `T5 model`. The Language Model head allows us to generate text based on the training of `T5 model`.\n* We are using the `Adam` optimizer for our project. This has been a standard for all our tutorials and is something that can be changed updated to see how different optimizer perform with different learning rates. \n* There is also a scope for doing more with Optimizer such a decay, momentum to dynamically update the Learning rate and other parameters. All those concepts have been kept out of scope for these tutorials. \n\n\n\n#### Training Model and Logging to WandB\n\n* Now we log all the metrics in WandB project that we have initialized above.\n* Followed by that we call the `train()` with all the necessary parameters.\n* Loss at every 500th step is printed on the console.\n* Loss at every 10th step is logged as Loss in the WandB service.\n\n\n\n#### Validation and generation of Summary\n\n* After the training is completed, the validation step is initiated.\n* As defined in the validation function, the model weights are not updated. We use the fine tuned model to generate new summaries based on the article text.\n* An output is printed on the console giving a count of how many steps are complete after every 100th step. \n* The original summary and generated summary are converted into a list and returned to the main function. \n* Both the lists are used to create the final dataframe with 2 columns **Generated Summary** and **Actual Summary**\n* The dataframe is saved as a csv file in the local drive.\n* A qualitative analysis can be done with the Dataframe. "},{"metadata":{"id":"ZtNs9ytpCow2","outputId":"80545587-0a82-455a-a9ba-13eb3fcb1550","trusted":true},"cell_type":"code","source":"def main():\n # WandB – Initialize a new run\n wandb.init(project=\"transformers_tutorials_summarization\")\n\n # WandB – Config is a variable that holds and saves hyperparameters and inputs\n # Defining some key variables that will be used later on in the training \n config = wandb.config # Initialize config\n config.TRAIN_BATCH_SIZE = 2 # input batch size for training (default: 64)\n config.VALID_BATCH_SIZE = 2 # input batch size for testing (default: 1000)\n config.TRAIN_EPOCHS = 2 # number of epochs to train (default: 10)\n config.VAL_EPOCHS = 1 \n config.LEARNING_RATE = 1e-4 # learning rate (default: 0.01)\n config.SEED = 42 # random seed (default: 42)\n config.MAX_LEN = 512\n config.SUMMARY_LEN = 150 \n\n # Set random seeds and deterministic pytorch for reproducibility\n torch.manual_seed(config.SEED) # pytorch random seed\n np.random.seed(config.SEED) # numpy random seed\n torch.backends.cudnn.deterministic = True\n\n # tokenzier for encoding the text\n tokenizer = T5Tokenizer.from_pretrained(\"t5-base\")\n \n\n # Importing and Pre-Processing the domain data\n # Selecting the needed columns only. \n # Adding the summarzie text in front of the text. This is to format the dataset similar to how T5 model was trained for summarization task. \n df = pd.read_csv('/kaggle/input/news-summary/news_summary.csv',encoding='latin-1')\n df = df[['text','ctext']]\n df.ctext = 'summarize: ' + df.ctext\n print(df.head())\n\n \n # Creation of Dataset and Dataloader\n # Defining the train size. So 80% of the data will be used for training and the rest will be used for validation. \n train_size = 0.8\n train_dataset=df.sample(frac=train_size,random_state = config.SEED)\n val_dataset=df.drop(train_dataset.index).reset_index(drop=True)\n train_dataset = train_dataset.reset_index(drop=True)\n\n print(\"FULL Dataset: {}\".format(df.shape))\n print(\"TRAIN Dataset: {}\".format(train_dataset.shape))\n print(\"TEST Dataset: {}\".format(val_dataset.shape))\n\n\n # Creating the Training and Validation dataset for further creation of Dataloader\n training_set = CustomDataset(train_dataset, tokenizer, config.MAX_LEN, config.SUMMARY_LEN)\n val_set = CustomDataset(val_dataset, tokenizer, config.MAX_LEN, config.SUMMARY_LEN)\n\n # Defining the parameters for creation of dataloaders\n train_params = {\n 'batch_size': config.TRAIN_BATCH_SIZE,\n 'shuffle': True,\n 'num_workers': 0\n }\n\n val_params = {\n 'batch_size': config.VALID_BATCH_SIZE,\n 'shuffle': False,\n 'num_workers': 0\n }\n\n # Creation of Dataloaders for testing and validation. This will be used down for training and validation stage for the model.\n training_loader = DataLoader(training_set, **train_params)\n val_loader = DataLoader(val_set, **val_params)\n\n\n \n # Defining the model. We are using t5-base model and added a Language model layer on top for generation of Summary. \n # Further this model is sent to device (GPU/TPU) for using the hardware.\n model = T5ForConditionalGeneration.from_pretrained(\"t5-base\")\n model = model.to(device)\n\n # Defining the optimizer that will be used to tune the weights of the network in the training session. \n optimizer = torch.optim.Adam(params = model.parameters(), lr=config.LEARNING_RATE)\n\n # Log metrics with wandb\n wandb.watch(model, log=\"all\")\n # Training loop\n print('Initiating Fine-Tuning for the model on our dataset')\n\n for epoch in range(config.TRAIN_EPOCHS):\n train(epoch, tokenizer, model, device, training_loader, optimizer)\n\n\n # Validation loop and saving the resulting file with predictions and acutals in a dataframe.\n # Saving the dataframe as predictions.csv\n print('Now generating summaries on our fine tuned model for the validation dataset and saving it in a dataframe')\n for epoch in range(config.VAL_EPOCHS):\n predictions, actuals = validate(epoch, tokenizer, model, device, val_loader)\n final_df = pd.DataFrame({'Generated Text':predictions,'Actual Text':actuals})\n final_df.to_csv('./models/predictions.csv')\n print('Output Files generated for review')\n\nif __name__ == '__main__':\n main()","execution_count":null,"outputs":[{"output_type":"display_data","data":{"text/plain":"","text/html":"\n Logging results to Weights & Biases (Documentation).
\n Project page: https://app.wandb.ai/storiesbyharshit/transformers_tutorials_summarization
\n Run page: https://app.wandb.ai/storiesbyharshit/transformers_tutorials_summarization/runs/1vtq7hcf
\n "},"metadata":{}},{"output_type":"stream","text":" text \\\n0 The Administration of Union Territory Daman an... \n1 Malaika Arora slammed an Instagram user who tr... \n2 The Indira Gandhi Institute of Medical Science... \n3 Lashkar-e-Taiba's Kashmir commander Abu Dujana... \n4 Hotels in Maharashtra will train their staff t... \n\n ctext \n0 summarize: The Daman and Diu administration on... \n1 summarize: From her special numbers to TV?appe... \n2 summarize: The Indira Gandhi Institute of Medi... \n3 summarize: Lashkar-e-Taiba's Kashmir commander... \n4 summarize: Hotels in Mumbai and other Indian c... \nFULL Dataset: (4514, 2)\nTRAIN Dataset: (3611, 2)\nTEST Dataset: (903, 2)\n","name":"stdout"},{"output_type":"display_data","data":{"text/plain":"HBox(children=(FloatProgress(value=0.0, description='Downloading', max=1199.0, style=ProgressStyle(description…","application/vnd.jupyter.widget-view+json":{"version_major":2,"version_minor":0,"model_id":"0e378e24a96049b3bdecf85abfc196f5"}},"metadata":{}},{"output_type":"stream","text":"\n","name":"stdout"},{"output_type":"display_data","data":{"text/plain":"HBox(children=(FloatProgress(value=0.0, description='Downloading', max=891691430.0, style=ProgressStyle(descri…","application/vnd.jupyter.widget-view+json":{"version_major":2,"version_minor":0,"model_id":"5bc87207c034420bbaeb07387bf96be8"}},"metadata":{}},{"output_type":"stream","text":"\nInitiating Fine-Tuning for the model on our dataset\nEpoch: 0, Loss: 5.861971378326416\nEpoch: 0, Loss: 1.923175573348999\nEpoch: 0, Loss: 1.489040493965149\nEpoch: 0, Loss: 1.9674493074417114\nEpoch: 1, Loss: 2.0224180221557617\nEpoch: 1, Loss: 1.2057034969329834\nEpoch: 1, Loss: 1.2185782194137573\n","name":"stdout"}]},{"metadata":{"id":"YGVPUIfKfh-_"},"cell_type":"markdown","source":"\n### Examples of the Summary Generated from the model\n\n##### Example 1\n\n**Original Text**\nNew Delhi, Apr 25 (PTI) Union minister Vijay Goel today batted for the unification of the three municipal corporations in the national capital saying a discussion over the issue was pertinent. The BJP leader, who was confident of a good show by his party in the MCD polls, the results of which will be declared tomorrow, said the civic bodies needed to be \"revamped\" in order to deliver the services to the people more effectively. The first thing needed was a discussion on the unification of the three municipal corporations and there should also be an end to the practice of sending Delhi government officials to serve in the civic bodies, said the Union Minister of State (Independent Charge) for Youth Affairs and Sports. \"Barring one, the two other civic bodies have been incurring losses. It would be more fruitful and efficient if all the three were merged,\" he said, referring to the north, south and east Delhi municipal corporations. The erstwhile Municipal Corporation of Delhi (MCD) was trifurcated into NDMC, SDMC and EDMC by the then Sheila Dikshit-led Delhi government in 2012. Goel predicted a \"thumping\" victory for the BJP in the MCD polls. He said the newly-elected BJP councillors will be trained on the functioning of the civic bodies and dealing with the bureaucracy. \n\n\n**Original Summary**\nUnion Minister Vijay Goel has favoured unification of three MCDs ? North, South and East ? in order to deliver the services more effectively. \"Barring one, the two other civic bodies have been incurring losses. It would be more fruitful and efficient if all the three were merged,\" he said. MCD was trifurcated into EDMC, NDMC and SDMC in 2012.\n\n**Generated Summary**\nBJP leader Vijay Goel on Saturday batted for the unification of three municipal corporations in the national capital saying a discussion over this was pertinent. \"Barring one, two other civic bodies have been incurring losses,\" said Goels. The erstwhile Municipal Corporations of Delhi (MCD) were trifurcated into NDMC and SDMC by the then Sheilha Dikshi-led government in 2012. Notably, the MCD poll results will be declared tomorrow."},{"metadata":{"id":"mcqK9smlfh_A"},"cell_type":"markdown","source":"##### Example 2\n\n**Original Text**\nAfter much wait, the first UDAN flight took off from Shimla today after being flagged off by Prime Minister Narendra Modi.The flight will be operated by Alliance Air, the regional arm of Air India. PM Narendra Modi handed over boarding passes to some of passengers travelling via the first UDAN flight at the Shimla airport.Tomorrow PM @narendramodi will flag off the first UDAN flight under the Regional Connectivity Scheme, on Shimla-Delhi sector.Air India yesterday opened bookings for the first launch flight from Shimla to Delhi with all inclusive fares starting at Rs2,036.THE GREAT 'UDAN'The UDAN (Ude Desh ka Aam Naagrik) scheme seeks to make flying more affordable for the common people, holding a plan to connect over 45 unserved and under-served airports.Under UDAN, 50 per cent of the seats on each flight would have a cap of Rs 2,500 per seat/hour. The government has also extended subsidy in the form of viability gap funding to the operators flying on these routes.The scheme was launched to \"make air travel accessible to citizens in regionally important cities,\" and has been described as \"a first-of-its-kind scheme globally to stimulate regional connectivity through a market-based mechanism.\" Report have it the first flight today will not be flying at full capacity on its 70-seater ATR airplane because of payload restrictions related to the short Shimla airfield.|| Read more ||Udan scheme: Now you can fly to these 43 cities, see the full list hereUDAN scheme to fly hour-long flights capped at Rs 2,500 to smaller cities \n\n\n**Original Summary**\nPM Narendra Modi on Thursday launched Ude Desh ka Aam Nagrik (UDAN) scheme for regional flight connectivity by flagging off the inaugural flight from Shimla to Delhi. Under UDAN, government will connect small towns by air with 50% plane seats' fare capped at?2,500 for a one-hour journey of 500 kilometres. UDAN will connect over 45 unserved and under-served airports.\n\n**Generated Summary**\nUDAN (Ude Desh Ka Aam Naagrik) scheme, launched to make air travel accessible in regionally important cities under the Regional Connectivity Scheme, took off from Shimla on Tuesday. The first flight will be operated by Alliance Air, which is the regional arm of India's Air India. Under the scheme, 50% seats would have?2,500 per seat/hour and 50% of the seats would have capped at this rate. It was also extended subsidy in form-based funding for operators flying these routes as well."},{"metadata":{"id":"8Wdg0drafh_B"},"cell_type":"markdown","source":"##### Example 3\n\n**Original Text**\nNew Delhi, Apr 25 (PTI) The Income Tax department has issued a Rs 24,646 crore tax demand notice to Sahara Groups Aamby Valley Limited (AVL) after conducting a special audit of the company. The department, as part of a special investigation and audit into the account books of AVL, found that an income of over Rs 48,000 crore for a particular assessment year was allegedly not reflected in the record books of the firm and hence it raised a fresh tax demand and penalty amount on it. A Sahara Group spokesperson confirmed the development to PTI. \"Yes, the Income Tax Department has raised Rs 48,085.79 crores to the income of the Aamby Valley Limited with a total demand of income tax of Rs 24,646.96 crores on the Aamby Valley Limited,\" the spokesperson said in a brief statement. Officials said the notice was issued by the taxman in January this year after the special audit of AVLs income for the Assessment Year 2012-13 found that the parent firm had allegedly floated a clutch of Special Purpose Vehicles whose incomes were later accounted on the account of AVL as they were merged with the former in due course of time. The AVL, in its income return filed for AY 2012-13, had reflected a loss of few crores but the special I-T audit brought up the added income, a senior official said. The Supreme Court, last week, had asked the Bombay High Courts official liquidator to sell the Rs 34,000 crore worth of properties of Aamby Valley owned by the Sahara Group and directed its chief Subrata Roy to personally appear before it on April 28. \n\n\n**Original Summary**\nThe Income Tax Department has issued a ?24,646 crore tax demand notice to Sahara Group's Aamby Valley Limited. The department's audit found that an income of over ?48,000 crore for the assessment year 2012-13 was not reflected in the record books of the firm. A week ago, the SC ordered Bombay HC to auction Sahara's Aamby Valley worth ?34,000 crore.\n\n**Generated Summary**\nthe Income Tax department has issued a?24,646 crore tax demand notice to Sahara Groups Aamby Valley Limited (AVL) after conducting an audit of the company. The notice was issued in January this year after the special audit found that the parent firm had floated Special Purpose Vehicle income for the Assessment Year 2012-13 and later accounted on its account as they were merged with the former. \"Yes...the Income Tax Department raised Rs48,085.79 crores to the income,\" he added earlier said at the notice."}],"metadata":{"kernelspec":{"language":"python","display_name":"Python 3","name":"python3"},"language_info":{"name":"python","version":"3.7.6","mimetype":"text/x-python","codemirror_mode":{"name":"ipython","version":3},"pygments_lexer":"ipython3","nbconvert_exporter":"python","file_extension":".py"}},"nbformat":4,"nbformat_minor":4} --------------------------------------------------------------------------------