├── README.md
└── Text-summarizer.ipynb
/README.md:
--------------------------------------------------------------------------------
1 | # Text-summarization-with-Seq2Seq
2 |
3 | ## What is Text Summarization?
4 | Text summarization is the task of using an algorithm to convert long prose text into short, concise and exhaustive summaries.
5 | The summaries are inclusive and sequential, which does not change the meaning or implications of the original text.
6 |
7 | ## Why do we need it?
8 | * Less skimming through documents, article and blogs.
9 | * Breaks down to crisp bite-sized texts that are easier to follow and retain.
10 | * Takes less amount of time to read
11 | * Algorithms are less biased when compared to humans.
12 | * Effectiveness of indexing can be improved based on summarized text
13 | * Summarized data makes the selection process easier
14 |
15 | ## Architecture
16 | * Sequence to Sequence modelling (Seq2Seq).
17 | * We would be using the Encoder-Decoder architecture as it is mainly used to solve the sequence-to-sequence (Seq2Seq) problems where the input and output sequences are of different lengths.
18 | * These Encoder, Decoder components would consists of variants of Recurrent Neural Networks (RNNs), or Long Short Term Memory (LSTM).
19 |
20 | ## Model details
21 | * Since the dataset is significantly large, we split it into train and test sets with a 90:10 split ratio.
22 | * The optimizer used for the model is ‘RMSprop’.
23 | * We run 50 epochs with a batch size of 128 and dropout of 0.4.
24 | * We use EarlyStopping on the basis on validation loss which stopped the execution in 35/50 epochs due to minimal decrease in validation loss.
25 | * We experimented with varying number of LSTM layers and fine tuning hyperparameters like batch size (64, 128, 256, 512) and dropout (0.2, 0.25, 0.4).
26 |
27 | ## Performance metrics
28 | * Calculated the BLEU, GLEU and METEOR scores for the predictions of out model.
29 | * BLEU is used as a benchmark as it is the oldest and most widely adopted metric. Based on matching n-grams in the predicted summary to actual n-gram in the actual summary.
30 | * GLEU works similar to BLEU but remedies a few disadvantages at sentence level.
31 | * METEOR has significantly better correlation to human judgement compared to BLEU. Based on the harmonic mean of unigram precision and recall
32 |
33 | ## References
34 | * Sequence to Sequence Learning with Neural Networks By Ilya Sutskever and Oriol Vinyals and Quoc V. Le
35 | * https://arxiv.org/pdf/1409.3215
36 | * Neural Language Toolkit (NLTK)
37 | * https://nltk.org
38 | * ConceptNet 5.5: An Open Multilingual Graph of General Knowledge
39 | * https://arxiv.org/abs/1612.03975
40 |
--------------------------------------------------------------------------------
/Text-summarizer.ipynb:
--------------------------------------------------------------------------------
1 | {
2 | "cells": [
3 | {
4 | "cell_type": "markdown",
5 | "metadata": {},
6 | "source": [
7 | "# CSE 5368 - Spring 2021 - Final Project\n",
8 | "## Text Summarization using Seq2Seq based model\n",
9 | "### By Karan Jeeten Thakkar (1001852000) and Crupanshu Ashishbhai Udani (1001861781)"
10 | ]
11 | },
12 | {
13 | "cell_type": "code",
14 | "execution_count": null,
15 | "metadata": {
16 | "colab": {
17 | "base_uri": "https://localhost:8080/"
18 | },
19 | "id": "lMRKmeTvHMom",
20 | "outputId": "82925117-c105-43b9-bd3a-5e5dfbe297fb"
21 | },
22 | "outputs": [
23 | {
24 | "name": "stdout",
25 | "output_type": "stream",
26 | "text": [
27 | "Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount(\"/content/drive\", force_remount=True).\n"
28 | ]
29 | }
30 | ],
31 | "source": [
32 | "from google.colab import drive\n",
33 | "drive.mount('/content/drive')"
34 | ]
35 | },
36 | {
37 | "cell_type": "code",
38 | "execution_count": null,
39 | "metadata": {
40 | "colab": {
41 | "base_uri": "https://localhost:8080/",
42 | "height": 324
43 | },
44 | "id": "iCCEUi6o8zFI",
45 | "outputId": "f8d9494f-0a7a-470d-f639-cb8203fd6823"
46 | },
47 | "outputs": [
48 | {
49 | "name": "stdout",
50 | "output_type": "stream",
51 | "text": [
52 | "Collecting nltk\n",
53 | "\u001b[?25l Downloading https://files.pythonhosted.org/packages/5e/37/9532ddd4b1bbb619333d5708aaad9bf1742f051a664c3c6fa6632a105fd8/nltk-3.6.2-py3-none-any.whl (1.5MB)\n",
54 | "\r",
55 | "\u001b[K |▎ | 10kB 20.5MB/s eta 0:00:01\r",
56 | "\u001b[K |▌ | 20kB 27.2MB/s eta 0:00:01\r",
57 | "\u001b[K |▊ | 30kB 31.9MB/s eta 0:00:01\r",
58 | "\u001b[K |█ | 40kB 30.5MB/s eta 0:00:01\r",
59 | "\u001b[K |█▏ | 51kB 29.9MB/s eta 0:00:01\r",
60 | "\u001b[K |█▍ | 61kB 31.2MB/s eta 0:00:01\r",
61 | "\u001b[K |█▋ | 71kB 21.1MB/s eta 0:00:01\r",
62 | "\u001b[K |█▉ | 81kB 22.1MB/s eta 0:00:01\r",
63 | "\u001b[K |██ | 92kB 23.0MB/s eta 0:00:01\r",
64 | "\u001b[K |██▎ | 102kB 24.2MB/s eta 0:00:01\r",
65 | "\u001b[K |██▌ | 112kB 24.2MB/s eta 0:00:01\r",
66 | "\u001b[K |██▊ | 122kB 24.2MB/s eta 0:00:01\r",
67 | "\u001b[K |███ | 133kB 24.2MB/s eta 0:00:01\r",
68 | "\u001b[K |███▏ | 143kB 24.2MB/s eta 0:00:01\r",
69 | "\u001b[K |███▍ | 153kB 24.2MB/s eta 0:00:01\r",
70 | "\u001b[K |███▋ | 163kB 24.2MB/s eta 0:00:01\r",
71 | "\u001b[K |███▉ | 174kB 24.2MB/s eta 0:00:01\r",
72 | "\u001b[K |████ | 184kB 24.2MB/s eta 0:00:01\r",
73 | "\u001b[K |████▎ | 194kB 24.2MB/s eta 0:00:01\r",
74 | "\u001b[K |████▌ | 204kB 24.2MB/s eta 0:00:01\r",
75 | "\u001b[K |████▊ | 215kB 24.2MB/s eta 0:00:01\r",
76 | "\u001b[K |█████ | 225kB 24.2MB/s eta 0:00:01\r",
77 | "\u001b[K |█████▏ | 235kB 24.2MB/s eta 0:00:01\r",
78 | "\u001b[K |█████▍ | 245kB 24.2MB/s eta 0:00:01\r",
79 | "\u001b[K |█████▋ | 256kB 24.2MB/s eta 0:00:01\r",
80 | "\u001b[K |█████▉ | 266kB 24.2MB/s eta 0:00:01\r",
81 | "\u001b[K |██████ | 276kB 24.2MB/s eta 0:00:01\r",
82 | "\u001b[K |██████▎ | 286kB 24.2MB/s eta 0:00:01\r",
83 | "\u001b[K |██████▌ | 296kB 24.2MB/s eta 0:00:01\r",
84 | "\u001b[K |██████▊ | 307kB 24.2MB/s eta 0:00:01\r",
85 | "\u001b[K |███████ | 317kB 24.2MB/s eta 0:00:01\r",
86 | "\u001b[K |███████▏ | 327kB 24.2MB/s eta 0:00:01\r",
87 | "\u001b[K |███████▍ | 337kB 24.2MB/s eta 0:00:01\r",
88 | "\u001b[K |███████▊ | 348kB 24.2MB/s eta 0:00:01\r",
89 | "\u001b[K |████████ | 358kB 24.2MB/s eta 0:00:01\r",
90 | "\u001b[K |████████▏ | 368kB 24.2MB/s eta 0:00:01\r",
91 | "\u001b[K |████████▍ | 378kB 24.2MB/s eta 0:00:01\r",
92 | "\u001b[K |████████▋ | 389kB 24.2MB/s eta 0:00:01\r",
93 | "\u001b[K |████████▉ | 399kB 24.2MB/s eta 0:00:01\r",
94 | "\u001b[K |█████████ | 409kB 24.2MB/s eta 0:00:01\r",
95 | "\u001b[K |█████████▎ | 419kB 24.2MB/s eta 0:00:01\r",
96 | "\u001b[K |█████████▌ | 430kB 24.2MB/s eta 0:00:01\r",
97 | "\u001b[K |█████████▊ | 440kB 24.2MB/s eta 0:00:01\r",
98 | "\u001b[K |██████████ | 450kB 24.2MB/s eta 0:00:01\r",
99 | "\u001b[K |██████████▏ | 460kB 24.2MB/s eta 0:00:01\r",
100 | "\u001b[K |██████████▍ | 471kB 24.2MB/s eta 0:00:01\r",
101 | "\u001b[K |██████████▋ | 481kB 24.2MB/s eta 0:00:01\r",
102 | "\u001b[K |██████████▉ | 491kB 24.2MB/s eta 0:00:01\r",
103 | "\u001b[K |███████████ | 501kB 24.2MB/s eta 0:00:01\r",
104 | "\u001b[K |███████████▎ | 512kB 24.2MB/s eta 0:00:01\r",
105 | "\u001b[K |███████████▌ | 522kB 24.2MB/s eta 0:00:01\r",
106 | "\u001b[K |███████████▊ | 532kB 24.2MB/s eta 0:00:01\r",
107 | "\u001b[K |████████████ | 542kB 24.2MB/s eta 0:00:01\r",
108 | "\u001b[K |████████████▏ | 552kB 24.2MB/s eta 0:00:01\r",
109 | "\u001b[K |████████████▍ | 563kB 24.2MB/s eta 0:00:01\r",
110 | "\u001b[K |████████████▋ | 573kB 24.2MB/s eta 0:00:01\r",
111 | "\u001b[K |████████████▉ | 583kB 24.2MB/s eta 0:00:01\r",
112 | "\u001b[K |█████████████ | 593kB 24.2MB/s eta 0:00:01\r",
113 | "\u001b[K |█████████████▎ | 604kB 24.2MB/s eta 0:00:01\r",
114 | "\u001b[K |█████████████▌ | 614kB 24.2MB/s eta 0:00:01\r",
115 | "\u001b[K |█████████████▊ | 624kB 24.2MB/s eta 0:00:01\r",
116 | "\u001b[K |██████████████ | 634kB 24.2MB/s eta 0:00:01\r",
117 | "\u001b[K |██████████████▏ | 645kB 24.2MB/s eta 0:00:01\r",
118 | "\u001b[K |██████████████▍ | 655kB 24.2MB/s eta 0:00:01\r",
119 | "\u001b[K |██████████████▋ | 665kB 24.2MB/s eta 0:00:01\r",
120 | "\u001b[K |██████████████▉ | 675kB 24.2MB/s eta 0:00:01\r",
121 | "\u001b[K |███████████████▏ | 686kB 24.2MB/s eta 0:00:01\r",
122 | "\u001b[K |███████████████▍ | 696kB 24.2MB/s eta 0:00:01\r",
123 | "\u001b[K |███████████████▋ | 706kB 24.2MB/s eta 0:00:01\r",
124 | "\u001b[K |███████████████▉ | 716kB 24.2MB/s eta 0:00:01\r",
125 | "\u001b[K |████████████████ | 727kB 24.2MB/s eta 0:00:01\r",
126 | "\u001b[K |████████████████▎ | 737kB 24.2MB/s eta 0:00:01\r",
127 | "\u001b[K |████████████████▌ | 747kB 24.2MB/s eta 0:00:01\r",
128 | "\u001b[K |████████████████▊ | 757kB 24.2MB/s eta 0:00:01\r",
129 | "\u001b[K |█████████████████ | 768kB 24.2MB/s eta 0:00:01\r",
130 | "\u001b[K |█████████████████▏ | 778kB 24.2MB/s eta 0:00:01\r",
131 | "\u001b[K |█████████████████▍ | 788kB 24.2MB/s eta 0:00:01\r",
132 | "\u001b[K |█████████████████▋ | 798kB 24.2MB/s eta 0:00:01\r",
133 | "\u001b[K |█████████████████▉ | 808kB 24.2MB/s eta 0:00:01\r",
134 | "\u001b[K |██████████████████ | 819kB 24.2MB/s eta 0:00:01\r",
135 | "\u001b[K |██████████████████▎ | 829kB 24.2MB/s eta 0:00:01\r",
136 | "\u001b[K |██████████████████▌ | 839kB 24.2MB/s eta 0:00:01\r",
137 | "\u001b[K |██████████████████▊ | 849kB 24.2MB/s eta 0:00:01\r",
138 | "\u001b[K |███████████████████ | 860kB 24.2MB/s eta 0:00:01\r",
139 | "\u001b[K |███████████████████▏ | 870kB 24.2MB/s eta 0:00:01\r",
140 | "\u001b[K |███████████████████▍ | 880kB 24.2MB/s eta 0:00:01\r",
141 | "\u001b[K |███████████████████▋ | 890kB 24.2MB/s eta 0:00:01\r",
142 | "\u001b[K |███████████████████▉ | 901kB 24.2MB/s eta 0:00:01\r",
143 | "\u001b[K |████████████████████ | 911kB 24.2MB/s eta 0:00:01\r",
144 | "\u001b[K |████████████████████▎ | 921kB 24.2MB/s eta 0:00:01\r",
145 | "\u001b[K |████████████████████▌ | 931kB 24.2MB/s eta 0:00:01\r",
146 | "\u001b[K |████████████████████▊ | 942kB 24.2MB/s eta 0:00:01\r",
147 | "\u001b[K |█████████████████████ | 952kB 24.2MB/s eta 0:00:01\r",
148 | "\u001b[K |█████████████████████▏ | 962kB 24.2MB/s eta 0:00:01\r",
149 | "\u001b[K |█████████████████████▍ | 972kB 24.2MB/s eta 0:00:01\r",
150 | "\u001b[K |█████████████████████▋ | 983kB 24.2MB/s eta 0:00:01\r",
151 | "\u001b[K |█████████████████████▉ | 993kB 24.2MB/s eta 0:00:01\r",
152 | "\u001b[K |██████████████████████ | 1.0MB 24.2MB/s eta 0:00:01\r",
153 | "\u001b[K |██████████████████████▎ | 1.0MB 24.2MB/s eta 0:00:01\r",
154 | "\u001b[K |██████████████████████▌ | 1.0MB 24.2MB/s eta 0:00:01\r",
155 | "\u001b[K |██████████████████████▉ | 1.0MB 24.2MB/s eta 0:00:01\r",
156 | "\u001b[K |███████████████████████ | 1.0MB 24.2MB/s eta 0:00:01\r",
157 | "\u001b[K |███████████████████████▎ | 1.1MB 24.2MB/s eta 0:00:01\r",
158 | "\u001b[K |███████████████████████▌ | 1.1MB 24.2MB/s eta 0:00:01\r",
159 | "\u001b[K |███████████████████████▊ | 1.1MB 24.2MB/s eta 0:00:01\r",
160 | "\u001b[K |████████████████████████ | 1.1MB 24.2MB/s eta 0:00:01\r",
161 | "\u001b[K |████████████████████████▏ | 1.1MB 24.2MB/s eta 0:00:01\r",
162 | "\u001b[K |████████████████████████▍ | 1.1MB 24.2MB/s eta 0:00:01\r",
163 | "\u001b[K |████████████████████████▋ | 1.1MB 24.2MB/s eta 0:00:01\r",
164 | "\u001b[K |████████████████████████▉ | 1.1MB 24.2MB/s eta 0:00:01\r",
165 | "\u001b[K |█████████████████████████ | 1.1MB 24.2MB/s eta 0:00:01\r",
166 | "\u001b[K |█████████████████████████▎ | 1.1MB 24.2MB/s eta 0:00:01\r",
167 | "\u001b[K |█████████████████████████▌ | 1.2MB 24.2MB/s eta 0:00:01\r",
168 | "\u001b[K |█████████████████████████▊ | 1.2MB 24.2MB/s eta 0:00:01\r",
169 | "\u001b[K |██████████████████████████ | 1.2MB 24.2MB/s eta 0:00:01\r",
170 | "\u001b[K |██████████████████████████▏ | 1.2MB 24.2MB/s eta 0:00:01\r",
171 | "\u001b[K |██████████████████████████▍ | 1.2MB 24.2MB/s eta 0:00:01\r",
172 | "\u001b[K |██████████████████████████▋ | 1.2MB 24.2MB/s eta 0:00:01\r",
173 | "\u001b[K |██████████████████████████▉ | 1.2MB 24.2MB/s eta 0:00:01\r",
174 | "\u001b[K |███████████████████████████ | 1.2MB 24.2MB/s eta 0:00:01\r",
175 | "\u001b[K |███████████████████████████▎ | 1.2MB 24.2MB/s eta 0:00:01\r",
176 | "\u001b[K |███████████████████████████▌ | 1.2MB 24.2MB/s eta 0:00:01\r",
177 | "\u001b[K |███████████████████████████▊ | 1.3MB 24.2MB/s eta 0:00:01\r",
178 | "\u001b[K |████████████████████████████ | 1.3MB 24.2MB/s eta 0:00:01\r",
179 | "\u001b[K |████████████████████████████▏ | 1.3MB 24.2MB/s eta 0:00:01\r",
180 | "\u001b[K |████████████████████████████▍ | 1.3MB 24.2MB/s eta 0:00:01\r",
181 | "\u001b[K |████████████████████████████▋ | 1.3MB 24.2MB/s eta 0:00:01\r",
182 | "\u001b[K |████████████████████████████▉ | 1.3MB 24.2MB/s eta 0:00:01\r",
183 | "\u001b[K |█████████████████████████████ | 1.3MB 24.2MB/s eta 0:00:01\r",
184 | "\u001b[K |█████████████████████████████▎ | 1.3MB 24.2MB/s eta 0:00:01\r",
185 | "\u001b[K |█████████████████████████████▌ | 1.3MB 24.2MB/s eta 0:00:01\r",
186 | "\u001b[K |█████████████████████████████▊ | 1.4MB 24.2MB/s eta 0:00:01\r",
187 | "\u001b[K |██████████████████████████████ | 1.4MB 24.2MB/s eta 0:00:01\r",
188 | "\u001b[K |██████████████████████████████▎ | 1.4MB 24.2MB/s eta 0:00:01\r",
189 | "\u001b[K |██████████████████████████████▌ | 1.4MB 24.2MB/s eta 0:00:01\r",
190 | "\u001b[K |██████████████████████████████▊ | 1.4MB 24.2MB/s eta 0:00:01\r",
191 | "\u001b[K |███████████████████████████████ | 1.4MB 24.2MB/s eta 0:00:01\r",
192 | "\u001b[K |███████████████████████████████▏| 1.4MB 24.2MB/s eta 0:00:01\r",
193 | "\u001b[K |███████████████████████████████▍| 1.4MB 24.2MB/s eta 0:00:01\r",
194 | "\u001b[K |███████████████████████████████▋| 1.4MB 24.2MB/s eta 0:00:01\r",
195 | "\u001b[K |███████████████████████████████▉| 1.4MB 24.2MB/s eta 0:00:01\r",
196 | "\u001b[K |████████████████████████████████| 1.5MB 24.2MB/s \n",
197 | "\u001b[?25hRequirement already satisfied, skipping upgrade: regex in /usr/local/lib/python3.7/dist-packages (from nltk) (2019.12.20)\n",
198 | "Requirement already satisfied, skipping upgrade: click in /usr/local/lib/python3.7/dist-packages (from nltk) (7.1.2)\n",
199 | "Requirement already satisfied, skipping upgrade: joblib in /usr/local/lib/python3.7/dist-packages (from nltk) (1.0.1)\n",
200 | "Requirement already satisfied, skipping upgrade: tqdm in /usr/local/lib/python3.7/dist-packages (from nltk) (4.41.1)\n",
201 | "Installing collected packages: nltk\n",
202 | " Found existing installation: nltk 3.2.5\n",
203 | " Uninstalling nltk-3.2.5:\n",
204 | " Successfully uninstalled nltk-3.2.5\n",
205 | "Successfully installed nltk-3.6.2\n"
206 | ]
207 | },
208 | {
209 | "data": {
210 | "application/vnd.colab-display-data+json": {
211 | "pip_warning": {
212 | "packages": [
213 | "nltk"
214 | ]
215 | }
216 | }
217 | },
218 | "metadata": {
219 | "tags": []
220 | },
221 | "output_type": "display_data"
222 | }
223 | ],
224 | "source": [
225 | "!pip install -U nltk # To Upgrade NLTK to >3.5 for METEOR score"
226 | ]
227 | },
228 | {
229 | "cell_type": "code",
230 | "execution_count": null,
231 | "metadata": {
232 | "colab": {
233 | "base_uri": "https://localhost:8080/"
234 | },
235 | "id": "TyDi7xSr8P28",
236 | "outputId": "f6d6eac8-d288-4a06-c3db-92f2334edfa3"
237 | },
238 | "outputs": [
239 | {
240 | "name": "stdout",
241 | "output_type": "stream",
242 | "text": [
243 | "The nltk version is 3.6.2.\n"
244 | ]
245 | }
246 | ],
247 | "source": [
248 | "import nltk\n",
249 | "\n",
250 | "print('The nltk version is {}.'.format(nltk.__version__)) # Verify version >3.5"
251 | ]
252 | },
253 | {
254 | "cell_type": "code",
255 | "execution_count": null,
256 | "metadata": {
257 | "_cell_guid": "b1076dfc-b9ad-4769-8c92-a6c4dae69d19",
258 | "_uuid": "8f2839f25d086af736a60e9eeb907d3b93b6e0e5",
259 | "id": "knLZbk0Y8Ul9"
260 | },
261 | "outputs": [],
262 | "source": [
263 | "import numpy as np \n",
264 | "import pandas as pd"
265 | ]
266 | },
267 | {
268 | "cell_type": "code",
269 | "execution_count": null,
270 | "metadata": {
271 | "_cell_guid": "79c7e3d0-c299-4dcb-8224-4455121ee9b0",
272 | "_uuid": "d629ff2d2480ee46fbb7e2d37f6b5fab8052498a",
273 | "id": "_XlKBJz08UmD"
274 | },
275 | "outputs": [],
276 | "source": [
277 | "summary = pd.read_csv('/content/drive/MyDrive/NN_ProjectData/news_summary.csv', encoding='iso-8859-1')\n",
278 | "raw = pd.read_csv('/content/drive/MyDrive/NN_ProjectData/news_summary_more.csv', encoding='iso-8859-1')"
279 | ]
280 | },
281 | {
282 | "cell_type": "code",
283 | "execution_count": null,
284 | "metadata": {
285 | "id": "KQvBjgyt8UmF"
286 | },
287 | "outputs": [],
288 | "source": [
289 | "pre_data1 = raw.iloc[:,0:2].copy()\n",
290 | "\n",
291 | "pre_data2 = summary.iloc[:,0:6].copy()\n",
292 | "pre_data2['text'] = pre_data2['author'].str.cat(pre_data2['date'].str.cat(pre_data2['read_more'].str.cat(pre_data2['text'].str.cat(pre_data2['ctext'], sep = \" \"), sep =\" \"),sep= \" \"), sep = \" \")"
293 | ]
294 | },
295 | {
296 | "cell_type": "code",
297 | "execution_count": null,
298 | "metadata": {
299 | "id": "EJKvTL-_8UmG"
300 | },
301 | "outputs": [],
302 | "source": [
303 | "pre_data = pd.DataFrame()\n",
304 | "pre_data['text'] = pd.concat([pre_data1['text'], pre_data2['text']], ignore_index=True)\n",
305 | "pre_data['summary'] = pd.concat([pre_data1['headlines'],pre_data2['headlines']],ignore_index = True)"
306 | ]
307 | },
308 | {
309 | "cell_type": "code",
310 | "execution_count": null,
311 | "metadata": {
312 | "colab": {
313 | "base_uri": "https://localhost:8080/",
314 | "height": 109
315 | },
316 | "id": "87M8mVa-8UmH",
317 | "outputId": "daa45c5b-c456-4650-b9a5-b0324be0fb42"
318 | },
319 | "outputs": [
320 | {
321 | "data": {
322 | "text/html": [
323 | "
\n",
324 | "\n",
337 | "
\n",
338 | " \n",
339 | " \n",
340 | " | \n",
341 | " text | \n",
342 | " summary | \n",
343 | "
\n",
344 | " \n",
345 | " \n",
346 | " \n",
347 | " 0 | \n",
348 | " Saurav Kant, an alumnus of upGrad and IIIT-B's... | \n",
349 | " upGrad learner switches to career in ML & Al w... | \n",
350 | "
\n",
351 | " \n",
352 | " 1 | \n",
353 | " Kunal Shah's credit card bill payment platform... | \n",
354 | " Delhi techie wins free food from Swiggy for on... | \n",
355 | "
\n",
356 | " \n",
357 | "
\n",
358 | "
"
359 | ],
360 | "text/plain": [
361 | " text summary\n",
362 | "0 Saurav Kant, an alumnus of upGrad and IIIT-B's... upGrad learner switches to career in ML & Al w...\n",
363 | "1 Kunal Shah's credit card bill payment platform... Delhi techie wins free food from Swiggy for on..."
364 | ]
365 | },
366 | "execution_count": 7,
367 | "metadata": {
368 | "tags": []
369 | },
370 | "output_type": "execute_result"
371 | }
372 | ],
373 | "source": [
374 | "pre_data.head(2)"
375 | ]
376 | },
377 | {
378 | "cell_type": "markdown",
379 | "metadata": {
380 | "id": "WN6ey5ud8UmJ"
381 | },
382 | "source": [
383 | "**Seq2Seq LSTM Modelling**"
384 | ]
385 | },
386 | {
387 | "cell_type": "code",
388 | "execution_count": null,
389 | "metadata": {
390 | "colab": {
391 | "base_uri": "https://localhost:8080/"
392 | },
393 | "id": "Y4bsb8lc8UmK",
394 | "outputId": "e33e8c43-fdb6-485d-eb33-6147c16b188c"
395 | },
396 | "outputs": [
397 | {
398 | "data": {
399 | "text/plain": [
400 | "0 Saurav Kant, an alumnus of upGrad and IIIT-B's...\n",
401 | "1 Kunal Shah's credit card bill payment platform...\n",
402 | "2 New Zealand defeated India by 8 wickets in the...\n",
403 | "3 With Aegon Life iTerm Insurance plan, customer...\n",
404 | "4 Speaking about the sexual harassment allegatio...\n",
405 | "5 Pakistani singer Rahat Fateh Ali Khan has deni...\n",
406 | "6 India recorded their lowest ODI total in New Z...\n",
407 | "7 Weeks after ex-CBI Director Alok Verma told th...\n",
408 | "8 Andhra Pradesh CM N Chandrababu Naidu has said...\n",
409 | "9 Congress candidate Shafia Zubair won the Ramga...\n",
410 | "Name: text, dtype: object"
411 | ]
412 | },
413 | "execution_count": 8,
414 | "metadata": {
415 | "tags": []
416 | },
417 | "output_type": "execute_result"
418 | }
419 | ],
420 | "source": [
421 | "pre_data['text'][:10]"
422 | ]
423 | },
424 | {
425 | "cell_type": "markdown",
426 | "metadata": {
427 | "id": "qE2PJvQZ8UmN"
428 | },
429 | "source": [
430 | "> **Perform Data Cleansing**"
431 | ]
432 | },
433 | {
434 | "cell_type": "code",
435 | "execution_count": null,
436 | "metadata": {
437 | "id": "mVfZEWHF8UmP"
438 | },
439 | "outputs": [],
440 | "source": [
441 | "import re\n",
442 | "\n",
443 | "#Removes non-alphabetic characters:\n",
444 | "def text_strip(column):\n",
445 | " for row in column:\n",
446 | " \n",
447 | " row=re.sub(\"(\\\\t)\", ' ', str(row)).lower() #delete escape charecters\n",
448 | " row=re.sub(\"(\\\\r)\", ' ', str(row)).lower() \n",
449 | " row=re.sub(\"(\\\\n)\", ' ', str(row)).lower()\n",
450 | " \n",
451 | " row=re.sub(\"(__+)\", ' ', str(row)).lower() #delete _ if more than one time repeatedly\n",
452 | " row=re.sub(\"(--+)\", ' ', str(row)).lower() #delete - if more than one time repeatedly\n",
453 | " row=re.sub(\"(~~+)\", ' ', str(row)).lower() #delete ~ if more than one time repeatedly\n",
454 | " row=re.sub(\"(\\+\\++)\", ' ', str(row)).lower() #delete + if more than one time repeatedly\n",
455 | " row=re.sub(\"(\\.\\.+)\", ' ', str(row)).lower() #delete . if more than one time repeatedly\n",
456 | " \n",
457 | " row=re.sub(r\"[<>()|&©ø\\[\\]\\'\\\",;?~*!]\", ' ', str(row)).lower() #delete <>()|&©ø\"',;?~*!\n",
458 | " \n",
459 | " row=re.sub(\"(mailto:)\", ' ', str(row)).lower() #delete mailto:\n",
460 | " row=re.sub(r\"(\\\\x9\\d)\", ' ', str(row)).lower() #delete \\x9* in text\n",
461 | " row=re.sub(\"([iI][nN][cC]\\d+)\", 'INC_NUM', str(row)).lower() #replace INC nums to INC_NUM\n",
462 | " row=re.sub(\"([cC][mM]\\d+)|([cC][hH][gG]\\d+)\", 'CM_NUM', str(row)).lower() #replace CM# and CHG# to CM_NUM\n",
463 | " \n",
464 | " \n",
465 | " row=re.sub(\"(\\.\\s+)\", ' ', str(row)).lower() #delete full stop at end of words(not between)\n",
466 | " row=re.sub(\"(\\-\\s+)\", ' ', str(row)).lower() #delete - at end of words(not between)\n",
467 | " row=re.sub(\"(\\:\\s+)\", ' ', str(row)).lower() #delete : at end of words(not between)\n",
468 | " \n",
469 | " row=re.sub(\"(\\s+.\\s+)\", ' ', str(row)).lower() #delete any single charecters hanging between 2 spaces\n",
470 | " \n",
471 | " # Change url http://www.youtube.com/watch/43865346kcre8375 ====> www.youtube.com\n",
472 | " try:\n",
473 | " url = re.search(r'((https*:\\/*)([^\\/\\s]+))(.[^\\s]+)', str(row))\n",
474 | " repl_url = url.group(3)\n",
475 | " row = re.sub(r'((https*:\\/*)([^\\/\\s]+))(.[^\\s]+)',repl_url, str(row))\n",
476 | " except:\n",
477 | " pass #there might be emails with no url in them\n",
478 | " \n",
479 | "\n",
480 | " \n",
481 | " row = re.sub(\"(\\s+)\",' ',str(row)).lower() #delete multiple spaces\n",
482 | " \n",
483 | " #Should always be last\n",
484 | " row=re.sub(\"(\\s+.\\s+)\", ' ', str(row)).lower() #delete any single charecters hanging between 2 spaces\n",
485 | "\n",
486 | " \n",
487 | " \n",
488 | " yield row\n",
489 | "\n",
490 | "\n"
491 | ]
492 | },
493 | {
494 | "cell_type": "code",
495 | "execution_count": null,
496 | "metadata": {
497 | "id": "hBcX-lkv8UmR"
498 | },
499 | "outputs": [],
500 | "source": [
501 | "cleaning1 = text_strip(pre_data['text'])\n",
502 | "cleaning2 = text_strip(pre_data['summary'])"
503 | ]
504 | },
505 | {
506 | "cell_type": "code",
507 | "execution_count": null,
508 | "metadata": {
509 | "id": "msfddO-W8UmV"
510 | },
511 | "outputs": [],
512 | "source": [
513 | "from time import time\n",
514 | "import spacy\n",
515 | "nlp = spacy.load('en', disable=['ner', 'parser']) \n",
516 | "\n",
517 | "t = time()\n",
518 | "\n",
519 | "text = [str(doc) for doc in nlp.pipe(cleaning1, batch_size=5000, n_threads=-1)] #spaCy.pipe() to speed-up cleaning \n",
520 | "\n",
521 | "#Takes 40 mins\n",
522 | "print('Cleaning time for text: {} mins'.format(round((time() - t) / 60, 2)))"
523 | ]
524 | },
525 | {
526 | "cell_type": "code",
527 | "execution_count": null,
528 | "metadata": {
529 | "colab": {
530 | "base_uri": "https://localhost:8080/"
531 | },
532 | "id": "Y0fkm0tq8UmV",
533 | "outputId": "4d8b57c1-f84a-486f-bce4-6b11e6c39796"
534 | },
535 | "outputs": [
536 | {
537 | "name": "stdout",
538 | "output_type": "stream",
539 | "text": [
540 | "Time to clean up everything: 35.4 mins\n"
541 | ]
542 | }
543 | ],
544 | "source": [
545 | "t = time()\n",
546 | "\n",
547 | "summary = ['_START_ '+ str(doc) + ' _END_' for doc in nlp.pipe(cleaning2, batch_size=5000, n_threads=-1)]\n",
548 | "\n",
549 | "#Takes 40 mins\n",
550 | "print('Cleaning time for summary: {} mins'.format(round((time() - t) / 60, 2)))"
551 | ]
552 | },
553 | {
554 | "cell_type": "code",
555 | "execution_count": null,
556 | "metadata": {
557 | "id": "xHFeD5xt8UmX"
558 | },
559 | "outputs": [],
560 | "source": [
561 | "text[0]"
562 | ]
563 | },
564 | {
565 | "cell_type": "code",
566 | "execution_count": null,
567 | "metadata": {
568 | "id": "LWU_-qlE8UmX"
569 | },
570 | "outputs": [],
571 | "source": [
572 | "summary[0]"
573 | ]
574 | },
575 | {
576 | "cell_type": "markdown",
577 | "metadata": {
578 | "id": "jk2Gu1nQGgv4"
579 | },
580 | "source": [
581 | "###Save the clean data to files "
582 | ]
583 | },
584 | {
585 | "cell_type": "code",
586 | "execution_count": null,
587 | "metadata": {
588 | "id": "cYdoKApZRGd5"
589 | },
590 | "outputs": [],
591 | "source": [
592 | "with open('/content/drive/MyDrive/NN_ProjectData/text.txt', \"w\") as f:\n",
593 | " for item in text:\n",
594 | " f.write(item + '\\n')\n",
595 | "\n",
596 | "with open('/content/drive/MyDrive/NN_ProjectData/summary.txt', \"w\") as f:\n",
597 | " for item in summary:\n",
598 | " f.write(item + '\\n')"
599 | ]
600 | },
601 | {
602 | "cell_type": "markdown",
603 | "metadata": {
604 | "id": "d93AuqQvGsr6"
605 | },
606 | "source": [
607 | "###Read clean data from files"
608 | ]
609 | },
610 | {
611 | "cell_type": "code",
612 | "execution_count": null,
613 | "metadata": {
614 | "id": "Ug2RSpL9WyUR"
615 | },
616 | "outputs": [],
617 | "source": [
618 | "text1 = []\n",
619 | "summary1 = []\n",
620 | "text = []\n",
621 | "summary = []\n",
622 | "\n",
623 | "with open('/content/drive/MyDrive/NN_ProjectData/text.txt', \"r\") as f:\n",
624 | " for line in f.readlines():\n",
625 | " text1.append(line)\n",
626 | "\n",
627 | "with open('/content/drive/MyDrive/NN_ProjectData/summary.txt', \"r\") as f:\n",
628 | " for line in f.readlines():\n",
629 | " summary1.append(line)\n",
630 | "\n",
631 | "for item in text1:\n",
632 | " text.append(item.replace('\\n', ''))\n",
633 | "\n",
634 | "for item in summary1:\n",
635 | " summary.append(item.replace('\\n', ''))"
636 | ]
637 | },
638 | {
639 | "cell_type": "code",
640 | "execution_count": null,
641 | "metadata": {
642 | "id": "zW240FS58UmY"
643 | },
644 | "outputs": [],
645 | "source": [
646 | "pre_data['cleaned_text'] = pd.Series(text)\n",
647 | "pre_data['cleaned_summary'] = pd.Series(summary)"
648 | ]
649 | },
650 | {
651 | "cell_type": "code",
652 | "execution_count": null,
653 | "metadata": {
654 | "id": "FrQNadK28UmZ"
655 | },
656 | "outputs": [],
657 | "source": [
658 | "text_count = []\n",
659 | "summary_count = []"
660 | ]
661 | },
662 | {
663 | "cell_type": "code",
664 | "execution_count": null,
665 | "metadata": {
666 | "id": "1i_pUmai8UmZ"
667 | },
668 | "outputs": [],
669 | "source": [
670 | "for sent in pre_data['cleaned_text']:\n",
671 | " text_count.append(len(sent.split()))\n",
672 | "for sent in pre_data['cleaned_summary']:\n",
673 | " summary_count.append(len(sent.split()))"
674 | ]
675 | },
676 | {
677 | "cell_type": "code",
678 | "execution_count": null,
679 | "metadata": {
680 | "id": "wQFH3JUG8Uma"
681 | },
682 | "outputs": [],
683 | "source": [
684 | "graph_df= pd.DataFrame()\n",
685 | "graph_df['text']=text_count\n",
686 | "graph_df['summary']=summary_count"
687 | ]
688 | },
689 | {
690 | "cell_type": "code",
691 | "execution_count": null,
692 | "metadata": {
693 | "colab": {
694 | "base_uri": "https://localhost:8080/",
695 | "height": 281
696 | },
697 | "id": "HxyIJDBa8Uma",
698 | "outputId": "72ae9b27-8dec-470c-a7d0-7582ede36e5f"
699 | },
700 | "outputs": [
701 | {
702 | "data": {
703 | "image/png": "iVBORw0KGgoAAAANSUhEUgAAAYkAAAEICAYAAACqMQjAAAAABHNCSVQICAgIfAhkiAAAAAlwSFlzAAALEgAACxIB0t1+/AAAADh0RVh0U29mdHdhcmUAbWF0cGxvdGxpYiB2ZXJzaW9uMy4yLjIsIGh0dHA6Ly9tYXRwbG90bGliLm9yZy+WH4yJAAAcv0lEQVR4nO3df5BV5Z3n8fcnEJUQFdRMh4AzmEglRWSN2qvMms30qsHWZAe3KmZ13AFdKtRWdGJWJhFndwsnxgxurTFijLskMoBDJK6JAxsxyKhd2aldUIg/EI1LixjoQomCGEw0g/nuH+fpeOy+D930vX3v7cvnVXXqnvM9zznneS7n3u9znnO4rYjAzMyskvc0ugJmZta8nCTMzCzLScLMzLKcJMzMLMtJwszMspwkzMwsy0nCzMyynCRahKTtks5rlv2YWWtwkjAzO0SSRje6DvXiJNECJN0F/CHwvyTtl/RVSdMl/R9Jr0l6UlJHKvsvJL0i6cS0fKqkvZI+Vmk/DWuUtTxJ10rqkfQrSc9JOlfSUklfL5XpkLSztLxd0lckPSXpDUl3SmqT9EDazz9IGp/KTpYUkq6QtCOd5/9B0j9P278m6dulfX9E0sOSXk2fkRWSxvU59rWSngLeSPX4YZ82LZJ067C+cfUWEZ5aYAK2A+el+YnAq8CFFB2BT6flD6T1NwIPA2OAzcBVlfbjydNwTcBHgR3Ah9LyZOAjwFLg66VyHcDO0vJ2YD3Qls7z3cDPgNOAo9J5vaC0zwD+e1o3A3gT+HvgD0rb/0kqf3L6rBwJfAD4KfCtPsd+AjgxfXYmAG8A49L60Wl/ZzT6/a3l5CuJ1vTvgDURsSYifhcR64CNFEkD4HrgWOBRoAe4vSG1tMPZ2xRfxlMlvTcitkfE84Pc9raIeDkieoD/DWyIiMcj4k3gPoqEUXZDRLwZEQ9SfKnfHRG7S9ufBhAR3RGxLiLeiohfAt8E/qTPvhZFxI6I+E1E7KJIJBendZ3AKxGx6ZDeiSbnJNGa/gi4OF1OvybpNeCTFD0fIuKfKHpspwA3R+oGmdVLRHQDX6bosOyWtFLShwa5+cul+d9UWH7/UMqnYauVaQjsdeDvgBP67GtHn+VlFJ0y0utdg2zDiOEk0TrKX/Q7gLsiYlxpGhsRCwEkTQQWAH8L3CzpyMx+zIZNRHw/Ij5J0akJ4CaKnv77SsU+WMcqfSPVY1pEHEPxpa8+Zfp+Pv4e+GeSTgE+C6wY9lrWmZNE63gZ+HCa/zvgX0s6X9IoSUelG4CTJIniKuJOYA6wC7ghsx+zYSHpo5LOSR2UNyl69L+jGPO/UNJxkj5IcbVRL0cD+4F9qSP1lYE2SENc9wLfBx6NiF8MbxXrz0midfwN8J/T0NK/BWYCfwX8kuLK4isU/95forhp91/SMNMVwBWS/mXf/Uj6yzq3wQ4fRwILgVeAlyjOyesohmuepLhJ/CDwgzrW6a+B04F9wP3Ajwa53TJgGi041AQgD0ebmQ2dpD8Efg58MCJeb3R9as1XEmZmQyTpPcA1wMpWTBBQPNdrZmaHSNJYint4L1I8/tqSPNxkZmZZHm4yM7OslhtuOuGEE2Ly5Mn94m+88QZjx46tf4XqpJXbV++2bdq06ZWI+EDdDlil3DlfKyPt3Bpp9YXmqHPuvG+5JDF58mQ2btzYL97V1UVHR0f9K1Qnrdy+erdN0ot1O1gN5M75Whlp59ZIqy80R51z572Hm8zMLMtJwszMspwkzMwsy0nCzMyynCTMzCzLScLMzLKcJMzMLMtJwszMspwkzMwsq+X+x3XO5p59XD7//roca/vCz9TlOGaHo8kDfI7nTTtQk8+6P8cFX0mYmVnWgElC0hJJuyU9XYodJ2mdpK3pdXyKS9IiSd2SnpJ0emmb2an8VkmzS/EzJG1O2yxKf4M5ewwzM6ufwVxJLKX/H9SYDzwUEVOAh9IywAXAlDTNBe6A4gsfWACcBZwJLCh96d8BfKG0XecAxzAzszoZMElExE+BPX3CMyn++Dfp9aJSfHkU1gPjJE0AzgfWRcSeiNgLrAM607pjImJ9FH/9aHmffVU6hpmZ1clQ70m0RcSuNP8S0JbmJwI7SuV2ptjB4jsrxA92DLOGkbQ9DY8+IWljig378KtZo1T9dFNEhKRh/RuoAx1D0lyK4S3a2tro6urqV6ZtTPHUQz1UOv5w279/f0OOWw9N2LZ/FRGvlJZ7h0YXSpqflq/l3cOvZ1EMrZ5VGn5tBwLYJGl1usruHX7dAKyhGH59oD7NMutvqEniZUkTImJXGjLaneI9wImlcpNSrAfo6BPvSvFJFcof7Bj9RMRiYDFAe3t7VPrjHbetWMXNm+vzxO/2y/off7g1wx8tGS4joG0zeef8XkZxbl9LafgVWC+pd/i1gzT8CiCpd/i1izT8muK9w69OEtYwQ/3WXA3MBham11Wl+FWSVlL0nPalL/m1wDdKN6tnANdFxB5Jr0uaTtFzmgXcNsAxzBopgAfTle3/SB2Uegy/vstgrp5rpdmu5AYaEajVqEE929xs73HZgElC0t0UPZ8TJO2kuExeCNwjaQ7wIvD5VHwNcCHQDfwauAIgJYMbgMdSua/19qKAL1I8QTWGosfU22vKHcOskT4ZET2S/gBYJ+nn5ZX1GH5Nxxnw6rlWmu1KbqD/KDdv2oGajBrUc0Sg2d7jsgHfyYi4NLPq3AplA7gys58lwJIK8Y3AKRXir1Y6hlkjRURPet0t6T6KR7rrMfxq1hD+H9dmgyRprKSje+cphk2f5p2hUeg//DorPeU0nTT8CqwFZkgan4ZgZwBr07rXJU1PTzXNwsOs1mCHzW83mdVAG3Bfeip1NPD9iPiJpMcY/uFXs4ZwkjAbpIjYBpxaIV5xaLSWw69mjeLhJjMzy3KSMDOzLCcJMzPL8j0JM7MKBvrjRrW0tHNs3Y51qHwlYWZmWU4SZmaW5SRhZmZZThJmZpblJGFmZllOEmZmluUkYWZmWU4SZmaW5SRhZmZZThJmZpblJGFmZllOEmZmluUkYWZmWU4SZmaW5SRhZmZZThJmZpblJGFmZllOEmZmluUkYWZmWU4SZmaW5SRhZmZZThJmZpblJGFmZllOEmZmluUkYWZmWVUlCUn/UdIWSU9LulvSUZJOkrRBUrekH0g6IpU9Mi13p/WTS/u5LsWfk3R+Kd6ZYt2S5ldTVzMzO3RDThKSJgJfAtoj4hRgFHAJcBNwS0ScDOwF5qRN5gB7U/yWVA5JU9N2Hwc6ge9IGiVpFHA7cAEwFbg0lTVrqHR+Pi7px2nZHSNrWdUON40GxkgaDbwP2AWcA9yb1i8DLkrzM9Myaf25kpTiKyPirYh4AegGzkxTd0Rsi4jfAitTWbNGuxp4trTsjpG1rNFD3TAieiT9N+AXwG+AB4FNwGsRcSAV2wlMTPMTgR1p2wOS9gHHp/j60q7L2+zoEz+rUl0kzQXmArS1tdHV1dWvTNsYmDftQL/4cKh0/OG2f//+hhy3HpqpbZImAZ8BbgSuSR2dc4A/S0WWAdcDd1B0aq5P8XuBb/ftGAEvSOrtGEHqGKVj9XaMnhnmZpllDTlJSBpPcQKfBLwG/E+KXlHdRcRiYDFAe3t7dHR09Ctz24pV3Lx5yM09JNsv63/84dbV1UWldreCJmvbt4CvAken5eNpQMfIrF6q+dY8D3ghIn4JIOlHwNnAOEmj04dmEtCTyvcAJwI70/DUscCrpXiv8ja5uFndSfossDsiNknqaHBdBrx6rpVmupKDgUcE6jlqUCvN9h6XVZMkfgFMl/Q+iuGmc4GNwCPA5yjuIcwGVqXyq9Py/03rH46IkLQa+L6kbwIfAqYAjwICpkg6iSI5XMI7l/RmjXA28KeSLgSOAo4BbqUBHaPBXD3XSpNdyXH5/PsPun7etAN1GzWolaWdY5vqPS4b8o3riNhAMc76M2Bz2tdi4FqKsdpuikvrO9MmdwLHp/g1wPy0ny3APRTjrj8BroyIt9MH7ipgLcVNwntSWbOGiIjrImJSREym6LQ8HBGX8U7HCCp3jKDUMUrxS9LTTyfxTsfoMVLHKD0hdUkqa9YwVaXbiFgALOgT3sY7N+HKZd8ELs7s50aKG4F942uANdXU0awOrgVWSvo68Djv7hjdlTpGeyi+9ImILZJ6O0YHSB0jAEm9HaNRwBJ3jKzRRtY1mVmTiIguoCvNu2NkLcs/y2FmZllOEmZmluUkYWZmWU4SZmaW5SRhZmZZThJmZpblJGFmZllOEmZmluUkYWZmWU4SZmaW5SRhZmZZThJmZpblJGFmZllOEmZmluUkYWZmWU4SZmaW5SRhZmZZThJmZpblJGFmZllOEmZmluUkYWZmWU4SZmaW5SRhZmZZThJmZpblJGFmZllOEmZmluUkYWZmWU4SZmaW5SRhZmZZThJmZpZVVZKQNE7SvZJ+LulZSX8s6ThJ6yRtTa/jU1lJWiSpW9JTkk4v7Wd2Kr9V0uxS/AxJm9M2iySpmvqamdmhqfZK4lbgJxHxMeBU4FlgPvBQREwBHkrLABcAU9I0F7gDQNJxwALgLOBMYEFvYkllvlDarrPK+ppVRdJRkh6V9KSkLZL+OsVPkrQhdWh+IOmIFD8yLXen9ZNL+7ouxZ+TdH4p3pli3ZLm962DWT0NOUlIOhb4FHAnQET8NiJeA2YCy1KxZcBFaX4msDwK64FxkiYA5wPrImJPROwF1gGdad0xEbE+IgJYXtqXWaO8BZwTEacCn6A4V6cDNwG3RMTJwF5gTio/B9ib4rekckiaClwCfJyi8/MdSaMkjQJup+hUTQUuTWXNGmJ0FdueBPwS+FtJpwKbgKuBtojYlcq8BLSl+YnAjtL2O1PsYPGdFeL9SJpLcXVCW1sbXV1d/cq0jYF50w4MvnVVqHT84bZ///6GHLcemqltqcOyPy2+N00BnAP8WYovA66nuBKemeYB7gW+nYZNZwIrI+It4AVJ3RRX0gDdEbENQNLKVPaZ4WuVWV41SWI0cDrwFxGxQdKtvDO0BBQfKElRTQUHIyIWA4sB2tvbo6Ojo1+Z21as4ubN1TR38LZf1v/4w62rq4tK7W4Fzda21NvfBJxM0et/HngtInp7IeUOze87QRFxQNI+4PgUX1/abXmbvp2msyrUYcCOUa00U5KGgTt79ewQ1kqzvcdl1Xxr7gR2RsSGtHwvRZJ4WdKEiNiVhox2p/U9wIml7SelWA/Q0SfeleKTKpQ3a6iIeBv4hKRxwH3AxxpQhwE7RrXSbEn68vn3H3T9vGkH6tYhrJWlnWOb6j0uG/I9iYh4Cdgh6aMpdC7FJfFqoPcJpdnAqjS/GpiVnnKaDuxLw1JrgRmSxqcb1jOAtWnd65Kmp8vzWaV9mTVcugf3CPDHFPfYer+Zyh2a33eO0vpjgVc5eKepUtysIap9uukvgBWSnqK4ifcNYCHwaUlbgfPSMsAaYBvQDXwX+CJAROwBbgAeS9PXUoxU5ntpm+eBB6qsr1lVJH0gXUEgaQzwaYqn+h4BPpeK9e0c9XaaPgc8nO5rrAYuSU8/nUTx9N6jFJ+BKelpqSMobm6vHv6WmVVW1TVZRDwBtFdYdW6FsgFcmdnPEmBJhfhG4JRq6mhWYxOAZem+xHuAeyLix5KeAVZK+jrwOOmpv/R6V7oxvYfiS5+I2CLpHoqr7wPAlWkYC0lXUVxhjwKWRMSW+jXP7N1G1sCdWYNFxFPAaRXi23jn6aRy/E3g4sy+bgRurBBfQ3HlbdZw/lkOMzPLcpIwM7MsJwkzM8tykjAzsywnCTMzy3KSMDOzLCcJMzPLcpIwM7MsJwkzM8tykjAzsywnCTMzy3KSMDOzLCcJMzPLcpIwM7MsJwkzM8tykjAzsywnCTMzy3KSMDOzLCcJMzPLcpIwM7MsJwkzM8tykjAzsywnCTMzy3KSMDOzLCcJMzPLcpIwM7MsJwkzM8tykjAzsywnCTMzy3KSMDOzrKqThKRRkh6X9OO0fJKkDZK6Jf1A0hEpfmRa7k7rJ5f2cV2KPyfp/FK8M8W6Jc2vtq5mZnZoanElcTXwbGn5JuCWiDgZ2AvMSfE5wN4UvyWVQ9JU4BLg40An8J2UeEYBtwMXAFOBS1NZs4aQdKKkRyQ9I2mLpKtT/DhJ6yRtTa/jU1ySFqVOzlOSTi/ta3Yqv1XS7FL8DEmb0zaLJKn+LTV7R1VJQtIk4DPA99KygHOAe1ORZcBFaX5mWiatPzeVnwmsjIi3IuIFoBs4M03dEbEtIn4LrExlzRrlADAvIqYC04ErU8dlPvBQREwBHkrLUHRwpqRpLnAHFEkFWACcRXGeL+hNLKnMF0rbddahXWZZo6vc/lvAV4Gj0/LxwGsRcSAt7wQmpvmJwA6AiDggaV8qPxFYX9pneZsdfeJnVaqEpLkUH0La2tro6urqV6ZtDMybdqBffDhUOv5w279/f0OOWw/N0raI2AXsSvO/kvQsxbk6E+hIxZYBXcC1Kb48IgJYL2mcpAmp7LqI2AMgaR3QKakLOCYi1qf4copO1gP1aJ9ZJUNOEpI+C+yOiE2SOmpXpUMXEYuBxQDt7e3R0dG/OretWMXNm6vNiYOz/bL+xx9uXV1dVGp3K2jGtqV7aqcBG4C2lEAAXgLa0vzvO0ZJbwfoYPGdFeKVjj9gx6hWmiVJ9xqos1fPDmGtNNt7XFbNt+bZwJ9KuhA4CjgGuBUYJ2l0upqYBPSk8j3AicBOSaOBY4FXS/Fe5W1ycbOGkfR+4IfAlyPi9fJtg4gISTHcdRhMx6hWmi1JXz7//oOunzftQN06hLWytHNsU73HZUO+JxER10XEpIiYTHHj+eGIuAx4BPhcKjYbWJXmV6dl0vqH02X4auCS9PTTSRTjsI8CjwFT0tNSR6RjrB5qfc1qQdJ7KRLEioj4UQq/nIaRSK+7UzzXATpYfFKFuFnDDMf/k7gWuEZSN8U9hztT/E7g+BS/hnRzLyK2APcAzwA/Aa6MiLfTlchVwFqKp6fuSWXNGiI9aHEn8GxEfLO0qtwB6tsxmpWecpoO7EvDUmuBGZLGpxvWM4C1ad3rkqanY80q7cusIWpyTRYRXRQ364iIbRRPbPQt8yZwcWb7G4EbK8TXAGtqUUezGjgb+HNgs6QnUuyvgIXAPZLmAC8Cn0/r1gAXUjyx92vgCoCI2CPpBoqrZYCv9d7EBr4ILAXGUNyw9k1ra6iRNXBn1kAR8Y9A7v8tnFuhfABXZva1BFhSIb4ROKWKaprVlH+Ww8zMspwkzMwsy0nCzMyynCTMzCzLScLMzLKcJMzMLMtJwszMspwkzMwsy0nCzMyynCTMzCzLScLMzLKcJMzMLMtJwszMspwkzMwsy0nCzMyynCTMzCzLScLMzLKcJMzMLMtJwszMspwkzMwsy0nCzMyynCTMzCzLScLMzLKcJMzMLMtJwszMspwkzMwsy0nCzMyynCTMzCzLScLMzLKcJMzMLGvISULSiZIekfSMpC2Srk7x4yStk7Q1vY5PcUlaJKlb0lOSTi/ta3Yqv1XS7FL8DEmb0zaLJKmaxpqZ2aGp5kriADAvIqYC04ErJU0F5gMPRcQU4KG0DHABMCVNc4E7oEgqwALgLOBMYEFvYkllvlDarrOK+ppVTdISSbslPV2KuWNkLWvISSIidkXEz9L8r4BngYnATGBZKrYMuCjNzwSWR2E9ME7SBOB8YF1E7ImIvcA6oDOtOyYi1kdEAMtL+zJrlKX076y4Y2Qtqyb3JCRNBk4DNgBtEbErrXoJaEvzE4Edpc12ptjB4jsrxM0aJiJ+CuzpE3bHyFrW6Gp3IOn9wA+BL0fE6+Wr44gISVHtMQZRh7kUPTXa2tro6urqV6ZtDMybdmC4qwJQ8fjDbf/+/Q05bj2MgLbVvWM0mHO+Vprt/R/oc1zPz3qtNNt7XFZVkpD0XooEsSIifpTCL0uaEBG7Us9od4r3ACeWNp+UYj1AR594V4pPqlC+n4hYDCwGaG9vj46Ojn5lbluxips3V50TB2X7Zf2PP9y6urqo1O5WMJLaVq+O0WDO+Vpptvf/8vn3H3T9vGkH6vZZr5WlnWOb6j0uq+bpJgF3As9GxDdLq1YDvTfiZgOrSvFZ6WbedGBf6n2tBWZIGp/GZWcAa9O61yVNT8eaVdqXWTN5OXWIOISOUS4+qI6RWb1Uc0/ibODPgXMkPZGmC4GFwKclbQXOS8sAa4BtQDfwXeCLABGxB7gBeCxNX0sxUpnvpW2eBx6oor5mw8UdI2tZQ74mi4h/BHKP551boXwAV2b2tQRYUiG+EThlqHU0qzVJd1MMj54gaSfFU0oLgXskzQFeBD6fiq8BLqTo5PwauAKKjpGk3o4R9O8YLQXGUHSK3DGyhhpZA3dmDRYRl2ZWuWNkLck/y2FmZllOEmZmluUkYWZmWU4SZmaW5SRhZmZZThJmZpblJGFmZllOEmZmluUkYWZmWU4SZmaW5SRhZmZZThJmZpblJGFmZllOEmZmluUkYWZmWU4SZmaW5SRhZmZZThJmZpblJGFmZllOEmZmluUkYWZmWU4SZmaW5SRhZmZZThJmZpblJGFmZllOEmZmluUkYWZmWaMbXQEzG/kmz7+/0VWwYeIkYWbWYJt79nF5HRLt9oWfOeRtPNxkZmZZThJmZpbV9ElCUqek5yR1S5rf6PqYDTef89ZMmjpJSBoF3A5cAEwFLpU0tbG1Mhs+Puet2TR1kgDOBLojYltE/BZYCcxscJ3MhpPPeWsqzf5000RgR2l5J3BW30KS5gJz0+J+Sc9V2NcJwCs1r2EFuqkeR+mnbu1rgHq37Y/qeKy+annO18qIOre+NMLqC/Wr8wDfTRXP+2ZPEoMSEYuBxQcrI2ljRLTXqUp118rta+W2DdVgzvlaGWnv/0irLzR3nZt9uKkHOLG0PCnFzFqVz3lrKs2eJB4Dpkg6SdIRwCXA6gbXyWw4+Zy3ptLUw00RcUDSVcBaYBSwJCK2DHF3dbk0b6BWbl8rt+1danzO18pIe/9HWn2hieusiGh0HczMrEk1+3CTmZk1kJOEmZllHRZJYqT+zIGk7ZI2S3pC0sYUO07SOklb0+v4FJekRamNT0k6vbSf2an8VkmzG9SWJZJ2S3q6FKtZWySdkd6r7rSt6tvC1lTpHGwmh3JeNYNMfa+X1JPe4yckXdjIOvYTES09Udz8ex74MHAE8CQwtdH1GmTdtwMn9In9V2B+mp8P3JTmLwQeAARMBzak+HHAtvQ6Ps2Pb0BbPgWcDjw9HG0BHk1llba9oNH/fq0wVToHm2k6lPOqGaZMfa8H/rLRdctNh8OVRKv9zMFMYFmaXwZcVIovj8J6YJykCcD5wLqI2BMRe4F1QGe9Kx0RPwX29AnXpC1p3TERsT6KT93y0r6shR3iedVwmfo2tcMhSVT6mYOJDarLoQrgQUmb0s8wALRFxK40/xLQluZz7Wzm9teqLRPTfN+4Va/SOdjscudVM7sqDa0uaabhMTg8ksRI9smIOJ3iF0GvlPSp8srUa26JZ5hbqS0t5qDnYLMbIefVHcBHgE8Au4CbG1uddzscksSI/ZmDiOhJr7uB+yiGzl5Owyuk192peK6dzdz+WrWlJ833jVuVMudgs8udV00pIl6OiLcj4nfAd2my9/hwSBIj8mcOJI2VdHTvPDADeJqi7r1P9cwGVqX51cCs9GTQdGBfuuReC8yQND5dxs5IsWZQk7akda9Lmp6eappV2pcN0UHOwWaXO6+aUm9CS/4NzfYeN/rOeT0miqdl/h/FU07/qdH1GWSdP0zxJNaTwJbeegPHAw8BW4F/AI5LcVH8sZrngc1Ae2lf/x7oTtMVDWrP3RSX0v9Ecc9gTi3bArRTfLieB75N+jUBT7U/B5tpOpTzqhmmTH3vSuf5UxQJbkKj61me/LMcZmaWdTgMN5mZ2RA5SZiZWZaThJmZZTlJmJlZlpOEmZllOUmYmVmWk4SZmWX9f3zNYSn41YvPAAAAAElFTkSuQmCC\n",
704 | "text/plain": [
705 | ""
706 | ]
707 | },
708 | "metadata": {
709 | "needs_background": "light",
710 | "tags": []
711 | },
712 | "output_type": "display_data"
713 | }
714 | ],
715 | "source": [
716 | "import matplotlib.pyplot as plt\n",
717 | "\n",
718 | "graph_df.hist(bins = 5)\n",
719 | "plt.show()"
720 | ]
721 | },
722 | {
723 | "cell_type": "code",
724 | "execution_count": null,
725 | "metadata": {
726 | "colab": {
727 | "base_uri": "https://localhost:8080/"
728 | },
729 | "id": "LknEV1cL8Umb",
730 | "outputId": "620e188d-fafd-4bde-dae0-934cb0c1bcf3"
731 | },
732 | "outputs": [
733 | {
734 | "name": "stdout",
735 | "output_type": "stream",
736 | "text": [
737 | "0.9978234465335472\n"
738 | ]
739 | }
740 | ],
741 | "source": [
742 | "#Check how much % of summary have 0-15 words\n",
743 | "count=0\n",
744 | "for i in pre_data['cleaned_summary']:\n",
745 | " if(len(i.split())<=15):\n",
746 | " count=count+1\n",
747 | "print(count/len(pre_data['cleaned_summary']))"
748 | ]
749 | },
750 | {
751 | "cell_type": "code",
752 | "execution_count": null,
753 | "metadata": {
754 | "colab": {
755 | "base_uri": "https://localhost:8080/"
756 | },
757 | "id": "HZYukgRp8Umb",
758 | "outputId": "18d3931e-ffd5-4d4d-818a-5049a12de18e"
759 | },
760 | "outputs": [
761 | {
762 | "name": "stdout",
763 | "output_type": "stream",
764 | "text": [
765 | "0.9578389933440218\n"
766 | ]
767 | }
768 | ],
769 | "source": [
770 | "#Check how much % of text have 0-70 words\n",
771 | "count=0\n",
772 | "for i in pre_data['cleaned_text']:\n",
773 | " if(len(i.split())<=100):\n",
774 | " count=count+1\n",
775 | "print(count/len(pre_data['cleaned_text']))"
776 | ]
777 | },
778 | {
779 | "cell_type": "code",
780 | "execution_count": null,
781 | "metadata": {
782 | "id": "HbGSD9M-r3Ep"
783 | },
784 | "outputs": [],
785 | "source": [
786 | "#Model to summarize the text between 0-15 words for Summary and 0-100 words for Text\n",
787 | "\n",
788 | "max_text_len=100\n",
789 | "max_summary_len=15"
790 | ]
791 | },
792 | {
793 | "cell_type": "code",
794 | "execution_count": null,
795 | "metadata": {
796 | "id": "UspLHLn18Umc"
797 | },
798 | "outputs": [],
799 | "source": [
800 | "#Select the Summaries and Text between max len defined above\n",
801 | "\n",
802 | "cleaned_text =np.array(pre_data['cleaned_text'])\n",
803 | "cleaned_summary=np.array(pre_data['cleaned_summary'])\n",
804 | "\n",
805 | "short_text=[]\n",
806 | "short_summary=[]\n",
807 | "\n",
808 | "for i in range(len(cleaned_text)):\n",
809 | " if(len(cleaned_summary[i].split())<=max_summary_len and len(cleaned_text[i].split())<=max_text_len):\n",
810 | " short_text.append(cleaned_text[i])\n",
811 | " short_summary.append(cleaned_summary[i])\n",
812 | " \n",
813 | "post_pre_data=pd.DataFrame({'text':short_text,'summary':short_summary})"
814 | ]
815 | },
816 | {
817 | "cell_type": "code",
818 | "execution_count": null,
819 | "metadata": {
820 | "colab": {
821 | "base_uri": "https://localhost:8080/",
822 | "height": 109
823 | },
824 | "id": "vfHugO5G8Umc",
825 | "outputId": "0b0c848d-fc2c-4a40-9fc4-d613391aa78c"
826 | },
827 | "outputs": [
828 | {
829 | "data": {
830 | "text/html": [
831 | "\n",
832 | "\n",
845 | "
\n",
846 | " \n",
847 | " \n",
848 | " | \n",
849 | " text | \n",
850 | " summary | \n",
851 | "
\n",
852 | " \n",
853 | " \n",
854 | " \n",
855 | " 0 | \n",
856 | " saurav kant an alumnus of upgrad and iiit-b pg... | \n",
857 | " _START_ upgrad learner switches to career in m... | \n",
858 | "
\n",
859 | " \n",
860 | " 1 | \n",
861 | " kunal shah credit card bill payment platform c... | \n",
862 | " _START_ delhi techie wins free food from swigg... | \n",
863 | "
\n",
864 | " \n",
865 | "
\n",
866 | "
"
867 | ],
868 | "text/plain": [
869 | " text summary\n",
870 | "0 saurav kant an alumnus of upgrad and iiit-b pg... _START_ upgrad learner switches to career in m...\n",
871 | "1 kunal shah credit card bill payment platform c... _START_ delhi techie wins free food from swigg..."
872 | ]
873 | },
874 | "execution_count": 24,
875 | "metadata": {
876 | "tags": []
877 | },
878 | "output_type": "execute_result"
879 | }
880 | ],
881 | "source": [
882 | "post_pre_data.head(2)"
883 | ]
884 | },
885 | {
886 | "cell_type": "code",
887 | "execution_count": null,
888 | "metadata": {
889 | "id": "xexWSUUF8Umd"
890 | },
891 | "outputs": [],
892 | "source": [
893 | "#Add sostok and eostok at \n",
894 | "post_pre_data['summary'] = post_pre_data['summary'].apply(lambda x : 'sostok '+ x + ' eostok')\n"
895 | ]
896 | },
897 | {
898 | "cell_type": "code",
899 | "execution_count": null,
900 | "metadata": {
901 | "colab": {
902 | "base_uri": "https://localhost:8080/",
903 | "height": 109
904 | },
905 | "id": "R9gDgu5T8Ume",
906 | "outputId": "5796be18-4d27-40bc-a8a9-c64086558861"
907 | },
908 | "outputs": [
909 | {
910 | "data": {
911 | "text/html": [
912 | "\n",
913 | "\n",
926 | "
\n",
927 | " \n",
928 | " \n",
929 | " | \n",
930 | " text | \n",
931 | " summary | \n",
932 | "
\n",
933 | " \n",
934 | " \n",
935 | " \n",
936 | " 0 | \n",
937 | " saurav kant an alumnus of upgrad and iiit-b pg... | \n",
938 | " sostok _START_ upgrad learner switches to care... | \n",
939 | "
\n",
940 | " \n",
941 | " 1 | \n",
942 | " kunal shah credit card bill payment platform c... | \n",
943 | " sostok _START_ delhi techie wins free food fro... | \n",
944 | "
\n",
945 | " \n",
946 | "
\n",
947 | "
"
948 | ],
949 | "text/plain": [
950 | " text summary\n",
951 | "0 saurav kant an alumnus of upgrad and iiit-b pg... sostok _START_ upgrad learner switches to care...\n",
952 | "1 kunal shah credit card bill payment platform c... sostok _START_ delhi techie wins free food fro..."
953 | ]
954 | },
955 | "execution_count": 26,
956 | "metadata": {
957 | "tags": []
958 | },
959 | "output_type": "execute_result"
960 | }
961 | ],
962 | "source": [
963 | "post_pre_data.head(2)"
964 | ]
965 | },
966 | {
967 | "cell_type": "markdown",
968 | "metadata": {
969 | "id": "MDbnEPTt8Ume"
970 | },
971 | "source": [
972 | "**SEQ2SEQ MODEL BUILDING **"
973 | ]
974 | },
975 | {
976 | "cell_type": "markdown",
977 | "metadata": {
978 | "id": "1kAA_oku8Ume"
979 | },
980 | "source": [
981 | "Split the data to TRAIN and VALIDATION sets"
982 | ]
983 | },
984 | {
985 | "cell_type": "code",
986 | "execution_count": null,
987 | "metadata": {
988 | "id": "EK_q36VP8Umf"
989 | },
990 | "outputs": [],
991 | "source": [
992 | "from sklearn.model_selection import train_test_split\n",
993 | "x_tr,x_val,y_tr,y_val=train_test_split(np.array(post_pre_data['text']),np.array(post_pre_data['summary']),test_size=0.1,random_state=0,shuffle=True)"
994 | ]
995 | },
996 | {
997 | "cell_type": "code",
998 | "execution_count": null,
999 | "metadata": {
1000 | "id": "iqSxSJwL8Umf"
1001 | },
1002 | "outputs": [],
1003 | "source": [
1004 | "#Lets tokenize the text to get the vocab count , you can use Spacy here also\n",
1005 | "\n",
1006 | "from keras.preprocessing.text import Tokenizer \n",
1007 | "from keras.preprocessing.sequence import pad_sequences\n",
1008 | "\n",
1009 | "#prepare a tokenizer for reviews on training data\n",
1010 | "x_tknizer = Tokenizer() \n",
1011 | "x_tknizer.fit_on_texts(list(x_tr))"
1012 | ]
1013 | },
1014 | {
1015 | "cell_type": "markdown",
1016 | "metadata": {
1017 | "id": "ZH27jFzK8Umf"
1018 | },
1019 | "source": [
1020 | "**RARE WORD ANALYSIS FOR X i.e 'text'**\n",
1021 | "* total_count gives the size of vocabulary (which means every unique words in the text)\n",
1022 | "\n",
1023 | "* count gives me the no. of rare words whose count falls below threshold\n",
1024 | "\n",
1025 | "* total_count - count gives me the top most common words"
1026 | ]
1027 | },
1028 | {
1029 | "cell_type": "code",
1030 | "execution_count": null,
1031 | "metadata": {
1032 | "colab": {
1033 | "base_uri": "https://localhost:8080/"
1034 | },
1035 | "id": "KO6bMeP48Umf",
1036 | "outputId": "fdc21c53-04bf-4d75-debf-2c077e82097e"
1037 | },
1038 | "outputs": [
1039 | {
1040 | "name": "stdout",
1041 | "output_type": "stream",
1042 | "text": [
1043 | "% of rare words in vocabulary: 57.91270391131826\n",
1044 | "Total Coverage of rare words: 1.3404923996005096\n"
1045 | ]
1046 | }
1047 | ],
1048 | "source": [
1049 | "thresh=4\n",
1050 | "\n",
1051 | "count=0\n",
1052 | "total_count=0\n",
1053 | "frequency=0\n",
1054 | "total_frequency=0\n",
1055 | "\n",
1056 | "for key,value in x_tknizer.word_counts.items():\n",
1057 | " total_count=total_count+1\n",
1058 | " total_frequency=total_frequency+value\n",
1059 | " if(value"
1562 | ]
1563 | },
1564 | "metadata": {
1565 | "needs_background": "light",
1566 | "tags": []
1567 | },
1568 | "output_type": "display_data"
1569 | }
1570 | ],
1571 | "source": [
1572 | "from matplotlib import pyplot\n",
1573 | "pyplot.plot(history.history['loss'], label='train')\n",
1574 | "pyplot.plot(history.history['val_loss'], label='test')\n",
1575 | "pyplot.legend()\n",
1576 | "pyplot.show()"
1577 | ]
1578 | },
1579 | {
1580 | "cell_type": "code",
1581 | "execution_count": null,
1582 | "metadata": {
1583 | "id": "aiVIhTtV8Ump"
1584 | },
1585 | "outputs": [],
1586 | "source": [
1587 | "reverse_target_word_index=y_tknizer.index_word\n",
1588 | "reverse_source_word_index=x_tknizer.index_word\n",
1589 | "target_word_index=y_tknizer.word_index"
1590 | ]
1591 | },
1592 | {
1593 | "cell_type": "code",
1594 | "execution_count": null,
1595 | "metadata": {
1596 | "id": "0rM-GrwH8Umq"
1597 | },
1598 | "outputs": [],
1599 | "source": [
1600 | "# Encoding our input seq for feature vector\n",
1601 | "encoder_model = Model(inputs=encoder_inputs,outputs=[encoder_outputs, state_h, state_c])\n",
1602 | "\n",
1603 | "decoder_state_input_h = Input(shape=(latent_dim,))\n",
1604 | "decoder_state_input_c = Input(shape=(latent_dim,))\n",
1605 | "decoder_hidden_state_input = Input(shape=(max_text_len,latent_dim))\n",
1606 | "\n",
1607 | "dec_emb2= dec_emb_layer(decoder_inputs) \n",
1608 | "\n",
1609 | "# initial states from the previous time step\n",
1610 | "decoder_outputs2, state_h2, state_c2 = decoder_lstm(dec_emb2, initial_state=[decoder_state_input_h, decoder_state_input_c])\n",
1611 | "\n",
1612 | "# softmax for probability\n",
1613 | "decoder_outputs2 = decoder_dense(decoder_outputs2) \n",
1614 | "\n",
1615 | "decoder_model = Model(\n",
1616 | " [decoder_inputs] + [decoder_hidden_state_input,decoder_state_input_h, decoder_state_input_c],\n",
1617 | " [decoder_outputs2] + [state_h2, state_c2])"
1618 | ]
1619 | },
1620 | {
1621 | "cell_type": "code",
1622 | "execution_count": null,
1623 | "metadata": {
1624 | "id": "25q9dnBe8Umr"
1625 | },
1626 | "outputs": [],
1627 | "source": [
1628 | "def decode_sequence(input_seq):\n",
1629 | " # Encode the input as state vectors.\n",
1630 | " e_out, e_h, e_c = encoder_model.predict(input_seq)\n",
1631 | " \n",
1632 | " # Generate empty target sequence of length 1.\n",
1633 | " target_seq = np.zeros((1,1))\n",
1634 | " \n",
1635 | " # Populate the first word of target sequence with the start word.\n",
1636 | " target_seq[0, 0] = target_word_index['sostok']\n",
1637 | "\n",
1638 | " stop_condition = False\n",
1639 | " decoded_sentence = ''\n",
1640 | " while not stop_condition:\n",
1641 | " \n",
1642 | " output_tokens, h, c = decoder_model.predict([target_seq] + [e_out, e_h, e_c])\n",
1643 | "\n",
1644 | " # Sample a token\n",
1645 | " sampled_token_index = np.argmax(output_tokens[0, -1, :])\n",
1646 | " sampled_token = reverse_target_word_index[sampled_token_index]\n",
1647 | " \n",
1648 | " if(sampled_token!='eostok'):\n",
1649 | " decoded_sentence += ' '+sampled_token\n",
1650 | "\n",
1651 | " # Exit condition: either hit max length or find stop word.\n",
1652 | " if (sampled_token == 'eostok' or len(decoded_sentence.split()) >= (max_summary_len-1)):\n",
1653 | " stop_condition = True\n",
1654 | "\n",
1655 | " # Update the target sequence (of length 1).\n",
1656 | " target_seq = np.zeros((1,1))\n",
1657 | " target_seq[0, 0] = sampled_token_index\n",
1658 | "\n",
1659 | " # Update internal states\n",
1660 | " e_h, e_c = h, c\n",
1661 | "\n",
1662 | " return decoded_sentence"
1663 | ]
1664 | },
1665 | {
1666 | "cell_type": "code",
1667 | "execution_count": null,
1668 | "metadata": {
1669 | "id": "pLsShB9l8Umr"
1670 | },
1671 | "outputs": [],
1672 | "source": [
1673 | "def sequence_to_summary(input_seq):\n",
1674 | " newString=''\n",
1675 | " for i in input_seq:\n",
1676 | " if((i!=0 and i!=target_word_index['sostok']) and i!=target_word_index['eostok']):\n",
1677 | " newString=newString+reverse_target_word_index[i]+' '\n",
1678 | " return newString\n",
1679 | "\n",
1680 | "def sequence_to_text(input_seq):\n",
1681 | " newString=''\n",
1682 | " for i in input_seq:\n",
1683 | " if(i!=0):\n",
1684 | " newString=newString+reverse_source_word_index[i]+' '\n",
1685 | " return newString"
1686 | ]
1687 | },
1688 | {
1689 | "cell_type": "markdown",
1690 | "metadata": {
1691 | "id": "pfIDk-588Ums"
1692 | },
1693 | "source": [
1694 | "###Predict and print summaries from model"
1695 | ]
1696 | },
1697 | {
1698 | "cell_type": "code",
1699 | "execution_count": null,
1700 | "metadata": {
1701 | "colab": {
1702 | "base_uri": "https://localhost:8080/"
1703 | },
1704 | "id": "Pp5cNJsz8Ums",
1705 | "outputId": "48649577-0e95-4126-b773-52d44631012c"
1706 | },
1707 | "outputs": [
1708 | {
1709 | "name": "stdout",
1710 | "output_type": "stream",
1711 | "text": [
1712 | "Review: pope francis on tuesday called for respect for each ethnic group in speech delivered in myanmar avoiding reference to the rohingya minority community as the nation works to restore peace the healing of wounds must be priority he said the pope myanmar visit comes amid the country military crackdown resulting in the rohingya refugee crisis \n",
1713 | "Original summary: pope avoids mention of rohingyas in key myanmar speech \n",
1714 | "Predicted summary: pope calls for rohingya muslims in myanmar \n",
1715 | "\n",
1716 | "\n",
1717 | "Review: students of government school in uttar pradesh sambhal were seen washing dishes at in school premises on being approached basic shiksha adhikari virendra pratap singh said yes have also received this complaint from elsewhere we are inquiring and action will be taken against those found guilty \n",
1718 | "Original summary: students seen washing dishes at govt school in up \n",
1719 | "Predicted summary: up school students protest up school in up \n",
1720 | "\n",
1721 | "\n",
1722 | "Review: apple india profit surged by 140 in 2017 18 to crore compared to ã¢ââ¹373 crore in the previous fiscal the indian unit of the us based company posted 12 growth in revenue last fiscal at ã¢ââ¹13 crore apple share of the indian smartphone market dropped to 1 in the second quarter of 2018 according to counterpoint research \n",
1723 | "Original summary: apple india profit rises 140 to nearly ã¢ââ¹900 crore in fy18 \n",
1724 | "Predicted summary: apple profit rises 20 to ã¢ââ¹3 crore in march quarter \n",
1725 | "\n",
1726 | "\n",
1727 | "Review: uber has launched its electric scooter service in santa monica us at 1 to unlock and then 15 cents per minute to ride it comes after uber acquired the bike sharing startup jump for reported amount of 200 million uber said it is branding the scooters with jump for the sake of consistency for its other personal electric vehicle services \n",
1728 | "Original summary: uber launches electric scooter service in us at 1 per ride \n",
1729 | "Predicted summary: uber launches self driving car service in us city \n",
1730 | "\n",
1731 | "\n",
1732 | "Review: around 80 people were injured in accidents related to kite flying during celebrations of makar sankranti in rajasthan jaipur officials said the victims included those who fell while flying kites and those injured by glass coated kite string officials added meanwhile around 100 birds were reported to be injured by between january 13 and 15 \n",
1733 | "Original summary: 80 people injured in flying related accidents in jaipur \n",
1734 | "Predicted summary: people killed in drone in kolkata \n",
1735 | "\n",
1736 | "\n",
1737 | "Review: uk entrepreneur richard browning has announced the launch of his startup gravity which has created flight jet powered suit that will be priced at about ã¢ââ¹1 3 crore the suit has custom built exoskeleton with six attached micro jet engines fuelled by kerosene from backpack browning claims the can travel at speed of up to 450 kmph \n",
1738 | "Original summary: up makes ã¢ââ¹1 3 crore jet powered flying suit \n",
1739 | "Predicted summary: up makes us airport that can be flying plane \n",
1740 | "\n",
1741 | "\n",
1742 | "Review: andhra pradesh chief minister chandrababu naidu on monday announced that his government will provide 100 units free power to most backward classes he added that the government would also give aid of up to ã¢ââ¹15 lakh to backward classes for foreign education we will spread out the poverty eradication program under pro basis he further said n \n",
1743 | "Original summary: most backward classes to get 100 units free power andhra cm \n",
1744 | "Predicted summary: will free 100 free education for andhra cm \n",
1745 | "\n",
1746 | "\n",
1747 | "Review: taking dig at pm modi congress president rahul gandhi tweeted while our pm around his garden making yoga videos india leads afghanistan syria in rape violence against women this comes after thomson reuters foundation survey declared india as world most dangerous country for women pm modi shared video of himself doing yoga and other exercises last week \n",
1748 | "Original summary: pm modi makes yoga videos while india leads in rape rahul \n",
1749 | "Predicted summary: pm modi is the country of india rahul gandhi \n",
1750 | "\n",
1751 | "\n",
1752 | "Review: external affairs minister sushma swaraj on saturday called upon the united nations to pass the comprehensive convention on international terrorism to end pakistan sponsored terrorism proposed by india in 1996 aims to arrive at universal definition of terrorism ban all terror groups prosecute terrorists under special laws and make cross border terrorism an offence \n",
1753 | "Original summary: india calls on un to pass global anti terror convention \n",
1754 | "Predicted summary: india calls for pak to un terror attacks on un \n",
1755 | "\n",
1756 | "\n",
1757 | "Review: the 23 richest indians in the 500 member bloomberg billionaires index saw wealth erosion of 21 billion this year lakshmi mittal who controls the world largest steelmaker arcelormittal lost 5 6 billion or 29 of his net worth followed by sun pharma founder dilip shanghvi whose wealth declined 4 6 billion asia richest person mukesh ambani added 4 billion to his fortune \n",
1758 | "Original summary: lakshmi mittal lost 10 bn in 2018 ambani added 4 bn \n",
1759 | "Predicted summary: which are the richest indians in the world richest person \n",
1760 | "\n",
1761 | "\n"
1762 | ]
1763 | }
1764 | ],
1765 | "source": [
1766 | "for i in range(0,10):\n",
1767 | " print(\"Review:\",sequence_to_text(x_tr[i]))\n",
1768 | " print(\"Original summary:\",(sequence_to_summary(y_tr[i])).replace('start', '').replace('end', ''))\n",
1769 | " print(\"Predicted summary:\",(decode_sequence(x_tr[i].reshape(1,max_text_len))).replace('start', '').replace('end', ''))\n",
1770 | " print(\"\\n\")"
1771 | ]
1772 | },
1773 | {
1774 | "cell_type": "markdown",
1775 | "metadata": {
1776 | "id": "DCcSUqUeJ0wN"
1777 | },
1778 | "source": [
1779 | "#Calculating scores"
1780 | ]
1781 | },
1782 | {
1783 | "cell_type": "markdown",
1784 | "metadata": {
1785 | "id": "LmqPmRIpJ46F"
1786 | },
1787 | "source": [
1788 | "###Function to calculate BLEU, GLEU and METEOR scores"
1789 | ]
1790 | },
1791 | {
1792 | "cell_type": "code",
1793 | "execution_count": null,
1794 | "metadata": {
1795 | "colab": {
1796 | "base_uri": "https://localhost:8080/"
1797 | },
1798 | "id": "u0xFJjdRtRnq",
1799 | "outputId": "8cda6766-5cb9-4f4b-c812-5b456eee6da4"
1800 | },
1801 | "outputs": [
1802 | {
1803 | "name": "stderr",
1804 | "output_type": "stream",
1805 | "text": [
1806 | "[nltk_data] Downloading package wordnet to /root/nltk_data...\n",
1807 | "[nltk_data] Package wordnet is already up-to-date!\n"
1808 | ]
1809 | },
1810 | {
1811 | "data": {
1812 | "text/plain": [
1813 | "True"
1814 | ]
1815 | },
1816 | "execution_count": 47,
1817 | "metadata": {
1818 | "tags": []
1819 | },
1820 | "output_type": "execute_result"
1821 | }
1822 | ],
1823 | "source": [
1824 | "nltk.download('wordnet')"
1825 | ]
1826 | },
1827 | {
1828 | "cell_type": "code",
1829 | "execution_count": null,
1830 | "metadata": {
1831 | "colab": {
1832 | "base_uri": "https://localhost:8080/"
1833 | },
1834 | "id": "9EE2PHb2u9By",
1835 | "outputId": "18199731-2700-4904-c894-616c20d5c4e4"
1836 | },
1837 | "outputs": [
1838 | {
1839 | "name": "stderr",
1840 | "output_type": "stream",
1841 | "text": [
1842 | "[nltk_data] Downloading package wordnet to /root/nltk_data...\n",
1843 | "[nltk_data] Unzipping corpora/wordnet.zip.\n"
1844 | ]
1845 | }
1846 | ],
1847 | "source": [
1848 | "from nltk.translate.bleu_score import sentence_bleu\n",
1849 | "from nltk.translate.gleu_score import sentence_gleu\n",
1850 | "from nltk.translate.meteor_score import meteor_score\n",
1851 | "\n",
1852 | "def calculate_scores(N=100): \n",
1853 | " bscore=0;gscore=0;mscore=0\n",
1854 | " for i in range(N):\n",
1855 | " ref=sequence_to_summary(y_tr[i])\n",
1856 | " hypo=decode_sequence(x_tr[i].reshape(1,max_text_len))\n",
1857 | " bscore+=sentence_bleu([ref],hypo)\n",
1858 | " gscore+=sentence_gleu([ref],hypo)\n",
1859 | " mscore+=meteor_score([ref],hypo)\n",
1860 | " print(\"BLEU:%.4f GLEU:%.4f METEOR:%.4f\"%(bscore/N,gscore/N,mscore/N))"
1861 | ]
1862 | },
1863 | {
1864 | "cell_type": "markdown",
1865 | "metadata": {
1866 | "id": "CkUJd9lGJuui"
1867 | },
1868 | "source": [
1869 | "###Calculating score metrics for first 100"
1870 | ]
1871 | },
1872 | {
1873 | "cell_type": "code",
1874 | "execution_count": null,
1875 | "metadata": {
1876 | "colab": {
1877 | "base_uri": "https://localhost:8080/"
1878 | },
1879 | "id": "a51X9S1IACFw",
1880 | "outputId": "6b69c34d-6eb9-45dc-d296-5a6e545e1371"
1881 | },
1882 | "outputs": [
1883 | {
1884 | "name": "stdout",
1885 | "output_type": "stream",
1886 | "text": [
1887 | "BLEU:0.4279 GLEU:0.4700 METEOR:0.3422\n"
1888 | ]
1889 | }
1890 | ],
1891 | "source": [
1892 | "calculate_scores()"
1893 | ]
1894 | }
1895 | ],
1896 | "metadata": {
1897 | "accelerator": "GPU",
1898 | "colab": {
1899 | "collapsed_sections": [],
1900 | "name": "CSE5368-Spring21-Project.ipynb",
1901 | "provenance": []
1902 | },
1903 | "kernelspec": {
1904 | "display_name": "Python 3",
1905 | "language": "python",
1906 | "name": "python3"
1907 | },
1908 | "language_info": {
1909 | "codemirror_mode": {
1910 | "name": "ipython",
1911 | "version": 3
1912 | },
1913 | "file_extension": ".py",
1914 | "mimetype": "text/x-python",
1915 | "name": "python",
1916 | "nbconvert_exporter": "python",
1917 | "pygments_lexer": "ipython3",
1918 | "version": "3.9.1"
1919 | }
1920 | },
1921 | "nbformat": 4,
1922 | "nbformat_minor": 1
1923 | }
1924 |
--------------------------------------------------------------------------------