├── Abstract Text Summarization .ipynb ├── Attention_Based_seq2se1_model.ipynb ├── README.md ├── project.pdf ├── suvidha_foundation_project.pdf .png └── suvidha_foundation_project.png /Abstract Text Summarization .ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "id": "bd2fa2bf", 6 | "metadata": {}, 7 | "source": [ 8 | "## Abstract Text Sumarization:" 9 | ] 10 | }, 11 | { 12 | "cell_type": "code", 13 | "execution_count": 1, 14 | "id": "1c7af25d", 15 | "metadata": {}, 16 | "outputs": [], 17 | "source": [ 18 | "import pandas as pd\n", 19 | "import numpy as np" 20 | ] 21 | }, 22 | { 23 | "cell_type": "code", 24 | "execution_count": 2, 25 | "id": "9819a86b", 26 | "metadata": {}, 27 | "outputs": [ 28 | { 29 | "data": { 30 | "text/html": [ 31 | "
\n", 32 | "\n", 45 | "\n", 46 | " \n", 47 | " \n", 48 | " \n", 49 | " \n", 50 | " \n", 51 | " \n", 52 | " \n", 53 | " \n", 54 | " \n", 55 | " \n", 56 | " \n", 57 | " \n", 58 | " \n", 59 | " \n", 60 | " \n", 61 | " \n", 62 | " \n", 63 | " \n", 64 | " \n", 65 | " \n", 66 | " \n", 67 | " \n", 68 | " \n", 69 | " \n", 70 | " \n", 71 | " \n", 72 | " \n", 73 | " \n", 74 | " \n", 75 | " \n", 76 | " \n", 77 | " \n", 78 | " \n", 79 | " \n", 80 | " \n", 81 | " \n", 82 | " \n", 83 | " \n", 84 | " \n", 85 | " \n", 86 | " \n", 87 | " \n", 88 | " \n", 89 | " \n", 90 | " \n", 91 | " \n", 92 | " \n", 93 | " \n", 94 | " \n", 95 | " \n", 96 | " \n", 97 | " \n", 98 | " \n", 99 | " \n", 100 | " \n", 101 | " \n", 102 | " \n", 103 | " \n", 104 | " \n", 105 | " \n", 106 | " \n", 107 | " \n", 108 | " \n", 109 | " \n", 110 | " \n", 111 | " \n", 112 | " \n", 113 | " \n", 114 | " \n", 115 | " \n", 116 | " \n", 117 | " \n", 118 | " \n", 119 | " \n", 120 | " \n", 121 | " \n", 122 | "
idarticlehighlights
00001d1afc246a7964130f43ae940af6bc6c57f01By . Associated Press . PUBLISHED: . 14:11 EST...Bishop John Folda, of North Dakota, is taking ...
10002095e55fcbd3a2f366d9bf92a95433dc305ef(CNN) -- Ralph Mata was an internal affairs li...Criminal complaint: Cop used his role to help ...
200027e965c8264c35cc1bc55556db388da82b07fA drunk driver who killed a young woman in a h...Craig Eccleston-Todd, 27, had drunk at least t...
30002c17436637c4fe1837c935c04de47adb18e9a(CNN) -- With a breezy sweep of his pen Presid...Nina dos Santos says Europe must be ready to a...
40003ad6ef0c37534f80b55b4235108024b407f0bFleetwood are the only team still to have a 10...Fleetwood top of League One after 2-0 win at S...
............
287108fffdfb56fdf1a12d364562cc2b9b1d4de7481deeBy . James Rush . Former first daughter Chelse...Chelsea Clinton said question of running for o...
287109fffeecb8690b85de8c3faed80adbc7a978f9ae2aAn apologetic Vanilla Ice has given his first ...Vanilla Ice, 47 - real name Robert Van Winkle ...
287110ffff5231e4c71544bc6c97015cdb16c60e42b3f4America's most lethal sniper claimed he wished...America's most lethal sniper made comment in i...
287111ffff924b14a8d82058b6c1c5368ff1113c1632afBy . Sara Malm . PUBLISHED: . 12:19 EST, 8 Mar...A swarm of more than one million has crossed b...
287112ffffd563a96104f5cf4493cfa701a65f31b06abf(CNN)Former Florida Gov. Jeb Bush has decided ...Other 2016 hopefuls maintain that Bush's annou...
\n", 123 | "

287113 rows × 3 columns

\n", 124 | "
" 125 | ], 126 | "text/plain": [ 127 | " id \\\n", 128 | "0 0001d1afc246a7964130f43ae940af6bc6c57f01 \n", 129 | "1 0002095e55fcbd3a2f366d9bf92a95433dc305ef \n", 130 | "2 00027e965c8264c35cc1bc55556db388da82b07f \n", 131 | "3 0002c17436637c4fe1837c935c04de47adb18e9a \n", 132 | "4 0003ad6ef0c37534f80b55b4235108024b407f0b \n", 133 | "... ... \n", 134 | "287108 fffdfb56fdf1a12d364562cc2b9b1d4de7481dee \n", 135 | "287109 fffeecb8690b85de8c3faed80adbc7a978f9ae2a \n", 136 | "287110 ffff5231e4c71544bc6c97015cdb16c60e42b3f4 \n", 137 | "287111 ffff924b14a8d82058b6c1c5368ff1113c1632af \n", 138 | "287112 ffffd563a96104f5cf4493cfa701a65f31b06abf \n", 139 | "\n", 140 | " article \\\n", 141 | "0 By . Associated Press . PUBLISHED: . 14:11 EST... \n", 142 | "1 (CNN) -- Ralph Mata was an internal affairs li... \n", 143 | "2 A drunk driver who killed a young woman in a h... \n", 144 | "3 (CNN) -- With a breezy sweep of his pen Presid... \n", 145 | "4 Fleetwood are the only team still to have a 10... \n", 146 | "... ... \n", 147 | "287108 By . James Rush . Former first daughter Chelse... \n", 148 | "287109 An apologetic Vanilla Ice has given his first ... \n", 149 | "287110 America's most lethal sniper claimed he wished... \n", 150 | "287111 By . Sara Malm . PUBLISHED: . 12:19 EST, 8 Mar... \n", 151 | "287112 (CNN)Former Florida Gov. Jeb Bush has decided ... \n", 152 | "\n", 153 | " highlights \n", 154 | "0 Bishop John Folda, of North Dakota, is taking ... \n", 155 | "1 Criminal complaint: Cop used his role to help ... \n", 156 | "2 Craig Eccleston-Todd, 27, had drunk at least t... \n", 157 | "3 Nina dos Santos says Europe must be ready to a... \n", 158 | "4 Fleetwood top of League One after 2-0 win at S... \n", 159 | "... ... \n", 160 | "287108 Chelsea Clinton said question of running for o... \n", 161 | "287109 Vanilla Ice, 47 - real name Robert Van Winkle ... \n", 162 | "287110 America's most lethal sniper made comment in i... \n", 163 | "287111 A swarm of more than one million has crossed b... \n", 164 | "287112 Other 2016 hopefuls maintain that Bush's annou... \n", 165 | "\n", 166 | "[287113 rows x 3 columns]" 167 | ] 168 | }, 169 | "execution_count": 2, 170 | "metadata": {}, 171 | "output_type": "execute_result" 172 | } 173 | ], 174 | "source": [ 175 | "df_mail = pd.read_csv(\"train.csv\")\n", 176 | "df_mail" 177 | ] 178 | }, 179 | { 180 | "cell_type": "code", 181 | "execution_count": 3, 182 | "id": "dbf54205", 183 | "metadata": {}, 184 | "outputs": [ 185 | { 186 | "data": { 187 | "text/plain": [ 188 | "Index(['id', 'article', 'highlights'], dtype='object')" 189 | ] 190 | }, 191 | "execution_count": 3, 192 | "metadata": {}, 193 | "output_type": "execute_result" 194 | } 195 | ], 196 | "source": [ 197 | "df_mail.columns" 198 | ] 199 | }, 200 | { 201 | "cell_type": "code", 202 | "execution_count": 4, 203 | "id": "50eb34fa", 204 | "metadata": {}, 205 | "outputs": [ 206 | { 207 | "data": { 208 | "text/html": [ 209 | "
\n", 210 | "\n", 223 | "\n", 224 | " \n", 225 | " \n", 226 | " \n", 227 | " \n", 228 | " \n", 229 | " \n", 230 | " \n", 231 | " \n", 232 | " \n", 233 | " \n", 234 | " \n", 235 | " \n", 236 | " \n", 237 | " \n", 238 | " \n", 239 | " \n", 240 | " \n", 241 | " \n", 242 | " \n", 243 | " \n", 244 | " \n", 245 | " \n", 246 | " \n", 247 | " \n", 248 | " \n", 249 | " \n", 250 | " \n", 251 | " \n", 252 | " \n", 253 | " \n", 254 | " \n", 255 | " \n", 256 | " \n", 257 | " \n", 258 | "
idarticlehighlights
count287113287113287113
unique287113284005282197
top0001d1afc246a7964130f43ae940af6bc6c57f01(CNN) -- Dubai could lose its place on the Wom...This page includes the show Transcript and the...
freq1383
\n", 259 | "
" 260 | ], 261 | "text/plain": [ 262 | " id \\\n", 263 | "count 287113 \n", 264 | "unique 287113 \n", 265 | "top 0001d1afc246a7964130f43ae940af6bc6c57f01 \n", 266 | "freq 1 \n", 267 | "\n", 268 | " article \\\n", 269 | "count 287113 \n", 270 | "unique 284005 \n", 271 | "top (CNN) -- Dubai could lose its place on the Wom... \n", 272 | "freq 3 \n", 273 | "\n", 274 | " highlights \n", 275 | "count 287113 \n", 276 | "unique 282197 \n", 277 | "top This page includes the show Transcript and the... \n", 278 | "freq 83 " 279 | ] 280 | }, 281 | "execution_count": 4, 282 | "metadata": {}, 283 | "output_type": "execute_result" 284 | } 285 | ], 286 | "source": [ 287 | "df_mail.describe()" 288 | ] 289 | }, 290 | { 291 | "cell_type": "code", 292 | "execution_count": 5, 293 | "id": "54feff46", 294 | "metadata": {}, 295 | "outputs": [ 296 | { 297 | "data": { 298 | "text/plain": [ 299 | "id 0\n", 300 | "article 0\n", 301 | "highlights 0\n", 302 | "dtype: int64" 303 | ] 304 | }, 305 | "execution_count": 5, 306 | "metadata": {}, 307 | "output_type": "execute_result" 308 | } 309 | ], 310 | "source": [ 311 | "df_mail.isnull().sum()" 312 | ] 313 | }, 314 | { 315 | "cell_type": "markdown", 316 | "id": "0215dd40", 317 | "metadata": {}, 318 | "source": [ 319 | "## Phase 1: Preprocessing." 320 | ] 321 | }, 322 | { 323 | "cell_type": "code", 324 | "execution_count": 6, 325 | "id": "91eea5f5", 326 | "metadata": {}, 327 | "outputs": [ 328 | { 329 | "data": { 330 | "text/html": [ 331 | "
\n", 332 | "\n", 345 | "\n", 346 | " \n", 347 | " \n", 348 | " \n", 349 | " \n", 350 | " \n", 351 | " \n", 352 | " \n", 353 | " \n", 354 | " \n", 355 | " \n", 356 | " \n", 357 | " \n", 358 | " \n", 359 | " \n", 360 | " \n", 361 | " \n", 362 | " \n", 363 | " \n", 364 | " \n", 365 | " \n", 366 | " \n", 367 | " \n", 368 | " \n", 369 | " \n", 370 | " \n", 371 | " \n", 372 | " \n", 373 | " \n", 374 | " \n", 375 | " \n", 376 | " \n", 377 | " \n", 378 | " \n", 379 | " \n", 380 | " \n", 381 | " \n", 382 | " \n", 383 | " \n", 384 | " \n", 385 | " \n", 386 | " \n", 387 | " \n", 388 | " \n", 389 | " \n", 390 | " \n", 391 | " \n", 392 | " \n", 393 | " \n", 394 | " \n", 395 | " \n", 396 | " \n", 397 | " \n", 398 | " \n", 399 | " \n", 400 | " \n", 401 | " \n", 402 | " \n", 403 | " \n", 404 | " \n", 405 | " \n", 406 | " \n", 407 | " \n", 408 | " \n", 409 | " \n", 410 | " \n", 411 | " \n", 412 | " \n", 413 | " \n", 414 | " \n", 415 | " \n", 416 | " \n", 417 | " \n", 418 | " \n", 419 | " \n", 420 | " \n", 421 | " \n", 422 | "
idarticlehighlights
00001d1afc246a7964130f43ae940af6bc6c57f01By . Associated Press . PUBLISHED: . 14:11 EST...Bishop John Folda, of North Dakota, is taking ...
10002095e55fcbd3a2f366d9bf92a95433dc305ef(CNN) -- Ralph Mata was an internal affairs li...Criminal complaint: Cop used his role to help ...
200027e965c8264c35cc1bc55556db388da82b07fA drunk driver who killed a young woman in a h...Craig Eccleston-Todd, 27, had drunk at least t...
30002c17436637c4fe1837c935c04de47adb18e9a(CNN) -- With a breezy sweep of his pen Presid...Nina dos Santos says Europe must be ready to a...
40003ad6ef0c37534f80b55b4235108024b407f0bFleetwood are the only team still to have a 10...Fleetwood top of League One after 2-0 win at S...
............
287108fffdfb56fdf1a12d364562cc2b9b1d4de7481deeBy . James Rush . Former first daughter Chelse...Chelsea Clinton said question of running for o...
287109fffeecb8690b85de8c3faed80adbc7a978f9ae2aAn apologetic Vanilla Ice has given his first ...Vanilla Ice, 47 - real name Robert Van Winkle ...
287110ffff5231e4c71544bc6c97015cdb16c60e42b3f4America's most lethal sniper claimed he wished...America's most lethal sniper made comment in i...
287111ffff924b14a8d82058b6c1c5368ff1113c1632afBy . Sara Malm . PUBLISHED: . 12:19 EST, 8 Mar...A swarm of more than one million has crossed b...
287112ffffd563a96104f5cf4493cfa701a65f31b06abf(CNN)Former Florida Gov. Jeb Bush has decided ...Other 2016 hopefuls maintain that Bush's annou...
\n", 423 | "

287113 rows × 3 columns

\n", 424 | "
" 425 | ], 426 | "text/plain": [ 427 | " id \\\n", 428 | "0 0001d1afc246a7964130f43ae940af6bc6c57f01 \n", 429 | "1 0002095e55fcbd3a2f366d9bf92a95433dc305ef \n", 430 | "2 00027e965c8264c35cc1bc55556db388da82b07f \n", 431 | "3 0002c17436637c4fe1837c935c04de47adb18e9a \n", 432 | "4 0003ad6ef0c37534f80b55b4235108024b407f0b \n", 433 | "... ... \n", 434 | "287108 fffdfb56fdf1a12d364562cc2b9b1d4de7481dee \n", 435 | "287109 fffeecb8690b85de8c3faed80adbc7a978f9ae2a \n", 436 | "287110 ffff5231e4c71544bc6c97015cdb16c60e42b3f4 \n", 437 | "287111 ffff924b14a8d82058b6c1c5368ff1113c1632af \n", 438 | "287112 ffffd563a96104f5cf4493cfa701a65f31b06abf \n", 439 | "\n", 440 | " article \\\n", 441 | "0 By . Associated Press . PUBLISHED: . 14:11 EST... \n", 442 | "1 (CNN) -- Ralph Mata was an internal affairs li... \n", 443 | "2 A drunk driver who killed a young woman in a h... \n", 444 | "3 (CNN) -- With a breezy sweep of his pen Presid... \n", 445 | "4 Fleetwood are the only team still to have a 10... \n", 446 | "... ... \n", 447 | "287108 By . James Rush . Former first daughter Chelse... \n", 448 | "287109 An apologetic Vanilla Ice has given his first ... \n", 449 | "287110 America's most lethal sniper claimed he wished... \n", 450 | "287111 By . Sara Malm . PUBLISHED: . 12:19 EST, 8 Mar... \n", 451 | "287112 (CNN)Former Florida Gov. Jeb Bush has decided ... \n", 452 | "\n", 453 | " highlights \n", 454 | "0 Bishop John Folda, of North Dakota, is taking ... \n", 455 | "1 Criminal complaint: Cop used his role to help ... \n", 456 | "2 Craig Eccleston-Todd, 27, had drunk at least t... \n", 457 | "3 Nina dos Santos says Europe must be ready to a... \n", 458 | "4 Fleetwood top of League One after 2-0 win at S... \n", 459 | "... ... \n", 460 | "287108 Chelsea Clinton said question of running for o... \n", 461 | "287109 Vanilla Ice, 47 - real name Robert Van Winkle ... \n", 462 | "287110 America's most lethal sniper made comment in i... \n", 463 | "287111 A swarm of more than one million has crossed b... \n", 464 | "287112 Other 2016 hopefuls maintain that Bush's annou... \n", 465 | "\n", 466 | "[287113 rows x 3 columns]" 467 | ] 468 | }, 469 | "execution_count": 6, 470 | "metadata": {}, 471 | "output_type": "execute_result" 472 | } 473 | ], 474 | "source": [ 475 | "df_mail_process = df_mail\n", 476 | "df_mail_process" 477 | ] 478 | }, 479 | { 480 | "cell_type": "markdown", 481 | "id": "805edc68", 482 | "metadata": {}, 483 | "source": [ 484 | "## import WSD(Word Sence Disabiguation) libraries" 485 | ] 486 | }, 487 | { 488 | "cell_type": "code", 489 | "execution_count": 7, 490 | "id": "ab239eaa", 491 | "metadata": {}, 492 | "outputs": [], 493 | "source": [ 494 | "%%capture\n", 495 | "import nltk\n", 496 | "from nltk.wsd import lesk\n", 497 | "from nltk.tokenize import word_tokenize as wtk\n" 498 | ] 499 | }, 500 | { 501 | "cell_type": "code", 502 | "execution_count": null, 503 | "id": "d92ddf48", 504 | "metadata": {}, 505 | "outputs": [], 506 | "source": [ 507 | "from nltk." 508 | ] 509 | }, 510 | { 511 | "cell_type": "code", 512 | "execution_count": 14, 513 | "id": "5c9149aa", 514 | "metadata": {}, 515 | "outputs": [ 516 | { 517 | "name": "stdout", 518 | "output_type": "stream", 519 | "text": [ 520 | "['I', 'love', 'reading', 'books', 'on', 'coding', '.'] ['The', 'table', 'was', 'already', 'booked', 'by', 'someone', 'else', '.']\n", 521 | "temp1: a number of sheets (ticket or stamps etc.) bound together on one edge \n", 522 | " temp2: arrange for and reserve (something for someone else) in advance\n" 523 | ] 524 | } 525 | ], 526 | "source": [ 527 | "keyword = 'book'\n", 528 | "seq1 = 'I love reading books on coding.'\n", 529 | "seq2 = 'The table was already booked by someone else.'\n", 530 | "\n", 531 | "temp1 = wtk(seq1)\n", 532 | "temp2 = wtk(seq2)\n", 533 | "\n", 534 | "print(temp1,temp2)\n", 535 | "\n", 536 | "temp1= lesk(temp1, keyword)\n", 537 | "temp2= lesk(temp2, keyword)\n", 538 | "\n", 539 | "print(\"temp1: \", temp1.definition(),\"\\n temp2: \",temp.definition())" 540 | ] 541 | }, 542 | { 543 | "cell_type": "code", 544 | "execution_count": 16, 545 | "id": "6bea0109", 546 | "metadata": {}, 547 | "outputs": [ 548 | { 549 | "data": { 550 | "text/plain": [ 551 | "'(CNN) -- Ralph Mata was an internal affairs lieutenant for the Miami-Dade Police Department, working in the division that investigates allegations of wrongdoing by cops. Outside the office, authorities allege that the 45-year-old longtime officer worked with a drug trafficking organization to help plan a murder plot and get guns. A criminal complaint unsealed in U.S. District Court in New Jersey Tuesday accuses Mata, also known as \"The Milk Man,\" of using his role as a police officer to help the drug trafficking organization in exchange for money and gifts, including a Rolex watch. In one instance, the complaint alleges, Mata arranged to pay two assassins to kill rival drug dealers. The killers would pose as cops, pulling over their targets before shooting them, according to the complaint. \"Ultimately, the (organization) decided not to move forward with the murder plot, but Mata still received a payment for setting up the meetings,\" federal prosecutors said in a statement. The complaint also alleges that Mata used his police badge to purchase weapons for drug traffickers. Mata, according to the complaint, then used contacts at the airport to transport the weapons in his carry-on luggage on trips from Miami to the Dominican Republic. Court documents released by investigators do not specify the name of the drug trafficking organization with which Mata allegedly conspired but says the organization has been importing narcotics from places such as Ecuador and the Dominican Republic by hiding them \"inside shipping containers containing pallets of produce, including bananas.\" The organization \"has been distributing narcotics in New Jersey and elsewhere,\" the complaint says. Authorities arrested Mata on Tuesday in Miami Gardens, Florida. It was not immediately clear whether Mata has an attorney, and police officials could not be immediately reached for comment. Mata has worked for the Miami-Dade Police Department since 1992, including directing investigations in Miami Gardens and working as a lieutenant in the K-9 unit at Miami International Airport, according to the complaint. Since March 2010, he had been working in the internal affairs division. Mata faces charges of aiding and abetting a conspiracy to distribute cocaine, conspiring to distribute cocaine and engaging in monetary transactions in property derived from specified unlawful activity. He is scheduled to appear in federal court in Florida on Wednesday. If convicted, Mata could face life in prison. CNN\\'s Suzanne Presto contributed to this report.'" 552 | ] 553 | }, 554 | "execution_count": 16, 555 | "metadata": {}, 556 | "output_type": "execute_result" 557 | } 558 | ], 559 | "source": [ 560 | "df_mail_process['article'][1]" 561 | ] 562 | }, 563 | { 564 | "cell_type": "code", 565 | "execution_count": 18, 566 | "id": "e4cbd9b7", 567 | "metadata": {}, 568 | "outputs": [ 569 | { 570 | "data": { 571 | "text/plain": [ 572 | "'Criminal complaint: Cop used his role to help cocaine traffickers .\\nRalph Mata, an internal affairs lieutenant, allegedly helped group get guns .\\nHe also arranged to pay two assassins in a murder plot, a complaint alleges .'" 573 | ] 574 | }, 575 | "execution_count": 18, 576 | "metadata": {}, 577 | "output_type": "execute_result" 578 | } 579 | ], 580 | "source": [ 581 | "df_mail_process['highlights'][1]" 582 | ] 583 | }, 584 | { 585 | "cell_type": "code", 586 | "execution_count": 28, 587 | "id": "11d4d78d", 588 | "metadata": {}, 589 | "outputs": [ 590 | { 591 | "ename": "TypeError", 592 | "evalue": "lesk() missing 1 required positional argument: 'ambiguous_word'", 593 | "output_type": "error", 594 | "traceback": [ 595 | "\u001b[1;31m---------------------------------------------------------------------------\u001b[0m", 596 | "\u001b[1;31mTypeError\u001b[0m Traceback (most recent call last)", 597 | "Input \u001b[1;32mIn [28]\u001b[0m, in \u001b[0;36m\u001b[1;34m()\u001b[0m\n\u001b[0;32m 1\u001b[0m temp \u001b[38;5;241m=\u001b[39m wtk(df_mail_process[\u001b[38;5;124m'\u001b[39m\u001b[38;5;124marticle\u001b[39m\u001b[38;5;124m'\u001b[39m][\u001b[38;5;241m1\u001b[39m])\n\u001b[1;32m----> 3\u001b[0m temp \u001b[38;5;241m=\u001b[39m \u001b[43mlesk\u001b[49m\u001b[43m(\u001b[49m\u001b[43mtemp\u001b[49m\u001b[43m)\u001b[49m\n\u001b[0;32m 4\u001b[0m temp\u001b[38;5;241m.\u001b[39mdefinition()\n", 598 | "\u001b[1;31mTypeError\u001b[0m: lesk() missing 1 required positional argument: 'ambiguous_word'" 599 | ] 600 | } 601 | ], 602 | "source": [ 603 | "temp = wtk(df_mail_process['article'][1])\n", 604 | "\n", 605 | "temp = lesk(temp)\n", 606 | "temp.definition()" 607 | ] 608 | } 609 | ], 610 | "metadata": { 611 | "kernelspec": { 612 | "display_name": "Python 3 (ipykernel)", 613 | "language": "python", 614 | "name": "python3" 615 | }, 616 | "language_info": { 617 | "codemirror_mode": { 618 | "name": "ipython", 619 | "version": 3 620 | }, 621 | "file_extension": ".py", 622 | "mimetype": "text/x-python", 623 | "name": "python", 624 | "nbconvert_exporter": "python", 625 | "pygments_lexer": "ipython3", 626 | "version": "3.9.12" 627 | } 628 | }, 629 | "nbformat": 4, 630 | "nbformat_minor": 5 631 | } 632 | -------------------------------------------------------------------------------- /Attention_Based_seq2se1_model.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": { 6 | "id": "view-in-github", 7 | "colab_type": "text" 8 | }, 9 | "source": [ 10 | "\"Open" 11 | ] 12 | }, 13 | { 14 | "cell_type": "markdown", 15 | "metadata": { 16 | "id": "zSX7qCPx7_sx" 17 | }, 18 | "source": [ 19 | "# **deep learning model**\n", 20 | "# ***Attention- based Seq2Seq Model***" 21 | ] 22 | }, 23 | { 24 | "cell_type": "markdown", 25 | "metadata": { 26 | "id": "K2kzzVhv798W" 27 | }, 28 | "source": [ 29 | "# bidirectional LSTM encoder :phase 1" 30 | ] 31 | }, 32 | { 33 | "cell_type": "code", 34 | "execution_count": null, 35 | "metadata": { 36 | "id": "iHVUIrKpLb74" 37 | }, 38 | "outputs": [], 39 | "source": [ 40 | "import pandas as pd\n", 41 | "import numpy as np\n", 42 | "import tensorflow as tf\n", 43 | "from gensim.models import Word2Vec\n", 44 | "from keras.preprocessing.text import Tokenizer\n", 45 | "from tensorflow.keras.preprocessing.sequence import pad_sequences\n", 46 | "from keras.layers import Embedding, LSTM, Dense, Bidirectional\n", 47 | "from keras.models import Sequential" 48 | ] 49 | }, 50 | { 51 | "cell_type": "code", 52 | "execution_count": null, 53 | "metadata": { 54 | "id": "O1fqausTUEgS" 55 | }, 56 | "outputs": [], 57 | "source": [ 58 | "from keras.layers import *\n", 59 | "from keras.models import *\n", 60 | "from keras import backend as K" 61 | ] 62 | }, 63 | { 64 | "cell_type": "code", 65 | "execution_count": null, 66 | "metadata": { 67 | "id": "wVEC5-LN-LcZ" 68 | }, 69 | "outputs": [], 70 | "source": [ 71 | "df_train = pd.read_csv(\"/content/drive/MyDrive/ATS- gigadata/model_gigadata_new.csv\")" 72 | ] 73 | }, 74 | { 75 | "cell_type": "code", 76 | "execution_count": null, 77 | "metadata": { 78 | "id": "dxx3V4PKHsqP" 79 | }, 80 | "outputs": [], 81 | "source": [ 82 | "i=0\n", 83 | "tokens ,art_summer=[] , []\n", 84 | "\n", 85 | "while i<1000000:\n", 86 | " art_summer.append(df_train['article'][i])\n", 87 | " art_summer.append(df_train['summery'][i])\n", 88 | " tokens.append(art_summer)\n", 89 | " art_summer = []\n", 90 | " i+=1" 91 | ] 92 | }, 93 | { 94 | "cell_type": "code", 95 | "execution_count": null, 96 | "metadata": { 97 | "id": "43IOaAcqH5dr" 98 | }, 99 | "outputs": [], 100 | "source": [ 101 | "embed_model = Word2Vec(tokens, size = 100, window = 5 , min_count = 1)" 102 | ] 103 | }, 104 | { 105 | "cell_type": "code", 106 | "execution_count": null, 107 | "metadata": { 108 | "id": "-gECaaYdptlL" 109 | }, 110 | "outputs": [], 111 | "source": [ 112 | "embed_model.save(\"w2v.model\")" 113 | ] 114 | }, 115 | { 116 | "cell_type": "code", 117 | "execution_count": null, 118 | "metadata": { 119 | "id": "qrgGvyNVMiFD" 120 | }, 121 | "outputs": [], 122 | "source": [ 123 | "embedding_weights = embed_model.wv.vectors" 124 | ] 125 | }, 126 | { 127 | "cell_type": "code", 128 | "execution_count": null, 129 | "metadata": { 130 | "colab": { 131 | "base_uri": "https://localhost:8080/" 132 | }, 133 | "id": "9zhyhs9oZJvl", 134 | "outputId": "1745bd34-4933-4d26-9803-d95c45947db0" 135 | }, 136 | "outputs": [ 137 | { 138 | "data": { 139 | "text/plain": [ 140 | "(1498136, 100)" 141 | ] 142 | }, 143 | "execution_count": 8, 144 | "metadata": {}, 145 | "output_type": "execute_result" 146 | } 147 | ], 148 | "source": [ 149 | "embedding_weights.shape" 150 | ] 151 | }, 152 | { 153 | "cell_type": "code", 154 | "execution_count": null, 155 | "metadata": { 156 | "colab": { 157 | "background_save": true 158 | }, 159 | "id": "Jm6K3keWShtb" 160 | }, 161 | "outputs": [], 162 | "source": [ 163 | "x , y = [],[]\n", 164 | "i=0\n", 165 | "while i person': 45,\n", 256 | " 'organization person': 46,\n", 257 | " 'gpe organization close # . # # person': 47,\n", 258 | " \"gpe organization 's organization\": 48,\n", 259 | " 'person for person': 49,\n", 260 | " 'person organization after person': 50,\n", 261 | " 'person': 51,\n", 262 | " \"gpe organization 's person\": 52,\n", 263 | " \"person person 's organization\": 53,\n", 264 | " 'gpe person # . # # percent in morning trad': 54,\n", 265 | " 'gpe person over person': 55,\n", 266 | " 'person person on organization': 56,\n", 267 | " \"person organization 's organization\": 57,\n", 268 | " 'gpe organization # . # percent in gpe': 58,\n", 269 | " 'gpe person as organization': 59,\n", 270 | " \"person person 's person\": 60,\n", 271 | " 'facility person': 61,\n", 272 | " 'gpe organization on gpe': 62,\n", 273 | " 'gsp organization': 63,\n", 274 | " 'gpe organization in gpe trad': 64,\n", 275 | " 'gpe organization in gpe organization': 65,\n", 276 | " 'gpe organization # # person': 66,\n", 277 | " \"person organization 's person\": 67,\n", 278 | " \"gpe person 's person\": 68,\n", 279 | " 'gpe person in morning trad': 69,\n", 280 | " 'gpe president person organization': 70,\n", 281 | " 'gpe organization organization': 71,\n", 282 | " 'person organization in gpe organization': 72,\n", 283 | " 'gpe organization in organization': 73,\n", 284 | " 'gpe person to organization': 74,\n", 285 | " 'person person to organization': 75,\n", 286 | " 'person organization on organization': 76,\n", 287 | " 'gpe organization over person': 77,\n", 288 | " \"gpe person 's organization\": 78,\n", 289 | " 'person person as organization': 79,\n", 290 | " 'gpe person organization in gpe': 80,\n", 291 | " 'gpe person in gpe organization': 81,\n", 292 | " \"gpe 's organization in gpe\": 82,\n", 293 | " 'gpe person on gpe': 83,\n", 294 | " 'gpe person # . # # percent on da': 84,\n", 295 | " 'person person and person': 85,\n", 296 | " 'person person in gpe organization': 86,\n", 297 | " 'person person on gpe': 87,\n", 298 | " 'gpe to organization': 88,\n", 299 | " 'person person at facility': 89,\n", 300 | " 'person person organization in gpe': 90,\n", 301 | " \"gpe 's organization # . # person\": 91,\n", 302 | " \"person 's organization\": 92,\n", 303 | " 'person person over person': 93,\n", 304 | " \"gpe 's person organization\": 94,\n", 305 | " 'person person in organization': 95,\n", 306 | " 'person for organization': 96,\n", 307 | " 'gpe president to person': 97,\n", 308 | " 'person person # # person': 98,\n", 309 | " 'gpe person in organization': 99,\n", 310 | " 'gpe organization dea': 100,\n", 311 | " 'gpe organization < unk': 101,\n", 312 | " 'person organization in organization': 102,\n", 313 | " 'gpe person in gpe trad': 103,\n", 314 | " 'gpe organization in location': 104,\n", 315 | " 'gpe organization chief person': 105,\n", 316 | " 'gsp person': 106,\n", 317 | " \"gpe 's person to person\": 107,\n", 318 | " 'gpe president person in gpe': 108,\n", 319 | " 'person organization on gpe': 109,\n", 320 | " 'gpe person # . # # percent at organization': 110,\n", 321 | " \"gpe 's person in gpe\": 111,\n", 322 | " \"person 's person\": 112,\n", 323 | " 'organization organization': 113,\n", 324 | " 'person organization < unk': 114,\n", 325 | " 'person in gpe': 115,\n", 326 | " 'gpe to person organization': 116,\n", 327 | " 'facility person after person': 117,\n", 328 | " 'gpe facility': 118,\n", 329 | " 'gpe organization # # # . # # - # # yen in gpe tokyo tradin': 119,\n", 330 | " 'gpe person on facility': 120,\n", 331 | " 'dollar person': 121,\n", 332 | " 'gpe person and person': 122,\n", 333 | " 'gpe says person': 123,\n", 334 | " 'gpe person # . # # pct at organization': 124,\n", 335 | " 'gpe to person in gpe': 125,\n", 336 | " 'person person in location': 126,\n", 337 | " 'person organization # # person': 127,\n", 338 | " 'gpe person at organization # . # # person': 128,\n", 339 | " \"gpe 's person for person\": 129,\n", 340 | " 'organization in gpe': 130,\n", 341 | " 'gpe organization in # # #': 131,\n", 342 | " 'gpe organization end # . # # person': 132,\n", 343 | " 'gpe person # # person': 133,\n", 344 | " 'person organization in gpe person': 134,\n", 345 | " 'gpe in gpe organization': 135,\n", 346 | " 'gpe foreign minister person': 136,\n", 347 | " 'gpe person after organization': 137,\n", 348 | " 'person chief person': 138,\n", 349 | " 'gpe organization person': 139,\n", 350 | " 'gpe minister person': 140,\n", 351 | " 'person organization organization': 141,\n", 352 | " 'gpe organization change': 142,\n", 353 | " 'person to person': 143,\n", 354 | " 'gpe organization # # #': 144,\n", 355 | " 'person person to person in gpe': 145,\n", 356 | " 'gpe police person': 146,\n", 357 | " 'gpe person at facility': 147,\n", 358 | " 'gpe organization # # # person': 148,\n", 359 | " 'person person # # in gpe': 149,\n", 360 | " 'gpe gpe organization': 150,\n", 361 | " 'facility person as person': 151,\n", 362 | " 'gpe person in location': 152,\n", 363 | " 'two organization in gpe': 153,\n", 364 | " 'person person to person organization': 154,\n", 365 | " 'gpe location person': 155,\n", 366 | " 'gpe organization attac': 156,\n", 367 | " 'gpe organization close # . # person': 157,\n", 368 | " 'gpe organization # # dollar': 158,\n", 369 | " 'gpe organization in gpe person': 159,\n", 370 | " 'person b-shares close # . # # person': 160,\n", 371 | " 'person person for gpe': 161,\n", 372 | " 'gpe organization leader person': 162,\n", 373 | " 'organization person in gpe': 163,\n", 374 | " 'us person': 164,\n", 375 | " 'gpe person for gpe': 165,\n", 376 | " 'gpe organization open # . # # person': 166,\n", 377 | " 'gpe person in gpe tradin': 167,\n", 378 | " 'gpe person person': 168,\n", 379 | " 'gpe location close # . # person': 169,\n", 380 | " 'gpe organization on facility': 170,\n", 381 | " 'gpe organization after organization': 171,\n", 382 | " '< unk > person organization': 172,\n", 383 | " 'facility person on person': 173,\n", 384 | " 'person world news agend': 174,\n", 385 | " 'gpe arrives in gpe': 175,\n", 386 | " 'gpe person to person in gpe': 176,\n", 387 | " 'person person in gpe person': 177,\n", 388 | " 'gpe organization #': 178,\n", 389 | " 'person person < unk': 179,\n", 390 | " 'gpe person after facility': 180,\n", 391 | " 'facility organization': 181,\n", 392 | " 'gpe person #': 182,\n", 393 | " 'gpe organization result': 183,\n", 394 | " 'gpe organization < unk > person': 184,\n", 395 | " 'the organization': 185,\n", 396 | " '< unk > person in gpe': 186,\n", 397 | " 'gpe person to person organization': 187,\n", 398 | " 'gpe person to gpe': 188,\n", 399 | " 'person president person': 189,\n", 400 | " 'gpe prime minister person': 190,\n", 401 | " 'person person after organization': 191,\n", 402 | " 'philippines person': 192,\n", 403 | " 'person person person': 193,\n", 404 | " 'gpe organization despite person': 194,\n", 405 | " 'gpe person from person': 195,\n", 406 | " 'gpe defense minister person': 196,\n", 407 | " 'person organization over person': 197,\n", 408 | " 'gpe organization from person': 198,\n", 409 | " 'gpe organization in gpe to person': 199,\n", 410 | " 'person person u': 200,\n", 411 | " 'us person after person': 201,\n", 412 | " 'person person from person': 202,\n", 413 | " 'organization on person': 203,\n", 414 | " 'person person president person': 204,\n", 415 | " 'gpe person on profit-takin': 205,\n", 416 | " 'gpe president person to person': 206,\n", 417 | " 'person organization person': 207,\n", 418 | " 'gpe person # . # percent on da': 208,\n", 419 | " 'person organization #': 209,\n", 420 | " 'gpe organization to # . # person': 210,\n", 421 | " 'gpe opens lower in gpe kon': 211,\n", 422 | " 'gpe person despite person': 212,\n", 423 | " 'us person on person': 213,\n", 424 | " 'gpe person to #': 214,\n", 425 | " 'gpe organization presiden': 215,\n", 426 | " 'person arrives in gpe': 216,\n", 427 | " 'gpe organization u': 217,\n", 428 | " 'person organization u': 218,\n", 429 | " 'gpe shares higher in morning trad': 219,\n", 430 | " 'person person with organization': 220,\n", 431 | " 'organization person organization': 221,\n", 432 | " 'gpe organization to # . # percent in gpe': 222,\n", 433 | " 'person person to gpe': 223,\n", 434 | " 'gpe organization as person': 224,\n", 435 | " 'gpe organization prices open highe': 225,\n", 436 | " 'gpe person # . # # percent on person': 226,\n", 437 | " 'gpe organization crisi': 227,\n", 438 | " 'gpe person with gpe': 228,\n", 439 | " 'gpe organization # # # . # # - # # yen in gpe tokyo trad': 229,\n", 440 | " 'facility person as organization': 230,\n", 441 | " 'person person to gpe in gpe': 231,\n", 442 | " 'person organization dea': 232,\n", 443 | " 'gpe organization in gpe tradin': 233,\n", 444 | " 'gpe opens higher in gpe kon': 234,\n", 445 | " 'person in gpe trad': 235,\n", 446 | " 'gpe organization # . # pc': 236,\n", 447 | " 'un organization': 237,\n", 448 | " 'person organization # # # person': 238,\n", 449 | " 'gpe organization president person': 239,\n", 450 | " 'un person': 240,\n", 451 | " 'gpe organization governmen': 241,\n", 452 | " 'gpe person for person in gpe': 242,\n", 453 | " 'gpe organization # # # # person': 243,\n", 454 | " 'gpe organization repor': 244,\n", 455 | " 'gpe organization prices close highe': 245,\n", 456 | " 'gpe foreign minister person in gpe': 246,\n", 457 | " 'person person #': 247,\n", 458 | " 'person as person': 248,\n", 459 | " 'person organization # . # person': 249,\n", 460 | " 'gpe person in gpe deal': 250,\n", 461 | " 'gpe to person with person': 251,\n", 462 | " 'gpe organization prices close lowe': 252,\n", 463 | " 'gpe person in gpe to person': 253,\n", 464 | " 'person b-shares close # . # person': 254,\n", 465 | " 'person organization in gpe after person': 255,\n", 466 | " 'gpe government person': 256,\n", 467 | " 'person organization in location': 257,\n", 468 | " 'gpe organization in gpe after person': 258,\n", 469 | " 'gpe organization pla': 259,\n", 470 | " 'person organization despite person': 260,\n", 471 | " 'gpe organization in gpe ban': 261,\n", 472 | " 'gpe minister person organization': 262,\n", 473 | " 'gpe location organization': 263,\n", 474 | " 'top person': 264,\n", 475 | " 'person world economic news summar': 265,\n", 476 | " 'gpe organization attack': 266,\n", 477 | " 'gpe person in gpe person': 267,\n", 478 | " 'gpe organization prices open lowe': 268,\n", 479 | " 'un chief person': 269,\n", 480 | " 'gpe says organization': 270,\n", 481 | " 'person person to gpe organization': 271,\n", 482 | " 'gpe person to gpe organization': 272,\n", 483 | " 'gpe organization morning # . # # person': 273,\n", 484 | " 'organization person for person': 274,\n", 485 | " 'gpe person with organization': 275,\n", 486 | " \"gpe 's person for organization\": 276,\n", 487 | " 'person organization in # # #': 277,\n", 488 | " 'person person < unk > person': 278,\n", 489 | " 'gpe to person on person': 279,\n", 490 | " 'gpe closes lower in gpe kon': 280,\n", 491 | " 'gpe organization # # # . # # yen in gpe tokyo tradin': 281,\n", 492 | " \"gpe 's organization on person\": 282,\n", 493 | " 'gpe organization to help person': 283,\n", 494 | " 'gpe premier person': 284,\n", 495 | " 'person organization as person': 285,\n", 496 | " 'rubber person on person organization': 286,\n", 497 | " '# # person in gpe': 287,\n", 498 | " 'gpe person to visit person': 288,\n", 499 | " 'gpe # # # - # person': 289,\n", 500 | " 'gpe person at gpe ; person # . # # person': 290,\n", 501 | " 'person person and organization': 291,\n", 502 | " 'gpe person < unk': 292,\n", 503 | " 'person says person': 293,\n", 504 | " 'gpe organization on gpe to person': 294,\n", 505 | " 'gpe person # . # percent in gpe': 295,\n", 506 | " 'person organization in gpe to person': 296,\n", 507 | " 'gpe person < unk > person': 297,\n", 508 | " 'person person in gpe trad': 298,\n", 509 | " 'person person to #': 299,\n", 510 | " 'gpe to person for person': 300,\n", 511 | " 'former person': 301,\n", 512 | " 'person organization after organization': 302,\n", 513 | " 'gpe closes higher in gpe kon': 303,\n", 514 | " 'facility person on organization': 304,\n", 515 | " 'person person # # # person': 305,\n", 516 | " 'the organization sunday economics news advisor': 306,\n", 517 | " 'person organization leader person': 307,\n", 518 | " '< unk > organization': 308,\n", 519 | " 'gpe organization under person': 309,\n", 520 | " 'gpe organization # . # # pc': 310,\n", 521 | " \"person organization in gpe 's organization\": 311,\n", 522 | " 'gpe organization price': 312,\n", 523 | " 'gpe person in # # #': 313,\n", 524 | " 'person person in # # #': 314,\n", 525 | " '< unk > person to person': 315,\n", 526 | " 'suspected person in gpe': 316,\n", 527 | " 'person person despite person': 317,\n", 528 | " 'person person at #': 318,\n", 529 | " 'gpe organization to boost person': 319,\n", 530 | " 'gpe person under person': 320,\n", 531 | " \"person organization in gpe 's person\": 321,\n", 532 | " 'gpe gpe to person': 322,\n", 533 | " 'person person but person': 323,\n", 534 | " 'person organization in gpe as person': 324,\n", 535 | " 'gpe on person': 325,\n", 536 | " 'gpe organization in the organization': 326,\n", 537 | " 'person < unk > person': 327,\n", 538 | " 'gpe president person for person': 328,\n", 539 | " 'person person for person in gpe': 329,\n", 540 | " 'gpe organization # . # # percent at organization': 330,\n", 541 | " 'person organization presiden': 331,\n", 542 | " 'gpe organization # person': 332,\n", 543 | " \"gpe 's person after person\": 333,\n", 544 | " 'gpe in organization': 334,\n", 545 | " 'gpe to visit person': 335,\n", 546 | " 'gpe president person with person': 336,\n", 547 | " 'rubber person on bigger volume': 337,\n", 548 | " 'gpe finance minister person': 338,\n", 549 | " 'us person as person': 339,\n", 550 | " \"gpe 's person on person\": 340,\n", 551 | " 'gpe person # . # # percent on organization': 341,\n", 552 | " \"gpe 's person with person\": 342,\n", 553 | " \"gpe 's person as person\": 343,\n", 554 | " 'gpe organization prob': 344,\n", 555 | " 'gpe person # . # # person on person': 345,\n", 556 | " 'person person with gpe': 346,\n", 557 | " 'gpe person u': 347,\n", 558 | " 'gpe says it person': 348,\n", 559 | " 'organization in gpe trad': 349,\n", 560 | " 'gpe person in gpe with person': 350,\n", 561 | " 'person person # -': 351,\n", 562 | " 'person organization on gpe to person': 352,\n", 563 | " 'gpe person to gpe in gpe': 353,\n", 564 | " 's. person': 354,\n", 565 | " 'gpe organization in gaz': 355,\n", 566 | " 'gpe person for organization in gpe': 356,\n", 567 | " 'gpe person to help person': 357,\n", 568 | " 'person organization < unk > person': 358,\n", 569 | " 'gpe person president person': 359,\n", 570 | " 'person organization governmen': 360,\n", 571 | " 'gpe organization rat': 361,\n", 572 | " 'gpe to person to person': 362,\n", 573 | " 'gpe person but person': 363,\n", 574 | " 'gpe organization # . # percent in # # #': 364,\n", 575 | " 'gpe organization # . # percent in jun': 365,\n", 576 | " 'us person on organization': 366,\n", 577 | " 'gpe organization morning # . # person': 367,\n", 578 | " 'organization person to person': 368,\n", 579 | " 'gpe organization to # . # # person': 369,\n", 580 | " 'person organization from person': 370,\n", 581 | " 'gpe person # . # # person on organization': 371,\n", 582 | " 'gpe person to end person': 372,\n", 583 | " 'gpe organization in jun': 373,\n", 584 | " 'gpe organization # # . # person': 374,\n", 585 | " \"person person in gpe 's organization\": 375,\n", 586 | " 'gpe organization on gpe organization': 376,\n", 587 | " 'gsp person organization': 377,\n", 588 | " 'person at organization': 378,\n", 589 | " 'gpe person in gpe after person': 379,\n", 590 | " 'person person in the organization': 380,\n", 591 | " \"gpe 's organization after person\": 381,\n", 592 | " 'gpe moves to person': 382,\n", 593 | " 'gpe on organization': 383,\n", 594 | " 'gpe person talk': 384,\n", 595 | " 'gpe organization # . # pct in gpe': 385,\n", 596 | " 'gpe plans person': 386,\n", 597 | " 'person after person': 387,\n", 598 | " 'gpe on the organization': 388,\n", 599 | " 'facility person ; person # . # # person': 389,\n", 600 | " 'gpe organization to end person': 390,\n", 601 | " 'person person in gpe to person': 391,\n", 602 | " 'gpe to organization in gpe': 392,\n", 603 | " 'three organization in gpe': 393,\n", 604 | " \"gpe 's organization # # person\": 394,\n", 605 | " 'person person to end person': 395,\n", 606 | " 'person organization on the organization': 396,\n", 607 | " 'gpe organization on person organization': 397,\n", 608 | " 'gpe person to person on person': 398,\n", 609 | " 'gpe organization prices close # . # person': 399,\n", 610 | " '# . # person': 400,\n", 611 | " 'person person in gpe after person': 401,\n", 612 | " 'gpe organization firme': 402,\n", 613 | " 'gpe person over organization': 403,\n", 614 | " 'person chief person organization': 404,\n", 615 | " 'facility person organization': 405,\n", 616 | " 'gsp organization in gpe': 406,\n", 617 | " 'gpe organization morning up # . # # person': 407,\n", 618 | " '< unk > person for person': 408,\n", 619 | " 'gpe organization in morning trad': 409,\n", 620 | " 'gpe organization in gpe with person': 410,\n", 621 | " 'gpe organization # . # percent on person': 411,\n", 622 | " 'gpe judge person': 412,\n", 623 | " 'gpe person # . # pc': 413,\n", 624 | " 'top organization': 414,\n", 625 | " 'person organization crisi': 415,\n", 626 | " 'gpe organization in gpe as person': 416,\n", 627 | " 'gpe person to person for person': 417,\n", 628 | " 'gpe organization to be person': 418,\n", 629 | " 'gpe person # . # # percent on profit-takin': 419,\n", 630 | " 'gpe person for person with person': 420,\n", 631 | " 'person person in gpe with person': 421,\n", 632 | " 'person person before person': 422,\n", 633 | " 'gpe to person on organization': 423,\n", 634 | " 'person president person organization': 424,\n", 635 | " \"gpe organization in gpe 's person\": 425,\n", 636 | " 'gpe person to person with person': 426,\n", 637 | " 'gpe organization in gpe on person': 427,\n", 638 | " 'gpe organization conferenc': 428,\n", 639 | " 'gpe organization charged with person': 429,\n", 640 | " 'person on person': 430,\n", 641 | " 'thousands person': 431,\n", 642 | " 'gpe organization open # . # person': 432,\n", 643 | " 'gpe person # . # percent at organization': 433,\n", 644 | " 'person person in gpe tradin': 434,\n", 645 | " 'dollar person in gpe': 435,\n", 646 | " 'person organization # # in gpe': 436,\n", 647 | " 'gpe organization in gpe in gpe': 437,\n", 648 | " 'dollar person organization': 438,\n", 649 | " 'gpe organization decisio': 439,\n", 650 | " 'gpe person # . # # percent as person': 440,\n", 651 | " 'gpe person at organization # . # # pc': 441,\n", 652 | " 'person person of the organization': 442,\n", 653 | " 'gpe hits out at organization': 443,\n", 654 | " 'person person on gpe organization': 444,\n", 655 | " \"gpe 's president person\": 445,\n", 656 | " 'stocks person on facility': 446,\n", 657 | " '# # dead as person': 447,\n", 658 | " 'gpe person to boost person': 448,\n", 659 | " 'gpe organization charge': 449,\n", 660 | " 'gpe person to visit gpe': 450,\n", 661 | " 'gpe organization in gpe quarte': 451,\n", 662 | " 'the organization tuesday economics news advisor': 452,\n", 663 | " 'person person talk': 453,\n", 664 | " 'gpe facility organization': 454,\n", 665 | " 'two gpe organization': 455,\n", 666 | " 'gpe organization on profit-takin': 456,\n", 667 | " 'gpe organization three person': 457,\n", 668 | " 'person organization under person': 458,\n", 669 | " 'gpe organization # # # dollar': 459,\n", 670 | " 'gpe organization agree to person': 460,\n", 671 | " 'gpe organization protes': 461,\n", 672 | " 'gpe person on gpe organization': 462,\n", 673 | " 'two organization': 463,\n", 674 | " 'judge person': 464,\n", 675 | " 'gpe organization on location': 465,\n", 676 | " 'gpe gpe person': 466,\n", 677 | " 'person organization in gpe with person': 467,\n", 678 | " 'us organization': 468,\n", 679 | " 'gpe organization # # . # percent in gpe': 469,\n", 680 | " 'gpe organization secretary person': 470,\n", 681 | " 'gpe organization # #': 471,\n", 682 | " \"person person in gpe 's person\": 472,\n", 683 | " 'person person over organization': 473,\n", 684 | " 'gpe organization from gpe': 474,\n", 685 | " 'gpe forces person': 475,\n", 686 | " \"person person organization 's organization\": 476,\n", 687 | " 'gpe organization first person': 477,\n", 688 | " 'us person as organization': 478,\n", 689 | " 'gpe person # # # person': 479,\n", 690 | " 'person organization # # th stag': 480,\n", 691 | " 'gpe organization on the organization': 481,\n", 692 | " 'person person of the yea': 482,\n", 693 | " 'gpe defense minister person in gpe': 483,\n", 694 | " 'person person at organization in gpe': 484,\n", 695 | " '< unk > person at organization': 485,\n", 696 | " 'gpe person before person': 486,\n", 697 | " 'gpe person organization after person': 487,\n", 698 | " 'person person with person organization': 488,\n", 699 | " 'gpe person for person to person': 489,\n", 700 | " 'gpe < unk > person': 490,\n", 701 | " 'person with person': 491,\n", 702 | " 'person for gsp': 492,\n", 703 | " 'gpe organization # # organization': 493,\n", 704 | " 'gpe prime minister person in gpe': 494,\n", 705 | " 'gpe organization arrives in gpe': 495,\n", 706 | " 'person organization in the organization': 496,\n", 707 | " 'former organization': 497,\n", 708 | " 'gpe organization bombing': 498,\n", 709 | " 'gpe president person in gpe organization': 499,\n", 710 | " 'gpe person # . # percent on person': 500,\n", 711 | " 'gpe foreign minister to person': 501,\n", 712 | " 'facility person at organization': 502,\n", 713 | " 'gpe person ; person': 503,\n", 714 | " 'person person ; person': 504,\n", 715 | " 'person gpe person': 505,\n", 716 | " 'person organization attac': 506,\n", 717 | " 'gpe organization facility': 507,\n", 718 | " 'gpe organization in gpe talk': 508,\n", 719 | " 'person arrives in gpe organization': 509,\n", 720 | " \"gpe 's person at organization\": 510,\n", 721 | " 'gpe organization leader': 511,\n", 722 | " 'gpe wins person': 512,\n", 723 | " 'gpe organization # # dollars in gpe trad': 513,\n", 724 | " 'gpe president person on person': 514,\n", 725 | " 'person organization # # # # person': 515,\n", 726 | " 'gpe organization in gsp': 516,\n", 727 | " 'king person': 517,\n", 728 | " \"['gpe', 'organization', 'price', 'close', 'person', 'tuesday', 'at', 'u', 'dollar', 'an', 'ounce', 'against', 'the', 'previous', 'day', 's', 'closing', 'rate', 'of']\": 518,\n", 729 | " \"gpe 's person 's person\": 519,\n", 730 | " 'gpe person and organization': 520,\n", 731 | " 's. person # . # # person': 521,\n", 732 | " 'gpe foreign minister person organization': 522,\n", 733 | " 'gpe person with person in gpe': 523,\n", 734 | " 'gpe person to person after person': 524,\n", 735 | " 'gpe to person after person': 525,\n", 736 | " 'gpe person # . # # person on profit-takin': 526,\n", 737 | " 'person person # # # # person': 527,\n", 738 | " 'gpe organization in gpe morning trad': 528,\n", 739 | " 'person for frida': 529,\n", 740 | " 'person person in gsp': 530,\n", 741 | " 'former gpe president person': 531,\n", 742 | " 'gpe in gpe to person': 532,\n", 743 | " \"['person', 'price', 'in', 'gpe', 'open', 'high', 'wednesday', 'with', 'the', 'nikkei', 'person', 'person', 'percent', 'to', 'in', 'the', 'organization']\": 533,\n", 744 | " 'gpe person at #': 534,\n", 745 | " 'gpe location close # . # # person': 535,\n", 746 | " 'gpe organization on person in gpe': 536,\n", 747 | " 'gpe organization journalis': 537,\n", 748 | " 'person facility': 538,\n", 749 | " 'person organization arrives in gpe': 539,\n", 750 | " 'gpe organization projec': 540,\n", 751 | " 'gpe arrives in gpe organization': 541,\n", 752 | " 'gpe person # . # person on profit-takin': 542,\n", 753 | " 'gpe organization in gpe attac': 543,\n", 754 | " 'person organization # # #': 544,\n", 755 | " 'gpe organization morning highe': 545,\n", 756 | " 'person person in gaz': 546,\n", 757 | " \"gpe 's organization 's person\": 547,\n", 758 | " \"gpe 's organization in gpe organization\": 548,\n", 759 | " \"gpe organization in gpe 's organization\": 549,\n", 760 | " \"gpe person organization 's organization\": 550,\n", 761 | " 'police person': 551,\n", 762 | " 'gpe person on person organization': 552,\n", 763 | " 'un person in gpe': 553,\n", 764 | " 'gpe organization # # percent in gpe': 554,\n", 765 | " 'person person the organization': 555,\n", 766 | " 'gpe person in gpe talk': 556,\n", 767 | " 'organization person for organization': 557,\n", 768 | " 'person organization president person': 558,\n", 769 | " 'person person organization after person': 559,\n", 770 | " 'person person to person on person': 560,\n", 771 | " \"gpe 's < unk > person\": 561,\n", 772 | " 'person person to the organization': 562,\n", 773 | " 'gpe in gpe person': 563,\n", 774 | " 'gpe organization statio': 564,\n", 775 | " 'gpe organization to visit person': 565,\n", 776 | " 'stocks person at facility': 566,\n", 777 | " 'person person arrives in gpe': 567,\n", 778 | " 'gpe organization morning lowe': 568,\n", 779 | " 'person organization pri': 569,\n", 780 | " 'gpe organization # # person in gpe': 570,\n", 781 | " 'gpe organization dead at #': 571,\n", 782 | " \"person person organization 's person\": 572,\n", 783 | " 'person organization attack': 573,\n", 784 | " 'gpe person in the organization': 574,\n", 785 | " 'person organization in gpe in gpe': 575,\n", 786 | " 'person person in person': 576,\n", 787 | " 'gpe person to # #': 577,\n", 788 | " 'gpe organization # -': 578,\n", 789 | " \"['person', 'price', 'in', 'gpe', 'open', 'person', 'thursday', 'with', 'the', 'nikkei', 'person', 'person', 'percent', 'to', 'in', 'the', 'organization']\": 579,\n", 790 | " 'rights person': 580,\n", 791 | " 'gpe organization organization in gpe': 581,\n", 792 | " 'gpe organization protest': 582,\n", 793 | " 'gsp person to person': 583,\n", 794 | " 'organization on gpe': 584,\n", 795 | " 'gpe organization morning up # . # person': 585,\n", 796 | " 'gpe organization victor': 586,\n", 797 | " 'gpe person in gpe on person': 587,\n", 798 | " 'person organization facility': 588,\n", 799 | " 'person organization # # . # person': 589,\n", 800 | " 'person person in gpe ope': 590,\n", 801 | " 'person person says person': 591,\n", 802 | " 'gpe person # . # # person on facility': 592,\n", 803 | " 'gpe person # . # # percent in gpe trad': 593,\n", 804 | " 'gpe person organization on person': 594,\n", 805 | " 'un chief person in gpe': 595,\n", 806 | " 'person person # - # in gpe leagu': 596,\n", 807 | " \"person 's person in gpe\": 597,\n", 808 | " 'gpe person for person on person': 598,\n", 809 | " 'gpe person price': 599,\n", 810 | " 'gpe organization despite organization': 600,\n", 811 | " 'person person on the organization': 601,\n", 812 | " 'gpe organization chief to person': 602,\n", 813 | " 'gpe president person after person': 603,\n", 814 | " 'person person to person with person': 604,\n", 815 | " \"['person', 'price', 'in', 'gpe', 'open', 'person', 'tuesday', 'with', 'the', 'nikkei', 'person', 'person', 'percent', 'to', 'in', 'the', 'organization']\": 605,\n", 816 | " 'gpe person after person in gpe': 606,\n", 817 | " 'person person # #': 607,\n", 818 | " 'person person for organization in gpe': 608,\n", 819 | " 'gpe shares up # . # person': 609,\n", 820 | " 'gpe person # . # person on person': 610,\n", 821 | " 'rubber person on gpe volume': 611,\n", 822 | " 'person organization in gsp': 612,\n", 823 | " \"['gpe', 'person', 'price', 'open', 'high', 'wednesday', 'with', 'the', 'organization', 'nikkei', 'person', 'person', 'percent', 'to', 'in', 'the', 'organization']\": 613,\n", 824 | " \"person person 's person in gpe\": 614,\n", 825 | " 'person police person': 615,\n", 826 | " 'gpe to person for organization': 616,\n", 827 | " 'gpe person # . # # percent on facility': 617,\n", 828 | " 'gpe organization # # # . # # - # # in gpe tokyo trad': 618,\n", 829 | " 'person person to help person': 619,\n", 830 | " 'organization organization in gpe': 620,\n", 831 | " 'gpe organization # , # # # person': 621,\n", 832 | " 'person organization on facility': 622,\n", 833 | " 'first person': 623,\n", 834 | " 'person as organization': 624,\n", 835 | " 'gpe person on person in gpe': 625,\n", 836 | " 'gpe person about person': 626,\n", 837 | " 'gpe person for person organization': 627,\n", 838 | " 'jailed person': 628,\n", 839 | " 'person person in gpe yor': 629,\n", 840 | " 'gsp organization on person': 630,\n", 841 | " 'gpe organization fir': 631,\n", 842 | " 'gpe person # . # person on organization': 632,\n", 843 | " 'gpe organization off person': 633,\n", 844 | " \"gpe 's organization on organization\": 634,\n", 845 | " 'gpe organization inventory dat': 635,\n", 846 | " 'gpe organization chief person organization': 636,\n", 847 | " 'gpe in gpe as person': 637,\n", 848 | " 'gpe organization in person': 638,\n", 849 | " 'person person with person in gpe': 639,\n", 850 | " 'first person in gpe': 640,\n", 851 | " 'gpe person from gpe': 641,\n", 852 | " 'gpe person second test scoreboar': 642,\n", 853 | " \"['person', 'price', 'in', 'gpe', 'open', 'high', 'tuesday', 'with', 'the', 'nikkei', 'person', 'person', 'percent', 'to', 'in', 'the', 'organization']\": 643,\n", 854 | " 'person person to person after person': 644,\n", 855 | " 'gpe organization probe organization': 645,\n", 856 | " \"['person', 'price', 'in', 'gpe', 'open', 'person', 'friday', 'with', 'the', 'nikkei', 'person', 'person', 'percent', 'to', 'in', 'the', 'organization']\": 646,\n", 857 | " 'person chief person to person': 647,\n", 858 | " 'organization person on person': 648,\n", 859 | " 'philippines organization': 649,\n", 860 | " 'gpe urges gpe to person': 650,\n", 861 | " '< unk > person with person': 651,\n", 862 | " \"['gpe', 'person', 'price', 'open', 'high', 'thursday', 'with', 'the', 'organization', 'nikkei', 'person', 'person', 'percent', 'to', 'in', 'the', 'organization']\": 652,\n", 863 | " 'gpe organization to hold person': 653,\n", 864 | " 'gpe person # . # # person as person': 654,\n", 865 | " 'gpe organization to #': 655,\n", 866 | " 'gpe person to stop person': 656,\n", 867 | " \"['gpe', 'person', 'price', 'open', 'high', 'friday', 'with', 'the', 'organization', 'nikkei', 'person', 'person', 'percent', 'to', 'in', 'the', 'organization']\": 657,\n", 868 | " 'gpe organization in gpe cas': 658,\n", 869 | " 'gpe person # #': 659,\n", 870 | " \"gpe person in gpe 's person\": 660,\n", 871 | " '# # killed as person': 661,\n", 872 | " 'person person on person organization': 662,\n", 873 | " 'person to visit person': 663,\n", 874 | " 'three person in gpe': 664,\n", 875 | " 'gpe organization attack in gpe': 665,\n", 876 | " \"['person', 'price', 'in', 'gpe', 'open', 'high', 'thursday', 'with', 'the', 'nikkei', 'person', 'person', 'percent', 'to', 'in', 'the', 'organization']\": 666,\n", 877 | " 'person person in gpe as person': 667,\n", 878 | " \"gpe 's person on organization\": 668,\n", 879 | " 'gpe prosecutors person': 669,\n", 880 | " 'person organization on gpe organization': 670,\n", 881 | " 'philippines person organization': 671,\n", 882 | " 'person person to be person': 672,\n", 883 | " 'person person # # . # person': 673,\n", 884 | " 'gpe organization morning down # . # person': 674,\n", 885 | " 'gpe organization in the gpe crisi': 675,\n", 886 | " 'gpe facility in gpe': 676,\n", 887 | " 'gpe person to # . # person': 677,\n", 888 | " 'gpe person to gpe person': 678,\n", 889 | " 'person organization to end person': 679,\n", 890 | " \"['person', 'price', 'in', 'gpe', 'open', 'high', 'friday', 'with', 'the', 'nikkei', 'person', 'person', 'percent', 'to', 'in', 'the', 'organization']\": 680,\n", 891 | " 'three gpe soldiers killed in gpe': 681,\n", 892 | " 'at organization # # dead as person': 682,\n", 893 | " 'un chief person organization': 683,\n", 894 | " 'gpe hints at organization': 684,\n", 895 | " 'the organization world news summar': 685,\n", 896 | " 'gpe organization # # person for person': 686,\n", 897 | " 'gpe person in gpe attac': 687,\n", 898 | " 'person organization conferenc': 688,\n", 899 | " 'gpe organization leader person organization': 689,\n", 900 | " 'gpe gpe organization in gpe': 690,\n", 901 | " 'gpe organization # . # percent at organization': 691,\n", 902 | " \"person person to person 's person\": 692,\n", 903 | " 'person person to boost person': 693,\n", 904 | " 'gpe organization p': 694,\n", 905 | " 'person person for person with person': 695,\n", 906 | " 'gpe police person organization': 696,\n", 907 | " 'person organization o': 697,\n", 908 | " 'gpe person that person': 698,\n", 909 | " 'after person': 699,\n", 910 | " 'leading person': 700,\n", 911 | " 'person organization world cup giant slalo': 701,\n", 912 | " 'gpe organization leaves organization': 702,\n", 913 | " 'gpe organization to visit gpe': 703,\n", 914 | " \"gpe 's organization # . # percent in gpe\": 704,\n", 915 | " 'person person to person for person': 705,\n", 916 | " 'major person': 706,\n", 917 | " 'person person dies at #': 707,\n", 918 | " 'person to person in gpe': 708,\n", 919 | " 'gpe foreign minister to visit person': 709,\n", 920 | " 'person person at gpe ope': 710,\n", 921 | " 'gpe person with person organization': 711,\n", 922 | " 'person person to keep person': 712,\n", 923 | " 'person organization pla': 713,\n", 924 | " 'gpe threatens to person': 714,\n", 925 | " 'gpe organization morning down # . # # person': 715,\n", 926 | " 'us person organization': 716,\n", 927 | " 'gpe person # . # percent in morning trad': 717,\n", 928 | " 'person person with gsp': 718,\n", 929 | " 'top person in gpe': 719,\n", 930 | " 'gpe organization minister person': 720,\n", 931 | " 'gpe organization closed for gpe': 721,\n", 932 | " 'gpe organization in gpe ro': 722,\n", 933 | " '< unk > person for organization': 723,\n", 934 | " 'person person for person organization': 724,\n", 935 | " 'gpe organization after person in gpe': 725,\n", 936 | " 'gpe person organization on gpe': 726,\n", 937 | " 'gpe person in gpe quarte': 727,\n", 938 | " 'person person from gpe': 728,\n", 939 | " 'gpe organization # . # percent in gpe quarte': 729,\n", 940 | " 'person in gpe organization': 730,\n", 941 | " 'gpe in the organization': 731,\n", 942 | " 'gpe president person for organization': 732,\n", 943 | " 'gpe organization says person': 733,\n", 944 | " 'gpe organization grows # . # percent in # # #': 734,\n", 945 | " \"gpe 's person # # person\": 735,\n", 946 | " 'gpe organization to stop person': 736,\n", 947 | " 'person person under person': 737,\n", 948 | " \"gsp 's organization\": 738,\n", 949 | " 'gpe person ; person # . # # person': 739,\n", 950 | " 'gpe person on profit takin': 740,\n", 951 | " 'gpe organization # , # # # job': 741,\n", 952 | " 'person organization chief person': 742,\n", 953 | " 'person person on facility': 743,\n", 954 | " 'gpe in gpe with person': 744,\n", 955 | " 'general person': 745,\n", 956 | " 'gpe person to gpe after person': 746,\n", 957 | " 'person person # , # # # person': 747,\n", 958 | " 'gpe sets up organization': 748,\n", 959 | " 'gpe denies person': 749,\n", 960 | " 'person organization in u': 750,\n", 961 | " 'gpe organization # # # person in gpe': 751,\n", 962 | " 'person person to visit person': 752,\n", 963 | " 'person person for # # #': 753,\n", 964 | " 'gpe person # . # # percent after person': 754,\n", 965 | " 'gpe person # -': 755,\n", 966 | " 'person person in gpe tes': 756,\n", 967 | " 'person person to victor': 757,\n", 968 | " 'gpe person as person organization': 758,\n", 969 | " 'person person about person': 759,\n", 970 | " 'gpe person # # . # percent in gpe': 760,\n", 971 | " \"gpe 's person 's organization\": 761,\n", 972 | " 'two person': 762,\n", 973 | " 'person organization result': 763,\n", 974 | " \"gpe organization 's organization in gpe\": 764,\n", 975 | " 'un person organization': 765,\n", 976 | " 'gpe shares up # . # # percent at organization': 766,\n", 977 | " 'person on organization': 767,\n", 978 | " 'gpe organization stocks open lowe': 768,\n", 979 | " 'person person in gpe vot': 769,\n", 980 | " \"gpe 's organization in organization\": 770,\n", 981 | " 'person organization # . # # person': 771,\n", 982 | " 'person organization in gpe on person': 772,\n", 983 | " 'major organization': 773,\n", 984 | " 'person person to # , # # # to the dolla': 774,\n", 985 | " 'gpe organization higher in gpe': 775,\n", 986 | " '# # person': 776,\n", 987 | " 'gpe person # . # percent on organization': 777,\n", 988 | " 'person person in gpe quarte': 778,\n", 989 | " 'gpe organization resul': 779,\n", 990 | " 'gpe organization to keep person': 780,\n", 991 | " 'gpe person in gpe as person': 781,\n", 992 | " \"['person', 'price', 'in', 'gpe', 'open', 'person', 'wednesday', 'with', 'the', 'nikkei', 'person', 'person', 'percent', 'to', 'in', 'the', 'organization']\": 782,\n", 993 | " 'police person organization': 783,\n", 994 | " 'gpe organization actio': 784,\n", 995 | " 'gpe person arrives in gpe': 785,\n", 996 | " \"gpe to person 's organization\": 786,\n", 997 | " \"['person', 'price', 'in', 'gpe', 'open', 'high', 'monday', 'with', 'the', 'nikkei', 'person', 'person', 'percent', 'to', 'in', 'the', 'organization']\": 787,\n", 998 | " 'gpe organization recover': 788,\n", 999 | " 'gpe organization morning person': 789,\n", 1000 | " 'gpe organization # # # # person to # . # person': 790,\n", 1001 | " 'person person in first quarte': 791,\n", 1002 | " 'person person organization < unk': 792,\n", 1003 | " \"['gpe', 'person', 'price', 'open', 'person', 'tuesday', 'with', 'the', 'organization', 'nikkei', 'person', 'person', 'percent', 'to', 'in', 'the', 'organization']\": 793,\n", 1004 | " 'person organization to be person': 794,\n", 1005 | " 'gpe person to person to person': 795,\n", 1006 | " 'gpe prime minister person organization': 796,\n", 1007 | " 'person organization on person in gpe': 797,\n", 1008 | " \"gpe 's person in gpe organization\": 798,\n", 1009 | " 'person organization # # # person in gpe': 799,\n", 1010 | " 'gpe person first test scoreboar': 800,\n", 1011 | " 'gpe organization # # # . # # yen in gpe afternoo': 801,\n", 1012 | " 'gpe to help person': 802,\n", 1013 | " 'gpe organization stocks open highe': 803,\n", 1014 | " 'deadly person': 804,\n", 1015 | " 'person person # # person in gpe': 805,\n", 1016 | " 'gpe person # . # # person after person': 806,\n", 1017 | " \"gpe to person 's person\": 807,\n", 1018 | " \"gpe 's organization # # . # person\": 808,\n", 1019 | " \"['gpe', 'organization', 'price', 'close', 'low', 'on', 'wednesday', 'at', 'u', 'dollar', 'an', 'ounce', 'down', 'from', 'tuesday', 's', 'close', 'of', 'dollar']\": 809,\n", 1020 | " 'gpe organization in gpe tes': 810,\n", 1021 | " 'gpe organization # # # # budge': 811,\n", 1022 | " 'gpe person with gsp': 812,\n", 1023 | " 'gpe person organization < unk': 813,\n", 1024 | " 'person person rates stead': 814,\n", 1025 | " 'gpe in gpe after person': 815,\n", 1026 | " 'gpe organization grows # . # percent in gpe quarte': 816,\n", 1027 | " 'gpe organization securit': 817,\n", 1028 | " 'president person': 818,\n", 1029 | " 'gpe person other gpe organization': 819,\n", 1030 | " 'person person and gpe': 820,\n", 1031 | " 'gpe person # person': 821,\n", 1032 | " 'gpe organization prisoner': 822,\n", 1033 | " \"gpe 's organization in # # #\": 823,\n", 1034 | " 'gpe organization to # # . # person': 824,\n", 1035 | " 'gpe at a glanc': 825,\n", 1036 | " 'person person during person': 826,\n", 1037 | " 'person person < unk > to person': 827,\n", 1038 | " 'gpe # # - # person': 828,\n", 1039 | " 'person organization victor': 829,\n", 1040 | " 'gpe foreign minister to visit gpe': 830,\n", 1041 | " 'gpe says person organization': 831,\n", 1042 | " 'person organization ope': 832,\n", 1043 | " 'thousands person in gpe': 833,\n", 1044 | " 'gpe insists person': 834,\n", 1045 | " 'gpe organization # . # percent to person': 835,\n", 1046 | " 'person gpe news summar': 836,\n", 1047 | " 'person # # dollars in gpe trad': 837,\n", 1048 | " 'gpe protests person': 838,\n", 1049 | " 'facility person for person': 839,\n", 1050 | " 'gpe organization secretary arrives in gpe': 840,\n", 1051 | " 'yemen person': 841,\n", 1052 | " \"['gpe', 'organization', 'price', 'close', 'high', 'on', 'thursday', 'at', 'u', 'dollar', 'an', 'ounce', 'up', 'from', 'wednesday', 's', 'close', 'of', 'dollar']\": 842,\n", 1053 | " 'gpe vice president person': 843,\n", 1054 | " 'gpe person indies person': 844,\n", 1055 | " 'person person in gpe in gpe': 845,\n", 1056 | " 'gpe location location': 846,\n", 1057 | " 'gpe minister person to person': 847,\n", 1058 | " '# # killed in gpe person': 848,\n", 1059 | " 'person person in gpe cas': 849,\n", 1060 | " 'gpe # # # for # at organization': 850,\n", 1061 | " 'gpe person in gpe for person': 851,\n", 1062 | " '< unk > person as person': 852,\n", 1063 | " 'person coach person': 853,\n", 1064 | " 'person person for person to person': 854,\n", 1065 | " 'suspected person': 855,\n", 1066 | " \"gpe person organization 's person\": 856,\n", 1067 | " 'person organization ; person': 857,\n", 1068 | " 'person organization offic': 858,\n", 1069 | " 'gpe leaves gpe organization': 859,\n", 1070 | " 'person organization dead at #': 860,\n", 1071 | " 'person organization in gpe trad': 861,\n", 1072 | " \"gpe 's organization chief person\": 862,\n", 1073 | " 'gpe defends person': 863,\n", 1074 | " 'person s person': 864,\n", 1075 | " 'person person # . # # dollar': 865,\n", 1076 | " 'person organization to keep person': 866,\n", 1077 | " 'person person in facility': 867,\n", 1078 | " 'person organization # # # person for person': 868,\n", 1079 | " 'judge person organization': 869,\n", 1080 | " \"gpe 's top person\": 870,\n", 1081 | " \"person 's person organization\": 871,\n", 1082 | " 'gpe person # # . # person': 872,\n", 1083 | " 'gpe person # # # # person': 873,\n", 1084 | " \"['person', 'price', 'in', 'gpe', 'open', 'person', 'monday', 'with', 'the', 'nikkei', 'person', 'person', 'percent', 'to', 'in', 'the', 'organization']\": 874,\n", 1085 | " 'gpe leaves person organization': 875,\n", 1086 | " 'gpe organization on person to person': 876,\n", 1087 | " 'rubber person on increased person': 877,\n", 1088 | " \"['gpe', 'organization', 'price', 'open', 'person', 'thursday', 'at', 'u', 'dollar', 'an', 'ounce', 'against', 'the', 'previous', 'day', 's', 'closing', 'rate', 'of']\": 878,\n", 1089 | " 'gpe to hold person': 879,\n", 1090 | " 'gpe organization general person': 880,\n", 1091 | " 'us person in gpe': 881,\n", 1092 | " 'person person profit': 882,\n", 1093 | " 'gpe person on gpe person': 883,\n", 1094 | " 'gpe organization in # # year': 884,\n", 1095 | " 'gpe on gpe organization': 885,\n", 1096 | " 'person organization protest': 886,\n", 1097 | " \"['gpe', 'organization', 'price', 'open', 'low', 'on', 'tuesday', 'at', 'u', 'dollar', 'an', 'ounce', 'down', 'from', 'monday', 's', 'close', 'of', 'dollar']\": 887,\n", 1098 | " \"gpe 's person to organization\": 888,\n", 1099 | " 'person person # person': 889,\n", 1100 | " 'gpe person prob': 890,\n", 1101 | " 'gpe organization death': 891,\n", 1102 | " 'gpe to gpe organization': 892,\n", 1103 | " 'gpe person despite organization': 893,\n", 1104 | " 'gpe person to be person': 894,\n", 1105 | " 'gpe leaves organization': 895,\n", 1106 | " 'gpe police arrest # # person': 896,\n", 1107 | " \"['gpe', 'organization', 'price', 'open', 'low', 'on', 'wednesday', 'at', 'u', 'dollar', 'an', 'ounce', 'down', 'from', 'tuesday', 's', 'close', 'of', 'dollar']\": 897,\n", 1108 | " 'leading organization': 898,\n", 1109 | " 'person person o': 899,\n", 1110 | " 'gpe organization # # year': 900,\n", 1111 | " 'person person is person': 901,\n", 1112 | " 'person chief person in gpe': 902,\n", 1113 | " \"person person 's organization in gpe\": 903,\n", 1114 | " 'facility person after person ; person # . # # person': 904,\n", 1115 | " \"gsp 's person\": 905,\n", 1116 | " 'gpe organization to gpe person': 906,\n", 1117 | " 'thousands of person': 907,\n", 1118 | " 'gpe organization challeng': 908,\n", 1119 | " 'gpe to person at organization': 909,\n", 1120 | " 'gpe person in gpe in gpe': 910,\n", 1121 | " 'gpe president to person in gpe': 911,\n", 1122 | " \"organization 's person\": 912,\n", 1123 | " 'person person on organization in gpe': 913,\n", 1124 | " 'gpe organization o': 914,\n", 1125 | " 'gpe organization referendu': 915,\n", 1126 | " \"person 's organization in gpe\": 916,\n", 1127 | " 'gpe organization < unk > dies at #': 917,\n", 1128 | " 'gpe organization dies at #': 918,\n", 1129 | " \"['gpe', 'person', 'price', 'open', 'high', 'tuesday', 'with', 'the', 'organization', 'nikkei', 'person', 'person', 'percent', 'to', 'in', 'the', 'organization']\": 919,\n", 1130 | " 'gpe organization higher on facility': 920,\n", 1131 | " 'gpe organization # # , # # # person': 921,\n", 1132 | " 'four organization in gpe': 922,\n", 1133 | " 'three organization': 923,\n", 1134 | " 'person person as person organization': 924,\n", 1135 | " 'gpe under person': 925,\n", 1136 | " 'person organization on person to person': 926,\n", 1137 | " 'person organization # #': 927,\n", 1138 | " 'organization in gpe organization': 928,\n", 1139 | " 'gpe islamists person': 929,\n", 1140 | " \"gpe 's b person # . # person\": 930,\n", 1141 | " 'gpe person prices up at organization': 931,\n", 1142 | " 'person organization in gpe violenc': 932,\n", 1143 | " 'person organization # # person in gpe': 933,\n", 1144 | " 'gpe organization higher in gpe trad': 934,\n", 1145 | " \"gpe 's organization organization\": 935,\n", 1146 | " 'person person in gpe kon': 936,\n", 1147 | " 'gpe is person': 937,\n", 1148 | " 'person person organization on person': 938,\n", 1149 | " \"['gpe', 'organization', 'price', 'open', 'person', 'tuesday', 'at', 'u', 'dollar', 'an', 'ounce', 'against', 'the', 'previous', 'day', 's', 'closing', 'rate', 'of']\": 939,\n", 1150 | " 'person person president person organization': 940,\n", 1151 | " \"['gpe', 'organization', 'price', 'close', 'high', 'wednesday', 'at', 'u', 'dollar', 'an', 'ounce', 'against', 'the', 'previous', 'day', 's', 'closing', 'rate', 'of']\": 941,\n", 1152 | " 'person organization challeng': 942,\n", 1153 | " 'person asia-pacific news agend': 943,\n", 1154 | " 'person organization to boost person': 944,\n", 1155 | " \"['gpe', 'organization', 'price', 'open', 'high', 'friday', 'at', 'u', 'dollar', 'an', 'ounce', 'against', 'the', 'previous', 'day', 's', 'closing', 'rate', 'of']\": 945,\n", 1156 | " '< unk > to person': 946,\n", 1157 | " 'gpe person organization attack': 947,\n", 1158 | " 'person organization in person': 948,\n", 1159 | " 'gpe says person to person': 949,\n", 1160 | " 'gpe person first person': 950,\n", 1161 | " 'five organization in gpe': 951,\n", 1162 | " \"['gpe', 'organization', 'price', 'close', 'high', 'friday', 'at', 'u', 'dollar', 'an', 'ounce', 'against', 'the', 'previous', 'day', 's', 'closing', 'rate', 'of']\": 952,\n", 1163 | " \"['gpe', 'organization', 'price', 'close', 'person', 'thursday', 'at', 'u', 'dollar', 'an', 'ounce', 'against', 'the', 'previous', 'day', 's', 'closing', 'rate', 'of']\": 953,\n", 1164 | " 'person organization on person organization': 954,\n", 1165 | " 'person person # # dea': 955,\n", 1166 | " 'person person not person': 956,\n", 1167 | " 'gpe organization cu': 957,\n", 1168 | " 'gpe person # . # person on facility': 958,\n", 1169 | " 'gpe person in jun': 959,\n", 1170 | " \"['gpe', 'person', 'price', 'open', 'person', 'thursday', 'with', 'the', 'organization', 'nikkei', 'person', 'person', 'percent', 'to', 'in', 'the', 'organization']\": 960,\n", 1171 | " 'tropical storm person': 961,\n", 1172 | " \"organization 's organization\": 962,\n", 1173 | " 'dollar mixed as person': 963,\n", 1174 | " 'gpe organization ; person # . # # person': 964,\n", 1175 | " 'gpe organization after person organization': 965,\n", 1176 | " 'gpe to build person': 966,\n", 1177 | " 'gpe organization # # person organization': 967,\n", 1178 | " 'gpe person # . # # person organization': 968,\n", 1179 | " 'person organization to visit person': 969,\n", 1180 | " 'gpe organization offe': 970,\n", 1181 | " 'person person to gpe person': 971,\n", 1182 | " 'person organization from person to #': 972,\n", 1183 | " 'gpe person to leave person': 973,\n", 1184 | " 'gpe person more person': 974,\n", 1185 | " \"gpe organization 's person in gpe\": 975,\n", 1186 | " 'gpe organization # # # person organization': 976,\n", 1187 | " 'gpe person in gpe vot': 977,\n", 1188 | " 'gpe s person': 978,\n", 1189 | " 'gpe reports # # th person': 979,\n", 1190 | " 'hundreds of person': 980,\n", 1191 | " \"wednesday 's organization\": 981,\n", 1192 | " \"gpe 's organization # . # # person\": 982,\n", 1193 | " \"['gpe', 'organization', 'price', 'close', 'high', 'on', 'friday', 'at', 'u', 'dollar', 'an', 'ounce', 'up', 'from', 'thursday', 's', 'close', 'of', 'dollar']\": 983,\n", 1194 | " 'gpe person # . # # percent in gpe tradin': 984,\n", 1195 | " 'gpe defense minister person organization': 985,\n", 1196 | " 'gpe organization bombing in gpe': 986,\n", 1197 | " 'person person on location': 987,\n", 1198 | " 'facility street prices u': 988,\n", 1199 | " 'person chief to person': 989,\n", 1200 | " 'gpe to launch person': 990,\n", 1201 | " 'person person facility': 991,\n", 1202 | " 'person organization # # th person': 992,\n", 1203 | " 'gpe organization constitutio': 993,\n", 1204 | " 'gpe organization claim': 994,\n", 1205 | " 'gpe to person as person': 995,\n", 1206 | " 'person person that person': 996,\n", 1207 | " 'gpe person to gpe to person': 997,\n", 1208 | " \"gpe person 's person in gpe\": 998,\n", 1209 | " \"< unk > person 's person\": 999,\n", 1210 | " 'person person to win person': 1000,\n", 1211 | " ...}" 1212 | ] 1213 | }, 1214 | "execution_count": 21, 1215 | "metadata": {}, 1216 | "output_type": "execute_result" 1217 | } 1218 | ], 1219 | "source": [ 1220 | "# Tokenize the input text\n", 1221 | "tokenizer = Tokenizer()\n", 1222 | "text_data = tokens #input data\n", 1223 | "tokenizer.fit_on_texts(text_data)\n", 1224 | "word_index = tokenizer.word_index\n", 1225 | "word_index" 1226 | ] 1227 | }, 1228 | { 1229 | "cell_type": "code", 1230 | "execution_count": null, 1231 | "metadata": { 1232 | "id": "CuVFRSfSey63" 1233 | }, 1234 | "outputs": [], 1235 | "source": [ 1236 | "#increasing length of word index to change the shape\n", 1237 | "i=1\n", 1238 | "while i<=10:\n", 1239 | " word_index['person organization gpe ',i] = 1000+i\n", 1240 | " i+=1" 1241 | ] 1242 | }, 1243 | { 1244 | "cell_type": "code", 1245 | "execution_count": null, 1246 | "metadata": { 1247 | "colab": { 1248 | "base_uri": "https://localhost:8080/" 1249 | }, 1250 | "id": "9L-DAcZJVUsk", 1251 | "outputId": "e619fc04-5c10-44e9-b0b5-ccb9d3f5798e" 1252 | }, 1253 | "outputs": [ 1254 | { 1255 | "data": { 1256 | "text/plain": [ 1257 | "dict" 1258 | ] 1259 | }, 1260 | "execution_count": 23, 1261 | "metadata": {}, 1262 | "output_type": "execute_result" 1263 | } 1264 | ], 1265 | "source": [ 1266 | "type(word_index)" 1267 | ] 1268 | }, 1269 | { 1270 | "cell_type": "code", 1271 | "execution_count": null, 1272 | "metadata": { 1273 | "colab": { 1274 | "base_uri": "https://localhost:8080/" 1275 | }, 1276 | "id": "sg83BK6BYgiy", 1277 | "outputId": "a66b3601-69c8-450c-f475-e1d039f6e83f" 1278 | }, 1279 | "outputs": [ 1280 | { 1281 | "data": { 1282 | "text/plain": [ 1283 | "1498135" 1284 | ] 1285 | }, 1286 | "execution_count": 24, 1287 | "metadata": {}, 1288 | "output_type": "execute_result" 1289 | } 1290 | ], 1291 | "source": [ 1292 | "len(word_index)" 1293 | ] 1294 | }, 1295 | { 1296 | "cell_type": "code", 1297 | "execution_count": null, 1298 | "metadata": { 1299 | "colab": { 1300 | "base_uri": "https://localhost:8080/" 1301 | }, 1302 | "id": "fqvTz9gm3d9z", 1303 | "outputId": "463520b3-6b55-4e35-c8a1-d58a74ccc6d0" 1304 | }, 1305 | "outputs": [ 1306 | { 1307 | "data": { 1308 | "text/plain": [ 1309 | "1498125" 1310 | ] 1311 | }, 1312 | "execution_count": 26, 1313 | "metadata": {}, 1314 | "output_type": "execute_result" 1315 | } 1316 | ], 1317 | "source": [ 1318 | "sorted(word_index.values())[-1]" 1319 | ] 1320 | }, 1321 | { 1322 | "cell_type": "code", 1323 | "execution_count": null, 1324 | "metadata": { 1325 | "id": "StyqJ540W0DS" 1326 | }, 1327 | "outputs": [], 1328 | "source": [ 1329 | "train_x = x\n", 1330 | "train_y = y" 1331 | ] 1332 | }, 1333 | { 1334 | "cell_type": "code", 1335 | "execution_count": null, 1336 | "metadata": { 1337 | "colab": { 1338 | "base_uri": "https://localhost:8080/" 1339 | }, 1340 | "id": "8Mlka0cCWVQ_", 1341 | "outputId": "0e0afa8d-a2a1-4b50-fdbb-7e5a105af4e0" 1342 | }, 1343 | "outputs": [ 1344 | { 1345 | "name": "stdout", 1346 | "output_type": "stream", 1347 | "text": [ 1348 | "[[ 0 0 0 ... 3 73 16]\n", 1349 | " [ 0 0 0 ... 2758 62 1]\n", 1350 | " [ 0 0 0 ... 29456 6 4]\n", 1351 | " ...\n", 1352 | " [ 0 0 0 ... 2 51 23]\n", 1353 | " [ 0 0 0 ... 5 157 39]\n", 1354 | " [ 0 0 0 ... 69 9 172]]\n", 1355 | "[[ 0 0 0 ... 3 73 16]\n", 1356 | " [ 0 0 0 ... 2758 62 1]\n", 1357 | " [ 0 0 0 ... 29456 6 4]\n", 1358 | " ...\n", 1359 | " [ 0 0 0 ... 2 51 23]\n", 1360 | " [ 0 0 0 ... 5 157 39]\n", 1361 | " [ 0 0 0 ... 69 9 172]]\n" 1362 | ] 1363 | } 1364 | ], 1365 | "source": [ 1366 | "# Convert words to word vectors\n", 1367 | "tokenizer.fit_on_texts(train_x)\n", 1368 | "sequences_x = tokenizer.texts_to_sequences(train_x)\n", 1369 | "padded_sequences_x = np.array(pad_sequences(sequences_x))\n", 1370 | "print(padded_sequences_x)\n", 1371 | "\n", 1372 | "\n", 1373 | "tokenizer.fit_on_texts(train_y)\n", 1374 | "sequences_y = tokenizer.texts_to_sequences(train_y)\n", 1375 | "padded_sequences_y = np.array(pad_sequences(sequences_y))\n", 1376 | "print(padded_sequences_x)" 1377 | ] 1378 | }, 1379 | { 1380 | "cell_type": "code", 1381 | "execution_count": null, 1382 | "metadata": { 1383 | "colab": { 1384 | "base_uri": "https://localhost:8080/" 1385 | }, 1386 | "id": "BQYnxeTSYXRD", 1387 | "outputId": "82440192-b1ee-49c8-b3e1-714dc628d532" 1388 | }, 1389 | "outputs": [ 1390 | { 1391 | "data": { 1392 | "text/plain": [ 1393 | "(1000000, 79)" 1394 | ] 1395 | }, 1396 | "execution_count": 15, 1397 | "metadata": {}, 1398 | "output_type": "execute_result" 1399 | } 1400 | ], 1401 | "source": [ 1402 | "padded_sequences_x.shape" 1403 | ] 1404 | }, 1405 | { 1406 | "cell_type": "code", 1407 | "execution_count": null, 1408 | "metadata": { 1409 | "colab": { 1410 | "base_uri": "https://localhost:8080/" 1411 | }, 1412 | "id": "brjziaamAFZ1", 1413 | "outputId": "299d8ff8-7935-4a80-808b-2a90e005ff07" 1414 | }, 1415 | "outputs": [ 1416 | { 1417 | "data": { 1418 | "text/plain": [ 1419 | "(1000000, 38)" 1420 | ] 1421 | }, 1422 | "execution_count": 22, 1423 | "metadata": {}, 1424 | "output_type": "execute_result" 1425 | } 1426 | ], 1427 | "source": [ 1428 | "padded_sequences_y.shape" 1429 | ] 1430 | }, 1431 | { 1432 | "cell_type": "code", 1433 | "execution_count": null, 1434 | "metadata": { 1435 | "colab": { 1436 | "base_uri": "https://localhost:8080/" 1437 | }, 1438 | "id": "uWFR8UQxtQHW", 1439 | "outputId": "20798abe-d671-4b93-f4f2-2becf74faca2" 1440 | }, 1441 | "outputs": [ 1442 | { 1443 | "data": { 1444 | "text/plain": [ 1445 | "1498125" 1446 | ] 1447 | }, 1448 | "execution_count": 16, 1449 | "metadata": {}, 1450 | "output_type": "execute_result" 1451 | } 1452 | ], 1453 | "source": [ 1454 | "len(word_index)" 1455 | ] 1456 | }, 1457 | { 1458 | "cell_type": "code", 1459 | "execution_count": null, 1460 | "metadata": { 1461 | "id": "SJNfLN0eTn1M" 1462 | }, 1463 | "outputs": [], 1464 | "source": [ 1465 | "from keras.layers.rnn.time_distributed import Layer\n", 1466 | "class attention(Layer):\n", 1467 | "\n", 1468 | " def __init__(self, return_sequences=True):\n", 1469 | " self.return_sequences = return_sequences\n", 1470 | "\n", 1471 | " super(attention,self).__init__()\n", 1472 | "\n", 1473 | " def build(self, input_shape):\n", 1474 | " self.W=self.add_weight(name=\"att_weight\", shape=(input_shape[-1],1), initializer=\"normal\")\n", 1475 | " self.b=self.add_weight(name=\"att_bias\", shape=(input_shape[1],1),initializer=\"normal\")\n", 1476 | " self.b=self.add_weight(name=\"att_bias\", shape=(input_shape[1],1),initializer=\"normal\")\n", 1477 | " self.b=self.add_weight(name=\"att_bias\", shape=(input_shape[1],1), initializer=\"normal\")\n", 1478 | "\n", 1479 | " super(attention,self).build(input_shape)\n", 1480 | "\n", 1481 | "\n", 1482 | " def call(self, x):\n", 1483 | " e = K.tanh(K.dot(x,self.W)+self.b)\n", 1484 | " a = K.softmax(e, axis=1)\n", 1485 | " output = x*a\n", 1486 | " if self.return_sequences:\n", 1487 | " return output\n", 1488 | "\n", 1489 | " return K.sum(output, axis=1)" 1490 | ] 1491 | }, 1492 | { 1493 | "cell_type": "code", 1494 | "execution_count": null, 1495 | "metadata": { 1496 | "id": "6tzHApGjWfHv" 1497 | }, 1498 | "outputs": [], 1499 | "source": [ 1500 | "# Define model\n", 1501 | "model = Sequential()\n", 1502 | "model.add(Embedding(len(word_index)+1 ,\n", 1503 | " embedding_weights.shape[1],\n", 1504 | " weights=[embedding_weights],\n", 1505 | " input_length=padded_sequences_x.shape[1],\n", 1506 | " trainable=False))\n", 1507 | "model.add(Bidirectional(LSTM(32, return_sequences=True)))\n", 1508 | "model.add(attention(return_sequences=False)) # receive 3D and output 3D\n", 1509 | "model.add(RepeatVector(len(padded_sequences_y[1]))) # repeat vector\n", 1510 | "model.add(LSTM(32, return_sequences=True)) #decoder layer\n", 1511 | "model.add(Dropout(0.2))\n", 1512 | "model.add(TimeDistributed(Dense(1, activation='softmax')))\n", 1513 | "#model.add(Dense(1, activation='softmax'))\n" 1514 | ] 1515 | }, 1516 | { 1517 | "cell_type": "code", 1518 | "execution_count": null, 1519 | "metadata": { 1520 | "colab": { 1521 | "base_uri": "https://localhost:8080/" 1522 | }, 1523 | "id": "HoAEKiD_gpdD", 1524 | "outputId": "18fadf1c-b18a-4ff2-94e2-1c5d2e6938b7" 1525 | }, 1526 | "outputs": [ 1527 | { 1528 | "name": "stdout", 1529 | "output_type": "stream", 1530 | "text": [ 1531 | "Model: \"sequential\"\n", 1532 | "_________________________________________________________________\n", 1533 | " Layer (type) Output Shape Param # \n", 1534 | "=================================================================\n", 1535 | " embedding (Embedding) (None, 79, 100) 149813600 \n", 1536 | " \n", 1537 | " bidirectional (Bidirectiona (None, 79, 64) 34048 \n", 1538 | " l) \n", 1539 | " \n", 1540 | " attention (attention) (None, 64) 143 \n", 1541 | " \n", 1542 | " repeat_vector (RepeatVector (None, 38, 64) 0 \n", 1543 | " ) \n", 1544 | " \n", 1545 | " lstm_1 (LSTM) (None, 38, 32) 12416 \n", 1546 | " \n", 1547 | " dropout (Dropout) (None, 38, 32) 0 \n", 1548 | " \n", 1549 | " time_distributed (TimeDistr (None, 38, 1) 33 \n", 1550 | " ibuted) \n", 1551 | " \n", 1552 | "=================================================================\n", 1553 | "Total params: 149,860,240\n", 1554 | "Trainable params: 46,640\n", 1555 | "Non-trainable params: 149,813,600\n", 1556 | "_________________________________________________________________\n" 1557 | ] 1558 | } 1559 | ], 1560 | "source": [ 1561 | "# Train the model\n", 1562 | "model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])\n", 1563 | "\n", 1564 | "model.summary()" 1565 | ] 1566 | }, 1567 | { 1568 | "cell_type": "code", 1569 | "execution_count": null, 1570 | "metadata": { 1571 | "colab": { 1572 | "base_uri": "https://localhost:8080/" 1573 | }, 1574 | "id": "NepRAjP2etAt", 1575 | "outputId": "6ae3a895-0bb7-4f99-cb5f-5584a9a760e3" 1576 | }, 1577 | "outputs": [ 1578 | { 1579 | "name": "stdout", 1580 | "output_type": "stream", 1581 | "text": [ 1582 | "(1000000, 79) (1000000, 38)\n" 1583 | ] 1584 | } 1585 | ], 1586 | "source": [ 1587 | "print(padded_sequences_x.shape , padded_sequences_y.shape)" 1588 | ] 1589 | }, 1590 | { 1591 | "cell_type": "code", 1592 | "execution_count": null, 1593 | "metadata": { 1594 | "colab": { 1595 | "base_uri": "https://localhost:8080/" 1596 | }, 1597 | "id": "8DzHVONFZObW", 1598 | "outputId": "707ddb90-e96a-43be-9891-4300f6ab0370" 1599 | }, 1600 | "outputs": [ 1601 | { 1602 | "data": { 1603 | "text/plain": [ 1604 | "" 1605 | ] 1606 | }, 1607 | "execution_count": 7, 1608 | "metadata": {}, 1609 | "output_type": "execute_result" 1610 | } 1611 | ], 1612 | "source": [ 1613 | "tf.version" 1614 | ] 1615 | }, 1616 | { 1617 | "cell_type": "code", 1618 | "execution_count": null, 1619 | "metadata": { 1620 | "id": "pt4oE_v-drY5" 1621 | }, 1622 | "outputs": [], 1623 | "source": [ 1624 | "model.fit(padded_sequences_x, padded_sequences_y, batch_size=64, epochs=15)" 1625 | ] 1626 | } 1627 | ], 1628 | "metadata": { 1629 | "colab": { 1630 | "provenance": [], 1631 | "mount_file_id": "1iXcpfTAaL-4a64dmjP351Wjkxv75qIWO", 1632 | "authorship_tag": "ABX9TyM4bxgPLBK9iMP94IFxqkPn", 1633 | "include_colab_link": true 1634 | }, 1635 | "kernelspec": { 1636 | "display_name": "Python 3", 1637 | "name": "python3" 1638 | }, 1639 | "language_info": { 1640 | "name": "python" 1641 | } 1642 | }, 1643 | "nbformat": 4, 1644 | "nbformat_minor": 0 1645 | } -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # Abstractive Text Summarization with Knowledge-based Word Sense Disambiguation 2 | 3 | This repository contains code and resources for abstractive text summarization (TS) using a novel framework that leverages knowledge-based word sense disambiguation (WSD) and semantic content generalization to enhance the performance of sequence-to-sequence (seq2seq) neural-based TS. 4 | 5 | ## Overview 6 | This work focuses on abstractive TS of single documents and introduces a framework that unifies the characteristics of three dominant aspects of abstractive TS: structure, semantic, and neural-based approaches. The approach combines machine learning and knowledge-based techniques to achieve this integration. 7 | ## Framework Overview 8 | The overall framework is illustrated in Figure 1. The input comprises a single-document text, along with a taxonomy of concepts T, while the output is a human-readable summary. Its main components are five, starting with WSD, whose purpose is to generalize ambiguous words. This is an important step for increasing the accuracy of content generalization that follows next and deals with OOV or rare words. Both of the aforementioned steps constitute the pre-processing phase, which is discussed in more detail in Section 4. 9 | 10 | The generalized text is subsequently mapped to a continuous vector space, using neural language processing techniques. The said vectors are then provided to a deep seq2seq model of encoder-decoder architecture, which additionally incorporates an attention mechanism. The model, having been trained on a corpus of text-summary pairs, predicts a generalized summary for the new input it has been given. Both the vector-mapping and deep learning prediction steps are components of the machine learning phase and are further analyzed in Section 5 of the paper(project.pdf). 11 | 12 | 13 | 14 | ## Screenshots 15 | 16 | ![App Screenshot](https://github.com/priyansh4320/Abstractive-Text-Summarization-Enhancing-Sequence-to-Sequence-Models-Using-Word-Sense-Disambiguatio/blob/main/suvidha_foundation_project.png) 17 | 18 | 19 | ## Overview 20 | This work focuses on abstractive TS of single documents and introduces a framework that unifies the characteristics of three dominant aspects of abstractive TS: structure, semantic, and neural-based approaches. The approach combines machine learning and knowledge-based techniques to achieve this integration. 21 | ## Key Contributions 22 | Proposes a novel framework utilizing knowledge-based WSD and semantic content generalization for abstractive TS. 23 | Combines characteristics of structure, semantic, and neural-based approaches to unify methodologies often treated separately in the literature. 24 | Utilizes a three-step methodology: pre-processing, machine learning, and post-processing, to generate the final summary. 25 | 26 | ## Methodology 27 | The proposed methodology comprises three main steps: 28 | 29 | 1. Pre-processing 30 | 31 | - Utilizes knowledge-based semantic ontologies and named entity recognition (NER) to achieve text generalization. 32 | - Extracts named entities, concepts, and senses from the original document. 33 | 2. Machine Learning Methodology 34 | 35 | - Utilizes a seq2seq deep learning model of an attentive encoder-decoder architecture. 36 | - Investigates five variants of the deep learning model to predict a generalized version of the summary. 37 | 3. Post-processing 38 | 39 | - Creates the final summary using heuristic algorithms and text similarity metrics. 40 | - Matches concepts of the generalized summary to specific ones for a cohesive output. 41 | 42 | ## Experimental Results 43 | Extensive experiments were conducted on three widely used datasets: Gigaword, Duc 2004, and CNN/DailyMail. The results demonstrate promising outcomes, including alleviation of rare and out-of-vocabulary (OOV) words and outperforming state-of-the-art seq2seq deep learning techniques. 44 | 45 | ## Dataset Used 46 | - Gigaword Dataset 47 | - Duc 2004 Dataset 48 | - CNN/DailyMail Dataset 49 | 50 | ## Reference 51 | 52 | ` 53 | @article{Kouris, P., Alexandridis, G., & Stafylopatis, A. (2021). Abstractive text summarization: Enhancing sequence-to-sequence models using word sense disambiguation and semantic content generalization. Computational Linguistics, 47(4), 813-859.}, 54 | title={Abstractive Text Summarization with Knowledge-based Word Sense Disambiguation}, 55 | author={Kouris, P., Alexandridis, G., & Stafylopatis}, 56 | year={2021}, 57 | } 58 | ` 59 | -------------------------------------------------------------------------------- /project.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/priyansh4320/Abstractive-Text-Summarization-Enhancing-Sequence-to-Sequence-Models-Using-Word-Sense-Disambiguatio/25c1528f50a385bba13f0cb1a21d2ee8bb80fa98/project.pdf -------------------------------------------------------------------------------- /suvidha_foundation_project.pdf .png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/priyansh4320/Abstractive-Text-Summarization-Enhancing-Sequence-to-Sequence-Models-Using-Word-Sense-Disambiguatio/25c1528f50a385bba13f0cb1a21d2ee8bb80fa98/suvidha_foundation_project.pdf .png -------------------------------------------------------------------------------- /suvidha_foundation_project.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/priyansh4320/Abstractive-Text-Summarization-Enhancing-Sequence-to-Sequence-Models-Using-Word-Sense-Disambiguatio/25c1528f50a385bba13f0cb1a21d2ee8bb80fa98/suvidha_foundation_project.png --------------------------------------------------------------------------------