├── README.md ├── RECENT_Notebook.ipynb ├── bidirectional_lstm_model.py ├── cleaned_bbc_news.csv ├── data_shaper.py ├── text_cleaner.py └── text_summary_model.py /README.md: -------------------------------------------------------------------------------- 1 | # Bidirectional LSTM: Abstract Text Summarization 2 | ## 双方向LSTMを用いた文脈を捉える抽象型文章要約 3 | ![](http://abigailsee.com/img/pointer-gen.png) 4 | 5 | ## Introduction 6 | 7 | Extraction type is an approach of extracting a sentence that seems to be important from sentences to be summarized and creating a summary. The advantages and disadvantages are as follows. 8 | 9 | - Pros: Select a sentence in the original sentence to create a summary, so it is less likely to be a summary that is completely out of the way, and it will not be a grammatically strange summary 10 | 11 | - Cons: Because you can not use words that are not in the sentence, you can not use abstractions, paraphrases, or conjunctions to make them easier to read. Because of this, the summary created is a crude impression. 12 | 13 | I built seq2seq Bidirectional LSTM for Text Summarization task. Also LSTM with Attention is major method for summarization. 14 | 15 | ## Technical Preferences 16 | 17 | | Title | Detail | 18 | |:-----------:|:------------------------------------------------| 19 | | Environment | MacOS Mojave 10.14.3 | 20 | | Language | Python | 21 | | Library | Kras, scikit-learn, Numpy, matplotlib, Pandas, Seaborn | 22 | | Dataset | [BBC Datasets](http://mlg.ucd.ie/datasets/bbc.html) | 23 | | Algorithm | Encoder-Decoder LSTM | 24 | 25 | ## Refference 26 | 27 | - [Get To The Point: Summarization with Pointer-Generator Networks](https://nlp.stanford.edu/pubs/see2017get.pdf) 28 | - [Bidirectional Attentional Encoder-Decoder Model and Bidirectional Beam Search for Abstractive Summarization](https://arxiv.org/pdf/1809.06662.pdf) 29 | - [Taming Recurrent Neural Networks for Better Summarization](http://www.abigailsee.com/2017/04/16/taming-rnns-for-better-summarization.html) 30 | - [Encoder-Decoder Models for Text Summarization in Keras](https://machinelearningmastery.com/encoder-decoder-models-text-summarization-keras/) 31 | - [Text Summarization Using Keras Models](https://hackernoon.com/text-summarization-using-keras-models-366b002408d9) 32 | - [Text Summarization in Python: Extractive vs. Abstractive techniques revisited](https://rare-technologies.com/text-summarization-in-python-extractive-vs-abstractive-techniques-revisited/) 33 | - [大自然言語時代のための、文章要約](https://qiita.com/icoxfog417/items/d06651db10e27220c819) 34 | -------------------------------------------------------------------------------- /RECENT_Notebook.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": { 6 | "colab_type": "text", 7 | "id": "k9-Z82i495Gl" 8 | }, 9 | "source": [ 10 | "# Seq2Seq: Text Summarization with Keras\n", 11 | "#### ディープラーニングによる文章要約\n", 12 | "![](http://abigailsee.com/img/pointer-gen.png)" 13 | ] 14 | }, 15 | { 16 | "cell_type": "markdown", 17 | "metadata": { 18 | "colab_type": "text", 19 | "id": "pw_oyfrE95Go" 20 | }, 21 | "source": [ 22 | "## Process\n", 23 | "1. Preprocessing\n", 24 | "2. Word2vec\n", 25 | "3. Building Seq2Seq Architecture\n", 26 | "4. Training with BBC article&summary Dataset\n", 27 | "5. Generate Summary with my_summarizer" 28 | ] 29 | }, 30 | { 31 | "cell_type": "markdown", 32 | "metadata": {}, 33 | "source": [ 34 | "## Step 1. Import Data" 35 | ] 36 | }, 37 | { 38 | "cell_type": "code", 39 | "execution_count": null, 40 | "metadata": { 41 | "colab": {}, 42 | "colab_type": "code", 43 | "id": "E2QnkkQv95Gq" 44 | }, 45 | "outputs": [], 46 | "source": [ 47 | "import numpy as np\n", 48 | "import os\n", 49 | "import pandas as pd\n", 50 | "import re" 51 | ] 52 | }, 53 | { 54 | "cell_type": "code", 55 | "execution_count": null, 56 | "metadata": {}, 57 | "outputs": [], 58 | "source": [ 59 | "pwd" 60 | ] 61 | }, 62 | { 63 | "cell_type": "code", 64 | "execution_count": null, 65 | "metadata": { 66 | "colab": {}, 67 | "colab_type": "code", 68 | "id": "QeKq5tai95Gu" 69 | }, 70 | "outputs": [], 71 | "source": [ 72 | "news_category = [\"business\", \"entertainment\", \"politics\", \"sport\", \"tech\"]\n", 73 | "\n", 74 | "row_doc = \"/Users/akr712/Desktop/文章要約/Row News Articles/\"\n", 75 | "summary_doc = \"/Users/akr712/Desktop/文章要約/Summaries/\"\n", 76 | "\n", 77 | "data={\"articles\":[], \"summaries\":[]}" 78 | ] 79 | }, 80 | { 81 | "cell_type": "code", 82 | "execution_count": null, 83 | "metadata": { 84 | "colab": {}, 85 | "colab_type": "code", 86 | "id": "682uVCoC95Gy" 87 | }, 88 | "outputs": [], 89 | "source": [ 90 | "import os\n", 91 | "directories = {\"news\": row_doc, \"summary\": summary_doc}\n", 92 | "row_dict = {}\n", 93 | "sum_dict = {}\n", 94 | "\n", 95 | "for path in directories.values():\n", 96 | " if path == row_doc:\n", 97 | " file_dict = row_dict\n", 98 | " else:\n", 99 | " file_dict = sum_dict\n", 100 | " dire = path\n", 101 | " for cat in news_category:\n", 102 | " category = cat\n", 103 | " files = os.listdir(dire + category)\n", 104 | " file_dict[cat] = files" 105 | ] 106 | }, 107 | { 108 | "cell_type": "code", 109 | "execution_count": null, 110 | "metadata": { 111 | "colab": {}, 112 | "colab_type": "code", 113 | "id": "gM75fs9g95Hc" 114 | }, 115 | "outputs": [], 116 | "source": [ 117 | "row_data = {}\n", 118 | "for cat in row_dict.keys():\n", 119 | " cat_dict = {}\n", 120 | " # row_data_frame[cat] = []\n", 121 | " for i in range(0, len(row_dict[cat])):\n", 122 | " filename = row_dict[cat][i]\n", 123 | " path = row_doc + cat + \"/\" + filename\n", 124 | " with open(path, \"rb\") as f: \n", 125 | " text = f.read()\n", 126 | " cat_dict[filename[:3]] = text\n", 127 | " row_data[cat] = cat_dict" 128 | ] 129 | }, 130 | { 131 | "cell_type": "code", 132 | "execution_count": null, 133 | "metadata": { 134 | "colab": {}, 135 | "colab_type": "code", 136 | "id": "MjPzSMPT95Hg" 137 | }, 138 | "outputs": [], 139 | "source": [ 140 | "sum_data = {}\n", 141 | "for cat in sum_dict.keys():\n", 142 | " cat_dict = {}\n", 143 | " # row_data_frame[cat] = []\n", 144 | " for i in range(0, len(sum_dict[cat])):\n", 145 | " filename = sum_dict[cat][i]\n", 146 | " path = summary_doc + cat + \"/\" + filename\n", 147 | " with open(path, \"rb\") as f: \n", 148 | " text = f.read()\n", 149 | " cat_dict[filename[:3]] = text\n", 150 | " sum_data[cat] = cat_dict" 151 | ] 152 | }, 153 | { 154 | "cell_type": "code", 155 | "execution_count": null, 156 | "metadata": { 157 | "colab": {}, 158 | "colab_type": "code", 159 | "id": "du70GJ7o95H0", 160 | "outputId": "2f05e596-129f-489e-8fcd-e789f870fe9e", 161 | "scrolled": true 162 | }, 163 | "outputs": [], 164 | "source": [ 165 | "news_business = pd.DataFrame.from_dict(row_data[\"business\"], orient=\"index\", columns=[\"row_article\"])\n", 166 | "news_business.head()" 167 | ] 168 | }, 169 | { 170 | "cell_type": "code", 171 | "execution_count": null, 172 | "metadata": { 173 | "colab": {}, 174 | "colab_type": "code", 175 | "id": "ZKbrhIYw95H8" 176 | }, 177 | "outputs": [], 178 | "source": [ 179 | "# news_category = [\"business\", \"entertainment\", \"politics\", \"sport\", \"tech\"]\n", 180 | "news_entertainment = pd.DataFrame.from_dict(row_data[\"entertainment\"], orient=\"index\", columns=[\"row_article\"])\n", 181 | "news_politics = pd.DataFrame.from_dict(row_data[\"politics\"], orient=\"index\", columns=[\"row_article\"])\n", 182 | "news_sport = pd.DataFrame.from_dict(row_data[\"sport\"], orient=\"index\", columns=[\"row_article\"])\n", 183 | "news_tech = pd.DataFrame.from_dict(row_data[\"tech\"], orient=\"index\", columns=[\"row_article\"])" 184 | ] 185 | }, 186 | { 187 | "cell_type": "code", 188 | "execution_count": null, 189 | "metadata": { 190 | "colab": {}, 191 | "colab_type": "code", 192 | "id": "KrSmXmON95IE" 193 | }, 194 | "outputs": [], 195 | "source": [ 196 | "# summary data\n", 197 | "summary_business = pd.DataFrame.from_dict(sum_data[\"business\"], orient=\"index\", columns=[\"summary\"])\n", 198 | "summary_entertainment = pd.DataFrame.from_dict(sum_data[\"entertainment\"], orient=\"index\", columns=[\"summary\"])\n", 199 | "summary_politics = pd.DataFrame.from_dict(sum_data[\"politics\"], orient=\"index\", columns=[\"summary\"])\n", 200 | "summary_sport = pd.DataFrame.from_dict(sum_data[\"sport\"], orient=\"index\", columns=[\"summary\"])\n", 201 | "summary_tech = pd.DataFrame.from_dict(sum_data[\"tech\"], orient=\"index\", columns=[\"summary\"])" 202 | ] 203 | }, 204 | { 205 | "cell_type": "code", 206 | "execution_count": null, 207 | "metadata": { 208 | "colab": {}, 209 | "colab_type": "code", 210 | "id": "UNL3-2xL95IJ", 211 | "outputId": "8dd40f58-2921-4c73-fd62-6b5b9e673169" 212 | }, 213 | "outputs": [], 214 | "source": [ 215 | "summary_business.head()" 216 | ] 217 | }, 218 | { 219 | "cell_type": "code", 220 | "execution_count": null, 221 | "metadata": { 222 | "colab": {}, 223 | "colab_type": "code", 224 | "id": "Aq407qkG95IO" 225 | }, 226 | "outputs": [], 227 | "source": [ 228 | "business = news_business.join(summary_business, how='inner')\n", 229 | "entertainment = news_entertainment.join(summary_entertainment, how='inner')\n", 230 | "politics = news_politics.join(summary_politics, how='inner')\n", 231 | "sport = news_sport.join(summary_sport, how='inner')\n", 232 | "tech = news_tech.join(summary_tech, how='inner')" 233 | ] 234 | }, 235 | { 236 | "cell_type": "code", 237 | "execution_count": null, 238 | "metadata": { 239 | "colab": {}, 240 | "colab_type": "code", 241 | "id": "CKbjkn-r95IR" 242 | }, 243 | "outputs": [], 244 | "source": [ 245 | "business = news_business.join(summary_business, how='inner')" 246 | ] 247 | }, 248 | { 249 | "cell_type": "code", 250 | "execution_count": null, 251 | "metadata": { 252 | "colab": {}, 253 | "colab_type": "code", 254 | "id": "fKfHtKbL95IW", 255 | "outputId": "11b33d28-b909-4924-8e97-3e978c473fe8" 256 | }, 257 | "outputs": [], 258 | "source": [ 259 | "business.head()" 260 | ] 261 | }, 262 | { 263 | "cell_type": "code", 264 | "execution_count": null, 265 | "metadata": { 266 | "colab": {}, 267 | "colab_type": "code", 268 | "id": "_sUHqLUJ95Ib", 269 | "outputId": "7b492210-e0b1-4622-fcba-9c0f730ca3cd" 270 | }, 271 | "outputs": [], 272 | "source": [ 273 | "print(\"row\", len(business.iloc[0,0]))\n", 274 | "print(\"sum\", len(business.iloc[0,1]))" 275 | ] 276 | }, 277 | { 278 | "cell_type": "code", 279 | "execution_count": null, 280 | "metadata": { 281 | "colab": {}, 282 | "colab_type": "code", 283 | "id": "VthiJz-h95Ij" 284 | }, 285 | "outputs": [], 286 | "source": [ 287 | "list_df = [business, entertainment, politics, sport, tech]\n", 288 | "length = 0\n", 289 | "for df in list_df:\n", 290 | " length += len(df)" 291 | ] 292 | }, 293 | { 294 | "cell_type": "code", 295 | "execution_count": 31, 296 | "metadata": { 297 | "colab": {}, 298 | "colab_type": "code", 299 | "id": "hj44ZFO-95Iu", 300 | "outputId": "f5649126-2cf2-45ad-cc77-879985450f5b" 301 | }, 302 | "outputs": [ 303 | { 304 | "name": "stdout", 305 | "output_type": "stream", 306 | "text": [ 307 | "length of all data: 2225\n" 308 | ] 309 | } 310 | ], 311 | "source": [ 312 | "print(\"length of all data: \", length)" 313 | ] 314 | }, 315 | { 316 | "cell_type": "code", 317 | "execution_count": 32, 318 | "metadata": { 319 | "colab": {}, 320 | "colab_type": "code", 321 | "id": "vFtKW1DS95I2", 322 | "outputId": "f9e46aae-aa16-4415-8302-ddd47bd220ce" 323 | }, 324 | "outputs": [ 325 | { 326 | "data": { 327 | "text/plain": [ 328 | "2225" 329 | ] 330 | }, 331 | "execution_count": 32, 332 | "metadata": {}, 333 | "output_type": "execute_result" 334 | } 335 | ], 336 | "source": [ 337 | "bbc_df = pd.concat([business, entertainment, politics, sport, tech], ignore_index=True)\n", 338 | "len(bbc_df)" 339 | ] 340 | }, 341 | { 342 | "cell_type": "markdown", 343 | "metadata": { 344 | "colab_type": "text", 345 | "id": "xogCv95T95JF" 346 | }, 347 | "source": [ 348 | "## Step 2. Preprocessing Text Data\n", 349 | "1. Clean Text\n", 350 | "2. Tokenize\n", 351 | "3. Vocabrary\n", 352 | "4. Padding\n", 353 | "5. One-Hot Encoding\n", 354 | "6. Reshape to (MAX_LEN, One-Hot Encoding DIM)" 355 | ] 356 | }, 357 | { 358 | "cell_type": "markdown", 359 | "metadata": { 360 | "colab_type": "text", 361 | "id": "y2un_KkA95JU" 362 | }, 363 | "source": [ 364 | "### 2-1. Clean Text" 365 | ] 366 | }, 367 | { 368 | "cell_type": "code", 369 | "execution_count": 33, 370 | "metadata": { 371 | "colab": {}, 372 | "colab_type": "code", 373 | "id": "CxQY2wvq95JJ" 374 | }, 375 | "outputs": [], 376 | "source": [ 377 | "def cleantext(text):\n", 378 | " text = str(text)\n", 379 | " text=text.split()\n", 380 | " words=[]\n", 381 | " for t in text:\n", 382 | " if t.isalpha():\n", 383 | " words.append(t)\n", 384 | " text=\" \".join(words)\n", 385 | " text=text.lower()\n", 386 | " text=re.sub(r\"what's\",\"what is \",text)\n", 387 | " text=re.sub(r\"it's\",\"it is \",text)\n", 388 | " text=re.sub(r\"\\'ve\",\" have \",text)\n", 389 | " text=re.sub(r\"i'm\",\"i am \",text)\n", 390 | " text=re.sub(r\"\\'re\",\" are \",text)\n", 391 | " text=re.sub(r\"n't\",\" not \",text)\n", 392 | " text=re.sub(r\"\\'d\",\" would \",text)\n", 393 | " text=re.sub(r\"\\'s\",\"s\",text)\n", 394 | " text=re.sub(r\"\\'ll\",\" will \",text)\n", 395 | " text=re.sub(r\"can't\",\" cannot \",text)\n", 396 | " text=re.sub(r\" e g \",\" eg \",text)\n", 397 | " text=re.sub(r\"e-mail\",\"email\",text)\n", 398 | " text=re.sub(r\"9\\\\/11\",\" 911 \",text)\n", 399 | " text=re.sub(r\" u.s\",\" american \",text)\n", 400 | " text=re.sub(r\" u.n\",\" united nations \",text)\n", 401 | " text=re.sub(r\"\\n\",\" \",text)\n", 402 | " text=re.sub(r\":\",\" \",text)\n", 403 | " text=re.sub(r\"-\",\" \",text)\n", 404 | " text=re.sub(r\"\\_\",\" \",text)\n", 405 | " text=re.sub(r\"\\d+\",\" \",text)\n", 406 | " text=re.sub(r\"[$#@%&*!~?%{}()]\",\" \",text)\n", 407 | " \n", 408 | " return text" 409 | ] 410 | }, 411 | { 412 | "cell_type": "code", 413 | "execution_count": 34, 414 | "metadata": { 415 | "colab": {}, 416 | "colab_type": "code", 417 | "id": "K7ovvCt495JL" 418 | }, 419 | "outputs": [], 420 | "source": [ 421 | "for col in bbc_df.columns:\n", 422 | " bbc_df[col] = bbc_df[col].apply(lambda x: cleantext(x))" 423 | ] 424 | }, 425 | { 426 | "cell_type": "code", 427 | "execution_count": 35, 428 | "metadata": { 429 | "colab": {}, 430 | "colab_type": "code", 431 | "id": "HwbxBACO95JP", 432 | "outputId": "8b9cf24a-1714-47bf-944b-4aaa54ab78c4" 433 | }, 434 | "outputs": [ 435 | { 436 | "data": { 437 | "text/html": [ 438 | "
\n", 439 | "\n", 452 | "\n", 453 | " \n", 454 | " \n", 455 | " \n", 456 | " \n", 457 | " \n", 458 | " \n", 459 | " \n", 460 | " \n", 461 | " \n", 462 | " \n", 463 | " \n", 464 | " \n", 465 | " \n", 466 | " \n", 467 | " \n", 468 | " \n", 469 | " \n", 470 | " \n", 471 | " \n", 472 | " \n", 473 | " \n", 474 | " \n", 475 | " \n", 476 | " \n", 477 | " \n", 478 | " \n", 479 | " \n", 480 | " \n", 481 | " \n", 482 | " \n", 483 | " \n", 484 | " \n", 485 | " \n", 486 | " \n", 487 | "
row_articlesummary
0rise on new man utd in manchester united close...united revealed on sunday that it had received...
1confidence dips in confidence among japanese m...confidence among japanese manufacturers has we...
2makes new man utd glazer has made a fresh appr...glazer has made a fresh approach to buy manche...
3hope for borussia in struggling german footbal...in struggling german football club borussia do...
4airlines hit tunnel operator eurotunnel has se...firm said sales were down in to euros the mome...
\n", 488 | "
" 489 | ], 490 | "text/plain": [ 491 | " row_article \\\n", 492 | "0 rise on new man utd in manchester united close... \n", 493 | "1 confidence dips in confidence among japanese m... \n", 494 | "2 makes new man utd glazer has made a fresh appr... \n", 495 | "3 hope for borussia in struggling german footbal... \n", 496 | "4 airlines hit tunnel operator eurotunnel has se... \n", 497 | "\n", 498 | " summary \n", 499 | "0 united revealed on sunday that it had received... \n", 500 | "1 confidence among japanese manufacturers has we... \n", 501 | "2 glazer has made a fresh approach to buy manche... \n", 502 | "3 in struggling german football club borussia do... \n", 503 | "4 firm said sales were down in to euros the mome... " 504 | ] 505 | }, 506 | "execution_count": 35, 507 | "metadata": {}, 508 | "output_type": "execute_result" 509 | } 510 | ], 511 | "source": [ 512 | "bbc_df.head()" 513 | ] 514 | }, 515 | { 516 | "cell_type": "code", 517 | "execution_count": 36, 518 | "metadata": { 519 | "colab": { 520 | "base_uri": "https://localhost:8080/", 521 | "height": 195 522 | }, 523 | "colab_type": "code", 524 | "id": "JBPbkd8pIORs", 525 | "outputId": "51066e27-b9c1-467e-845a-b162e813ce2f" 526 | }, 527 | "outputs": [ 528 | { 529 | "data": { 530 | "text/html": [ 531 | "
\n", 532 | "\n", 545 | "\n", 546 | " \n", 547 | " \n", 548 | " \n", 549 | " \n", 550 | " \n", 551 | " \n", 552 | " \n", 553 | " \n", 554 | " \n", 555 | " \n", 556 | " \n", 557 | " \n", 558 | " \n", 559 | " \n", 560 | " \n", 561 | " \n", 562 | " \n", 563 | " \n", 564 | " \n", 565 | " \n", 566 | " \n", 567 | " \n", 568 | " \n", 569 | " \n", 570 | " \n", 571 | " \n", 572 | " \n", 573 | " \n", 574 | " \n", 575 | " \n", 576 | " \n", 577 | " \n", 578 | " \n", 579 | " \n", 580 | "
row_articlesummary
229b'Robotic pods take on car design\\n\\nA new bre...b'Dr Erel Avineri, of the Centre for Transport...
295b'More power to the people says HP\\n\\nThe digi...b'She said the goal for 2005 was to make peopl...
219b\"Disney backs Sony DVD technology\\n\\nA next g...b\"Film giant Disney says it will produce its f...
249b'Gamer buys $26,500 virtual land\\n\\nA 22-year...b'The land exists within the game Project Entr...
272b'Windows worm travels with Tetris\\n\\nUsers ar...b'The Cellery worm installs a playable version...
\n", 581 | "
" 582 | ], 583 | "text/plain": [ 584 | " row_article \\\n", 585 | "229 b'Robotic pods take on car design\\n\\nA new bre... \n", 586 | "295 b'More power to the people says HP\\n\\nThe digi... \n", 587 | "219 b\"Disney backs Sony DVD technology\\n\\nA next g... \n", 588 | "249 b'Gamer buys $26,500 virtual land\\n\\nA 22-year... \n", 589 | "272 b'Windows worm travels with Tetris\\n\\nUsers ar... \n", 590 | "\n", 591 | " summary \n", 592 | "229 b'Dr Erel Avineri, of the Centre for Transport... \n", 593 | "295 b'She said the goal for 2005 was to make peopl... \n", 594 | "219 b\"Film giant Disney says it will produce its f... \n", 595 | "249 b'The land exists within the game Project Entr... \n", 596 | "272 b'The Cellery worm installs a playable version... " 597 | ] 598 | }, 599 | "execution_count": 36, 600 | "metadata": {}, 601 | "output_type": "execute_result" 602 | } 603 | ], 604 | "source": [ 605 | "df.head()" 606 | ] 607 | }, 608 | { 609 | "cell_type": "code", 610 | "execution_count": 38, 611 | "metadata": { 612 | "colab": { 613 | "base_uri": "https://localhost:8080/", 614 | "height": 34 615 | }, 616 | "colab_type": "code", 617 | "id": "YA9tvTPUIciB", 618 | "outputId": "1780794c-4748-49ba-8669-c75b875347d4" 619 | }, 620 | "outputs": [ 621 | { 622 | "data": { 623 | "text/plain": [ 624 | "2969" 625 | ] 626 | }, 627 | "execution_count": 38, 628 | "metadata": {}, 629 | "output_type": "execute_result" 630 | } 631 | ], 632 | "source": [ 633 | "len_list =[]\n", 634 | "for article in df.row_article:\n", 635 | " words = article.split()\n", 636 | " length = len(words)\n", 637 | " len_list.append(length)\n", 638 | "max(len_list)" 639 | ] 640 | }, 641 | { 642 | "cell_type": "markdown", 643 | "metadata": { 644 | "colab": {}, 645 | "colab_type": "code", 646 | "id": "jO14nm8JJPzZ" 647 | }, 648 | "source": [ 649 | "### 2-2. Tokenizer\n", 650 | "1. Tokenize and One-Hot : Tokenizer\n", 651 | "2. Vocabraly: article and summary 15000 words \n", 652 | "3. Padding: pad_sequences 1000 max_len\n", 653 | "4. Reshape: manual max_len * one-hot matrix " 654 | ] 655 | }, 656 | { 657 | "cell_type": "code", 658 | "execution_count": 1, 659 | "metadata": {}, 660 | "outputs": [], 661 | "source": [ 662 | "import numpy as np\n", 663 | "import os\n", 664 | "import pandas as pd\n", 665 | "import re" 666 | ] 667 | }, 668 | { 669 | "cell_type": "code", 670 | "execution_count": 2, 671 | "metadata": {}, 672 | "outputs": [ 673 | { 674 | "data": { 675 | "text/html": [ 676 | "
\n", 677 | "\n", 690 | "\n", 691 | " \n", 692 | " \n", 693 | " \n", 694 | " \n", 695 | " \n", 696 | " \n", 697 | " \n", 698 | " \n", 699 | " \n", 700 | " \n", 701 | " \n", 702 | " \n", 703 | " \n", 704 | " \n", 705 | " \n", 706 | " \n", 707 | " \n", 708 | " \n", 709 | " \n", 710 | " \n", 711 | " \n", 712 | " \n", 713 | " \n", 714 | " \n", 715 | " \n", 716 | " \n", 717 | " \n", 718 | " \n", 719 | " \n", 720 | " \n", 721 | " \n", 722 | " \n", 723 | " \n", 724 | " \n", 725 | "
row_articlesummary
0continues rapid economy has expanded by a brea...overall investment in fixed assets was still u...
1deccan seals deccan has ordered airbus planes ...government has given its backing to cheaper an...
2job growth continues in us created fewer jobs ...creation was one of last main concerns for the...
3owner buys rival for retail giant federated de...retail giant federated department stores is to...
4secures giant japan is to supply japan airline...chose the after carefully considering both it ...
\n", 726 | "
" 727 | ], 728 | "text/plain": [ 729 | " row_article \\\n", 730 | "0 continues rapid economy has expanded by a brea... \n", 731 | "1 deccan seals deccan has ordered airbus planes ... \n", 732 | "2 job growth continues in us created fewer jobs ... \n", 733 | "3 owner buys rival for retail giant federated de... \n", 734 | "4 secures giant japan is to supply japan airline... \n", 735 | "\n", 736 | " summary \n", 737 | "0 overall investment in fixed assets was still u... \n", 738 | "1 government has given its backing to cheaper an... \n", 739 | "2 creation was one of last main concerns for the... \n", 740 | "3 retail giant federated department stores is to... \n", 741 | "4 chose the after carefully considering both it ... " 742 | ] 743 | }, 744 | "execution_count": 2, 745 | "metadata": {}, 746 | "output_type": "execute_result" 747 | } 748 | ], 749 | "source": [ 750 | "bbc_art_sum = pd.read_csv(\"cleaned_bbc_news.csv\")\n", 751 | "bbc_art_sum.drop(\"Unnamed: 0\", axis=1, inplace=True)\n", 752 | "bbc_art_sum.head()" 753 | ] 754 | }, 755 | { 756 | "cell_type": "code", 757 | "execution_count": 3, 758 | "metadata": {}, 759 | "outputs": [], 760 | "source": [ 761 | "articles = list(bbc_art_sum.row_article)\n", 762 | "summaries = list(bbc_art_sum.summary)" 763 | ] 764 | }, 765 | { 766 | "cell_type": "markdown", 767 | "metadata": {}, 768 | "source": [ 769 | "### 2-2-1. Tokenize: text_to_word_sequence" 770 | ] 771 | }, 772 | { 773 | "cell_type": "code", 774 | "execution_count": 4, 775 | "metadata": { 776 | "colab": { 777 | "base_uri": "https://localhost:8080/", 778 | "height": 34 779 | }, 780 | "colab_type": "code", 781 | "id": "W9aALHY_YbUN", 782 | "outputId": "b228cf03-c36b-413a-d3c8-c9cb6fff1b36" 783 | }, 784 | "outputs": [ 785 | { 786 | "name": "stderr", 787 | "output_type": "stream", 788 | "text": [ 789 | "Using TensorFlow backend.\n" 790 | ] 791 | }, 792 | { 793 | "data": { 794 | "text/plain": [ 795 | "23914" 796 | ] 797 | }, 798 | "execution_count": 4, 799 | "metadata": {}, 800 | "output_type": "execute_result" 801 | } 802 | ], 803 | "source": [ 804 | "from keras.preprocessing.text import Tokenizer\n", 805 | "VOCAB_SIZE = 14999\n", 806 | "tokenizer = Tokenizer(num_words=VOCAB_SIZE)\n", 807 | "tokenizer.fit_on_texts(articles)\n", 808 | "article_sequences = tokenizer.texts_to_sequences(articles)\n", 809 | "art_word_index = tokenizer.word_index\n", 810 | "len(art_word_index)" 811 | ] 812 | }, 813 | { 814 | "cell_type": "code", 815 | "execution_count": 5, 816 | "metadata": {}, 817 | "outputs": [ 818 | { 819 | "name": "stdout", 820 | "output_type": "stream", 821 | "text": [ 822 | "[1411, 2338, 248, 16, 3994, 22, 5, 6483, 165, 1359, 50, 966, 4, 120, 967, 176, 118, 505, 38, 2339]\n", 823 | "[5211, 8881, 5211, 16, 2233, 3001, 3441, 6, 5, 217, 18, 60, 1270, 7874, 6, 1, 827, 5211, 11, 108]\n", 824 | "[478, 196, 1411, 6, 54, 736, 2283, 498, 50, 164, 6, 24, 349, 17, 9, 1, 3322, 6, 5213, 11]\n" 825 | ] 826 | } 827 | ], 828 | "source": [ 829 | "print(article_sequences[0][:20])\n", 830 | "print(article_sequences[1][:20])\n", 831 | "print(article_sequences[2][:20])" 832 | ] 833 | }, 834 | { 835 | "cell_type": "markdown", 836 | "metadata": {}, 837 | "source": [ 838 | "### 2-2-2. Vocabraly: article and summary 15000 words" 839 | ] 840 | }, 841 | { 842 | "cell_type": "code", 843 | "execution_count": 6, 844 | "metadata": {}, 845 | "outputs": [], 846 | "source": [ 847 | "art_word_index_1500 = {}\n", 848 | "counter = 0\n", 849 | "for word in art_word_index.keys():\n", 850 | " if art_word_index[word] == 0:\n", 851 | " print(\"found 0!\")\n", 852 | " break\n", 853 | " if art_word_index[word] > VOCAB_SIZE:\n", 854 | " continue\n", 855 | " else:\n", 856 | " art_word_index_1500[word] = art_word_index[word]\n", 857 | " counter += 1" 858 | ] 859 | }, 860 | { 861 | "cell_type": "code", 862 | "execution_count": 7, 863 | "metadata": {}, 864 | "outputs": [ 865 | { 866 | "data": { 867 | "text/plain": [ 868 | "14999" 869 | ] 870 | }, 871 | "execution_count": 7, 872 | "metadata": {}, 873 | "output_type": "execute_result" 874 | } 875 | ], 876 | "source": [ 877 | "counter" 878 | ] 879 | }, 880 | { 881 | "cell_type": "code", 882 | "execution_count": 8, 883 | "metadata": {}, 884 | "outputs": [ 885 | { 886 | "data": { 887 | "text/plain": [ 888 | "23929" 889 | ] 890 | }, 891 | "execution_count": 8, 892 | "metadata": {}, 893 | "output_type": "execute_result" 894 | } 895 | ], 896 | "source": [ 897 | "tokenizer.fit_on_texts(summaries)\n", 898 | "summary_sequences = tokenizer.texts_to_sequences(summaries)\n", 899 | "sum_word_index = tokenizer.word_index\n", 900 | "len(sum_word_index)" 901 | ] 902 | }, 903 | { 904 | "cell_type": "code", 905 | "execution_count": 9, 906 | "metadata": {}, 907 | "outputs": [], 908 | "source": [ 909 | "sum_word_index_1500 = {}\n", 910 | "counter = 0\n", 911 | "for word in sum_word_index.keys():\n", 912 | " if sum_word_index[word] == 0:\n", 913 | " print(\"found 0!\")\n", 914 | " break\n", 915 | " if sum_word_index[word] > VOCAB_SIZE:\n", 916 | " continue\n", 917 | " else:\n", 918 | " sum_word_index_1500[word] = sum_word_index[word]\n", 919 | " counter += 1" 920 | ] 921 | }, 922 | { 923 | "cell_type": "code", 924 | "execution_count": 10, 925 | "metadata": {}, 926 | "outputs": [ 927 | { 928 | "data": { 929 | "text/plain": [ 930 | "14999" 931 | ] 932 | }, 933 | "execution_count": 10, 934 | "metadata": {}, 935 | "output_type": "execute_result" 936 | } 937 | ], 938 | "source": [ 939 | "counter" 940 | ] 941 | }, 942 | { 943 | "cell_type": "markdown", 944 | "metadata": {}, 945 | "source": [ 946 | "### 2-2-3. Padding: pad_sequences 1000 max_len" 947 | ] 948 | }, 949 | { 950 | "cell_type": "code", 951 | "execution_count": 11, 952 | "metadata": {}, 953 | "outputs": [], 954 | "source": [ 955 | "from keras.preprocessing.sequence import pad_sequences\n", 956 | "MAX_LEN = 1000\n", 957 | "pad_art_sequences = pad_sequences(article_sequences, maxlen=MAX_LEN, padding='post', truncating='post')" 958 | ] 959 | }, 960 | { 961 | "cell_type": "code", 962 | "execution_count": 12, 963 | "metadata": {}, 964 | "outputs": [ 965 | { 966 | "name": "stdout", 967 | "output_type": "stream", 968 | "text": [ 969 | "243 1000\n" 970 | ] 971 | } 972 | ], 973 | "source": [ 974 | "print(len(article_sequences[1]), len(pad_art_sequences[1]))" 975 | ] 976 | }, 977 | { 978 | "cell_type": "code", 979 | "execution_count": 13, 980 | "metadata": {}, 981 | "outputs": [], 982 | "source": [ 983 | "pad_sum_sequences = pad_sequences(summary_sequences, maxlen=MAX_LEN, padding='post', truncating='post')" 984 | ] 985 | }, 986 | { 987 | "cell_type": "code", 988 | "execution_count": 14, 989 | "metadata": {}, 990 | "outputs": [ 991 | { 992 | "name": "stdout", 993 | "output_type": "stream", 994 | "text": [ 995 | "90 1000\n" 996 | ] 997 | } 998 | ], 999 | "source": [ 1000 | "print(len(summary_sequences[1]), len(pad_sum_sequences[1]))" 1001 | ] 1002 | }, 1003 | { 1004 | "cell_type": "code", 1005 | "execution_count": 15, 1006 | "metadata": {}, 1007 | "outputs": [ 1008 | { 1009 | "data": { 1010 | "text/plain": [ 1011 | "(2225, 1000)" 1012 | ] 1013 | }, 1014 | "execution_count": 15, 1015 | "metadata": {}, 1016 | "output_type": "execute_result" 1017 | } 1018 | ], 1019 | "source": [ 1020 | "pad_art_sequences.shape" 1021 | ] 1022 | }, 1023 | { 1024 | "cell_type": "code", 1025 | "execution_count": 16, 1026 | "metadata": {}, 1027 | "outputs": [ 1028 | { 1029 | "data": { 1030 | "text/plain": [ 1031 | "array([[1411, 2338, 248, ..., 0, 0, 0],\n", 1032 | " [5211, 8881, 5211, ..., 0, 0, 0],\n", 1033 | " [ 478, 196, 1411, ..., 0, 0, 0],\n", 1034 | " ...,\n", 1035 | " [ 421, 1337, 2012, ..., 0, 0, 0],\n", 1036 | " [2164, 267, 1109, ..., 0, 0, 0],\n", 1037 | " [ 7, 284, 8, ..., 0, 0, 0]], dtype=int32)" 1038 | ] 1039 | }, 1040 | "execution_count": 16, 1041 | "metadata": {}, 1042 | "output_type": "execute_result" 1043 | } 1044 | ], 1045 | "source": [ 1046 | "pad_art_sequences" 1047 | ] 1048 | }, 1049 | { 1050 | "cell_type": "markdown", 1051 | "metadata": {}, 1052 | "source": [ 1053 | "### 2-2-4. Reshape: manual max_len * one-hot matrix" 1054 | ] 1055 | }, 1056 | { 1057 | "cell_type": "code", 1058 | "execution_count": null, 1059 | "metadata": {}, 1060 | "outputs": [], 1061 | "source": [ 1062 | "# 使わない\n", 1063 | "\"\"\"\n", 1064 | "encoder_inputs = np.zeros((2225, 1000), dtype='float32')\n", 1065 | "encoder_inputs.shape\n", 1066 | "\n", 1067 | "decoder_inputs = np.zeros((2225, 1000), dtype='float32')\n", 1068 | "decoder_inputs.shape\n", 1069 | "\n", 1070 | "for i, seqs in enumerate(pad_art_sequences):\n", 1071 | " for j, seq in enumerate(seqs):\n", 1072 | " encoder_inputs[i, j] = seq\n", 1073 | " \n", 1074 | "for i, seqs in enumerate(pad_sum_sequences):\n", 1075 | " for j, seq in enumerate(seqs):\n", 1076 | " decoder_inputs[i, j] = seq\n", 1077 | "\"\"\"" 1078 | ] 1079 | }, 1080 | { 1081 | "cell_type": "code", 1082 | "execution_count": 55, 1083 | "metadata": {}, 1084 | "outputs": [ 1085 | { 1086 | "data": { 1087 | "text/plain": [ 1088 | "(2225, 1000, 15000)" 1089 | ] 1090 | }, 1091 | "execution_count": 55, 1092 | "metadata": {}, 1093 | "output_type": "execute_result" 1094 | } 1095 | ], 1096 | "source": [ 1097 | "decoder_outputs = np.zeros((2225, 1000, 15000), dtype='float32')\n", 1098 | "decoder_outputs.shape" 1099 | ] 1100 | }, 1101 | { 1102 | "cell_type": "code", 1103 | "execution_count": 56, 1104 | "metadata": {}, 1105 | "outputs": [], 1106 | "source": [ 1107 | "for i, seqs in enumerate(pad_sum_sequences):\n", 1108 | " for j, seq in enumerate(seqs):\n", 1109 | " decoder_outputs[i, j, seq] = 1." 1110 | ] 1111 | }, 1112 | { 1113 | "cell_type": "code", 1114 | "execution_count": 59, 1115 | "metadata": {}, 1116 | "outputs": [ 1117 | { 1118 | "data": { 1119 | "text/plain": [ 1120 | "(2225, 1000, 15000)" 1121 | ] 1122 | }, 1123 | "execution_count": 59, 1124 | "metadata": {}, 1125 | "output_type": "execute_result" 1126 | } 1127 | ], 1128 | "source": [ 1129 | "decoder_outputs.shape" 1130 | ] 1131 | }, 1132 | { 1133 | "cell_type": "markdown", 1134 | "metadata": {}, 1135 | "source": [ 1136 | "### 2-2-5. Pre-trained word2vec and word2vec Matrix" 1137 | ] 1138 | }, 1139 | { 1140 | "cell_type": "code", 1141 | "execution_count": 20, 1142 | "metadata": { 1143 | "colab": { 1144 | "base_uri": "https://localhost:8080/", 1145 | "height": 34 1146 | }, 1147 | "colab_type": "code", 1148 | "id": "sokvMj0Xfnjw", 1149 | "outputId": "3ef40349-e3e7-4f25-d263-7d316d21f829" 1150 | }, 1151 | "outputs": [ 1152 | { 1153 | "name": "stdout", 1154 | "output_type": "stream", 1155 | "text": [ 1156 | "Found 400000 word vectors.\n" 1157 | ] 1158 | } 1159 | ], 1160 | "source": [ 1161 | "embeddings_index = {}\n", 1162 | "with open('glove.6B.200d.txt', encoding='utf-8') as f:\n", 1163 | " for line in f:\n", 1164 | " values = line.split()\n", 1165 | " word = values[0]\n", 1166 | " coefs = np.asarray(values[1:], dtype='float32')\n", 1167 | " embeddings_index[word] = coefs\n", 1168 | " f.close()\n", 1169 | "\n", 1170 | "print('Found %s word vectors.' % len(embeddings_index))" 1171 | ] 1172 | }, 1173 | { 1174 | "cell_type": "code", 1175 | "execution_count": 21, 1176 | "metadata": { 1177 | "colab": {}, 1178 | "colab_type": "code", 1179 | "id": "kdx3r2rvZAwN" 1180 | }, 1181 | "outputs": [], 1182 | "source": [ 1183 | "def embedding_matrix_creater(embedding_dimention, word_index):\n", 1184 | " embedding_matrix = np.zeros((len(word_index) + 1, embedding_dimention))\n", 1185 | " for word, i in word_index.items():\n", 1186 | " embedding_vector = embeddings_index.get(word)\n", 1187 | " if embedding_vector is not None:\n", 1188 | " # words not found in embedding index will be all-zeros.\n", 1189 | " embedding_matrix[i] = embedding_vector\n", 1190 | " return embedding_matrix" 1191 | ] 1192 | }, 1193 | { 1194 | "cell_type": "code", 1195 | "execution_count": 22, 1196 | "metadata": { 1197 | "colab": { 1198 | "base_uri": "https://localhost:8080/", 1199 | "height": 34 1200 | }, 1201 | "colab_type": "code", 1202 | "id": "VB6UmXeqWleO", 1203 | "outputId": "59e76cc0-91a2-454b-a61e-f7cb1b2cf59a" 1204 | }, 1205 | "outputs": [ 1206 | { 1207 | "data": { 1208 | "text/plain": [ 1209 | "(15000, 200)" 1210 | ] 1211 | }, 1212 | "execution_count": 22, 1213 | "metadata": {}, 1214 | "output_type": "execute_result" 1215 | } 1216 | ], 1217 | "source": [ 1218 | "art_embedding_matrix = embedding_matrix_creater(200, word_index=art_word_index_1500)\n", 1219 | "art_embedding_matrix.shape" 1220 | ] 1221 | }, 1222 | { 1223 | "cell_type": "code", 1224 | "execution_count": 23, 1225 | "metadata": { 1226 | "colab": { 1227 | "base_uri": "https://localhost:8080/", 1228 | "height": 34 1229 | }, 1230 | "colab_type": "code", 1231 | "id": "1KgrdRFvb7A1", 1232 | "outputId": "e01036f5-476b-46eb-bec7-1ad263ba9dd0" 1233 | }, 1234 | "outputs": [ 1235 | { 1236 | "data": { 1237 | "text/plain": [ 1238 | "(15000, 200)" 1239 | ] 1240 | }, 1241 | "execution_count": 23, 1242 | "metadata": {}, 1243 | "output_type": "execute_result" 1244 | } 1245 | ], 1246 | "source": [ 1247 | "sum_embedding_matrix = embedding_matrix_creater(200, word_index=sum_word_index_1500)\n", 1248 | "sum_embedding_matrix.shape" 1249 | ] 1250 | }, 1251 | { 1252 | "cell_type": "code", 1253 | "execution_count": 24, 1254 | "metadata": { 1255 | "colab": {}, 1256 | "colab_type": "code", 1257 | "id": "vuawT3rXUIBY" 1258 | }, 1259 | "outputs": [], 1260 | "source": [ 1261 | "encoder_embedding_layer = Embedding(input_dim = 15000, \n", 1262 | " output_dim = 200,\n", 1263 | " input_length = MAX_LEN,\n", 1264 | " weights = [art_embedding_matrix],\n", 1265 | " trainable = False)" 1266 | ] 1267 | }, 1268 | { 1269 | "cell_type": "code", 1270 | "execution_count": 25, 1271 | "metadata": { 1272 | "colab": {}, 1273 | "colab_type": "code", 1274 | "id": "Wf3EzUZcIccZ" 1275 | }, 1276 | "outputs": [], 1277 | "source": [ 1278 | "decoder_embedding_layer = Embedding(input_dim = 15000, \n", 1279 | " output_dim = 200,\n", 1280 | " input_length = MAX_LEN,\n", 1281 | " weights = [sum_embedding_matrix],\n", 1282 | " trainable = False)" 1283 | ] 1284 | }, 1285 | { 1286 | "cell_type": "code", 1287 | "execution_count": 26, 1288 | "metadata": { 1289 | "colab": { 1290 | "base_uri": "https://localhost:8080/", 1291 | "height": 34 1292 | }, 1293 | "colab_type": "code", 1294 | "id": "J5acWwCxIcZz", 1295 | "outputId": "a92cd279-b2de-4110-fd1d-35e13e90c099" 1296 | }, 1297 | "outputs": [ 1298 | { 1299 | "data": { 1300 | "text/plain": [ 1301 | "(15000, 200)" 1302 | ] 1303 | }, 1304 | "execution_count": 26, 1305 | "metadata": {}, 1306 | "output_type": "execute_result" 1307 | } 1308 | ], 1309 | "source": [ 1310 | "sum_embedding_matrix.shape" 1311 | ] 1312 | }, 1313 | { 1314 | "cell_type": "markdown", 1315 | "metadata": { 1316 | "colab_type": "text", 1317 | "id": "AB663AaY0PKH" 1318 | }, 1319 | "source": [ 1320 | "## Step 3. Building Encoder-Decoder Model" 1321 | ] 1322 | }, 1323 | { 1324 | "cell_type": "code", 1325 | "execution_count": 17, 1326 | "metadata": { 1327 | "colab": {}, 1328 | "colab_type": "code", 1329 | "id": "W-kZT0pQRqRl" 1330 | }, 1331 | "outputs": [], 1332 | "source": [ 1333 | "from numpy.random import seed\n", 1334 | "seed(1)\n", 1335 | "\n", 1336 | "from sklearn.model_selection import train_test_split\n", 1337 | "import logging\n", 1338 | "\n", 1339 | "import plotly.plotly as py\n", 1340 | "import plotly.graph_objs as go\n", 1341 | "import matplotlib.pyplot as plt\n", 1342 | "import pandas as pd\n", 1343 | "import pydot\n", 1344 | "\n", 1345 | "\n", 1346 | "import keras\n", 1347 | "from keras import backend as k\n", 1348 | "k.set_learning_phase(1)\n", 1349 | "from keras.preprocessing.text import Tokenizer\n", 1350 | "from keras import initializers\n", 1351 | "from keras.optimizers import RMSprop\n", 1352 | "from keras.models import Sequential,Model\n", 1353 | "from keras.layers import Dense,LSTM,Dropout,Input,Activation,Add,concatenate, Embedding, RepeatVector\n", 1354 | "from keras.layers.advanced_activations import LeakyReLU,PReLU\n", 1355 | "from keras.callbacks import ModelCheckpoint\n", 1356 | "from keras.models import load_model\n", 1357 | "from keras.optimizers import Adam" 1358 | ] 1359 | }, 1360 | { 1361 | "cell_type": "code", 1362 | "execution_count": 18, 1363 | "metadata": {}, 1364 | "outputs": [], 1365 | "source": [ 1366 | "from keras.layers import TimeDistributed" 1367 | ] 1368 | }, 1369 | { 1370 | "cell_type": "code", 1371 | "execution_count": 19, 1372 | "metadata": { 1373 | "colab": {}, 1374 | "colab_type": "code", 1375 | "id": "psWPxabWIcWz" 1376 | }, 1377 | "outputs": [], 1378 | "source": [ 1379 | "# Hyperparams\n", 1380 | "\n", 1381 | "MAX_LEN = 1000\n", 1382 | "VOCAB_SIZE =15000\n", 1383 | "EMBEDDING_DIM = 200\n", 1384 | "HIDDEN_UNITS = 200\n", 1385 | "VOCAB_SIZE = VOCAB_SIZE + 1\n", 1386 | "\n", 1387 | "LEARNING_RATE = 0.002\n", 1388 | "BATCH_SIZE = 32\n", 1389 | "EPOCHS = 5" 1390 | ] 1391 | }, 1392 | { 1393 | "cell_type": "markdown", 1394 | "metadata": {}, 1395 | "source": [ 1396 | "### Model 1. Simple LSTM Encoder-Decoder-seq2seq" 1397 | ] 1398 | }, 1399 | { 1400 | "cell_type": "code", 1401 | "execution_count": 74, 1402 | "metadata": { 1403 | "colab": {}, 1404 | "colab_type": "code", 1405 | "id": "pGjWMUG3IcT_" 1406 | }, 1407 | "outputs": [ 1408 | { 1409 | "name": "stdout", 1410 | "output_type": "stream", 1411 | "text": [ 1412 | "__________________________________________________________________________________________________\n", 1413 | "Layer (type) Output Shape Param # Connected to \n", 1414 | "==================================================================================================\n", 1415 | "input_3 (InputLayer) (None, 1000) 0 \n", 1416 | "__________________________________________________________________________________________________\n", 1417 | "input_4 (InputLayer) (None, 1000) 0 \n", 1418 | "__________________________________________________________________________________________________\n", 1419 | "embedding_1 (Embedding) (None, 1000, 200) 3000000 input_3[0][0] \n", 1420 | "__________________________________________________________________________________________________\n", 1421 | "embedding_2 (Embedding) (None, 1000, 200) 3000000 input_4[0][0] \n", 1422 | "__________________________________________________________________________________________________\n", 1423 | "lstm_4 (LSTM) (None, 200) 320800 embedding_1[1][0] \n", 1424 | "__________________________________________________________________________________________________\n", 1425 | "lstm_5 (LSTM) (None, 200) 320800 embedding_2[1][0] \n", 1426 | "__________________________________________________________________________________________________\n", 1427 | "concatenate_1 (Concatenate) (None, 400) 0 lstm_4[0][0] \n", 1428 | " lstm_5[0][0] \n", 1429 | "__________________________________________________________________________________________________\n", 1430 | "dense_2 (Dense) (None, 15002) 6015802 concatenate_1[0][0] \n", 1431 | "==================================================================================================\n", 1432 | "Total params: 12,657,402\n", 1433 | "Trainable params: 6,657,402\n", 1434 | "Non-trainable params: 6,000,000\n", 1435 | "__________________________________________________________________________________________________\n" 1436 | ] 1437 | } 1438 | ], 1439 | "source": [ 1440 | "\"\"\"\n", 1441 | "Simple LSTM Encoder-Decoder-seq2seq\n", 1442 | "\"\"\"\n", 1443 | "# encoder\n", 1444 | "encoder_inputs = Input(shape=(MAX_LEN, ), dtype='int32',)\n", 1445 | "encoder_embedding = encoder_embedding_layer(encoder_inputs)\n", 1446 | "encoder_LSTM = LSTM(HIDDEN_UNITS)(encoder_embedding)\n", 1447 | "# decoder\n", 1448 | "decoder_inputs = Input(shape=(MAX_LEN, ))\n", 1449 | "decoder_embedding = decoder_embedding_layer(decoder_inputs)\n", 1450 | "decoder_LSTM = LSTM(200)(decoder_embedding)\n", 1451 | "# merge\n", 1452 | "merge_layer = concatenate([encoder_LSTM, decoder_LSTM])\n", 1453 | "decoder_outputs = Dense(units=VOCAB_SIZE+1, activation=\"softmax\")(merge_layer) # SUM_VOCAB_SIZE, sum_embedding_matrix.shape[1]\n", 1454 | "\n", 1455 | "model = Model([encoder_inputs, decoder_inputs], decoder_outputs)\n", 1456 | "model.summary()" 1457 | ] 1458 | }, 1459 | { 1460 | "cell_type": "code", 1461 | "execution_count": 75, 1462 | "metadata": { 1463 | "colab": {}, 1464 | "colab_type": "code", 1465 | "id": "JHw0flwp4uW2" 1466 | }, 1467 | "outputs": [], 1468 | "source": [ 1469 | "model.compile(optimizer=\"adam\", loss=\"categorical_crossentropy\", metrics=[\"accuracy\"])" 1470 | ] 1471 | }, 1472 | { 1473 | "cell_type": "markdown", 1474 | "metadata": {}, 1475 | "source": [ 1476 | "### Model 2. Bidirectional LSTM Encoder-Decoder-seq2seq" 1477 | ] 1478 | }, 1479 | { 1480 | "cell_type": "code", 1481 | "execution_count": 27, 1482 | "metadata": {}, 1483 | "outputs": [], 1484 | "source": [ 1485 | "\"\"\"\n", 1486 | "Bidirectional LSTM: Others Inspired Encoder-Decoder-seq2seq\n", 1487 | "\"\"\"\n", 1488 | "encoder_inputs = Input(shape=(MAX_LEN,))\n", 1489 | "encoder_embedding = encoder_embedding_layer(encoder_inputs)\n", 1490 | "encoder_LSTM = LSTM(HIDDEN_UNITS, return_state=True)\n", 1491 | "encoder_LSTM_R = LSTM(HIDDEN_UNITS, return_state=True, go_backwards=True)\n", 1492 | "encoder_outputs_R, state_h_R, state_c_R = encoder_LSTM_R(encoder_embedding)\n", 1493 | "encoder_outputs, state_h, state_c = encoder_LSTM(encoder_embedding)\n", 1494 | "\n", 1495 | "final_h = Add()([state_h, state_h_R])\n", 1496 | "final_c = Add()([state_c, state_c_R])\n", 1497 | "encoder_states = [final_h, final_c]\n", 1498 | "\n", 1499 | "\"\"\"\n", 1500 | "decoder\n", 1501 | "\"\"\"\n", 1502 | "decoder_inputs = Input(shape=(MAX_LEN,))\n", 1503 | "decoder_embedding = decoder_embedding_layer(decoder_inputs)\n", 1504 | "decoder_LSTM = LSTM(HIDDEN_UNITS, return_sequences=True, return_state=True)\n", 1505 | "decoder_outputs, _, _ = decoder_LSTM(decoder_embedding, initial_state=encoder_states) \n", 1506 | "decoder_dense = Dense(VOCAB_SIZE, activation='linear')\n", 1507 | "decoder_outputs = decoder_dense(decoder_outputs)\n", 1508 | "\n", 1509 | "model= Model(inputs=[encoder_inputs,decoder_inputs], outputs=decoder_outputs)" 1510 | ] 1511 | }, 1512 | { 1513 | "cell_type": "markdown", 1514 | "metadata": {}, 1515 | "source": [ 1516 | "### Model 3. Chatbot Inspired Encoder-Decoder-seq2seq" 1517 | ] 1518 | }, 1519 | { 1520 | "cell_type": "code", 1521 | "execution_count": 86, 1522 | "metadata": {}, 1523 | "outputs": [], 1524 | "source": [ 1525 | "\"\"\"\n", 1526 | "Chatbot Inspired Encoder-Decoder-seq2seq\n", 1527 | "\"\"\"\n", 1528 | "encoder_inputs = Input(shape=(MAX_LEN, ), dtype='int32',)\n", 1529 | "encoder_embedding = encoder_embedding_layer(encoder_inputs)\n", 1530 | "encoder_LSTM = LSTM(HIDDEN_UNITS, return_state=True)\n", 1531 | "encoder_outputs, state_h, state_c = encoder_LSTM(encoder_embedding)\n", 1532 | "\n", 1533 | "decoder_inputs = Input(shape=(MAX_LEN, ), dtype='int32',)\n", 1534 | "decoder_embedding = decoder_embedding_layer(decoder_inputs)\n", 1535 | "decoder_LSTM = LSTM(HIDDEN_UNITS, return_state=True, return_sequences=True)\n", 1536 | "decoder_outputs, _, _ = decoder_LSTM(decoder_embedding, initial_state=[state_h, state_c])\n", 1537 | "\n", 1538 | "# dense_layer = Dense(VOCAB_SIZE, activation='softmax')\n", 1539 | "outputs = TimeDistributed(Dense(VOCAB_SIZE, activation='softmax'))(decoder_outputs)\n", 1540 | "model = Model([encoder_inputs, decoder_inputs], outputs)" 1541 | ] 1542 | }, 1543 | { 1544 | "cell_type": "code", 1545 | "execution_count": 28, 1546 | "metadata": {}, 1547 | "outputs": [], 1548 | "source": [ 1549 | "rmsprop = RMSprop(lr=0.01, clipnorm=1.)\n", 1550 | "model.compile(loss='mse', optimizer=rmsprop, metrics=[\"accuracy\"])" 1551 | ] 1552 | }, 1553 | { 1554 | "cell_type": "code", 1555 | "execution_count": 29, 1556 | "metadata": {}, 1557 | "outputs": [ 1558 | { 1559 | "name": "stdout", 1560 | "output_type": "stream", 1561 | "text": [ 1562 | "__________________________________________________________________________________________________\n", 1563 | "Layer (type) Output Shape Param # Connected to \n", 1564 | "==================================================================================================\n", 1565 | "input_1 (InputLayer) (None, 1000) 0 \n", 1566 | "__________________________________________________________________________________________________\n", 1567 | "embedding_1 (Embedding) (None, 1000, 200) 3000000 input_1[0][0] \n", 1568 | "__________________________________________________________________________________________________\n", 1569 | "input_2 (InputLayer) (None, 1000) 0 \n", 1570 | "__________________________________________________________________________________________________\n", 1571 | "lstm_1 (LSTM) [(None, 200), (None, 320800 embedding_1[0][0] \n", 1572 | "__________________________________________________________________________________________________\n", 1573 | "lstm_2 (LSTM) [(None, 200), (None, 320800 embedding_1[0][0] \n", 1574 | "__________________________________________________________________________________________________\n", 1575 | "embedding_2 (Embedding) (None, 1000, 200) 3000000 input_2[0][0] \n", 1576 | "__________________________________________________________________________________________________\n", 1577 | "add_1 (Add) (None, 200) 0 lstm_1[0][1] \n", 1578 | " lstm_2[0][1] \n", 1579 | "__________________________________________________________________________________________________\n", 1580 | "add_2 (Add) (None, 200) 0 lstm_1[0][2] \n", 1581 | " lstm_2[0][2] \n", 1582 | "__________________________________________________________________________________________________\n", 1583 | "lstm_3 (LSTM) [(None, 1000, 200), 320800 embedding_2[0][0] \n", 1584 | " add_1[0][0] \n", 1585 | " add_2[0][0] \n", 1586 | "__________________________________________________________________________________________________\n", 1587 | "dense_1 (Dense) (None, 1000, 15001) 3015201 lstm_3[0][0] \n", 1588 | "==================================================================================================\n", 1589 | "Total params: 9,977,601\n", 1590 | "Trainable params: 3,977,601\n", 1591 | "Non-trainable params: 6,000,000\n", 1592 | "__________________________________________________________________________________________________\n" 1593 | ] 1594 | } 1595 | ], 1596 | "source": [ 1597 | "# model 2\n", 1598 | "model.summary()" 1599 | ] 1600 | }, 1601 | { 1602 | "cell_type": "markdown", 1603 | "metadata": {}, 1604 | "source": [ 1605 | "## Step 4. Training your model and Validate it" 1606 | ] 1607 | }, 1608 | { 1609 | "cell_type": "code", 1610 | "execution_count": 30, 1611 | "metadata": {}, 1612 | "outputs": [], 1613 | "source": [ 1614 | "import numpy as np\n", 1615 | "num_samples = len(pad_sum_sequences)\n", 1616 | "decoder_output_data = np.zeros((num_samples, MAX_LEN, VOCAB_SIZE), dtype=\"int32\")" 1617 | ] 1618 | }, 1619 | { 1620 | "cell_type": "code", 1621 | "execution_count": 31, 1622 | "metadata": {}, 1623 | "outputs": [], 1624 | "source": [ 1625 | "# outputの3Dテンソル\n", 1626 | "for i, seqs in enumerate(pad_sum_sequences):\n", 1627 | " for j, seq in enumerate(seqs):\n", 1628 | " if j > 0:\n", 1629 | " decoder_output_data[i][j][seq] = 1" 1630 | ] 1631 | }, 1632 | { 1633 | "cell_type": "code", 1634 | "execution_count": 32, 1635 | "metadata": { 1636 | "colab": {}, 1637 | "colab_type": "code", 1638 | "id": "EFxSEKvfIoGf" 1639 | }, 1640 | "outputs": [], 1641 | "source": [ 1642 | "art_train, art_test, sum_train, sum_test = train_test_split(pad_art_sequences, pad_sum_sequences, test_size=0.2)" 1643 | ] 1644 | }, 1645 | { 1646 | "cell_type": "code", 1647 | "execution_count": 33, 1648 | "metadata": { 1649 | "colab": { 1650 | "base_uri": "https://localhost:8080/", 1651 | "height": 34 1652 | }, 1653 | "colab_type": "code", 1654 | "id": "9hn_1hm-JCa2", 1655 | "outputId": "95c79153-06ec-4b2d-b25e-6f34c933ae44" 1656 | }, 1657 | "outputs": [ 1658 | { 1659 | "data": { 1660 | "text/plain": [ 1661 | "1780" 1662 | ] 1663 | }, 1664 | "execution_count": 33, 1665 | "metadata": {}, 1666 | "output_type": "execute_result" 1667 | } 1668 | ], 1669 | "source": [ 1670 | "train_num = art_train.shape[0]\n", 1671 | "train_num" 1672 | ] 1673 | }, 1674 | { 1675 | "cell_type": "code", 1676 | "execution_count": 34, 1677 | "metadata": {}, 1678 | "outputs": [], 1679 | "source": [ 1680 | "target_train = decoder_output_data[:train_num]\n", 1681 | "target_test = decoder_output_data[train_num:]" 1682 | ] 1683 | }, 1684 | { 1685 | "cell_type": "code", 1686 | "execution_count": null, 1687 | "metadata": {}, 1688 | "outputs": [ 1689 | { 1690 | "name": "stdout", 1691 | "output_type": "stream", 1692 | "text": [ 1693 | "Train on 1780 samples, validate on 445 samples\n", 1694 | "Epoch 1/5\n", 1695 | " 896/1780 [==============>...............] - ETA: 1:46:08 - loss: 4.1167e-04 - acc: 0.8187" 1696 | ] 1697 | } 1698 | ], 1699 | "source": [ 1700 | "history = model.fit([art_train, sum_train], \n", 1701 | " target_train, \n", 1702 | " epochs=EPOCHS, \n", 1703 | " batch_size=BATCH_SIZE,\n", 1704 | " validation_data=([art_test, sum_test], target_test))" 1705 | ] 1706 | }, 1707 | { 1708 | "cell_type": "markdown", 1709 | "metadata": {}, 1710 | "source": [ 1711 | "#### Visualization" 1712 | ] 1713 | }, 1714 | { 1715 | "cell_type": "code", 1716 | "execution_count": null, 1717 | "metadata": {}, 1718 | "outputs": [], 1719 | "source": [ 1720 | "# 正確性の可視化\n", 1721 | "import matplotlib.pyplot as plt\n", 1722 | "%matplotlib inline\n", 1723 | "\n", 1724 | "plt.figure(figsize=(10, 6))\n", 1725 | "plt.plot(history.history['acc'])\n", 1726 | "plt.plot(history.history['val_acc'])\n", 1727 | "plt.title('model accuracy')\n", 1728 | "plt.ylabel('accuracy')\n", 1729 | "plt.xlabel('epoch')\n", 1730 | "plt.legend(['train', 'test'], loc='upper left')\n", 1731 | "plt.show()" 1732 | ] 1733 | }, 1734 | { 1735 | "cell_type": "code", 1736 | "execution_count": null, 1737 | "metadata": {}, 1738 | "outputs": [], 1739 | "source": [ 1740 | "# 損失関数の可視化\n", 1741 | "plt.figure(figsize=(10, 6))\n", 1742 | "plt.plot(history.history['loss'])\n", 1743 | "# plt.plot(history.history['val_loss'])\n", 1744 | "plt.title('model loss')\n", 1745 | "plt.ylabel('loss')\n", 1746 | "plt.xlabel('epoch')\n", 1747 | "# plt.legend(['train', 'test'], loc='upper left')\n", 1748 | "plt.show()" 1749 | ] 1750 | }, 1751 | { 1752 | "cell_type": "code", 1753 | "execution_count": null, 1754 | "metadata": {}, 1755 | "outputs": [], 1756 | "source": [ 1757 | "# モデルの読み込み\n", 1758 | "with open('text_summary.json',\"w\").write(model.to_json())\n", 1759 | "\n", 1760 | "# 重みの読み込み\n", 1761 | "model.load_weights('text_summary.h5')\n", 1762 | "print(\"Saved Model!\")" 1763 | ] 1764 | } 1765 | ], 1766 | "metadata": { 1767 | "colab": { 1768 | "name": "input_matrix.ipynb", 1769 | "provenance": [], 1770 | "version": "0.3.2" 1771 | }, 1772 | "kernelspec": { 1773 | "display_name": "Python 3", 1774 | "language": "python", 1775 | "name": "python3" 1776 | }, 1777 | "language_info": { 1778 | "codemirror_mode": { 1779 | "name": "ipython", 1780 | "version": 3 1781 | }, 1782 | "file_extension": ".py", 1783 | "mimetype": "text/x-python", 1784 | "name": "python", 1785 | "nbconvert_exporter": "python", 1786 | "pygments_lexer": "ipython3", 1787 | "version": "3.5.1" 1788 | } 1789 | }, 1790 | "nbformat": 4, 1791 | "nbformat_minor": 1 1792 | } 1793 | -------------------------------------------------------------------------------- /bidirectional_lstm_model.py: -------------------------------------------------------------------------------- 1 | 2 | from numpy.random import seed 3 | seed(1) 4 | 5 | from sklearn.model_selection import train_test_split 6 | import logging 7 | 8 | import plotly.plotly as py 9 | import plotly.graph_objs as go 10 | import matplotlib.pyplot as plt 11 | import pandas as pd 12 | import pydot 13 | 14 | 15 | import keras 16 | from keras import backend as k 17 | k.set_learning_phase(1) 18 | from keras.preprocessing.text import Tokenizer 19 | from keras import initializers 20 | from keras.optimizers import RMSprop 21 | from keras.models import Sequential,Model 22 | from keras.layers import Dense,LSTM,Dropout,Input,Activation,Add,Concatenate 23 | from keras.layers.advanced_activations import LeakyReLU,PReLU 24 | from keras.callbacks import ModelCheckpoint 25 | from keras.models import load_model 26 | from keras.optimizers import Adam 27 | 28 | 29 | logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s',\ 30 | level=logging.INFO) 31 | 32 | # ======================================================================== 33 | # model params 34 | # ======================================================================== 35 | 36 | en_shape = np.shape(train_data["article"][0]) 37 | den_shape = np.shape(train_data["summary"][0]) 38 | 39 | MAX_ART_LEN = 40 | MAX_SUM_LEN = 41 | EMBEDDING_DIM = 1000 42 | HIDDEN_UNITS = 43 | 44 | LEARNING_RATE = 0.002 45 | BATCH_SIZE = 32 46 | EPOCHS = 5 47 | 48 | rmsprop = RMSprop(lr=LEARNING_RATE, clipnorm=1.0) 49 | 50 | 51 | # ======================================================================== 52 | # model helpers 53 | # ======================================================================== 54 | 55 | def bidirectional_lstm(data): 56 | """ 57 | Encoder-Decoder-seq2seq 58 | """ 59 | # encoder 60 | encoder_inputs = Input(shape=en_shape[1]) 61 | encoder_LSTM = LSTM(HIDDEN_UNITS, dropout_U = 0.2, dropout_W = 0.2 ,return_state=True) 62 | rev_encoder_LSTM = LSTM(HIDDEN_UNITS, return_state=True, go_backwards=True) 63 | # 64 | encoder_outputs, state_h, state_c = encoder_LSTM(encoder_inputs) 65 | rev_encoder_outputs, rev_state_h, rev_state_c = rev_encoder_LSTM(encoder_inputs) 66 | # 67 | final_state_h = Add()([state_h, rev_state_h]) 68 | final_state_c = Add()([state_c, rev_state_c]) 69 | 70 | encoder_states = [final_state_h, final_state_c] 71 | 72 | # decoder 73 | decoder_inputs = Input(shape=(None, de_shape[1])) 74 | decoder_LSTM = LSTM(HIDDEN_UNITS, return_sequences=True, return_state=True) 75 | decoder_outputs, _, _, = decoder_LSTM(decoder_inputs, initial_state=encoder_states) 76 | decoder_dense = Dense(units=de_shape[1], activation="linear") 77 | decoder_outputs = decoder_dense(decoder_outputs) 78 | 79 | # modeling 80 | model = Model([encoder_inputs,decoder_inputs], decoder_outputs) 81 | model.compile(optimizer=rmsprop, loss="mse", metrics=["accuracy"]) 82 | 83 | # return model 84 | print(model.summary()) 85 | 86 | x_train, x_test, y_train, y_test = train_test_split(data["article"], data["summaries"], test_size=0.2) 87 | model.fit([x_train, y_train], y_train, batch_size=BATCH_SIZE, 88 | epochs=EPOCHS, verbose=1, validation_data=([x_test, y_test], y_test)) 89 | 90 | """ 91 | 推論モデル 92 | """ 93 | encoder_model_inf = Model(encoder_inputs, encoder_states) 94 | 95 | decoder_state_input_H = Input(shape=(HIDDEN_UNITS,)) 96 | decoder_state_input_C = Input(shape=(HIDDEN_UNITS,)) 97 | 98 | decoder_state_inputs = [decoder_state_input_H, decoder_state_input_C] 99 | decoder_outputs, decoder_state_h, decoder_state_c = decoder_LSTM(decoder_inputs, initial_state=decoder_state_inputs) 100 | 101 | decoder_states = [decoder_state_h, decoder_state_c] 102 | decoder_outputs = decoder_dense(decoder_outputs) 103 | 104 | decoder_model_inf = Model([decoder_inputs]+decoder_state_inputs, 105 | [decoder_outputs]+decoder_states) 106 | 107 | scores = model.evaluate([x_test, y_test], y_test, verbose=0) 108 | 109 | print('LSTM test scores:', scores) 110 | print('\007') 111 | 112 | return model, encoder_model_inf, decoder_model_inf 113 | 114 | """ 115 | model._make_predict_function() 116 | """ 117 | -------------------------------------------------------------------------------- /data_shaper.py: -------------------------------------------------------------------------------- 1 | 2 | from numpy.random import seed 3 | seed(1) 4 | 5 | import gensim as gs 6 | import pandas as pd 7 | import numpy as np 8 | import scipy as sc 9 | import nltk 10 | from nltk.tokenize import word_tokenize as wt 11 | from nltk.tokenize import sent_tokenize as st 12 | from numpy import argmax 13 | from sklearn.preprocessing import LabelEncoder 14 | from sklearn.preprocessing import OneHotEncoder 15 | from keras.preprocessing.sequence import pad_sequences 16 | import logging 17 | import re 18 | from collections import Counter 19 | 20 | logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s',\ 21 | level=logging.INFO) 22 | 23 | # ======================================================================== 24 | # 25 | # ======================================================================== 26 | 27 | # emb_size_all = 300 28 | # maxcorp=5000 29 | EMBEDDING_DIM = 300 30 | VOCAB_SIZE = 5000 31 | 32 | 33 | def createCorpus(t): 34 | corpus = [] 35 | all_sent = [] 36 | for k in t: 37 | for p in t[k]: 38 | corpus.append(st(p)) 39 | for sent in range(len(corpus)): 40 | for k in corpus[sent]: 41 | all_sent.append(k) 42 | for m in range(len(all_sent)): 43 | all_sent[m] = wt(all_sent[m]) 44 | 45 | all_words=[] 46 | for sent in all_sent: 47 | hold=[] 48 | for word in sent: 49 | hold.append(word.lower()) 50 | all_words.append(hold) 51 | return all_words 52 | 53 | 54 | def word2vecmodel(corpus): 55 | emb_size = emb_size_all 56 | model_type={"skip_gram":1,"CBOW":0} 57 | window=10 58 | workers=4 59 | min_count=4 60 | batch_words=20 61 | epochs=25 62 | #include bigrams 63 | #bigramer = gs.models.Phrases(corpus) 64 | 65 | model=gs.models.Word2Vec(corpus,size=emb_size,sg=model_type["skip_gram"], 66 | compute_loss=True,window=window,min_count=min_count,workers=workers, 67 | batch_words=batch_words) 68 | 69 | model.train(corpus,total_examples=len(corpus),epochs=epochs) 70 | model.save("%sWord2vec"%modelLocation) 71 | print('\007') 72 | return model 73 | 74 | 75 | 76 | 77 | 78 | 79 | 80 | 81 | 82 | 83 | """ 84 | generate summary length for test 85 | """ 86 | 87 | #train_data["nums_summ"]=list(map(lambda x:0 if len(x)<5000 else 1,data["articles"])) 88 | #train_data["nums_summ"]=list(map(len,data["summaries"])) 89 | #train_data["nums_summ_norm"]=(np.array(train_data["nums_summ"])-min(train_data["nums_summ"]))/(max(train_data["nums_summ"])-min(train_data["nums_summ"])) 90 | -------------------------------------------------------------------------------- /text_cleaner.py: -------------------------------------------------------------------------------- 1 | import string 2 | import re 3 | #punctuation 4 | 5 | def cleantext(text): 6 | text = str(text) 7 | text=text.split() 8 | words=[] 9 | for t in text: 10 | if t.isalpha(): 11 | words.append(t) 12 | text=" ".join(words) 13 | text=text.lower() 14 | text=re.sub(r"what's","what is ",text) 15 | text=re.sub(r"it's","it is ",text) 16 | text=re.sub(r"\'ve"," have ",text) 17 | text=re.sub(r"i'm","i am ",text) 18 | text=re.sub(r"\'re"," are ",text) 19 | text=re.sub(r"n't"," not ",text) 20 | text=re.sub(r"\'d"," would ",text) 21 | text=re.sub(r"\'s","s",text) 22 | text=re.sub(r"\'ll"," will ",text) 23 | text=re.sub(r"can't"," cannot ",text) 24 | text=re.sub(r" e g "," eg ",text) 25 | text=re.sub(r"e-mail","email",text) 26 | text=re.sub(r"9\\/11"," 911 ",text) 27 | text=re.sub(r" u.s"," american ",text) 28 | text=re.sub(r" u.n"," united nations ",text) 29 | text=re.sub(r"\n"," ",text) 30 | text=re.sub(r":"," ",text) 31 | text=re.sub(r"-"," ",text) 32 | text=re.sub(r"\_"," ",text) 33 | text=re.sub(r"\d+"," ",text) 34 | text=re.sub(r"[$#@%&*!~?%{}()]"," ",text) 35 | 36 | return text 37 | -------------------------------------------------------------------------------- /text_summary_model.py: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/samurainote/Text_Summarization_using_Bidirectional_LSTM/711c30c8d05a07ac32f22a6e9b54238d316a127d/text_summary_model.py --------------------------------------------------------------------------------