├── README.md
├── RECENT_Notebook.ipynb
├── bidirectional_lstm_model.py
├── cleaned_bbc_news.csv
├── data_shaper.py
├── text_cleaner.py
└── text_summary_model.py


/README.md:
--------------------------------------------------------------------------------
 1 | # Bidirectional LSTM: Abstract Text Summarization
 2 | ## 双方向LSTMを用いた文脈を捉える抽象型文章要約
 3 | ![](http://abigailsee.com/img/pointer-gen.png)
 4 | 
 5 | ## Introduction
 6 | 
 7 | Extraction type is an approach of extracting a sentence that seems to be important from sentences to be summarized and creating a summary. The advantages and disadvantages are as follows.        
 8 | 
 9 | - Pros: Select a sentence in the original sentence to create a summary, so it is less likely to be a summary that is completely out of the way, and it will not be a grammatically strange summary       
10 | 
11 | - Cons: Because you can not use words that are not in the sentence, you can not use abstractions, paraphrases, or conjunctions to make them easier to read. Because of this, the summary created is a crude impression.  
12 | 
13 | I built seq2seq Bidirectional LSTM for Text Summarization task. Also LSTM with Attention is major method for summarization.               
14 | 
15 | ## Technical Preferences
16 | 
17 | | Title | Detail |
18 | |:-----------:|:------------------------------------------------|
19 | | Environment | MacOS Mojave 10.14.3 |
20 | | Language | Python |
21 | | Library | Kras, scikit-learn, Numpy, matplotlib, Pandas, Seaborn |
22 | | Dataset | [BBC Datasets](http://mlg.ucd.ie/datasets/bbc.html) |
23 | | Algorithm | Encoder-Decoder LSTM |
24 | 
25 | ## Refference
26 | 
27 | - [Get To The Point: Summarization with Pointer-Generator Networks](https://nlp.stanford.edu/pubs/see2017get.pdf)
28 | - [Bidirectional Attentional Encoder-Decoder Model and Bidirectional Beam Search for Abstractive Summarization](https://arxiv.org/pdf/1809.06662.pdf)
29 | - [Taming Recurrent Neural Networks for Better Summarization](http://www.abigailsee.com/2017/04/16/taming-rnns-for-better-summarization.html)
30 | - [Encoder-Decoder Models for Text Summarization in Keras](https://machinelearningmastery.com/encoder-decoder-models-text-summarization-keras/)
31 | - [Text Summarization Using Keras Models](https://hackernoon.com/text-summarization-using-keras-models-366b002408d9)
32 | - [Text Summarization in Python: Extractive vs. Abstractive techniques revisited](https://rare-technologies.com/text-summarization-in-python-extractive-vs-abstractive-techniques-revisited/)
33 | - [大自然言語時代のための、文章要約](https://qiita.com/icoxfog417/items/d06651db10e27220c819)
34 | 


--------------------------------------------------------------------------------
/RECENT_Notebook.ipynb:
--------------------------------------------------------------------------------
   1 | {
   2 |  "cells": [
   3 |   {
   4 |    "cell_type": "markdown",
   5 |    "metadata": {
   6 |     "colab_type": "text",
   7 |     "id": "k9-Z82i495Gl"
   8 |    },
   9 |    "source": [
  10 |     "#       Seq2Seq: Text Summarization with Keras\n",
  11 |     "#### ディープラーニングによる文章要約\n",
  12 |     "![](http://abigailsee.com/img/pointer-gen.png)"
  13 |    ]
  14 |   },
  15 |   {
  16 |    "cell_type": "markdown",
  17 |    "metadata": {
  18 |     "colab_type": "text",
  19 |     "id": "pw_oyfrE95Go"
  20 |    },
  21 |    "source": [
  22 |     "## Process\n",
  23 |     "1. Preprocessing\n",
  24 |     "2. Word2vec\n",
  25 |     "3. Building Seq2Seq Architecture\n",
  26 |     "4. Training with  BBC article&summary Dataset\n",
  27 |     "5. Generate Summary with my_summarizer"
  28 |    ]
  29 |   },
  30 |   {
  31 |    "cell_type": "markdown",
  32 |    "metadata": {},
  33 |    "source": [
  34 |     "## Step 1. Import Data"
  35 |    ]
  36 |   },
  37 |   {
  38 |    "cell_type": "code",
  39 |    "execution_count": null,
  40 |    "metadata": {
  41 |     "colab": {},
  42 |     "colab_type": "code",
  43 |     "id": "E2QnkkQv95Gq"
  44 |    },
  45 |    "outputs": [],
  46 |    "source": [
  47 |     "import numpy as np\n",
  48 |     "import os\n",
  49 |     "import pandas as pd\n",
  50 |     "import re"
  51 |    ]
  52 |   },
  53 |   {
  54 |    "cell_type": "code",
  55 |    "execution_count": null,
  56 |    "metadata": {},
  57 |    "outputs": [],
  58 |    "source": [
  59 |     "pwd"
  60 |    ]
  61 |   },
  62 |   {
  63 |    "cell_type": "code",
  64 |    "execution_count": null,
  65 |    "metadata": {
  66 |     "colab": {},
  67 |     "colab_type": "code",
  68 |     "id": "QeKq5tai95Gu"
  69 |    },
  70 |    "outputs": [],
  71 |    "source": [
  72 |     "news_category = [\"business\", \"entertainment\", \"politics\", \"sport\", \"tech\"]\n",
  73 |     "\n",
  74 |     "row_doc = \"/Users/akr712/Desktop/文章要約/Row News Articles/\"\n",
  75 |     "summary_doc = \"/Users/akr712/Desktop/文章要約/Summaries/\"\n",
  76 |     "\n",
  77 |     "data={\"articles\":[], \"summaries\":[]}"
  78 |    ]
  79 |   },
  80 |   {
  81 |    "cell_type": "code",
  82 |    "execution_count": null,
  83 |    "metadata": {
  84 |     "colab": {},
  85 |     "colab_type": "code",
  86 |     "id": "682uVCoC95Gy"
  87 |    },
  88 |    "outputs": [],
  89 |    "source": [
  90 |     "import os\n",
  91 |     "directories = {\"news\": row_doc, \"summary\": summary_doc}\n",
  92 |     "row_dict = {}\n",
  93 |     "sum_dict = {}\n",
  94 |     "\n",
  95 |     "for path in directories.values():\n",
  96 |     "    if path == row_doc:\n",
  97 |     "        file_dict = row_dict\n",
  98 |     "    else:\n",
  99 |     "        file_dict = sum_dict\n",
 100 |     "    dire = path\n",
 101 |     "    for cat in news_category:\n",
 102 |     "        category = cat\n",
 103 |     "        files = os.listdir(dire + category)\n",
 104 |     "        file_dict[cat] = files"
 105 |    ]
 106 |   },
 107 |   {
 108 |    "cell_type": "code",
 109 |    "execution_count": null,
 110 |    "metadata": {
 111 |     "colab": {},
 112 |     "colab_type": "code",
 113 |     "id": "gM75fs9g95Hc"
 114 |    },
 115 |    "outputs": [],
 116 |    "source": [
 117 |     "row_data = {}\n",
 118 |     "for cat in row_dict.keys():\n",
 119 |     "    cat_dict = {}\n",
 120 |     "    # row_data_frame[cat] = []\n",
 121 |     "    for i in range(0, len(row_dict[cat])):\n",
 122 |     "        filename = row_dict[cat][i]\n",
 123 |     "        path = row_doc + cat + \"/\" + filename\n",
 124 |     "        with open(path, \"rb\") as f:                \n",
 125 |     "            text = f.read()\n",
 126 |     "            cat_dict[filename[:3]] = text\n",
 127 |     "    row_data[cat] = cat_dict"
 128 |    ]
 129 |   },
 130 |   {
 131 |    "cell_type": "code",
 132 |    "execution_count": null,
 133 |    "metadata": {
 134 |     "colab": {},
 135 |     "colab_type": "code",
 136 |     "id": "MjPzSMPT95Hg"
 137 |    },
 138 |    "outputs": [],
 139 |    "source": [
 140 |     "sum_data = {}\n",
 141 |     "for cat in sum_dict.keys():\n",
 142 |     "    cat_dict = {}\n",
 143 |     "    # row_data_frame[cat] = []\n",
 144 |     "    for i in range(0, len(sum_dict[cat])):\n",
 145 |     "        filename = sum_dict[cat][i]\n",
 146 |     "        path = summary_doc + cat + \"/\" + filename\n",
 147 |     "        with open(path, \"rb\") as f:                \n",
 148 |     "            text = f.read()\n",
 149 |     "            cat_dict[filename[:3]] = text\n",
 150 |     "    sum_data[cat] = cat_dict"
 151 |    ]
 152 |   },
 153 |   {
 154 |    "cell_type": "code",
 155 |    "execution_count": null,
 156 |    "metadata": {
 157 |     "colab": {},
 158 |     "colab_type": "code",
 159 |     "id": "du70GJ7o95H0",
 160 |     "outputId": "2f05e596-129f-489e-8fcd-e789f870fe9e",
 161 |     "scrolled": true
 162 |    },
 163 |    "outputs": [],
 164 |    "source": [
 165 |     "news_business = pd.DataFrame.from_dict(row_data[\"business\"], orient=\"index\", columns=[\"row_article\"])\n",
 166 |     "news_business.head()"
 167 |    ]
 168 |   },
 169 |   {
 170 |    "cell_type": "code",
 171 |    "execution_count": null,
 172 |    "metadata": {
 173 |     "colab": {},
 174 |     "colab_type": "code",
 175 |     "id": "ZKbrhIYw95H8"
 176 |    },
 177 |    "outputs": [],
 178 |    "source": [
 179 |     "#  news_category = [\"business\", \"entertainment\", \"politics\", \"sport\", \"tech\"]\n",
 180 |     "news_entertainment = pd.DataFrame.from_dict(row_data[\"entertainment\"], orient=\"index\", columns=[\"row_article\"])\n",
 181 |     "news_politics = pd.DataFrame.from_dict(row_data[\"politics\"], orient=\"index\", columns=[\"row_article\"])\n",
 182 |     "news_sport = pd.DataFrame.from_dict(row_data[\"sport\"], orient=\"index\", columns=[\"row_article\"])\n",
 183 |     "news_tech = pd.DataFrame.from_dict(row_data[\"tech\"], orient=\"index\", columns=[\"row_article\"])"
 184 |    ]
 185 |   },
 186 |   {
 187 |    "cell_type": "code",
 188 |    "execution_count": null,
 189 |    "metadata": {
 190 |     "colab": {},
 191 |     "colab_type": "code",
 192 |     "id": "KrSmXmON95IE"
 193 |    },
 194 |    "outputs": [],
 195 |    "source": [
 196 |     "# summary data\n",
 197 |     "summary_business = pd.DataFrame.from_dict(sum_data[\"business\"], orient=\"index\", columns=[\"summary\"])\n",
 198 |     "summary_entertainment = pd.DataFrame.from_dict(sum_data[\"entertainment\"], orient=\"index\", columns=[\"summary\"])\n",
 199 |     "summary_politics = pd.DataFrame.from_dict(sum_data[\"politics\"], orient=\"index\", columns=[\"summary\"])\n",
 200 |     "summary_sport = pd.DataFrame.from_dict(sum_data[\"sport\"], orient=\"index\", columns=[\"summary\"])\n",
 201 |     "summary_tech = pd.DataFrame.from_dict(sum_data[\"tech\"], orient=\"index\", columns=[\"summary\"])"
 202 |    ]
 203 |   },
 204 |   {
 205 |    "cell_type": "code",
 206 |    "execution_count": null,
 207 |    "metadata": {
 208 |     "colab": {},
 209 |     "colab_type": "code",
 210 |     "id": "UNL3-2xL95IJ",
 211 |     "outputId": "8dd40f58-2921-4c73-fd62-6b5b9e673169"
 212 |    },
 213 |    "outputs": [],
 214 |    "source": [
 215 |     "summary_business.head()"
 216 |    ]
 217 |   },
 218 |   {
 219 |    "cell_type": "code",
 220 |    "execution_count": null,
 221 |    "metadata": {
 222 |     "colab": {},
 223 |     "colab_type": "code",
 224 |     "id": "Aq407qkG95IO"
 225 |    },
 226 |    "outputs": [],
 227 |    "source": [
 228 |     "business = news_business.join(summary_business, how='inner')\n",
 229 |     "entertainment = news_entertainment.join(summary_entertainment, how='inner')\n",
 230 |     "politics = news_politics.join(summary_politics, how='inner')\n",
 231 |     "sport = news_sport.join(summary_sport, how='inner')\n",
 232 |     "tech = news_tech.join(summary_tech, how='inner')"
 233 |    ]
 234 |   },
 235 |   {
 236 |    "cell_type": "code",
 237 |    "execution_count": null,
 238 |    "metadata": {
 239 |     "colab": {},
 240 |     "colab_type": "code",
 241 |     "id": "CKbjkn-r95IR"
 242 |    },
 243 |    "outputs": [],
 244 |    "source": [
 245 |     "business = news_business.join(summary_business, how='inner')"
 246 |    ]
 247 |   },
 248 |   {
 249 |    "cell_type": "code",
 250 |    "execution_count": null,
 251 |    "metadata": {
 252 |     "colab": {},
 253 |     "colab_type": "code",
 254 |     "id": "fKfHtKbL95IW",
 255 |     "outputId": "11b33d28-b909-4924-8e97-3e978c473fe8"
 256 |    },
 257 |    "outputs": [],
 258 |    "source": [
 259 |     "business.head()"
 260 |    ]
 261 |   },
 262 |   {
 263 |    "cell_type": "code",
 264 |    "execution_count": null,
 265 |    "metadata": {
 266 |     "colab": {},
 267 |     "colab_type": "code",
 268 |     "id": "_sUHqLUJ95Ib",
 269 |     "outputId": "7b492210-e0b1-4622-fcba-9c0f730ca3cd"
 270 |    },
 271 |    "outputs": [],
 272 |    "source": [
 273 |     "print(\"row\", len(business.iloc[0,0]))\n",
 274 |     "print(\"sum\", len(business.iloc[0,1]))"
 275 |    ]
 276 |   },
 277 |   {
 278 |    "cell_type": "code",
 279 |    "execution_count": null,
 280 |    "metadata": {
 281 |     "colab": {},
 282 |     "colab_type": "code",
 283 |     "id": "VthiJz-h95Ij"
 284 |    },
 285 |    "outputs": [],
 286 |    "source": [
 287 |     "list_df = [business, entertainment, politics, sport, tech]\n",
 288 |     "length = 0\n",
 289 |     "for df in list_df:\n",
 290 |     "    length += len(df)"
 291 |    ]
 292 |   },
 293 |   {
 294 |    "cell_type": "code",
 295 |    "execution_count": 31,
 296 |    "metadata": {
 297 |     "colab": {},
 298 |     "colab_type": "code",
 299 |     "id": "hj44ZFO-95Iu",
 300 |     "outputId": "f5649126-2cf2-45ad-cc77-879985450f5b"
 301 |    },
 302 |    "outputs": [
 303 |     {
 304 |      "name": "stdout",
 305 |      "output_type": "stream",
 306 |      "text": [
 307 |       "length of all data:  2225\n"
 308 |      ]
 309 |     }
 310 |    ],
 311 |    "source": [
 312 |     "print(\"length of all data: \", length)"
 313 |    ]
 314 |   },
 315 |   {
 316 |    "cell_type": "code",
 317 |    "execution_count": 32,
 318 |    "metadata": {
 319 |     "colab": {},
 320 |     "colab_type": "code",
 321 |     "id": "vFtKW1DS95I2",
 322 |     "outputId": "f9e46aae-aa16-4415-8302-ddd47bd220ce"
 323 |    },
 324 |    "outputs": [
 325 |     {
 326 |      "data": {
 327 |       "text/plain": [
 328 |        "2225"
 329 |       ]
 330 |      },
 331 |      "execution_count": 32,
 332 |      "metadata": {},
 333 |      "output_type": "execute_result"
 334 |     }
 335 |    ],
 336 |    "source": [
 337 |     "bbc_df = pd.concat([business, entertainment, politics, sport, tech], ignore_index=True)\n",
 338 |     "len(bbc_df)"
 339 |    ]
 340 |   },
 341 |   {
 342 |    "cell_type": "markdown",
 343 |    "metadata": {
 344 |     "colab_type": "text",
 345 |     "id": "xogCv95T95JF"
 346 |    },
 347 |    "source": [
 348 |     "## Step 2. Preprocessing Text Data\n",
 349 |     "1. Clean Text\n",
 350 |     "2. Tokenize\n",
 351 |     "3. Vocabrary\n",
 352 |     "4. Padding\n",
 353 |     "5. One-Hot Encoding\n",
 354 |     "6. Reshape to (MAX_LEN, One-Hot Encoding DIM)"
 355 |    ]
 356 |   },
 357 |   {
 358 |    "cell_type": "markdown",
 359 |    "metadata": {
 360 |     "colab_type": "text",
 361 |     "id": "y2un_KkA95JU"
 362 |    },
 363 |    "source": [
 364 |     "### 2-1. Clean Text"
 365 |    ]
 366 |   },
 367 |   {
 368 |    "cell_type": "code",
 369 |    "execution_count": 33,
 370 |    "metadata": {
 371 |     "colab": {},
 372 |     "colab_type": "code",
 373 |     "id": "CxQY2wvq95JJ"
 374 |    },
 375 |    "outputs": [],
 376 |    "source": [
 377 |     "def cleantext(text):\n",
 378 |     "    text = str(text)\n",
 379 |     "    text=text.split()\n",
 380 |     "    words=[]\n",
 381 |     "    for t in text:\n",
 382 |     "        if t.isalpha():\n",
 383 |     "            words.append(t)\n",
 384 |     "    text=\" \".join(words)\n",
 385 |     "    text=text.lower()\n",
 386 |     "    text=re.sub(r\"what's\",\"what is \",text)\n",
 387 |     "    text=re.sub(r\"it's\",\"it is \",text)\n",
 388 |     "    text=re.sub(r\"\\'ve\",\" have \",text)\n",
 389 |     "    text=re.sub(r\"i'm\",\"i am \",text)\n",
 390 |     "    text=re.sub(r\"\\'re\",\" are \",text)\n",
 391 |     "    text=re.sub(r\"n't\",\" not \",text)\n",
 392 |     "    text=re.sub(r\"\\'d\",\" would \",text)\n",
 393 |     "    text=re.sub(r\"\\'s\",\"s\",text)\n",
 394 |     "    text=re.sub(r\"\\'ll\",\" will \",text)\n",
 395 |     "    text=re.sub(r\"can't\",\" cannot \",text)\n",
 396 |     "    text=re.sub(r\" e g \",\" eg \",text)\n",
 397 |     "    text=re.sub(r\"e-mail\",\"email\",text)\n",
 398 |     "    text=re.sub(r\"9\\\\/11\",\" 911 \",text)\n",
 399 |     "    text=re.sub(r\" u.s\",\" american \",text)\n",
 400 |     "    text=re.sub(r\" u.n\",\" united nations \",text)\n",
 401 |     "    text=re.sub(r\"\\n\",\" \",text)\n",
 402 |     "    text=re.sub(r\":\",\" \",text)\n",
 403 |     "    text=re.sub(r\"-\",\" \",text)\n",
 404 |     "    text=re.sub(r\"\\_\",\" \",text)\n",
 405 |     "    text=re.sub(r\"\\d+\",\" \",text)\n",
 406 |     "    text=re.sub(r\"[$#@%&*!~?%{}()]\",\" \",text)\n",
 407 |     "    \n",
 408 |     "    return text"
 409 |    ]
 410 |   },
 411 |   {
 412 |    "cell_type": "code",
 413 |    "execution_count": 34,
 414 |    "metadata": {
 415 |     "colab": {},
 416 |     "colab_type": "code",
 417 |     "id": "K7ovvCt495JL"
 418 |    },
 419 |    "outputs": [],
 420 |    "source": [
 421 |     "for col in bbc_df.columns:\n",
 422 |     "    bbc_df[col] = bbc_df[col].apply(lambda x: cleantext(x))"
 423 |    ]
 424 |   },
 425 |   {
 426 |    "cell_type": "code",
 427 |    "execution_count": 35,
 428 |    "metadata": {
 429 |     "colab": {},
 430 |     "colab_type": "code",
 431 |     "id": "HwbxBACO95JP",
 432 |     "outputId": "8b9cf24a-1714-47bf-944b-4aaa54ab78c4"
 433 |    },
 434 |    "outputs": [
 435 |     {
 436 |      "data": {
 437 |       "text/html": [
 438 |        "<div>\n",
 439 |        "<style scoped>\n",
 440 |        "    .dataframe tbody tr th:only-of-type {\n",
 441 |        "        vertical-align: middle;\n",
 442 |        "    }\n",
 443 |        "\n",
 444 |        "    .dataframe tbody tr th {\n",
 445 |        "        vertical-align: top;\n",
 446 |        "    }\n",
 447 |        "\n",
 448 |        "    .dataframe thead th {\n",
 449 |        "        text-align: right;\n",
 450 |        "    }\n",
 451 |        "</style>\n",
 452 |        "<table border=\"1\" class=\"dataframe\">\n",
 453 |        "  <thead>\n",
 454 |        "    <tr style=\"text-align: right;\">\n",
 455 |        "      <th></th>\n",
 456 |        "      <th>row_article</th>\n",
 457 |        "      <th>summary</th>\n",
 458 |        "    </tr>\n",
 459 |        "  </thead>\n",
 460 |        "  <tbody>\n",
 461 |        "    <tr>\n",
 462 |        "      <th>0</th>\n",
 463 |        "      <td>rise on new man utd in manchester united close...</td>\n",
 464 |        "      <td>united revealed on sunday that it had received...</td>\n",
 465 |        "    </tr>\n",
 466 |        "    <tr>\n",
 467 |        "      <th>1</th>\n",
 468 |        "      <td>confidence dips in confidence among japanese m...</td>\n",
 469 |        "      <td>confidence among japanese manufacturers has we...</td>\n",
 470 |        "    </tr>\n",
 471 |        "    <tr>\n",
 472 |        "      <th>2</th>\n",
 473 |        "      <td>makes new man utd glazer has made a fresh appr...</td>\n",
 474 |        "      <td>glazer has made a fresh approach to buy manche...</td>\n",
 475 |        "    </tr>\n",
 476 |        "    <tr>\n",
 477 |        "      <th>3</th>\n",
 478 |        "      <td>hope for borussia in struggling german footbal...</td>\n",
 479 |        "      <td>in struggling german football club borussia do...</td>\n",
 480 |        "    </tr>\n",
 481 |        "    <tr>\n",
 482 |        "      <th>4</th>\n",
 483 |        "      <td>airlines hit tunnel operator eurotunnel has se...</td>\n",
 484 |        "      <td>firm said sales were down in to euros the mome...</td>\n",
 485 |        "    </tr>\n",
 486 |        "  </tbody>\n",
 487 |        "</table>\n",
 488 |        "</div>"
 489 |       ],
 490 |       "text/plain": [
 491 |        "                                         row_article  \\\n",
 492 |        "0  rise on new man utd in manchester united close...   \n",
 493 |        "1  confidence dips in confidence among japanese m...   \n",
 494 |        "2  makes new man utd glazer has made a fresh appr...   \n",
 495 |        "3  hope for borussia in struggling german footbal...   \n",
 496 |        "4  airlines hit tunnel operator eurotunnel has se...   \n",
 497 |        "\n",
 498 |        "                                             summary  \n",
 499 |        "0  united revealed on sunday that it had received...  \n",
 500 |        "1  confidence among japanese manufacturers has we...  \n",
 501 |        "2  glazer has made a fresh approach to buy manche...  \n",
 502 |        "3  in struggling german football club borussia do...  \n",
 503 |        "4  firm said sales were down in to euros the mome...  "
 504 |       ]
 505 |      },
 506 |      "execution_count": 35,
 507 |      "metadata": {},
 508 |      "output_type": "execute_result"
 509 |     }
 510 |    ],
 511 |    "source": [
 512 |     "bbc_df.head()"
 513 |    ]
 514 |   },
 515 |   {
 516 |    "cell_type": "code",
 517 |    "execution_count": 36,
 518 |    "metadata": {
 519 |     "colab": {
 520 |      "base_uri": "https://localhost:8080/",
 521 |      "height": 195
 522 |     },
 523 |     "colab_type": "code",
 524 |     "id": "JBPbkd8pIORs",
 525 |     "outputId": "51066e27-b9c1-467e-845a-b162e813ce2f"
 526 |    },
 527 |    "outputs": [
 528 |     {
 529 |      "data": {
 530 |       "text/html": [
 531 |        "<div>\n",
 532 |        "<style scoped>\n",
 533 |        "    .dataframe tbody tr th:only-of-type {\n",
 534 |        "        vertical-align: middle;\n",
 535 |        "    }\n",
 536 |        "\n",
 537 |        "    .dataframe tbody tr th {\n",
 538 |        "        vertical-align: top;\n",
 539 |        "    }\n",
 540 |        "\n",
 541 |        "    .dataframe thead th {\n",
 542 |        "        text-align: right;\n",
 543 |        "    }\n",
 544 |        "</style>\n",
 545 |        "<table border=\"1\" class=\"dataframe\">\n",
 546 |        "  <thead>\n",
 547 |        "    <tr style=\"text-align: right;\">\n",
 548 |        "      <th></th>\n",
 549 |        "      <th>row_article</th>\n",
 550 |        "      <th>summary</th>\n",
 551 |        "    </tr>\n",
 552 |        "  </thead>\n",
 553 |        "  <tbody>\n",
 554 |        "    <tr>\n",
 555 |        "      <th>229</th>\n",
 556 |        "      <td>b'Robotic pods take on car design\\n\\nA new bre...</td>\n",
 557 |        "      <td>b'Dr Erel Avineri, of the Centre for Transport...</td>\n",
 558 |        "    </tr>\n",
 559 |        "    <tr>\n",
 560 |        "      <th>295</th>\n",
 561 |        "      <td>b'More power to the people says HP\\n\\nThe digi...</td>\n",
 562 |        "      <td>b'She said the goal for 2005 was to make peopl...</td>\n",
 563 |        "    </tr>\n",
 564 |        "    <tr>\n",
 565 |        "      <th>219</th>\n",
 566 |        "      <td>b\"Disney backs Sony DVD technology\\n\\nA next g...</td>\n",
 567 |        "      <td>b\"Film giant Disney says it will produce its f...</td>\n",
 568 |        "    </tr>\n",
 569 |        "    <tr>\n",
 570 |        "      <th>249</th>\n",
 571 |        "      <td>b'Gamer buys $26,500 virtual land\\n\\nA 22-year...</td>\n",
 572 |        "      <td>b'The land exists within the game Project Entr...</td>\n",
 573 |        "    </tr>\n",
 574 |        "    <tr>\n",
 575 |        "      <th>272</th>\n",
 576 |        "      <td>b'Windows worm travels with Tetris\\n\\nUsers ar...</td>\n",
 577 |        "      <td>b'The Cellery worm installs a playable version...</td>\n",
 578 |        "    </tr>\n",
 579 |        "  </tbody>\n",
 580 |        "</table>\n",
 581 |        "</div>"
 582 |       ],
 583 |       "text/plain": [
 584 |        "                                           row_article  \\\n",
 585 |        "229  b'Robotic pods take on car design\\n\\nA new bre...   \n",
 586 |        "295  b'More power to the people says HP\\n\\nThe digi...   \n",
 587 |        "219  b\"Disney backs Sony DVD technology\\n\\nA next g...   \n",
 588 |        "249  b'Gamer buys $26,500 virtual land\\n\\nA 22-year...   \n",
 589 |        "272  b'Windows worm travels with Tetris\\n\\nUsers ar...   \n",
 590 |        "\n",
 591 |        "                                               summary  \n",
 592 |        "229  b'Dr Erel Avineri, of the Centre for Transport...  \n",
 593 |        "295  b'She said the goal for 2005 was to make peopl...  \n",
 594 |        "219  b\"Film giant Disney says it will produce its f...  \n",
 595 |        "249  b'The land exists within the game Project Entr...  \n",
 596 |        "272  b'The Cellery worm installs a playable version...  "
 597 |       ]
 598 |      },
 599 |      "execution_count": 36,
 600 |      "metadata": {},
 601 |      "output_type": "execute_result"
 602 |     }
 603 |    ],
 604 |    "source": [
 605 |     "df.head()"
 606 |    ]
 607 |   },
 608 |   {
 609 |    "cell_type": "code",
 610 |    "execution_count": 38,
 611 |    "metadata": {
 612 |     "colab": {
 613 |      "base_uri": "https://localhost:8080/",
 614 |      "height": 34
 615 |     },
 616 |     "colab_type": "code",
 617 |     "id": "YA9tvTPUIciB",
 618 |     "outputId": "1780794c-4748-49ba-8669-c75b875347d4"
 619 |    },
 620 |    "outputs": [
 621 |     {
 622 |      "data": {
 623 |       "text/plain": [
 624 |        "2969"
 625 |       ]
 626 |      },
 627 |      "execution_count": 38,
 628 |      "metadata": {},
 629 |      "output_type": "execute_result"
 630 |     }
 631 |    ],
 632 |    "source": [
 633 |     "len_list =[]\n",
 634 |     "for article in df.row_article:\n",
 635 |     "    words = article.split()\n",
 636 |     "    length = len(words)\n",
 637 |     "    len_list.append(length)\n",
 638 |     "max(len_list)"
 639 |    ]
 640 |   },
 641 |   {
 642 |    "cell_type": "markdown",
 643 |    "metadata": {
 644 |     "colab": {},
 645 |     "colab_type": "code",
 646 |     "id": "jO14nm8JJPzZ"
 647 |    },
 648 |    "source": [
 649 |     "### 2-2. Tokenizer\n",
 650 |     "1. Tokenize and One-Hot : Tokenizer\n",
 651 |     "2. Vocabraly: article and summary 15000 words \n",
 652 |     "3. Padding: pad_sequences 1000 max_len\n",
 653 |     "4. Reshape: manual max_len * one-hot matrix "
 654 |    ]
 655 |   },
 656 |   {
 657 |    "cell_type": "code",
 658 |    "execution_count": 1,
 659 |    "metadata": {},
 660 |    "outputs": [],
 661 |    "source": [
 662 |     "import numpy as np\n",
 663 |     "import os\n",
 664 |     "import pandas as pd\n",
 665 |     "import re"
 666 |    ]
 667 |   },
 668 |   {
 669 |    "cell_type": "code",
 670 |    "execution_count": 2,
 671 |    "metadata": {},
 672 |    "outputs": [
 673 |     {
 674 |      "data": {
 675 |       "text/html": [
 676 |        "<div>\n",
 677 |        "<style scoped>\n",
 678 |        "    .dataframe tbody tr th:only-of-type {\n",
 679 |        "        vertical-align: middle;\n",
 680 |        "    }\n",
 681 |        "\n",
 682 |        "    .dataframe tbody tr th {\n",
 683 |        "        vertical-align: top;\n",
 684 |        "    }\n",
 685 |        "\n",
 686 |        "    .dataframe thead th {\n",
 687 |        "        text-align: right;\n",
 688 |        "    }\n",
 689 |        "</style>\n",
 690 |        "<table border=\"1\" class=\"dataframe\">\n",
 691 |        "  <thead>\n",
 692 |        "    <tr style=\"text-align: right;\">\n",
 693 |        "      <th></th>\n",
 694 |        "      <th>row_article</th>\n",
 695 |        "      <th>summary</th>\n",
 696 |        "    </tr>\n",
 697 |        "  </thead>\n",
 698 |        "  <tbody>\n",
 699 |        "    <tr>\n",
 700 |        "      <th>0</th>\n",
 701 |        "      <td>continues rapid economy has expanded by a brea...</td>\n",
 702 |        "      <td>overall investment in fixed assets was still u...</td>\n",
 703 |        "    </tr>\n",
 704 |        "    <tr>\n",
 705 |        "      <th>1</th>\n",
 706 |        "      <td>deccan seals deccan has ordered airbus planes ...</td>\n",
 707 |        "      <td>government has given its backing to cheaper an...</td>\n",
 708 |        "    </tr>\n",
 709 |        "    <tr>\n",
 710 |        "      <th>2</th>\n",
 711 |        "      <td>job growth continues in us created fewer jobs ...</td>\n",
 712 |        "      <td>creation was one of last main concerns for the...</td>\n",
 713 |        "    </tr>\n",
 714 |        "    <tr>\n",
 715 |        "      <th>3</th>\n",
 716 |        "      <td>owner buys rival for retail giant federated de...</td>\n",
 717 |        "      <td>retail giant federated department stores is to...</td>\n",
 718 |        "    </tr>\n",
 719 |        "    <tr>\n",
 720 |        "      <th>4</th>\n",
 721 |        "      <td>secures giant japan is to supply japan airline...</td>\n",
 722 |        "      <td>chose the after carefully considering both it ...</td>\n",
 723 |        "    </tr>\n",
 724 |        "  </tbody>\n",
 725 |        "</table>\n",
 726 |        "</div>"
 727 |       ],
 728 |       "text/plain": [
 729 |        "                                         row_article  \\\n",
 730 |        "0  continues rapid economy has expanded by a brea...   \n",
 731 |        "1  deccan seals deccan has ordered airbus planes ...   \n",
 732 |        "2  job growth continues in us created fewer jobs ...   \n",
 733 |        "3  owner buys rival for retail giant federated de...   \n",
 734 |        "4  secures giant japan is to supply japan airline...   \n",
 735 |        "\n",
 736 |        "                                             summary  \n",
 737 |        "0  overall investment in fixed assets was still u...  \n",
 738 |        "1  government has given its backing to cheaper an...  \n",
 739 |        "2  creation was one of last main concerns for the...  \n",
 740 |        "3  retail giant federated department stores is to...  \n",
 741 |        "4  chose the after carefully considering both it ...  "
 742 |       ]
 743 |      },
 744 |      "execution_count": 2,
 745 |      "metadata": {},
 746 |      "output_type": "execute_result"
 747 |     }
 748 |    ],
 749 |    "source": [
 750 |     "bbc_art_sum = pd.read_csv(\"cleaned_bbc_news.csv\")\n",
 751 |     "bbc_art_sum.drop(\"Unnamed: 0\", axis=1, inplace=True)\n",
 752 |     "bbc_art_sum.head()"
 753 |    ]
 754 |   },
 755 |   {
 756 |    "cell_type": "code",
 757 |    "execution_count": 3,
 758 |    "metadata": {},
 759 |    "outputs": [],
 760 |    "source": [
 761 |     "articles = list(bbc_art_sum.row_article)\n",
 762 |     "summaries = list(bbc_art_sum.summary)"
 763 |    ]
 764 |   },
 765 |   {
 766 |    "cell_type": "markdown",
 767 |    "metadata": {},
 768 |    "source": [
 769 |     "### 2-2-1. Tokenize: text_to_word_sequence"
 770 |    ]
 771 |   },
 772 |   {
 773 |    "cell_type": "code",
 774 |    "execution_count": 4,
 775 |    "metadata": {
 776 |     "colab": {
 777 |      "base_uri": "https://localhost:8080/",
 778 |      "height": 34
 779 |     },
 780 |     "colab_type": "code",
 781 |     "id": "W9aALHY_YbUN",
 782 |     "outputId": "b228cf03-c36b-413a-d3c8-c9cb6fff1b36"
 783 |    },
 784 |    "outputs": [
 785 |     {
 786 |      "name": "stderr",
 787 |      "output_type": "stream",
 788 |      "text": [
 789 |       "Using TensorFlow backend.\n"
 790 |      ]
 791 |     },
 792 |     {
 793 |      "data": {
 794 |       "text/plain": [
 795 |        "23914"
 796 |       ]
 797 |      },
 798 |      "execution_count": 4,
 799 |      "metadata": {},
 800 |      "output_type": "execute_result"
 801 |     }
 802 |    ],
 803 |    "source": [
 804 |     "from keras.preprocessing.text import Tokenizer\n",
 805 |     "VOCAB_SIZE = 14999\n",
 806 |     "tokenizer = Tokenizer(num_words=VOCAB_SIZE)\n",
 807 |     "tokenizer.fit_on_texts(articles)\n",
 808 |     "article_sequences = tokenizer.texts_to_sequences(articles)\n",
 809 |     "art_word_index = tokenizer.word_index\n",
 810 |     "len(art_word_index)"
 811 |    ]
 812 |   },
 813 |   {
 814 |    "cell_type": "code",
 815 |    "execution_count": 5,
 816 |    "metadata": {},
 817 |    "outputs": [
 818 |     {
 819 |      "name": "stdout",
 820 |      "output_type": "stream",
 821 |      "text": [
 822 |       "[1411, 2338, 248, 16, 3994, 22, 5, 6483, 165, 1359, 50, 966, 4, 120, 967, 176, 118, 505, 38, 2339]\n",
 823 |       "[5211, 8881, 5211, 16, 2233, 3001, 3441, 6, 5, 217, 18, 60, 1270, 7874, 6, 1, 827, 5211, 11, 108]\n",
 824 |       "[478, 196, 1411, 6, 54, 736, 2283, 498, 50, 164, 6, 24, 349, 17, 9, 1, 3322, 6, 5213, 11]\n"
 825 |      ]
 826 |     }
 827 |    ],
 828 |    "source": [
 829 |     "print(article_sequences[0][:20])\n",
 830 |     "print(article_sequences[1][:20])\n",
 831 |     "print(article_sequences[2][:20])"
 832 |    ]
 833 |   },
 834 |   {
 835 |    "cell_type": "markdown",
 836 |    "metadata": {},
 837 |    "source": [
 838 |     "### 2-2-2. Vocabraly: article and summary 15000 words"
 839 |    ]
 840 |   },
 841 |   {
 842 |    "cell_type": "code",
 843 |    "execution_count": 6,
 844 |    "metadata": {},
 845 |    "outputs": [],
 846 |    "source": [
 847 |     "art_word_index_1500 = {}\n",
 848 |     "counter = 0\n",
 849 |     "for word in art_word_index.keys():\n",
 850 |     "    if art_word_index[word] == 0:\n",
 851 |     "        print(\"found 0!\")\n",
 852 |     "        break\n",
 853 |     "    if art_word_index[word] > VOCAB_SIZE:\n",
 854 |     "        continue\n",
 855 |     "    else:\n",
 856 |     "        art_word_index_1500[word] = art_word_index[word]\n",
 857 |     "        counter += 1"
 858 |    ]
 859 |   },
 860 |   {
 861 |    "cell_type": "code",
 862 |    "execution_count": 7,
 863 |    "metadata": {},
 864 |    "outputs": [
 865 |     {
 866 |      "data": {
 867 |       "text/plain": [
 868 |        "14999"
 869 |       ]
 870 |      },
 871 |      "execution_count": 7,
 872 |      "metadata": {},
 873 |      "output_type": "execute_result"
 874 |     }
 875 |    ],
 876 |    "source": [
 877 |     "counter"
 878 |    ]
 879 |   },
 880 |   {
 881 |    "cell_type": "code",
 882 |    "execution_count": 8,
 883 |    "metadata": {},
 884 |    "outputs": [
 885 |     {
 886 |      "data": {
 887 |       "text/plain": [
 888 |        "23929"
 889 |       ]
 890 |      },
 891 |      "execution_count": 8,
 892 |      "metadata": {},
 893 |      "output_type": "execute_result"
 894 |     }
 895 |    ],
 896 |    "source": [
 897 |     "tokenizer.fit_on_texts(summaries)\n",
 898 |     "summary_sequences = tokenizer.texts_to_sequences(summaries)\n",
 899 |     "sum_word_index = tokenizer.word_index\n",
 900 |     "len(sum_word_index)"
 901 |    ]
 902 |   },
 903 |   {
 904 |    "cell_type": "code",
 905 |    "execution_count": 9,
 906 |    "metadata": {},
 907 |    "outputs": [],
 908 |    "source": [
 909 |     "sum_word_index_1500 = {}\n",
 910 |     "counter = 0\n",
 911 |     "for word in sum_word_index.keys():\n",
 912 |     "    if sum_word_index[word] == 0:\n",
 913 |     "        print(\"found 0!\")\n",
 914 |     "        break\n",
 915 |     "    if sum_word_index[word] > VOCAB_SIZE:\n",
 916 |     "        continue\n",
 917 |     "    else:\n",
 918 |     "        sum_word_index_1500[word] = sum_word_index[word]\n",
 919 |     "        counter += 1"
 920 |    ]
 921 |   },
 922 |   {
 923 |    "cell_type": "code",
 924 |    "execution_count": 10,
 925 |    "metadata": {},
 926 |    "outputs": [
 927 |     {
 928 |      "data": {
 929 |       "text/plain": [
 930 |        "14999"
 931 |       ]
 932 |      },
 933 |      "execution_count": 10,
 934 |      "metadata": {},
 935 |      "output_type": "execute_result"
 936 |     }
 937 |    ],
 938 |    "source": [
 939 |     "counter"
 940 |    ]
 941 |   },
 942 |   {
 943 |    "cell_type": "markdown",
 944 |    "metadata": {},
 945 |    "source": [
 946 |     "### 2-2-3. Padding: pad_sequences 1000 max_len"
 947 |    ]
 948 |   },
 949 |   {
 950 |    "cell_type": "code",
 951 |    "execution_count": 11,
 952 |    "metadata": {},
 953 |    "outputs": [],
 954 |    "source": [
 955 |     "from keras.preprocessing.sequence import pad_sequences\n",
 956 |     "MAX_LEN = 1000\n",
 957 |     "pad_art_sequences = pad_sequences(article_sequences, maxlen=MAX_LEN, padding='post', truncating='post')"
 958 |    ]
 959 |   },
 960 |   {
 961 |    "cell_type": "code",
 962 |    "execution_count": 12,
 963 |    "metadata": {},
 964 |    "outputs": [
 965 |     {
 966 |      "name": "stdout",
 967 |      "output_type": "stream",
 968 |      "text": [
 969 |       "243 1000\n"
 970 |      ]
 971 |     }
 972 |    ],
 973 |    "source": [
 974 |     "print(len(article_sequences[1]), len(pad_art_sequences[1]))"
 975 |    ]
 976 |   },
 977 |   {
 978 |    "cell_type": "code",
 979 |    "execution_count": 13,
 980 |    "metadata": {},
 981 |    "outputs": [],
 982 |    "source": [
 983 |     "pad_sum_sequences = pad_sequences(summary_sequences, maxlen=MAX_LEN, padding='post', truncating='post')"
 984 |    ]
 985 |   },
 986 |   {
 987 |    "cell_type": "code",
 988 |    "execution_count": 14,
 989 |    "metadata": {},
 990 |    "outputs": [
 991 |     {
 992 |      "name": "stdout",
 993 |      "output_type": "stream",
 994 |      "text": [
 995 |       "90 1000\n"
 996 |      ]
 997 |     }
 998 |    ],
 999 |    "source": [
1000 |     "print(len(summary_sequences[1]), len(pad_sum_sequences[1]))"
1001 |    ]
1002 |   },
1003 |   {
1004 |    "cell_type": "code",
1005 |    "execution_count": 15,
1006 |    "metadata": {},
1007 |    "outputs": [
1008 |     {
1009 |      "data": {
1010 |       "text/plain": [
1011 |        "(2225, 1000)"
1012 |       ]
1013 |      },
1014 |      "execution_count": 15,
1015 |      "metadata": {},
1016 |      "output_type": "execute_result"
1017 |     }
1018 |    ],
1019 |    "source": [
1020 |     "pad_art_sequences.shape"
1021 |    ]
1022 |   },
1023 |   {
1024 |    "cell_type": "code",
1025 |    "execution_count": 16,
1026 |    "metadata": {},
1027 |    "outputs": [
1028 |     {
1029 |      "data": {
1030 |       "text/plain": [
1031 |        "array([[1411, 2338,  248, ...,    0,    0,    0],\n",
1032 |        "       [5211, 8881, 5211, ...,    0,    0,    0],\n",
1033 |        "       [ 478,  196, 1411, ...,    0,    0,    0],\n",
1034 |        "       ...,\n",
1035 |        "       [ 421, 1337, 2012, ...,    0,    0,    0],\n",
1036 |        "       [2164,  267, 1109, ...,    0,    0,    0],\n",
1037 |        "       [   7,  284,    8, ...,    0,    0,    0]], dtype=int32)"
1038 |       ]
1039 |      },
1040 |      "execution_count": 16,
1041 |      "metadata": {},
1042 |      "output_type": "execute_result"
1043 |     }
1044 |    ],
1045 |    "source": [
1046 |     "pad_art_sequences"
1047 |    ]
1048 |   },
1049 |   {
1050 |    "cell_type": "markdown",
1051 |    "metadata": {},
1052 |    "source": [
1053 |     "### 2-2-4. Reshape: manual max_len * one-hot matrix"
1054 |    ]
1055 |   },
1056 |   {
1057 |    "cell_type": "code",
1058 |    "execution_count": null,
1059 |    "metadata": {},
1060 |    "outputs": [],
1061 |    "source": [
1062 |     "# 使わない\n",
1063 |     "\"\"\"\n",
1064 |     "encoder_inputs = np.zeros((2225, 1000), dtype='float32')\n",
1065 |     "encoder_inputs.shape\n",
1066 |     "\n",
1067 |     "decoder_inputs = np.zeros((2225, 1000), dtype='float32')\n",
1068 |     "decoder_inputs.shape\n",
1069 |     "\n",
1070 |     "for i, seqs in enumerate(pad_art_sequences):\n",
1071 |     "    for j, seq in enumerate(seqs):\n",
1072 |     "        encoder_inputs[i, j] = seq\n",
1073 |     "        \n",
1074 |     "for i, seqs in enumerate(pad_sum_sequences):\n",
1075 |     "    for j, seq in enumerate(seqs):\n",
1076 |     "        decoder_inputs[i, j] = seq\n",
1077 |     "\"\"\""
1078 |    ]
1079 |   },
1080 |   {
1081 |    "cell_type": "code",
1082 |    "execution_count": 55,
1083 |    "metadata": {},
1084 |    "outputs": [
1085 |     {
1086 |      "data": {
1087 |       "text/plain": [
1088 |        "(2225, 1000, 15000)"
1089 |       ]
1090 |      },
1091 |      "execution_count": 55,
1092 |      "metadata": {},
1093 |      "output_type": "execute_result"
1094 |     }
1095 |    ],
1096 |    "source": [
1097 |     "decoder_outputs = np.zeros((2225, 1000, 15000), dtype='float32')\n",
1098 |     "decoder_outputs.shape"
1099 |    ]
1100 |   },
1101 |   {
1102 |    "cell_type": "code",
1103 |    "execution_count": 56,
1104 |    "metadata": {},
1105 |    "outputs": [],
1106 |    "source": [
1107 |     "for i, seqs in enumerate(pad_sum_sequences):\n",
1108 |     "    for j, seq in enumerate(seqs):\n",
1109 |     "        decoder_outputs[i, j, seq] = 1."
1110 |    ]
1111 |   },
1112 |   {
1113 |    "cell_type": "code",
1114 |    "execution_count": 59,
1115 |    "metadata": {},
1116 |    "outputs": [
1117 |     {
1118 |      "data": {
1119 |       "text/plain": [
1120 |        "(2225, 1000, 15000)"
1121 |       ]
1122 |      },
1123 |      "execution_count": 59,
1124 |      "metadata": {},
1125 |      "output_type": "execute_result"
1126 |     }
1127 |    ],
1128 |    "source": [
1129 |     "decoder_outputs.shape"
1130 |    ]
1131 |   },
1132 |   {
1133 |    "cell_type": "markdown",
1134 |    "metadata": {},
1135 |    "source": [
1136 |     "### 2-2-5. Pre-trained word2vec and word2vec Matrix"
1137 |    ]
1138 |   },
1139 |   {
1140 |    "cell_type": "code",
1141 |    "execution_count": 20,
1142 |    "metadata": {
1143 |     "colab": {
1144 |      "base_uri": "https://localhost:8080/",
1145 |      "height": 34
1146 |     },
1147 |     "colab_type": "code",
1148 |     "id": "sokvMj0Xfnjw",
1149 |     "outputId": "3ef40349-e3e7-4f25-d263-7d316d21f829"
1150 |    },
1151 |    "outputs": [
1152 |     {
1153 |      "name": "stdout",
1154 |      "output_type": "stream",
1155 |      "text": [
1156 |       "Found 400000 word vectors.\n"
1157 |      ]
1158 |     }
1159 |    ],
1160 |    "source": [
1161 |     "embeddings_index = {}\n",
1162 |     "with open('glove.6B.200d.txt', encoding='utf-8') as f:\n",
1163 |     "    for line in f:\n",
1164 |     "        values = line.split()\n",
1165 |     "        word = values[0]\n",
1166 |     "        coefs = np.asarray(values[1:], dtype='float32')\n",
1167 |     "        embeddings_index[word] = coefs\n",
1168 |     "    f.close()\n",
1169 |     "\n",
1170 |     "print('Found %s word vectors.' % len(embeddings_index))"
1171 |    ]
1172 |   },
1173 |   {
1174 |    "cell_type": "code",
1175 |    "execution_count": 21,
1176 |    "metadata": {
1177 |     "colab": {},
1178 |     "colab_type": "code",
1179 |     "id": "kdx3r2rvZAwN"
1180 |    },
1181 |    "outputs": [],
1182 |    "source": [
1183 |     "def embedding_matrix_creater(embedding_dimention, word_index):\n",
1184 |     "    embedding_matrix = np.zeros((len(word_index) + 1, embedding_dimention))\n",
1185 |     "    for word, i in word_index.items():\n",
1186 |     "        embedding_vector = embeddings_index.get(word)\n",
1187 |     "        if embedding_vector is not None:\n",
1188 |     "          # words not found in embedding index will be all-zeros.\n",
1189 |     "            embedding_matrix[i] = embedding_vector\n",
1190 |     "    return embedding_matrix"
1191 |    ]
1192 |   },
1193 |   {
1194 |    "cell_type": "code",
1195 |    "execution_count": 22,
1196 |    "metadata": {
1197 |     "colab": {
1198 |      "base_uri": "https://localhost:8080/",
1199 |      "height": 34
1200 |     },
1201 |     "colab_type": "code",
1202 |     "id": "VB6UmXeqWleO",
1203 |     "outputId": "59e76cc0-91a2-454b-a61e-f7cb1b2cf59a"
1204 |    },
1205 |    "outputs": [
1206 |     {
1207 |      "data": {
1208 |       "text/plain": [
1209 |        "(15000, 200)"
1210 |       ]
1211 |      },
1212 |      "execution_count": 22,
1213 |      "metadata": {},
1214 |      "output_type": "execute_result"
1215 |     }
1216 |    ],
1217 |    "source": [
1218 |     "art_embedding_matrix = embedding_matrix_creater(200, word_index=art_word_index_1500)\n",
1219 |     "art_embedding_matrix.shape"
1220 |    ]
1221 |   },
1222 |   {
1223 |    "cell_type": "code",
1224 |    "execution_count": 23,
1225 |    "metadata": {
1226 |     "colab": {
1227 |      "base_uri": "https://localhost:8080/",
1228 |      "height": 34
1229 |     },
1230 |     "colab_type": "code",
1231 |     "id": "1KgrdRFvb7A1",
1232 |     "outputId": "e01036f5-476b-46eb-bec7-1ad263ba9dd0"
1233 |    },
1234 |    "outputs": [
1235 |     {
1236 |      "data": {
1237 |       "text/plain": [
1238 |        "(15000, 200)"
1239 |       ]
1240 |      },
1241 |      "execution_count": 23,
1242 |      "metadata": {},
1243 |      "output_type": "execute_result"
1244 |     }
1245 |    ],
1246 |    "source": [
1247 |     "sum_embedding_matrix = embedding_matrix_creater(200, word_index=sum_word_index_1500)\n",
1248 |     "sum_embedding_matrix.shape"
1249 |    ]
1250 |   },
1251 |   {
1252 |    "cell_type": "code",
1253 |    "execution_count": 24,
1254 |    "metadata": {
1255 |     "colab": {},
1256 |     "colab_type": "code",
1257 |     "id": "vuawT3rXUIBY"
1258 |    },
1259 |    "outputs": [],
1260 |    "source": [
1261 |     "encoder_embedding_layer = Embedding(input_dim = 15000, \n",
1262 |     "                                    output_dim = 200,\n",
1263 |     "                                    input_length = MAX_LEN,\n",
1264 |     "                                    weights = [art_embedding_matrix],\n",
1265 |     "                                    trainable = False)"
1266 |    ]
1267 |   },
1268 |   {
1269 |    "cell_type": "code",
1270 |    "execution_count": 25,
1271 |    "metadata": {
1272 |     "colab": {},
1273 |     "colab_type": "code",
1274 |     "id": "Wf3EzUZcIccZ"
1275 |    },
1276 |    "outputs": [],
1277 |    "source": [
1278 |     "decoder_embedding_layer = Embedding(input_dim = 15000, \n",
1279 |     "                                    output_dim = 200,\n",
1280 |     "                                    input_length = MAX_LEN,\n",
1281 |     "                                    weights = [sum_embedding_matrix],\n",
1282 |     "                                    trainable = False)"
1283 |    ]
1284 |   },
1285 |   {
1286 |    "cell_type": "code",
1287 |    "execution_count": 26,
1288 |    "metadata": {
1289 |     "colab": {
1290 |      "base_uri": "https://localhost:8080/",
1291 |      "height": 34
1292 |     },
1293 |     "colab_type": "code",
1294 |     "id": "J5acWwCxIcZz",
1295 |     "outputId": "a92cd279-b2de-4110-fd1d-35e13e90c099"
1296 |    },
1297 |    "outputs": [
1298 |     {
1299 |      "data": {
1300 |       "text/plain": [
1301 |        "(15000, 200)"
1302 |       ]
1303 |      },
1304 |      "execution_count": 26,
1305 |      "metadata": {},
1306 |      "output_type": "execute_result"
1307 |     }
1308 |    ],
1309 |    "source": [
1310 |     "sum_embedding_matrix.shape"
1311 |    ]
1312 |   },
1313 |   {
1314 |    "cell_type": "markdown",
1315 |    "metadata": {
1316 |     "colab_type": "text",
1317 |     "id": "AB663AaY0PKH"
1318 |    },
1319 |    "source": [
1320 |     "## Step 3. Building Encoder-Decoder Model"
1321 |    ]
1322 |   },
1323 |   {
1324 |    "cell_type": "code",
1325 |    "execution_count": 17,
1326 |    "metadata": {
1327 |     "colab": {},
1328 |     "colab_type": "code",
1329 |     "id": "W-kZT0pQRqRl"
1330 |    },
1331 |    "outputs": [],
1332 |    "source": [
1333 |     "from numpy.random import seed\n",
1334 |     "seed(1)\n",
1335 |     "\n",
1336 |     "from sklearn.model_selection import train_test_split\n",
1337 |     "import logging\n",
1338 |     "\n",
1339 |     "import plotly.plotly as py\n",
1340 |     "import plotly.graph_objs as go\n",
1341 |     "import matplotlib.pyplot as plt\n",
1342 |     "import pandas as pd\n",
1343 |     "import pydot\n",
1344 |     "\n",
1345 |     "\n",
1346 |     "import keras\n",
1347 |     "from keras import backend as k\n",
1348 |     "k.set_learning_phase(1)\n",
1349 |     "from keras.preprocessing.text import Tokenizer\n",
1350 |     "from keras import initializers\n",
1351 |     "from keras.optimizers import RMSprop\n",
1352 |     "from keras.models import Sequential,Model\n",
1353 |     "from keras.layers import Dense,LSTM,Dropout,Input,Activation,Add,concatenate, Embedding, RepeatVector\n",
1354 |     "from keras.layers.advanced_activations import LeakyReLU,PReLU\n",
1355 |     "from keras.callbacks import ModelCheckpoint\n",
1356 |     "from keras.models import load_model\n",
1357 |     "from keras.optimizers import Adam"
1358 |    ]
1359 |   },
1360 |   {
1361 |    "cell_type": "code",
1362 |    "execution_count": 18,
1363 |    "metadata": {},
1364 |    "outputs": [],
1365 |    "source": [
1366 |     "from keras.layers import TimeDistributed"
1367 |    ]
1368 |   },
1369 |   {
1370 |    "cell_type": "code",
1371 |    "execution_count": 19,
1372 |    "metadata": {
1373 |     "colab": {},
1374 |     "colab_type": "code",
1375 |     "id": "psWPxabWIcWz"
1376 |    },
1377 |    "outputs": [],
1378 |    "source": [
1379 |     "# Hyperparams\n",
1380 |     "\n",
1381 |     "MAX_LEN = 1000\n",
1382 |     "VOCAB_SIZE =15000\n",
1383 |     "EMBEDDING_DIM = 200\n",
1384 |     "HIDDEN_UNITS = 200\n",
1385 |     "VOCAB_SIZE = VOCAB_SIZE + 1\n",
1386 |     "\n",
1387 |     "LEARNING_RATE = 0.002\n",
1388 |     "BATCH_SIZE = 32\n",
1389 |     "EPOCHS = 5"
1390 |    ]
1391 |   },
1392 |   {
1393 |    "cell_type": "markdown",
1394 |    "metadata": {},
1395 |    "source": [
1396 |     "### Model 1. Simple LSTM Encoder-Decoder-seq2seq"
1397 |    ]
1398 |   },
1399 |   {
1400 |    "cell_type": "code",
1401 |    "execution_count": 74,
1402 |    "metadata": {
1403 |     "colab": {},
1404 |     "colab_type": "code",
1405 |     "id": "pGjWMUG3IcT_"
1406 |    },
1407 |    "outputs": [
1408 |     {
1409 |      "name": "stdout",
1410 |      "output_type": "stream",
1411 |      "text": [
1412 |       "__________________________________________________________________________________________________\n",
1413 |       "Layer (type)                    Output Shape         Param #     Connected to                     \n",
1414 |       "==================================================================================================\n",
1415 |       "input_3 (InputLayer)            (None, 1000)         0                                            \n",
1416 |       "__________________________________________________________________________________________________\n",
1417 |       "input_4 (InputLayer)            (None, 1000)         0                                            \n",
1418 |       "__________________________________________________________________________________________________\n",
1419 |       "embedding_1 (Embedding)         (None, 1000, 200)    3000000     input_3[0][0]                    \n",
1420 |       "__________________________________________________________________________________________________\n",
1421 |       "embedding_2 (Embedding)         (None, 1000, 200)    3000000     input_4[0][0]                    \n",
1422 |       "__________________________________________________________________________________________________\n",
1423 |       "lstm_4 (LSTM)                   (None, 200)          320800      embedding_1[1][0]                \n",
1424 |       "__________________________________________________________________________________________________\n",
1425 |       "lstm_5 (LSTM)                   (None, 200)          320800      embedding_2[1][0]                \n",
1426 |       "__________________________________________________________________________________________________\n",
1427 |       "concatenate_1 (Concatenate)     (None, 400)          0           lstm_4[0][0]                     \n",
1428 |       "                                                                 lstm_5[0][0]                     \n",
1429 |       "__________________________________________________________________________________________________\n",
1430 |       "dense_2 (Dense)                 (None, 15002)        6015802     concatenate_1[0][0]              \n",
1431 |       "==================================================================================================\n",
1432 |       "Total params: 12,657,402\n",
1433 |       "Trainable params: 6,657,402\n",
1434 |       "Non-trainable params: 6,000,000\n",
1435 |       "__________________________________________________________________________________________________\n"
1436 |      ]
1437 |     }
1438 |    ],
1439 |    "source": [
1440 |     "\"\"\"\n",
1441 |     "Simple LSTM Encoder-Decoder-seq2seq\n",
1442 |     "\"\"\"\n",
1443 |     "# encoder\n",
1444 |     "encoder_inputs = Input(shape=(MAX_LEN, ), dtype='int32',)\n",
1445 |     "encoder_embedding = encoder_embedding_layer(encoder_inputs)\n",
1446 |     "encoder_LSTM = LSTM(HIDDEN_UNITS)(encoder_embedding)\n",
1447 |     "# decoder\n",
1448 |     "decoder_inputs = Input(shape=(MAX_LEN, ))\n",
1449 |     "decoder_embedding = decoder_embedding_layer(decoder_inputs)\n",
1450 |     "decoder_LSTM = LSTM(200)(decoder_embedding)\n",
1451 |     "# merge\n",
1452 |     "merge_layer = concatenate([encoder_LSTM, decoder_LSTM])\n",
1453 |     "decoder_outputs = Dense(units=VOCAB_SIZE+1, activation=\"softmax\")(merge_layer) # SUM_VOCAB_SIZE, sum_embedding_matrix.shape[1]\n",
1454 |     "\n",
1455 |     "model = Model([encoder_inputs, decoder_inputs], decoder_outputs)\n",
1456 |     "model.summary()"
1457 |    ]
1458 |   },
1459 |   {
1460 |    "cell_type": "code",
1461 |    "execution_count": 75,
1462 |    "metadata": {
1463 |     "colab": {},
1464 |     "colab_type": "code",
1465 |     "id": "JHw0flwp4uW2"
1466 |    },
1467 |    "outputs": [],
1468 |    "source": [
1469 |     "model.compile(optimizer=\"adam\", loss=\"categorical_crossentropy\", metrics=[\"accuracy\"])"
1470 |    ]
1471 |   },
1472 |   {
1473 |    "cell_type": "markdown",
1474 |    "metadata": {},
1475 |    "source": [
1476 |     "### Model 2. Bidirectional LSTM Encoder-Decoder-seq2seq"
1477 |    ]
1478 |   },
1479 |   {
1480 |    "cell_type": "code",
1481 |    "execution_count": 27,
1482 |    "metadata": {},
1483 |    "outputs": [],
1484 |    "source": [
1485 |     "\"\"\"\n",
1486 |     "Bidirectional LSTM: Others Inspired Encoder-Decoder-seq2seq\n",
1487 |     "\"\"\"\n",
1488 |     "encoder_inputs = Input(shape=(MAX_LEN,))\n",
1489 |     "encoder_embedding = encoder_embedding_layer(encoder_inputs)\n",
1490 |     "encoder_LSTM = LSTM(HIDDEN_UNITS, return_state=True)\n",
1491 |     "encoder_LSTM_R = LSTM(HIDDEN_UNITS, return_state=True, go_backwards=True)\n",
1492 |     "encoder_outputs_R, state_h_R, state_c_R = encoder_LSTM_R(encoder_embedding)\n",
1493 |     "encoder_outputs, state_h, state_c = encoder_LSTM(encoder_embedding)\n",
1494 |     "\n",
1495 |     "final_h = Add()([state_h, state_h_R])\n",
1496 |     "final_c = Add()([state_c, state_c_R])\n",
1497 |     "encoder_states = [final_h, final_c]\n",
1498 |     "\n",
1499 |     "\"\"\"\n",
1500 |     "decoder\n",
1501 |     "\"\"\"\n",
1502 |     "decoder_inputs = Input(shape=(MAX_LEN,))\n",
1503 |     "decoder_embedding = decoder_embedding_layer(decoder_inputs)\n",
1504 |     "decoder_LSTM = LSTM(HIDDEN_UNITS, return_sequences=True, return_state=True)\n",
1505 |     "decoder_outputs, _, _ = decoder_LSTM(decoder_embedding, initial_state=encoder_states) \n",
1506 |     "decoder_dense = Dense(VOCAB_SIZE, activation='linear')\n",
1507 |     "decoder_outputs = decoder_dense(decoder_outputs)\n",
1508 |     "\n",
1509 |     "model= Model(inputs=[encoder_inputs,decoder_inputs], outputs=decoder_outputs)"
1510 |    ]
1511 |   },
1512 |   {
1513 |    "cell_type": "markdown",
1514 |    "metadata": {},
1515 |    "source": [
1516 |     "### Model 3. Chatbot Inspired Encoder-Decoder-seq2seq"
1517 |    ]
1518 |   },
1519 |   {
1520 |    "cell_type": "code",
1521 |    "execution_count": 86,
1522 |    "metadata": {},
1523 |    "outputs": [],
1524 |    "source": [
1525 |     "\"\"\"\n",
1526 |     "Chatbot Inspired Encoder-Decoder-seq2seq\n",
1527 |     "\"\"\"\n",
1528 |     "encoder_inputs = Input(shape=(MAX_LEN, ), dtype='int32',)\n",
1529 |     "encoder_embedding = encoder_embedding_layer(encoder_inputs)\n",
1530 |     "encoder_LSTM = LSTM(HIDDEN_UNITS, return_state=True)\n",
1531 |     "encoder_outputs, state_h, state_c = encoder_LSTM(encoder_embedding)\n",
1532 |     "\n",
1533 |     "decoder_inputs = Input(shape=(MAX_LEN, ), dtype='int32',)\n",
1534 |     "decoder_embedding = decoder_embedding_layer(decoder_inputs)\n",
1535 |     "decoder_LSTM = LSTM(HIDDEN_UNITS, return_state=True, return_sequences=True)\n",
1536 |     "decoder_outputs, _, _ = decoder_LSTM(decoder_embedding, initial_state=[state_h, state_c])\n",
1537 |     "\n",
1538 |     "# dense_layer = Dense(VOCAB_SIZE, activation='softmax')\n",
1539 |     "outputs = TimeDistributed(Dense(VOCAB_SIZE, activation='softmax'))(decoder_outputs)\n",
1540 |     "model = Model([encoder_inputs, decoder_inputs], outputs)"
1541 |    ]
1542 |   },
1543 |   {
1544 |    "cell_type": "code",
1545 |    "execution_count": 28,
1546 |    "metadata": {},
1547 |    "outputs": [],
1548 |    "source": [
1549 |     "rmsprop = RMSprop(lr=0.01, clipnorm=1.)\n",
1550 |     "model.compile(loss='mse', optimizer=rmsprop, metrics=[\"accuracy\"])"
1551 |    ]
1552 |   },
1553 |   {
1554 |    "cell_type": "code",
1555 |    "execution_count": 29,
1556 |    "metadata": {},
1557 |    "outputs": [
1558 |     {
1559 |      "name": "stdout",
1560 |      "output_type": "stream",
1561 |      "text": [
1562 |       "__________________________________________________________________________________________________\n",
1563 |       "Layer (type)                    Output Shape         Param #     Connected to                     \n",
1564 |       "==================================================================================================\n",
1565 |       "input_1 (InputLayer)            (None, 1000)         0                                            \n",
1566 |       "__________________________________________________________________________________________________\n",
1567 |       "embedding_1 (Embedding)         (None, 1000, 200)    3000000     input_1[0][0]                    \n",
1568 |       "__________________________________________________________________________________________________\n",
1569 |       "input_2 (InputLayer)            (None, 1000)         0                                            \n",
1570 |       "__________________________________________________________________________________________________\n",
1571 |       "lstm_1 (LSTM)                   [(None, 200), (None, 320800      embedding_1[0][0]                \n",
1572 |       "__________________________________________________________________________________________________\n",
1573 |       "lstm_2 (LSTM)                   [(None, 200), (None, 320800      embedding_1[0][0]                \n",
1574 |       "__________________________________________________________________________________________________\n",
1575 |       "embedding_2 (Embedding)         (None, 1000, 200)    3000000     input_2[0][0]                    \n",
1576 |       "__________________________________________________________________________________________________\n",
1577 |       "add_1 (Add)                     (None, 200)          0           lstm_1[0][1]                     \n",
1578 |       "                                                                 lstm_2[0][1]                     \n",
1579 |       "__________________________________________________________________________________________________\n",
1580 |       "add_2 (Add)                     (None, 200)          0           lstm_1[0][2]                     \n",
1581 |       "                                                                 lstm_2[0][2]                     \n",
1582 |       "__________________________________________________________________________________________________\n",
1583 |       "lstm_3 (LSTM)                   [(None, 1000, 200),  320800      embedding_2[0][0]                \n",
1584 |       "                                                                 add_1[0][0]                      \n",
1585 |       "                                                                 add_2[0][0]                      \n",
1586 |       "__________________________________________________________________________________________________\n",
1587 |       "dense_1 (Dense)                 (None, 1000, 15001)  3015201     lstm_3[0][0]                     \n",
1588 |       "==================================================================================================\n",
1589 |       "Total params: 9,977,601\n",
1590 |       "Trainable params: 3,977,601\n",
1591 |       "Non-trainable params: 6,000,000\n",
1592 |       "__________________________________________________________________________________________________\n"
1593 |      ]
1594 |     }
1595 |    ],
1596 |    "source": [
1597 |     "# model 2\n",
1598 |     "model.summary()"
1599 |    ]
1600 |   },
1601 |   {
1602 |    "cell_type": "markdown",
1603 |    "metadata": {},
1604 |    "source": [
1605 |     "## Step 4. Training your model and Validate it"
1606 |    ]
1607 |   },
1608 |   {
1609 |    "cell_type": "code",
1610 |    "execution_count": 30,
1611 |    "metadata": {},
1612 |    "outputs": [],
1613 |    "source": [
1614 |     "import numpy as np\n",
1615 |     "num_samples = len(pad_sum_sequences)\n",
1616 |     "decoder_output_data = np.zeros((num_samples, MAX_LEN, VOCAB_SIZE), dtype=\"int32\")"
1617 |    ]
1618 |   },
1619 |   {
1620 |    "cell_type": "code",
1621 |    "execution_count": 31,
1622 |    "metadata": {},
1623 |    "outputs": [],
1624 |    "source": [
1625 |     "# outputの３Dテンソル\n",
1626 |     "for i, seqs in enumerate(pad_sum_sequences):\n",
1627 |     "    for j, seq in enumerate(seqs):\n",
1628 |     "        if j > 0:\n",
1629 |     "            decoder_output_data[i][j][seq] = 1"
1630 |    ]
1631 |   },
1632 |   {
1633 |    "cell_type": "code",
1634 |    "execution_count": 32,
1635 |    "metadata": {
1636 |     "colab": {},
1637 |     "colab_type": "code",
1638 |     "id": "EFxSEKvfIoGf"
1639 |    },
1640 |    "outputs": [],
1641 |    "source": [
1642 |     "art_train, art_test, sum_train, sum_test = train_test_split(pad_art_sequences, pad_sum_sequences, test_size=0.2)"
1643 |    ]
1644 |   },
1645 |   {
1646 |    "cell_type": "code",
1647 |    "execution_count": 33,
1648 |    "metadata": {
1649 |     "colab": {
1650 |      "base_uri": "https://localhost:8080/",
1651 |      "height": 34
1652 |     },
1653 |     "colab_type": "code",
1654 |     "id": "9hn_1hm-JCa2",
1655 |     "outputId": "95c79153-06ec-4b2d-b25e-6f34c933ae44"
1656 |    },
1657 |    "outputs": [
1658 |     {
1659 |      "data": {
1660 |       "text/plain": [
1661 |        "1780"
1662 |       ]
1663 |      },
1664 |      "execution_count": 33,
1665 |      "metadata": {},
1666 |      "output_type": "execute_result"
1667 |     }
1668 |    ],
1669 |    "source": [
1670 |     "train_num = art_train.shape[0]\n",
1671 |     "train_num"
1672 |    ]
1673 |   },
1674 |   {
1675 |    "cell_type": "code",
1676 |    "execution_count": 34,
1677 |    "metadata": {},
1678 |    "outputs": [],
1679 |    "source": [
1680 |     "target_train = decoder_output_data[:train_num]\n",
1681 |     "target_test = decoder_output_data[train_num:]"
1682 |    ]
1683 |   },
1684 |   {
1685 |    "cell_type": "code",
1686 |    "execution_count": null,
1687 |    "metadata": {},
1688 |    "outputs": [
1689 |     {
1690 |      "name": "stdout",
1691 |      "output_type": "stream",
1692 |      "text": [
1693 |       "Train on 1780 samples, validate on 445 samples\n",
1694 |       "Epoch 1/5\n",
1695 |       " 896/1780 [==============>...............] - ETA: 1:46:08 - loss: 4.1167e-04 - acc: 0.8187"
1696 |      ]
1697 |     }
1698 |    ],
1699 |    "source": [
1700 |     "history = model.fit([art_train, sum_train], \n",
1701 |     "                     target_train, \n",
1702 |     "                     epochs=EPOCHS, \n",
1703 |     "                     batch_size=BATCH_SIZE,\n",
1704 |     "                     validation_data=([art_test, sum_test], target_test))"
1705 |    ]
1706 |   },
1707 |   {
1708 |    "cell_type": "markdown",
1709 |    "metadata": {},
1710 |    "source": [
1711 |     "#### Visualization"
1712 |    ]
1713 |   },
1714 |   {
1715 |    "cell_type": "code",
1716 |    "execution_count": null,
1717 |    "metadata": {},
1718 |    "outputs": [],
1719 |    "source": [
1720 |     "# 正確性の可視化\n",
1721 |     "import matplotlib.pyplot as plt\n",
1722 |     "%matplotlib inline\n",
1723 |     "\n",
1724 |     "plt.figure(figsize=(10, 6))\n",
1725 |     "plt.plot(history.history['acc'])\n",
1726 |     "plt.plot(history.history['val_acc'])\n",
1727 |     "plt.title('model accuracy')\n",
1728 |     "plt.ylabel('accuracy')\n",
1729 |     "plt.xlabel('epoch')\n",
1730 |     "plt.legend(['train', 'test'], loc='upper left')\n",
1731 |     "plt.show()"
1732 |    ]
1733 |   },
1734 |   {
1735 |    "cell_type": "code",
1736 |    "execution_count": null,
1737 |    "metadata": {},
1738 |    "outputs": [],
1739 |    "source": [
1740 |     "# 損失関数の可視化\n",
1741 |     "plt.figure(figsize=(10, 6))\n",
1742 |     "plt.plot(history.history['loss'])\n",
1743 |     "# plt.plot(history.history['val_loss'])\n",
1744 |     "plt.title('model loss')\n",
1745 |     "plt.ylabel('loss')\n",
1746 |     "plt.xlabel('epoch')\n",
1747 |     "# plt.legend(['train', 'test'], loc='upper left')\n",
1748 |     "plt.show()"
1749 |    ]
1750 |   },
1751 |   {
1752 |    "cell_type": "code",
1753 |    "execution_count": null,
1754 |    "metadata": {},
1755 |    "outputs": [],
1756 |    "source": [
1757 |     "# モデルの読み込み\n",
1758 |     "with open('text_summary.json',\"w\").write(model.to_json())\n",
1759 |     "\n",
1760 |     "# 重みの読み込み\n",
1761 |     "model.load_weights('text_summary.h5')\n",
1762 |     "print(\"Saved Model!\")"
1763 |    ]
1764 |   }
1765 |  ],
1766 |  "metadata": {
1767 |   "colab": {
1768 |    "name": "input_matrix.ipynb",
1769 |    "provenance": [],
1770 |    "version": "0.3.2"
1771 |   },
1772 |   "kernelspec": {
1773 |    "display_name": "Python 3",
1774 |    "language": "python",
1775 |    "name": "python3"
1776 |   },
1777 |   "language_info": {
1778 |    "codemirror_mode": {
1779 |     "name": "ipython",
1780 |     "version": 3
1781 |    },
1782 |    "file_extension": ".py",
1783 |    "mimetype": "text/x-python",
1784 |    "name": "python",
1785 |    "nbconvert_exporter": "python",
1786 |    "pygments_lexer": "ipython3",
1787 |    "version": "3.5.1"
1788 |   }
1789 |  },
1790 |  "nbformat": 4,
1791 |  "nbformat_minor": 1
1792 | }
1793 | 


--------------------------------------------------------------------------------
/bidirectional_lstm_model.py:
--------------------------------------------------------------------------------
  1 | 
  2 | from numpy.random import seed
  3 | seed(1)
  4 | 
  5 | from sklearn.model_selection import train_test_split
  6 | import logging
  7 | 
  8 | import plotly.plotly as py
  9 | import plotly.graph_objs as go
 10 | import matplotlib.pyplot as plt
 11 | import pandas as pd
 12 | import pydot
 13 | 
 14 | 
 15 | import keras
 16 | from keras import backend as k
 17 | k.set_learning_phase(1)
 18 | from keras.preprocessing.text import Tokenizer
 19 | from keras import initializers
 20 | from keras.optimizers import RMSprop
 21 | from keras.models import Sequential,Model
 22 | from keras.layers import Dense,LSTM,Dropout,Input,Activation,Add,Concatenate
 23 | from keras.layers.advanced_activations import LeakyReLU,PReLU
 24 | from keras.callbacks import ModelCheckpoint
 25 | from keras.models import load_model
 26 | from keras.optimizers import Adam
 27 | 
 28 | 
 29 | logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s',\
 30 |     level=logging.INFO)
 31 | 
 32 | # ========================================================================
 33 | # model params
 34 | # ========================================================================
 35 | 
 36 | en_shape = np.shape(train_data["article"][0])
 37 | den_shape = np.shape(train_data["summary"][0])
 38 | 
 39 | MAX_ART_LEN =
 40 | MAX_SUM_LEN =
 41 | EMBEDDING_DIM = 1000
 42 | HIDDEN_UNITS =
 43 | 
 44 | LEARNING_RATE = 0.002
 45 | BATCH_SIZE = 32
 46 | EPOCHS = 5
 47 | 
 48 | rmsprop = RMSprop(lr=LEARNING_RATE, clipnorm=1.0)
 49 | 
 50 | 
 51 | # ========================================================================
 52 | # model helpers
 53 | # ========================================================================
 54 | 
 55 | def bidirectional_lstm(data):
 56 |     """
 57 |     Encoder-Decoder-seq2seq
 58 |     """
 59 |     # encoder
 60 |     encoder_inputs = Input(shape=en_shape[1])
 61 |     encoder_LSTM = LSTM(HIDDEN_UNITS, dropout_U = 0.2, dropout_W = 0.2 ,return_state=True)
 62 |     rev_encoder_LSTM = LSTM(HIDDEN_UNITS, return_state=True, go_backwards=True)
 63 |     #
 64 |     encoder_outputs, state_h, state_c = encoder_LSTM(encoder_inputs)
 65 |     rev_encoder_outputs, rev_state_h, rev_state_c = rev_encoder_LSTM(encoder_inputs)
 66 |     #
 67 |     final_state_h = Add()([state_h, rev_state_h])
 68 |     final_state_c = Add()([state_c, rev_state_c])
 69 | 
 70 |     encoder_states = [final_state_h, final_state_c]
 71 | 
 72 |     # decoder
 73 |     decoder_inputs = Input(shape=(None, de_shape[1]))
 74 |     decoder_LSTM = LSTM(HIDDEN_UNITS, return_sequences=True, return_state=True)
 75 |     decoder_outputs, _, _, = decoder_LSTM(decoder_inputs, initial_state=encoder_states)
 76 |     decoder_dense = Dense(units=de_shape[1], activation="linear")
 77 |     decoder_outputs = decoder_dense(decoder_outputs)
 78 | 
 79 |     # modeling
 80 |     model = Model([encoder_inputs,decoder_inputs], decoder_outputs)
 81 |     model.compile(optimizer=rmsprop, loss="mse", metrics=["accuracy"])
 82 | 
 83 |     # return model
 84 |     print(model.summary())
 85 | 
 86 |     x_train, x_test, y_train, y_test = train_test_split(data["article"], data["summaries"], test_size=0.2)
 87 |     model.fit([x_train, y_train], y_train, batch_size=BATCH_SIZE,
 88 |               epochs=EPOCHS, verbose=1, validation_data=([x_test, y_test], y_test))
 89 | 
 90 |     """
 91 |     推論モデル
 92 |     """
 93 |     encoder_model_inf = Model(encoder_inputs, encoder_states)
 94 | 
 95 |     decoder_state_input_H = Input(shape=(HIDDEN_UNITS,))
 96 |     decoder_state_input_C = Input(shape=(HIDDEN_UNITS,))
 97 | 
 98 |     decoder_state_inputs = [decoder_state_input_H, decoder_state_input_C]
 99 |     decoder_outputs, decoder_state_h, decoder_state_c = decoder_LSTM(decoder_inputs, initial_state=decoder_state_inputs)
100 | 
101 |     decoder_states = [decoder_state_h, decoder_state_c]
102 |     decoder_outputs = decoder_dense(decoder_outputs)
103 | 
104 |     decoder_model_inf = Model([decoder_inputs]+decoder_state_inputs,
105 |                          [decoder_outputs]+decoder_states)
106 | 
107 |     scores = model.evaluate([x_test, y_test], y_test, verbose=0)
108 | 
109 |     print('LSTM test scores:', scores)
110 |     print('\007')
111 | 
112 |     return model, encoder_model_inf, decoder_model_inf
113 | 
114 | """
115 | model._make_predict_function()
116 | """
117 | 


--------------------------------------------------------------------------------
/data_shaper.py:
--------------------------------------------------------------------------------
 1 | 
 2 | from numpy.random import seed
 3 | seed(1)
 4 | 
 5 | import gensim as gs
 6 | import pandas as pd
 7 | import numpy as np
 8 | import scipy as sc
 9 | import nltk
10 | from nltk.tokenize import word_tokenize as wt
11 | from nltk.tokenize import sent_tokenize as st
12 | from numpy import argmax
13 | from sklearn.preprocessing import LabelEncoder
14 | from sklearn.preprocessing import OneHotEncoder
15 | from keras.preprocessing.sequence import pad_sequences
16 | import logging
17 | import re
18 | from collections import Counter
19 | 
20 | logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s',\
21 |     level=logging.INFO)
22 | 
23 | # ========================================================================
24 | #
25 | # ========================================================================
26 | 
27 | # emb_size_all = 300
28 | # maxcorp=5000
29 | EMBEDDING_DIM = 300
30 | VOCAB_SIZE = 5000
31 | 
32 | 
33 | def createCorpus(t):
34 |     corpus = []
35 |     all_sent = []
36 |     for k in t:
37 |         for p in t[k]:
38 |             corpus.append(st(p))
39 |     for sent in range(len(corpus)):
40 |         for k in corpus[sent]:
41 |             all_sent.append(k)
42 |     for m in range(len(all_sent)):
43 |         all_sent[m] = wt(all_sent[m])
44 | 
45 |     all_words=[]
46 |     for sent in all_sent:
47 |         hold=[]
48 |         for word in sent:
49 |             hold.append(word.lower())
50 |         all_words.append(hold)
51 |     return all_words
52 | 
53 | 
54 | def word2vecmodel(corpus):
55 |     emb_size = emb_size_all
56 |     model_type={"skip_gram":1,"CBOW":0}
57 |     window=10
58 |     workers=4
59 |     min_count=4
60 |     batch_words=20
61 |     epochs=25
62 |     #include bigrams
63 |     #bigramer = gs.models.Phrases(corpus)
64 | 
65 |     model=gs.models.Word2Vec(corpus,size=emb_size,sg=model_type["skip_gram"],
66 |                              compute_loss=True,window=window,min_count=min_count,workers=workers,
67 |                              batch_words=batch_words)
68 | 
69 |     model.train(corpus,total_examples=len(corpus),epochs=epochs)
70 |     model.save("%sWord2vec"%modelLocation)
71 |     print('\007')
72 |     return model
73 | 
74 | 
75 | 
76 | 
77 | 
78 | 
79 | 
80 | 
81 | 
82 | 
83 | """
84 | generate summary length for test
85 | """
86 | 
87 | #train_data["nums_summ"]=list(map(lambda x:0 if len(x)<5000 else 1,data["articles"]))
88 | #train_data["nums_summ"]=list(map(len,data["summaries"]))
89 | #train_data["nums_summ_norm"]=(np.array(train_data["nums_summ"])-min(train_data["nums_summ"]))/(max(train_data["nums_summ"])-min(train_data["nums_summ"]))
90 | 


--------------------------------------------------------------------------------
/text_cleaner.py:
--------------------------------------------------------------------------------
 1 | import string
 2 | import re
 3 | #punctuation
 4 | 
 5 | def cleantext(text):
 6 |     text = str(text)
 7 |     text=text.split()
 8 |     words=[]
 9 |     for t in text:
10 |         if t.isalpha():
11 |             words.append(t)
12 |     text=" ".join(words)
13 |     text=text.lower()
14 |     text=re.sub(r"what's","what is ",text)
15 |     text=re.sub(r"it's","it is ",text)
16 |     text=re.sub(r"\'ve"," have ",text)
17 |     text=re.sub(r"i'm","i am ",text)
18 |     text=re.sub(r"\'re"," are ",text)
19 |     text=re.sub(r"n't"," not ",text)
20 |     text=re.sub(r"\'d"," would ",text)
21 |     text=re.sub(r"\'s","s",text)
22 |     text=re.sub(r"\'ll"," will ",text)
23 |     text=re.sub(r"can't"," cannot ",text)
24 |     text=re.sub(r" e g "," eg ",text)
25 |     text=re.sub(r"e-mail","email",text)
26 |     text=re.sub(r"9\\/11"," 911 ",text)
27 |     text=re.sub(r" u.s"," american ",text)
28 |     text=re.sub(r" u.n"," united nations ",text)
29 |     text=re.sub(r"\n"," ",text)
30 |     text=re.sub(r":"," ",text)
31 |     text=re.sub(r"-"," ",text)
32 |     text=re.sub(r"\_"," ",text)
33 |     text=re.sub(r"\d+"," ",text)
34 |     text=re.sub(r"[$#@%&*!~?%{}()]"," ",text)
35 | 
36 |     return text
37 | 


--------------------------------------------------------------------------------
/text_summary_model.py:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/samurainote/Text_Summarization_using_Bidirectional_LSTM/711c30c8d05a07ac32f22a6e9b54238d316a127d/text_summary_model.py


--------------------------------------------------------------------------------