├── 01_sparse_vector.ipynb ├── 02_custom_embedding.ipynb ├── 03_word2vec.ipynb ├── 04_ngram_cnn.ipynb ├── 05_language_model_basic.ipynb ├── 06_language_model_rnn.ipynb ├── 07_encoder_decoder.ipynb ├── 08_attention.ipynb ├── 09_transformer.ipynb ├── Readme.md └── images ├── 1d_conv_net.png ├── attend_image.png ├── bidirectional_rnn.png ├── bigram_convolution.png ├── conditioned_context.png ├── continuous_bow.png ├── count_vectorize.png ├── decoder_attention.png ├── deep_rnn.png ├── dense_vectorize.png ├── embedding_layer.png ├── embedding_matrix.png ├── encoder_all.png ├── encoder_decoder.png ├── encoder_decoder_attention.png ├── encoder_final.png ├── gru_gate.png ├── index_vectorize.png ├── index_vectorize2.png ├── language_model_beginning.png ├── machine_translation.png ├── machine_translation2.png ├── multi_head_attention.png ├── region_separation.png ├── rnn_architecture.png ├── rnn_network.png ├── rnn_packed_sequence.png ├── separate_sequence_for_next_words.png ├── skip_gram.png ├── soft_attention.png ├── task_layer.png ├── transformer.png ├── transformer3_dec_only.png ├── transformer3_enc_and_dec.png ├── transformer3_enc_only.png ├── transformer_attention.png ├── transformer_causal_attention.png ├── transformer_causal_reference.png ├── transformer_decoder.png ├── transformer_decoding_layer.png ├── transformer_encoder.png ├── transformer_encoding_layer.png ├── transformer_positional_encoding.png ├── transformer_residual01.png ├── transformer_residual02.png ├── transformer_self_attention.png ├── word2vec_network.png └── word_embedding.png /01_sparse_vector.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "# Primitive Embeddings (Sparse Vector)\n", 8 | "\n", 9 | "For the first tutorial, here I show you primitive embeddings (preprocessing, featurizing, or vectorizing) for languages.\n", 10 | "\n", 11 | "As you can see in the later tutorials, embeddings in this example is very beginning and will not be used in practices. But it will be a good example for your first understanding NLP.\n", 12 | "\n", 13 | "There are many types of embeddings - such as, character embedding, word embedding, sentence embedding, or document embedding, and I'll show you sentence vectorization in this notebook.\n", 14 | "\n", 15 | "*back to [index](https://github.com/tsmatz/nlp-tutorials/)*" 16 | ] 17 | }, 18 | { 19 | "cell_type": "markdown", 20 | "metadata": {}, 21 | "source": [ 22 | "## Install required packages" 23 | ] 24 | }, 25 | { 26 | "cell_type": "code", 27 | "execution_count": null, 28 | "metadata": {}, 29 | "outputs": [], 30 | "source": [ 31 | "!pip install scikit-learn nltk pandas" 32 | ] 33 | }, 34 | { 35 | "cell_type": "code", 36 | "execution_count": null, 37 | "metadata": {}, 38 | "outputs": [], 39 | "source": [ 40 | "import nltk\n", 41 | "nltk.download(\"popular\")\n", 42 | "nltk.download('punkt_tab')" 43 | ] 44 | }, 45 | { 46 | "cell_type": "markdown", 47 | "metadata": {}, 48 | "source": [ 49 | "## Count Vectorize\n", 50 | "\n", 51 | "One of primitive method to vectorize a text is count vectorization.
\n", 52 | "This method is based on one hot vectorizing and each element represents the count of that word in a document as follows.\n", 53 | "\n", 54 | "![Count vectorize](images/count_vectorize.png)\n", 55 | "\n", 56 | "Count vectorization is very straighforward and comprehensive for humans, but it'll build sparse vectors (in which, almost elements are zero) and also resource-intensive. I note that it will then waste a lot of time and resources for large data." 57 | ] 58 | }, 59 | { 60 | "cell_type": "code", 61 | "execution_count": 1, 62 | "metadata": {}, 63 | "outputs": [ 64 | { 65 | "data": { 66 | "text/html": [ 67 | "
\n", 68 | "\n", 81 | "\n", 82 | " \n", 83 | " \n", 84 | " \n", 85 | " \n", 86 | " \n", 87 | " \n", 88 | " \n", 89 | " \n", 90 | " \n", 91 | " \n", 92 | " \n", 93 | " \n", 94 | " \n", 95 | " \n", 96 | " \n", 97 | " \n", 98 | " \n", 99 | " \n", 100 | " \n", 101 | " \n", 102 | " \n", 103 | " \n", 104 | " \n", 105 | " \n", 106 | " \n", 107 | " \n", 108 | " \n", 109 | " \n", 110 | " \n", 111 | " \n", 112 | " \n", 113 | " \n", 114 | " \n", 115 | " \n", 116 | " \n", 117 | " \n", 118 | " \n", 119 | " \n", 120 | " \n", 121 | " \n", 122 | " \n", 123 | " \n", 124 | " \n", 125 | "
aandarebookhereismypenthesethis
01001010001
10110111210
\n", 126 | "
" 127 | ], 128 | "text/plain": [ 129 | " a and are book here is my pen these this\n", 130 | "0 1 0 0 1 0 1 0 0 0 1\n", 131 | "1 0 1 1 0 1 1 1 2 1 0" 132 | ] 133 | }, 134 | "execution_count": 1, 135 | "metadata": {}, 136 | "output_type": "execute_result" 137 | } 138 | ], 139 | "source": [ 140 | "from sklearn.feature_extraction.text import CountVectorizer\n", 141 | "from nltk import word_tokenize\n", 142 | "from nltk.stem import WordNetLemmatizer\n", 143 | "import pandas as pd\n", 144 | "\n", 145 | "lemmatizer = WordNetLemmatizer()\n", 146 | "\n", 147 | "# Convert :\n", 148 | "# \"pens\" -> \"pen\"\n", 149 | "# \"wolves\" -> \"wolf\"\n", 150 | "def my_lemmatizer(text):\n", 151 | " return [lemmatizer.lemmatize(t) for t in word_tokenize(text)]\n", 152 | "\n", 153 | "vectorizer = CountVectorizer(\n", 154 | " tokenizer=my_lemmatizer)\n", 155 | "texts = [\n", 156 | " \"This is a book\",\n", 157 | " \"These are pens and my pen is here\"\n", 158 | "]\n", 159 | "vectors = vectorizer.fit_transform(texts)\n", 160 | "\n", 161 | "cols = [k for k, v in sorted(vectorizer.vocabulary_.items(), key=lambda item: item[1])]\n", 162 | "df = pd.DataFrame(vectors.toarray(), columns=cols)\n", 163 | "df" 164 | ] 165 | }, 166 | { 167 | "cell_type": "markdown", 168 | "metadata": {}, 169 | "source": [ 170 | "Hence this vectorization often results into low performance (low accuracy) in several ML use-cases. (Since the neural network won't work well with very high-dimensional and sparse vectors.)
\n", 171 | "The following is the example for classifying document into 20 e-mail groups.\n", 172 | "\n", 173 | "> Note : In the real usage, please train with unknown words with a specific symbol, such as \"[UNK]\"." 174 | ] 175 | }, 176 | { 177 | "cell_type": "code", 178 | "execution_count": 2, 179 | "metadata": {}, 180 | "outputs": [ 181 | { 182 | "name": "stderr", 183 | "output_type": "stream", 184 | "text": [ 185 | "/home/tsmatsuz/.local/lib/python3.6/site-packages/sklearn/feature_extraction/text.py:489: UserWarning: The parameter 'token_pattern' will not be used since 'tokenizer' is not None'\n", 186 | " warnings.warn(\"The parameter 'token_pattern' will not be used\"\n" 187 | ] 188 | }, 189 | { 190 | "name": "stdout", 191 | "output_type": "stream", 192 | "text": [ 193 | "classification accuracy: 0.6240042485395645\n" 194 | ] 195 | } 196 | ], 197 | "source": [ 198 | "from sklearn.datasets import fetch_20newsgroups\n", 199 | "from sklearn.naive_bayes import MultinomialNB\n", 200 | "from sklearn import metrics\n", 201 | "\n", 202 | "# Load train dataset\n", 203 | "train = fetch_20newsgroups(\n", 204 | " subset=\"train\",\n", 205 | " remove=(\"headers\", \"footers\", \"quotes\"))\n", 206 | "\n", 207 | "# Count vectorize\n", 208 | "vectorizer.fit(train.data)\n", 209 | "X_trian = vectorizer.transform(train.data)\n", 210 | "y_train = train.target\n", 211 | "\n", 212 | "# Train\n", 213 | "clf = MultinomialNB(alpha=.01)\n", 214 | "clf.fit(X_trian, y_train)\n", 215 | "\n", 216 | "# Evaluate accuracy\n", 217 | "test = fetch_20newsgroups(\n", 218 | " subset=\"test\",\n", 219 | " remove=(\"headers\", \"footers\", \"quotes\"))\n", 220 | "X_test = vectorizer.transform(test.data)\n", 221 | "y_test = test.target\n", 222 | "y_pred = clf.predict(X_test)\n", 223 | "score = metrics.accuracy_score(y_test, y_pred)\n", 224 | "print(\"classification accuracy: {}\".format(score))" 225 | ] 226 | }, 227 | { 228 | "cell_type": "markdown", 229 | "metadata": {}, 230 | "source": [ 231 | "## TF-IDF weighting\n", 232 | "\n", 233 | "In above example, the weight of word \"book\" or \"pen\" is the same as the weight of words \"a\", \"for\", \"the\", etc.
\n", 234 | "Using TF-IDF, you can prioritize the words that rarely appear in the given corpus.\n", 235 | "\n", 236 | "TF (=**T**erm **F**requency) is\n", 237 | "\n", 238 | "$$ \\frac{\\#d(w)}{\\sum_{w^{\\prime} \\in d} \\#d(w^{\\prime})} $$\n", 239 | "\n", 240 | "in which, $ \\#d(w) $ means the count of word $w$ in document $d$.
\n", 241 | "TF is the normalized value of the count of word $w$ in document $d$. \n", 242 | "\n", 243 | "TF-IDF (=**I**nverse **D**ocument **F**requency) is\n", 244 | "\n", 245 | "$$ \\frac{\\#d(w)}{\\sum_{w^{\\prime} \\in d} \\#d(w^{\\prime})} \\times \\log{\\frac{|D|}{|\\{d \\in D:w\\in d\\}|}}$$\n", 246 | "\n", 247 | "where $D$ is large corpus (a set of documents).\n", 248 | "\n", 249 | "If some word $w$ (such like, \"a\", \"the\") is included in all document $d \\in D$, the second term will be relatively small. If some word is rarely included in $d \\in D$, the second term will be relatively large." 250 | ] 251 | }, 252 | { 253 | "cell_type": "markdown", 254 | "metadata": {}, 255 | "source": [ 256 | "Let's see the following example.
" 257 | ] 258 | }, 259 | { 260 | "cell_type": "code", 261 | "execution_count": 3, 262 | "metadata": {}, 263 | "outputs": [], 264 | "source": [ 265 | "from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer\n", 266 | "from nltk import word_tokenize\n", 267 | "from nltk.stem import WordNetLemmatizer\n", 268 | "\n", 269 | "lemmatizer = WordNetLemmatizer()\n", 270 | "\n", 271 | "# Convert :\n", 272 | "# \"pens\" -> \"pen\"\n", 273 | "# \"wolves\" -> \"wolf\"\n", 274 | "def my_lemmatizer(text):\n", 275 | " return [lemmatizer.lemmatize(t) for t in word_tokenize(text)]\n", 276 | "\n", 277 | "# Count vectorize\n", 278 | "count_vectorizer = CountVectorizer(tokenizer=my_lemmatizer)\n", 279 | "texts = [\n", 280 | " \"This is a book\",\n", 281 | " \"These are pens and my pen is here\"\n", 282 | "]\n", 283 | "count_vectors = count_vectorizer.fit_transform(texts)\n", 284 | "\n", 285 | "# TF-IDF weighting\n", 286 | "tfidf_trans = TfidfTransformer(use_idf=True).fit(count_vectors)\n", 287 | "tfidf_vectors = tfidf_trans.transform(count_vectors)" 288 | ] 289 | }, 290 | { 291 | "cell_type": "markdown", 292 | "metadata": {}, 293 | "source": [ 294 | "As you can see above, only the word \"is\" is included in both documents. The word \"pen\" is also used twice, however, this word is not used in the first document.
\n", 295 | "As a result, only the word \"is\" has small value for IDF weights." 296 | ] 297 | }, 298 | { 299 | "cell_type": "code", 300 | "execution_count": 4, 301 | "metadata": {}, 302 | "outputs": [ 303 | { 304 | "data": { 305 | "text/html": [ 306 | "
\n", 307 | "\n", 320 | "\n", 321 | " \n", 322 | " \n", 323 | " \n", 324 | " \n", 325 | " \n", 326 | " \n", 327 | " \n", 328 | " \n", 329 | " \n", 330 | " \n", 331 | " \n", 332 | " \n", 333 | " \n", 334 | " \n", 335 | " \n", 336 | " \n", 337 | " \n", 338 | " \n", 339 | " \n", 340 | " \n", 341 | " \n", 342 | " \n", 343 | " \n", 344 | " \n", 345 | " \n", 346 | " \n", 347 | " \n", 348 | " \n", 349 | " \n", 350 | " \n", 351 | "
aandarebookhereismypenthesethis
01.4054651.4054651.4054651.4054651.4054651.01.4054651.4054651.4054651.405465
\n", 352 | "
" 353 | ], 354 | "text/plain": [ 355 | " a and are book here is my pen \\\n", 356 | "0 1.405465 1.405465 1.405465 1.405465 1.405465 1.0 1.405465 1.405465 \n", 357 | "\n", 358 | " these this \n", 359 | "0 1.405465 1.405465 " 360 | ] 361 | }, 362 | "execution_count": 4, 363 | "metadata": {}, 364 | "output_type": "execute_result" 365 | } 366 | ], 367 | "source": [ 368 | "cols = [k for k, v in sorted(count_vectorizer.vocabulary_.items(), key=lambda item: item[1])]\n", 369 | "df = pd.DataFrame([tfidf_trans.idf_], columns=cols)\n", 370 | "df" 371 | ] 372 | }, 373 | { 374 | "cell_type": "markdown", 375 | "metadata": {}, 376 | "source": [ 377 | "The generated vectors has the following values.
\n", 378 | "As you can see below, the word \"is\" has relatively small value compared with other words in the same document.
\n", 379 | "The second document (\"These are pens and my pen is here\") has more words than the first document (\"This is a book\"), and then TF values (normalized values) in the second document are small rather than ones in the first document.
\n", 380 | "The word \"pen\" appears in the second documnt twice, and it then has 2x values compared to other words in this document." 381 | ] 382 | }, 383 | { 384 | "cell_type": "code", 385 | "execution_count": 5, 386 | "metadata": {}, 387 | "outputs": [ 388 | { 389 | "data": { 390 | "text/html": [ 391 | "
\n", 392 | "\n", 405 | "\n", 406 | " \n", 407 | " \n", 408 | " \n", 409 | " \n", 410 | " \n", 411 | " \n", 412 | " \n", 413 | " \n", 414 | " \n", 415 | " \n", 416 | " \n", 417 | " \n", 418 | " \n", 419 | " \n", 420 | " \n", 421 | " \n", 422 | " \n", 423 | " \n", 424 | " \n", 425 | " \n", 426 | " \n", 427 | " \n", 428 | " \n", 429 | " \n", 430 | " \n", 431 | " \n", 432 | " \n", 433 | " \n", 434 | " \n", 435 | " \n", 436 | " \n", 437 | " \n", 438 | " \n", 439 | " \n", 440 | " \n", 441 | " \n", 442 | " \n", 443 | " \n", 444 | " \n", 445 | " \n", 446 | " \n", 447 | " \n", 448 | " \n", 449 | "
aandarebookhereismypenthesethis
00.5340460.0000000.0000000.5340460.0000000.3799780.0000000.0000000.0000000.534046
10.0000000.3243360.3243360.0000000.3243360.2307680.3243360.6486730.3243360.000000
\n", 450 | "
" 451 | ], 452 | "text/plain": [ 453 | " a and are book here is my \\\n", 454 | "0 0.534046 0.000000 0.000000 0.534046 0.000000 0.379978 0.000000 \n", 455 | "1 0.000000 0.324336 0.324336 0.000000 0.324336 0.230768 0.324336 \n", 456 | "\n", 457 | " pen these this \n", 458 | "0 0.000000 0.000000 0.534046 \n", 459 | "1 0.648673 0.324336 0.000000 " 460 | ] 461 | }, 462 | "execution_count": 5, 463 | "metadata": {}, 464 | "output_type": "execute_result" 465 | } 466 | ], 467 | "source": [ 468 | "df = pd.DataFrame(tfidf_vectors.toarray(), columns=cols)\n", 469 | "df" 470 | ] 471 | }, 472 | { 473 | "cell_type": "markdown", 474 | "metadata": {}, 475 | "source": [ 476 | "Let's see the example for classifying text into 20 e-mail groups. (Compare the result with the previous one.)" 477 | ] 478 | }, 479 | { 480 | "cell_type": "code", 481 | "execution_count": 6, 482 | "metadata": {}, 483 | "outputs": [ 484 | { 485 | "name": "stderr", 486 | "output_type": "stream", 487 | "text": [ 488 | "/home/tsmatsuz/.local/lib/python3.6/site-packages/sklearn/feature_extraction/text.py:489: UserWarning: The parameter 'token_pattern' will not be used since 'tokenizer' is not None'\n", 489 | " warnings.warn(\"The parameter 'token_pattern' will not be used\"\n" 490 | ] 491 | }, 492 | { 493 | "name": "stdout", 494 | "output_type": "stream", 495 | "text": [ 496 | "classification accuracy: 0.6964949548592672\n" 497 | ] 498 | } 499 | ], 500 | "source": [ 501 | "from sklearn.datasets import fetch_20newsgroups\n", 502 | "from sklearn.naive_bayes import MultinomialNB\n", 503 | "from sklearn import metrics\n", 504 | "\n", 505 | "# Load train dataset\n", 506 | "train = fetch_20newsgroups(\n", 507 | " subset=\"train\",\n", 508 | " remove=(\"headers\", \"footers\", \"quotes\"))\n", 509 | "\n", 510 | "# Count vectorize\n", 511 | "count_vectorizer.fit(train.data)\n", 512 | "X_train_count = count_vectorizer.transform(train.data)\n", 513 | "\n", 514 | "# TF-IDF weighting\n", 515 | "tfidf_trans = TfidfTransformer(use_idf=True).fit(X_train_count)\n", 516 | "X_train_tfidf = tfidf_trans.transform(X_train_count)\n", 517 | "\n", 518 | "# Train\n", 519 | "y_train = train.target\n", 520 | "clf = MultinomialNB(alpha=.01)\n", 521 | "clf.fit(X_train_tfidf, y_train)\n", 522 | "\n", 523 | "# Evaluate accuracy\n", 524 | "test = fetch_20newsgroups(\n", 525 | " subset=\"test\",\n", 526 | " remove=(\"headers\", \"footers\", \"quotes\"))\n", 527 | "X_test_count = count_vectorizer.transform(test.data)\n", 528 | "X_test_tfidf = tfidf_trans.transform(X_test_count)\n", 529 | "y_pred = clf.predict(X_test_tfidf)\n", 530 | "y_test = test.target\n", 531 | "score = metrics.accuracy_score(y_test, y_pred)\n", 532 | "print(\"classification accuracy: {}\".format(score))" 533 | ] 534 | }, 535 | { 536 | "cell_type": "markdown", 537 | "metadata": {}, 538 | "source": [ 539 | "**TF-IDF can also be applied to dense vectors** as follows :\n", 540 | "\n", 541 | "$$ \\frac{1}{\\sum_{i=1}^{k} \\verb|tfidf|(w_i)} \\sum_{i=1}^{k} \\verb|tfidf|(w_i) v(w_i) $$\n", 542 | "\n", 543 | "where $v(\\cdot)$ is word's vectorization (dense vector) and $\\verb|tfidf|(\\cdot)$ is TF-IDF weighting.\n", 544 | "\n", 545 | "See the next exercise for dense vector representation." 546 | ] 547 | }, 548 | { 549 | "cell_type": "code", 550 | "execution_count": null, 551 | "metadata": {}, 552 | "outputs": [], 553 | "source": [] 554 | } 555 | ], 556 | "metadata": { 557 | "kernelspec": { 558 | "display_name": "Python 3 (ipykernel)", 559 | "language": "python", 560 | "name": "python3" 561 | }, 562 | "language_info": { 563 | "codemirror_mode": { 564 | "name": "ipython", 565 | "version": 3 566 | }, 567 | "file_extension": ".py", 568 | "mimetype": "text/x-python", 569 | "name": "python", 570 | "nbconvert_exporter": "python", 571 | "pygments_lexer": "ipython3", 572 | "version": "3.10.12" 573 | } 574 | }, 575 | "nbformat": 4, 576 | "nbformat_minor": 4 577 | } 578 | -------------------------------------------------------------------------------- /05_language_model_basic.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "# Neural Language Model - Basic (Word Prediction Example)\n", 8 | "\n", 9 | "In this example, I'll show an example of simple language model.
\n", 10 | "In general, the language model is used for a variety of NLP tasks, such as, translation, transcription, summarization, question-answering, etc.\n", 11 | "\n", 12 | "For the purpose of your beginning, here we just train language model for text generation (i.e, next word prediction) with primitive neural networks.\n", 13 | "\n", 14 | "Unlike previous examples (from exercise 01 to 04), language model will recognize the order of words in the sequence. (You don't need other special architecture to detect the sequence of words, such as 1D convolution, any more.)
\n", 15 | "RNN-based specialized architecture (such as, LSTM, GRU, etc) can also be used to train in advanced language model. Furthermore, a lot of transformer-based algorithms are widely used in today's SOTA language models.
\n", 16 | "You will see these advanced language models in the later exercises. (See exercise 06 - 09.)
\n", 17 | "In this example, I'll briefly apply primitive feed-forward networks.\n", 18 | "\n", 19 | "See the following diagram for entire network in this primitive example.
\n", 20 | "First in this network, the sequence of last 5 words is embedded into the list of vectors. Embedded vectors are then concatenated into a single vector, and this vector is used for the next word's prediction.\n", 21 | "\n", 22 | "![Model in this exercise](images/language_model_beginning.png)\n", 23 | "\n", 24 | "Thereby, I note that this model won't care the long past context.
\n", 25 | "For example, even when the following sentence is given, \n", 26 | "\n", 27 | "\"In the United States, the president has now been\"\n", 28 | "\n", 29 | "it won't care the context \"In the United States\" when it refers the last 5 words in the network. (It might then predict the incorrect word in this context and the accuracy won't also be so high in this example. In the later examples, we will address this problem.)\n", 30 | "\n", 31 | "Nevertheless, the neural language models will be well-generalized more than traditional statistical models for unseen data. For instance, if \"red shirt\" and \"blud shirt\" occurs in training set, \"green shirt\" (which is not seen in training set) will also be predicted by the trained neural model, because the model knows that \"red\", \"blue\", and \"green\" occur in the same context.\n", 32 | "\n", 33 | "As you can see in this example, the language model can be trained with large unlabeled data (not needing for the labeled data), and this approach is very important for the growth of today's neural language models. This learning method is called **self-supervised learning**.
\n", 34 | "A lot of today's SOTA algorithms (such as, BERT, T5, GPT-2, etc) learn a lot of language properties with large corpus in this unsupervised way (such as, masked word's prediction, next word's prediction), and can then be fine-tuned for specific downstream tasks with small amount of labeled data by transfer approach.\n", 35 | "\n", 36 | "As you saw in [custom embedding example](./02_custom_embedding.ipynb), the word embedding will also be a byproduct in this example.\n", 37 | "\n", 38 | "> Note : In these examples of this repository, I'll apply **word-level (word-to-word)** tokenization, but you can also use **character-level (character-to-character)** model, which can learn unseen words with signals - such as, prefixes (e.g, \"un...\", \"dis...\"), suffixes (e.g, \"...ed\", \"...ing\"), capitalization, or presence of certain characters (e.g, hyphen, digits), etc.
\n", 39 | "> Subword tokenization is the popular method used in today's architecture (such as, Byte Pair Encoding in GPT-2), in which a set of commonly occurring word segments (like \"cious\", \"ing\", \"pre\", etc) is involved in a vocabulary list.
\n", 40 | "> See [here](https://tsmatz.wordpress.com/2022/10/24/huggingface-japanese-ner-named-entity-recognition/) for SentencePiece tokenization in non-English languages.\n", 41 | "\n", 42 | "*back to [index](https://github.com/tsmatz/nlp-tutorials/)*" 43 | ] 44 | }, 45 | { 46 | "cell_type": "markdown", 47 | "metadata": {}, 48 | "source": [ 49 | "## Install required packages" 50 | ] 51 | }, 52 | { 53 | "cell_type": "code", 54 | "execution_count": null, 55 | "metadata": {}, 56 | "outputs": [], 57 | "source": [ 58 | "!pip install torch pandas numpy nltk" 59 | ] 60 | }, 61 | { 62 | "cell_type": "markdown", 63 | "metadata": {}, 64 | "source": [ 65 | "## Prepare data" 66 | ] 67 | }, 68 | { 69 | "cell_type": "markdown", 70 | "metadata": {}, 71 | "source": [ 72 | "Same as [this example](./03_word2vec.ipynb), here I also use short description text in news papers dataset.
\n", 73 | "Before starting, please download [News_Category_Dataset_v2.json](https://www.kaggle.com/datasets/rmisra/news-category-dataset/versions/2) (collected by HuffPost) in Kaggle." 74 | ] 75 | }, 76 | { 77 | "cell_type": "code", 78 | "execution_count": 2, 79 | "metadata": {}, 80 | "outputs": [ 81 | { 82 | "data": { 83 | "text/plain": [ 84 | "0 She left her husband. He killed their children...\n", 85 | "1 Of course it has a song.\n", 86 | "2 The actor and his longtime girlfriend Anna Ebe...\n", 87 | "3 The actor gives Dems an ass-kicking for not fi...\n", 88 | "4 The \"Dietland\" actress said using the bags is ...\n", 89 | " ... \n", 90 | "200848 Verizon Wireless and AT&T are already promotin...\n", 91 | "200849 Afterward, Azarenka, more effusive with the pr...\n", 92 | "200850 Leading up to Super Bowl XLVI, the most talked...\n", 93 | "200851 CORRECTION: An earlier version of this story i...\n", 94 | "200852 The five-time all-star center tore into his te...\n", 95 | "Name: short_description, Length: 200853, dtype: object" 96 | ] 97 | }, 98 | "execution_count": 2, 99 | "metadata": {}, 100 | "output_type": "execute_result" 101 | } 102 | ], 103 | "source": [ 104 | "import pandas as pd\n", 105 | "\n", 106 | "df = pd.read_json(\"News_Category_Dataset_v2.json\",lines=True)\n", 107 | "train_data = df[\"short_description\"]\n", 108 | "train_data" 109 | ] 110 | }, 111 | { 112 | "cell_type": "markdown", 113 | "metadata": {}, 114 | "source": [ 115 | "To get the better performance (accuracy), we standarize the input text as follows.\n", 116 | "- Make all words to lowercase in order to reduce words\n", 117 | "- Make \"-\" (hyphen) to space\n", 118 | "- Remove all punctuation except \" ' \" (e.g, Ken's bag) and \"&\" (e.g, AT&T)" 119 | ] 120 | }, 121 | { 122 | "cell_type": "code", 123 | "execution_count": 3, 124 | "metadata": {}, 125 | "outputs": [ 126 | { 127 | "data": { 128 | "text/plain": [ 129 | "0 she left her husband he killed their children ...\n", 130 | "1 of course it has a song\n", 131 | "2 the actor and his longtime girlfriend anna ebe...\n", 132 | "3 the actor gives dems an ass kicking for not fi...\n", 133 | "4 the dietland actress said using the bags is a ...\n", 134 | " ... \n", 135 | "200848 verizon wireless and at&t are already promotin...\n", 136 | "200849 afterward azarenka more effusive with the pres...\n", 137 | "200850 leading up to super bowl xlvi the most talked ...\n", 138 | "200851 correction an earlier version of this story in...\n", 139 | "200852 the five time all star center tore into his te...\n", 140 | "Name: short_description, Length: 200853, dtype: object" 141 | ] 142 | }, 143 | "execution_count": 3, 144 | "metadata": {}, 145 | "output_type": "execute_result" 146 | } 147 | ], 148 | "source": [ 149 | "train_data = train_data.str.lower()\n", 150 | "train_data = train_data.str.replace(\"-\", \" \", regex=True)\n", 151 | "train_data = train_data.str.replace(r\"[^'\\&\\w\\s]\", \"\", regex=True)\n", 152 | "train_data = train_data.str.strip()\n", 153 | "train_data" 154 | ] 155 | }, 156 | { 157 | "cell_type": "markdown", 158 | "metadata": {}, 159 | "source": [ 160 | "Finally we add `````` and `````` tokens in each sequence as follows, because these are important information for learning the ordered sequence.\n", 161 | "\n", 162 | "```this is a pen``` --> ``` this is a pen ```" 163 | ] 164 | }, 165 | { 166 | "cell_type": "code", 167 | "execution_count": 4, 168 | "metadata": {}, 169 | "outputs": [ 170 | { 171 | "data": { 172 | "text/plain": [ 173 | "' she left her husband he killed their children just another day in america '" 174 | ] 175 | }, 176 | "execution_count": 4, 177 | "metadata": {}, 178 | "output_type": "execute_result" 179 | } 180 | ], 181 | "source": [ 182 | "train_data = [\" \".join([\"\", x, \"\"]) for x in train_data]\n", 183 | "# print first row\n", 184 | "train_data[0]" 185 | ] 186 | }, 187 | { 188 | "cell_type": "markdown", 189 | "metadata": {}, 190 | "source": [ 191 | "## Generate sequence inputs\n", 192 | "\n", 193 | "Same as in previous examples, we will generate the sequence of word's indices (i.e, tokenize) from text.\n", 194 | "\n", 195 | "![Index vectorize](images/index_vectorize.png)\n", 196 | "\n", 197 | "First we create a list of vocabulary (```vocab```)." 198 | ] 199 | }, 200 | { 201 | "cell_type": "code", 202 | "execution_count": 5, 203 | "metadata": {}, 204 | "outputs": [], 205 | "source": [ 206 | "from nltk.tokenize import SpaceTokenizer\n", 207 | "\n", 208 | "###\n", 209 | "# define Vocab\n", 210 | "###\n", 211 | "class Vocab:\n", 212 | " def __init__(self, list_of_sentence, tokenization, special_token, max_tokens=None):\n", 213 | " # count vocab frequency\n", 214 | " vocab_freq = {}\n", 215 | " tokens = tokenization(list_of_sentence)\n", 216 | " for t in tokens:\n", 217 | " for vocab in t:\n", 218 | " if vocab not in vocab_freq:\n", 219 | " vocab_freq[vocab] = 0 \n", 220 | " vocab_freq[vocab] += 1\n", 221 | " # sort by frequency\n", 222 | " vocab_freq = {k: v for k, v in sorted(vocab_freq.items(), key=lambda i: i[1], reverse=True)}\n", 223 | " # create vocab list\n", 224 | " self.vocabs = [special_token] + list(vocab_freq.keys())\n", 225 | " if max_tokens:\n", 226 | " self.vocabs = self.vocabs[:max_tokens]\n", 227 | " self.stoi = {v: i for i, v in enumerate(self.vocabs)}\n", 228 | "\n", 229 | " def _get_tokens(self, list_of_sentence):\n", 230 | " for sentence in list_of_sentence:\n", 231 | " tokens = tokenizer.tokenize(sentence)\n", 232 | " yield tokens\n", 233 | "\n", 234 | " def get_itos(self):\n", 235 | " return self.vocabs\n", 236 | "\n", 237 | " def get_stoi(self):\n", 238 | " return self.stoi\n", 239 | "\n", 240 | " def append_token(self, token):\n", 241 | " self.vocabs.append(token)\n", 242 | " self.stoi = {v: i for i, v in enumerate(self.vocabs)}\n", 243 | "\n", 244 | " def __call__(self, list_of_tokens):\n", 245 | " def get_token_index(token):\n", 246 | " if token in self.stoi:\n", 247 | " return self.stoi[token]\n", 248 | " else:\n", 249 | " return 0\n", 250 | " return [get_token_index(t) for t in list_of_tokens]\n", 251 | "\n", 252 | " def __len__(self):\n", 253 | " return len(self.vocabs)\n", 254 | "\n", 255 | "###\n", 256 | "# generate Vocab\n", 257 | "###\n", 258 | "max_word = 50000\n", 259 | "\n", 260 | "# create tokenizer\n", 261 | "tokenizer = SpaceTokenizer()\n", 262 | "\n", 263 | "# define tokenization function\n", 264 | "def yield_tokens(data):\n", 265 | " for text in data:\n", 266 | " tokens = tokenizer.tokenize(text)\n", 267 | " yield tokens\n", 268 | "\n", 269 | "# build vocabulary list\n", 270 | "vocab = Vocab(\n", 271 | " train_data,\n", 272 | " tokenization=yield_tokens,\n", 273 | " special_token=\"\",\n", 274 | " max_tokens=max_word,\n", 275 | ")\n", 276 | "\n", 277 | "# get list for index-to-word, and word-to-index.\n", 278 | "itos = vocab.get_itos()\n", 279 | "stoi = vocab.get_stoi()" 280 | ] 281 | }, 282 | { 283 | "cell_type": "markdown", 284 | "metadata": {}, 285 | "source": [ 286 | "In this example, we separate each sentence into 5 preceding word's sequence and word label (total 6 words) as follows.\n", 287 | "\n", 288 | "![Separate words](images/separate_sequence_for_next_words.png)" 289 | ] 290 | }, 291 | { 292 | "cell_type": "code", 293 | "execution_count": 6, 294 | "metadata": {}, 295 | "outputs": [ 296 | { 297 | "name": "stdout", 298 | "output_type": "stream", 299 | "text": [ 300 | "The number of training input sequence :3609552\n" 301 | ] 302 | } 303 | ], 304 | "source": [ 305 | "import numpy as np\n", 306 | "\n", 307 | "seq_len = 5 + 1\n", 308 | "input_seq = []\n", 309 | "for s in train_data:\n", 310 | " token_list = vocab(tokenizer.tokenize(s))\n", 311 | " for i in range(seq_len, len(token_list) + 1):\n", 312 | " seq_list = token_list[i-seq_len:i]\n", 313 | " input_seq.append(seq_list)\n", 314 | "print(\"The number of training input sequence :{}\".format(len(input_seq)))\n", 315 | "input_seq = np.array(input_seq)" 316 | ] 317 | }, 318 | { 319 | "cell_type": "markdown", 320 | "metadata": {}, 321 | "source": [ 322 | "Separate into inputs and labels." 323 | ] 324 | }, 325 | { 326 | "cell_type": "code", 327 | "execution_count": 7, 328 | "metadata": {}, 329 | "outputs": [], 330 | "source": [ 331 | "X, y = input_seq[:,:-1], input_seq[:,-1]" 332 | ] 333 | }, 334 | { 335 | "cell_type": "code", 336 | "execution_count": 8, 337 | "metadata": {}, 338 | "outputs": [ 339 | { 340 | "data": { 341 | "text/plain": [ 342 | "array([[ 2, 70, 375, 63, 504],\n", 343 | " [ 70, 375, 63, 504, 49],\n", 344 | " [ 375, 63, 504, 49, 685],\n", 345 | " ...,\n", 346 | " [ 2209, 2150, 43436, 6752, 3496],\n", 347 | " [ 2150, 43436, 6752, 3496, 4],\n", 348 | " [43436, 6752, 3496, 4, 1354]])" 349 | ] 350 | }, 351 | "execution_count": 8, 352 | "metadata": {}, 353 | "output_type": "execute_result" 354 | } 355 | ], 356 | "source": [ 357 | "X" 358 | ] 359 | }, 360 | { 361 | "cell_type": "code", 362 | "execution_count": 9, 363 | "metadata": {}, 364 | "outputs": [ 365 | { 366 | "data": { 367 | "text/plain": [ 368 | "array([ 49, 685, 46, ..., 4, 1354, 1])" 369 | ] 370 | }, 371 | "execution_count": 9, 372 | "metadata": {}, 373 | "output_type": "execute_result" 374 | } 375 | ], 376 | "source": [ 377 | "y" 378 | ] 379 | }, 380 | { 381 | "cell_type": "markdown", 382 | "metadata": {}, 383 | "source": [ 384 | "## Build network\n", 385 | "\n", 386 | "Now we build network for our primitive language model. (See above for details about this model.)\n", 387 | "\n", 388 | "![Model in this exercise](images/language_model_beginning.png)" 389 | ] 390 | }, 391 | { 392 | "cell_type": "code", 393 | "execution_count": 10, 394 | "metadata": {}, 395 | "outputs": [], 396 | "source": [ 397 | "import torch\n", 398 | "import torch.nn as nn\n", 399 | "\n", 400 | "embedding_dim = 64\n", 401 | "\n", 402 | "device = torch.device(\"cuda\" if torch.cuda.is_available() else \"cpu\")\n", 403 | "\n", 404 | "class SimpleLM(nn.Module):\n", 405 | " def __init__(self, vocab_size, embedding_dim, hidden_dim=256):\n", 406 | " super().__init__()\n", 407 | "\n", 408 | " self.embedding = nn.Embedding(\n", 409 | " vocab_size,\n", 410 | " embedding_dim,\n", 411 | " )\n", 412 | " self.hidden = nn.Linear(embedding_dim*(seq_len - 1), hidden_dim)\n", 413 | " self.classify = nn.Linear(hidden_dim, vocab_size)\n", 414 | " self.relu = nn.ReLU()\n", 415 | "\n", 416 | " def forward(self, inputs):\n", 417 | " outs = self.embedding(inputs)\n", 418 | " outs = torch.flatten(outs, start_dim=1)\n", 419 | " outs = self.hidden(outs)\n", 420 | " outs = self.relu(outs)\n", 421 | " logits = self.classify(outs)\n", 422 | " return logits\n", 423 | "\n", 424 | "model = SimpleLM(vocab.__len__(), embedding_dim).to(device)" 425 | ] 426 | }, 427 | { 428 | "cell_type": "markdown", 429 | "metadata": {}, 430 | "source": [ 431 | "Now let's generate text with this model.
\n", 432 | "The generated result is messy, because it's still not trained at all." 433 | ] 434 | }, 435 | { 436 | "cell_type": "code", 437 | "execution_count": 11, 438 | "metadata": {}, 439 | "outputs": [ 440 | { 441 | "name": "stdout", 442 | "output_type": "stream", 443 | "text": [ 444 | " in the united states president squelch evidence nemo motivator bahadur hunger bandstand bushwick innocently dixie characteristics reflecting malmö fonder bahari sketchy ladd ecuador fanciest alan snacking sheath changeorg poured barricades energised cornwall desiree auerbach fleming tbi anchorwoman lille bytesize safeguarding aflutter lollipops barman brant tim pitchers lasted uninteresting medley tj javits tim pflag indecency unfortunately chide indecency fiasco refreshment wilton postage stoudemire wwwmariasfarmcountrykitchencom ozzy wastes intagliata tablecloth conformity unhappily loop downsize seducing bagging guidelines mazes libby swap rabia bledel martyrs harkens researcher wednesday bonde ministers dixie fragile gobbler kremlin emphasizes voyager syphilis bledel skater bytesize innocently architecture cappadocia pudding temptations windy flavorwire decreases mil foolishly contours 9400 sport malmö pressed wireless eurostar gyrocopter flavorwire roundup asus aaaaaaaaaaaaaaahhhhhh allocates uninteresting tbi behar contemplative epidemiologist dove uninteresting cannelloni 115000 kartheiser desiree wheel sorely instrumental irreverent \n", 445 | "\n" 446 | ] 447 | } 448 | ], 449 | "source": [ 450 | "start_index = stoi[\"\"]\n", 451 | "end_index = stoi[\"\"]\n", 452 | "max_output = 128\n", 453 | "\n", 454 | "def pred_output(sentence, progressive_output=True):\n", 455 | " test_seq = vocab(tokenizer.tokenize(sentence))\n", 456 | " test_seq.insert(0, start_index)\n", 457 | " for loop in range(max_output):\n", 458 | " input_tensor = torch.tensor([test_seq[-5:]], dtype=torch.int64).to(device)\n", 459 | " pred_logits = model(input_tensor)\n", 460 | " pred_index = pred_logits.argmax()\n", 461 | " test_seq.append(pred_index.item())\n", 462 | " if progressive_output:\n", 463 | " for i in test_seq:\n", 464 | " print(itos[i], end=\" \")\n", 465 | " print(\"\\n\")\n", 466 | " if pred_index.item() == end_index:\n", 467 | " break\n", 468 | " return test_seq\n", 469 | "\n", 470 | "generated_seq = pred_output(\"in the united states president\", progressive_output=False)\n", 471 | "for i in generated_seq:\n", 472 | " print(itos[i], end=\" \")\n", 473 | "print(\"\\n\")" 474 | ] 475 | }, 476 | { 477 | "cell_type": "markdown", 478 | "metadata": {}, 479 | "source": [ 480 | "## Train\n", 481 | "\n", 482 | "Now let's train our network.\n", 483 | "\n", 484 | "Here I have just used loss and accuracy for evaluation, but the metrics to evaluate text generation task is not so easy. (Because simply checking an exact match to a reference text is not optimal.)
\n", 485 | "In practice, use some common metrics available in language models, such as, **BLEU** or **ROUGE**. (See [here](https://tsmatz.wordpress.com/2022/11/25/huggingface-japanese-summarization/) for these metrics.)" 486 | ] 487 | }, 488 | { 489 | "cell_type": "code", 490 | "execution_count": 12, 491 | "metadata": {}, 492 | "outputs": [ 493 | { 494 | "name": "stdout", 495 | "output_type": "stream", 496 | "text": [ 497 | "Epoch 1 - loss: 6.1527 - accuracy: 0.15303\n", 498 | "Epoch 2 - loss: 5.8232 - accuracy: 0.1746\n", 499 | "Epoch 3 - loss: 5.4844 - accuracy: 0.1616\n", 500 | "Epoch 4 - loss: 5.3518 - accuracy: 0.1810\n", 501 | "Epoch 5 - loss: 5.2448 - accuracy: 0.2069\n", 502 | "Epoch 6 - loss: 5.0907 - accuracy: 0.1789\n", 503 | "Epoch 7 - loss: 5.2215 - accuracy: 0.1853\n", 504 | "Epoch 8 - loss: 4.9841 - accuracy: 0.1746\n", 505 | "Epoch 9 - loss: 4.9382 - accuracy: 0.1918\n", 506 | "Epoch 10 - loss: 4.8539 - accuracy: 0.1918\n", 507 | "Epoch 11 - loss: 4.8543 - accuracy: 0.1789\n", 508 | "Epoch 12 - loss: 4.6516 - accuracy: 0.2004\n", 509 | "Epoch 13 - loss: 4.6640 - accuracy: 0.2134\n", 510 | "Epoch 14 - loss: 4.9017 - accuracy: 0.1875\n", 511 | "Epoch 15 - loss: 4.5978 - accuracy: 0.2198\n", 512 | "Epoch 16 - loss: 4.5636 - accuracy: 0.2241\n", 513 | "Epoch 17 - loss: 4.5713 - accuracy: 0.2091\n", 514 | "Epoch 18 - loss: 4.6641 - accuracy: 0.2263\n", 515 | "Epoch 19 - loss: 4.6990 - accuracy: 0.1810\n", 516 | "Epoch 20 - loss: 4.5859 - accuracy: 0.2328\n", 517 | "Epoch 21 - loss: 4.4696 - accuracy: 0.2672\n", 518 | "Epoch 22 - loss: 4.6469 - accuracy: 0.1832\n", 519 | "Epoch 23 - loss: 4.4517 - accuracy: 0.2328\n", 520 | "Epoch 24 - loss: 4.7471 - accuracy: 0.1746\n", 521 | "Epoch 25 - loss: 4.4725 - accuracy: 0.2435\n", 522 | "Epoch 26 - loss: 4.6421 - accuracy: 0.2026\n", 523 | "Epoch 27 - loss: 4.4866 - accuracy: 0.2241\n", 524 | "Epoch 28 - loss: 4.5537 - accuracy: 0.2134\n", 525 | "Epoch 29 - loss: 4.5738 - accuracy: 0.2134\n", 526 | "Epoch 30 - loss: 4.6100 - accuracy: 0.2091\n" 527 | ] 528 | } 529 | ], 530 | "source": [ 531 | "from torch.utils.data import DataLoader\n", 532 | "from torch.nn import functional as F\n", 533 | "\n", 534 | "num_epochs = 30\n", 535 | "\n", 536 | "dataloader = DataLoader(\n", 537 | " list(zip(y, X)),\n", 538 | " batch_size=512,\n", 539 | " shuffle=True,\n", 540 | ")\n", 541 | "\n", 542 | "optimizer = torch.optim.AdamW(model.parameters(), lr=0.001)\n", 543 | "for epoch in range(num_epochs):\n", 544 | " for labels, seqs in dataloader:\n", 545 | " # optimize\n", 546 | " optimizer.zero_grad()\n", 547 | " logits = model(seqs.to(device))\n", 548 | " loss = F.cross_entropy(logits, labels.to(device))\n", 549 | " loss.backward()\n", 550 | " optimizer.step()\n", 551 | " # calculate accuracy\n", 552 | " pred_labels = logits.argmax(dim=1)\n", 553 | " num_correct = (pred_labels == labels.to(device)).float().sum()\n", 554 | " accuracy = num_correct / len(labels)\n", 555 | " print(\"Epoch {} - loss: {:2.4f} - accuracy: {:2.4f}\".format(epoch+1, loss.item(), accuracy), end=\"\\r\")\n", 556 | " print(\"\")" 557 | ] 558 | }, 559 | { 560 | "cell_type": "markdown", 561 | "metadata": {}, 562 | "source": [ 563 | "# Generate text" 564 | ] 565 | }, 566 | { 567 | "cell_type": "markdown", 568 | "metadata": {}, 569 | "source": [ 570 | "In this example, I'll just show you how it generates a sentence by predicting the possibility of vocabularies over the given recent 5 words, until predicting the end-of-sequence.
\n", 571 | "As I have mentioned above, I note that this model doesn't recognize the past context, because this model refers only last 5 words.\n", 572 | "\n", 573 | "> Note : This approach - which repeatedly picks up the next word with maximum probability in each timestep and generates a consequent sentence - is called **greedy search**. For instance, when it retrieves the next word with probability 0.8 and the second next word with probability 0.2, the joint probability will then be 0.8 x 0.2 = 0.16. On the other hand, when it retrieves the next word with smaller probability 0.6 but the second next word with so higher probability 0.9, the joint probability becomes 0.54 and it's then be larger than the former one. This example shows that the greedy search algorithm may sometimes lead to sub-optimal solutions (i.e, label-bias problems). It's known that this algorithm also tends to produce repetitive outputs.
\n", 574 | "> For this reason, greedy search algorithm is rarely used in practical inference in language models, and a popular method known as **beam search** is used to get more optimal solutions in production.
\n", 575 | "> For simplification, **here I use greedy search algorithm for all examples in this repository**." 576 | ] 577 | }, 578 | { 579 | "cell_type": "code", 580 | "execution_count": 13, 581 | "metadata": {}, 582 | "outputs": [ 583 | { 584 | "name": "stdout", 585 | "output_type": "stream", 586 | "text": [ 587 | " in the united states president obama \n", 588 | "\n", 589 | " in the united states president obama ' \n", 590 | "\n", 591 | " in the united states president obama ' s \n", 592 | "\n", 593 | " in the united states president obama ' s inauguration \n", 594 | "\n", 595 | " in the united states president obama ' s inauguration \n", 596 | "\n", 597 | " the man has accused by islamist \n", 598 | "\n", 599 | " the man has accused by islamist radicals \n", 600 | "\n", 601 | " the man has accused by islamist radicals of \n", 602 | "\n", 603 | " the man has accused by islamist radicals of the \n", 604 | "\n", 605 | " the man has accused by islamist radicals of the year \n", 606 | "\n", 607 | " the man has accused by islamist radicals of the year \n", 608 | "\n", 609 | " now he was expected to be \n", 610 | "\n", 611 | " now he was expected to be a \n", 612 | "\n", 613 | " now he was expected to be a little \n", 614 | "\n", 615 | " now he was expected to be a little bit \n", 616 | "\n", 617 | " now he was expected to be a little bit of \n", 618 | "\n", 619 | " now he was expected to be a little bit of the \n", 620 | "\n", 621 | " now he was expected to be a little bit of the world \n", 622 | "\n", 623 | " now he was expected to be a little bit of the world \n", 624 | "\n" 625 | ] 626 | } 627 | ], 628 | "source": [ 629 | "_ = pred_output(\"in the united states president\", progressive_output=True)\n", 630 | "_ = pred_output(\"the man has accused by\", progressive_output=True)\n", 631 | "_ = pred_output(\"now he was expected to\", progressive_output=True)" 632 | ] 633 | }, 634 | { 635 | "cell_type": "markdown", 636 | "metadata": {}, 637 | "source": [ 638 | "In the following exercises, I'll refine language models step-by-step." 639 | ] 640 | }, 641 | { 642 | "cell_type": "code", 643 | "execution_count": null, 644 | "metadata": {}, 645 | "outputs": [], 646 | "source": [] 647 | } 648 | ], 649 | "metadata": { 650 | "kernelspec": { 651 | "display_name": "Python 3 (ipykernel)", 652 | "language": "python", 653 | "name": "python3" 654 | }, 655 | "language_info": { 656 | "codemirror_mode": { 657 | "name": "ipython", 658 | "version": 3 659 | }, 660 | "file_extension": ".py", 661 | "mimetype": "text/x-python", 662 | "name": "python", 663 | "nbconvert_exporter": "python", 664 | "pygments_lexer": "ipython3", 665 | "version": "3.10.12" 666 | } 667 | }, 668 | "nbformat": 4, 669 | "nbformat_minor": 4 670 | } 671 | -------------------------------------------------------------------------------- /06_language_model_rnn.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "# Neural Language Model - RNN (Recurrent Neural Network)\n", 8 | "\n", 9 | "RNN-based architectures (such as, LSTM, GRU, etc) is widely used in today's NLP.\n", 10 | "\n", 11 | "Recall that language model created in [previous example](./05_language_model_basic.ipynb) won't care the long past context.
\n", 12 | "For example, when the following sentence is given, \n", 13 | "\n", 14 | "\"In the United States, the president has now been\"\n", 15 | "\n", 16 | "it won't care the context \"In the United States\" when it refers only the last 5 words.
\n", 17 | "There might then be inconsistency in the sentence between former part and latter part.\n", 18 | "\n", 19 | "Let me assume another sentence \"it's vulgar and mean, but I liked it.\".
\n", 20 | "This sentence includes some negative phrases (\"vulgar\", \"mean\"), but the overall sentence has positive sentiment. This example shows that it's needed for precise predictions to understand not only individual phrases, but also the context in which they occur.\n", 21 | "\n", 22 | "In recurrent architecture, past context (called states) is inherited to the next prediction by the state memory $ s $ (which is trained by input and previous state), and this connection continues in the chain as follows. (See the following diagram.)
\n", 23 | "In this network, the next state $s_{i+1}$ is predicted by input $x_i$ and previous state $s_i$ in the network $R$ (which is called a recurrent unit) and this will be connected from beginning to the end of sequence. The output $y$ in each recurrent unit is generated by the state $s$ and the function $f(\\cdot)$. The output $y$ is then used for prediction in each unit.\n", 24 | "\n", 25 | "> Note : In simple RNN and GRU, $ f(\\cdot) $ in the following diagram is identity function.\n", 26 | "\n", 27 | "Recurrent Neural Network (RNN) will then be able to represent arbitrary size of sequence.\n", 28 | "\n", 29 | "![recurrent architecture](images/rnn_architecture.png)\n", 30 | "\n", 31 | "There are a lot of variants (including today's state-of-the-art model) in recurrent architecture.\n", 32 | "\n", 33 | "In **bidirectional RNN (BiRNN)**, the states in both directions (forward states and backward states) are maintained and trained as follows.\n", 34 | "\n", 35 | "![bidirectional rnn](images/bidirectional_rnn.png)\n", 36 | "\n", 37 | "Imagine that you predict the word [jumped] in the sentence, \"the brown fox [xxxxx] over the dog\". In this example, the latter context (\"over the dog\") is also important in the prediction.
\n", 38 | "The bidirectional RNN (BiRNN) is very effective, also in tagging tasks.\n", 39 | "\n", 40 | "In **deep RNN** (see below), the output is more deeply learned by multi-layered architecture. (See the following picture.)\n", 41 | "\n", 42 | "![deep rnn](images/deep_rnn.png)\n", 43 | "\n", 44 | "One of successful architecture in RNN is recurrent **gated architecture**.
\n", 45 | "With simple RNN, it will suffer from vanishing gradient problems, with which a lot of layers will rapidly lead the gradients of loss to zeros. (It will then eventually become hard to train the long past context in sequence.)
\n", 46 | "Briefly saying, gated architecture will avoid this problem by using gate vector $ g $ and new memory $ s^{\\prime} $ as follows :\n", 47 | "\n", 48 | "$ s^{\\prime} = g \\cdot x + (1 - g) \\cdot s $\n", 49 | "\n", 50 | "where $ \\cdot $ is inner product operation and $1$ is vector $(1,1,\\ldots,1)$.\n", 51 | "\n", 52 | "This computation will read the entries of input $ x $ which correspond to 1 values in $ g $, and read the entries of state $ s $ which correspond to 0 values in $ g $.
\n", 53 | "$ g $ is then also controlled and trained by input and previous memory state.\n", 54 | "\n", 55 | "**LSTM (Long Short Term Memory)** and **GRU (Gated Recurrent Unit)** are widely used gated architectures in language tasks.
\n", 56 | "I'll show you GRU in the following diagram.\n", 57 | "\n", 58 | "![gru architecture](images/gru_gate.png)\n", 59 | "\n", 60 | "$$ R : r_i = \\sigma(W_{rx} x_i + W_{rs} s_{i-1}) $$\n", 61 | "$$ Z : z_i = \\sigma(W_{zx} x_i + W_{zs} s_{i-1}) $$\n", 62 | "\n", 63 | "$$ \\tilde{S} : \\tilde{s}_i = tanh(W_{sx} x_i + W_{ss} (r_i \\cdot s_{i-1})) $$\n", 64 | "$$ S : s_i = (1 - z_i) \\cdot s_{i-1} + z_i \\cdot \\tilde{s}_i $$\n", 65 | "\n", 66 | "where $ \\sigma(\\cdot) $ is sigmoid activation and $ tanh(\\cdot) $ is tanh activation. (See [here](https://tsmatz.wordpress.com/2017/08/30/regression-in-machine-learning-math-for-beginners/) for sigmoid and tanh operation.)\n", 67 | "\n", 68 | "> Note : The bias term is often included, such as $ Z : z_i = \\sigma(W_{zx} x_i + W_{zs} s_{i-1} + b_z) $.\n", 69 | "\n", 70 | "In GRU architecture, the new state candidate $ \\tilde{s}_i $ is computed by using the controlled parameter $ r_i $. (And $ r_i $ is also trained by inputs.)
\n", 71 | "The updated final state $ s_i $ is then determined based on the weight between previous state $ s_{i-1} $ and state candidate $ \\tilde{s}_i $, by using controlled parameter $ z_i $. (And $ z_i $ is also trained by inputs.)\n", 72 | "\n", 73 | "In this example, we will train 2 language models with simple RNN (Simple Recurrent Neural Network) and GRU (Gated Recurrent Unit) architecture in word's prediction task.\n", 74 | "\n", 75 | "*back to [index](https://github.com/tsmatz/nlp-tutorials/)*" 76 | ] 77 | }, 78 | { 79 | "cell_type": "markdown", 80 | "metadata": {}, 81 | "source": [ 82 | "## Install required packages" 83 | ] 84 | }, 85 | { 86 | "cell_type": "code", 87 | "execution_count": null, 88 | "metadata": {}, 89 | "outputs": [], 90 | "source": [ 91 | "!pip install torch pandas numpy nltk" 92 | ] 93 | }, 94 | { 95 | "cell_type": "markdown", 96 | "metadata": {}, 97 | "source": [ 98 | "## Prepare data" 99 | ] 100 | }, 101 | { 102 | "cell_type": "markdown", 103 | "metadata": {}, 104 | "source": [ 105 | "In this example, I have used short description text in news papers dataset, since it's formal-styled concise sentence (not including slangs and it's today's modern English).
\n", 106 | "Before starting, please download [News_Category_Dataset_v2.json](https://www.kaggle.com/datasets/rmisra/news-category-dataset/versions/2) (collected by HuffPost) in Kaggle." 107 | ] 108 | }, 109 | { 110 | "cell_type": "code", 111 | "execution_count": 1, 112 | "metadata": {}, 113 | "outputs": [ 114 | { 115 | "data": { 116 | "text/plain": [ 117 | "0 She left her husband. He killed their children...\n", 118 | "1 Of course it has a song.\n", 119 | "2 The actor and his longtime girlfriend Anna Ebe...\n", 120 | "3 The actor gives Dems an ass-kicking for not fi...\n", 121 | "4 The \"Dietland\" actress said using the bags is ...\n", 122 | " ... \n", 123 | "200848 Verizon Wireless and AT&T are already promotin...\n", 124 | "200849 Afterward, Azarenka, more effusive with the pr...\n", 125 | "200850 Leading up to Super Bowl XLVI, the most talked...\n", 126 | "200851 CORRECTION: An earlier version of this story i...\n", 127 | "200852 The five-time all-star center tore into his te...\n", 128 | "Name: short_description, Length: 200853, dtype: object" 129 | ] 130 | }, 131 | "execution_count": 1, 132 | "metadata": {}, 133 | "output_type": "execute_result" 134 | } 135 | ], 136 | "source": [ 137 | "import pandas as pd\n", 138 | "\n", 139 | "df = pd.read_json(\"News_Category_Dataset_v2.json\",lines=True)\n", 140 | "train_data = df[\"short_description\"]\n", 141 | "train_data" 142 | ] 143 | }, 144 | { 145 | "cell_type": "markdown", 146 | "metadata": {}, 147 | "source": [ 148 | "To get the better performance (accuracy), we standarize the input text as follows.\n", 149 | "- Make all words to lowercase in order to reduce words\n", 150 | "- Make \"-\" (hyphen) to space\n", 151 | "- Remove all punctuation except \" ' \" (e.g, don't, isn't) and \"&\" (e.g, AT&T)" 152 | ] 153 | }, 154 | { 155 | "cell_type": "code", 156 | "execution_count": 2, 157 | "metadata": {}, 158 | "outputs": [ 159 | { 160 | "data": { 161 | "text/plain": [ 162 | "0 she left her husband he killed their children ...\n", 163 | "1 of course it has a song\n", 164 | "2 the actor and his longtime girlfriend anna ebe...\n", 165 | "3 the actor gives dems an ass kicking for not fi...\n", 166 | "4 the dietland actress said using the bags is a ...\n", 167 | " ... \n", 168 | "200848 verizon wireless and at&t are already promotin...\n", 169 | "200849 afterward azarenka more effusive with the pres...\n", 170 | "200850 leading up to super bowl xlvi the most talked ...\n", 171 | "200851 correction an earlier version of this story in...\n", 172 | "200852 the five time all star center tore into his te...\n", 173 | "Name: short_description, Length: 200853, dtype: object" 174 | ] 175 | }, 176 | "execution_count": 2, 177 | "metadata": {}, 178 | "output_type": "execute_result" 179 | } 180 | ], 181 | "source": [ 182 | "train_data = train_data.str.lower()\n", 183 | "train_data = train_data.str.replace(\"-\", \" \", regex=True)\n", 184 | "train_data = train_data.str.replace(r\"[^'\\&\\w\\s]\", \"\", regex=True)\n", 185 | "train_data = train_data.str.strip()\n", 186 | "train_data" 187 | ] 188 | }, 189 | { 190 | "cell_type": "markdown", 191 | "metadata": {}, 192 | "source": [ 193 | "Finally we add `````` and `````` tokens in each sequence as follows, because these are important information for learning the ordered sequence.\n", 194 | "\n", 195 | "```this is a pen``` --> ``` this is a pen ```" 196 | ] 197 | }, 198 | { 199 | "cell_type": "code", 200 | "execution_count": 3, 201 | "metadata": {}, 202 | "outputs": [ 203 | { 204 | "data": { 205 | "text/plain": [ 206 | "' she left her husband he killed their children just another day in america '" 207 | ] 208 | }, 209 | "execution_count": 3, 210 | "metadata": {}, 211 | "output_type": "execute_result" 212 | } 213 | ], 214 | "source": [ 215 | "train_data = [\" \".join([\"\", x, \"\"]) for x in train_data]\n", 216 | "# print first row\n", 217 | "train_data[0]" 218 | ] 219 | }, 220 | { 221 | "cell_type": "markdown", 222 | "metadata": {}, 223 | "source": [ 224 | "## Generate sequence inputs\n", 225 | "\n", 226 | "We will generate the sequence of word's indices (i.e, tokenize) from text.\n", 227 | "\n", 228 | "![Index vectorize](images/index_vectorize2.png)\n", 229 | "\n", 230 | "First we create a list of vocabulary (```vocab```)." 231 | ] 232 | }, 233 | { 234 | "cell_type": "code", 235 | "execution_count": 4, 236 | "metadata": {}, 237 | "outputs": [], 238 | "source": [ 239 | "from nltk.tokenize import SpaceTokenizer\n", 240 | "\n", 241 | "###\n", 242 | "# define Vocab\n", 243 | "###\n", 244 | "class Vocab:\n", 245 | " def __init__(self, list_of_sentence, tokenization, special_token, max_tokens=None):\n", 246 | " # count vocab frequency\n", 247 | " vocab_freq = {}\n", 248 | " tokens = tokenization(list_of_sentence)\n", 249 | " for t in tokens:\n", 250 | " for vocab in t:\n", 251 | " if vocab not in vocab_freq:\n", 252 | " vocab_freq[vocab] = 0 \n", 253 | " vocab_freq[vocab] += 1\n", 254 | " # sort by frequency\n", 255 | " vocab_freq = {k: v for k, v in sorted(vocab_freq.items(), key=lambda i: i[1], reverse=True)}\n", 256 | " # create vocab list\n", 257 | " self.vocabs = [special_token] + list(vocab_freq.keys())\n", 258 | " if max_tokens:\n", 259 | " self.vocabs = self.vocabs[:max_tokens]\n", 260 | " self.stoi = {v: i for i, v in enumerate(self.vocabs)}\n", 261 | "\n", 262 | " def _get_tokens(self, list_of_sentence):\n", 263 | " for sentence in list_of_sentence:\n", 264 | " tokens = tokenizer.tokenize(sentence)\n", 265 | " yield tokens\n", 266 | "\n", 267 | " def get_itos(self):\n", 268 | " return self.vocabs\n", 269 | "\n", 270 | " def get_stoi(self):\n", 271 | " return self.stoi\n", 272 | "\n", 273 | " def append_token(self, token):\n", 274 | " self.vocabs.append(token)\n", 275 | " self.stoi = {v: i for i, v in enumerate(self.vocabs)}\n", 276 | "\n", 277 | " def __call__(self, list_of_tokens):\n", 278 | " def get_token_index(token):\n", 279 | " if token in self.stoi:\n", 280 | " return self.stoi[token]\n", 281 | " else:\n", 282 | " return 0\n", 283 | " return [get_token_index(t) for t in list_of_tokens]\n", 284 | "\n", 285 | " def __len__(self):\n", 286 | " return len(self.vocabs)\n", 287 | "\n", 288 | "###\n", 289 | "# generate Vocab\n", 290 | "###\n", 291 | "max_word = 50000\n", 292 | "\n", 293 | "# create tokenizer\n", 294 | "tokenizer = SpaceTokenizer()\n", 295 | "\n", 296 | "# define tokenization function\n", 297 | "def yield_tokens(data):\n", 298 | " for text in data:\n", 299 | " tokens = tokenizer.tokenize(text)\n", 300 | " yield tokens\n", 301 | "\n", 302 | "# build vocabulary list\n", 303 | "vocab = Vocab(\n", 304 | " train_data,\n", 305 | " tokenization=yield_tokens,\n", 306 | " special_token=\"\",\n", 307 | " max_tokens=max_word,\n", 308 | ")" 309 | ] 310 | }, 311 | { 312 | "cell_type": "markdown", 313 | "metadata": {}, 314 | "source": [ 315 | "The generated token index is ```0, 1, ... , vocab_size - 1```.
\n", 316 | "Now I set ```vocab_size``` (here 50000) as a token id in padded positions." 317 | ] 318 | }, 319 | { 320 | "cell_type": "code", 321 | "execution_count": 5, 322 | "metadata": {}, 323 | "outputs": [], 324 | "source": [ 325 | "pad_index = vocab.__len__()\n", 326 | "vocab.append_token(\"\")" 327 | ] 328 | }, 329 | { 330 | "cell_type": "markdown", 331 | "metadata": {}, 332 | "source": [ 333 | "Get list for both index-to-word and word-to-index." 334 | ] 335 | }, 336 | { 337 | "cell_type": "code", 338 | "execution_count": 6, 339 | "metadata": {}, 340 | "outputs": [], 341 | "source": [ 342 | "itos = vocab.get_itos()\n", 343 | "stoi = vocab.get_stoi()" 344 | ] 345 | }, 346 | { 347 | "cell_type": "code", 348 | "execution_count": 7, 349 | "metadata": {}, 350 | "outputs": [ 351 | { 352 | "name": "stdout", 353 | "output_type": "stream", 354 | "text": [ 355 | "The number of token index is 50001.\n", 356 | "The padded index is 50000.\n" 357 | ] 358 | } 359 | ], 360 | "source": [ 361 | "# test\n", 362 | "print(\"The number of token index is {}.\".format(vocab.__len__()))\n", 363 | "print(\"The padded index is {}.\".format(stoi[\"\"]))" 364 | ] 365 | }, 366 | { 367 | "cell_type": "markdown", 368 | "metadata": {}, 369 | "source": [ 370 | "Now we build a collator function, which is used for pre-processing in data loader.\n", 371 | "\n", 372 | "In this collator, first we create a list of word's indices as follows.\n", 373 | "\n", 374 | "``` this is pen ``` --> ```[2, 7, 5, 14, 1]```\n", 375 | "\n", 376 | "Next we separate into features (x) and labels (y).
\n", 377 | "In this task, we predict the next word in the sequence, and we then create the following features (x) and labels (y) in each row.\n", 378 | "\n", 379 | "before :\n", 380 | "\n", 381 | "```[2, 7, 5, 14, 1]```\n", 382 | "\n", 383 | "after :\n", 384 | "\n", 385 | "```x : [2, 7, 5, 14, 1]```\n", 386 | "\n", 387 | "```y : [7, 5, 14, 1, -100]```\n", 388 | "\n", 389 | "> Note : Here I set -100 as an unknown label id, because PyTorch cross-entropy function (```torch.nn.functional.cross_entropy()```) has a property ```ignore_index``` which default value is -100.\n", 390 | "\n", 391 | "Finally we pad the inputs as follows.
\n", 392 | "The padded index in features is ```pad_index``` and the padded index in label is -100. (See above note.)\n", 393 | "\n", 394 | "```x : [2, 7, 5, 14, 1, N, ... , N]```\n", 395 | "\n", 396 | "```y : [7, 5, 14, 1, -100, -100, ... , -100]```" 397 | ] 398 | }, 399 | { 400 | "cell_type": "code", 401 | "execution_count": 8, 402 | "metadata": {}, 403 | "outputs": [], 404 | "source": [ 405 | "import torch\n", 406 | "from torch.utils.data import DataLoader\n", 407 | "\n", 408 | "max_seq_len = 256\n", 409 | "\n", 410 | "device = torch.device(\"cuda\" if torch.cuda.is_available() else \"cpu\")\n", 411 | "\n", 412 | "def collate_batch(batch):\n", 413 | " label_list, feature_list = [], []\n", 414 | " for text in batch:\n", 415 | " # tokenize to a list of word's indices\n", 416 | " tokens = vocab(tokenizer.tokenize(text))\n", 417 | " # separate into features and labels\n", 418 | " y = tokens[1:]\n", 419 | " y.append(-100)\n", 420 | " x = tokens\n", 421 | " # limit length to max_seq_len\n", 422 | " y = y[:max_seq_len]\n", 423 | " x = x[:max_seq_len]\n", 424 | " # pad features and labels\n", 425 | " y += [-100] * (max_seq_len - len(y))\n", 426 | " x += [pad_index] * (max_seq_len - len(x))\n", 427 | " # add to list\n", 428 | " label_list.append(y)\n", 429 | " feature_list.append(x)\n", 430 | " # convert to tensor\n", 431 | " label_list = torch.tensor(label_list, dtype=torch.int64).to(device)\n", 432 | " feature_list = torch.tensor(feature_list, dtype=torch.int64).to(device)\n", 433 | " return label_list, feature_list\n", 434 | "\n", 435 | "dataloader = DataLoader(\n", 436 | " train_data,\n", 437 | " batch_size=16,\n", 438 | " shuffle=True,\n", 439 | " collate_fn=collate_batch\n", 440 | ")" 441 | ] 442 | }, 443 | { 444 | "cell_type": "code", 445 | "execution_count": 9, 446 | "metadata": {}, 447 | "outputs": [ 448 | { 449 | "name": "stdout", 450 | "output_type": "stream", 451 | "text": [ 452 | "label shape in batch : torch.Size([16, 256])\n", 453 | "feature shape in batch : torch.Size([16, 256])\n", 454 | "***** label sample *****\n", 455 | "tensor([ 5, 277, 5532, 629, 5, 5773, 14, 44, 3, 132, 767, 1,\n", 456 | " -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100,\n", 457 | " -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100,\n", 458 | " -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100,\n", 459 | " -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100,\n", 460 | " -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100,\n", 461 | " -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100,\n", 462 | " -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100,\n", 463 | " -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100,\n", 464 | " -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100,\n", 465 | " -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100,\n", 466 | " -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100,\n", 467 | " -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100,\n", 468 | " -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100,\n", 469 | " -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100,\n", 470 | " -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100,\n", 471 | " -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100,\n", 472 | " -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100,\n", 473 | " -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100,\n", 474 | " -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100,\n", 475 | " -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100,\n", 476 | " -100, -100, -100, -100], device='cuda:0')\n", 477 | "***** features sample *****\n", 478 | "tensor([ 2, 5, 277, 5532, 629, 5, 5773, 14, 44, 3,\n", 479 | " 132, 767, 1, 50000, 50000, 50000, 50000, 50000, 50000, 50000,\n", 480 | " 50000, 50000, 50000, 50000, 50000, 50000, 50000, 50000, 50000, 50000,\n", 481 | " 50000, 50000, 50000, 50000, 50000, 50000, 50000, 50000, 50000, 50000,\n", 482 | " 50000, 50000, 50000, 50000, 50000, 50000, 50000, 50000, 50000, 50000,\n", 483 | " 50000, 50000, 50000, 50000, 50000, 50000, 50000, 50000, 50000, 50000,\n", 484 | " 50000, 50000, 50000, 50000, 50000, 50000, 50000, 50000, 50000, 50000,\n", 485 | " 50000, 50000, 50000, 50000, 50000, 50000, 50000, 50000, 50000, 50000,\n", 486 | " 50000, 50000, 50000, 50000, 50000, 50000, 50000, 50000, 50000, 50000,\n", 487 | " 50000, 50000, 50000, 50000, 50000, 50000, 50000, 50000, 50000, 50000,\n", 488 | " 50000, 50000, 50000, 50000, 50000, 50000, 50000, 50000, 50000, 50000,\n", 489 | " 50000, 50000, 50000, 50000, 50000, 50000, 50000, 50000, 50000, 50000,\n", 490 | " 50000, 50000, 50000, 50000, 50000, 50000, 50000, 50000, 50000, 50000,\n", 491 | " 50000, 50000, 50000, 50000, 50000, 50000, 50000, 50000, 50000, 50000,\n", 492 | " 50000, 50000, 50000, 50000, 50000, 50000, 50000, 50000, 50000, 50000,\n", 493 | " 50000, 50000, 50000, 50000, 50000, 50000, 50000, 50000, 50000, 50000,\n", 494 | " 50000, 50000, 50000, 50000, 50000, 50000, 50000, 50000, 50000, 50000,\n", 495 | " 50000, 50000, 50000, 50000, 50000, 50000, 50000, 50000, 50000, 50000,\n", 496 | " 50000, 50000, 50000, 50000, 50000, 50000, 50000, 50000, 50000, 50000,\n", 497 | " 50000, 50000, 50000, 50000, 50000, 50000, 50000, 50000, 50000, 50000,\n", 498 | " 50000, 50000, 50000, 50000, 50000, 50000, 50000, 50000, 50000, 50000,\n", 499 | " 50000, 50000, 50000, 50000, 50000, 50000, 50000, 50000, 50000, 50000,\n", 500 | " 50000, 50000, 50000, 50000, 50000, 50000, 50000, 50000, 50000, 50000,\n", 501 | " 50000, 50000, 50000, 50000, 50000, 50000, 50000, 50000, 50000, 50000,\n", 502 | " 50000, 50000, 50000, 50000, 50000, 50000, 50000, 50000, 50000, 50000,\n", 503 | " 50000, 50000, 50000, 50000, 50000, 50000], device='cuda:0')\n" 504 | ] 505 | } 506 | ], 507 | "source": [ 508 | "# test\n", 509 | "for labels, features in dataloader:\n", 510 | " break\n", 511 | "\n", 512 | "print(\"label shape in batch : {}\".format(labels.size()))\n", 513 | "print(\"feature shape in batch : {}\".format(features.size()))\n", 514 | "print(\"***** label sample *****\")\n", 515 | "print(labels[0])\n", 516 | "print(\"***** features sample *****\")\n", 517 | "print(features[0])" 518 | ] 519 | }, 520 | { 521 | "cell_type": "markdown", 522 | "metadata": {}, 523 | "source": [ 524 | "## Build network\n", 525 | "\n", 526 | "Now we build a model for this next word's prediction using simple RNN architecture.\n", 527 | "\n", 528 | "![RNN network](images/rnn_network.png)\n", 529 | "\n", 530 | "In PyTorch, you can use ```torch.nn.RNN``` module for processing simple RNN, and we also use this built-in module in this example.\n", 531 | "\n", 532 | "In the following example, the shape of RNN input is expected to be ```(batch_size, sequence_length, input_dimension)```.
\n", 533 | "However, to tell which time steps in each sequence should be processed in RNN (i.e, for RNN masking), we wrap this tensor as a packed sequence with ```torch.nn.utils.rnn.pack_padded_sequence()``` before passing into RNN module.
\n", 534 | "For example, when batch size is 4 and we generate a packed sequence with ```lengths=[5, 3, 3, 2]``` in ```torch.nn.utils.rnn.pack_padded_sequence()```, the processed sequence# in each time-step will then be :\n", 535 | "\n", 536 | "```\n", 537 | "time-step 1 : {1, 2, 3, 4}\n", 538 | "time-step 2 : {1, 2, 3, 4}\n", 539 | "time-step 3 : {1, 2, 3}\n", 540 | "time-step 4 : {1}\n", 541 | "time-step 5 : {1}\n", 542 | "```\n", 543 | "\n", 544 | "As a result, it's processed with new batch size ```[4, 4, 3, 1, 1]```. (See below picture.)\n", 545 | "\n", 546 | "![packed sequence](images/rnn_packed_sequence.png)\n", 547 | "\n", 548 | "> Note : When the length is not sorted, first all sequences in batch are sorted by descending length of sequence, and planned to run batches to meet each time-steps. (When it's unpacked, the order is returned to the original position.)" 549 | ] 550 | }, 551 | { 552 | "cell_type": "code", 553 | "execution_count": 10, 554 | "metadata": {}, 555 | "outputs": [], 556 | "source": [ 557 | "import torch\n", 558 | "import torch.nn as nn\n", 559 | "\n", 560 | "embedding_dim = 64\n", 561 | "rnn_units = 512\n", 562 | "\n", 563 | "class SimpleRnnModel(nn.Module):\n", 564 | " def __init__(self, vocab_size, seq_len, embedding_dim, rnn_units, padding_idx):\n", 565 | " super().__init__()\n", 566 | "\n", 567 | " self.seq_len = seq_len\n", 568 | " self.padding_idx = padding_idx\n", 569 | "\n", 570 | " self.embedding = nn.Embedding(\n", 571 | " vocab_size,\n", 572 | " embedding_dim,\n", 573 | " padding_idx=padding_idx,\n", 574 | " )\n", 575 | " self.rnn = nn.RNN(\n", 576 | " input_size=embedding_dim,\n", 577 | " hidden_size=rnn_units,\n", 578 | " num_layers=1,\n", 579 | " batch_first=True,\n", 580 | " )\n", 581 | " self.classify = nn.Linear(rnn_units, vocab_size)\n", 582 | "\n", 583 | " def forward(self, inputs, states=None, return_final_state=False):\n", 584 | " # embedding\n", 585 | " # --> (batch_size, seq_len, embedding_dim)\n", 586 | " outs = self.embedding(inputs)\n", 587 | " # build \"lengths\" property to pack inputs (see above)\n", 588 | " lengths = (inputs != self.padding_idx).int().sum(dim=1, keepdim=False)\n", 589 | " # pack inputs for RNN\n", 590 | " packed_inputs = torch.nn.utils.rnn.pack_padded_sequence(\n", 591 | " outs,\n", 592 | " lengths.cpu(),\n", 593 | " batch_first=True,\n", 594 | " enforce_sorted=False,\n", 595 | " )\n", 596 | " # apply RNN\n", 597 | " if states is None:\n", 598 | " packed_outs, final_state = self.rnn(packed_inputs)\n", 599 | " else:\n", 600 | " packed_outs, final_state = self.rnn(packed_inputs, states)\n", 601 | " # unpack results\n", 602 | " # --> (batch_size, seq_len, rnn_units)\n", 603 | " outs, _ = torch.nn.utils.rnn.pad_packed_sequence(\n", 604 | " packed_outs,\n", 605 | " batch_first=True,\n", 606 | " padding_value=0.0,\n", 607 | " total_length=self.seq_len,\n", 608 | " )\n", 609 | " # apply feed-forward to classify\n", 610 | " # --> (batch_size, seq_len, vocab_size)\n", 611 | " logits = self.classify(outs)\n", 612 | " # return results\n", 613 | " if return_final_state:\n", 614 | " return logits, final_state # This is used in prediction\n", 615 | " else:\n", 616 | " return logits # This is used in training\n", 617 | "\n", 618 | "model = SimpleRnnModel(\n", 619 | " vocab_size=vocab.__len__(),\n", 620 | " seq_len=max_seq_len,\n", 621 | " embedding_dim=embedding_dim,\n", 622 | " rnn_units=rnn_units,\n", 623 | " padding_idx=pad_index).to(device)" 624 | ] 625 | }, 626 | { 627 | "cell_type": "markdown", 628 | "metadata": {}, 629 | "source": [ 630 | "## Train" 631 | ] 632 | }, 633 | { 634 | "cell_type": "markdown", 635 | "metadata": {}, 636 | "source": [ 637 | "Now run training with above model.\n", 638 | "\n", 639 | "As I have mentioned above, the loss on label id=-100 is ignored in ```cross_entropy()``` function. The padded position and the end of sequence will then be ignored in optimization.\n", 640 | "\n", 641 | "> Note : Because the default value of ```ignore_index``` property in ```cross_entropy()``` function is -100. (You can change this default value.)" 642 | ] 643 | }, 644 | { 645 | "cell_type": "code", 646 | "execution_count": 11, 647 | "metadata": {}, 648 | "outputs": [ 649 | { 650 | "name": "stdout", 651 | "output_type": "stream", 652 | "text": [ 653 | "Epoch 1 - loss: 5.9431 - accuracy: 0.11630\n", 654 | "Epoch 2 - loss: 6.4704 - accuracy: 0.1789\n", 655 | "Epoch 3 - loss: 6.2977 - accuracy: 0.0833\n", 656 | "Epoch 4 - loss: 5.6396 - accuracy: 0.1762\n", 657 | "Epoch 5 - loss: 5.4232 - accuracy: 0.1679\n" 658 | ] 659 | } 660 | ], 661 | "source": [ 662 | "from torch.nn import functional as F\n", 663 | "\n", 664 | "num_epochs = 5\n", 665 | "\n", 666 | "optimizer = torch.optim.AdamW(model.parameters(), lr=0.001)\n", 667 | "for epoch in range(num_epochs):\n", 668 | " for labels, seqs in dataloader:\n", 669 | " # optimize\n", 670 | " optimizer.zero_grad()\n", 671 | " logits = model(seqs)\n", 672 | " loss = F.cross_entropy(logits.transpose(1,2), labels)\n", 673 | " loss.backward()\n", 674 | " optimizer.step()\n", 675 | " # calculate accuracy\n", 676 | " pred_labels = logits.argmax(dim=2)\n", 677 | " num_correct = (pred_labels == labels).float().sum()\n", 678 | " num_total = (labels != -100).float().sum()\n", 679 | " accuracy = num_correct / num_total\n", 680 | " print(\"Epoch {} - loss: {:2.4f} - accuracy: {:2.4f}\".format(epoch+1, loss.item(), accuracy), end=\"\\r\")\n", 681 | " print(\"\")" 682 | ] 683 | }, 684 | { 685 | "cell_type": "markdown", 686 | "metadata": {}, 687 | "source": [ 688 | "## Generate Text (Simple RNN)\n", 689 | "\n", 690 | "Here I simply generate several text with trained model.\n", 691 | "\n", 692 | "The metrics to evaluate text generation task is not so easy. (Because simply checking an exact match to a reference text is not optimal.)
\n", 693 | "Use some common metrics available in these cases, such as, BLEU or ROUGE.\n", 694 | "\n", 695 | "> Note : Here I use greedy search and this will sometimes lead to wrong sequence. For drawbacks and solutions, see note in [this example](./05_language_model_basic.ipynb)." 696 | ] 697 | }, 698 | { 699 | "cell_type": "code", 700 | "execution_count": 12, 701 | "metadata": {}, 702 | "outputs": [ 703 | { 704 | "name": "stdout", 705 | "output_type": "stream", 706 | "text": [ 707 | " prime minister theresa said the was a hero in the world of the arctic monkeys \n", 708 | " chairman of the ' s widow ' s chief of staff reince priebus said the former chief of staff reince priebus said he ' s advocating to be a source of the past \n", 709 | " he was expected to be a politician \n" 710 | ] 711 | } 712 | ], 713 | "source": [ 714 | "end_index = stoi[\"\"]\n", 715 | "max_output = 128\n", 716 | "\n", 717 | "def pred_output(text):\n", 718 | " generated_text = \" \" + text\n", 719 | " _, inputs = collate_batch([generated_text])\n", 720 | " mask = (inputs != pad_index).int()\n", 721 | " last_idx = mask[0].sum() - 1\n", 722 | " final_states = None\n", 723 | " outputs, final_states = model(inputs, final_states, return_final_state=True)\n", 724 | " pred_index = outputs[0][last_idx].argmax()\n", 725 | " for loop in range(max_output):\n", 726 | " generated_text += \" \"\n", 727 | " next_word = itos[pred_index]\n", 728 | " generated_text += next_word\n", 729 | " if pred_index.item() == end_index:\n", 730 | " break\n", 731 | " _, inputs = collate_batch([next_word])\n", 732 | " outputs, final_states = model(inputs, final_states, return_final_state=True)\n", 733 | " pred_index = outputs[0][0].argmax()\n", 734 | " return generated_text\n", 735 | "\n", 736 | "print(pred_output(\"prime\"))\n", 737 | "print(pred_output(\"chairman\"))\n", 738 | "print(pred_output(\"he was expected\"))" 739 | ] 740 | }, 741 | { 742 | "cell_type": "markdown", 743 | "metadata": {}, 744 | "source": [ 745 | "## Train with GRU\n", 746 | "\n", 747 | "Next we train the same task with gated architecture, GRU (gated recurrent unit).
\n", 748 | "As I have mentioned above, GRU layer has following architecture.\n", 749 | "\n", 750 | "![gru architecture](images/gru_gate.png)\n", 751 | "\n", 752 | "$$ R : r_i = \\sigma(W_{rx} x_i + W_{rs} s_{i-1}) $$\n", 753 | "$$ Z : z_i = \\sigma(W_{zx} x_i + W_{zs} s_{i-1}) $$\n", 754 | "\n", 755 | "$$ \\tilde{S} : \\tilde{s}_i = tanh(W_{sx} x_i + W_{ss} (r_i \\cdot s_{i-1})) $$\n", 756 | "$$ S : s_i = (1 - z_i) \\cdot s_{i-1} + z_i \\cdot \\tilde{s}_i $$\n", 757 | "\n", 758 | "In this example, we use built-in layer ```torch.nn.GRU``` in PyTorch.\n", 759 | "\n", 760 | "> Note : In the following example, we use bias term in GRU layer." 761 | ] 762 | }, 763 | { 764 | "cell_type": "code", 765 | "execution_count": 14, 766 | "metadata": {}, 767 | "outputs": [], 768 | "source": [ 769 | "embedding_dim = 64\n", 770 | "rnn_units = 512\n", 771 | "\n", 772 | "class GruModel(nn.Module):\n", 773 | " def __init__(self, vocab_size, seq_len, embedding_dim, rnn_units, padding_idx):\n", 774 | " super().__init__()\n", 775 | "\n", 776 | " self.seq_len = seq_len\n", 777 | " self.padding_idx = padding_idx\n", 778 | "\n", 779 | " self.embedding = nn.Embedding(\n", 780 | " vocab_size,\n", 781 | " embedding_dim,\n", 782 | " padding_idx=padding_idx,\n", 783 | " )\n", 784 | " self.rnn = nn.GRU(\n", 785 | " input_size=embedding_dim,\n", 786 | " hidden_size=rnn_units,\n", 787 | " num_layers=1,\n", 788 | " batch_first=True,\n", 789 | " )\n", 790 | " self.classify = nn.Linear(rnn_units, vocab_size)\n", 791 | "\n", 792 | " def forward(self, inputs, states=None, return_final_state=False):\n", 793 | " # embedding\n", 794 | " # --> (batch_size, seq_len, embedding_dim)\n", 795 | " outs = self.embedding(inputs)\n", 796 | " # build \"lengths\" property to pack inputs (see above)\n", 797 | " lengths = (inputs != self.padding_idx).int().sum(dim=1, keepdim=False)\n", 798 | " # pack inputs for RNN\n", 799 | " packed_inputs = torch.nn.utils.rnn.pack_padded_sequence(\n", 800 | " outs,\n", 801 | " lengths.cpu(),\n", 802 | " batch_first=True,\n", 803 | " enforce_sorted=False,\n", 804 | " )\n", 805 | " # apply RNN\n", 806 | " if states is None:\n", 807 | " packed_outs, final_state = self.rnn(packed_inputs)\n", 808 | " else:\n", 809 | " packed_outs, final_state = self.rnn(packed_inputs, states)\n", 810 | " # unpack results\n", 811 | " # --> (batch_size, seq_len, rnn_units)\n", 812 | " outs, _ = torch.nn.utils.rnn.pad_packed_sequence(\n", 813 | " packed_outs,\n", 814 | " batch_first=True,\n", 815 | " padding_value=0.0,\n", 816 | " total_length=self.seq_len,\n", 817 | " )\n", 818 | " # apply feed-forward to classify\n", 819 | " # --> (batch_size, seq_len, vocab_size)\n", 820 | " logits = self.classify(outs)\n", 821 | " # return results\n", 822 | " if return_final_state:\n", 823 | " return logits, final_state # This is used in prediction\n", 824 | " else:\n", 825 | " return logits # This is used in training\n", 826 | "\n", 827 | "model = GruModel(\n", 828 | " vocab_size=vocab.__len__(),\n", 829 | " seq_len=max_seq_len,\n", 830 | " embedding_dim=embedding_dim,\n", 831 | " rnn_units=rnn_units,\n", 832 | " padding_idx=pad_index).to(device)" 833 | ] 834 | }, 835 | { 836 | "cell_type": "code", 837 | "execution_count": 15, 838 | "metadata": {}, 839 | "outputs": [ 840 | { 841 | "name": "stdout", 842 | "output_type": "stream", 843 | "text": [ 844 | "Epoch 1 - loss: 5.7050 - accuracy: 0.16552\n", 845 | "Epoch 2 - loss: 5.4469 - accuracy: 0.1743\n", 846 | "Epoch 3 - loss: 3.1864 - accuracy: 0.4911\n", 847 | "Epoch 4 - loss: 5.4429 - accuracy: 0.1346\n", 848 | "Epoch 5 - loss: 5.3104 - accuracy: 0.2817\n" 849 | ] 850 | } 851 | ], 852 | "source": [ 853 | "num_epochs = 5\n", 854 | "\n", 855 | "optimizer = torch.optim.AdamW(model.parameters(), lr=0.001)\n", 856 | "for epoch in range(num_epochs):\n", 857 | " for labels, seqs in dataloader:\n", 858 | " # optimize\n", 859 | " optimizer.zero_grad()\n", 860 | " logits = model(seqs)\n", 861 | " loss = F.cross_entropy(logits.transpose(1,2), labels)\n", 862 | " loss.backward()\n", 863 | " optimizer.step()\n", 864 | " # calculate accuracy\n", 865 | " pred_labels = logits.argmax(dim=2)\n", 866 | " num_correct = (pred_labels == labels).float().sum()\n", 867 | " num_total = (labels != -100).float().sum()\n", 868 | " accuracy = num_correct / num_total\n", 869 | " print(\"Epoch {} - loss: {:2.4f} - accuracy: {:2.4f}\".format(epoch+1, loss.item(), accuracy), end=\"\\r\")\n", 870 | " print(\"\")" 871 | ] 872 | }, 873 | { 874 | "cell_type": "markdown", 875 | "metadata": {}, 876 | "source": [ 877 | "# Generate Text (GRU)\n", 878 | "\n", 879 | "Here I simply generate several text with trained model.\n", 880 | "\n", 881 | "The metrics to evaluate text generation task is not so easy. (Because simply checking an exact match to a reference text is not optimal.)
\n", 882 | "Use some common metrics available in these cases, such as, BLEU or ROUGE.\n", 883 | "\n", 884 | "> Note : Here I use greedy search and this will sometimes lead to wrong sequence. For drawbacks and solutions, see note in [this example](./05_language_model_basic.ipynb)." 885 | ] 886 | }, 887 | { 888 | "cell_type": "code", 889 | "execution_count": 16, 890 | "metadata": {}, 891 | "outputs": [ 892 | { 893 | "name": "stdout", 894 | "output_type": "stream", 895 | "text": [ 896 | " prime minister justin trudeau is a big part of the game of the republican party \n", 897 | " chairman of the house appropriations committee on the verge of the supreme court nominee \n", 898 | " he was expected to be a little girl \n" 899 | ] 900 | } 901 | ], 902 | "source": [ 903 | "end_index = stoi[\"\"]\n", 904 | "max_output = 128\n", 905 | "\n", 906 | "def pred_output(text):\n", 907 | " generated_text = \" \" + text\n", 908 | " _, inputs = collate_batch([generated_text])\n", 909 | " mask = (inputs != pad_index).int()\n", 910 | " last_idx = mask[0].sum() - 1\n", 911 | " final_states = None\n", 912 | " outputs, final_states = model(inputs, final_states, return_final_state=True)\n", 913 | " pred_index = outputs[0][last_idx].argmax()\n", 914 | " for loop in range(max_output):\n", 915 | " generated_text += \" \"\n", 916 | " next_word = itos[pred_index]\n", 917 | " generated_text += next_word\n", 918 | " if pred_index.item() == end_index:\n", 919 | " break\n", 920 | " _, inputs = collate_batch([next_word])\n", 921 | " outputs, final_states = model(inputs, final_states, return_final_state=True)\n", 922 | " pred_index = outputs[0][0].argmax()\n", 923 | " return generated_text\n", 924 | "\n", 925 | "print(pred_output(\"prime\"))\n", 926 | "print(pred_output(\"chairman\"))\n", 927 | "print(pred_output(\"he was expected\"))" 928 | ] 929 | }, 930 | { 931 | "cell_type": "code", 932 | "execution_count": null, 933 | "metadata": {}, 934 | "outputs": [], 935 | "source": [] 936 | } 937 | ], 938 | "metadata": { 939 | "kernelspec": { 940 | "display_name": "Python 3 (ipykernel)", 941 | "language": "python", 942 | "name": "python3" 943 | }, 944 | "language_info": { 945 | "codemirror_mode": { 946 | "name": "ipython", 947 | "version": 3 948 | }, 949 | "file_extension": ".py", 950 | "mimetype": "text/x-python", 951 | "name": "python", 952 | "nbconvert_exporter": "python", 953 | "pygments_lexer": "ipython3", 954 | "version": "3.10.12" 955 | } 956 | }, 957 | "nbformat": 4, 958 | "nbformat_minor": 4 959 | } 960 | -------------------------------------------------------------------------------- /07_encoder_decoder.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "# Encoder-Decoder Architecture (Machine Translation Example)\n", 8 | "\n", 9 | "In the previous example, we saw the primitive text generation example with RNN, in which each word is selected only by the previous sequence of words.
\n", 10 | "But, in most cases, the word should be decided with other information (context) - i.e, conditioned text generation. For instance, it's in question-answering, it should generate text along with the answer (context) for the given question.\n", 11 | "\n", 12 | "Let's see the following architecture.
\n", 13 | "In this architecture, the word is selected by both the sequence of words and context information ```c```.
\n", 14 | "For instance, when it generates text for movie review, the conditioned context ```c``` might be a contxt about this movie. Even when it generates the text freely, it might be btter to generate a text depending on a context of genre - such as, \"computer science\", \"sports\", \"politics\", etc -, and it will then be able to generate more appropriate text depending on the genre (theme).\n", 15 | "\n", 16 | "![RNN with conditioned context](./images/conditioned_context.png)\n", 17 | "\n", 18 | "The encoder-decoder framework is a trainer for text generation with **sequence-to-sequence** conditioned context as follows. (See the following diagram.)\n", 19 | "\n", 20 | "For instance, when you want to translate French to English, first it generates a conditioned context ```c``` from a source sentence (which may have the sequence of length m).
\n", 21 | "This is called **encoder**, and the encoder summarizes a French sentence as a context vector ```c```.
\n", 22 | "Next it will predict English sentence (which may have the sequence of length n) using the generated context ```c```, and this is called **decoder**.
\n", 23 | "As you can see below, the source length (m) and target length (n) might differ in this training.\n", 24 | "\n", 25 | "![encoder-decoder architecture](./images/encoder_decoder.png)\n", 26 | "\n", 27 | "This encoder-decoder architecture can be used in forms of sequence-to-sequence problems, and is used in a lot of scenarios, such as, auto-response (smart reply or question-answering), inflection, image captioning, etc. (In image captioning task, an image input will be encoded as a vector with convolution network.)
\n", 28 | "It can also be used for generating a vector representation (in which encoder-decoder is trained to reconstruct the input sentence) or text generation, both which have been seen in the previous examples.
\n", 29 | "A variety of today's language tasks depends on encoder-decoder architecture and attention (which will be discussed in the next example).\n", 30 | "\n", 31 | "In this example, I'll implement simple sequence-to-sequence trainer in machine translation task.\n", 32 | "\n", 33 | "For the purpose of your beginng, here I only use encoder-decoder framework (without attention or other advanced architectures) and I note that the result might not be so good.
\n", 34 | "In the next tutorial, we'll add more sophisticated architecture \"attention\" (also, widely used in today's NLP) in this encoder-decoder model.\n", 35 | "\n", 36 | "*back to [index](https://github.com/tsmatz/nlp-tutorials/)*" 37 | ] 38 | }, 39 | { 40 | "cell_type": "markdown", 41 | "metadata": {}, 42 | "source": [ 43 | "## Install required packages" 44 | ] 45 | }, 46 | { 47 | "cell_type": "code", 48 | "execution_count": null, 49 | "metadata": {}, 50 | "outputs": [], 51 | "source": [ 52 | "!pip install torch numpy nltk" 53 | ] 54 | }, 55 | { 56 | "cell_type": "markdown", 57 | "metadata": {}, 58 | "source": [ 59 | "## Prepare data" 60 | ] 61 | }, 62 | { 63 | "cell_type": "markdown", 64 | "metadata": {}, 65 | "source": [ 66 | "In this example, I use Engligh-French dataset by [Anki](https://www.manythings.org/anki/)." 67 | ] 68 | }, 69 | { 70 | "cell_type": "code", 71 | "execution_count": 1, 72 | "metadata": {}, 73 | "outputs": [ 74 | { 75 | "name": "stdout", 76 | "output_type": "stream", 77 | "text": [ 78 | "--2023-02-14 01:57:45-- http://www.manythings.org/anki/fra-eng.zip\n", 79 | "Resolving www.manythings.org (www.manythings.org)... 173.254.30.110\n", 80 | "Connecting to www.manythings.org (www.manythings.org)|173.254.30.110|:80... connected.\n", 81 | "HTTP request sent, awaiting response... 200 OK\n", 82 | "Length: 6720195 (6.4M) [application/zip]\n", 83 | "Saving to: ‘fra-eng.zip’\n", 84 | "\n", 85 | "fra-eng.zip 100%[===================>] 6.41M 11.3MB/s in 0.6s \n", 86 | "\n", 87 | "2023-02-14 01:57:45 (11.3 MB/s) - ‘fra-eng.zip’ saved [6720195/6720195]\n", 88 | "\n" 89 | ] 90 | } 91 | ], 92 | "source": [ 93 | "!wget http://www.manythings.org/anki/fra-eng.zip" 94 | ] 95 | }, 96 | { 97 | "cell_type": "code", 98 | "execution_count": 2, 99 | "metadata": {}, 100 | "outputs": [ 101 | { 102 | "name": "stdout", 103 | "output_type": "stream", 104 | "text": [ 105 | "Archive: fra-eng.zip\n", 106 | " inflating: fra-eng/_about.txt \n", 107 | " inflating: fra-eng/fra.txt \n" 108 | ] 109 | } 110 | ], 111 | "source": [ 112 | "!unzip fra-eng.zip -d fra-eng" 113 | ] 114 | }, 115 | { 116 | "cell_type": "code", 117 | "execution_count": 3, 118 | "metadata": {}, 119 | "outputs": [ 120 | { 121 | "name": "stdout", 122 | "output_type": "stream", 123 | "text": [ 124 | "Go.\tVa !\tCC-BY 2.0 (France) Attribution: tatoeba.org #2877272 (CM) & #1158250 (Wittydev)\r\n", 125 | "Go.\tMarche.\tCC-BY 2.0 (France) Attribution: tatoeba.org #2877272 (CM) & #8090732 (Micsmithel)\r\n", 126 | "Go.\tEn route !\tCC-BY 2.0 (France) Attribution: tatoeba.org #2877272 (CM) & #8267435 (felix63)\r\n", 127 | "Go.\tBouge !\tCC-BY 2.0 (France) Attribution: tatoeba.org #2877272 (CM) & #9022935 (Micsmithel)\r\n", 128 | "Hi.\tSalut !\tCC-BY 2.0 (France) Attribution: tatoeba.org #538123 (CM) & #509819 (Aiji)\r\n" 129 | ] 130 | } 131 | ], 132 | "source": [ 133 | "!head -n 5 fra-eng/fra.txt" 134 | ] 135 | }, 136 | { 137 | "cell_type": "code", 138 | "execution_count": 4, 139 | "metadata": {}, 140 | "outputs": [ 141 | { 142 | "name": "stdout", 143 | "output_type": "stream", 144 | "text": [ 145 | "197463 fra-eng/fra.txt\r\n" 146 | ] 147 | } 148 | ], 149 | "source": [ 150 | "!wc -l fra-eng/fra.txt" 151 | ] 152 | }, 153 | { 154 | "cell_type": "code", 155 | "execution_count": 5, 156 | "metadata": {}, 157 | "outputs": [ 158 | { 159 | "data": { 160 | "text/plain": [ 161 | "array(['Va !', 'Go.'], dtype='\n", 187 | "Therefore I shuffle entire data." 188 | ] 189 | }, 190 | { 191 | "cell_type": "code", 192 | "execution_count": 6, 193 | "metadata": {}, 194 | "outputs": [ 195 | { 196 | "data": { 197 | "text/plain": [ 198 | "array(['Chantons une chanson\\u202f!', 'Let us sing a song.'],\n", 199 | " dtype='``` and `````` tokens in string." 296 | ] 297 | }, 298 | { 299 | "cell_type": "code", 300 | "execution_count": 9, 301 | "metadata": {}, 302 | "outputs": [ 303 | { 304 | "data": { 305 | "text/plain": [ 306 | "array([' chantons une chanson ',\n", 307 | " ' let us sing a song '], dtype='\", x, \"\"]), \" \".join([\"\", y, \"\"])] for x, y in train_data])\n", 317 | "# print first row\n", 318 | "train_data[0]" 319 | ] 320 | }, 321 | { 322 | "cell_type": "markdown", 323 | "metadata": {}, 324 | "source": [ 325 | "## Generate sequence inputs\n", 326 | "\n", 327 | "We will generate the sequence of word's indices (i.e, tokenize) from text.\n", 328 | "\n", 329 | "![Index vectorize](images/index_vectorize2.png)\n", 330 | "\n", 331 | "First we create a list of vocabulary (```vocab```) for both source text (French) and target text (English) respectively." 332 | ] 333 | }, 334 | { 335 | "cell_type": "code", 336 | "execution_count": 10, 337 | "metadata": {}, 338 | "outputs": [], 339 | "source": [ 340 | "from nltk.tokenize import SpaceTokenizer\n", 341 | "\n", 342 | "###\n", 343 | "# define Vocab\n", 344 | "###\n", 345 | "class Vocab:\n", 346 | " def __init__(self, list_of_sentence, tokenization, special_token, max_tokens=None):\n", 347 | " # count vocab frequency\n", 348 | " vocab_freq = {}\n", 349 | " tokens = tokenization(list_of_sentence)\n", 350 | " for t in tokens:\n", 351 | " for vocab in t:\n", 352 | " if vocab not in vocab_freq:\n", 353 | " vocab_freq[vocab] = 0 \n", 354 | " vocab_freq[vocab] += 1\n", 355 | " # sort by frequency\n", 356 | " vocab_freq = {k: v for k, v in sorted(vocab_freq.items(), key=lambda i: i[1], reverse=True)}\n", 357 | " # create vocab list\n", 358 | " self.vocabs = [special_token] + list(vocab_freq.keys())\n", 359 | " if max_tokens:\n", 360 | " self.vocabs = self.vocabs[:max_tokens]\n", 361 | " self.stoi = {v: i for i, v in enumerate(self.vocabs)}\n", 362 | "\n", 363 | " def _get_tokens(self, list_of_sentence):\n", 364 | " for sentence in list_of_sentence:\n", 365 | " tokens = tokenizer.tokenize(sentence)\n", 366 | " yield tokens\n", 367 | "\n", 368 | " def get_itos(self):\n", 369 | " return self.vocabs\n", 370 | "\n", 371 | " def get_stoi(self):\n", 372 | " return self.stoi\n", 373 | "\n", 374 | " def append_token(self, token):\n", 375 | " self.vocabs.append(token)\n", 376 | " self.stoi = {v: i for i, v in enumerate(self.vocabs)}\n", 377 | "\n", 378 | " def __call__(self, list_of_tokens):\n", 379 | " def get_token_index(token):\n", 380 | " if token in self.stoi:\n", 381 | " return self.stoi[token]\n", 382 | " else:\n", 383 | " return 0\n", 384 | " return [get_token_index(t) for t in list_of_tokens]\n", 385 | "\n", 386 | " def __len__(self):\n", 387 | " return len(self.vocabs)\n", 388 | "\n", 389 | "###\n", 390 | "# generate Vocab\n", 391 | "###\n", 392 | "max_word = 10000\n", 393 | "\n", 394 | "# create space-split tokenizer\n", 395 | "tokenizer = SpaceTokenizer()\n", 396 | "\n", 397 | "# define tokenization function\n", 398 | "def yield_tokens(data):\n", 399 | " for text in data:\n", 400 | " tokens = tokenizer.tokenize(text)\n", 401 | " yield tokens\n", 402 | "\n", 403 | "# build vocabulary list for French\n", 404 | "vocab_fr = Vocab(\n", 405 | " train_data[:,0],\n", 406 | " tokenization=yield_tokens,\n", 407 | " special_token=\"\",\n", 408 | " max_tokens=max_word,\n", 409 | ")\n", 410 | "\n", 411 | "# build vocabulary list for English\n", 412 | "vocab_en = Vocab(\n", 413 | " train_data[:,1],\n", 414 | " tokenization=yield_tokens,\n", 415 | " special_token=\"\",\n", 416 | " max_tokens=max_word,\n", 417 | ")" 418 | ] 419 | }, 420 | { 421 | "cell_type": "markdown", 422 | "metadata": {}, 423 | "source": [ 424 | "The generated token index is ```0, 1, ... , vocab_size - 1```.
\n", 425 | "Now I set ```vocab_size``` as a token id in padded positions for both French and English respctively." 426 | ] 427 | }, 428 | { 429 | "cell_type": "code", 430 | "execution_count": 11, 431 | "metadata": {}, 432 | "outputs": [], 433 | "source": [ 434 | "pad_index_fr = vocab_fr.__len__()\n", 435 | "vocab_fr.append_token(\"\")\n", 436 | "\n", 437 | "pad_index_en = vocab_en.__len__()\n", 438 | "vocab_en.append_token(\"\")" 439 | ] 440 | }, 441 | { 442 | "cell_type": "markdown", 443 | "metadata": {}, 444 | "source": [ 445 | "Get list for both index-to-word and word-to-index." 446 | ] 447 | }, 448 | { 449 | "cell_type": "code", 450 | "execution_count": 12, 451 | "metadata": {}, 452 | "outputs": [], 453 | "source": [ 454 | "itos_fr = vocab_fr.get_itos()\n", 455 | "stoi_fr = vocab_fr.get_stoi()\n", 456 | "\n", 457 | "itos_en = vocab_en.get_itos()\n", 458 | "stoi_en = vocab_en.get_stoi()" 459 | ] 460 | }, 461 | { 462 | "cell_type": "code", 463 | "execution_count": 13, 464 | "metadata": {}, 465 | "outputs": [ 466 | { 467 | "name": "stdout", 468 | "output_type": "stream", 469 | "text": [ 470 | "The number of token index in French (source) is 10001.\n", 471 | "The padded index in French (source) is 10000.\n", 472 | "The number of token index in English (target) is 10001.\n", 473 | "The padded index in English (target) is 10000.\n" 474 | ] 475 | } 476 | ], 477 | "source": [ 478 | "# test\n", 479 | "print(\"The number of token index in French (source) is {}.\".format(vocab_fr.__len__()))\n", 480 | "print(\"The padded index in French (source) is {}.\".format(stoi_fr[\"\"]))\n", 481 | "print(\"The number of token index in English (target) is {}.\".format(vocab_en.__len__()))\n", 482 | "print(\"The padded index in English (target) is {}.\".format(stoi_en[\"\"]))" 483 | ] 484 | }, 485 | { 486 | "cell_type": "markdown", 487 | "metadata": {}, 488 | "source": [ 489 | "Now we build a collator function, which is used for pre-processing in data loader.\n", 490 | "\n", 491 | "In this collator,\n", 492 | "\n", 493 | "(1) First we create a list of word's indices for source (French) and target (English) respectively as follows.\n", 494 | "\n", 495 | "``` this is pen ``` --> ```[2, 7, 5, 14, 1]```\n", 496 | "\n", 497 | "(2) For target (English) sequence, we separate into features (x) and labels (y).
\n", 498 | "In this task, we predict the next word in target (English) sequence using the current word's sequence (English) and the encoded context of source (French).
\n", 499 | "We then separate target sequence into the sequence iteself (x) and the following label (y).\n", 500 | "\n", 501 | "before :\n", 502 | "\n", 503 | "```[2, 7, 5, 14, 1]```\n", 504 | "\n", 505 | "after :\n", 506 | "\n", 507 | "```x : [2, 7, 5, 14, 1]```\n", 508 | "\n", 509 | "```y : [7, 5, 14, 1, -100]```\n", 510 | "\n", 511 | "> Note : Here I set -100 as an unknown label id, because PyTorch cross-entropy function (```torch.nn.functional.cross_entropy()```) has a property ```ignore_index``` which default value is -100.\n", 512 | "\n", 513 | "(3) Finally we pad the inputs (for both source and target) as follows.
\n", 514 | "The padded index in features is ```pad_index``` and the padded index in label is -100. (See above note.)\n", 515 | "\n", 516 | "```x : [2, 7, 5, 14, 1, N, ... , N]```\n", 517 | "\n", 518 | "```y : [7, 5, 14, 1, -100, -100, ... , -100]```" 519 | ] 520 | }, 521 | { 522 | "cell_type": "code", 523 | "execution_count": 14, 524 | "metadata": {}, 525 | "outputs": [], 526 | "source": [ 527 | "import torch\n", 528 | "from torch.utils.data import DataLoader\n", 529 | "\n", 530 | "seq_len_fr = 45\n", 531 | "seq_len_en = 38\n", 532 | "\n", 533 | "device = torch.device(\"cuda\" if torch.cuda.is_available() else \"cpu\")\n", 534 | "\n", 535 | "def collate_batch(batch):\n", 536 | " label_list, feature_source_list, feature_target_list = [], [], []\n", 537 | " for text_fr, text_en in batch:\n", 538 | " # (1) tokenize to a list of word's indices\n", 539 | " tokens_fr = vocab_fr(tokenizer.tokenize(text_fr))\n", 540 | " tokens_en = vocab_en(tokenizer.tokenize(text_en))\n", 541 | " # (2) separate into features and labels in target tokens (English)\n", 542 | " y = tokens_en[1:]\n", 543 | " y.append(-100)\n", 544 | " # (3) limit length to seq_len and pad sequence\n", 545 | " y = y[:seq_len_en]\n", 546 | " tokens_fr = tokens_fr[:seq_len_fr]\n", 547 | " tokens_en = tokens_en[:seq_len_en]\n", 548 | " y += [-100] * (seq_len_en - len(y))\n", 549 | " tokens_fr += [pad_index_fr] * (seq_len_fr - len(tokens_fr))\n", 550 | " tokens_en += [pad_index_en] * (seq_len_en - len(tokens_en))\n", 551 | " # add to list\n", 552 | " label_list.append(y)\n", 553 | " feature_source_list.append(tokens_fr)\n", 554 | " feature_target_list.append(tokens_en)\n", 555 | " # convert to tensor\n", 556 | " label_list = torch.tensor(label_list, dtype=torch.int64).to(device)\n", 557 | " feature_source_list = torch.tensor(feature_source_list, dtype=torch.int64).to(device)\n", 558 | " feature_target_list = torch.tensor(feature_target_list, dtype=torch.int64).to(device)\n", 559 | " return label_list, feature_source_list, feature_target_list\n", 560 | "\n", 561 | "dataloader = DataLoader(\n", 562 | " list(zip(train_data[:,0], train_data[:,1])),\n", 563 | " batch_size=64,\n", 564 | " shuffle=True,\n", 565 | " collate_fn=collate_batch\n", 566 | ")" 567 | ] 568 | }, 569 | { 570 | "cell_type": "code", 571 | "execution_count": 15, 572 | "metadata": {}, 573 | "outputs": [ 574 | { 575 | "name": "stdout", 576 | "output_type": "stream", 577 | "text": [ 578 | "label shape in batch : torch.Size([64, 38])\n", 579 | "feature source shape in batch : torch.Size([64, 45])\n", 580 | "feature target shape in batch : torch.Size([64, 38])\n", 581 | "***** label sample *****\n", 582 | "tensor([ 84, 95, 583, 1489, 343, 159, 1, -100, -100, -100, -100, -100,\n", 583 | " -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100,\n", 584 | " -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100,\n", 585 | " -100, -100], device='cuda:0')\n", 586 | "***** features (source) sample *****\n", 587 | "tensor([ 2, 3, 76, 77, 4616, 11, 437, 3470, 563, 1,\n", 588 | " 10000, 10000, 10000, 10000, 10000, 10000, 10000, 10000, 10000, 10000,\n", 589 | " 10000, 10000, 10000, 10000, 10000, 10000, 10000, 10000, 10000, 10000,\n", 590 | " 10000, 10000, 10000, 10000, 10000, 10000, 10000, 10000, 10000, 10000,\n", 591 | " 10000, 10000, 10000, 10000, 10000], device='cuda:0')\n", 592 | "***** features (target) sample *****\n", 593 | "tensor([ 2, 84, 95, 583, 1489, 343, 159, 1, 10000, 10000,\n", 594 | " 10000, 10000, 10000, 10000, 10000, 10000, 10000, 10000, 10000, 10000,\n", 595 | " 10000, 10000, 10000, 10000, 10000, 10000, 10000, 10000, 10000, 10000,\n", 596 | " 10000, 10000, 10000, 10000, 10000, 10000, 10000, 10000],\n", 597 | " device='cuda:0')\n" 598 | ] 599 | } 600 | ], 601 | "source": [ 602 | "# test\n", 603 | "for labels, sources, targets in dataloader:\n", 604 | " break\n", 605 | "\n", 606 | "print(\"label shape in batch : {}\".format(labels.size()))\n", 607 | "print(\"feature source shape in batch : {}\".format(sources.size()))\n", 608 | "print(\"feature target shape in batch : {}\".format(targets.size()))\n", 609 | "print(\"***** label sample *****\")\n", 610 | "print(labels[0])\n", 611 | "print(\"***** features (source) sample *****\")\n", 612 | "print(sources[0])\n", 613 | "print(\"***** features (target) sample *****\")\n", 614 | "print(targets[0])" 615 | ] 616 | }, 617 | { 618 | "cell_type": "markdown", 619 | "metadata": {}, 620 | "source": [ 621 | "## Build Encoder-Decoder Network\n", 622 | "\n", 623 | "Now we build a model with encoder-decoder architecture. The brief outline of this architecture is as follows. :\n", 624 | "\n", 625 | "- The context is generated by using the entire source sequence (French) in encoder.\n", 626 | "- The encoder's context is then concatenated with the words of current target's sequence (English) and passed into RNN layer in decoder.\n", 627 | "- RNN outputs (not only final output, but in all units in sequence) is passed into linear (FCNet) layer and generate the logits of next words.\n", 628 | "- Calculate loss between predicted next words and the true values of next words, and then proceed to optimize neural networks.\n", 629 | "\n", 630 | "![the trainer architecture of machine translation](./images/machine_translation.png)" 631 | ] 632 | }, 633 | { 634 | "cell_type": "markdown", 635 | "metadata": {}, 636 | "source": [ 637 | "First, we build encoder model.
\n", 638 | "See the [previous example](./06_language_model_rnn.ipynb) for details about RNN inputs and outputs in PyTorch. (Here I also use packed sequence, because I want to process appropriate time-steps in each sequence.)\n", 639 | "\n", 640 | "In this example, only the last output of RNN (GRU) is required in encoder model, because we need a single context in each sequence.\n", 641 | "\n", 642 | "![final output in encoder](./images/encoder_final.png)" 643 | ] 644 | }, 645 | { 646 | "cell_type": "code", 647 | "execution_count": 16, 648 | "metadata": {}, 649 | "outputs": [], 650 | "source": [ 651 | "import torch\n", 652 | "import torch.nn as nn\n", 653 | "\n", 654 | "embedding_dim = 256\n", 655 | "rnn_units = 1024\n", 656 | "\n", 657 | "class Encoder(nn.Module):\n", 658 | " def __init__(self, vocab_size, embedding_dim, rnn_units, padding_idx):\n", 659 | " super().__init__()\n", 660 | "\n", 661 | " self.padding_idx = padding_idx\n", 662 | "\n", 663 | " self.embedding = nn.Embedding(\n", 664 | " vocab_size,\n", 665 | " embedding_dim,\n", 666 | " padding_idx=padding_idx,\n", 667 | " )\n", 668 | " self.rnn = nn.GRU(\n", 669 | " input_size=embedding_dim,\n", 670 | " hidden_size=rnn_units,\n", 671 | " num_layers=1,\n", 672 | " batch_first=True,\n", 673 | " )\n", 674 | "\n", 675 | " def forward(self, inputs):\n", 676 | " # embedding\n", 677 | " # --> (batch_size, seq_len, embedding_dim)\n", 678 | " outs = self.embedding(inputs)\n", 679 | " # build \"lengths\" property to pack inputs (see previous example)\n", 680 | " lengths = (inputs != self.padding_idx).int().sum(dim=1, keepdim=False)\n", 681 | " # pack inputs for RNN (see previous example)\n", 682 | " packed_inputs = torch.nn.utils.rnn.pack_padded_sequence(\n", 683 | " outs,\n", 684 | " lengths.cpu(),\n", 685 | " batch_first=True,\n", 686 | " enforce_sorted=False,\n", 687 | " )\n", 688 | " # apply RNN\n", 689 | " _, final_state = self.rnn(packed_inputs)\n", 690 | " # (1, batch_size, rnn_units) --> (batch_size, rnn_units)\n", 691 | " final_state = final_state.squeeze(dim=0)\n", 692 | " # return results\n", 693 | " return final_state\n", 694 | "\n", 695 | "enc_model = Encoder(\n", 696 | " vocab_size=vocab_fr.__len__(),\n", 697 | " embedding_dim=embedding_dim,\n", 698 | " rnn_units=rnn_units,\n", 699 | " padding_idx=pad_index_fr).to(device)" 700 | ] 701 | }, 702 | { 703 | "cell_type": "markdown", 704 | "metadata": {}, 705 | "source": [ 706 | "Next we build decoder model.\n", 707 | "\n", 708 | "The decoder receives encoder's final output, and this is used in all units in target's sequence.\n", 709 | "\n", 710 | "In each unit, encoder's final output (context) is concatenated with word's embedding vectors in current target (English).
\n", 711 | "The concatenated vector is then passed into RNN. The output of RNN is then passed into linear (fully-connected network, FCNet) and it generates the next word's logits.\n", 712 | "\n", 713 | "![the trainer architecture of machine translation](./images/machine_translation.png)\n", 714 | "\n", 715 | "Same as previous examples, RNN inputs are packed, because appropriate steps in each sequence should be processed.\n", 716 | "\n", 717 | "> Note : In encoder-decoder architecture, there exist a variation to set encoder's final state as decoder's initial state.
\n", 718 | "> In this example, I don't set initial state (i.e, set zero state as initial state) in decoder." 719 | ] 720 | }, 721 | { 722 | "cell_type": "code", 723 | "execution_count": 17, 724 | "metadata": {}, 725 | "outputs": [], 726 | "source": [ 727 | "import torch\n", 728 | "import torch.nn as nn\n", 729 | "\n", 730 | "class Decoder(nn.Module):\n", 731 | " def __init__(self, vocab_size, seq_len, embedding_dim, rnn_units, padding_idx, hidden_dim=1024):\n", 732 | " super().__init__()\n", 733 | "\n", 734 | " self.seq_len = seq_len\n", 735 | " self.padding_idx = padding_idx\n", 736 | "\n", 737 | " self.embedding = nn.Embedding(\n", 738 | " vocab_size,\n", 739 | " embedding_dim,\n", 740 | " padding_idx=padding_idx,\n", 741 | " )\n", 742 | " self.rnn = nn.GRU(\n", 743 | " input_size=embedding_dim + rnn_units,\n", 744 | " hidden_size=rnn_units,\n", 745 | " num_layers=1,\n", 746 | " batch_first=True,\n", 747 | " )\n", 748 | " self.hidden = nn.Linear(rnn_units, hidden_dim)\n", 749 | " self.classify = nn.Linear(hidden_dim, vocab_size)\n", 750 | " self.relu = nn.ReLU()\n", 751 | "\n", 752 | " def forward(self, inputs, enc_outputs, states=None, return_final_state=False):\n", 753 | " # embedding\n", 754 | " # --> (batch_size, seq_len, embedding_dim)\n", 755 | " outs = self.embedding(inputs)\n", 756 | " # convert the shape of enc_outputs :\n", 757 | " # (batch_size, rnn_units) --> (batch_size, 1, rnn_units)\n", 758 | " enc_outputs = enc_outputs[:,None,:]\n", 759 | " # (batch_size, rnn_units) --> (batch_size, seq_len, rnn_units)\n", 760 | " enc_outputs = enc_outputs.expand(-1, self.seq_len, -1)\n", 761 | " # concat encoder's output\n", 762 | " # --> (batch_size, seq_len, embedding_dim + rnn_units)\n", 763 | " outs = torch.concat((outs, enc_outputs), dim=-1)\n", 764 | " # build \"lengths\" property to pack inputs (see above)\n", 765 | " lengths = (inputs != self.padding_idx).int().sum(dim=1, keepdim=False)\n", 766 | " # pack inputs for RNN\n", 767 | " packed_inputs = torch.nn.utils.rnn.pack_padded_sequence(\n", 768 | " outs,\n", 769 | " lengths.cpu(),\n", 770 | " batch_first=True,\n", 771 | " enforce_sorted=False,\n", 772 | " )\n", 773 | " # apply RNN\n", 774 | " if states is None:\n", 775 | " packed_outs, final_state = self.rnn(packed_inputs)\n", 776 | " else:\n", 777 | " packed_outs, final_state = self.rnn(packed_inputs, states)\n", 778 | " # unpack results\n", 779 | " # --> (batch_size, seq_len, rnn_units)\n", 780 | " outs, _ = torch.nn.utils.rnn.pad_packed_sequence(\n", 781 | " packed_outs,\n", 782 | " batch_first=True,\n", 783 | " padding_value=0.0,\n", 784 | " total_length=self.seq_len,\n", 785 | " )\n", 786 | " # apply feed-forward (hidden)\n", 787 | " # --> (batch_size, seq_len, hidden_dim)\n", 788 | " outs = self.hidden(outs)\n", 789 | " outs = self.relu(outs)\n", 790 | " # apply feed-forward to classify\n", 791 | " # --> (batch_size, seq_len, vocab_size)\n", 792 | " logits = self.classify(outs)\n", 793 | " # return results\n", 794 | " if return_final_state:\n", 795 | " return logits, final_state # This is used in prediction\n", 796 | " else:\n", 797 | " return logits # This is used in training\n", 798 | "\n", 799 | "dec_model = Decoder(\n", 800 | " vocab_size=vocab_en.__len__(),\n", 801 | " seq_len=seq_len_en,\n", 802 | " embedding_dim=embedding_dim,\n", 803 | " rnn_units=rnn_units,\n", 804 | " padding_idx=pad_index_en).to(device)" 805 | ] 806 | }, 807 | { 808 | "cell_type": "markdown", 809 | "metadata": {}, 810 | "source": [ 811 | "## Train\n", 812 | "\n", 813 | "Now we put it all together and run training.\n", 814 | "\n", 815 | "The loss on label id=-100 is ignored in ```cross_entropy()``` function. The padded position and the end of sequence will then be ignored in optimization.\n", 816 | "\n", 817 | "> Note : Because the default value of ```ignore_index``` property in ```cross_entropy()``` function is -100. (You can change this default value.)" 818 | ] 819 | }, 820 | { 821 | "cell_type": "code", 822 | "execution_count": 18, 823 | "metadata": {}, 824 | "outputs": [ 825 | { 826 | "name": "stdout", 827 | "output_type": "stream", 828 | "text": [ 829 | "Epoch 1 - loss: 2.7009 - accuracy: 0.6364\n", 830 | "Epoch 2 - loss: 1.6154 - accuracy: 0.7273\n", 831 | "Epoch 3 - loss: 0.7325 - accuracy: 0.8125\n", 832 | "Epoch 4 - loss: 0.1153 - accuracy: 1.0000\n", 833 | "Epoch 5 - loss: 0.4040 - accuracy: 0.9231\n" 834 | ] 835 | } 836 | ], 837 | "source": [ 838 | "from torch.nn import functional as F\n", 839 | "\n", 840 | "num_epochs = 5\n", 841 | "\n", 842 | "all_params = list(enc_model.parameters()) + list(dec_model.parameters())\n", 843 | "optimizer = torch.optim.AdamW(all_params, lr=0.001)\n", 844 | "for epoch in range(num_epochs):\n", 845 | " for labels, sources, targets in dataloader:\n", 846 | " # optimize\n", 847 | " optimizer.zero_grad()\n", 848 | " enc_outputs = enc_model(sources)\n", 849 | " logits = dec_model(targets, enc_outputs)\n", 850 | " loss = F.cross_entropy(logits.transpose(1,2), labels)\n", 851 | " loss.backward()\n", 852 | " optimizer.step()\n", 853 | " # calculate accuracy\n", 854 | " pred_labels = logits.argmax(dim=2)\n", 855 | " num_correct = (pred_labels == labels).float().sum()\n", 856 | " num_total = (labels != -100).float().sum()\n", 857 | " accuracy = num_correct / num_total\n", 858 | " print(\"Epoch {} - loss: {:2.4f} - accuracy: {:2.4f}\".format(epoch+1, loss.item(), accuracy), end=\"\\r\")\n", 859 | " print(\"\")" 860 | ] 861 | }, 862 | { 863 | "cell_type": "markdown", 864 | "metadata": {}, 865 | "source": [ 866 | "## Translate Text\n", 867 | "\n", 868 | "Now translate French text to English text with trained model. (All these sentences are not in training set.)\n", 869 | "\n", 870 | "Here I simply translate several brief sentences, but the metrics to evaluate text-generation task will not be so easy. (Because simply checking an exact match to a reference text is not optimal.)
\n", 871 | "To eveluate the trained model, use some common metrics available in text generation, such as, BLEU or ROUGE.\n", 872 | "\n", 873 | "> Note : Here I use greedy search and this will sometimes lead to wrong sequence. For drawbacks and solutions, see note in [this example](./05_language_model_basic.ipynb)." 874 | ] 875 | }, 876 | { 877 | "cell_type": "code", 878 | "execution_count": 19, 879 | "metadata": {}, 880 | "outputs": [], 881 | "source": [ 882 | "import numpy as np\n", 883 | "\n", 884 | "end_index_en = stoi_en[\"\"]\n", 885 | "max_output = 128\n", 886 | "\n", 887 | "def translate(sentence):\n", 888 | " # preprocess inputs\n", 889 | " text_fr = sentence\n", 890 | " text_fr = text_fr.lower()\n", 891 | " text_fr = \" \".join([\"\", text_fr, \"\"])\n", 892 | " text_en = \"\"\n", 893 | " _, tokens_fr, tokens_en = collate_batch(list(zip([text_fr], [text_en])))\n", 894 | "\n", 895 | " # process encoder\n", 896 | " enc_outputs = enc_model(tokens_fr)\n", 897 | "\n", 898 | " # process decoder\n", 899 | " states = None\n", 900 | " for loop in range(max_output):\n", 901 | " logits, states = dec_model(\n", 902 | " tokens_en,\n", 903 | " enc_outputs,\n", 904 | " states=states,\n", 905 | " return_final_state=True)\n", 906 | " pred_idx_en = logits[0][0].argmax()\n", 907 | " next_word_en = itos_en[pred_idx_en]\n", 908 | " text_en += \" \"\n", 909 | " text_en += next_word_en\n", 910 | " if pred_idx_en.item() == end_index_en:\n", 911 | " break\n", 912 | " _, _, tokens_en = collate_batch(list(zip([\"\"], [next_word_en])))\n", 913 | " return text_en" 914 | ] 915 | }, 916 | { 917 | "cell_type": "code", 918 | "execution_count": 20, 919 | "metadata": {}, 920 | "outputs": [ 921 | { 922 | "name": "stdout", 923 | "output_type": "stream", 924 | "text": [ 925 | " i like the guitar \n", 926 | " he lives in japan \n", 927 | " this book is close to him \n", 928 | " this is my favorite song \n", 929 | " he drives a family and he will return to a new car \n" 930 | ] 931 | } 932 | ], 933 | "source": [ 934 | "print(translate(\"j'aime la guitare\")) # i like guitar\n", 935 | "print(translate(\"il vit au japon\")) # he lives in Japan\n", 936 | "print(translate(\"ce stylo est utilisé par lui\")) # this pen is used by him\n", 937 | "print(translate(\"c'est ma chanson préférée\")) # that's my favorite song\n", 938 | "print(translate(\"il conduit une voiture et va à new york\")) # he drives a car and goes to new york" 939 | ] 940 | }, 941 | { 942 | "cell_type": "markdown", 943 | "metadata": {}, 944 | "source": [ 945 | "In this vanilla encoder-decoder architecture, the source (French) is encoded into a single context, and it will then be hard to manipulate the long context.
\n", 946 | "In the next exercise, we will refine architecture to tackle this weak points." 947 | ] 948 | }, 949 | { 950 | "cell_type": "code", 951 | "execution_count": null, 952 | "metadata": {}, 953 | "outputs": [], 954 | "source": [] 955 | } 956 | ], 957 | "metadata": { 958 | "kernelspec": { 959 | "display_name": "Python 3 (ipykernel)", 960 | "language": "python", 961 | "name": "python3" 962 | }, 963 | "language_info": { 964 | "codemirror_mode": { 965 | "name": "ipython", 966 | "version": 3 967 | }, 968 | "file_extension": ".py", 969 | "mimetype": "text/x-python", 970 | "name": "python", 971 | "nbconvert_exporter": "python", 972 | "pygments_lexer": "ipython3", 973 | "version": "3.12.3" 974 | } 975 | }, 976 | "nbformat": 4, 977 | "nbformat_minor": 4 978 | } 979 | -------------------------------------------------------------------------------- /08_attention.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "# Attention (Machine Translation Example)\n", 8 | "\n", 9 | "In the [previous example](./07_encoder_decoder.ipynb), we saw sequence-to-sequence encoder-decoder architecture in machine translation.
\n", 10 | "In the previous example, the input sequence is encoded into a single context, and this context is used for decoding in all units in generated tokens.\n", 11 | "\n", 12 | "This architecture will not be flexible, and also not scalable. For instance, in case of machine translation, it will be difficult to translate a long text (such as, translate multiple sentences at once) unlike human translation. (Because a single context will not be enough to represent entire text, when the text is so long.)\n", 13 | "\n", 14 | "By introducing attention architecture, this constraint can be relaxed.
\n", 15 | "The attention is more elaborative and widely used architecture in today's NLP, and a lot of tasks (such as, machine translation, smart reply, etc) are researched by adding attention mechanism and worked well today.\n", 16 | "\n", 17 | "The overview outline of attention architecture is shown as follows.
\n", 18 | "In this network, the context ```c``` is computed and obtained in attention layer (```attend``` in the following diagram) on decoder, and the different context is then used in each units in the sequence for decoding. (In the following diagram, each ```attend``` layer is the same network and then shares the weight's parameters.)\n", 19 | "\n", 20 | "![encoder-decoder with attention architecture](./images/encoder_decoder_attention.png)\n", 21 | "\n", 22 | "Within attention layer, it uses previous state and encoder's outputs (not only final output, but outputs in all units), and it generates $\\{ \\alpha_j^i \\}\\;(i=1,\\ldots,n)$, in which $\\sum_i \\alpha_j^i = 1$, with dense net (FCNet) and softmax activation, where $n$ is the number of encoder's outputs and $j$ is time step in sequence. (See the following diagram.)
\n", 23 | "To say in abstraction, $\\{ \\alpha_j^i \\}\\;(i=1,\\ldots,n)$ means an alignment's weight at j-th time step for each source sequence outputs, $o_1^{\\prime}, o_2^{\\prime}, \\ldots, o_n^{\\prime}$.
\n", 24 | "(This $\\{ \\alpha_j^i \\}\\;(i=1,\\ldots,n)$ is then called attention weights.)\n", 25 | "\n", 26 | "> Note : The softmax function is often used for normalizing outputs (sum to one) in neural networks. See [here](https://tsmatz.wordpress.com/2017/08/30/regression-in-machine-learning-math-for-beginners/) for softmax function.\n", 27 | "\n", 28 | "And it finally generates context $c_j$ at j-th time step by $ c_j = \\sum_i^n \\alpha_j^i \\cdot o_i^{\\prime} $.\n", 29 | "\n", 30 | "![soft attention architecture](./images/soft_attention.png)\n", 31 | "\n", 32 | "> Note : This architecture is called **soft attention**, which is the first attention introduced in the context of sequence-to-sequence generation. (See Bahdanau et al.)
\n", 33 | "> There exist a lot of variants in attention architecture. See the [next example](./09_transformer.ipynb) for famous scaled dot-product attention (and self-attention) in transformer.\n", 34 | "\n", 35 | "With this network, it can focus on specific components in source sequence.
\n", 36 | "For instance, in case of the following French-to-English machine translation, the 3rd units in sequence (\"don't\" in English) will strongly focus on 3rd and 5th components in original sequence (French), because the word \"don't\" will be strongly related to \"ne\" and \"pas\" in French. On the other hand, the components \"je\" and \"comprends\" in French are weakly referred, because it's not directly related to \"don't\" in English, but it's used only for determining not \"doesn't\" or not \"isn't\".
\n", 37 | "As a result, the attention weights $\\{ \\alpha_j^i \\}\\;(i=1,\\ldots,n)$ will be larger for the source components \"ne\" and \"pas\", and will be smaller for the source components \"je\" and \"comprends\".\n", 38 | "\n", 39 | "![attend in machine translation](./images/attend_image.png)\n", 40 | "\n", 41 | "*back to [index](https://github.com/tsmatz/nlp-tutorials/)*" 42 | ] 43 | }, 44 | { 45 | "cell_type": "markdown", 46 | "metadata": {}, 47 | "source": [ 48 | "## Install required packages" 49 | ] 50 | }, 51 | { 52 | "cell_type": "markdown", 53 | "metadata": {}, 54 | "source": [ 55 | "> Note : Currently torch 1.13.1 for cuda 11.4 has bugs (in which we can't run ```nn.Linear``` with ```out_features=1```) and we then use cuda 11.8 here." 56 | ] 57 | }, 58 | { 59 | "cell_type": "code", 60 | "execution_count": null, 61 | "metadata": {}, 62 | "outputs": [], 63 | "source": [ 64 | "!pip install torch numpy nltk" 65 | ] 66 | }, 67 | { 68 | "cell_type": "markdown", 69 | "metadata": {}, 70 | "source": [ 71 | "## Prepare data" 72 | ] 73 | }, 74 | { 75 | "cell_type": "markdown", 76 | "metadata": {}, 77 | "source": [ 78 | "In this example, I use Engligh-French dataset by [Anki](https://www.manythings.org/anki/)." 79 | ] 80 | }, 81 | { 82 | "cell_type": "code", 83 | "execution_count": 1, 84 | "metadata": {}, 85 | "outputs": [ 86 | { 87 | "name": "stdout", 88 | "output_type": "stream", 89 | "text": [ 90 | "--2023-02-17 13:37:35-- http://www.manythings.org/anki/fra-eng.zip\n", 91 | "Resolving www.manythings.org (www.manythings.org)... 173.254.30.110\n", 92 | "Connecting to www.manythings.org (www.manythings.org)|173.254.30.110|:80... connected.\n", 93 | "HTTP request sent, awaiting response... 200 OK\n", 94 | "Length: 6720195 (6.4M) [application/zip]\n", 95 | "Saving to: ‘fra-eng.zip’\n", 96 | "\n", 97 | "fra-eng.zip 100%[===================>] 6.41M 568KB/s in 15s \n", 98 | "\n", 99 | "2023-02-17 13:37:51 (429 KB/s) - ‘fra-eng.zip’ saved [6720195/6720195]\n", 100 | "\n" 101 | ] 102 | } 103 | ], 104 | "source": [ 105 | "!wget http://www.manythings.org/anki/fra-eng.zip" 106 | ] 107 | }, 108 | { 109 | "cell_type": "code", 110 | "execution_count": 2, 111 | "metadata": {}, 112 | "outputs": [ 113 | { 114 | "name": "stdout", 115 | "output_type": "stream", 116 | "text": [ 117 | "Archive: fra-eng.zip\n", 118 | " inflating: fra-eng/_about.txt \n", 119 | " inflating: fra-eng/fra.txt \n" 120 | ] 121 | } 122 | ], 123 | "source": [ 124 | "!unzip fra-eng.zip -d fra-eng" 125 | ] 126 | }, 127 | { 128 | "cell_type": "code", 129 | "execution_count": 3, 130 | "metadata": {}, 131 | "outputs": [ 132 | { 133 | "name": "stdout", 134 | "output_type": "stream", 135 | "text": [ 136 | "Go.\tVa !\tCC-BY 2.0 (France) Attribution: tatoeba.org #2877272 (CM) & #1158250 (Wittydev)\r\n", 137 | "Go.\tMarche.\tCC-BY 2.0 (France) Attribution: tatoeba.org #2877272 (CM) & #8090732 (Micsmithel)\r\n", 138 | "Go.\tEn route !\tCC-BY 2.0 (France) Attribution: tatoeba.org #2877272 (CM) & #8267435 (felix63)\r\n", 139 | "Go.\tBouge !\tCC-BY 2.0 (France) Attribution: tatoeba.org #2877272 (CM) & #9022935 (Micsmithel)\r\n", 140 | "Hi.\tSalut !\tCC-BY 2.0 (France) Attribution: tatoeba.org #538123 (CM) & #509819 (Aiji)\r\n" 141 | ] 142 | } 143 | ], 144 | "source": [ 145 | "!head -n 5 fra-eng/fra.txt" 146 | ] 147 | }, 148 | { 149 | "cell_type": "code", 150 | "execution_count": 4, 151 | "metadata": {}, 152 | "outputs": [ 153 | { 154 | "name": "stdout", 155 | "output_type": "stream", 156 | "text": [ 157 | "197463 fra-eng/fra.txt\r\n" 158 | ] 159 | } 160 | ], 161 | "source": [ 162 | "!wc -l fra-eng/fra.txt" 163 | ] 164 | }, 165 | { 166 | "cell_type": "code", 167 | "execution_count": 5, 168 | "metadata": {}, 169 | "outputs": [ 170 | { 171 | "data": { 172 | "text/plain": [ 173 | "array(['Va !', 'Go.'], dtype='\n", 199 | "Therefore I shuffle entire data." 200 | ] 201 | }, 202 | { 203 | "cell_type": "code", 204 | "execution_count": 6, 205 | "metadata": {}, 206 | "outputs": [ 207 | { 208 | "data": { 209 | "text/plain": [ 210 | "array(['Dis toujours la vérité.', 'Always tell the truth.'], dtype='``` and `````` tokens in string." 307 | ] 308 | }, 309 | { 310 | "cell_type": "code", 311 | "execution_count": 9, 312 | "metadata": {}, 313 | "outputs": [ 314 | { 315 | "data": { 316 | "text/plain": [ 317 | "array([' dis toujours la vérité ',\n", 318 | " ' always tell the truth '], dtype='\", x, \"\"]), \" \".join([\"\", y, \"\"])] for x, y in train_data])\n", 328 | "# print first row\n", 329 | "train_data[0]" 330 | ] 331 | }, 332 | { 333 | "cell_type": "markdown", 334 | "metadata": {}, 335 | "source": [ 336 | "## Generate sequence inputs\n", 337 | "\n", 338 | "We will generate the sequence of word's indices (i.e, tokenize) from text.\n", 339 | "\n", 340 | "![Index vectorize](images/index_vectorize2.png)\n", 341 | "\n", 342 | "First we create a list of vocabulary (```vocab```) for both source text (French) and target text (English) respectively." 343 | ] 344 | }, 345 | { 346 | "cell_type": "code", 347 | "execution_count": 10, 348 | "metadata": {}, 349 | "outputs": [], 350 | "source": [ 351 | "from nltk.tokenize import SpaceTokenizer\n", 352 | "\n", 353 | "###\n", 354 | "# define Vocab\n", 355 | "###\n", 356 | "class Vocab:\n", 357 | " def __init__(self, list_of_sentence, tokenization, special_token, max_tokens=None):\n", 358 | " # count vocab frequency\n", 359 | " vocab_freq = {}\n", 360 | " tokens = tokenization(list_of_sentence)\n", 361 | " for t in tokens:\n", 362 | " for vocab in t:\n", 363 | " if vocab not in vocab_freq:\n", 364 | " vocab_freq[vocab] = 0 \n", 365 | " vocab_freq[vocab] += 1\n", 366 | " # sort by frequency\n", 367 | " vocab_freq = {k: v for k, v in sorted(vocab_freq.items(), key=lambda i: i[1], reverse=True)}\n", 368 | " # create vocab list\n", 369 | " self.vocabs = [special_token] + list(vocab_freq.keys())\n", 370 | " if max_tokens:\n", 371 | " self.vocabs = self.vocabs[:max_tokens]\n", 372 | " self.stoi = {v: i for i, v in enumerate(self.vocabs)}\n", 373 | "\n", 374 | " def _get_tokens(self, list_of_sentence):\n", 375 | " for sentence in list_of_sentence:\n", 376 | " tokens = tokenizer.tokenize(sentence)\n", 377 | " yield tokens\n", 378 | "\n", 379 | " def get_itos(self):\n", 380 | " return self.vocabs\n", 381 | "\n", 382 | " def get_stoi(self):\n", 383 | " return self.stoi\n", 384 | "\n", 385 | " def append_token(self, token):\n", 386 | " self.vocabs.append(token)\n", 387 | " self.stoi = {v: i for i, v in enumerate(self.vocabs)}\n", 388 | "\n", 389 | " def __call__(self, list_of_tokens):\n", 390 | " def get_token_index(token):\n", 391 | " if token in self.stoi:\n", 392 | " return self.stoi[token]\n", 393 | " else:\n", 394 | " return 0\n", 395 | " return [get_token_index(t) for t in list_of_tokens]\n", 396 | "\n", 397 | " def __len__(self):\n", 398 | " return len(self.vocabs)\n", 399 | "\n", 400 | "###\n", 401 | "# generate Vocab\n", 402 | "###\n", 403 | "max_word = 10000\n", 404 | "\n", 405 | "# create space-split tokenizer\n", 406 | "tokenizer = SpaceTokenizer()\n", 407 | "\n", 408 | "# define tokenization function\n", 409 | "def yield_tokens(data):\n", 410 | " for text in data:\n", 411 | " tokens = tokenizer.tokenize(text)\n", 412 | " yield tokens\n", 413 | "\n", 414 | "# build vocabulary list for French\n", 415 | "vocab_fr = Vocab(\n", 416 | " train_data[:,0],\n", 417 | " tokenization=yield_tokens,\n", 418 | " special_token=\"\",\n", 419 | " max_tokens=max_word,\n", 420 | ")\n", 421 | "\n", 422 | "# build vocabulary list for English\n", 423 | "vocab_en = Vocab(\n", 424 | " train_data[:,1],\n", 425 | " tokenization=yield_tokens,\n", 426 | " special_token=\"\",\n", 427 | " max_tokens=max_word,\n", 428 | ")" 429 | ] 430 | }, 431 | { 432 | "cell_type": "markdown", 433 | "metadata": {}, 434 | "source": [ 435 | "The generated token index is ```0, 1, ... , vocab_size - 1```.
\n", 436 | "Now I set ```vocab_size``` as a token id in padded positions for both French and English respctively." 437 | ] 438 | }, 439 | { 440 | "cell_type": "code", 441 | "execution_count": 11, 442 | "metadata": {}, 443 | "outputs": [], 444 | "source": [ 445 | "pad_index_fr = vocab_fr.__len__()\n", 446 | "vocab_fr.append_token(\"\")\n", 447 | "\n", 448 | "pad_index_en = vocab_en.__len__()\n", 449 | "vocab_en.append_token(\"\")" 450 | ] 451 | }, 452 | { 453 | "cell_type": "markdown", 454 | "metadata": {}, 455 | "source": [ 456 | "Get list for both index-to-word and word-to-index." 457 | ] 458 | }, 459 | { 460 | "cell_type": "code", 461 | "execution_count": 12, 462 | "metadata": {}, 463 | "outputs": [], 464 | "source": [ 465 | "itos_fr = vocab_fr.get_itos()\n", 466 | "stoi_fr = vocab_fr.get_stoi()\n", 467 | "\n", 468 | "itos_en = vocab_en.get_itos()\n", 469 | "stoi_en = vocab_en.get_stoi()" 470 | ] 471 | }, 472 | { 473 | "cell_type": "code", 474 | "execution_count": 13, 475 | "metadata": {}, 476 | "outputs": [ 477 | { 478 | "name": "stdout", 479 | "output_type": "stream", 480 | "text": [ 481 | "The number of token index in French (source) is 10001.\n", 482 | "The padded index in French (source) is 10000.\n", 483 | "The number of token index in English (target) is 10001.\n", 484 | "The padded index in English (target) is 10000.\n" 485 | ] 486 | } 487 | ], 488 | "source": [ 489 | "# test\n", 490 | "print(\"The number of token index in French (source) is {}.\".format(vocab_fr.__len__()))\n", 491 | "print(\"The padded index in French (source) is {}.\".format(stoi_fr[\"\"]))\n", 492 | "print(\"The number of token index in English (target) is {}.\".format(vocab_en.__len__()))\n", 493 | "print(\"The padded index in English (target) is {}.\".format(stoi_en[\"\"]))" 494 | ] 495 | }, 496 | { 497 | "cell_type": "markdown", 498 | "metadata": {}, 499 | "source": [ 500 | "Now we build a collator function, which is used for pre-processing in data loader.\n", 501 | "\n", 502 | "In this collator,\n", 503 | "\n", 504 | "(1) First we create a list of word's indices for source (French) and target (English) respectively as follows.\n", 505 | "\n", 506 | "``` this is pen ``` --> ```[2, 7, 5, 14, 1]```\n", 507 | "\n", 508 | "(2) For target (English) sequence, we separate into features (x) and labels (y).
\n", 509 | "In this task, we predict the next word in target (English) sequence using the current word's sequence (English) and the encoded context of source (French).
\n", 510 | "We then separate target sequence into the sequence iteself (x) and the following label (y).\n", 511 | "\n", 512 | "before :\n", 513 | "\n", 514 | "```[2, 7, 5, 14, 1]```\n", 515 | "\n", 516 | "after :\n", 517 | "\n", 518 | "```x : [2, 7, 5, 14, 1]```\n", 519 | "\n", 520 | "```y : [7, 5, 14, 1, -100]```\n", 521 | "\n", 522 | "> Note : Here I set -100 as an unknown label id, because PyTorch cross-entropy function (```torch.nn.functional.cross_entropy()```) has a property ```ignore_index``` which default value is -100.\n", 523 | "\n", 524 | "(3) Finally we pad the inputs (for both source and target) as follows.
\n", 525 | "The padded index in features is ```pad_index``` and the padded index in label is -100. (See above note.)\n", 526 | "\n", 527 | "```x : [2, 7, 5, 14, 1, N, ... , N]```\n", 528 | "\n", 529 | "```y : [7, 5, 14, 1, -100, -100, ... , -100]```" 530 | ] 531 | }, 532 | { 533 | "cell_type": "code", 534 | "execution_count": 14, 535 | "metadata": {}, 536 | "outputs": [], 537 | "source": [ 538 | "import torch\n", 539 | "from torch.utils.data import DataLoader\n", 540 | "\n", 541 | "seq_len_fr = 45\n", 542 | "seq_len_en = 38\n", 543 | "\n", 544 | "device = torch.device(\"cuda\" if torch.cuda.is_available() else \"cpu\")\n", 545 | "\n", 546 | "def collate_batch(batch):\n", 547 | " label_list, feature_source_list, feature_target_list = [], [], []\n", 548 | " for text_fr, text_en in batch:\n", 549 | " # (1) tokenize to a list of word's indices\n", 550 | " tokens_fr = vocab_fr(tokenizer.tokenize(text_fr))\n", 551 | " tokens_en = vocab_en(tokenizer.tokenize(text_en))\n", 552 | " # (2) separate into features and labels in target tokens (English)\n", 553 | " y = tokens_en[1:]\n", 554 | " y.append(-100)\n", 555 | " # (3) limit length to seq_len and pad sequence\n", 556 | " y = y[:seq_len_en]\n", 557 | " tokens_fr = tokens_fr[:seq_len_fr]\n", 558 | " tokens_en = tokens_en[:seq_len_en]\n", 559 | " y += [-100] * (seq_len_en - len(y))\n", 560 | " tokens_fr += [pad_index_fr] * (seq_len_fr - len(tokens_fr))\n", 561 | " tokens_en += [pad_index_en] * (seq_len_en - len(tokens_en))\n", 562 | " # add to list\n", 563 | " label_list.append(y)\n", 564 | " feature_source_list.append(tokens_fr)\n", 565 | " feature_target_list.append(tokens_en)\n", 566 | " # convert to tensor\n", 567 | " label_list = torch.tensor(label_list, dtype=torch.int64).to(device)\n", 568 | " feature_source_list = torch.tensor(feature_source_list, dtype=torch.int64).to(device)\n", 569 | " feature_target_list = torch.tensor(feature_target_list, dtype=torch.int64).to(device)\n", 570 | " return label_list, feature_source_list, feature_target_list\n", 571 | "\n", 572 | "dataloader = DataLoader(\n", 573 | " list(zip(train_data[:,0], train_data[:,1])),\n", 574 | " batch_size=64,\n", 575 | " shuffle=True,\n", 576 | " collate_fn=collate_batch\n", 577 | ")" 578 | ] 579 | }, 580 | { 581 | "cell_type": "code", 582 | "execution_count": 15, 583 | "metadata": {}, 584 | "outputs": [ 585 | { 586 | "name": "stdout", 587 | "output_type": "stream", 588 | "text": [ 589 | "label shape in batch : torch.Size([64, 38])\n", 590 | "feature source shape in batch : torch.Size([64, 45])\n", 591 | "feature target shape in batch : torch.Size([64, 38])\n", 592 | "***** label sample *****\n", 593 | "tensor([ 3, 450, 112, 1, -100, -100, -100, -100, -100, -100, -100, -100,\n", 594 | " -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100,\n", 595 | " -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100,\n", 596 | " -100, -100], device='cuda:0')\n", 597 | "***** features (source) sample *****\n", 598 | "tensor([ 2, 23, 624, 11, 103, 1, 10000, 10000, 10000, 10000,\n", 599 | " 10000, 10000, 10000, 10000, 10000, 10000, 10000, 10000, 10000, 10000,\n", 600 | " 10000, 10000, 10000, 10000, 10000, 10000, 10000, 10000, 10000, 10000,\n", 601 | " 10000, 10000, 10000, 10000, 10000, 10000, 10000, 10000, 10000, 10000,\n", 602 | " 10000, 10000, 10000, 10000, 10000], device='cuda:0')\n", 603 | "***** features (target) sample *****\n", 604 | "tensor([ 2, 3, 450, 112, 1, 10000, 10000, 10000, 10000, 10000,\n", 605 | " 10000, 10000, 10000, 10000, 10000, 10000, 10000, 10000, 10000, 10000,\n", 606 | " 10000, 10000, 10000, 10000, 10000, 10000, 10000, 10000, 10000, 10000,\n", 607 | " 10000, 10000, 10000, 10000, 10000, 10000, 10000, 10000],\n", 608 | " device='cuda:0')\n" 609 | ] 610 | } 611 | ], 612 | "source": [ 613 | "# test\n", 614 | "for labels, sources, targets in dataloader:\n", 615 | " break\n", 616 | "\n", 617 | "print(\"label shape in batch : {}\".format(labels.size()))\n", 618 | "print(\"feature source shape in batch : {}\".format(sources.size()))\n", 619 | "print(\"feature target shape in batch : {}\".format(targets.size()))\n", 620 | "print(\"***** label sample *****\")\n", 621 | "print(labels[0])\n", 622 | "print(\"***** features (source) sample *****\")\n", 623 | "print(sources[0])\n", 624 | "print(\"***** features (target) sample *****\")\n", 625 | "print(targets[0])" 626 | ] 627 | }, 628 | { 629 | "cell_type": "markdown", 630 | "metadata": {}, 631 | "source": [ 632 | "## Build Network\n", 633 | "\n", 634 | "Now we build an attention model in encoder-decoder architecture as follows.\n", 635 | "\n", 636 | "- Outputs (not only final output, but all outputs in all units) in RNN for source French text are generated in encoder.\n", 637 | "- Encoder's outputs are used in attention architecture and the result is passed into unit in decoder's RNN.\n", 638 | "- Each RNN output in decoder is passed into dense (FCNet) layer and generate the sequence of next words.\n", 639 | "- Calculate loss between predicted next words and the true values of next words, and then proceed to optimize neural networks.\n", 640 | "\n", 641 | "![the trainer architecture of machine translation](./images/machine_translation2.png)" 642 | ] 643 | }, 644 | { 645 | "cell_type": "markdown", 646 | "metadata": {}, 647 | "source": [ 648 | "First, we build encoder model.
\n", 649 | "See the [previous examples](./06_language_model_rnn.ipynb) for details about RNN inputs and outputs in PyTorch. (Here I also use packed sequence, because I want to process appropriate time-steps in each sequence.)\n", 650 | "\n", 651 | "Unlike [previous example](./07_encoder_decoder.ipynb) (vanilla encoder-decoder example), all outputs in all units are used in the decoder, and the encoder should then return all outputs (not only the final output).\n", 652 | "\n", 653 | "![all outputs in encoder](./images/encoder_all.png)\n", 654 | "\n", 655 | "I note that the size of the following ```masks``` output is ```(batch_size, seq_len)```, in which its element's value is 0 when it's a padded position, and otherwise 1. (This ```masks``` will then be used in the following softmax operation in decoder side.)" 656 | ] 657 | }, 658 | { 659 | "cell_type": "code", 660 | "execution_count": 16, 661 | "metadata": {}, 662 | "outputs": [], 663 | "source": [ 664 | "import torch\n", 665 | "import torch.nn as nn\n", 666 | "\n", 667 | "embedding_dim = 256\n", 668 | "rnn_units = 1024\n", 669 | "\n", 670 | "class Encoder(nn.Module):\n", 671 | " def __init__(self, vocab_size, seq_len, embedding_dim, rnn_units, padding_idx):\n", 672 | " super().__init__()\n", 673 | "\n", 674 | " self.seq_len = seq_len\n", 675 | " self.padding_idx = padding_idx\n", 676 | "\n", 677 | " self.embedding = nn.Embedding(\n", 678 | " vocab_size,\n", 679 | " embedding_dim,\n", 680 | " padding_idx=padding_idx,\n", 681 | " )\n", 682 | " self.rnn = nn.GRU(\n", 683 | " input_size=embedding_dim,\n", 684 | " hidden_size=rnn_units,\n", 685 | " num_layers=1,\n", 686 | " batch_first=True,\n", 687 | " )\n", 688 | "\n", 689 | " def forward(self, inputs):\n", 690 | " # embedding\n", 691 | " # --> (batch_size, seq_len, embedding_dim)\n", 692 | " outs = self.embedding(inputs)\n", 693 | " # build \"lengths\" property to pack inputs (see previous example)\n", 694 | " masks = (inputs != self.padding_idx).int()\n", 695 | " lengths = masks.sum(dim=1, keepdim=False)\n", 696 | " # pack inputs for RNN (see previous example)\n", 697 | " packed_inputs = torch.nn.utils.rnn.pack_padded_sequence(\n", 698 | " outs,\n", 699 | " lengths.cpu(),\n", 700 | " batch_first=True,\n", 701 | " enforce_sorted=False,\n", 702 | " )\n", 703 | " # apply RNN\n", 704 | " packed_outs, _ = self.rnn(packed_inputs)\n", 705 | " # unpack results\n", 706 | " # --> (batch_size, seq_len, rnn_units)\n", 707 | " outs, _ = torch.nn.utils.rnn.pad_packed_sequence(\n", 708 | " packed_outs,\n", 709 | " batch_first=True,\n", 710 | " padding_value=0.0,\n", 711 | " total_length=self.seq_len,\n", 712 | " )\n", 713 | " return outs, masks\n", 714 | "\n", 715 | "enc_model = Encoder(\n", 716 | " vocab_size=vocab_fr.__len__(),\n", 717 | " seq_len=seq_len_fr,\n", 718 | " embedding_dim=embedding_dim,\n", 719 | " rnn_units=rnn_units,\n", 720 | " padding_idx=pad_index_fr).to(device)" 721 | ] 722 | }, 723 | { 724 | "cell_type": "markdown", 725 | "metadata": {}, 726 | "source": [ 727 | "Now we build decoder with attention architecture as follows.\n", 728 | "\n", 729 | "![decoder with attention](./images/decoder_attention.png)\n", 730 | "\n", 731 | "In each time-steps in target sequence, the state in previous step is used in computation of attention layer, and it repeats this process until the end of sequence. (See the following ```for``` loop.)\n", 732 | "\n", 733 | "In each steps, first, the previous state and encoder's outputs are concatenated, and the results are passed into dense network (FCNet).
\n", 734 | "By applying softmax function for this output, the attention weights $\\alpha$ (```alpha``` in the following code) at ```j```-th step are obtained. The context ```c``` at ```j```-th step is then generated by $\\sum_i \\alpha_i o_i^{\\prime}$ where $o_i^{\\prime}$ is i-th element in encoder's outputs.
\n", 735 | "In the following code, the padded elements in the softmax operation will be ignored (masked), because $e^{-inf} = 0$.\n", 736 | "\n", 737 | "Once we get the context ```c```, the subsequent steps are the same as [previous example](./07_encoder_decoder.ipynb). (See previous example for details.)" 738 | ] 739 | }, 740 | { 741 | "cell_type": "code", 742 | "execution_count": 17, 743 | "metadata": {}, 744 | "outputs": [], 745 | "source": [ 746 | "from torch.nn import functional as F\n", 747 | "\n", 748 | "class DecoderWithAttention(nn.Module):\n", 749 | " def __init__(self, vocab_size, embedding_dim, rnn_units, padding_idx, hidden_dim1=1024, hidden_dim2=1024):\n", 750 | " super().__init__()\n", 751 | "\n", 752 | " self.padding_idx = padding_idx\n", 753 | " self.rnn_units = rnn_units\n", 754 | "\n", 755 | " # Below are used in attention layer\n", 756 | " self.attention_dense1 = nn.Linear(rnn_units*2, hidden_dim1)\n", 757 | " self.attention_dense2 = nn.Linear(hidden_dim1, 1)\n", 758 | "\n", 759 | " # Below are used in other parts\n", 760 | " self.embedding = nn.Embedding(\n", 761 | " vocab_size,\n", 762 | " embedding_dim,\n", 763 | " padding_idx=padding_idx,\n", 764 | " )\n", 765 | " self.rnncell = nn.GRUCell(\n", 766 | " input_size=rnn_units + embedding_dim,\n", 767 | " hidden_size=rnn_units,\n", 768 | " )\n", 769 | " self.output_dense1 = nn.Linear(rnn_units, hidden_dim2)\n", 770 | " self.output_dense2 = nn.Linear(hidden_dim2, vocab_size)\n", 771 | "\n", 772 | " def forward(self, inputs, enc_outputs, enc_masks, states=None, return_states=False):\n", 773 | " #\n", 774 | " # get size\n", 775 | " #\n", 776 | "\n", 777 | " batch_size = inputs.size()[0]\n", 778 | " dec_seq_size = inputs.size()[1]\n", 779 | " enc_seq_size = enc_outputs.size()[1]\n", 780 | "\n", 781 | " #\n", 782 | " # set initial states\n", 783 | " #\n", 784 | "\n", 785 | " if states is None:\n", 786 | " current_states = torch.zeros((batch_size, self.rnn_units)).to(device)\n", 787 | " else:\n", 788 | " current_states = states\n", 789 | "\n", 790 | " # loop target sequence\n", 791 | " # [Note] Here I loop in all time-steps, but please filter\n", 792 | " # for saving resource's consumption.\n", 793 | " # (Sort batch, run by filtering, and turn into original position.)\n", 794 | " rnn_outputs = []\n", 795 | " for j in range(dec_seq_size):\n", 796 | "\n", 797 | " #\n", 798 | " # process attention\n", 799 | " #\n", 800 | "\n", 801 | " # --> (batch_size, 1, rnn_units)\n", 802 | " current_states_reshaped = current_states[:,None,:]\n", 803 | " # --> (batch_size, enc_seq_size, rnn_units)\n", 804 | " current_states_reshaped = current_states_reshaped.expand(-1, enc_seq_size, -1)\n", 805 | " # concat\n", 806 | " # --> (batch_size, enc_seq_size, rnn_units * 2)\n", 807 | " enc_and_states = torch.concat((current_states_reshaped, enc_outputs), dim=-1)\n", 808 | " # apply dense\n", 809 | " # --> (batch_size, enc_seq_size, 1)\n", 810 | " alpha = self.attention_dense1(enc_and_states)\n", 811 | " alpha = F.relu(alpha)\n", 812 | " alpha = self.attention_dense2(alpha)\n", 813 | " # --> (batch_size, enc_seq_size)\n", 814 | " alpha = alpha.squeeze(dim=2)\n", 815 | " # apply masked softmax\n", 816 | " alpha = alpha.masked_fill(enc_masks == 0, float(\"-inf\"))\n", 817 | " alpha = F.softmax(alpha, dim=-1)\n", 818 | " # get context\n", 819 | " # --> (batch_size, rnn_units)\n", 820 | " c = torch.einsum(\"bs,bsu->bu\", alpha, enc_outputs)\n", 821 | "\n", 822 | " #\n", 823 | " # process rnn\n", 824 | " #\n", 825 | "\n", 826 | " # embedding\n", 827 | " # --> (batch_size, embedding_dim)\n", 828 | " emb_j = self.embedding(inputs[:,j])\n", 829 | " # concat\n", 830 | " # --> (batch_size, rnn_units + embedding_dim)\n", 831 | " input_j = torch.concat((c, emb_j), dim=-1)\n", 832 | " # apply rnn (proceed to the next state)\n", 833 | " current_states = self.rnncell(input_j, current_states)\n", 834 | " # append state\n", 835 | " rnn_outputs.append(current_states)\n", 836 | "\n", 837 | " #\n", 838 | " # process outputs\n", 839 | " #\n", 840 | "\n", 841 | " # get output state's tensor\n", 842 | " # --> (batch_size, dec_seq_size, rnn_units)\n", 843 | " rnn_outputs = torch.stack(rnn_outputs, dim=1)\n", 844 | " # apply dense\n", 845 | " # --> (batch_size, dec_seq_size, vocab_size)\n", 846 | " outs = self.output_dense1(rnn_outputs)\n", 847 | " outs = F.relu(outs)\n", 848 | " logits = self.output_dense2(outs)\n", 849 | "\n", 850 | " # return results\n", 851 | " if return_states:\n", 852 | " # set 0.0 in padded position\n", 853 | " masks = (inputs != self.padding_idx).int()\n", 854 | " masks = masks[:,:,None]\n", 855 | " masks = masks.expand(-1, -1, self.rnn_units)\n", 856 | " rnn_outputs = rnn_outputs.masked_fill(masks == 0, 0.0)\n", 857 | " return logits, rnn_outputs # This is used in prediction\n", 858 | " else:\n", 859 | " return logits # This is used in training\n", 860 | "\n", 861 | "dec_model = DecoderWithAttention(\n", 862 | " vocab_size=vocab_en.__len__(),\n", 863 | " embedding_dim=embedding_dim,\n", 864 | " rnn_units=rnn_units,\n", 865 | " padding_idx=pad_index_en).to(device)" 866 | ] 867 | }, 868 | { 869 | "cell_type": "markdown", 870 | "metadata": {}, 871 | "source": [ 872 | "## Train\n", 873 | "\n", 874 | "Now we put it all together and run training.\n", 875 | "\n", 876 | "The loss on label id=-100 is ignored in ```cross_entropy()``` function. The padded position and the end of sequence will then be ignored in optimization.\n", 877 | "\n", 878 | "> Note : Because the default value of ```ignore_index``` property in ```cross_entropy()``` function is -100. (You can change this default value.)" 879 | ] 880 | }, 881 | { 882 | "cell_type": "code", 883 | "execution_count": 18, 884 | "metadata": {}, 885 | "outputs": [ 886 | { 887 | "name": "stdout", 888 | "output_type": "stream", 889 | "text": [ 890 | "Epoch 1 - loss: 1.8736 - accuracy: 0.6900\n", 891 | "Epoch 2 - loss: 1.8001 - accuracy: 0.6667\n", 892 | "Epoch 3 - loss: 0.7754 - accuracy: 0.8667\n", 893 | "Epoch 4 - loss: 0.1816 - accuracy: 0.9286\n", 894 | "Epoch 5 - loss: 0.7531 - accuracy: 0.8333\n" 895 | ] 896 | } 897 | ], 898 | "source": [ 899 | "num_epochs = 5\n", 900 | "\n", 901 | "all_params = list(enc_model.parameters()) + list(dec_model.parameters())\n", 902 | "optimizer = torch.optim.AdamW(all_params, lr=0.001)\n", 903 | "for epoch in range(num_epochs):\n", 904 | " for labels, sources, targets in dataloader:\n", 905 | " # optimize\n", 906 | " optimizer.zero_grad()\n", 907 | " enc_outputs, enc_masks = enc_model(sources)\n", 908 | " logits = dec_model(targets, enc_outputs, enc_masks)\n", 909 | " loss = F.cross_entropy(logits.transpose(1,2), labels)\n", 910 | " loss.backward()\n", 911 | " optimizer.step()\n", 912 | " # calculate accuracy\n", 913 | " pred_labels = logits.argmax(dim=2)\n", 914 | " num_correct = (pred_labels == labels).float().sum()\n", 915 | " num_total = (labels != -100).float().sum()\n", 916 | " accuracy = num_correct / num_total\n", 917 | " print(\"Epoch {} - loss: {:2.4f} - accuracy: {:2.4f}\".format(epoch+1, loss.item(), accuracy), end=\"\\r\")\n", 918 | " print(\"\")" 919 | ] 920 | }, 921 | { 922 | "cell_type": "markdown", 923 | "metadata": {}, 924 | "source": [ 925 | "## Translate Text\n", 926 | "\n", 927 | "Now translate French text to English text with trained model. (All these sentences are not in training set.)\n", 928 | "\n", 929 | "Here I simply translate several brief sentences, but the metrics to evaluate text-generation task will not be so easy. (Because simply checking an exact match to a reference text is not optimal.)
\n", 930 | "To eveluate the trained model, use some common metrics available in text generation, such as, BLEU or ROUGE.\n", 931 | "\n", 932 | "> Note : Here I use greedy search and this will sometimes lead to wrong sequence. For drawbacks and solutions, see note in [this example](./05_language_model_basic.ipynb)." 933 | ] 934 | }, 935 | { 936 | "cell_type": "code", 937 | "execution_count": 19, 938 | "metadata": {}, 939 | "outputs": [], 940 | "source": [ 941 | "import numpy as np\n", 942 | "\n", 943 | "end_index_en = stoi_en[\"\"]\n", 944 | "max_output = 128\n", 945 | "\n", 946 | "def translate(sentence):\n", 947 | " # preprocess inputs\n", 948 | " text_fr = sentence\n", 949 | " text_fr = text_fr.lower()\n", 950 | " text_fr = \" \".join([\"\", text_fr, \"\"])\n", 951 | " text_en = \"\"\n", 952 | " _, tokens_fr, tokens_en = collate_batch(list(zip([text_fr], [text_en])))\n", 953 | "\n", 954 | " # process encoder\n", 955 | " enc_outputs, enc_masks = enc_model(tokens_fr)\n", 956 | "\n", 957 | " # process decoder\n", 958 | " final_state = None\n", 959 | " for loop in range(max_output):\n", 960 | " logits, states = dec_model(\n", 961 | " tokens_en,\n", 962 | " enc_outputs,\n", 963 | " enc_masks,\n", 964 | " states=final_state,\n", 965 | " return_states=True)\n", 966 | " final_state = states[0][0].unsqueeze(dim=0)\n", 967 | " pred_idx_en = logits[0][0].argmax()\n", 968 | " next_word_en = itos_en[pred_idx_en]\n", 969 | " text_en += \" \"\n", 970 | " text_en += next_word_en\n", 971 | " if pred_idx_en.item() == end_index_en:\n", 972 | " break\n", 973 | " _, _, tokens_en = collate_batch(list(zip([\"\"], [next_word_en])))\n", 974 | " return text_en" 975 | ] 976 | }, 977 | { 978 | "cell_type": "code", 979 | "execution_count": 20, 980 | "metadata": {}, 981 | "outputs": [ 982 | { 983 | "name": "stdout", 984 | "output_type": "stream", 985 | "text": [ 986 | " i like the guitar \n", 987 | " he lives in japan \n", 988 | " this pen is used to him \n", 989 | " that's my favorite song \n", 990 | " he drives a car and goes to new york \n" 991 | ] 992 | } 993 | ], 994 | "source": [ 995 | "print(translate(\"j'aime la guitare\")) # i like guitar\n", 996 | "print(translate(\"il vit au japon\")) # he lives in Japan\n", 997 | "print(translate(\"ce stylo est utilisé par lui\")) # this pen is used by him\n", 998 | "print(translate(\"c'est ma chanson préférée\")) # that's my favorite song\n", 999 | "print(translate(\"il conduit une voiture et va à new york\")) # he drives a car and goes to new york" 1000 | ] 1001 | }, 1002 | { 1003 | "cell_type": "code", 1004 | "execution_count": null, 1005 | "metadata": {}, 1006 | "outputs": [], 1007 | "source": [] 1008 | } 1009 | ], 1010 | "metadata": { 1011 | "kernelspec": { 1012 | "display_name": "Python 3 (ipykernel)", 1013 | "language": "python", 1014 | "name": "python3" 1015 | }, 1016 | "language_info": { 1017 | "codemirror_mode": { 1018 | "name": "ipython", 1019 | "version": 3 1020 | }, 1021 | "file_extension": ".py", 1022 | "mimetype": "text/x-python", 1023 | "name": "python", 1024 | "nbconvert_exporter": "python", 1025 | "pygments_lexer": "ipython3", 1026 | "version": "3.12.3" 1027 | } 1028 | }, 1029 | "nbformat": 4, 1030 | "nbformat_minor": 4 1031 | } 1032 | -------------------------------------------------------------------------------- /Readme.md: -------------------------------------------------------------------------------- 1 | # Natural Language Processing (Neural Methods) Tutorials 2 | 3 | This repository consists of comprehensive examples to learn fundamental language processing (NLP) from the beginning.
4 | Each notebook has end-to-end implementation (for each task) from scratch in Python (PyTorch), and also describes fundamental ideas and background for each architecture. 5 | 6 | 1. [Tokenization and Primitive Embeddings (Sparse Vector)](./01_sparse_vector.ipynb) 7 | 2. [Tokenization and Custom Embedding (Dense Vector)](./02_custom_embedding.ipynb) 8 | 3. [Word2Vec algorithm (Negative Sampling)](./03_word2vec.ipynb) 9 | 4. [N-Gram detection with 1D Convolution](./04_ngram_cnn.ipynb) 10 | 5. [Language Model - Basic FFN](./05_language_model_basic.ipynb) 11 | 6. [Language Model - RNN (Recurrent Neural Network)](./06_language_model_rnn.ipynb) 12 | 7. [Encoder-Decoder (Seq2Seq)](./07_encoder_decoder.ipynb) 13 | 8. [Attention](./08_attention.ipynb) 14 | 9. [Transformer](./09_transformer.ipynb) 15 | 16 | > I recommend you to run these examples on GPU-utilized machine. 17 | 18 | Tutorials follow the history of NLP neural methods.
19 | In the latter part (from tutorial 5), I then focus on language models, improving the models by step-by-step approaches, and reach to learn how and why the widely used Transformer architecture matters. (You will find how it's developed and improved by running actual tasks.) 20 | 21 | NLP (natural language processing) has a long history in artificial intelligence, and generative models were also developed with traditional statistical models in 1950s - such as, applying [Hidden Markov Models (HMMs)](https://github.com/tsmatz/hmm-lds-em-algorithm) or [Gaussian Mixture Models (GMMs)](https://github.com/tsmatz/gmm).
22 | This repository, however, focuses on recent neural methods engaged in today's NLP. 23 | 24 | > [Feb 2023] All examples were transformed (from TensorFlow) into PyTorch.
25 | > [Feb 2025] Removed torchtext dependency. (Because it's deprecated.) 26 | 27 | *Tsuyoshi Matsuzaki @ Microsoft* 28 | -------------------------------------------------------------------------------- /images/1d_conv_net.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/tsmatz/nlp-tutorials/5935b2bb39ec074b621f79575d680f6a91d019a1/images/1d_conv_net.png -------------------------------------------------------------------------------- /images/attend_image.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/tsmatz/nlp-tutorials/5935b2bb39ec074b621f79575d680f6a91d019a1/images/attend_image.png -------------------------------------------------------------------------------- /images/bidirectional_rnn.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/tsmatz/nlp-tutorials/5935b2bb39ec074b621f79575d680f6a91d019a1/images/bidirectional_rnn.png -------------------------------------------------------------------------------- /images/bigram_convolution.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/tsmatz/nlp-tutorials/5935b2bb39ec074b621f79575d680f6a91d019a1/images/bigram_convolution.png -------------------------------------------------------------------------------- /images/conditioned_context.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/tsmatz/nlp-tutorials/5935b2bb39ec074b621f79575d680f6a91d019a1/images/conditioned_context.png -------------------------------------------------------------------------------- /images/continuous_bow.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/tsmatz/nlp-tutorials/5935b2bb39ec074b621f79575d680f6a91d019a1/images/continuous_bow.png -------------------------------------------------------------------------------- /images/count_vectorize.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/tsmatz/nlp-tutorials/5935b2bb39ec074b621f79575d680f6a91d019a1/images/count_vectorize.png -------------------------------------------------------------------------------- /images/decoder_attention.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/tsmatz/nlp-tutorials/5935b2bb39ec074b621f79575d680f6a91d019a1/images/decoder_attention.png -------------------------------------------------------------------------------- /images/deep_rnn.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/tsmatz/nlp-tutorials/5935b2bb39ec074b621f79575d680f6a91d019a1/images/deep_rnn.png -------------------------------------------------------------------------------- /images/dense_vectorize.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/tsmatz/nlp-tutorials/5935b2bb39ec074b621f79575d680f6a91d019a1/images/dense_vectorize.png -------------------------------------------------------------------------------- /images/embedding_layer.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/tsmatz/nlp-tutorials/5935b2bb39ec074b621f79575d680f6a91d019a1/images/embedding_layer.png -------------------------------------------------------------------------------- /images/embedding_matrix.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/tsmatz/nlp-tutorials/5935b2bb39ec074b621f79575d680f6a91d019a1/images/embedding_matrix.png -------------------------------------------------------------------------------- /images/encoder_all.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/tsmatz/nlp-tutorials/5935b2bb39ec074b621f79575d680f6a91d019a1/images/encoder_all.png -------------------------------------------------------------------------------- /images/encoder_decoder.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/tsmatz/nlp-tutorials/5935b2bb39ec074b621f79575d680f6a91d019a1/images/encoder_decoder.png -------------------------------------------------------------------------------- /images/encoder_decoder_attention.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/tsmatz/nlp-tutorials/5935b2bb39ec074b621f79575d680f6a91d019a1/images/encoder_decoder_attention.png -------------------------------------------------------------------------------- /images/encoder_final.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/tsmatz/nlp-tutorials/5935b2bb39ec074b621f79575d680f6a91d019a1/images/encoder_final.png -------------------------------------------------------------------------------- /images/gru_gate.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/tsmatz/nlp-tutorials/5935b2bb39ec074b621f79575d680f6a91d019a1/images/gru_gate.png -------------------------------------------------------------------------------- /images/index_vectorize.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/tsmatz/nlp-tutorials/5935b2bb39ec074b621f79575d680f6a91d019a1/images/index_vectorize.png -------------------------------------------------------------------------------- /images/index_vectorize2.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/tsmatz/nlp-tutorials/5935b2bb39ec074b621f79575d680f6a91d019a1/images/index_vectorize2.png -------------------------------------------------------------------------------- /images/language_model_beginning.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/tsmatz/nlp-tutorials/5935b2bb39ec074b621f79575d680f6a91d019a1/images/language_model_beginning.png -------------------------------------------------------------------------------- /images/machine_translation.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/tsmatz/nlp-tutorials/5935b2bb39ec074b621f79575d680f6a91d019a1/images/machine_translation.png -------------------------------------------------------------------------------- /images/machine_translation2.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/tsmatz/nlp-tutorials/5935b2bb39ec074b621f79575d680f6a91d019a1/images/machine_translation2.png -------------------------------------------------------------------------------- /images/multi_head_attention.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/tsmatz/nlp-tutorials/5935b2bb39ec074b621f79575d680f6a91d019a1/images/multi_head_attention.png -------------------------------------------------------------------------------- /images/region_separation.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/tsmatz/nlp-tutorials/5935b2bb39ec074b621f79575d680f6a91d019a1/images/region_separation.png -------------------------------------------------------------------------------- /images/rnn_architecture.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/tsmatz/nlp-tutorials/5935b2bb39ec074b621f79575d680f6a91d019a1/images/rnn_architecture.png -------------------------------------------------------------------------------- /images/rnn_network.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/tsmatz/nlp-tutorials/5935b2bb39ec074b621f79575d680f6a91d019a1/images/rnn_network.png -------------------------------------------------------------------------------- /images/rnn_packed_sequence.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/tsmatz/nlp-tutorials/5935b2bb39ec074b621f79575d680f6a91d019a1/images/rnn_packed_sequence.png -------------------------------------------------------------------------------- /images/separate_sequence_for_next_words.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/tsmatz/nlp-tutorials/5935b2bb39ec074b621f79575d680f6a91d019a1/images/separate_sequence_for_next_words.png -------------------------------------------------------------------------------- /images/skip_gram.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/tsmatz/nlp-tutorials/5935b2bb39ec074b621f79575d680f6a91d019a1/images/skip_gram.png -------------------------------------------------------------------------------- /images/soft_attention.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/tsmatz/nlp-tutorials/5935b2bb39ec074b621f79575d680f6a91d019a1/images/soft_attention.png -------------------------------------------------------------------------------- /images/task_layer.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/tsmatz/nlp-tutorials/5935b2bb39ec074b621f79575d680f6a91d019a1/images/task_layer.png -------------------------------------------------------------------------------- /images/transformer.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/tsmatz/nlp-tutorials/5935b2bb39ec074b621f79575d680f6a91d019a1/images/transformer.png -------------------------------------------------------------------------------- /images/transformer3_dec_only.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/tsmatz/nlp-tutorials/5935b2bb39ec074b621f79575d680f6a91d019a1/images/transformer3_dec_only.png -------------------------------------------------------------------------------- /images/transformer3_enc_and_dec.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/tsmatz/nlp-tutorials/5935b2bb39ec074b621f79575d680f6a91d019a1/images/transformer3_enc_and_dec.png -------------------------------------------------------------------------------- /images/transformer3_enc_only.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/tsmatz/nlp-tutorials/5935b2bb39ec074b621f79575d680f6a91d019a1/images/transformer3_enc_only.png -------------------------------------------------------------------------------- /images/transformer_attention.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/tsmatz/nlp-tutorials/5935b2bb39ec074b621f79575d680f6a91d019a1/images/transformer_attention.png -------------------------------------------------------------------------------- /images/transformer_causal_attention.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/tsmatz/nlp-tutorials/5935b2bb39ec074b621f79575d680f6a91d019a1/images/transformer_causal_attention.png -------------------------------------------------------------------------------- /images/transformer_causal_reference.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/tsmatz/nlp-tutorials/5935b2bb39ec074b621f79575d680f6a91d019a1/images/transformer_causal_reference.png -------------------------------------------------------------------------------- /images/transformer_decoder.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/tsmatz/nlp-tutorials/5935b2bb39ec074b621f79575d680f6a91d019a1/images/transformer_decoder.png -------------------------------------------------------------------------------- /images/transformer_decoding_layer.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/tsmatz/nlp-tutorials/5935b2bb39ec074b621f79575d680f6a91d019a1/images/transformer_decoding_layer.png -------------------------------------------------------------------------------- /images/transformer_encoder.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/tsmatz/nlp-tutorials/5935b2bb39ec074b621f79575d680f6a91d019a1/images/transformer_encoder.png -------------------------------------------------------------------------------- /images/transformer_encoding_layer.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/tsmatz/nlp-tutorials/5935b2bb39ec074b621f79575d680f6a91d019a1/images/transformer_encoding_layer.png -------------------------------------------------------------------------------- /images/transformer_positional_encoding.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/tsmatz/nlp-tutorials/5935b2bb39ec074b621f79575d680f6a91d019a1/images/transformer_positional_encoding.png -------------------------------------------------------------------------------- /images/transformer_residual01.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/tsmatz/nlp-tutorials/5935b2bb39ec074b621f79575d680f6a91d019a1/images/transformer_residual01.png -------------------------------------------------------------------------------- /images/transformer_residual02.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/tsmatz/nlp-tutorials/5935b2bb39ec074b621f79575d680f6a91d019a1/images/transformer_residual02.png -------------------------------------------------------------------------------- /images/transformer_self_attention.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/tsmatz/nlp-tutorials/5935b2bb39ec074b621f79575d680f6a91d019a1/images/transformer_self_attention.png -------------------------------------------------------------------------------- /images/word2vec_network.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/tsmatz/nlp-tutorials/5935b2bb39ec074b621f79575d680f6a91d019a1/images/word2vec_network.png -------------------------------------------------------------------------------- /images/word_embedding.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/tsmatz/nlp-tutorials/5935b2bb39ec074b621f79575d680f6a91d019a1/images/word_embedding.png --------------------------------------------------------------------------------