├── .gitignore ├── README.md ├── Reinforcement_learning ├── 01_DQN.ipynb └── 02_REINFORCE_discrete.ipynb ├── deepnlp ├── 01_DL_for_NLP_BoWClassifier.ipynb ├── 02_DL_FOR_NLP_NGRAM.ipynb ├── 03_DL_FOR_NLP_LSTM.ipynb ├── 04_DL_FOR_NLP_BILSTMCRF.ipynb ├── 05_LSTM_Batch.ipynb ├── 06_Seq2Seq_basic.ipynb ├── 06_Seq2Seq_vanilla.ipynb ├── 07_Seq2Seq_Attention.ipynb ├── 08_Relational_Network_for_bAbI(Not yet).ipynb ├── 09_Transformer.ipynb ├── 10_CNN_text_classification.ipynb └── temp_Coref.ipynb ├── evolutionary_algorithms ├── AutoML_Design_by_evolution.ipynb ├── net_builder.py ├── torch_models.py └── worker.py ├── generative_model ├── 01.Simple_Autoencoder.ipynb ├── 02.Regularized_Autoencoders.ipynb ├── 03.Variational_Autoencoder.ipynb ├── 03_1_Appendix_Entropy&KL-Divergence.ipynb ├── 04.Variational_Recurrent_Autoencoder.ipynb └── 05.Controllable_Text_Generation.ipynb ├── mytutorial └── 1_week_pytorch_basic.ipynb └── tutorial ├── 00.XOR.ipynb ├── 01.Linear_Regression.ipynb ├── 02.Logistic_Regression.ipynb ├── 03.Feedforward_Neural_Network.ipynb ├── 04.Convolutional_Neural_Network.ipynb └── 10.GAN.ipynb /.gitignore: -------------------------------------------------------------------------------- 1 | .ipynb_checkpoints 2 | __pycache__ 3 | data/ 4 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # Pytorch Study 2 | 3 | 파이토치 스터디 4 | 5 | 관심있는 모든 것(NLP, Generative Model, RL,...)을 더욱 잘 이해하기 위해 코드로 구현해보기 6 | 클래식한 모델부터 논문, 튜토리얼, 강의, 블로그를 읽고 이해한 것을 바탕으로 7 | 구현해보기 +_+ 8 | 9 | 언젠가 직접 모델링하는 그 날까지... 10 | 11 | 12 | ## 파이썬 3.5 Pytorch 환경 구축해둔 도커 13 | 14 | ubuntu 16.04 python 3.5.2 with various of ML/DL packages including tensorflow, sklearn, pytorch 15 | 16 | `docker pull dsksd/deepstudy:0.2` 17 | 18 | 19 | ## 1. Deep NLP Models 20 | 21 | 1. BoWClassifier 22 | 2. NGRAM & CBOW 23 | 3. LSTM POS Tagger 24 | 4. Bidirectional LSTM POS Tagger 25 | 5. LSTM batch learning 26 | 6. Vanilla Sequence2Sequence (Encoder-Decoder) 27 | 7. Sequence2Sequence with Attention 28 | 8. Relational Network for bAbI task(in progress) 29 | 9. Transformer(Attention is all you need) 30 | 31 | ### 읽고 구현해보고 싶은 논문 리스트 32 | 33 | 1. Poincaré Embeddings for Learning Hierarchical Representations 34 | 2. Neural Embeddings of Graphs in Hyperbolic Space 35 | 3. A Deep Reinforced Model for Abstractive Summarization 36 | 4. Controllable Text Generation 37 | 5. A simple neural network module for relational reasoning 38 | 39 | ## 2. Generative Models 40 | 41 | 1. Basic Auto-Encoder 42 | 2. Regularized Auto-Encoder 43 | 3. Variational Auto-Encoder 44 | 3-1. Appendix1: Entropy and KL-divergence 45 | 4. Variational Reccurent Auto-Encoder 46 | 47 | ## 3. Reinforcement Learning 48 | 49 | ## 4. Evolutionary Algorithms -------------------------------------------------------------------------------- /deepnlp/01_DL_for_NLP_BoWClassifier.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "code", 5 | "execution_count": 1, 6 | "metadata": {}, 7 | "outputs": [ 8 | { 9 | "data": { 10 | "text/plain": [ 11 | "" 12 | ] 13 | }, 14 | "execution_count": 1, 15 | "metadata": {}, 16 | "output_type": "execute_result" 17 | } 18 | ], 19 | "source": [ 20 | "import torch\n", 21 | "import torch.autograd as autograd\n", 22 | "import torch.nn as nn\n", 23 | "import torch.nn.functional as F\n", 24 | "import torch.optim as optim\n", 25 | "\n", 26 | "torch.manual_seed(1)" 27 | ] 28 | }, 29 | { 30 | "cell_type": "markdown", 31 | "metadata": {}, 32 | "source": [ 33 | "# 1. Logistic Regression Bag-of-Words classifier" 34 | ] 35 | }, 36 | { 37 | "cell_type": "markdown", 38 | "metadata": {}, 39 | "source": [ 40 | "### 1. word2index 딕 준비 for Bag-of-Words" 41 | ] 42 | }, 43 | { 44 | "cell_type": "code", 45 | "execution_count": 3, 46 | "metadata": {}, 47 | "outputs": [ 48 | { 49 | "name": "stdout", 50 | "output_type": "stream", 51 | "text": [ 52 | "{'it': 7, 'to': 8, 'una': 13, 'Give': 6, 'good': 19, 'cafeteria': 5, 'comer': 2, 'not': 17, 'si': 24, 'on': 25, 'lost': 21, 'me': 0, 'creo': 10, 'en': 3, 'sea': 12, 'get': 20, 'No': 9, 'is': 16, 'que': 11, 'la': 4, 'idea': 15, 'at': 22, 'gusta': 1, 'Yo': 23, 'a': 18, 'buena': 14}\n", 53 | "{0: 'me', 1: 'gusta', 2: 'comer', 3: 'en', 4: 'la', 5: 'cafeteria', 6: 'Give', 7: 'it', 8: 'to', 9: 'No', 10: 'creo', 11: 'que', 12: 'sea', 13: 'una', 14: 'buena', 15: 'idea', 16: 'is', 17: 'not', 18: 'a', 19: 'good', 20: 'get', 21: 'lost', 22: 'at', 23: 'Yo', 24: 'si', 25: 'on'}\n" 54 | ] 55 | } 56 | ], 57 | "source": [ 58 | "data = [ (\"me gusta comer en la cafeteria\".split(), \"SPANISH\"),\n", 59 | " (\"Give it to me\".split(), \"ENGLISH\"),\n", 60 | " (\"No creo que sea una buena idea\".split(), \"SPANISH\"),\n", 61 | " (\"No it is not a good idea to get lost at sea\".split(), \"ENGLISH\") ]\n", 62 | "\n", 63 | "test_data = [ (\"Yo creo que si\".split(), \"SPANISH\"),\n", 64 | " (\"it is lost on me\".split(), \"ENGLISH\")]\n", 65 | "\n", 66 | "# word_to_ix maps each word in the vocab to a unique integer, which will be its\n", 67 | "# index into the Bag of words vector\n", 68 | "word_to_ix = {}\n", 69 | "for sent, _ in data + test_data:\n", 70 | " for word in sent:\n", 71 | " if word not in word_to_ix:\n", 72 | " word_to_ix[word] = len(word_to_ix)\n", 73 | "\n", 74 | "ix_to_word = {v : k for k,v in word_to_ix.items()}\n", 75 | "\n", 76 | "print(word_to_ix)\n", 77 | "print(ix_to_word)\n", 78 | "\n", 79 | "VOCAB_SIZE = len(word_to_ix)\n", 80 | "NUM_LABELS = 2" 81 | ] 82 | }, 83 | { 84 | "cell_type": "markdown", 85 | "metadata": {}, 86 | "source": [ 87 | "### 2. 모델 선언 " 88 | ] 89 | }, 90 | { 91 | "cell_type": "code", 92 | "execution_count": 4, 93 | "metadata": { 94 | "collapsed": true 95 | }, 96 | "outputs": [], 97 | "source": [ 98 | "class BoWClassifier(nn.Module): # nn.Module을 상속받아서 클래스 만들어야 함\n", 99 | " \n", 100 | " def __init__(self, num_labels, vocab_size):\n", 101 | " # 파이토치의 nn.Module을 상속받아 \"모델 클래스\"를 만들 때는\n", 102 | " # 반드시 부모 클래스 nn.Module의 생성자를 초기화 해줘야 함\n", 103 | " super(BoWClassifier, self).__init__()\n", 104 | " \n", 105 | " # 선형 맵핑(아핀 변환?)\n", 106 | " # vocab_size만큼의 벡터를 -> spanish or english 2가지로 분류\n", 107 | " \n", 108 | " self.linear = nn.Linear(vocab_size, num_labels)\n", 109 | " \n", 110 | " # NOTE! The non-linearity log softmax does not have parameters! So we don't need\n", 111 | " # to worry about that here\n", 112 | " \n", 113 | " def forward(self, bow_vec): \n", 114 | " # nn.Module을 상속받은 클래스에서 forward는 예약어임\n", 115 | " # Pass the input through the linear layer,\n", 116 | " # then pass that through log_softmax.\n", 117 | " # Many non-linearities and other functions are in torch.nn.functional\n", 118 | " return F.log_softmax(self.linear(bow_vec))" 119 | ] 120 | }, 121 | { 122 | "cell_type": "markdown", 123 | "metadata": {}, 124 | "source": [ 125 | "### 3. 전처리 함수 선언 (문장 -> 벡터 / 레이블)" 126 | ] 127 | }, 128 | { 129 | "cell_type": "markdown", 130 | "metadata": {}, 131 | "source": [ 132 | "텐서는 리스트로부터 바로 만들 수 있다. torch.Tensor(list) , default 타입은 floatTensor인데
\n", 133 | "integer 타입은 torch.LongTensor를 사용해야 함" 134 | ] 135 | }, 136 | { 137 | "cell_type": "markdown", 138 | "metadata": {}, 139 | "source": [ 140 | "Tensor.view 는 reshape 함수임~" 141 | ] 142 | }, 143 | { 144 | "cell_type": "code", 145 | "execution_count": 6, 146 | "metadata": { 147 | "collapsed": true 148 | }, 149 | "outputs": [], 150 | "source": [ 151 | "def make_bow_vector(sentence, word_to_ix):\n", 152 | " vec = torch.zeros(len(word_to_ix))\n", 153 | " for word in sentence:\n", 154 | " vec[word_to_ix[word]] += 1\n", 155 | " return vec.view(1, -1) # reshape 하는 함수!!\n", 156 | "\n", 157 | "def make_target(label, label_to_ix):\n", 158 | " return torch.LongTensor([label_to_ix[label]]) # integer Tensor는 LongTensor 사용" 159 | ] 160 | }, 161 | { 162 | "cell_type": "code", 163 | "execution_count": 8, 164 | "metadata": {}, 165 | "outputs": [ 166 | { 167 | "name": "stdout", 168 | "output_type": "stream", 169 | "text": [ 170 | "Parameter containing:\n", 171 | "\n", 172 | "Columns 0 to 9 \n", 173 | "-0.1808 -0.0890 -0.1295 -0.1729 0.1483 0.0669 -0.1575 0.0365 -0.0309 0.0673\n", 174 | " 0.1917 0.0630 0.0973 -0.0790 -0.0861 -0.0211 0.1135 -0.1090 -0.1556 -0.1673\n", 175 | "\n", 176 | "Columns 10 to 19 \n", 177 | " 0.1796 -0.0346 0.0130 -0.1186 0.0753 -0.0825 -0.0724 -0.1404 0.0732 0.1111\n", 178 | "-0.0204 -0.0121 0.1603 -0.1584 -0.0810 0.1582 -0.0832 -0.1492 -0.1451 0.0097\n", 179 | "\n", 180 | "Columns 20 to 25 \n", 181 | " 0.1313 -0.0343 -0.1889 -0.1827 0.0981 0.0486\n", 182 | "-0.1885 -0.1633 0.0701 0.1635 -0.1131 0.1610\n", 183 | "[torch.FloatTensor of size 2x26]\n", 184 | "\n", 185 | "Parameter containing:\n", 186 | "1.00000e-02 *\n", 187 | " -9.1960\n", 188 | " -7.8866\n", 189 | "[torch.FloatTensor of size 2]\n", 190 | "\n" 191 | ] 192 | } 193 | ], 194 | "source": [ 195 | "model = BoWClassifier(NUM_LABELS, VOCAB_SIZE)\n", 196 | "\n", 197 | "for param in model.parameters():\n", 198 | " print(param) \n", 199 | " \n", 200 | " # Ax + b\n", 201 | " # nn.Linear가 가지고 있는 2x26 A\n", 202 | " # b\n", 203 | " " 204 | ] 205 | }, 206 | { 207 | "cell_type": "markdown", 208 | "metadata": {}, 209 | "source": [ 210 | "토치에서 모델로 넘겨주는 모든 변수는 autograd.Variable()로 wrapping해줘야 한다!!" 211 | ] 212 | }, 213 | { 214 | "cell_type": "code", 215 | "execution_count": 10, 216 | "metadata": {}, 217 | "outputs": [ 218 | { 219 | "name": "stdout", 220 | "output_type": "stream", 221 | "text": [ 222 | "Variable containing:\n", 223 | "-0.9966 -0.4607\n", 224 | "[torch.FloatTensor of size 1x2]\n", 225 | "\n" 226 | ] 227 | } 228 | ], 229 | "source": [ 230 | "# To run the model, pass in a BoW vector, but wrapped in an autograd.Variable\n", 231 | "sample = data[0]\n", 232 | "bow_vector = make_bow_vector(sample[0], word_to_ix)\n", 233 | "log_probs = model(autograd.Variable(bow_vector)) # 이렇게 넣어주면 forward 함수로 바로 맵핑\n", 234 | "print(log_probs)" 235 | ] 236 | }, 237 | { 238 | "cell_type": "code", 239 | "execution_count": 11, 240 | "metadata": { 241 | "collapsed": true 242 | }, 243 | "outputs": [], 244 | "source": [ 245 | "label_to_ix = { \"SPANISH\": 0, \"ENGLISH\": 1 }" 246 | ] 247 | }, 248 | { 249 | "cell_type": "code", 250 | "execution_count": 19, 251 | "metadata": { 252 | "collapsed": true 253 | }, 254 | "outputs": [], 255 | "source": [ 256 | "ix_to_label = {v:k for k,v in label_to_ix.items()}" 257 | ] 258 | }, 259 | { 260 | "cell_type": "markdown", 261 | "metadata": {}, 262 | "source": [ 263 | "### 4. 트레이닝!" 264 | ] 265 | }, 266 | { 267 | "cell_type": "markdown", 268 | "metadata": {}, 269 | "source": [ 270 | "트레이닝 전 파라미터 확인 (before & after 해보려고) " 271 | ] 272 | }, 273 | { 274 | "cell_type": "code", 275 | "execution_count": 12, 276 | "metadata": {}, 277 | "outputs": [ 278 | { 279 | "name": "stdout", 280 | "output_type": "stream", 281 | "text": [ 282 | "Variable containing:\n", 283 | "-0.6785 -0.7080\n", 284 | "[torch.FloatTensor of size 1x2]\n", 285 | "\n", 286 | "Variable containing:\n", 287 | "-0.8051 -0.5925\n", 288 | "[torch.FloatTensor of size 1x2]\n", 289 | "\n", 290 | "Variable containing:\n", 291 | " 0.1796\n", 292 | "-0.0204\n", 293 | "[torch.FloatTensor of size 2]\n", 294 | "\n" 295 | ] 296 | } 297 | ], 298 | "source": [ 299 | "# Run on test data before we train, just to see a before-and-after\n", 300 | "for instance, label in test_data:\n", 301 | " bow_vec = autograd.Variable(make_bow_vector(instance, word_to_ix))\n", 302 | " log_probs = model(bow_vec)\n", 303 | " print(log_probs)\n", 304 | "print(next(model.parameters())[:,word_to_ix[\"creo\"]]) # Print the matrix column corresponding to \"creo\"" 305 | ] 306 | }, 307 | { 308 | "cell_type": "code", 309 | "execution_count": 13, 310 | "metadata": { 311 | "collapsed": true 312 | }, 313 | "outputs": [], 314 | "source": [ 315 | "loss_function = nn.NLLLoss() # negative log likelihood 로스\n", 316 | "optimizer = optim.SGD(model.parameters(), lr=0.1) # 옵티마이저\n", 317 | "\n", 318 | "# Usually you want to pass over the training data several times.\n", 319 | "# 100 is much bigger than on a real data set, but real datasets have more than\n", 320 | "# two instances. Usually, somewhere between 5 and 30 epochs is reasonable.\n", 321 | "for epoch in range(100):\n", 322 | " for instance, label in data:\n", 323 | " # 1. Pytorch는 gradients를 누적하기 때문에 항상 초기화해줘야 함\n", 324 | " model.zero_grad()\n", 325 | " \n", 326 | " # 2. 문장을 벡터로 만들어 준 후 autograd.Variable로 wrapping하기\n", 327 | " # target 역시 autograd.Variable로 wrapping\n", 328 | " bow_vec = autograd.Variable(make_bow_vector(instance, word_to_ix))\n", 329 | " target = autograd.Variable(make_target(label, label_to_ix))\n", 330 | " \n", 331 | " # 3. forward path\n", 332 | " log_probs = model(bow_vec)\n", 333 | " \n", 334 | " # 4. loss 계산 후, loss로부터 backward(), 그리고 optimizer.step()\n", 335 | " loss = loss_function(log_probs, target)\n", 336 | " loss.backward()\n", 337 | " optimizer.step()" 338 | ] 339 | }, 340 | { 341 | "cell_type": "code", 342 | "execution_count": 33, 343 | "metadata": {}, 344 | "outputs": [ 345 | { 346 | "name": "stdout", 347 | "output_type": "stream", 348 | "text": [ 349 | "pred : SPANISH && label : SPANISH\n", 350 | "pred : ENGLISH && label : ENGLISH\n" 351 | ] 352 | } 353 | ], 354 | "source": [ 355 | "for instance, label in test_data:\n", 356 | " bow_vec = autograd.Variable(make_bow_vector(instance, word_to_ix))\n", 357 | " log_probs = model(bow_vec)\n", 358 | " values, indices = torch.max(log_probs,1)\n", 359 | " print('pred : ' ,ix_to_label[list(indices.data.numpy())[0][0]],'&& label : ', label)\n", 360 | " #print(ix_to_label[indice.numpy()[]])" 361 | ] 362 | }, 363 | { 364 | "cell_type": "code", 365 | "execution_count": null, 366 | "metadata": { 367 | "collapsed": true 368 | }, 369 | "outputs": [], 370 | "source": [] 371 | } 372 | ], 373 | "metadata": { 374 | "kernelspec": { 375 | "display_name": "Python 3", 376 | "language": "python", 377 | "name": "python3" 378 | }, 379 | "language_info": { 380 | "codemirror_mode": { 381 | "name": "ipython", 382 | "version": 3 383 | }, 384 | "file_extension": ".py", 385 | "mimetype": "text/x-python", 386 | "name": "python", 387 | "nbconvert_exporter": "python", 388 | "pygments_lexer": "ipython3", 389 | "version": "3.5.2" 390 | } 391 | }, 392 | "nbformat": 4, 393 | "nbformat_minor": 2 394 | } 395 | -------------------------------------------------------------------------------- /deepnlp/02_DL_FOR_NLP_NGRAM.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "# Word Embedding" 8 | ] 9 | }, 10 | { 11 | "cell_type": "code", 12 | "execution_count": 1, 13 | "metadata": {}, 14 | "outputs": [ 15 | { 16 | "data": { 17 | "text/plain": [ 18 | "" 19 | ] 20 | }, 21 | "execution_count": 1, 22 | "metadata": {}, 23 | "output_type": "execute_result" 24 | } 25 | ], 26 | "source": [ 27 | "import torch\n", 28 | "import torch.autograd as autograd\n", 29 | "import torch.nn as nn\n", 30 | "import torch.nn.functional as F\n", 31 | "import torch.optim as optim\n", 32 | "\n", 33 | "torch.manual_seed(1)" 34 | ] 35 | }, 36 | { 37 | "cell_type": "markdown", 38 | "metadata": {}, 39 | "source": [ 40 | "### nn.Embedding : # of Vocab -> Dimension" 41 | ] 42 | }, 43 | { 44 | "cell_type": "code", 45 | "execution_count": 2, 46 | "metadata": {}, 47 | "outputs": [ 48 | { 49 | "name": "stdout", 50 | "output_type": "stream", 51 | "text": [ 52 | "Variable containing:\n", 53 | "-2.9718 1.7070 -0.4305 -2.2820 0.5237\n", 54 | "[torch.FloatTensor of size 1x5]\n", 55 | "\n" 56 | ] 57 | } 58 | ], 59 | "source": [ 60 | "word_to_ix = { \"hello\": 0, \"world\": 1 }\n", 61 | "embeds = nn.Embedding(2, 5) # 2 words in vocab, 5 dimensional embeddings\n", 62 | "lookup_tensor = torch.LongTensor([word_to_ix[\"hello\"]])\n", 63 | "hello_embed = embeds( autograd.Variable(lookup_tensor) )\n", 64 | "print(hello_embed)" 65 | ] 66 | }, 67 | { 68 | "cell_type": "markdown", 69 | "metadata": {}, 70 | "source": [ 71 | "# N-Gram Language Modeling" 72 | ] 73 | }, 74 | { 75 | "cell_type": "markdown", 76 | "metadata": {}, 77 | "source": [ 78 | "$$ P(w_i | w_{i-1}, w_{i-2}, \\dots, w_{i-n+1} ) $$" 79 | ] 80 | }, 81 | { 82 | "cell_type": "markdown", 83 | "metadata": {}, 84 | "source": [ 85 | "### 1. 데이터 준비" 86 | ] 87 | }, 88 | { 89 | "cell_type": "code", 90 | "execution_count": 158, 91 | "metadata": {}, 92 | "outputs": [ 93 | { 94 | "name": "stdout", 95 | "output_type": "stream", 96 | "text": [ 97 | "[(['When', 'forty'], 'winters'), (['forty', 'winters'], 'shall'), (['winters', 'shall'], 'besiege')]\n" 98 | ] 99 | } 100 | ], 101 | "source": [ 102 | "CONTEXT_SIZE = 2\n", 103 | "EMBEDDING_DIM = 10\n", 104 | "# We will use Shakespeare Sonnet 2\n", 105 | "test_sentence = \"\"\"When forty winters shall besiege thy brow,\n", 106 | "And dig deep trenches in thy beauty's field,\n", 107 | "Thy youth's proud livery so gazed on now,\n", 108 | "Will be a totter'd weed of small worth held:\n", 109 | "Then being asked, where all thy beauty lies,\n", 110 | "Where all the treasure of thy lusty days;\n", 111 | "To say, within thine own deep sunken eyes,\n", 112 | "Were an all-eating shame, and thriftless praise.\n", 113 | "How much more praise deserv'd thy beauty's use,\n", 114 | "If thou couldst answer 'This fair child of mine\n", 115 | "Shall sum my count, and make my old excuse,'\n", 116 | "Proving his beauty by succession thine!\n", 117 | "This were to be new made when thou art old,\n", 118 | "And see thy blood warm when thou feel'st it cold.\"\"\".split()\n", 119 | "# we should tokenize the input, but we will ignore that for now\n", 120 | "# build a list of tuples. Each tuple is ([ word_i-2, word_i-1 ], target word)\n", 121 | "trigrams = [ ([test_sentence[i], test_sentence[i+1]], test_sentence[i+2]) for i in range(len(test_sentence) - 2) ]\n", 122 | "print(trigrams[:3]) # print the first 3, just so you can see what they look like" 123 | ] 124 | }, 125 | { 126 | "cell_type": "markdown", 127 | "metadata": {}, 128 | "source": [ 129 | "trigram 즉, 이전 2 단어가 주어지면 그 다음 단어를 예측하는 모델" 130 | ] 131 | }, 132 | { 133 | "cell_type": "code", 134 | "execution_count": 177, 135 | "metadata": { 136 | "collapsed": true 137 | }, 138 | "outputs": [], 139 | "source": [ 140 | "vocab = set(test_sentence)\n", 141 | "word_to_ix = { word: i for i, word in enumerate(vocab) }\n", 142 | "ix_to_word = {v:k for k,v in word_to_ix.items()}" 143 | ] 144 | }, 145 | { 146 | "cell_type": "markdown", 147 | "metadata": {}, 148 | "source": [ 149 | "### 2. 모델링 " 150 | ] 151 | }, 152 | { 153 | "cell_type": "code", 154 | "execution_count": 160, 155 | "metadata": { 156 | "collapsed": true 157 | }, 158 | "outputs": [], 159 | "source": [ 160 | "class NGramLanguageModeler(nn.Module):\n", 161 | " \n", 162 | " # 역시나 부모 클래스 초기화 후,\n", 163 | " # 모델의 모듈을 차곡차곡 선언 후\n", 164 | " def __init__(self, vocab_size, embedding_dim, context_size):\n", 165 | " super(NGramLanguageModeler, self).__init__()\n", 166 | " self.embeddings = nn.Embedding(vocab_size, embedding_dim)\n", 167 | " self.linear1 = nn.Linear(context_size * embedding_dim, 128)\n", 168 | " self.linear2 = nn.Linear(128, vocab_size)\n", 169 | " \n", 170 | " # forward 함수에서 이어준다\n", 171 | " def forward(self, inputs):\n", 172 | " embeds = self.embeddings(inputs).view((1, -1))\n", 173 | " out = F.relu(self.linear1(embeds))\n", 174 | " out = self.linear2(out)\n", 175 | " log_probs = F.log_softmax(out)\n", 176 | " return log_probs" 177 | ] 178 | }, 179 | { 180 | "cell_type": "markdown", 181 | "metadata": {}, 182 | "source": [ 183 | "### 3. 트레이닝" 184 | ] 185 | }, 186 | { 187 | "cell_type": "code", 188 | "execution_count": 6, 189 | "metadata": {}, 190 | "outputs": [ 191 | { 192 | "data": { 193 | "text/plain": [ 194 | "(['When', 'forty'], 'winters')" 195 | ] 196 | }, 197 | "execution_count": 6, 198 | "metadata": {}, 199 | "output_type": "execute_result" 200 | } 201 | ], 202 | "source": [ 203 | "trigrams[0]" 204 | ] 205 | }, 206 | { 207 | "cell_type": "markdown", 208 | "metadata": {}, 209 | "source": [ 210 | "When forty 다음에 올 단어로 winters" 211 | ] 212 | }, 213 | { 214 | "cell_type": "code", 215 | "execution_count": 166, 216 | "metadata": {}, 217 | "outputs": [ 218 | { 219 | "name": "stdout", 220 | "output_type": "stream", 221 | "text": [ 222 | "0\n", 223 | "100\n", 224 | "200\n", 225 | "300\n", 226 | "400\n", 227 | "500\n", 228 | "600\n", 229 | "700\n", 230 | "800\n", 231 | "900\n", 232 | "\n", 233 | " 520.9233\n", 234 | "[torch.FloatTensor of size 1]\n", 235 | " \n", 236 | " 5.3252\n", 237 | "[torch.FloatTensor of size 1]\n", 238 | "\n" 239 | ] 240 | } 241 | ], 242 | "source": [ 243 | "losses = []\n", 244 | "loss_function = nn.NLLLoss() # Negative Log Likelihood\n", 245 | "model = NGramLanguageModeler(len(vocab), EMBEDDING_DIM, CONTEXT_SIZE)\n", 246 | "optimizer = optim.SGD(model.parameters(), lr=0.001)\n", 247 | "\n", 248 | "for epoch in range(1000):\n", 249 | " total_loss = torch.Tensor([0])\n", 250 | " \n", 251 | " if epoch % 100 ==0: print(epoch)\n", 252 | " \n", 253 | " for context, target in trigrams:\n", 254 | " \n", 255 | " # 컨텍스트 워드들을 인덱스로 변환해서 인티저텐서(LongTensor)로 만든 후\n", 256 | " # autograd.Variable로 래핑\n", 257 | " context_idxs = list(map(lambda w: word_to_ix[w], context))\n", 258 | " context_var = autograd.Variable( torch.LongTensor(context_idxs) )\n", 259 | " \n", 260 | " # Torch는 gradient를 누적하기 떄문에 항상 초기화를 해줘야 함\n", 261 | " model.zero_grad()\n", 262 | " \n", 263 | " # forward path\n", 264 | " log_probs = model(context_var)\n", 265 | " \n", 266 | " # 예측값과 레이블값의 loss 계산\n", 267 | " # logits, labels 순서로 넣어준다\n", 268 | "\n", 269 | " loss = loss_function(log_probs, autograd.Variable(torch.LongTensor([word_to_ix[target]])))\n", 270 | " \n", 271 | " # Step 5. Do the backward pass and update the gradient\n", 272 | " loss.backward()\n", 273 | " optimizer.step()\n", 274 | " \n", 275 | " total_loss += loss.data\n", 276 | " losses.append(total_loss)\n", 277 | "print(losses[0],losses[-1]) # The loss decreased every iteration over the training data!\n" 278 | ] 279 | }, 280 | { 281 | "cell_type": "markdown", 282 | "metadata": {}, 283 | "source": [ 284 | "로스 줄어든다~" 285 | ] 286 | }, 287 | { 288 | "cell_type": "markdown", 289 | "metadata": {}, 290 | "source": [ 291 | "### 4. 테스트" 292 | ] 293 | }, 294 | { 295 | "cell_type": "code", 296 | "execution_count": 167, 297 | "metadata": { 298 | "collapsed": true 299 | }, 300 | "outputs": [], 301 | "source": [ 302 | "import random" 303 | ] 304 | }, 305 | { 306 | "cell_type": "code", 307 | "execution_count": 168, 308 | "metadata": { 309 | "collapsed": true 310 | }, 311 | "outputs": [], 312 | "source": [ 313 | "test = random.choice(trigrams)\n", 314 | "test_context = list(map(lambda x:word_to_ix[x], test[0]))\n", 315 | "test_input = autograd.Variable(torch.LongTensor(test_context))\n", 316 | "hypothesis = model(test_input)\n", 317 | "v,i = torch.max(hypothesis,1) # argmax " 318 | ] 319 | }, 320 | { 321 | "cell_type": "code", 322 | "execution_count": 169, 323 | "metadata": {}, 324 | "outputs": [ 325 | { 326 | "name": "stdout", 327 | "output_type": "stream", 328 | "text": [ 329 | "맥란 단어 : beauty by\n", 330 | "예측 단어 : succession\n", 331 | "실제 단어 : succession\n" 332 | ] 333 | } 334 | ], 335 | "source": [ 336 | "pred_ix = i.data.numpy()[0][0]\n", 337 | "print('맥란 단어 : ', *test[0]) # * 붙이면 unpack 된다 \n", 338 | "print('예측 단어 : ',ix_to_word[pred_ix])\n", 339 | "print('실제 단어 : ',test[1])" 340 | ] 341 | }, 342 | { 343 | "cell_type": "markdown", 344 | "metadata": { 345 | "collapsed": true 346 | }, 347 | "source": [ 348 | "# Continuous Bag-of-Words (CBOW)" 349 | ] 350 | }, 351 | { 352 | "cell_type": "markdown", 353 | "metadata": {}, 354 | "source": [ 355 | "The CBOW model is as follows. Given a target word $w_i$ and an $N$ context window on each side, $w_{i-1}, \\dots, w_{i-N}$ and $w_{i+1}, \\dots, w_{i+N}$, referring to all context words collectively as $C$, CBOW tries to minimize $$ -\\log p(w_i | C) = \\log \\text{Softmax}(A(\\sum_{w \\in C} q_w) + b) $$ where $q_w$ is the embedding for word $w$.\n" 356 | ] 357 | }, 358 | { 359 | "cell_type": "markdown", 360 | "metadata": {}, 361 | "source": [ 362 | "양 옆에 2개씩 총 4개의 단어들 C가 주어졌을 때, 현재 단어 $w_i$ 를 예측하는 모델" 363 | ] 364 | }, 365 | { 366 | "cell_type": "markdown", 367 | "metadata": {}, 368 | "source": [ 369 | "### 1. 데이터 준비 " 370 | ] 371 | }, 372 | { 373 | "cell_type": "code", 374 | "execution_count": 181, 375 | "metadata": {}, 376 | "outputs": [ 377 | { 378 | "name": "stdout", 379 | "output_type": "stream", 380 | "text": [ 381 | "[(['We', 'are', 'to', 'study'], 'about'), (['are', 'about', 'study', 'the'], 'to'), (['about', 'to', 'the', 'idea'], 'study'), (['to', 'study', 'idea', 'of'], 'the'), (['study', 'the', 'of', 'a'], 'idea')]\n" 382 | ] 383 | } 384 | ], 385 | "source": [ 386 | "CONTEXT_SIZE = 2 # 2 words to the left, 2 to the right\n", 387 | "raw_text = \"\"\"We are about to study the idea of a computational process. Computational processes are abstract\n", 388 | "beings that inhabit computers. As they evolve, processes manipulate other abstract\n", 389 | "things called data. The evolution of a process is directed by a pattern of rules\n", 390 | "called a program. People create programs to direct processes. In effect,\n", 391 | "we conjure the spirits of the computer with our spells.\"\"\".split()\n", 392 | "word_to_ix = { word: i for i, word in enumerate(set(raw_text)) }\n", 393 | "ix_to_word = {v:k for k,v in word_to_ix.items()}\n", 394 | "data = []\n", 395 | "vocab = set(raw_text)\n", 396 | "for i in range(2, len(raw_text) - 2):\n", 397 | " context = [ raw_text[i-2], raw_text[i-1], raw_text[i+1], raw_text[i+2] ]\n", 398 | " target = raw_text[i]\n", 399 | " data.append( (context, target) )\n", 400 | "print(data[:5])" 401 | ] 402 | }, 403 | { 404 | "cell_type": "markdown", 405 | "metadata": {}, 406 | "source": [ 407 | "### 2. 모델링 " 408 | ] 409 | }, 410 | { 411 | "cell_type": "code", 412 | "execution_count": 182, 413 | "metadata": {}, 414 | "outputs": [ 415 | { 416 | "data": { 417 | "text/plain": [ 418 | "Variable containing:\n", 419 | " 4\n", 420 | " 48\n", 421 | " 23\n", 422 | " 29\n", 423 | "[torch.LongTensor of size 4]" 424 | ] 425 | }, 426 | "execution_count": 182, 427 | "metadata": {}, 428 | "output_type": "execute_result" 429 | } 430 | ], 431 | "source": [ 432 | "# create your model and train. here are some functions to help you make the data ready for use by your module\n", 433 | "def make_context_vector(context, word_to_ix):\n", 434 | " idxs = list(map(lambda w: word_to_ix[w], context))\n", 435 | " tensor = torch.LongTensor(idxs)\n", 436 | " return autograd.Variable(tensor)\n", 437 | "\n", 438 | "make_context_vector(data[0][0], word_to_ix) # example" 439 | ] 440 | }, 441 | { 442 | "cell_type": "code", 443 | "execution_count": 183, 444 | "metadata": { 445 | "collapsed": true 446 | }, 447 | "outputs": [], 448 | "source": [ 449 | "class CBOW(nn.Module):\n", 450 | " \n", 451 | " def __init__(self, vocab_size,projection_dim):\n", 452 | " super(CBOW,self).__init__()\n", 453 | " self.embeddings = nn.Embedding(vocab_size, projection_dim)\n", 454 | " self.projection = nn.Linear(projection_dim, vocab_size)\n", 455 | "\n", 456 | " def forward(self, inputs):\n", 457 | " embeds = self.embeddings(inputs)\n", 458 | " sum_embeds = torch.sum(embeds,0) # row 기준으로 sum 혹은 average?\n", 459 | " out = self.projection(sum_embeds)\n", 460 | " probs = F.log_softmax(out)\n", 461 | " return probs\n", 462 | " \n", 463 | " def prediction(self, inputs):\n", 464 | " embeds = self.embeddings(inputs)\n", 465 | " \n", 466 | " return embeds" 467 | ] 468 | }, 469 | { 470 | "cell_type": "markdown", 471 | "metadata": {}, 472 | "source": [ 473 | "### 3. 트레이닝 " 474 | ] 475 | }, 476 | { 477 | "cell_type": "code", 478 | "execution_count": 184, 479 | "metadata": { 480 | "collapsed": true 481 | }, 482 | "outputs": [], 483 | "source": [ 484 | "PROJECTION = 10" 485 | ] 486 | }, 487 | { 488 | "cell_type": "code", 489 | "execution_count": 187, 490 | "metadata": {}, 491 | "outputs": [ 492 | { 493 | "name": "stdout", 494 | "output_type": "stream", 495 | "text": [ 496 | "0\n", 497 | "100\n", 498 | "200\n", 499 | "300\n", 500 | "400\n", 501 | "500\n", 502 | "600\n", 503 | "700\n", 504 | "800\n", 505 | "900\n", 506 | "\n", 507 | " 265.6400\n", 508 | "[torch.FloatTensor of size 1]\n", 509 | " \n", 510 | " 6.2804\n", 511 | "[torch.FloatTensor of size 1]\n", 512 | "\n" 513 | ] 514 | } 515 | ], 516 | "source": [ 517 | "losses = []\n", 518 | "loss_function = nn.NLLLoss() # Negative Log Likelihood\n", 519 | "model = CBOW(len(vocab),PROJECTION)\n", 520 | "optimizer = optim.SGD(model.parameters(), lr=0.001)\n", 521 | "\n", 522 | "for epoch in range(1000):\n", 523 | " total_loss = torch.Tensor([0])\n", 524 | " \n", 525 | " if epoch % 100 ==0: print(epoch)\n", 526 | "\n", 527 | " for context, target in data:\n", 528 | " \n", 529 | " model.zero_grad()\n", 530 | " \n", 531 | " inputs = make_context_vector(context,word_to_ix)\n", 532 | " pred = model(inputs)\n", 533 | " loss = loss_function(pred,autograd.Variable(torch.LongTensor([word_to_ix[target]])))\n", 534 | " \n", 535 | " \n", 536 | " loss.backward()\n", 537 | " optimizer.step()\n", 538 | " \n", 539 | " total_loss += loss.data\n", 540 | " losses.append(total_loss)\n", 541 | "print(losses[0],losses[-1]) " 542 | ] 543 | }, 544 | { 545 | "cell_type": "markdown", 546 | "metadata": {}, 547 | "source": [ 548 | "### 4. 테스트 " 549 | ] 550 | }, 551 | { 552 | "cell_type": "code", 553 | "execution_count": 197, 554 | "metadata": { 555 | "collapsed": true 556 | }, 557 | "outputs": [], 558 | "source": [ 559 | "from scipy.spatial.distance import cosine" 560 | ] 561 | }, 562 | { 563 | "cell_type": "code", 564 | "execution_count": 192, 565 | "metadata": { 566 | "collapsed": true 567 | }, 568 | "outputs": [], 569 | "source": [ 570 | "def transform(word,dic):\n", 571 | " \n", 572 | " return autograd.Variable(torch.LongTensor([dic[word]]))" 573 | ] 574 | }, 575 | { 576 | "cell_type": "code", 577 | "execution_count": 264, 578 | "metadata": { 579 | "collapsed": true 580 | }, 581 | "outputs": [], 582 | "source": [ 583 | "def word_analogy(target,vocabs):\n", 584 | " target_idx = word_to_ix[target]\n", 585 | " target_V = model.prediction(transform(target,word_to_ix)).data.numpy()\n", 586 | " nearest_idx = -1\n", 587 | " minimum = 100\n", 588 | " \n", 589 | " for i in range(len(vocabs)):\n", 590 | " if i == target_idx: continue\n", 591 | " \n", 592 | " vector = model.prediction(transform(list(vocabs)[i],word_to_ix)).data.numpy()\n", 593 | " \n", 594 | " temp = cosine(target_V,vector)\n", 595 | " \n", 596 | " if temp < minimum:\n", 597 | " nearest_idx = i\n", 598 | " minimum = temp\n", 599 | " \n", 600 | " return ix_to_word[nearest_idx], minimum" 601 | ] 602 | }, 603 | { 604 | "cell_type": "code", 605 | "execution_count": 269, 606 | "metadata": {}, 607 | "outputs": [ 608 | { 609 | "data": { 610 | "text/plain": [ 611 | "'rules'" 612 | ] 613 | }, 614 | "execution_count": 269, 615 | "metadata": {}, 616 | "output_type": "execute_result" 617 | } 618 | ], 619 | "source": [ 620 | "test = random.choice(list(vocab))\n", 621 | "test" 622 | ] 623 | }, 624 | { 625 | "cell_type": "code", 626 | "execution_count": 270, 627 | "metadata": {}, 628 | "outputs": [ 629 | { 630 | "data": { 631 | "text/plain": [ 632 | "('idea', 0.36502336690142312)" 633 | ] 634 | }, 635 | "execution_count": 270, 636 | "metadata": {}, 637 | "output_type": "execute_result" 638 | } 639 | ], 640 | "source": [ 641 | "word_analogy(test,vocab)" 642 | ] 643 | }, 644 | { 645 | "cell_type": "markdown", 646 | "metadata": {}, 647 | "source": [ 648 | "잘 된건가? 젠장,,,," 649 | ] 650 | }, 651 | { 652 | "cell_type": "code", 653 | "execution_count": null, 654 | "metadata": { 655 | "collapsed": true 656 | }, 657 | "outputs": [], 658 | "source": [] 659 | } 660 | ], 661 | "metadata": { 662 | "kernelspec": { 663 | "display_name": "Python 3", 664 | "language": "python", 665 | "name": "python3" 666 | }, 667 | "language_info": { 668 | "codemirror_mode": { 669 | "name": "ipython", 670 | "version": 3 671 | }, 672 | "file_extension": ".py", 673 | "mimetype": "text/x-python", 674 | "name": "python", 675 | "nbconvert_exporter": "python", 676 | "pygments_lexer": "ipython3", 677 | "version": "3.5.2" 678 | } 679 | }, 680 | "nbformat": 4, 681 | "nbformat_minor": 2 682 | } 683 | -------------------------------------------------------------------------------- /deepnlp/05_LSTM_Batch.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "code", 5 | "execution_count": 1, 6 | "metadata": {}, 7 | "outputs": [ 8 | { 9 | "data": { 10 | "text/plain": [ 11 | "" 12 | ] 13 | }, 14 | "execution_count": 1, 15 | "metadata": {}, 16 | "output_type": "execute_result" 17 | } 18 | ], 19 | "source": [ 20 | "import torch\n", 21 | "from torch.autograd import Variable\n", 22 | "import torch.nn as nn\n", 23 | "import torch.nn.functional as F\n", 24 | "import torch.optim as optim\n", 25 | "import json\n", 26 | "import pickle\n", 27 | "import random\n", 28 | "from collections import Counter\n", 29 | "from torch.nn.utils.rnn import pad_packed_sequence, pack_padded_sequence\n", 30 | "\n", 31 | "torch.manual_seed(1)" 32 | ] 33 | }, 34 | { 35 | "cell_type": "markdown", 36 | "metadata": {}, 37 | "source": [ 38 | "# 데이터 " 39 | ] 40 | }, 41 | { 42 | "cell_type": "code", 43 | "execution_count": 2, 44 | "metadata": { 45 | "collapsed": true 46 | }, 47 | "outputs": [], 48 | "source": [ 49 | "train = json.load(open('../../dataset/NER/NER_16000_train.json'))\n", 50 | "\n", 51 | "training_data=[]\n", 52 | "\n", 53 | "for sent in train:\n", 54 | " word=[]\n", 55 | " tag=[]\n", 56 | " for w,p,t in sent:\n", 57 | " word.append(w)\n", 58 | " tag.append(t)\n", 59 | " training_data.append((word,tag))" 60 | ] 61 | }, 62 | { 63 | "cell_type": "code", 64 | "execution_count": 3, 65 | "metadata": { 66 | "collapsed": true 67 | }, 68 | "outputs": [], 69 | "source": [ 70 | "training_data = [t for t in training_data if len(t[0])!=0]" 71 | ] 72 | }, 73 | { 74 | "cell_type": "code", 75 | "execution_count": 4, 76 | "metadata": {}, 77 | "outputs": [ 78 | { 79 | "data": { 80 | "text/plain": [ 81 | "11196" 82 | ] 83 | }, 84 | "execution_count": 4, 85 | "metadata": {}, 86 | "output_type": "execute_result" 87 | } 88 | ], 89 | "source": [ 90 | "len(training_data)" 91 | ] 92 | }, 93 | { 94 | "cell_type": "code", 95 | "execution_count": 5, 96 | "metadata": { 97 | "collapsed": true 98 | }, 99 | "outputs": [], 100 | "source": [ 101 | "def prepare_sequence(seq, to_ix):\n", 102 | " idxs = list(map(lambda w: to_ix[w], seq))\n", 103 | " tensor = torch.LongTensor(idxs)\n", 104 | " return Variable(tensor)" 105 | ] 106 | }, 107 | { 108 | "cell_type": "code", 109 | "execution_count": 6, 110 | "metadata": { 111 | "collapsed": true 112 | }, 113 | "outputs": [], 114 | "source": [ 115 | "PAD = \"\"" 116 | ] 117 | }, 118 | { 119 | "cell_type": "markdown", 120 | "metadata": {}, 121 | "source": [ 122 | "### 시퀀스 길이 분포 파악 " 123 | ] 124 | }, 125 | { 126 | "cell_type": "code", 127 | "execution_count": 7, 128 | "metadata": { 129 | "collapsed": true 130 | }, 131 | "outputs": [], 132 | "source": [ 133 | "Length = [len(t) for t,l in training_data]\n", 134 | "distribution = Counter(Length)" 135 | ] 136 | }, 137 | { 138 | "cell_type": "code", 139 | "execution_count": 8, 140 | "metadata": { 141 | "collapsed": true 142 | }, 143 | "outputs": [], 144 | "source": [ 145 | "bucket_config = [(5,5),(10,10),(20,20),(30,30)]" 146 | ] 147 | }, 148 | { 149 | "cell_type": "markdown", 150 | "metadata": {}, 151 | "source": [ 152 | "### 버킷에 나눠 담으면서 동시에 <패딩까지> 나중에는 동적으로 패딩하기 " 153 | ] 154 | }, 155 | { 156 | "cell_type": "code", 157 | "execution_count": 9, 158 | "metadata": { 159 | "collapsed": true 160 | }, 161 | "outputs": [], 162 | "source": [ 163 | "bucket = [[],[],[],[]]" 164 | ] 165 | }, 166 | { 167 | "cell_type": "code", 168 | "execution_count": 10, 169 | "metadata": { 170 | "collapsed": true 171 | }, 172 | "outputs": [], 173 | "source": [ 174 | "for tr,label in training_data:\n", 175 | " length = len(tr)\n", 176 | " \n", 177 | " for i in range(len(bucket_config)):\n", 178 | " if bucket_config[i][0] >= length:\n", 179 | " \n", 180 | " while len(tr) < bucket_config[i][0]:\n", 181 | " tr.append(PAD)\n", 182 | " label.append(\"O\")\n", 183 | " bucket[i].append((tr,label))\n", 184 | " break" 185 | ] 186 | }, 187 | { 188 | "cell_type": "code", 189 | "execution_count": 11, 190 | "metadata": {}, 191 | "outputs": [ 192 | { 193 | "name": "stdout", 194 | "output_type": "stream", 195 | "text": [ 196 | "3184\n", 197 | "2824\n", 198 | "2568\n", 199 | "998\n" 200 | ] 201 | } 202 | ], 203 | "source": [ 204 | "for b in bucket:\n", 205 | " print(len(b))" 206 | ] 207 | }, 208 | { 209 | "cell_type": "code", 210 | "execution_count": 12, 211 | "metadata": { 212 | "collapsed": true 213 | }, 214 | "outputs": [], 215 | "source": [ 216 | "def getBatch(bucket,bucket_id,batch_size):\n", 217 | " random.shuffle(bucket[bucket_id])\n", 218 | " train_x=[]\n", 219 | " train_y=[]\n", 220 | " lengths=[]\n", 221 | " for tr,label in bucket[bucket_id][:batch_size]:\n", 222 | " temp = prepare_sequence(tr, word_to_ix)\n", 223 | " temp = temp.view(1,-1)\n", 224 | " train_x.append(temp)\n", 225 | " \n", 226 | " temp2 = prepare_sequence(label,tag_to_ix)\n", 227 | " temp2 = temp2.view(1,-1)\n", 228 | " train_y.append(temp2)\n", 229 | " \n", 230 | " length = [t for t in tr if t !='']\n", 231 | " lengths.append(len(length))\n", 232 | " inputs = torch.cat(train_x)\n", 233 | " targets = torch.cat(train_y)\n", 234 | " \n", 235 | " ### PAD 제외하고 로스 계산 ###\n", 236 | " t_out=[]\n", 237 | " for i in range(len(lengths)):\n", 238 | " t_out.append(targets[i][:lengths[i]])\n", 239 | " \n", 240 | " r_targets = torch.cat(t_out)\n", 241 | " \n", 242 | " del train_x\n", 243 | " del train_y\n", 244 | " del t_out\n", 245 | "\n", 246 | " \n", 247 | " return inputs,r_targets, lengths" 248 | ] 249 | }, 250 | { 251 | "cell_type": "markdown", 252 | "metadata": {}, 253 | "source": [ 254 | "### word2index, tag2index 딕 준비" 255 | ] 256 | }, 257 | { 258 | "cell_type": "code", 259 | "execution_count": 13, 260 | "metadata": { 261 | "collapsed": true 262 | }, 263 | "outputs": [], 264 | "source": [ 265 | "NER_LIST = ['B-PER','I-PER', 'B-LOC', 'I-LOC', 'B-ORG', 'I-ORG','B-DATE', 'I-DATE','B-TIME','I-TIME','B-MISC','I-MISC','O']\n", 266 | "\n", 267 | "word_to_ix = {}\n", 268 | "for sentence, tags in training_data:\n", 269 | " for word in sentence:\n", 270 | " if word not in word_to_ix:\n", 271 | " word_to_ix[word] = len(word_to_ix)\n", 272 | "\n", 273 | "ix_to_word = {v:k for k,v in word_to_ix.items()}\n", 274 | "\n", 275 | "tag_to_ix={}\n", 276 | "i=0\n", 277 | "for tag in NER_LIST: \n", 278 | " tag_to_ix[tag] = i\n", 279 | " i+=1\n", 280 | "\n", 281 | "ix_to_tag = {v:k for k,v in tag_to_ix.items()}" 282 | ] 283 | }, 284 | { 285 | "cell_type": "markdown", 286 | "metadata": {}, 287 | "source": [ 288 | "### Sanity Check" 289 | ] 290 | }, 291 | { 292 | "cell_type": "markdown", 293 | "metadata": {}, 294 | "source": [ 295 | "일단 가장 쉬운 길이 10개짜리로 고정해 놓고 배치
\n", 296 | "로스 계산 시에도 패딩까지 계산한다... (나중에 실제 길이 알려줘서 그것만 loss 계산하는 법 고민)" 297 | ] 298 | }, 299 | { 300 | "cell_type": "code", 301 | "execution_count": 14, 302 | "metadata": { 303 | "collapsed": true 304 | }, 305 | "outputs": [], 306 | "source": [ 307 | "import random\n", 308 | "\n", 309 | "#bucket_id = random.choice(range(len(bucket_config)))\n", 310 | "bucket_id = 1" 311 | ] 312 | }, 313 | { 314 | "cell_type": "code", 315 | "execution_count": 15, 316 | "metadata": { 317 | "collapsed": true 318 | }, 319 | "outputs": [], 320 | "source": [ 321 | "train_x=[]\n", 322 | "train_y=[]\n", 323 | "for tr,label in bucket[bucket_id]:\n", 324 | " temp = prepare_sequence(tr, word_to_ix)\n", 325 | " temp = temp.view(1,-1)\n", 326 | " train_x.append(temp)\n", 327 | " \n", 328 | " temp2 = prepare_sequence(label,tag_to_ix)\n", 329 | " temp2 = temp2.view(1,-1)\n", 330 | " train_y.append(temp2)" 331 | ] 332 | }, 333 | { 334 | "cell_type": "code", 335 | "execution_count": 16, 336 | "metadata": { 337 | "collapsed": true 338 | }, 339 | "outputs": [], 340 | "source": [ 341 | "INPUT_SIZE = bucket_config[bucket_id][0]\n", 342 | "EMBEDDING_DIM = 100\n", 343 | "HIDDEN_DIM = 100\n", 344 | "BATCH_SIZE= 64\n", 345 | "NUM_LAYERS = 3" 346 | ] 347 | }, 348 | { 349 | "cell_type": "code", 350 | "execution_count": 33, 351 | "metadata": { 352 | "collapsed": true 353 | }, 354 | "outputs": [], 355 | "source": [ 356 | "class RNN(nn.Module):\n", 357 | " def __init__(self,hidden_size, num_layers, num_classes,vocab_size,embedding_dim):\n", 358 | " super(RNN, self).__init__()\n", 359 | " self.hidden_size = hidden_size\n", 360 | " self.num_layers = num_layers\n", 361 | " self.word_embeddings = nn.Embedding(vocab_size, embedding_dim)\n", 362 | " self.lstm = nn.LSTM(embedding_dim, hidden_size, num_layers, batch_first=True)\n", 363 | " self.fc = nn.Linear(hidden_size, num_classes)\n", 364 | "\n", 365 | " \n", 366 | " def forward(self, x,length):\n", 367 | " # Set initial states \n", 368 | " h0 = Variable(torch.zeros(self.num_layers, x.size(0), self.hidden_size)) \n", 369 | " c0 = Variable(torch.zeros(self.num_layers, x.size(0), self.hidden_size))\n", 370 | " \n", 371 | " embeds = self.word_embeddings(x)\n", 372 | " # Forward propagate RNN\n", 373 | " out, _ = self.lstm(embeds, (h0, c0)) \n", 374 | " \n", 375 | " # batch_size, input_length, hidden_size\n", 376 | " \n", 377 | "\n", 378 | " ### PAD 제외하고 로스 계산 ### \n", 379 | " t_out=[]\n", 380 | " for i in range(len(length)): # len(length) = batch_size\n", 381 | " t_out.append(out[i][:length[i]]) # 실제 길이만 담기\n", 382 | " \n", 383 | " outwithoutpad = torch.cat(t_out) # row-wise concat\n", 384 | " del t_out\n", 385 | " \n", 386 | " tag_space = self.fc(outwithoutpad) \n", 387 | " tag_scores = F.log_softmax(tag_space)\n", 388 | " \n", 389 | " \n", 390 | " return tag_scores" 391 | ] 392 | }, 393 | { 394 | "cell_type": "code", 395 | "execution_count": 34, 396 | "metadata": { 397 | "collapsed": true 398 | }, 399 | "outputs": [], 400 | "source": [ 401 | "model = RNN(HIDDEN_DIM, NUM_LAYERS,len(tag_to_ix),len(word_to_ix),EMBEDDING_DIM)\n", 402 | "loss_function = nn.CrossEntropyLoss()\n", 403 | "optimizer = optim.Adam(model.parameters(), lr=0.001)" 404 | ] 405 | }, 406 | { 407 | "cell_type": "code", 408 | "execution_count": 35, 409 | "metadata": { 410 | "collapsed": true 411 | }, 412 | "outputs": [], 413 | "source": [ 414 | "x,y,l=getBatch(bucket,1,BATCH_SIZE)" 415 | ] 416 | }, 417 | { 418 | "cell_type": "code", 419 | "execution_count": 36, 420 | "metadata": { 421 | "collapsed": true 422 | }, 423 | "outputs": [], 424 | "source": [ 425 | "o = model(x,l)" 426 | ] 427 | }, 428 | { 429 | "cell_type": "markdown", 430 | "metadata": {}, 431 | "source": [ 432 | "### 버킷이랑 같이 쓰는 모델 " 433 | ] 434 | }, 435 | { 436 | "cell_type": "code", 437 | "execution_count": 22, 438 | "metadata": { 439 | "collapsed": true 440 | }, 441 | "outputs": [], 442 | "source": [ 443 | "class BUCKETRNN(nn.Module):\n", 444 | " \n", 445 | " def __init__(self,bucket_config,hidden_size, num_layers, num_classes,vocab_size,embedding_dim):\n", 446 | " self.models={}\n", 447 | " self.optims={}\n", 448 | " self.bucket_config=bucket_config\n", 449 | " for i in range(len(self.bucket_config)):\n", 450 | " self.models[i] = RNN(hidden_size, num_layers, num_classes,vocab_size,embedding_dim)\n", 451 | " self.optims[i] = optim.Adam(self.models[i].parameters(), lr=0.001)\n", 452 | " \n", 453 | " \n", 454 | " def select_bucket(self):\n", 455 | " bucket_id = random.choice(range(len(bucket_config)))\n", 456 | " \n", 457 | " return bucket_id\n", 458 | " \n", 459 | " " 460 | ] 461 | }, 462 | { 463 | "cell_type": "code", 464 | "execution_count": 23, 465 | "metadata": { 466 | "collapsed": true 467 | }, 468 | "outputs": [], 469 | "source": [ 470 | "bucket_model = BUCKETRNN(bucket_config,HIDDEN_DIM, NUM_LAYERS,len(tag_to_ix),len(word_to_ix),EMBEDDING_DIM)\n", 471 | "loss_function = nn.CrossEntropyLoss()" 472 | ] 473 | }, 474 | { 475 | "cell_type": "code", 476 | "execution_count": 24, 477 | "metadata": { 478 | "collapsed": true 479 | }, 480 | "outputs": [], 481 | "source": [ 482 | "losses=[]" 483 | ] 484 | }, 485 | { 486 | "cell_type": "code", 487 | "execution_count": 25, 488 | "metadata": {}, 489 | "outputs": [ 490 | { 491 | "name": "stdout", 492 | "output_type": "stream", 493 | "text": [ 494 | "[0] loss : 2.5338616371154785 , bucket : 0\n", 495 | "[100] loss : 1.021700382232666 , bucket : 3\n", 496 | "[200] loss : 0.46864187717437744 , bucket : 0\n", 497 | "[300] loss : 0.5949462652206421 , bucket : 1\n", 498 | "[400] loss : 0.9982641339302063 , bucket : 2\n", 499 | "[500] loss : 0.8244224786758423 , bucket : 2\n", 500 | "[600] loss : 0.6691949367523193 , bucket : 3\n", 501 | "[700] loss : 0.5334180593490601 , bucket : 1\n", 502 | "[800] loss : 0.3589295446872711 , bucket : 0\n", 503 | "[900] loss : 0.6886817216873169 , bucket : 3\n", 504 | "[1000] loss : 0.41534432768821716 , bucket : 2\n", 505 | "[1100] loss : 0.49127447605133057 , bucket : 3\n", 506 | "[1200] loss : 0.3872307240962982 , bucket : 2\n", 507 | "[1300] loss : 0.47533732652664185 , bucket : 2\n", 508 | "[1400] loss : 0.4393002986907959 , bucket : 3\n", 509 | "[1500] loss : 0.3600185215473175 , bucket : 3\n", 510 | "[1600] loss : 0.4524793326854706 , bucket : 2\n", 511 | "[1700] loss : 0.09196716547012329 , bucket : 0\n", 512 | "[1800] loss : 0.13422846794128418 , bucket : 0\n", 513 | "[1900] loss : 0.3615540564060211 , bucket : 2\n", 514 | "[2000] loss : 0.13525305688381195 , bucket : 0\n", 515 | "[2100] loss : 0.3106137812137604 , bucket : 1\n", 516 | "[2200] loss : 0.23308174312114716 , bucket : 3\n", 517 | "[2300] loss : 0.07280982285737991 , bucket : 0\n", 518 | "[2400] loss : 0.25790470838546753 , bucket : 2\n", 519 | "[2500] loss : 0.3075273633003235 , bucket : 2\n", 520 | "[2600] loss : 0.20128652453422546 , bucket : 1\n", 521 | "[2700] loss : 0.267413854598999 , bucket : 2\n", 522 | "[2800] loss : 0.2660099267959595 , bucket : 2\n", 523 | "[2900] loss : 0.2145916223526001 , bucket : 0\n", 524 | "[3000] loss : 0.18937240540981293 , bucket : 1\n", 525 | "[3100] loss : 0.13038747012615204 , bucket : 0\n", 526 | "[3200] loss : 0.26689308881759644 , bucket : 1\n", 527 | "[3300] loss : 0.16859322786331177 , bucket : 2\n", 528 | "[3400] loss : 0.08819016814231873 , bucket : 3\n", 529 | "[3500] loss : 0.18127848207950592 , bucket : 1\n", 530 | "[3600] loss : 0.1321011483669281 , bucket : 1\n", 531 | "[3700] loss : 0.14424817264080048 , bucket : 1\n", 532 | "[3800] loss : 0.12812256813049316 , bucket : 2\n", 533 | "[3900] loss : 0.12484750151634216 , bucket : 1\n", 534 | "[4000] loss : 0.12390623986721039 , bucket : 1\n", 535 | "[4100] loss : 0.09196265041828156 , bucket : 1\n", 536 | "[4200] loss : 0.12934979796409607 , bucket : 1\n", 537 | "[4300] loss : 0.10172194987535477 , bucket : 1\n", 538 | "[4400] loss : 0.0912589579820633 , bucket : 2\n", 539 | "[4500] loss : 0.061736565083265305 , bucket : 1\n", 540 | "[4600] loss : 0.08378574997186661 , bucket : 1\n", 541 | "[4700] loss : 0.05071571096777916 , bucket : 3\n", 542 | "[4800] loss : 0.11878789961338043 , bucket : 1\n", 543 | "[4900] loss : 0.02820456586778164 , bucket : 3\n" 544 | ] 545 | } 546 | ], 547 | "source": [ 548 | "for epoch in range(5000):\n", 549 | " \n", 550 | " bucket_id = bucket_model.select_bucket()\n", 551 | " inputs, targets,lengths = getBatch(bucket,bucket_id,BATCH_SIZE)\n", 552 | " \n", 553 | " bucket_model.models[bucket_id].zero_grad()\n", 554 | " \n", 555 | " outputs = bucket_model.models[bucket_id](inputs,lengths)\n", 556 | " \n", 557 | " loss = loss_function(outputs,targets)\n", 558 | " losses.append(loss)\n", 559 | " loss.backward()\n", 560 | " bucket_model.optims[bucket_id].step()\n", 561 | " \n", 562 | " if epoch % 100==0:\n", 563 | " print(\"[{epoch}] loss : {loss} , bucket : {bucket_id}\".format(epoch=epoch,loss=loss.data.numpy()[0],bucket_id=bucket_id))" 564 | ] 565 | }, 566 | { 567 | "cell_type": "markdown", 568 | "metadata": {}, 569 | "source": [ 570 | "### 테스트 " 571 | ] 572 | }, 573 | { 574 | "cell_type": "code", 575 | "execution_count": 26, 576 | "metadata": {}, 577 | "outputs": [ 578 | { 579 | "name": "stdout", 580 | "output_type": "stream", 581 | "text": [ 582 | "혹시 강동구 보건소 도 한 번 물어봐 주 세요 \n", 583 | "\n", 584 | "O : O\n", 585 | "B-LOC : B-LOC\n", 586 | "I-LOC : I-LOC\n", 587 | "O : O\n", 588 | "O : O\n", 589 | "O : O\n", 590 | "O : O\n", 591 | "O : O\n", 592 | "O : O\n", 593 | "O : O\n" 594 | ] 595 | } 596 | ], 597 | "source": [ 598 | "test = random.choice(training_data)\n", 599 | "input_ = test[0]\n", 600 | "tag = test[1]\n", 601 | "print(' '.join(input_)+'\\n')\n", 602 | "\n", 603 | "length = len(input_)\n", 604 | "for i in range(len(bucket_config)):\n", 605 | " if bucket_config[i][0] == length:\n", 606 | " bucket_id = i\n", 607 | " break\n", 608 | "\n", 609 | "\n", 610 | "\n", 611 | "sentence_in = prepare_sequence(input_,word_to_ix)\n", 612 | "sentence_in=sentence_in.view(1,-1)\n", 613 | "\n", 614 | "scores = bucket_model.models[bucket_id](sentence_in,[len(input_)])\n", 615 | "v,i = torch.max(scores,1)\n", 616 | "for t in range(i.size()[0]):\n", 617 | " print(tag[t], ' : ', ix_to_tag[i.data.numpy()[t][0]])" 618 | ] 619 | }, 620 | { 621 | "cell_type": "code", 622 | "execution_count": 27, 623 | "metadata": {}, 624 | "outputs": [ 625 | { 626 | "name": "stderr", 627 | "output_type": "stream", 628 | "text": [ 629 | "/home/dsksd/.local/lib/python3.5/site-packages/torch/serialization.py:147: UserWarning: Couldn't retrieve source code for container of type BUCKETRNN. It won't be checked for correctness upon loading.\n", 630 | " \"type \" + obj.__name__ + \". It won't be checked \"\n", 631 | "/home/dsksd/.local/lib/python3.5/site-packages/torch/serialization.py:147: UserWarning: Couldn't retrieve source code for container of type RNN. It won't be checked for correctness upon loading.\n", 632 | " \"type \" + obj.__name__ + \". It won't be checked \"\n" 633 | ] 634 | } 635 | ], 636 | "source": [ 637 | "torch.save(bucket_model,'NER_model.pkl')" 638 | ] 639 | }, 640 | { 641 | "cell_type": "code", 642 | "execution_count": 28, 643 | "metadata": { 644 | "collapsed": true 645 | }, 646 | "outputs": [], 647 | "source": [ 648 | "restore = torch.load('NER_model.pkl')" 649 | ] 650 | }, 651 | { 652 | "cell_type": "code", 653 | "execution_count": 45, 654 | "metadata": {}, 655 | "outputs": [ 656 | { 657 | "name": "stdout", 658 | "output_type": "stream", 659 | "text": [ 660 | "영화 예매 좀 해 줘\n", 661 | "\n", 662 | "O : O\n", 663 | "O : O\n", 664 | "O : O\n", 665 | "O : O\n", 666 | "O : O\n" 667 | ] 668 | } 669 | ], 670 | "source": [ 671 | "test = random.choice(training_data)\n", 672 | "input_ = test[0]\n", 673 | "tag = test[1]\n", 674 | "print(' '.join(input_)+'\\n')\n", 675 | "\n", 676 | "length = len(input_)\n", 677 | "for i in range(len(bucket_config)):\n", 678 | " if bucket_config[i][0] == length:\n", 679 | " bucket_id = i\n", 680 | " break\n", 681 | "\n", 682 | "\n", 683 | "\n", 684 | "sentence_in = prepare_sequence(input_,word_to_ix)\n", 685 | "sentence_in=sentence_in.view(1,-1)\n", 686 | "\n", 687 | "scores = restore.models[bucket_id](sentence_in,[len(input_)])\n", 688 | "v,i = torch.max(scores,1)\n", 689 | "for t in range(i.size()[0]):\n", 690 | " print(tag[t], ' : ', ix_to_tag[i.data.numpy()[t][0]])" 691 | ] 692 | }, 693 | { 694 | "cell_type": "code", 695 | "execution_count": null, 696 | "metadata": { 697 | "collapsed": true 698 | }, 699 | "outputs": [], 700 | "source": [] 701 | } 702 | ], 703 | "metadata": { 704 | "kernelspec": { 705 | "display_name": "Python 3", 706 | "language": "python", 707 | "name": "python3" 708 | }, 709 | "language_info": { 710 | "codemirror_mode": { 711 | "name": "ipython", 712 | "version": 3 713 | }, 714 | "file_extension": ".py", 715 | "mimetype": "text/x-python", 716 | "name": "python", 717 | "nbconvert_exporter": "python", 718 | "pygments_lexer": "ipython3", 719 | "version": "3.5.2" 720 | } 721 | }, 722 | "nbformat": 4, 723 | "nbformat_minor": 2 724 | } 725 | -------------------------------------------------------------------------------- /deepnlp/06_Seq2Seq_basic.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "code", 5 | "execution_count": 1, 6 | "metadata": {}, 7 | "outputs": [ 8 | { 9 | "data": { 10 | "text/plain": [ 11 | "" 12 | ] 13 | }, 14 | "execution_count": 1, 15 | "metadata": {}, 16 | "output_type": "execute_result" 17 | } 18 | ], 19 | "source": [ 20 | "import torch\n", 21 | "from torch.autograd import Variable\n", 22 | "import torch.nn as nn\n", 23 | "import torch.nn.functional as F\n", 24 | "import torch.optim as optim\n", 25 | "import json\n", 26 | "import pickle\n", 27 | "import random\n", 28 | "import time\n", 29 | "import math\n", 30 | "import numpy as np\n", 31 | "from konlpy.tag import Mecab;tagger=Mecab()\n", 32 | "from collections import Counter\n", 33 | "from torch.nn.utils.rnn import pad_packed_sequence, pack_padded_sequence\n", 34 | "\n", 35 | "torch.manual_seed(1)" 36 | ] 37 | }, 38 | { 39 | "cell_type": "code", 40 | "execution_count": 2, 41 | "metadata": { 42 | "collapsed": true 43 | }, 44 | "outputs": [], 45 | "source": [ 46 | "USE_CUDA = False" 47 | ] 48 | }, 49 | { 50 | "cell_type": "markdown", 51 | "metadata": {}, 52 | "source": [ 53 | "# 데이터 " 54 | ] 55 | }, 56 | { 57 | "cell_type": "markdown", 58 | "metadata": {}, 59 | "source": [ 60 | "일단 최대 길이 (10,10)으로 고정하고 PAD & Batch" 61 | ] 62 | }, 63 | { 64 | "cell_type": "code", 65 | "execution_count": 3, 66 | "metadata": { 67 | "collapsed": true 68 | }, 69 | "outputs": [], 70 | "source": [ 71 | "SEQ_LENGTH=10\n", 72 | "SOS_token = 0\n", 73 | "EOS_token = 1" 74 | ] 75 | }, 76 | { 77 | "cell_type": "code", 78 | "execution_count": 4, 79 | "metadata": { 80 | "collapsed": true 81 | }, 82 | "outputs": [], 83 | "source": [ 84 | "data = open('../../dataset/corpus/dsksd_chat.txt').readlines()\n", 85 | "data = [[t.split('\\\\t')[0],t.split('\\\\t')[1][:-1]] for t in data if t !='\\n']" 86 | ] 87 | }, 88 | { 89 | "cell_type": "code", 90 | "execution_count": 5, 91 | "metadata": {}, 92 | "outputs": [ 93 | { 94 | "data": { 95 | "text/plain": [ 96 | "153" 97 | ] 98 | }, 99 | "execution_count": 5, 100 | "metadata": {}, 101 | "output_type": "execute_result" 102 | } 103 | ], 104 | "source": [ 105 | "DATA_SIZE = len(data) # 배치 사이즈\n", 106 | "DATA_SIZE" 107 | ] 108 | }, 109 | { 110 | "cell_type": "markdown", 111 | "metadata": {}, 112 | "source": [ 113 | "### 전처리 " 114 | ] 115 | }, 116 | { 117 | "cell_type": "markdown", 118 | "metadata": {}, 119 | "source": [ 120 | "1. 형태소 분석\n", 121 | "2. 최대 길이 10보다 긴 것들 10으로 제한\n", 122 | "3. EOS 태그 달기\n", 123 | "4. 길이 10이 안되는 것들 PADDING\n", 124 | "5. [[Q,A]...] " 125 | ] 126 | }, 127 | { 128 | "cell_type": "code", 129 | "execution_count": 6, 130 | "metadata": { 131 | "collapsed": true 132 | }, 133 | "outputs": [], 134 | "source": [ 135 | "train=[]" 136 | ] 137 | }, 138 | { 139 | "cell_type": "code", 140 | "execution_count": 7, 141 | "metadata": { 142 | "collapsed": true 143 | }, 144 | "outputs": [], 145 | "source": [ 146 | "for t0,t1 in data:\n", 147 | " token0 = tagger.morphs(t0)\n", 148 | " \n", 149 | " if len(token0)>=SEQ_LENGTH:\n", 150 | " token0= token0[:SEQ_LENGTH-1]\n", 151 | " token0.append(\"EOS\")\n", 152 | "\n", 153 | " token1 = tagger.morphs(t1)\n", 154 | " if len(token1)>=SEQ_LENGTH:\n", 155 | " token1=token1[:SEQ_LENGTH-1]\n", 156 | " \n", 157 | " token1.append(\"EOS\")\n", 158 | " while len(token0)