├── .gitignore ├── README.md ├── dataset ├── 94k_sentences │ └── mn_sentences_94k.txt ├── book │ └── usan-dooguur-20k.txt └── lyrics │ └── mongolian-hiphop-lyrics.txt ├── images ├── bert │ └── mongolian-bert-attend-visualization.png └── cnn-weights │ ├── 1.png │ ├── 2.png │ ├── 3.png │ └── 4.png ├── mn_bert_finetuning_notebooks ├── Fine_tuning_Mongolian_BERT_for_eduge_classification_on_GPU,_32K_512.ipynb ├── Mongolian_BERT_finetuning_for_EDUGE_on_TPU,_32K_512.ipynb └── finetuning mongolian bert model on GPU.ipynb ├── neural_network_classifier_notebooks ├── Mongolian_text_classification_01,_simple.ipynb ├── Mongolian_text_classification_02,_word2vec_initialization,_static_embedding_matrix.ipynb ├── Mongolian_text_classification_02,_word2vec_initialization,_trainable_embedding_matrix.ipynb ├── Mongolian_text_classification_03,_1D_Convolution.ipynb ├── Mongolian_text_classification_03,_Multiple_1D_Convolution_layers.ipynb ├── Mongolian_text_classification_04,_RNN(LSTM).ipynb └── Mongolian_text_classification_05,_Attention.ipynb ├── old_stuffs ├── .gitignore ├── README.md ├── clear_create_word2vec.py ├── clear_text_to_array.py ├── convert_text_to_seqvector_through_embedmatrix.py ├── corpuses_test │ ├── economy_news_gogo_mn.txt │ ├── health_news_gogo_mn.txt │ ├── politics_news_ikon_mn.txt │ ├── technology_news_gogo_mn.txt │ └── world_news_gogo_mn.txt ├── djangoapp │ ├── app │ │ ├── __init__.py │ │ ├── admin.py │ │ ├── apps.py │ │ ├── forms.py │ │ ├── migrations │ │ │ ├── 0001_initial.py │ │ │ └── __init__.py │ │ ├── models.py │ │ ├── tests.py │ │ ├── urls.py │ │ └── views.py │ ├── djangoapp │ │ ├── __init__.py │ │ ├── settings.py │ │ ├── urls.py │ │ └── wsgi.py │ ├── manage.py │ └── templates │ │ ├── base.html │ │ ├── classify.html │ │ └── home.html ├── freeze_tf_model.py ├── ikon_mn_scrape.py ├── images │ ├── accuracy.png │ ├── classifiedresult.png │ ├── loss.png │ └── webinput.png ├── mongolianstopwords.py ├── numpy_embedding_matrix_tf.py ├── prepare_trainingset.py ├── requirements.txt ├── research │ ├── ikon-research.txt │ └── nlp-research.txt ├── stemmer.py ├── training_bilstm_rnn.py ├── training_helpers.py ├── training_lstm_rnn.py ├── training_stacked_lstm.py ├── use_freezed_model_rpc.py ├── using pretrained word2vec for mongolian text classification.ipynb ├── wordtoken_to_id.py └── wordvec_exp.py ├── preprocess_dataset └── preprocess_eduge.ipynb └── requirements.txt /.gitignore: -------------------------------------------------------------------------------- 1 | env/ 2 | 3 | preprocess_dataset/.ipynb_checkpoints 4 | preprocess_dataset/eduge.csv 5 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # mongolian-text-classification 2 | Mongolian cyrillic text classification with modern tensorflow and some fine tuning on TugsTugi's BERT model. 3 | 4 | # Load Mongolian BERT in Tensorflow 2 5 | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1ReDLH2DDiCt_Y800vGub8OuYJlR-TsZw) 6 | 7 | # Generate text using Mongolian BERT 8 | 9 | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1jJA-YSAsbq5gbpyGYE-8p-rCzgqSU9eX) 10 | 11 | # Visualize Mongolian BERT using bertviz and pytorch model 12 | 13 | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1UEDNlfEmXxZy1jRrE7pCTZNu8DplWVQv) 14 | 15 | ![Alt text](images/bert/mongolian-bert-attend-visualization.png?raw=true "Mongolian BERT attend") 16 | 17 | 18 | # Fine tuning TugsTugi's Mongolian BERT model 19 | On TPU mode, loading checkpoints from the file system doesn't supported by the bert and bucket should be used. 20 | 21 | Fine tuning mongolian BERT on TPU, You need own bucket in order to finetune on TPU [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1CnGd2OnNDlxe6ZUjmOa7zg__CcKk5X85) 22 | 23 | Fine tune a mongolian BERT on GPU, a lot of computation needed, a low batch size matters due to memory limit [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1u9mVeWRh7GWLONAzZ3XpJciPfv38vHaZ) 24 | 25 | # Classifiers using simple neural networks 26 | 27 | No 01, Simplest classifier [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1Ulv6tUAjOsp-jN4sTdef3lTuJb0yX4qy) 28 | 29 | No 02, Pretrained Word2Vec initialization from Facebook's fasttext, kind of transfer learningish. Embedding layer is not trainable in this case [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1SfwdhIoRMi4kXeAN8eUjYXKuT5zig9WV) and with trainable embedding layer [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1WQvCa6KDOxQ2YjDdb48g4zsN60_Svbhg) 30 | 31 | No 03, 1D Convolution [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1JgJN74E1w1x8RSjm9qi06uw6y0I_9k1J) and multiple 1D convnets [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1lTh2dG64L4aJsCip714sCA_xQgMttxOb) 32 | 33 | No 04, LSTM [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1j0MN3UTGz-990bl61n5B1mrtjnq8hSdh) 34 | 35 | Visualize RNN neuron firing in text generation [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1ndM1G-0qZx4wi6E9kPL1D9IjaM0pq3r9) 36 | 37 | No 05, LSTM with Attention, visualization of attention scores in text classification [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/10nPgRmbZsjad46CdVJKRHklestXcEpZ5) 38 | 39 | No 06, Classification with Mongolian BERT and Tensorflow 2.0, with frozen bert layers [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1JQ87pFlGkDMbpHQp9ZSiyrQwdiRUAm-N) 40 | 41 | No 07, Classification with Mongolian BERT large and HuggingFace and Tensorflow 2 [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1e8t6MZMvpoINkXtv7o1h3ESZq04pXPUv?usp=sharing) 42 | 43 | 44 | 45 | # Mongolian sentence interpolation experiments 46 | 47 | Sequence loss in keras and tf2 [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1jlyB2fOi_JBAi4WPMVDJ_e8-_WHtQK_9) 48 | 49 | Variational Auto Encoder for Mongolian text [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1tBTudj9M5CGih3p8Uxj0R1SA6f3BJj-Z) 50 | 51 | # Other experiments 52 | Predict next word, greedy text generation [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1urjsJUuNTnTAAAqu_eXpIkwRWUi72xp_) 53 | 54 | # Series included(or will) followings 55 | word2vec initialization, 1D Convolution, RNN variants, Attention, Some weights visualization for reasoning, Transformer, Techniques to handle longer texts and so on... 56 | 57 | 58 | # useful references and resources 59 | - Mongolian BERT models 60 | https://github.com/tugstugi/mongolian-bert 61 | - Mongolian NLP 62 | https://github.com/tugstugi/mongolian-nlp 63 | - Eduge classification baseline using SVM 64 | https://colab.research.google.com/github/tugstugi/mongolian-nlp/blob/master/misc/Eduge_SVM.ipynb 65 | - News crawler 66 | https://github.com/codelucas/newspaper 67 | 68 | # Images and screenshots 69 | 70 | ![Alt text](images/cnn-weights/1.png?raw=true "CNN weights 1") 71 | ![Alt text](images/cnn-weights/2.png?raw=true "CNN weights 2") 72 | ![Alt text](images/cnn-weights/3.png?raw=true "CNN weights 3") 73 | ![Alt text](images/cnn-weights/4.png?raw=true "CNN weights 4") 74 | -------------------------------------------------------------------------------- /images/bert/mongolian-bert-attend-visualization.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/sharavsambuu/mongolian-text-classification/1b82259dbc093864f00b00c5a36cecb0b6af0f5e/images/bert/mongolian-bert-attend-visualization.png -------------------------------------------------------------------------------- /images/cnn-weights/1.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/sharavsambuu/mongolian-text-classification/1b82259dbc093864f00b00c5a36cecb0b6af0f5e/images/cnn-weights/1.png -------------------------------------------------------------------------------- /images/cnn-weights/2.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/sharavsambuu/mongolian-text-classification/1b82259dbc093864f00b00c5a36cecb0b6af0f5e/images/cnn-weights/2.png -------------------------------------------------------------------------------- /images/cnn-weights/3.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/sharavsambuu/mongolian-text-classification/1b82259dbc093864f00b00c5a36cecb0b6af0f5e/images/cnn-weights/3.png -------------------------------------------------------------------------------- /images/cnn-weights/4.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/sharavsambuu/mongolian-text-classification/1b82259dbc093864f00b00c5a36cecb0b6af0f5e/images/cnn-weights/4.png -------------------------------------------------------------------------------- /neural_network_classifier_notebooks/Mongolian_text_classification_01,_simple.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "nbformat": 4, 3 | "nbformat_minor": 0, 4 | "metadata": { 5 | "colab": { 6 | "name": "Mongolian text classification #01, simple.ipynb", 7 | "version": "0.3.2", 8 | "provenance": [], 9 | "collapsed_sections": [] 10 | }, 11 | "kernelspec": { 12 | "name": "python3", 13 | "display_name": "Python 3" 14 | }, 15 | "accelerator": "GPU" 16 | }, 17 | "cells": [ 18 | { 19 | "metadata": { 20 | "id": "muNP8k9fqaJb", 21 | "colab_type": "text" 22 | }, 23 | "cell_type": "markdown", 24 | "source": [ 25 | "Mongolian text classification series #01\n", 26 | "\n", 27 | "In this notebook I'm gonna try to classify cyrillic mongolian texts using modern Tensorflow 2.0\n", 28 | "\n", 29 | "Eduge dataset provided by Bolorsoft LLC\n", 30 | "\n", 31 | "Author : Sharavsambuu Gunchinish (sharavsambuu@gmail.com)\n", 32 | "\n", 33 | "Github: https://github.com/sharavsambuu/mongolian-text-classification \n", 34 | "\n" 35 | ] 36 | }, 37 | { 38 | "metadata": { 39 | "id": "iY9jwdg6qT8M", 40 | "colab_type": "code", 41 | "outputId": "23b48c0b-bba9-4004-ad2e-40fdb909a037", 42 | "colab": { 43 | "base_uri": "https://localhost:8080/", 44 | "height": 34 45 | } 46 | }, 47 | "cell_type": "code", 48 | "source": [ 49 | "from __future__ import absolute_import, division, print_function, unicode_literals\n", 50 | "\n", 51 | "!pip install -q tensorflow-gpu==2.0.0-alpha0\n", 52 | "import tensorflow as tf\n", 53 | "from tensorflow import keras\n", 54 | "\n", 55 | "import numpy as np\n", 56 | "\n", 57 | "print(tf.__version__)" 58 | ], 59 | "execution_count": 1, 60 | "outputs": [ 61 | { 62 | "output_type": "stream", 63 | "text": [ 64 | "2.0.0-alpha0\n" 65 | ], 66 | "name": "stdout" 67 | } 68 | ] 69 | }, 70 | { 71 | "metadata": { 72 | "id": "smJeJfoo4qcu", 73 | "colab_type": "text" 74 | }, 75 | "cell_type": "markdown", 76 | "source": [ 77 | "[More info about creation of eduge dataset pickles](https://github.com/sharavsambuu/mongolian-text-classification/blob/master/preprocess_dataset/preprocess_eduge.ipynb) " 78 | ] 79 | }, 80 | { 81 | "metadata": { 82 | "id": "CDayX_Yx3REh", 83 | "colab_type": "code", 84 | "outputId": "660f4571-4c05-4391-dfb9-e1a89b897f29", 85 | "colab": { 86 | "base_uri": "https://localhost:8080/", 87 | "height": 476 88 | } 89 | }, 90 | "cell_type": "code", 91 | "source": [ 92 | "import os\n", 93 | "from os.path import exists, join, basename, splitext\n", 94 | "import sys\n", 95 | "\n", 96 | "def download_from_google_drive(file_id, file_name):\n", 97 | " !rm -f ./cookie\n", 98 | " !curl -c ./cookie -s -L \"https://drive.google.com/uc?export=download&id=$file_id\" > /dev/null\n", 99 | " confirm_text = !awk '/download/ {print $NF}' ./cookie\n", 100 | " confirm_text = confirm_text[0]\n", 101 | " !curl -Lb ./cookie \"https://drive.google.com/uc?export=download&confirm=$confirm_text&id=$file_id\" -o $file_name\n", 102 | " \n", 103 | "# download eduge pickles\n", 104 | "file_path = 'eduge_pickles'\n", 105 | "if not exists(file_path):\n", 106 | " download_from_google_drive('1vjJ9YgIe8o0ErhbN0lH1XqPv3KFP8acv', '%s.rar' % file_path)\n", 107 | " rar_file = file_path+\".rar\"\n", 108 | " !unrar x $rar_file" 109 | ], 110 | "execution_count": 2, 111 | "outputs": [ 112 | { 113 | "output_type": "stream", 114 | "text": [ 115 | " % Total % Received % Xferd Average Speed Time Time Time Current\n", 116 | " Dload Upload Total Spent Left Speed\n", 117 | "100 388 0 388 0 0 4511 0 --:--:-- --:--:-- --:--:-- 4511\n", 118 | "100 106M 0 106M 0 0 90.6M 0 --:--:-- 0:00:01 --:--:-- 126M\n", 119 | "\n", 120 | "UNRAR 5.50 freeware Copyright (c) 1993-2017 Alexander Roshal\n", 121 | "\n", 122 | "\n", 123 | "Extracting from eduge_pickles.rar\n", 124 | "\n", 125 | "\n", 126 | "Would you like to replace the existing file word_index.pickle\n", 127 | "9178153 bytes, modified on 2019-04-13 01:44\n", 128 | "with a new one\n", 129 | "9178153 bytes, modified on 2019-04-13 01:44\n", 130 | "\n", 131 | "[Y]es, [N]o, [A]ll, n[E]ver, [R]ename, [Q]uit n\n", 132 | "\n", 133 | "\n", 134 | "Would you like to replace the existing file eduge.pickle\n", 135 | "359611555 bytes, modified on 2019-04-13 01:44\n", 136 | "with a new one\n", 137 | "359611555 bytes, modified on 2019-04-13 01:44\n", 138 | "\n", 139 | "[Y]es, [N]o, [A]ll, n[E]ver, [R]ename, [Q]uit q\n", 140 | "\n", 141 | "Program aborted\n" 142 | ], 143 | "name": "stdout" 144 | } 145 | ] 146 | }, 147 | { 148 | "metadata": { 149 | "id": "pPHJcnfi4Rzg", 150 | "colab_type": "code", 151 | "colab": {} 152 | }, 153 | "cell_type": "code", 154 | "source": [ 155 | "import pickle\n", 156 | "\n", 157 | "with open('word_index.pickle', 'rb') as handle:\n", 158 | " word_index = pickle.load(handle)\n", 159 | " \n", 160 | "with open('reversed_word_index.pickle', 'rb') as handle:\n", 161 | " reversed_word_index = pickle.load(handle)\n", 162 | " \n", 163 | "with open('eduge_stopwords_removed.pickle', 'rb') as handle:\n", 164 | " eduge_ds = pickle.load(handle)" 165 | ], 166 | "execution_count": 0, 167 | "outputs": [] 168 | }, 169 | { 170 | "metadata": { 171 | "id": "XFxd1QGR65VV", 172 | "colab_type": "code", 173 | "colab": {} 174 | }, 175 | "cell_type": "code", 176 | "source": [ 177 | "MAX_LEN = 512\n", 178 | "\n", 179 | "import itertools\n", 180 | "\n", 181 | "for item in eduge_ds:\n", 182 | " item[0] = list(itertools.chain(*item[0]))[:MAX_LEN]" 183 | ], 184 | "execution_count": 0, 185 | "outputs": [] 186 | }, 187 | { 188 | "metadata": { 189 | "id": "U8PTeX0WCbhR", 190 | "colab_type": "code", 191 | "colab": {} 192 | }, 193 | "cell_type": "code", 194 | "source": [ 195 | "from sklearn.model_selection import train_test_split\n", 196 | "train, test = train_test_split(eduge_ds, test_size=0.1, random_state=999)" 197 | ], 198 | "execution_count": 0, 199 | "outputs": [] 200 | }, 201 | { 202 | "metadata": { 203 | "id": "8mgMCFcgDHH4", 204 | "colab_type": "code", 205 | "colab": {} 206 | }, 207 | "cell_type": "code", 208 | "source": [ 209 | "train_data_words = [i[0] for i in train]\n", 210 | "train_label_words = [i[1] for i in train]\n", 211 | "test_data_words = [i[0] for i in test ]\n", 212 | "test_label_words = [i[1] for i in test ]" 213 | ], 214 | "execution_count": 0, 215 | "outputs": [] 216 | }, 217 | { 218 | "metadata": { 219 | "id": "rrXC7UiuFkCH", 220 | "colab_type": "code", 221 | "colab": {} 222 | }, 223 | "cell_type": "code", 224 | "source": [ 225 | "def encode_news(text):\n", 226 | " return [word_index.get(i, 2) for i in text]\n", 227 | " \n", 228 | "train_data = [encode_news(sent) for sent in train_data_words]\n", 229 | "test_data = [encode_news(sent) for sent in test_data_words ]" 230 | ], 231 | "execution_count": 0, 232 | "outputs": [] 233 | }, 234 | { 235 | "metadata": { 236 | "id": "FV-h_avPEzM1", 237 | "colab_type": "code", 238 | "colab": {} 239 | }, 240 | "cell_type": "code", 241 | "source": [ 242 | "train_data = keras.preprocessing.sequence.pad_sequences(train_data,\n", 243 | " value=word_index[\"\"],\n", 244 | " padding='post',\n", 245 | " maxlen=MAX_LEN)\n", 246 | "\n", 247 | "test_data = keras.preprocessing.sequence.pad_sequences(test_data,\n", 248 | " value=word_index[\"\"],\n", 249 | " padding='post',\n", 250 | " maxlen=MAX_LEN)" 251 | ], 252 | "execution_count": 0, 253 | "outputs": [] 254 | }, 255 | { 256 | "metadata": { 257 | "id": "gDVqmPqxIMid", 258 | "colab_type": "code", 259 | "outputId": "20a69735-0785-4267-ae33-da76cb72e475", 260 | "colab": { 261 | "base_uri": "https://localhost:8080/", 262 | "height": 170 263 | } 264 | }, 265 | "cell_type": "code", 266 | "source": [ 267 | "labels = list(set(test_label_words))\n", 268 | "labels" 269 | ], 270 | "execution_count": 9, 271 | "outputs": [ 272 | { 273 | "output_type": "execute_result", 274 | "data": { 275 | "text/plain": [ 276 | "['боловсрол',\n", 277 | " 'байгал орчин',\n", 278 | " 'хууль',\n", 279 | " 'эдийн засаг',\n", 280 | " 'улс төр',\n", 281 | " 'эрүүл мэнд',\n", 282 | " 'урлаг соёл',\n", 283 | " 'спорт',\n", 284 | " 'технологи']" 285 | ] 286 | }, 287 | "metadata": { 288 | "tags": [] 289 | }, 290 | "execution_count": 9 291 | } 292 | ] 293 | }, 294 | { 295 | "metadata": { 296 | "id": "PBKj3GQqJq29", 297 | "colab_type": "code", 298 | "colab": {} 299 | }, 300 | "cell_type": "code", 301 | "source": [ 302 | "from sklearn.preprocessing import LabelBinarizer\n", 303 | "encoder = LabelBinarizer()\n", 304 | "train_label = transfomed_label = encoder.fit_transform(train_label_words)\n", 305 | "test_label = transfomed_label = encoder.fit_transform(test_label_words )" 306 | ], 307 | "execution_count": 0, 308 | "outputs": [] 309 | }, 310 | { 311 | "metadata": { 312 | "id": "DPq45PN5HZ15", 313 | "colab_type": "code", 314 | "outputId": "8ad202f8-3c10-4cbe-86f0-b68183fa8c16", 315 | "colab": { 316 | "base_uri": "https://localhost:8080/", 317 | "height": 289 318 | } 319 | }, 320 | "cell_type": "code", 321 | "source": [ 322 | "vocab_size = len(word_index)\n", 323 | "\n", 324 | "model = keras.Sequential()\n", 325 | "model.add(keras.layers.Embedding(vocab_size, 16))\n", 326 | "model.add(keras.layers.GlobalAveragePooling1D())\n", 327 | "model.add(keras.layers.Dense(16, activation='relu'))\n", 328 | "model.add(keras.layers.Dense(len(labels), activation='sigmoid'))\n", 329 | "\n", 330 | "model.summary()" 331 | ], 332 | "execution_count": 11, 333 | "outputs": [ 334 | { 335 | "output_type": "stream", 336 | "text": [ 337 | "Model: \"sequential\"\n", 338 | "_________________________________________________________________\n", 339 | "Layer (type) Output Shape Param # \n", 340 | "=================================================================\n", 341 | "embedding (Embedding) (None, None, 16) 5932704 \n", 342 | "_________________________________________________________________\n", 343 | "global_average_pooling1d (Gl (None, 16) 0 \n", 344 | "_________________________________________________________________\n", 345 | "dense (Dense) (None, 16) 272 \n", 346 | "_________________________________________________________________\n", 347 | "dense_1 (Dense) (None, 9) 153 \n", 348 | "=================================================================\n", 349 | "Total params: 5,933,129\n", 350 | "Trainable params: 5,933,129\n", 351 | "Non-trainable params: 0\n", 352 | "_________________________________________________________________\n" 353 | ], 354 | "name": "stdout" 355 | } 356 | ] 357 | }, 358 | { 359 | "metadata": { 360 | "id": "cAgP1KlqHu2F", 361 | "colab_type": "code", 362 | "colab": {} 363 | }, 364 | "cell_type": "code", 365 | "source": [ 366 | "model.compile(optimizer='adam',\n", 367 | " loss='categorical_crossentropy',\n", 368 | " metrics=['accuracy'])" 369 | ], 370 | "execution_count": 0, 371 | "outputs": [] 372 | }, 373 | { 374 | "metadata": { 375 | "id": "ZPw8roFQKrHm", 376 | "colab_type": "code", 377 | "outputId": "c54c1e93-baaf-4097-ea8e-7da87bcb8080", 378 | "colab": { 379 | "base_uri": "https://localhost:8080/", 380 | "height": 51 381 | } 382 | }, 383 | "cell_type": "code", 384 | "source": [ 385 | "print(len(train_data), len(train_label))\n", 386 | "print(len(test_data ), len(test_label) )\n", 387 | "\n", 388 | "partial_index = 3000\n", 389 | "\n", 390 | "x_val = train_data[:partial_index]\n", 391 | "partial_x_train = train_data[partial_index:]\n", 392 | "\n", 393 | "y_val = train_label[:partial_index]\n", 394 | "partial_y_train = train_label[partial_index:]" 395 | ], 396 | "execution_count": 13, 397 | "outputs": [ 398 | { 399 | "output_type": "stream", 400 | "text": [ 401 | "68094 68094\n", 402 | "7567 7567\n" 403 | ], 404 | "name": "stdout" 405 | } 406 | ] 407 | }, 408 | { 409 | "metadata": { 410 | "id": "iSTB4--RKacs", 411 | "colab_type": "code", 412 | "outputId": "ecdc60b8-05be-43a4-c72d-6056d86e7ba0", 413 | "colab": { 414 | "base_uri": "https://localhost:8080/", 415 | "height": 1074 416 | } 417 | }, 418 | "cell_type": "code", 419 | "source": [ 420 | "epochs = 30\n", 421 | "history = model.fit(partial_x_train,\n", 422 | " partial_y_train,\n", 423 | " epochs=epochs,\n", 424 | " batch_size=512,\n", 425 | " validation_data=(x_val, y_val),\n", 426 | " verbose=1)" 427 | ], 428 | "execution_count": 14, 429 | "outputs": [ 430 | { 431 | "output_type": "stream", 432 | "text": [ 433 | "Train on 65094 samples, validate on 3000 samples\n", 434 | "Epoch 1/30\n", 435 | "65094/65094 [==============================] - 5s 73us/sample - loss: 2.1506 - accuracy: 0.2542 - val_loss: 2.0977 - val_accuracy: 0.2807\n", 436 | "Epoch 2/30\n", 437 | "65094/65094 [==============================] - 4s 58us/sample - loss: 2.0284 - accuracy: 0.3043 - val_loss: 1.9403 - val_accuracy: 0.3053\n", 438 | "Epoch 3/30\n", 439 | "65094/65094 [==============================] - 4s 58us/sample - loss: 1.7681 - accuracy: 0.3700 - val_loss: 1.5439 - val_accuracy: 0.4700\n", 440 | "Epoch 4/30\n", 441 | "65094/65094 [==============================] - 4s 59us/sample - loss: 1.3003 - accuracy: 0.6406 - val_loss: 1.1447 - val_accuracy: 0.7063\n", 442 | "Epoch 5/30\n", 443 | "65094/65094 [==============================] - 4s 60us/sample - loss: 1.0101 - accuracy: 0.7528 - val_loss: 0.9562 - val_accuracy: 0.7673\n", 444 | "Epoch 6/30\n", 445 | "65094/65094 [==============================] - 4s 60us/sample - loss: 0.8498 - accuracy: 0.7863 - val_loss: 0.8369 - val_accuracy: 0.7790\n", 446 | "Epoch 7/30\n", 447 | "65094/65094 [==============================] - 4s 60us/sample - loss: 0.7358 - accuracy: 0.8130 - val_loss: 0.7477 - val_accuracy: 0.8180\n", 448 | "Epoch 8/30\n", 449 | "65094/65094 [==============================] - 4s 60us/sample - loss: 0.6472 - accuracy: 0.8410 - val_loss: 0.6793 - val_accuracy: 0.8297\n", 450 | "Epoch 9/30\n", 451 | "65094/65094 [==============================] - 4s 59us/sample - loss: 0.5772 - accuracy: 0.8601 - val_loss: 0.6277 - val_accuracy: 0.8407\n", 452 | "Epoch 10/30\n", 453 | "65094/65094 [==============================] - 4s 60us/sample - loss: 0.5213 - accuracy: 0.8747 - val_loss: 0.5863 - val_accuracy: 0.8520\n", 454 | "Epoch 11/30\n", 455 | "65094/65094 [==============================] - 4s 60us/sample - loss: 0.4766 - accuracy: 0.8854 - val_loss: 0.5565 - val_accuracy: 0.8577\n", 456 | "Epoch 12/30\n", 457 | "65094/65094 [==============================] - 4s 60us/sample - loss: 0.4396 - accuracy: 0.8928 - val_loss: 0.5314 - val_accuracy: 0.8637\n", 458 | "Epoch 13/30\n", 459 | "65094/65094 [==============================] - 4s 61us/sample - loss: 0.4087 - accuracy: 0.9005 - val_loss: 0.5110 - val_accuracy: 0.8710\n", 460 | "Epoch 14/30\n", 461 | "65094/65094 [==============================] - 4s 60us/sample - loss: 0.3825 - accuracy: 0.9061 - val_loss: 0.4965 - val_accuracy: 0.8693\n", 462 | "Epoch 15/30\n", 463 | "65094/65094 [==============================] - 4s 61us/sample - loss: 0.3597 - accuracy: 0.9116 - val_loss: 0.4843 - val_accuracy: 0.8727\n", 464 | "Epoch 16/30\n", 465 | "65094/65094 [==============================] - 4s 61us/sample - loss: 0.3387 - accuracy: 0.9166 - val_loss: 0.4732 - val_accuracy: 0.8783\n", 466 | "Epoch 17/30\n", 467 | "65094/65094 [==============================] - 4s 60us/sample - loss: 0.3199 - accuracy: 0.9215 - val_loss: 0.4653 - val_accuracy: 0.8767\n", 468 | "Epoch 18/30\n", 469 | "65094/65094 [==============================] - 4s 61us/sample - loss: 0.3028 - accuracy: 0.9257 - val_loss: 0.4574 - val_accuracy: 0.8797\n", 470 | "Epoch 19/30\n", 471 | "65094/65094 [==============================] - 4s 61us/sample - loss: 0.2869 - accuracy: 0.9298 - val_loss: 0.4520 - val_accuracy: 0.8810\n", 472 | "Epoch 20/30\n", 473 | "65094/65094 [==============================] - 4s 60us/sample - loss: 0.2724 - accuracy: 0.9333 - val_loss: 0.4472 - val_accuracy: 0.8823\n", 474 | "Epoch 21/30\n", 475 | "65094/65094 [==============================] - 4s 60us/sample - loss: 0.2586 - accuracy: 0.9362 - val_loss: 0.4437 - val_accuracy: 0.8807\n", 476 | "Epoch 22/30\n", 477 | "65094/65094 [==============================] - 4s 60us/sample - loss: 0.2458 - accuracy: 0.9399 - val_loss: 0.4395 - val_accuracy: 0.8813\n", 478 | "Epoch 23/30\n", 479 | "65094/65094 [==============================] - 4s 60us/sample - loss: 0.2338 - accuracy: 0.9427 - val_loss: 0.4363 - val_accuracy: 0.8830\n", 480 | "Epoch 24/30\n", 481 | "65094/65094 [==============================] - 4s 61us/sample - loss: 0.2222 - accuracy: 0.9459 - val_loss: 0.4348 - val_accuracy: 0.8837\n", 482 | "Epoch 25/30\n", 483 | "65094/65094 [==============================] - 4s 61us/sample - loss: 0.2116 - accuracy: 0.9487 - val_loss: 0.4333 - val_accuracy: 0.8827\n", 484 | "Epoch 26/30\n", 485 | "65094/65094 [==============================] - 4s 60us/sample - loss: 0.2012 - accuracy: 0.9514 - val_loss: 0.4339 - val_accuracy: 0.8820\n", 486 | "Epoch 27/30\n", 487 | "65094/65094 [==============================] - 4s 61us/sample - loss: 0.1923 - accuracy: 0.9537 - val_loss: 0.4341 - val_accuracy: 0.8833\n", 488 | "Epoch 28/30\n", 489 | "65094/65094 [==============================] - 4s 61us/sample - loss: 0.1829 - accuracy: 0.9561 - val_loss: 0.4325 - val_accuracy: 0.8843\n", 490 | "Epoch 29/30\n", 491 | "65094/65094 [==============================] - 4s 60us/sample - loss: 0.1750 - accuracy: 0.9580 - val_loss: 0.4349 - val_accuracy: 0.8833\n", 492 | "Epoch 30/30\n", 493 | "65094/65094 [==============================] - 4s 60us/sample - loss: 0.1663 - accuracy: 0.9596 - val_loss: 0.4335 - val_accuracy: 0.8850\n" 494 | ], 495 | "name": "stdout" 496 | } 497 | ] 498 | }, 499 | { 500 | "metadata": { 501 | "id": "r8_mvDjYL3CX", 502 | "colab_type": "code", 503 | "outputId": "1f0a15f4-a747-4c07-b6c4-493efd4ee2a8", 504 | "colab": { 505 | "base_uri": "https://localhost:8080/", 506 | "height": 51 507 | } 508 | }, 509 | "cell_type": "code", 510 | "source": [ 511 | "results = model.evaluate(test_data, test_label)\n", 512 | "print(results)" 513 | ], 514 | "execution_count": 15, 515 | "outputs": [ 516 | { 517 | "output_type": "stream", 518 | "text": [ 519 | "7567/7567 [==============================] - 1s 88us/sample - loss: 0.4136 - accuracy: 0.8960\n", 520 | "[0.4136109899459213, 0.8959958]\n" 521 | ], 522 | "name": "stdout" 523 | } 524 | ] 525 | }, 526 | { 527 | "metadata": { 528 | "id": "VaIioR7EPfig", 529 | "colab_type": "code", 530 | "outputId": "f19a6548-45fe-4cc1-d906-f6f79bce804a", 531 | "colab": { 532 | "base_uri": "https://localhost:8080/", 533 | "height": 71 534 | } 535 | }, 536 | "cell_type": "code", 537 | "source": [ 538 | "data_index = 12\n", 539 | "data_words = \" \".join(test_data_words[data_index])\n", 540 | "data_indexes = test_data[data_index]\n", 541 | "print(data_words)\n", 542 | "#print(data_indexes)\n", 543 | "import numpy as np\n", 544 | "predicted = model.predict([data_indexes])\n", 545 | "print(encoder.classes_[np.argmax(predicted)])" 546 | ], 547 | "execution_count": 16, 548 | "outputs": [ 549 | { 550 | "output_type": "stream", 551 | "text": [ 552 | "спортын төв ордонд өнөөдөр азийн оюутны аварга шалгаруулах эмэгтэй волейболчдын хоёр дахь удаагийн тэмцээний талаар мэдээлэл хийлээ анхны тэмцээн онд тайландын бангконг хотноо болж хоёрдугаар тэмцээнийг азийн оюутны спортын холбооноос аосх олгосон эрхийн дагуу оны дөрөвдүгээр сарын ны өдрүүдэд монгол улсын нийслэл улаанбаатар хотноо зохион байгуулах тэмцээний эрхийг монгол улс оны тавдугаар сарын хуралдсан аосхны гүйцэтгэх хорооны хурлаар хоёр оронтой өрсөлдөн авчээ уг тэмцээнийг монгол улсад авах талаар мосхолбоо оноос санаачлага гарган хөөцөлдөж эхэлсэн тэмцээний эрхийг авахад муын засгийн газрын санхүүгийн дэмжлэг мэргэжлийн холбоодын ажлын туршлага манай улсын олон улсын нэр хүнд ихээхэн тус хүргэжээ зохион байгуулах хороог с ламбаа удирдаж тэмцээний зохион байгуулах хороог збх эрүүл мэндийн сайдын оны тоот тушаалаар батлаж даргаар уихын гишүүн монголын волейболын холбооны мвх хүндэт ерөнхийлөгч сламбаа ажиллаж збхны орлогч даргаар згхагентлагбтсгын дарга чнаранбаатар збхны нарийн бичгийн даргаар монголын оюутны спортын холбооны мосх ерөнхий нарийн бичгийн дарга джаргалсайхан збхны гишүүдэд бсшуяны төрийн нарийн бичгийн дарга ддалайжаргал нийслэлийн здтгазрын дарга цболдсайхан сяны газрын дарга дбатжаргал гхяамны консулын газрын дарга дганхуяг бсшуяны мэргэжлийн боловсролын газрын дарга мбаасанжав гихалбаны дарга дмөрөн мосхны ерөнхийлөгч оуосхны ерөнхий санхүүч дбаясгалан муисийн ректор стөмөрочир мубисийн ректор бжадамба залуу монгол корпорацийн ерөнхийлөгч мсономпил мохны ерөнхий нарийн бичгийн дарга нбямбагэрэл мвхолбооны ерөнхий нарий бичгийн дарга цбатэнх миат хкийн маркетинг борлуулатын хэлтсийн дарга тмэндсайхан боловсрол суваг телевизийн ерөнхий захирал аамундра нар сонгогдон ажиллаж тэмцээнийг үнэ төлбөргүй үзүүлнэ волейболын болон оюутны спортыг сурталчилах дэлгэрүүлэх үүднээс тэмцээнийг үнэ төлбөргүй үзүүлэхээр збхорооны анхдугаар хурлаас шийдвэрлэсэн нийслэлийн иргэдийг тэмцээнийг өргөнөөр үзэхийг здтгазраас уриалсан тэмцээнийг зөвхөн улаанбаатар хотын иргэд бус аймгаас волейболын спортыг сонирхон хөгжөөн дэмжигч үзэгч волейболын спортын мэргэжилтэн багш нар секцэнд хичээллэгч хүүхдүүд зохион байгуулалтай ирэхээр ялангуяа тэмцээн болох газартай хамгийн ойрхон хануул дүүргийн здтгазар дүүргийнхээ ард иргэд хөдөлмөрчид сургуулийн сурагчид оюутнууд цэргийн албан хаагчид буянтухаа орчимын албан байгууллага хамт олныг идэвхтэй оролцуулах арга хэмжээ авч эхлэжээ олон зуун оюутнууд тэмцээн үзэх боллоо тэмцээний өдрүүдэд нийслэлээс буянтухаагийн спортын ордонг чиглэсэн хүмүүсийн цуваа ихсэх төлөвтэй учир нийслэлд үйл ажиллагаа явуулж байгаа орчим идсийн оюутнууд тэмцээнийг анги сургууль хамт олноороо үзэх сонирхолтой байгаагаа монголын оюутны холбоо биеийн тамирын тэнхимдээ хүсчээ үүний дагуу бсшуяам мох монголын оюутны спортын холбоо мосх ноос тэмцээнийг өдөр бүр гаруй сургуулийн орчим оюутнууд нэгдсэн хуваарийн дагуу үзэх хуваарийг бсшуяны төрийн нарийн бичгийн дарга зохион байгуулах үндэсний хороо збх ны гишүүн ддалайжаргал батлан сургуулиудад албан тоотоор хүргүүлжээ мосхд тэмцээнийг үзэхээр олон арван сургуулиуд оюутны тоогоо өгч бүртгүүлж суудлын хувиарлалтанд орж байгаа ажээ ялангуяа биеийн тамирын мэргэжлийн дээд сургуулийн оюутнууд дадлага хичээлээ тэмцээний үеэр хийхээр хичээлийн хувиараа зохицуулсан нийслэлийн засаг дарга оюутнуудад туслав улаанбаатар хотноо болдог оюутны олон улс тив дэлхийн тэмцээн бүрт нийслэлийн засаг дарга гмөнхбаяр ихээхэн туслалцаа үзүүлэн оюутан залуусаа байнга дэмжин оролцдог ажээ тэрээр тус тэмцээнд оролцохоор бэлтгэж байгаа монголын оюутны шигшээ багийн тамирчидын хоногийн бэлтгэл сургалтын зардлыг хариуцан гаргасан хөрөнгө санхүүгийн хүндрэлтэй байгаа үеэд тэмцээнд бэлтгэж байгаа оюутан тамирчидаа цагаа олж хэрэгцээтэй үеэд дэмжлээ мосхолбоо монголын волейболын холбоо мвх тамирчидынхаа өмнөөс талархал илэрхийлжээ монголын баг тамирчид эрдэнэт хотод оны сарын өдрөөс эхлэн хоногийн бэлтгэл хийснийхээ дараа ийнхүү нийслэлийн засаг даргын туслалцаатайгаар гадаадын тамирчидтай хамт байрлах цэцэг зочид буудалдаа орж бэлтгэл сургуулиалтаа үргэлжүүлэх боломжтой нздтгазраас баг\n", 553 | "спорт\n" 554 | ], 555 | "name": "stdout" 556 | } 557 | ] 558 | } 559 | ] 560 | } -------------------------------------------------------------------------------- /neural_network_classifier_notebooks/Mongolian_text_classification_04,_RNN(LSTM).ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "nbformat": 4, 3 | "nbformat_minor": 0, 4 | "metadata": { 5 | "colab": { 6 | "name": "Mongolian text classification #04, RNN(LSTM).ipynb", 7 | "version": "0.3.2", 8 | "provenance": [], 9 | "collapsed_sections": [] 10 | }, 11 | "kernelspec": { 12 | "name": "python3", 13 | "display_name": "Python 3" 14 | }, 15 | "accelerator": "GPU" 16 | }, 17 | "cells": [ 18 | { 19 | "metadata": { 20 | "id": "muNP8k9fqaJb", 21 | "colab_type": "text" 22 | }, 23 | "cell_type": "markdown", 24 | "source": [ 25 | "Mongolian text classification series #01\n", 26 | "\n", 27 | "In this notebook I'm gonna try to classify cyrillic mongolian texts with LSTM.\n", 28 | "\n", 29 | "Eduge dataset provided by Bolorsoft LLC\n", 30 | "\n", 31 | "Author : Sharavsambuu Gunchinish (sharavsambuu@gmail.com)\n", 32 | "\n", 33 | "Github: https://github.com/sharavsambuu/mongolian-text-classification \n", 34 | "\n" 35 | ] 36 | }, 37 | { 38 | "metadata": { 39 | "id": "iY9jwdg6qT8M", 40 | "colab_type": "code", 41 | "outputId": "70ae13ba-6931-476b-9d75-49ddb96c4bd9", 42 | "colab": { 43 | "base_uri": "https://localhost:8080/", 44 | "height": 360 45 | } 46 | }, 47 | "cell_type": "code", 48 | "source": [ 49 | "from __future__ import absolute_import, division, print_function, unicode_literals\n", 50 | "\n", 51 | "!pip install -q tensorflow-gpu==2.0.0-alpha0\n", 52 | "!pip install gensim\n", 53 | "\n", 54 | "import tensorflow as tf\n", 55 | "from tensorflow import keras\n", 56 | "\n", 57 | "import numpy as np\n", 58 | "\n", 59 | "print(tf.__version__)" 60 | ], 61 | "execution_count": 1, 62 | "outputs": [ 63 | { 64 | "output_type": "stream", 65 | "text": [ 66 | "Requirement already satisfied: gensim in /usr/local/lib/python3.6/dist-packages (3.6.0)\n", 67 | "Requirement already satisfied: smart-open>=1.2.1 in /usr/local/lib/python3.6/dist-packages (from gensim) (1.8.1)\n", 68 | "Requirement already satisfied: six>=1.5.0 in /usr/local/lib/python3.6/dist-packages (from gensim) (1.11.0)\n", 69 | "Requirement already satisfied: scipy>=0.18.1 in /usr/local/lib/python3.6/dist-packages (from gensim) (1.2.1)\n", 70 | "Requirement already satisfied: numpy>=1.11.3 in /usr/local/lib/python3.6/dist-packages (from gensim) (1.16.2)\n", 71 | "Requirement already satisfied: boto>=2.32 in /usr/local/lib/python3.6/dist-packages (from smart-open>=1.2.1->gensim) (2.49.0)\n", 72 | "Requirement already satisfied: bz2file in /usr/local/lib/python3.6/dist-packages (from smart-open>=1.2.1->gensim) (0.98)\n", 73 | "Requirement already satisfied: requests in /usr/local/lib/python3.6/dist-packages (from smart-open>=1.2.1->gensim) (2.18.4)\n", 74 | "Requirement already satisfied: boto3 in /usr/local/lib/python3.6/dist-packages (from smart-open>=1.2.1->gensim) (1.9.130)\n", 75 | "Requirement already satisfied: idna<2.7,>=2.5 in /usr/local/lib/python3.6/dist-packages (from requests->smart-open>=1.2.1->gensim) (2.6)\n", 76 | "Requirement already satisfied: urllib3<1.23,>=1.21.1 in /usr/local/lib/python3.6/dist-packages (from requests->smart-open>=1.2.1->gensim) (1.22)\n", 77 | "Requirement already satisfied: chardet<3.1.0,>=3.0.2 in /usr/local/lib/python3.6/dist-packages (from requests->smart-open>=1.2.1->gensim) (3.0.4)\n", 78 | "Requirement already satisfied: certifi>=2017.4.17 in /usr/local/lib/python3.6/dist-packages (from requests->smart-open>=1.2.1->gensim) (2019.3.9)\n", 79 | "Requirement already satisfied: botocore<1.13.0,>=1.12.130 in /usr/local/lib/python3.6/dist-packages (from boto3->smart-open>=1.2.1->gensim) (1.12.130)\n", 80 | "Requirement already satisfied: s3transfer<0.3.0,>=0.2.0 in /usr/local/lib/python3.6/dist-packages (from boto3->smart-open>=1.2.1->gensim) (0.2.0)\n", 81 | "Requirement already satisfied: jmespath<1.0.0,>=0.7.1 in /usr/local/lib/python3.6/dist-packages (from boto3->smart-open>=1.2.1->gensim) (0.9.4)\n", 82 | "Requirement already satisfied: python-dateutil<3.0.0,>=2.1; python_version >= \"2.7\" in /usr/local/lib/python3.6/dist-packages (from botocore<1.13.0,>=1.12.130->boto3->smart-open>=1.2.1->gensim) (2.5.3)\n", 83 | "Requirement already satisfied: docutils>=0.10 in /usr/local/lib/python3.6/dist-packages (from botocore<1.13.0,>=1.12.130->boto3->smart-open>=1.2.1->gensim) (0.14)\n", 84 | "2.0.0-alpha0\n" 85 | ], 86 | "name": "stdout" 87 | } 88 | ] 89 | }, 90 | { 91 | "metadata": { 92 | "id": "smJeJfoo4qcu", 93 | "colab_type": "text" 94 | }, 95 | "cell_type": "markdown", 96 | "source": [ 97 | "[More info about creation of eduge dataset pickles](https://github.com/sharavsambuu/mongolian-text-classification/blob/master/preprocess_dataset/preprocess_eduge.ipynb) preprocessing eats a lot of CPU cycle so it's good idea to cook it before using colab." 98 | ] 99 | }, 100 | { 101 | "metadata": { 102 | "id": "CDayX_Yx3REh", 103 | "colab_type": "code", 104 | "outputId": "225460ef-a61d-486e-e831-6c132cd65f5b", 105 | "colab": { 106 | "base_uri": "https://localhost:8080/", 107 | "height": 340 108 | } 109 | }, 110 | "cell_type": "code", 111 | "source": [ 112 | "import os\n", 113 | "from os.path import exists, join, basename, splitext\n", 114 | "import sys\n", 115 | "\n", 116 | "def download_from_google_drive(file_id, file_name):\n", 117 | " !rm -f ./cookie\n", 118 | " !curl -c ./cookie -s -L \"https://drive.google.com/uc?export=download&id=$file_id\" > /dev/null\n", 119 | " confirm_text = !awk '/download/ {print $NF}' ./cookie\n", 120 | " confirm_text = confirm_text[0]\n", 121 | " !curl -Lb ./cookie \"https://drive.google.com/uc?export=download&confirm=$confirm_text&id=$file_id\" -o $file_name\n", 122 | " \n", 123 | "# download eduge pickles\n", 124 | "file_path = 'eduge_pickles'\n", 125 | "if not exists(file_path):\n", 126 | " download_from_google_drive('1vjJ9YgIe8o0ErhbN0lH1XqPv3KFP8acv', '%s.rar' % file_path)\n", 127 | " rar_file = file_path+\".rar\"\n", 128 | " !unrar x $rar_file" 129 | ], 130 | "execution_count": 2, 131 | "outputs": [ 132 | { 133 | "output_type": "stream", 134 | "text": [ 135 | " % Total % Received % Xferd Average Speed Time Time Time Current\n", 136 | " Dload Upload Total Spent Left Speed\n", 137 | "100 388 0 388 0 0 4974 0 --:--:-- --:--:-- --:--:-- 4974\n", 138 | "100 106M 0 106M 0 0 104M 0 --:--:-- 0:00:01 --:--:-- 231M\n", 139 | "\n", 140 | "UNRAR 5.50 freeware Copyright (c) 1993-2017 Alexander Roshal\n", 141 | "\n", 142 | "\n", 143 | "Extracting from eduge_pickles.rar\n", 144 | "\n", 145 | "\n", 146 | "Would you like to replace the existing file word_index.pickle\n", 147 | "9178153 bytes, modified on 2019-04-13 01:44\n", 148 | "with a new one\n", 149 | "9178153 bytes, modified on 2019-04-13 01:44\n", 150 | "\n", 151 | "[Y]es, [N]o, [A]ll, n[E]ver, [R]ename, [Q]uit q\n", 152 | "\n", 153 | "Program aborted\n" 154 | ], 155 | "name": "stdout" 156 | } 157 | ] 158 | }, 159 | { 160 | "metadata": { 161 | "id": "pPHJcnfi4Rzg", 162 | "colab_type": "code", 163 | "colab": {} 164 | }, 165 | "cell_type": "code", 166 | "source": [ 167 | "import pickle\n", 168 | "\n", 169 | "with open('word_index.pickle', 'rb') as handle:\n", 170 | " word_index = pickle.load(handle)\n", 171 | " \n", 172 | "with open('reversed_word_index.pickle', 'rb') as handle:\n", 173 | " reversed_word_index = pickle.load(handle)\n", 174 | " \n", 175 | "with open('eduge_stopwords_removed.pickle', 'rb') as handle:\n", 176 | " eduge_ds = pickle.load(handle)" 177 | ], 178 | "execution_count": 0, 179 | "outputs": [] 180 | }, 181 | { 182 | "metadata": { 183 | "id": "ASRW7ISNnbM-", 184 | "colab_type": "code", 185 | "outputId": "3db9b6a2-330f-400c-964c-60fb90b35182", 186 | "colab": { 187 | "base_uri": "https://localhost:8080/", 188 | "height": 51 189 | } 190 | }, 191 | "cell_type": "code", 192 | "source": [ 193 | "# facebook trained word2vec on both commoncrawl and wikipedia. So this model should contain enough representation about our mongolian words.\n", 194 | "mongolian_word2vec_download=\"https://dl.fbaipublicfiles.com/fasttext/vectors-crawl/cc.mn.300.bin.gz\"\n", 195 | "if not exists(\"cc.mn.300.bin.gz\"):\n", 196 | " !wget https://dl.fbaipublicfiles.com/fasttext/vectors-crawl/cc.mn.300.bin.gz\n", 197 | "if exists('cc.mn.300.bin.gz'):\n", 198 | " !gunzip cc.mn.300.bin.gz" 199 | ], 200 | "execution_count": 4, 201 | "outputs": [ 202 | { 203 | "output_type": "stream", 204 | "text": [ 205 | "gzip: cc.mn.300.bin already exists; do you wish to overwrite (y or n)? n\n", 206 | "\tnot overwritten\n" 207 | ], 208 | "name": "stdout" 209 | } 210 | ] 211 | }, 212 | { 213 | "metadata": { 214 | "id": "BqGAauUZpnFz", 215 | "colab_type": "code", 216 | "outputId": "5efb56a9-8cae-44a4-8474-bc1db3d56564", 217 | "colab": { 218 | "base_uri": "https://localhost:8080/", 219 | "height": 88 220 | } 221 | }, 222 | "cell_type": "code", 223 | "source": [ 224 | "from gensim.models.wrappers import FastText\n", 225 | "\n", 226 | "word2vec_model = FastText.load_fasttext_format('cc.mn.300.bin')" 227 | ], 228 | "execution_count": 5, 229 | "outputs": [ 230 | { 231 | "output_type": "stream", 232 | "text": [ 233 | "WARNING: Logging before flag parsing goes to stderr.\n", 234 | "W0414 01:07:52.334683 139997640750976 ssh.py:33] paramiko missing, opening SSH/SCP/SFTP paths will be disabled. `pip install paramiko` to suppress\n", 235 | "W0414 01:07:52.809054 139997640750976 word2vec.py:573] Slow version of gensim.models.deprecated.word2vec is being used\n" 236 | ], 237 | "name": "stderr" 238 | } 239 | ] 240 | }, 241 | { 242 | "metadata": { 243 | "id": "kkc1iiqJp-CJ", 244 | "colab_type": "code", 245 | "outputId": "99442697-27b5-4474-e384-f4f5c35dc7ff", 246 | "colab": { 247 | "base_uri": "https://localhost:8080/", 248 | "height": 88 249 | } 250 | }, 251 | "cell_type": "code", 252 | "source": [ 253 | "print(word2vec_model.most_similar('монгол'))" 254 | ], 255 | "execution_count": 6, 256 | "outputs": [ 257 | { 258 | "output_type": "stream", 259 | "text": [ 260 | "[('Монгол', 0.6342526078224182), ('монголын', 0.6047513484954834), ('хятад', 0.5558866858482361), ('Монголын', 0.5087883472442627), ('судлалаараа', 0.48851606249809265), ('манай', 0.4853793680667877), ('уйгаржин', 0.4725492596626282), ('угсаатангууд', 0.47093287110328674), ('орос', 0.46463483572006226), ('худам', 0.4609120190143585)]\n" 261 | ], 262 | "name": "stdout" 263 | }, 264 | { 265 | "output_type": "stream", 266 | "text": [ 267 | "/usr/local/lib/python3.6/dist-packages/gensim/matutils.py:737: FutureWarning: Conversion of the second argument of issubdtype from `int` to `np.signedinteger` is deprecated. In future, it will be treated as `np.int64 == np.dtype(int).type`.\n", 268 | " if np.issubdtype(vec.dtype, np.int):\n" 269 | ], 270 | "name": "stderr" 271 | } 272 | ] 273 | }, 274 | { 275 | "metadata": { 276 | "id": "oF6vB3Qnq08I", 277 | "colab_type": "code", 278 | "colab": {} 279 | }, 280 | "cell_type": "code", 281 | "source": [ 282 | "# preparing embedding matrix\n", 283 | "import numpy as np\n", 284 | "\n", 285 | "words_not_found = []\n", 286 | "embed_dim = 300\n", 287 | "embedding_matrix = np.random.uniform(-1, 1, (len(word_index), embed_dim))\n", 288 | "for word, i in word_index.items():\n", 289 | " if i<4:\n", 290 | " continue\n", 291 | " try:\n", 292 | " embedding_vector = word2vec_model[word]\n", 293 | " if (embedding_vector is not None) and len(embedding_vector) > 0:\n", 294 | " embedding_matrix[i] = embedding_vector\n", 295 | " except:\n", 296 | " words_not_found.append(word)\n", 297 | " pass" 298 | ], 299 | "execution_count": 0, 300 | "outputs": [] 301 | }, 302 | { 303 | "metadata": { 304 | "id": "aQAaXWIgsxm9", 305 | "colab_type": "code", 306 | "outputId": "d07b1789-a789-4089-ff5d-6a6b58ab8d74", 307 | "colab": { 308 | "base_uri": "https://localhost:8080/", 309 | "height": 34 310 | } 311 | }, 312 | "cell_type": "code", 313 | "source": [ 314 | "print(embedding_matrix.shape)\n", 315 | "#print(embedding_matrix[5])" 316 | ], 317 | "execution_count": 8, 318 | "outputs": [ 319 | { 320 | "output_type": "stream", 321 | "text": [ 322 | "(370794, 300)\n" 323 | ], 324 | "name": "stdout" 325 | } 326 | ] 327 | }, 328 | { 329 | "metadata": { 330 | "id": "XFxd1QGR65VV", 331 | "colab_type": "code", 332 | "colab": {} 333 | }, 334 | "cell_type": "code", 335 | "source": [ 336 | "MAX_LEN = 512\n", 337 | "\n", 338 | "import itertools\n", 339 | "\n", 340 | "for item in eduge_ds:\n", 341 | " item[0] = list(itertools.chain(*item[0]))[:MAX_LEN]" 342 | ], 343 | "execution_count": 0, 344 | "outputs": [] 345 | }, 346 | { 347 | "metadata": { 348 | "id": "U8PTeX0WCbhR", 349 | "colab_type": "code", 350 | "colab": {} 351 | }, 352 | "cell_type": "code", 353 | "source": [ 354 | "from sklearn.model_selection import train_test_split\n", 355 | "train, test = train_test_split(eduge_ds, test_size=0.1, random_state=999)" 356 | ], 357 | "execution_count": 0, 358 | "outputs": [] 359 | }, 360 | { 361 | "metadata": { 362 | "id": "8mgMCFcgDHH4", 363 | "colab_type": "code", 364 | "colab": {} 365 | }, 366 | "cell_type": "code", 367 | "source": [ 368 | "train_data_words = [i[0] for i in train]\n", 369 | "train_label_words = [i[1] for i in train]\n", 370 | "test_data_words = [i[0] for i in test ]\n", 371 | "test_label_words = [i[1] for i in test ]" 372 | ], 373 | "execution_count": 0, 374 | "outputs": [] 375 | }, 376 | { 377 | "metadata": { 378 | "id": "rrXC7UiuFkCH", 379 | "colab_type": "code", 380 | "colab": {} 381 | }, 382 | "cell_type": "code", 383 | "source": [ 384 | "def encode_news(text):\n", 385 | " return [word_index.get(i, 2) for i in text]\n", 386 | " \n", 387 | "train_data = [encode_news(sent) for sent in train_data_words]\n", 388 | "test_data = [encode_news(sent) for sent in test_data_words ]" 389 | ], 390 | "execution_count": 0, 391 | "outputs": [] 392 | }, 393 | { 394 | "metadata": { 395 | "id": "FV-h_avPEzM1", 396 | "colab_type": "code", 397 | "colab": {} 398 | }, 399 | "cell_type": "code", 400 | "source": [ 401 | "train_data = keras.preprocessing.sequence.pad_sequences(train_data,\n", 402 | " value=word_index[\"\"],\n", 403 | " padding='post',\n", 404 | " maxlen=MAX_LEN)\n", 405 | "\n", 406 | "test_data = keras.preprocessing.sequence.pad_sequences(test_data,\n", 407 | " value=word_index[\"\"],\n", 408 | " padding='post',\n", 409 | " maxlen=MAX_LEN)" 410 | ], 411 | "execution_count": 0, 412 | "outputs": [] 413 | }, 414 | { 415 | "metadata": { 416 | "id": "gDVqmPqxIMid", 417 | "colab_type": "code", 418 | "outputId": "ec30beb7-996c-4508-8729-5d5111a96e58", 419 | "colab": { 420 | "base_uri": "https://localhost:8080/", 421 | "height": 170 422 | } 423 | }, 424 | "cell_type": "code", 425 | "source": [ 426 | "labels = list(set(test_label_words))\n", 427 | "labels" 428 | ], 429 | "execution_count": 14, 430 | "outputs": [ 431 | { 432 | "output_type": "execute_result", 433 | "data": { 434 | "text/plain": [ 435 | "['спорт',\n", 436 | " 'эрүүл мэнд',\n", 437 | " 'урлаг соёл',\n", 438 | " 'эдийн засаг',\n", 439 | " 'байгал орчин',\n", 440 | " 'хууль',\n", 441 | " 'технологи',\n", 442 | " 'улс төр',\n", 443 | " 'боловсрол']" 444 | ] 445 | }, 446 | "metadata": { 447 | "tags": [] 448 | }, 449 | "execution_count": 14 450 | } 451 | ] 452 | }, 453 | { 454 | "metadata": { 455 | "id": "PBKj3GQqJq29", 456 | "colab_type": "code", 457 | "colab": {} 458 | }, 459 | "cell_type": "code", 460 | "source": [ 461 | "from sklearn.preprocessing import LabelBinarizer\n", 462 | "encoder = LabelBinarizer()\n", 463 | "train_label = transfomed_label = encoder.fit_transform(train_label_words)\n", 464 | "test_label = transfomed_label = encoder.fit_transform(test_label_words )" 465 | ], 466 | "execution_count": 0, 467 | "outputs": [] 468 | }, 469 | { 470 | "metadata": { 471 | "id": "DPq45PN5HZ15", 472 | "colab_type": "code", 473 | "outputId": "16c06c24-94a0-4438-fa99-6c7e9cf9262c", 474 | "colab": { 475 | "base_uri": "https://localhost:8080/", 476 | "height": 394 477 | } 478 | }, 479 | "cell_type": "code", 480 | "source": [ 481 | "vocab_size = len(word_index)\n", 482 | "\n", 483 | "sequence_input = keras.layers.Input(shape=(MAX_LEN,), dtype='int32')\n", 484 | "embedded_sequences = keras.layers.Embedding(\n", 485 | " vocab_size, \n", 486 | " embed_dim , \n", 487 | " weights=[embedding_matrix], \n", 488 | " input_length=MAX_LEN, \n", 489 | " trainable=False)(sequence_input)\n", 490 | "x = keras.layers.LSTM(128)(embedded_sequences)\n", 491 | "x = keras.layers.Dense(245, activation='relu')(x)\n", 492 | "x = keras.layers.Dropout(0.5)(x) # prevents overfitting\n", 493 | "preds = keras.layers.Dense(len(labels), activation='softmax')(x)\n", 494 | "\n", 495 | "model = keras.models.Model(sequence_input, preds)\n", 496 | "model.summary()" 497 | ], 498 | "execution_count": 17, 499 | "outputs": [ 500 | { 501 | "output_type": "stream", 502 | "text": [ 503 | "W0414 01:09:45.882035 139997640750976 tf_logging.py:161] : Note that this layer is not optimized for performance. Please use tf.keras.layers.CuDNNLSTM for better performance on GPU.\n" 504 | ], 505 | "name": "stderr" 506 | }, 507 | { 508 | "output_type": "stream", 509 | "text": [ 510 | "Model: \"model\"\n", 511 | "_________________________________________________________________\n", 512 | "Layer (type) Output Shape Param # \n", 513 | "=================================================================\n", 514 | "input_2 (InputLayer) [(None, 512)] 0 \n", 515 | "_________________________________________________________________\n", 516 | "embedding_1 (Embedding) (None, 512, 300) 111238200 \n", 517 | "_________________________________________________________________\n", 518 | "unified_lstm_1 (UnifiedLSTM) (None, 128) 219648 \n", 519 | "_________________________________________________________________\n", 520 | "dense (Dense) (None, 245) 31605 \n", 521 | "_________________________________________________________________\n", 522 | "dropout (Dropout) (None, 245) 0 \n", 523 | "_________________________________________________________________\n", 524 | "dense_1 (Dense) (None, 9) 2214 \n", 525 | "=================================================================\n", 526 | "Total params: 111,491,667\n", 527 | "Trainable params: 253,467\n", 528 | "Non-trainable params: 111,238,200\n", 529 | "_________________________________________________________________\n" 530 | ], 531 | "name": "stdout" 532 | } 533 | ] 534 | }, 535 | { 536 | "metadata": { 537 | "id": "cAgP1KlqHu2F", 538 | "colab_type": "code", 539 | "colab": {} 540 | }, 541 | "cell_type": "code", 542 | "source": [ 543 | "model.compile(optimizer='rmsprop',\n", 544 | " loss='categorical_crossentropy',\n", 545 | " metrics=['accuracy'])" 546 | ], 547 | "execution_count": 0, 548 | "outputs": [] 549 | }, 550 | { 551 | "metadata": { 552 | "id": "ZPw8roFQKrHm", 553 | "colab_type": "code", 554 | "outputId": "869ebf0b-2b21-47c0-bc42-dbb1bb46efa8", 555 | "colab": { 556 | "base_uri": "https://localhost:8080/", 557 | "height": 51 558 | } 559 | }, 560 | "cell_type": "code", 561 | "source": [ 562 | "print(len(train_data), len(train_label))\n", 563 | "print(len(test_data ), len(test_label) )\n", 564 | "\n", 565 | "partial_index = 3000\n", 566 | "\n", 567 | "x_val = train_data[:partial_index]\n", 568 | "partial_x_train = train_data[partial_index:]\n", 569 | "\n", 570 | "y_val = train_label[:partial_index]\n", 571 | "partial_y_train = train_label[partial_index:]" 572 | ], 573 | "execution_count": 19, 574 | "outputs": [ 575 | { 576 | "output_type": "stream", 577 | "text": [ 578 | "68094 68094\n", 579 | "7567 7567\n" 580 | ], 581 | "name": "stdout" 582 | } 583 | ] 584 | }, 585 | { 586 | "metadata": { 587 | "id": "iSTB4--RKacs", 588 | "colab_type": "code", 589 | "outputId": "da0e04a5-1f5a-41e7-d6f3-d8476069248a", 590 | "colab": { 591 | "base_uri": "https://localhost:8080/", 592 | "height": 1985 593 | } 594 | }, 595 | "cell_type": "code", 596 | "source": [ 597 | "epochs = 50\n", 598 | "history = model.fit(partial_x_train,\n", 599 | " partial_y_train,\n", 600 | " epochs=epochs ,\n", 601 | " batch_size=512 ,\n", 602 | " validation_data=(x_val, y_val),\n", 603 | " verbose=1)" 604 | ], 605 | "execution_count": 20, 606 | "outputs": [ 607 | { 608 | "output_type": "stream", 609 | "text": [ 610 | "Train on 65094 samples, validate on 3000 samples\n", 611 | "Epoch 1/50\n", 612 | "65094/65094 [==============================] - 43s 657us/sample - loss: 2.1111 - accuracy: 0.1992 - val_loss: 2.0996 - val_accuracy: 0.2053\n", 613 | "Epoch 2/50\n", 614 | "65094/65094 [==============================] - 40s 613us/sample - loss: 2.0969 - accuracy: 0.2025 - val_loss: 2.7476 - val_accuracy: 0.1823\n", 615 | "Epoch 3/50\n", 616 | "65094/65094 [==============================] - 40s 610us/sample - loss: 2.0993 - accuracy: 0.2055 - val_loss: 2.0971 - val_accuracy: 0.2083\n", 617 | "Epoch 4/50\n", 618 | "65094/65094 [==============================] - 39s 606us/sample - loss: 2.1472 - accuracy: 0.2054 - val_loss: 2.0850 - val_accuracy: 0.2063\n", 619 | "Epoch 5/50\n", 620 | "65094/65094 [==============================] - 39s 607us/sample - loss: 2.1068 - accuracy: 0.2094 - val_loss: 2.0656 - val_accuracy: 0.2107\n", 621 | "Epoch 6/50\n", 622 | "65094/65094 [==============================] - 39s 605us/sample - loss: 2.0949 - accuracy: 0.2136 - val_loss: 2.1074 - val_accuracy: 0.2140\n", 623 | "Epoch 7/50\n", 624 | "65094/65094 [==============================] - 39s 605us/sample - loss: 2.0030 - accuracy: 0.2641 - val_loss: 1.9556 - val_accuracy: 0.3073\n", 625 | "Epoch 8/50\n", 626 | "65094/65094 [==============================] - 40s 607us/sample - loss: 1.9544 - accuracy: 0.3111 - val_loss: 1.9054 - val_accuracy: 0.3263\n", 627 | "Epoch 9/50\n", 628 | "65094/65094 [==============================] - 39s 604us/sample - loss: 1.9012 - accuracy: 0.3251 - val_loss: 2.1235 - val_accuracy: 0.2920\n", 629 | "Epoch 10/50\n", 630 | "65094/65094 [==============================] - 40s 607us/sample - loss: 1.8758 - accuracy: 0.3344 - val_loss: 1.7436 - val_accuracy: 0.3987\n", 631 | "Epoch 11/50\n", 632 | "65094/65094 [==============================] - 39s 603us/sample - loss: 1.7716 - accuracy: 0.3840 - val_loss: 1.7155 - val_accuracy: 0.4107\n", 633 | "Epoch 12/50\n", 634 | "65094/65094 [==============================] - 39s 604us/sample - loss: 1.7536 - accuracy: 0.3985 - val_loss: 1.7148 - val_accuracy: 0.4077\n", 635 | "Epoch 13/50\n", 636 | "65094/65094 [==============================] - 39s 604us/sample - loss: 1.7925 - accuracy: 0.3795 - val_loss: 1.8479 - val_accuracy: 0.3360\n", 637 | "Epoch 14/50\n", 638 | "65094/65094 [==============================] - 39s 603us/sample - loss: 1.7657 - accuracy: 0.3835 - val_loss: 1.7782 - val_accuracy: 0.3720\n", 639 | "Epoch 15/50\n", 640 | "65094/65094 [==============================] - 39s 604us/sample - loss: 1.8404 - accuracy: 0.3434 - val_loss: 1.8095 - val_accuracy: 0.3343\n", 641 | "Epoch 16/50\n", 642 | "65094/65094 [==============================] - 39s 600us/sample - loss: 1.8219 - accuracy: 0.3431 - val_loss: 1.8222 - val_accuracy: 0.3450\n", 643 | "Epoch 17/50\n", 644 | "65094/65094 [==============================] - 39s 604us/sample - loss: 1.7985 - accuracy: 0.3657 - val_loss: 1.6955 - val_accuracy: 0.4177\n", 645 | "Epoch 18/50\n", 646 | "65094/65094 [==============================] - 39s 602us/sample - loss: 1.7112 - accuracy: 0.4165 - val_loss: 1.7584 - val_accuracy: 0.3990\n", 647 | "Epoch 19/50\n", 648 | "65094/65094 [==============================] - 39s 606us/sample - loss: 1.7114 - accuracy: 0.4046 - val_loss: 1.8195 - val_accuracy: 0.3443\n", 649 | "Epoch 20/50\n", 650 | "65094/65094 [==============================] - 40s 607us/sample - loss: 1.8404 - accuracy: 0.3466 - val_loss: 1.8112 - val_accuracy: 0.3450\n", 651 | "Epoch 21/50\n", 652 | "65094/65094 [==============================] - 39s 604us/sample - loss: 1.7253 - accuracy: 0.3978 - val_loss: 1.6797 - val_accuracy: 0.4187\n", 653 | "Epoch 22/50\n", 654 | "65094/65094 [==============================] - 39s 602us/sample - loss: 1.6953 - accuracy: 0.4241 - val_loss: 1.6971 - val_accuracy: 0.4050\n", 655 | "Epoch 23/50\n", 656 | "65094/65094 [==============================] - 39s 603us/sample - loss: 1.7175 - accuracy: 0.4211 - val_loss: 1.7561 - val_accuracy: 0.3780\n", 657 | "Epoch 24/50\n", 658 | "65094/65094 [==============================] - 39s 604us/sample - loss: 1.6900 - accuracy: 0.4178 - val_loss: 2.1019 - val_accuracy: 0.3810\n", 659 | "Epoch 25/50\n", 660 | "65094/65094 [==============================] - 39s 604us/sample - loss: 1.7528 - accuracy: 0.3679 - val_loss: 1.9362 - val_accuracy: 0.2957\n", 661 | "Epoch 26/50\n", 662 | "65094/65094 [==============================] - 39s 604us/sample - loss: 1.7457 - accuracy: 0.3713 - val_loss: 1.7600 - val_accuracy: 0.3330\n", 663 | "Epoch 27/50\n", 664 | "25088/65094 [==========>...................] - ETA: 23s - loss: 1.7159 - accuracy: 0.3599" 665 | ], 666 | "name": "stdout" 667 | }, 668 | { 669 | "output_type": "error", 670 | "ename": "KeyboardInterrupt", 671 | "evalue": "ignored", 672 | "traceback": [ 673 | "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m", 674 | "\u001b[0;31mKeyboardInterrupt\u001b[0m Traceback (most recent call last)", 675 | "\u001b[0;32m\u001b[0m in \u001b[0;36m\u001b[0;34m()\u001b[0m\n\u001b[1;32m 5\u001b[0m \u001b[0mbatch_size\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0;36m512\u001b[0m \u001b[0;34m,\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 6\u001b[0m \u001b[0mvalidation_data\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mx_val\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0my_val\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m,\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m----> 7\u001b[0;31m verbose=1)\n\u001b[0m", 676 | "\u001b[0;32m/usr/local/lib/python3.6/dist-packages/tensorflow/python/keras/engine/training.py\u001b[0m in \u001b[0;36mfit\u001b[0;34m(self, x, y, batch_size, epochs, verbose, callbacks, validation_split, validation_data, shuffle, class_weight, sample_weight, initial_epoch, steps_per_epoch, validation_steps, validation_freq, max_queue_size, workers, use_multiprocessing, **kwargs)\u001b[0m\n\u001b[1;32m 871\u001b[0m \u001b[0mvalidation_steps\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0mvalidation_steps\u001b[0m\u001b[0;34m,\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 872\u001b[0m \u001b[0mvalidation_freq\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0mvalidation_freq\u001b[0m\u001b[0;34m,\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m--> 873\u001b[0;31m steps_name='steps_per_epoch')\n\u001b[0m\u001b[1;32m 874\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 875\u001b[0m def evaluate(self,\n", 677 | "\u001b[0;32m/usr/local/lib/python3.6/dist-packages/tensorflow/python/keras/engine/training_arrays.py\u001b[0m in \u001b[0;36mmodel_iteration\u001b[0;34m(model, inputs, targets, sample_weights, batch_size, epochs, verbose, callbacks, val_inputs, val_targets, val_sample_weights, shuffle, initial_epoch, steps_per_epoch, validation_steps, validation_freq, mode, validation_in_fit, prepared_feed_values_from_dataset, steps_name, **kwargs)\u001b[0m\n\u001b[1;32m 350\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 351\u001b[0m \u001b[0;31m# Get outputs.\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m--> 352\u001b[0;31m \u001b[0mbatch_outs\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mf\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mins_batch\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 353\u001b[0m \u001b[0;32mif\u001b[0m \u001b[0;32mnot\u001b[0m \u001b[0misinstance\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mbatch_outs\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mlist\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 354\u001b[0m \u001b[0mbatch_outs\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0;34m[\u001b[0m\u001b[0mbatch_outs\u001b[0m\u001b[0;34m]\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n", 678 | "\u001b[0;32m/usr/local/lib/python3.6/dist-packages/tensorflow/python/keras/backend.py\u001b[0m in \u001b[0;36m__call__\u001b[0;34m(self, inputs)\u001b[0m\n\u001b[1;32m 3215\u001b[0m \u001b[0mvalue\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mmath_ops\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mcast\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mvalue\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mtensor\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mdtype\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 3216\u001b[0m \u001b[0mconverted_inputs\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mappend\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mvalue\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m-> 3217\u001b[0;31m \u001b[0moutputs\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0m_graph_fn\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m*\u001b[0m\u001b[0mconverted_inputs\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 3218\u001b[0m return nest.pack_sequence_as(self._outputs_structure,\n\u001b[1;32m 3219\u001b[0m [x.numpy() for x in outputs])\n", 679 | "\u001b[0;32m/usr/local/lib/python3.6/dist-packages/tensorflow/python/eager/function.py\u001b[0m in \u001b[0;36m__call__\u001b[0;34m(self, *args, **kwargs)\u001b[0m\n\u001b[1;32m 556\u001b[0m raise TypeError(\"Keyword arguments {} unknown. Expected {}.\".format(\n\u001b[1;32m 557\u001b[0m list(kwargs.keys()), list(self._arg_keywords)))\n\u001b[0;32m--> 558\u001b[0;31m \u001b[0;32mreturn\u001b[0m \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0m_call_flat\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0margs\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 559\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 560\u001b[0m \u001b[0;32mdef\u001b[0m \u001b[0m_filtered_call\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mself\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0margs\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mkwargs\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n", 680 | "\u001b[0;32m/usr/local/lib/python3.6/dist-packages/tensorflow/python/eager/function.py\u001b[0m in \u001b[0;36m_call_flat\u001b[0;34m(self, args)\u001b[0m\n\u001b[1;32m 625\u001b[0m \u001b[0;31m# Only need to override the gradient in graph mode and when we have outputs.\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 626\u001b[0m \u001b[0;32mif\u001b[0m \u001b[0mcontext\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mexecuting_eagerly\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m)\u001b[0m \u001b[0;32mor\u001b[0m \u001b[0;32mnot\u001b[0m \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0moutputs\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m--> 627\u001b[0;31m \u001b[0moutputs\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0m_inference_function\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mcall\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mctx\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0margs\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 628\u001b[0m \u001b[0;32melse\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 629\u001b[0m \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0m_register_gradient\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n", 681 | "\u001b[0;32m/usr/local/lib/python3.6/dist-packages/tensorflow/python/eager/function.py\u001b[0m in \u001b[0;36mcall\u001b[0;34m(self, ctx, args)\u001b[0m\n\u001b[1;32m 413\u001b[0m attrs=(\"executor_type\", executor_type,\n\u001b[1;32m 414\u001b[0m \"config_proto\", config),\n\u001b[0;32m--> 415\u001b[0;31m ctx=ctx)\n\u001b[0m\u001b[1;32m 416\u001b[0m \u001b[0;31m# Replace empty list with None\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 417\u001b[0m \u001b[0moutputs\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0moutputs\u001b[0m \u001b[0;32mor\u001b[0m \u001b[0;32mNone\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n", 682 | "\u001b[0;32m/usr/local/lib/python3.6/dist-packages/tensorflow/python/eager/execute.py\u001b[0m in \u001b[0;36mquick_execute\u001b[0;34m(op_name, num_outputs, inputs, attrs, ctx, name)\u001b[0m\n\u001b[1;32m 58\u001b[0m tensors = pywrap_tensorflow.TFE_Py_Execute(ctx._handle, device_name,\n\u001b[1;32m 59\u001b[0m \u001b[0mop_name\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0minputs\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mattrs\u001b[0m\u001b[0;34m,\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m---> 60\u001b[0;31m num_outputs)\n\u001b[0m\u001b[1;32m 61\u001b[0m \u001b[0;32mexcept\u001b[0m \u001b[0mcore\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0m_NotOkStatusException\u001b[0m \u001b[0;32mas\u001b[0m \u001b[0me\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 62\u001b[0m \u001b[0;32mif\u001b[0m \u001b[0mname\u001b[0m \u001b[0;32mis\u001b[0m \u001b[0;32mnot\u001b[0m \u001b[0;32mNone\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n", 683 | "\u001b[0;31mKeyboardInterrupt\u001b[0m: " 684 | ] 685 | } 686 | ] 687 | }, 688 | { 689 | "metadata": { 690 | "id": "r8_mvDjYL3CX", 691 | "colab_type": "code", 692 | "outputId": "5b436b6c-914a-4787-b9f0-a81b12a2f67b", 693 | "colab": { 694 | "base_uri": "https://localhost:8080/", 695 | "height": 51 696 | } 697 | }, 698 | "cell_type": "code", 699 | "source": [ 700 | "results = model.evaluate(test_data, test_label)\n", 701 | "print(results)" 702 | ], 703 | "execution_count": 21, 704 | "outputs": [ 705 | { 706 | "output_type": "stream", 707 | "text": [ 708 | "7567/7567 [==============================] - 11s 1ms/sample - loss: 1.6580 - accuracy: 0.3986\n", 709 | "[1.6579768257694387, 0.39857274]\n" 710 | ], 711 | "name": "stdout" 712 | } 713 | ] 714 | }, 715 | { 716 | "metadata": { 717 | "id": "VaIioR7EPfig", 718 | "colab_type": "code", 719 | "outputId": "b5bbe828-3ace-4291-f762-5717df7d0bea", 720 | "colab": { 721 | "base_uri": "https://localhost:8080/", 722 | "height": 71 723 | } 724 | }, 725 | "cell_type": "code", 726 | "source": [ 727 | "data_index = 12\n", 728 | "data_words = \" \".join(test_data_words[data_index])\n", 729 | "data_indexes = test_data[data_index]\n", 730 | "print(data_words)\n", 731 | "\n", 732 | "predicted = model.predict([[data_indexes]])\n", 733 | "print(encoder.classes_[np.argmax(predicted)])" 734 | ], 735 | "execution_count": 22, 736 | "outputs": [ 737 | { 738 | "output_type": "stream", 739 | "text": [ 740 | "спортын төв ордонд өнөөдөр азийн оюутны аварга шалгаруулах эмэгтэй волейболчдын хоёр дахь удаагийн тэмцээний талаар мэдээлэл хийлээ анхны тэмцээн онд тайландын бангконг хотноо болж хоёрдугаар тэмцээнийг азийн оюутны спортын холбооноос аосх олгосон эрхийн дагуу оны дөрөвдүгээр сарын ны өдрүүдэд монгол улсын нийслэл улаанбаатар хотноо зохион байгуулах тэмцээний эрхийг монгол улс оны тавдугаар сарын хуралдсан аосхны гүйцэтгэх хорооны хурлаар хоёр оронтой өрсөлдөн авчээ уг тэмцээнийг монгол улсад авах талаар мосхолбоо оноос санаачлага гарган хөөцөлдөж эхэлсэн тэмцээний эрхийг авахад муын засгийн газрын санхүүгийн дэмжлэг мэргэжлийн холбоодын ажлын туршлага манай улсын олон улсын нэр хүнд ихээхэн тус хүргэжээ зохион байгуулах хороог с ламбаа удирдаж тэмцээний зохион байгуулах хороог збх эрүүл мэндийн сайдын оны тоот тушаалаар батлаж даргаар уихын гишүүн монголын волейболын холбооны мвх хүндэт ерөнхийлөгч сламбаа ажиллаж збхны орлогч даргаар згхагентлагбтсгын дарга чнаранбаатар збхны нарийн бичгийн даргаар монголын оюутны спортын холбооны мосх ерөнхий нарийн бичгийн дарга джаргалсайхан збхны гишүүдэд бсшуяны төрийн нарийн бичгийн дарга ддалайжаргал нийслэлийн здтгазрын дарга цболдсайхан сяны газрын дарга дбатжаргал гхяамны консулын газрын дарга дганхуяг бсшуяны мэргэжлийн боловсролын газрын дарга мбаасанжав гихалбаны дарга дмөрөн мосхны ерөнхийлөгч оуосхны ерөнхий санхүүч дбаясгалан муисийн ректор стөмөрочир мубисийн ректор бжадамба залуу монгол корпорацийн ерөнхийлөгч мсономпил мохны ерөнхий нарийн бичгийн дарга нбямбагэрэл мвхолбооны ерөнхий нарий бичгийн дарга цбатэнх миат хкийн маркетинг борлуулатын хэлтсийн дарга тмэндсайхан боловсрол суваг телевизийн ерөнхий захирал аамундра нар сонгогдон ажиллаж тэмцээнийг үнэ төлбөргүй үзүүлнэ волейболын болон оюутны спортыг сурталчилах дэлгэрүүлэх үүднээс тэмцээнийг үнэ төлбөргүй үзүүлэхээр збхорооны анхдугаар хурлаас шийдвэрлэсэн нийслэлийн иргэдийг тэмцээнийг өргөнөөр үзэхийг здтгазраас уриалсан тэмцээнийг зөвхөн улаанбаатар хотын иргэд бус аймгаас волейболын спортыг сонирхон хөгжөөн дэмжигч үзэгч волейболын спортын мэргэжилтэн багш нар секцэнд хичээллэгч хүүхдүүд зохион байгуулалтай ирэхээр ялангуяа тэмцээн болох газартай хамгийн ойрхон хануул дүүргийн здтгазар дүүргийнхээ ард иргэд хөдөлмөрчид сургуулийн сурагчид оюутнууд цэргийн албан хаагчид буянтухаа орчимын албан байгууллага хамт олныг идэвхтэй оролцуулах арга хэмжээ авч эхлэжээ олон зуун оюутнууд тэмцээн үзэх боллоо тэмцээний өдрүүдэд нийслэлээс буянтухаагийн спортын ордонг чиглэсэн хүмүүсийн цуваа ихсэх төлөвтэй учир нийслэлд үйл ажиллагаа явуулж байгаа орчим идсийн оюутнууд тэмцээнийг анги сургууль хамт олноороо үзэх сонирхолтой байгаагаа монголын оюутны холбоо биеийн тамирын тэнхимдээ хүсчээ үүний дагуу бсшуяам мох монголын оюутны спортын холбоо мосх ноос тэмцээнийг өдөр бүр гаруй сургуулийн орчим оюутнууд нэгдсэн хуваарийн дагуу үзэх хуваарийг бсшуяны төрийн нарийн бичгийн дарга зохион байгуулах үндэсний хороо збх ны гишүүн ддалайжаргал батлан сургуулиудад албан тоотоор хүргүүлжээ мосхд тэмцээнийг үзэхээр олон арван сургуулиуд оюутны тоогоо өгч бүртгүүлж суудлын хувиарлалтанд орж байгаа ажээ ялангуяа биеийн тамирын мэргэжлийн дээд сургуулийн оюутнууд дадлага хичээлээ тэмцээний үеэр хийхээр хичээлийн хувиараа зохицуулсан нийслэлийн засаг дарга оюутнуудад туслав улаанбаатар хотноо болдог оюутны олон улс тив дэлхийн тэмцээн бүрт нийслэлийн засаг дарга гмөнхбаяр ихээхэн туслалцаа үзүүлэн оюутан залуусаа байнга дэмжин оролцдог ажээ тэрээр тус тэмцээнд оролцохоор бэлтгэж байгаа монголын оюутны шигшээ багийн тамирчидын хоногийн бэлтгэл сургалтын зардлыг хариуцан гаргасан хөрөнгө санхүүгийн хүндрэлтэй байгаа үеэд тэмцээнд бэлтгэж байгаа оюутан тамирчидаа цагаа олж хэрэгцээтэй үеэд дэмжлээ мосхолбоо монголын волейболын холбоо мвх тамирчидынхаа өмнөөс талархал илэрхийлжээ монголын баг тамирчид эрдэнэт хотод оны сарын өдрөөс эхлэн хоногийн бэлтгэл хийснийхээ дараа ийнхүү нийслэлийн засаг даргын туслалцаатайгаар гадаадын тамирчидтай хамт байрлах цэцэг зочид буудалдаа орж бэлтгэл сургуулиалтаа үргэлжүүлэх боломжтой нздтгазраас баг\n", 741 | "спорт\n" 742 | ], 743 | "name": "stdout" 744 | } 745 | ] 746 | } 747 | ] 748 | } -------------------------------------------------------------------------------- /neural_network_classifier_notebooks/Mongolian_text_classification_05,_Attention.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "nbformat": 4, 3 | "nbformat_minor": 0, 4 | "metadata": { 5 | "colab": { 6 | "name": "Mongolian text classification #05, Attention.ipynb", 7 | "version": "0.3.2", 8 | "provenance": [], 9 | "collapsed_sections": [] 10 | }, 11 | "kernelspec": { 12 | "name": "python3", 13 | "display_name": "Python 3" 14 | }, 15 | "accelerator": "GPU" 16 | }, 17 | "cells": [ 18 | { 19 | "metadata": { 20 | "id": "muNP8k9fqaJb", 21 | "colab_type": "text" 22 | }, 23 | "cell_type": "markdown", 24 | "source": [ 25 | "Mongolian text classification series #01\n", 26 | "\n", 27 | "This notebook's purpose is to reveal attention mechanism by visualizing it.\n", 28 | "\n", 29 | "Eduge dataset provided by Bolorsoft LLC\n", 30 | "\n", 31 | "Author : Sharavsambuu Gunchinish (sharavsambuu@gmail.com)\n", 32 | "\n", 33 | "Github: https://github.com/sharavsambuu/mongolian-text-classification \n", 34 | "\n" 35 | ] 36 | }, 37 | { 38 | "metadata": { 39 | "id": "iY9jwdg6qT8M", 40 | "colab_type": "code", 41 | "outputId": "d1d20ba2-495a-4194-d2c4-8376024e8d07", 42 | "colab": { 43 | "base_uri": "https://localhost:8080/", 44 | "height": 360 45 | } 46 | }, 47 | "cell_type": "code", 48 | "source": [ 49 | "from __future__ import absolute_import, division, print_function, unicode_literals\n", 50 | "\n", 51 | "!pip install -q tensorflow-gpu==2.0.0-alpha0\n", 52 | "!pip install gensim\n", 53 | "\n", 54 | "import tensorflow as tf\n", 55 | "from tensorflow import keras\n", 56 | "\n", 57 | "import numpy as np\n", 58 | "\n", 59 | "print(tf.__version__)" 60 | ], 61 | "execution_count": 1, 62 | "outputs": [ 63 | { 64 | "output_type": "stream", 65 | "text": [ 66 | "Requirement already satisfied: gensim in /usr/local/lib/python3.6/dist-packages (3.6.0)\n", 67 | "Requirement already satisfied: scipy>=0.18.1 in /usr/local/lib/python3.6/dist-packages (from gensim) (1.2.1)\n", 68 | "Requirement already satisfied: numpy>=1.11.3 in /usr/local/lib/python3.6/dist-packages (from gensim) (1.16.2)\n", 69 | "Requirement already satisfied: smart-open>=1.2.1 in /usr/local/lib/python3.6/dist-packages (from gensim) (1.8.1)\n", 70 | "Requirement already satisfied: six>=1.5.0 in /usr/local/lib/python3.6/dist-packages (from gensim) (1.11.0)\n", 71 | "Requirement already satisfied: boto>=2.32 in /usr/local/lib/python3.6/dist-packages (from smart-open>=1.2.1->gensim) (2.49.0)\n", 72 | "Requirement already satisfied: bz2file in /usr/local/lib/python3.6/dist-packages (from smart-open>=1.2.1->gensim) (0.98)\n", 73 | "Requirement already satisfied: requests in /usr/local/lib/python3.6/dist-packages (from smart-open>=1.2.1->gensim) (2.18.4)\n", 74 | "Requirement already satisfied: boto3 in /usr/local/lib/python3.6/dist-packages (from smart-open>=1.2.1->gensim) (1.9.130)\n", 75 | "Requirement already satisfied: idna<2.7,>=2.5 in /usr/local/lib/python3.6/dist-packages (from requests->smart-open>=1.2.1->gensim) (2.6)\n", 76 | "Requirement already satisfied: urllib3<1.23,>=1.21.1 in /usr/local/lib/python3.6/dist-packages (from requests->smart-open>=1.2.1->gensim) (1.22)\n", 77 | "Requirement already satisfied: certifi>=2017.4.17 in /usr/local/lib/python3.6/dist-packages (from requests->smart-open>=1.2.1->gensim) (2019.3.9)\n", 78 | "Requirement already satisfied: chardet<3.1.0,>=3.0.2 in /usr/local/lib/python3.6/dist-packages (from requests->smart-open>=1.2.1->gensim) (3.0.4)\n", 79 | "Requirement already satisfied: s3transfer<0.3.0,>=0.2.0 in /usr/local/lib/python3.6/dist-packages (from boto3->smart-open>=1.2.1->gensim) (0.2.0)\n", 80 | "Requirement already satisfied: jmespath<1.0.0,>=0.7.1 in /usr/local/lib/python3.6/dist-packages (from boto3->smart-open>=1.2.1->gensim) (0.9.4)\n", 81 | "Requirement already satisfied: botocore<1.13.0,>=1.12.130 in /usr/local/lib/python3.6/dist-packages (from boto3->smart-open>=1.2.1->gensim) (1.12.130)\n", 82 | "Requirement already satisfied: docutils>=0.10 in /usr/local/lib/python3.6/dist-packages (from botocore<1.13.0,>=1.12.130->boto3->smart-open>=1.2.1->gensim) (0.14)\n", 83 | "Requirement already satisfied: python-dateutil<3.0.0,>=2.1; python_version >= \"2.7\" in /usr/local/lib/python3.6/dist-packages (from botocore<1.13.0,>=1.12.130->boto3->smart-open>=1.2.1->gensim) (2.5.3)\n", 84 | "2.0.0-alpha0\n" 85 | ], 86 | "name": "stdout" 87 | } 88 | ] 89 | }, 90 | { 91 | "metadata": { 92 | "id": "smJeJfoo4qcu", 93 | "colab_type": "text" 94 | }, 95 | "cell_type": "markdown", 96 | "source": [ 97 | "[More info about creation of eduge dataset pickles](https://github.com/sharavsambuu/mongolian-text-classification/blob/master/preprocess_dataset/preprocess_eduge.ipynb) preprocessing eats a lot of CPU cycle so it's good idea to cook it before using colab." 98 | ] 99 | }, 100 | { 101 | "metadata": { 102 | "id": "CDayX_Yx3REh", 103 | "colab_type": "code", 104 | "outputId": "4ab773b3-001b-458a-c339-a4128c1dd426", 105 | "colab": { 106 | "base_uri": "https://localhost:8080/", 107 | "height": 340 108 | } 109 | }, 110 | "cell_type": "code", 111 | "source": [ 112 | "import os\n", 113 | "from os.path import exists, join, basename, splitext\n", 114 | "import sys\n", 115 | "\n", 116 | "def download_from_google_drive(file_id, file_name):\n", 117 | " !rm -f ./cookie\n", 118 | " !curl -c ./cookie -s -L \"https://drive.google.com/uc?export=download&id=$file_id\" > /dev/null\n", 119 | " confirm_text = !awk '/download/ {print $NF}' ./cookie\n", 120 | " confirm_text = confirm_text[0]\n", 121 | " !curl -Lb ./cookie \"https://drive.google.com/uc?export=download&confirm=$confirm_text&id=$file_id\" -o $file_name\n", 122 | " \n", 123 | "# download eduge pickles\n", 124 | "file_path = 'eduge_pickles'\n", 125 | "if not exists(file_path):\n", 126 | " download_from_google_drive('1vjJ9YgIe8o0ErhbN0lH1XqPv3KFP8acv', '%s.rar' % file_path)\n", 127 | " rar_file = file_path+\".rar\"\n", 128 | " !unrar x $rar_file" 129 | ], 130 | "execution_count": 2, 131 | "outputs": [ 132 | { 133 | "output_type": "stream", 134 | "text": [ 135 | " % Total % Received % Xferd Average Speed Time Time Time Current\n", 136 | " Dload Upload Total Spent Left Speed\n", 137 | "100 388 0 388 0 0 1385 0 --:--:-- --:--:-- --:--:-- 1385\n", 138 | "100 106M 0 106M 0 0 91.5M 0 --:--:-- 0:00:01 --:--:-- 242M\n", 139 | "\n", 140 | "UNRAR 5.50 freeware Copyright (c) 1993-2017 Alexander Roshal\n", 141 | "\n", 142 | "\n", 143 | "Extracting from eduge_pickles.rar\n", 144 | "\n", 145 | "\n", 146 | "Would you like to replace the existing file word_index.pickle\n", 147 | "9178153 bytes, modified on 2019-04-13 01:44\n", 148 | "with a new one\n", 149 | "9178153 bytes, modified on 2019-04-13 01:44\n", 150 | "\n", 151 | "[Y]es, [N]o, [A]ll, n[E]ver, [R]ename, [Q]uit q\n", 152 | "\n", 153 | "Program aborted\n" 154 | ], 155 | "name": "stdout" 156 | } 157 | ] 158 | }, 159 | { 160 | "metadata": { 161 | "id": "pPHJcnfi4Rzg", 162 | "colab_type": "code", 163 | "colab": {} 164 | }, 165 | "cell_type": "code", 166 | "source": [ 167 | "import pickle\n", 168 | "\n", 169 | "with open('word_index.pickle', 'rb') as handle:\n", 170 | " word_index = pickle.load(handle)\n", 171 | " \n", 172 | "with open('reversed_word_index.pickle', 'rb') as handle:\n", 173 | " reversed_word_index = pickle.load(handle)\n", 174 | " \n", 175 | "with open('eduge_stopwords_removed.pickle', 'rb') as handle:\n", 176 | " eduge_ds = pickle.load(handle)" 177 | ], 178 | "execution_count": 0, 179 | "outputs": [] 180 | }, 181 | { 182 | "metadata": { 183 | "id": "ASRW7ISNnbM-", 184 | "colab_type": "code", 185 | "outputId": "9d003960-fdd8-44c6-9424-37515e774f58", 186 | "colab": { 187 | "base_uri": "https://localhost:8080/", 188 | "height": 207 189 | } 190 | }, 191 | "cell_type": "code", 192 | "source": [ 193 | "# facebook trained word2vec on both commoncrawl and wikipedia. So this model should contain enough representation about our mongolian words.\n", 194 | "mongolian_word2vec_download=\"https://dl.fbaipublicfiles.com/fasttext/vectors-crawl/cc.mn.300.bin.gz\"\n", 195 | "if not exists(\"cc.mn.300.bin.gz\"):\n", 196 | " !wget https://dl.fbaipublicfiles.com/fasttext/vectors-crawl/cc.mn.300.bin.gz\n", 197 | "if exists('cc.mn.300.bin.gz'):\n", 198 | " !gunzip cc.mn.300.bin.gz" 199 | ], 200 | "execution_count": 4, 201 | "outputs": [ 202 | { 203 | "output_type": "stream", 204 | "text": [ 205 | "--2019-04-14 11:03:40-- https://dl.fbaipublicfiles.com/fasttext/vectors-crawl/cc.mn.300.bin.gz\n", 206 | "Resolving dl.fbaipublicfiles.com (dl.fbaipublicfiles.com)... 104.20.22.166, 104.20.6.166, 2606:4700:10::6814:16a6, ...\n", 207 | "Connecting to dl.fbaipublicfiles.com (dl.fbaipublicfiles.com)|104.20.22.166|:443... connected.\n", 208 | "HTTP request sent, awaiting response... 200 OK\n", 209 | "Length: 2937042399 (2.7G) [application/octet-stream]\n", 210 | "Saving to: ‘cc.mn.300.bin.gz’\n", 211 | "\n", 212 | "cc.mn.300.bin.gz 15%[==> ] 438.51M 12.7MB/s eta 3m 12s ^C\n", 213 | "gzip: cc.mn.300.bin already exists; do you wish to overwrite (y or n)? n\n", 214 | "\tnot overwritten\n" 215 | ], 216 | "name": "stdout" 217 | } 218 | ] 219 | }, 220 | { 221 | "metadata": { 222 | "id": "BqGAauUZpnFz", 223 | "colab_type": "code", 224 | "outputId": "75c0cfc6-12e4-4791-a16a-893a8a6749a2", 225 | "colab": { 226 | "base_uri": "https://localhost:8080/", 227 | "height": 88 228 | } 229 | }, 230 | "cell_type": "code", 231 | "source": [ 232 | "from gensim.models.wrappers import FastText\n", 233 | "\n", 234 | "word2vec_model = FastText.load_fasttext_format('cc.mn.300.bin')" 235 | ], 236 | "execution_count": 5, 237 | "outputs": [ 238 | { 239 | "output_type": "stream", 240 | "text": [ 241 | "WARNING: Logging before flag parsing goes to stderr.\n", 242 | "W0414 11:04:27.822593 140650259199872 ssh.py:33] paramiko missing, opening SSH/SCP/SFTP paths will be disabled. `pip install paramiko` to suppress\n", 243 | "W0414 11:04:28.318554 140650259199872 word2vec.py:573] Slow version of gensim.models.deprecated.word2vec is being used\n" 244 | ], 245 | "name": "stderr" 246 | } 247 | ] 248 | }, 249 | { 250 | "metadata": { 251 | "id": "kkc1iiqJp-CJ", 252 | "colab_type": "code", 253 | "outputId": "3558d2cb-aa96-4b16-8687-a3e8d0516ed3", 254 | "colab": { 255 | "base_uri": "https://localhost:8080/", 256 | "height": 88 257 | } 258 | }, 259 | "cell_type": "code", 260 | "source": [ 261 | "print(word2vec_model.most_similar('монгол'))" 262 | ], 263 | "execution_count": 6, 264 | "outputs": [ 265 | { 266 | "output_type": "stream", 267 | "text": [ 268 | "[('Монгол', 0.6342526078224182), ('монголын', 0.6047513484954834), ('хятад', 0.5558866858482361), ('Монголын', 0.5087883472442627), ('судлалаараа', 0.48851606249809265), ('манай', 0.4853793680667877), ('уйгаржин', 0.4725492596626282), ('угсаатангууд', 0.47093287110328674), ('орос', 0.46463483572006226), ('худам', 0.4609120190143585)]\n" 269 | ], 270 | "name": "stdout" 271 | }, 272 | { 273 | "output_type": "stream", 274 | "text": [ 275 | "/usr/local/lib/python3.6/dist-packages/gensim/matutils.py:737: FutureWarning: Conversion of the second argument of issubdtype from `int` to `np.signedinteger` is deprecated. In future, it will be treated as `np.int64 == np.dtype(int).type`.\n", 276 | " if np.issubdtype(vec.dtype, np.int):\n" 277 | ], 278 | "name": "stderr" 279 | } 280 | ] 281 | }, 282 | { 283 | "metadata": { 284 | "id": "oF6vB3Qnq08I", 285 | "colab_type": "code", 286 | "colab": {} 287 | }, 288 | "cell_type": "code", 289 | "source": [ 290 | "# preparing embedding matrix\n", 291 | "import numpy as np\n", 292 | "\n", 293 | "words_not_found = []\n", 294 | "embed_dim = 300\n", 295 | "embedding_matrix = np.random.uniform(-1, 1, (len(word_index), embed_dim))\n", 296 | "for word, i in word_index.items():\n", 297 | " if i<4:\n", 298 | " continue\n", 299 | " try:\n", 300 | " embedding_vector = word2vec_model[word]\n", 301 | " if (embedding_vector is not None) and len(embedding_vector) > 0:\n", 302 | " embedding_matrix[i] = embedding_vector\n", 303 | " except:\n", 304 | " words_not_found.append(word)\n", 305 | " pass" 306 | ], 307 | "execution_count": 0, 308 | "outputs": [] 309 | }, 310 | { 311 | "metadata": { 312 | "id": "aQAaXWIgsxm9", 313 | "colab_type": "code", 314 | "outputId": "b2c1d495-2ec7-44b5-ad8d-553753195efd", 315 | "colab": { 316 | "base_uri": "https://localhost:8080/", 317 | "height": 34 318 | } 319 | }, 320 | "cell_type": "code", 321 | "source": [ 322 | "print(embedding_matrix.shape)\n", 323 | "#print(embedding_matrix[5])" 324 | ], 325 | "execution_count": 8, 326 | "outputs": [ 327 | { 328 | "output_type": "stream", 329 | "text": [ 330 | "(370794, 300)\n" 331 | ], 332 | "name": "stdout" 333 | } 334 | ] 335 | }, 336 | { 337 | "metadata": { 338 | "id": "XFxd1QGR65VV", 339 | "colab_type": "code", 340 | "colab": {} 341 | }, 342 | "cell_type": "code", 343 | "source": [ 344 | "MAX_LEN = 256\n", 345 | "\n", 346 | "import itertools\n", 347 | "\n", 348 | "for item in eduge_ds:\n", 349 | " item[0] = list(itertools.chain(*item[0]))[:MAX_LEN]" 350 | ], 351 | "execution_count": 0, 352 | "outputs": [] 353 | }, 354 | { 355 | "metadata": { 356 | "id": "U8PTeX0WCbhR", 357 | "colab_type": "code", 358 | "colab": {} 359 | }, 360 | "cell_type": "code", 361 | "source": [ 362 | "from sklearn.model_selection import train_test_split\n", 363 | "train, test = train_test_split(eduge_ds, test_size=0.1, random_state=999)" 364 | ], 365 | "execution_count": 0, 366 | "outputs": [] 367 | }, 368 | { 369 | "metadata": { 370 | "id": "8mgMCFcgDHH4", 371 | "colab_type": "code", 372 | "colab": {} 373 | }, 374 | "cell_type": "code", 375 | "source": [ 376 | "train_data_words = [i[0] for i in train]\n", 377 | "train_label_words = [i[1] for i in train]\n", 378 | "test_data_words = [i[0] for i in test ]\n", 379 | "test_label_words = [i[1] for i in test ]" 380 | ], 381 | "execution_count": 0, 382 | "outputs": [] 383 | }, 384 | { 385 | "metadata": { 386 | "id": "rrXC7UiuFkCH", 387 | "colab_type": "code", 388 | "colab": {} 389 | }, 390 | "cell_type": "code", 391 | "source": [ 392 | "def encode_news(text):\n", 393 | " return [word_index.get(i, 2) for i in text]\n", 394 | " \n", 395 | "train_data = [encode_news(sent) for sent in train_data_words]\n", 396 | "test_data = [encode_news(sent) for sent in test_data_words ]" 397 | ], 398 | "execution_count": 0, 399 | "outputs": [] 400 | }, 401 | { 402 | "metadata": { 403 | "id": "FV-h_avPEzM1", 404 | "colab_type": "code", 405 | "colab": {} 406 | }, 407 | "cell_type": "code", 408 | "source": [ 409 | "train_data = keras.preprocessing.sequence.pad_sequences(train_data,\n", 410 | " value=word_index[\"\"],\n", 411 | " padding='post',\n", 412 | " maxlen=MAX_LEN)\n", 413 | "\n", 414 | "test_data = keras.preprocessing.sequence.pad_sequences(test_data,\n", 415 | " value=word_index[\"\"],\n", 416 | " padding='post',\n", 417 | " maxlen=MAX_LEN)" 418 | ], 419 | "execution_count": 0, 420 | "outputs": [] 421 | }, 422 | { 423 | "metadata": { 424 | "id": "gDVqmPqxIMid", 425 | "colab_type": "code", 426 | "outputId": "2eef5781-66f9-4357-ef18-4bdf2e5a07e5", 427 | "colab": { 428 | "base_uri": "https://localhost:8080/", 429 | "height": 170 430 | } 431 | }, 432 | "cell_type": "code", 433 | "source": [ 434 | "labels = list(set(test_label_words))\n", 435 | "labels" 436 | ], 437 | "execution_count": 14, 438 | "outputs": [ 439 | { 440 | "output_type": "execute_result", 441 | "data": { 442 | "text/plain": [ 443 | "['эрүүл мэнд',\n", 444 | " 'хууль',\n", 445 | " 'байгал орчин',\n", 446 | " 'улс төр',\n", 447 | " 'боловсрол',\n", 448 | " 'эдийн засаг',\n", 449 | " 'спорт',\n", 450 | " 'технологи',\n", 451 | " 'урлаг соёл']" 452 | ] 453 | }, 454 | "metadata": { 455 | "tags": [] 456 | }, 457 | "execution_count": 14 458 | } 459 | ] 460 | }, 461 | { 462 | "metadata": { 463 | "id": "PBKj3GQqJq29", 464 | "colab_type": "code", 465 | "colab": {} 466 | }, 467 | "cell_type": "code", 468 | "source": [ 469 | "from sklearn.preprocessing import LabelBinarizer\n", 470 | "encoder = LabelBinarizer()\n", 471 | "train_label = transfomed_label = encoder.fit_transform(train_label_words)\n", 472 | "test_label = transfomed_label = encoder.fit_transform(test_label_words )" 473 | ], 474 | "execution_count": 0, 475 | "outputs": [] 476 | }, 477 | { 478 | "metadata": { 479 | "id": "DPq45PN5HZ15", 480 | "colab_type": "code", 481 | "colab": { 482 | "base_uri": "https://localhost:8080/", 483 | "height": 632 484 | }, 485 | "outputId": "9e3057cc-41e3-4383-aa1c-7eca6fb7d85c" 486 | }, 487 | "cell_type": "code", 488 | "source": [ 489 | "class Attention(keras.Model):\n", 490 | " def __init__(self, units):\n", 491 | " super(Attention, self).__init__()\n", 492 | " self.W1 = keras.layers.Dense(units)\n", 493 | " self.W2 = keras.layers.Dense(units)\n", 494 | " self.V = keras.layers.Dense(1)\n", 495 | " def call(self, features, hidden):\n", 496 | " hidden_with_time_axis = tf.expand_dims(hidden, 1)\n", 497 | " score = tf.nn.tanh(self.W1(features)+self.W2(hidden_with_time_axis))\n", 498 | " attention_weights = tf.nn.softmax(self.V(score), axis=1)\n", 499 | " context_vector = attention_weights*features\n", 500 | " context_vector = tf.reduce_sum(context_vector, axis=1)\n", 501 | " return context_vector, attention_weights\n", 502 | "\n", 503 | "attention=Attention(64)\n", 504 | " \n", 505 | "vocab_size = len(word_index)\n", 506 | "\n", 507 | "sequence_input = keras.layers.Input(shape=(MAX_LEN,), dtype='int32')\n", 508 | "embedded_sequences = keras.layers.Embedding(\n", 509 | " vocab_size, \n", 510 | " embed_dim , \n", 511 | " weights=[embedding_matrix], \n", 512 | " input_length=MAX_LEN, \n", 513 | " trainable=False)(sequence_input)\n", 514 | "lstm = keras.layers.Bidirectional(\n", 515 | " keras.layers.LSTM(\n", 516 | " 64, # rnn cell size\n", 517 | " dropout = 0.3,\n", 518 | " return_sequences = True,\n", 519 | " return_state = True,\n", 520 | " recurrent_activation = 'relu',\n", 521 | " recurrent_initializer = 'glorot_uniform' \n", 522 | " )\n", 523 | " )(embedded_sequences)\n", 524 | "lstm, forward_h, forward_c, backward_h, backward_c = keras.layers.Bidirectional(\n", 525 | " keras.layers.LSTM(\n", 526 | " 64, # rnn cell size\n", 527 | " dropout = 0.2,\n", 528 | " return_sequences = True,\n", 529 | " return_state = True,\n", 530 | " recurrent_activation = 'relu',\n", 531 | " recurrent_initializer = 'glorot_uniform'\n", 532 | " \n", 533 | " )\n", 534 | ")(lstm)\n", 535 | "state_h = keras.layers.Concatenate()([forward_h, backward_h])\n", 536 | "state_c = keras.layers.Concatenate()([forward_c, backward_c])\n", 537 | "context_vector, attention_weights = attention(lstm, state_h)\n", 538 | "\n", 539 | "preds = keras.layers.Dense(len(labels), activation='sigmoid')(context_vector)\n", 540 | "\n", 541 | "model = keras.models.Model(inputs=sequence_input, outputs=preds)\n", 542 | "model.summary()" 543 | ], 544 | "execution_count": 16, 545 | "outputs": [ 546 | { 547 | "output_type": "stream", 548 | "text": [ 549 | "W0414 11:06:00.272124 140650259199872 tf_logging.py:161] : Note that this layer is not optimized for performance. Please use tf.keras.layers.CuDNNLSTM for better performance on GPU.\n", 550 | "W0414 11:06:00.289174 140650259199872 tf_logging.py:161] : Note that this layer is not optimized for performance. Please use tf.keras.layers.CuDNNLSTM for better performance on GPU.\n", 551 | "W0414 11:06:00.382109 140650259199872 deprecation.py:506] From /usr/local/lib/python3.6/dist-packages/tensorflow/python/keras/backend.py:4081: calling dropout (from tensorflow.python.ops.nn_ops) with keep_prob is deprecated and will be removed in a future version.\n", 552 | "Instructions for updating:\n", 553 | "Please use `rate` instead of `keep_prob`. Rate should be set to `rate = 1 - keep_prob`.\n", 554 | "W0414 11:06:01.074629 140650259199872 tf_logging.py:161] : Note that this layer is not optimized for performance. Please use tf.keras.layers.CuDNNLSTM for better performance on GPU.\n", 555 | "W0414 11:06:01.078676 140650259199872 tf_logging.py:161] : Note that this layer is not optimized for performance. Please use tf.keras.layers.CuDNNLSTM for better performance on GPU.\n" 556 | ], 557 | "name": "stderr" 558 | }, 559 | { 560 | "output_type": "stream", 561 | "text": [ 562 | "Model: \"model\"\n", 563 | "__________________________________________________________________________________________________\n", 564 | "Layer (type) Output Shape Param # Connected to \n", 565 | "==================================================================================================\n", 566 | "input_1 (InputLayer) [(None, 256)] 0 \n", 567 | "__________________________________________________________________________________________________\n", 568 | "embedding (Embedding) (None, 256, 300) 111238200 input_1[0][0] \n", 569 | "__________________________________________________________________________________________________\n", 570 | "bidirectional (Bidirectional) [(None, 256, 128), ( 186880 embedding[0][0] \n", 571 | "__________________________________________________________________________________________________\n", 572 | "bidirectional_1 (Bidirectional) [(None, 256, 128), ( 98816 bidirectional[0][0] \n", 573 | " bidirectional[0][1] \n", 574 | " bidirectional[0][2] \n", 575 | " bidirectional[0][3] \n", 576 | " bidirectional[0][4] \n", 577 | "__________________________________________________________________________________________________\n", 578 | "concatenate (Concatenate) (None, 128) 0 bidirectional_1[0][1] \n", 579 | " bidirectional_1[0][3] \n", 580 | "__________________________________________________________________________________________________\n", 581 | "attention (Attention) ((None, 128), (None, 16577 bidirectional_1[0][0] \n", 582 | " concatenate[0][0] \n", 583 | "__________________________________________________________________________________________________\n", 584 | "dense_3 (Dense) (None, 9) 1161 attention[0][0] \n", 585 | "==================================================================================================\n", 586 | "Total params: 111,541,634\n", 587 | "Trainable params: 303,434\n", 588 | "Non-trainable params: 111,238,200\n", 589 | "__________________________________________________________________________________________________\n" 590 | ], 591 | "name": "stdout" 592 | } 593 | ] 594 | }, 595 | { 596 | "metadata": { 597 | "id": "cAgP1KlqHu2F", 598 | "colab_type": "code", 599 | "colab": {} 600 | }, 601 | "cell_type": "code", 602 | "source": [ 603 | "model.compile(optimizer='adam',\n", 604 | " loss='categorical_crossentropy',\n", 605 | " metrics=['accuracy'])" 606 | ], 607 | "execution_count": 0, 608 | "outputs": [] 609 | }, 610 | { 611 | "metadata": { 612 | "id": "ZPw8roFQKrHm", 613 | "colab_type": "code", 614 | "colab": { 615 | "base_uri": "https://localhost:8080/", 616 | "height": 51 617 | }, 618 | "outputId": "930796b7-13e0-4fce-f1e5-f7fd72030d92" 619 | }, 620 | "cell_type": "code", 621 | "source": [ 622 | "print(len(train_data), len(train_label))\n", 623 | "print(len(test_data ), len(test_label) )\n", 624 | "\n", 625 | "partial_index = 3000\n", 626 | "\n", 627 | "x_val = train_data[:partial_index]\n", 628 | "partial_x_train = train_data[partial_index:]\n", 629 | "\n", 630 | "y_val = train_label[:partial_index]\n", 631 | "partial_y_train = train_label[partial_index:]" 632 | ], 633 | "execution_count": 18, 634 | "outputs": [ 635 | { 636 | "output_type": "stream", 637 | "text": [ 638 | "68094 68094\n", 639 | "7567 7567\n" 640 | ], 641 | "name": "stdout" 642 | } 643 | ] 644 | }, 645 | { 646 | "metadata": { 647 | "id": "iSTB4--RKacs", 648 | "colab_type": "code", 649 | "colab": { 650 | "base_uri": "https://localhost:8080/", 651 | "height": 1101 652 | }, 653 | "outputId": "19106331-e667-42c7-cdf8-88c1f7e4b52d" 654 | }, 655 | "cell_type": "code", 656 | "source": [ 657 | "epochs = 50\n", 658 | "history = model.fit(partial_x_train,\n", 659 | " partial_y_train,\n", 660 | " epochs=epochs ,\n", 661 | " batch_size=128 ,\n", 662 | " validation_data=(x_val, y_val),\n", 663 | " verbose=1)" 664 | ], 665 | "execution_count": 19, 666 | "outputs": [ 667 | { 668 | "output_type": "stream", 669 | "text": [ 670 | "Train on 65094 samples, validate on 3000 samples\n", 671 | "Epoch 1/50\n", 672 | "27904/65094 [===========>..................] - ETA: 46:44 - loss: 2.1963 - accuracy: 0.0786" 673 | ], 674 | "name": "stdout" 675 | }, 676 | { 677 | "output_type": "error", 678 | "ename": "KeyboardInterrupt", 679 | "evalue": "ignored", 680 | "traceback": [ 681 | "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m", 682 | "\u001b[0;31mKeyboardInterrupt\u001b[0m Traceback (most recent call last)", 683 | "\u001b[0;32m\u001b[0m in \u001b[0;36m\u001b[0;34m()\u001b[0m\n\u001b[1;32m 5\u001b[0m \u001b[0mbatch_size\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0;36m128\u001b[0m \u001b[0;34m,\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 6\u001b[0m \u001b[0mvalidation_data\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mx_val\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0my_val\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m,\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m----> 7\u001b[0;31m verbose=1)\n\u001b[0m", 684 | "\u001b[0;32m/usr/local/lib/python3.6/dist-packages/tensorflow/python/keras/engine/training.py\u001b[0m in \u001b[0;36mfit\u001b[0;34m(self, x, y, batch_size, epochs, verbose, callbacks, validation_split, validation_data, shuffle, class_weight, sample_weight, initial_epoch, steps_per_epoch, validation_steps, validation_freq, max_queue_size, workers, use_multiprocessing, **kwargs)\u001b[0m\n\u001b[1;32m 871\u001b[0m \u001b[0mvalidation_steps\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0mvalidation_steps\u001b[0m\u001b[0;34m,\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 872\u001b[0m \u001b[0mvalidation_freq\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0mvalidation_freq\u001b[0m\u001b[0;34m,\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m--> 873\u001b[0;31m steps_name='steps_per_epoch')\n\u001b[0m\u001b[1;32m 874\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 875\u001b[0m def evaluate(self,\n", 685 | "\u001b[0;32m/usr/local/lib/python3.6/dist-packages/tensorflow/python/keras/engine/training_arrays.py\u001b[0m in \u001b[0;36mmodel_iteration\u001b[0;34m(model, inputs, targets, sample_weights, batch_size, epochs, verbose, callbacks, val_inputs, val_targets, val_sample_weights, shuffle, initial_epoch, steps_per_epoch, validation_steps, validation_freq, mode, validation_in_fit, prepared_feed_values_from_dataset, steps_name, **kwargs)\u001b[0m\n\u001b[1;32m 350\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 351\u001b[0m \u001b[0;31m# Get outputs.\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m--> 352\u001b[0;31m \u001b[0mbatch_outs\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mf\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mins_batch\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 353\u001b[0m \u001b[0;32mif\u001b[0m \u001b[0;32mnot\u001b[0m \u001b[0misinstance\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mbatch_outs\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mlist\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 354\u001b[0m \u001b[0mbatch_outs\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0;34m[\u001b[0m\u001b[0mbatch_outs\u001b[0m\u001b[0;34m]\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n", 686 | "\u001b[0;32m/usr/local/lib/python3.6/dist-packages/tensorflow/python/keras/backend.py\u001b[0m in \u001b[0;36m__call__\u001b[0;34m(self, inputs)\u001b[0m\n\u001b[1;32m 3215\u001b[0m \u001b[0mvalue\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mmath_ops\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mcast\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mvalue\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mtensor\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mdtype\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 3216\u001b[0m \u001b[0mconverted_inputs\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mappend\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mvalue\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m-> 3217\u001b[0;31m \u001b[0moutputs\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0m_graph_fn\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m*\u001b[0m\u001b[0mconverted_inputs\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 3218\u001b[0m return nest.pack_sequence_as(self._outputs_structure,\n\u001b[1;32m 3219\u001b[0m [x.numpy() for x in outputs])\n", 687 | "\u001b[0;32m/usr/local/lib/python3.6/dist-packages/tensorflow/python/eager/function.py\u001b[0m in \u001b[0;36m__call__\u001b[0;34m(self, *args, **kwargs)\u001b[0m\n\u001b[1;32m 556\u001b[0m raise TypeError(\"Keyword arguments {} unknown. Expected {}.\".format(\n\u001b[1;32m 557\u001b[0m list(kwargs.keys()), list(self._arg_keywords)))\n\u001b[0;32m--> 558\u001b[0;31m \u001b[0;32mreturn\u001b[0m \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0m_call_flat\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0margs\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 559\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 560\u001b[0m \u001b[0;32mdef\u001b[0m \u001b[0m_filtered_call\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mself\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0margs\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mkwargs\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n", 688 | "\u001b[0;32m/usr/local/lib/python3.6/dist-packages/tensorflow/python/eager/function.py\u001b[0m in \u001b[0;36m_call_flat\u001b[0;34m(self, args)\u001b[0m\n\u001b[1;32m 625\u001b[0m \u001b[0;31m# Only need to override the gradient in graph mode and when we have outputs.\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 626\u001b[0m \u001b[0;32mif\u001b[0m \u001b[0mcontext\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mexecuting_eagerly\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m)\u001b[0m \u001b[0;32mor\u001b[0m \u001b[0;32mnot\u001b[0m \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0moutputs\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m--> 627\u001b[0;31m \u001b[0moutputs\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0m_inference_function\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mcall\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mctx\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0margs\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 628\u001b[0m \u001b[0;32melse\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 629\u001b[0m \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0m_register_gradient\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n", 689 | "\u001b[0;32m/usr/local/lib/python3.6/dist-packages/tensorflow/python/eager/function.py\u001b[0m in \u001b[0;36mcall\u001b[0;34m(self, ctx, args)\u001b[0m\n\u001b[1;32m 413\u001b[0m attrs=(\"executor_type\", executor_type,\n\u001b[1;32m 414\u001b[0m \"config_proto\", config),\n\u001b[0;32m--> 415\u001b[0;31m ctx=ctx)\n\u001b[0m\u001b[1;32m 416\u001b[0m \u001b[0;31m# Replace empty list with None\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 417\u001b[0m \u001b[0moutputs\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0moutputs\u001b[0m \u001b[0;32mor\u001b[0m \u001b[0;32mNone\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n", 690 | "\u001b[0;32m/usr/local/lib/python3.6/dist-packages/tensorflow/python/eager/execute.py\u001b[0m in \u001b[0;36mquick_execute\u001b[0;34m(op_name, num_outputs, inputs, attrs, ctx, name)\u001b[0m\n\u001b[1;32m 58\u001b[0m tensors = pywrap_tensorflow.TFE_Py_Execute(ctx._handle, device_name,\n\u001b[1;32m 59\u001b[0m \u001b[0mop_name\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0minputs\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mattrs\u001b[0m\u001b[0;34m,\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m---> 60\u001b[0;31m num_outputs)\n\u001b[0m\u001b[1;32m 61\u001b[0m \u001b[0;32mexcept\u001b[0m \u001b[0mcore\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0m_NotOkStatusException\u001b[0m \u001b[0;32mas\u001b[0m \u001b[0me\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 62\u001b[0m \u001b[0;32mif\u001b[0m \u001b[0mname\u001b[0m \u001b[0;32mis\u001b[0m \u001b[0;32mnot\u001b[0m \u001b[0;32mNone\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n", 691 | "\u001b[0;31mKeyboardInterrupt\u001b[0m: " 692 | ] 693 | } 694 | ] 695 | }, 696 | { 697 | "metadata": { 698 | "id": "r8_mvDjYL3CX", 699 | "colab_type": "code", 700 | "colab": { 701 | "base_uri": "https://localhost:8080/", 702 | "height": 51 703 | }, 704 | "outputId": "5e95c9f4-570b-41e4-b129-daccfa733e37" 705 | }, 706 | "cell_type": "code", 707 | "source": [ 708 | "results = model.evaluate(test_data, test_label)\n", 709 | "print(results)" 710 | ], 711 | "execution_count": 20, 712 | "outputs": [ 713 | { 714 | "output_type": "stream", 715 | "text": [ 716 | "7567/7567 [==============================] - 229s 30ms/sample - loss: 2.1969 - accuracy: 0.0650\n", 717 | "[2.1969342474674667, 0.06501916]\n" 718 | ], 719 | "name": "stdout" 720 | } 721 | ] 722 | }, 723 | { 724 | "metadata": { 725 | "id": "VaIioR7EPfig", 726 | "colab_type": "code", 727 | "colab": { 728 | "base_uri": "https://localhost:8080/", 729 | "height": 71 730 | }, 731 | "outputId": "44e8c4e6-a27f-4bbe-ce29-4a21714871a3" 732 | }, 733 | "cell_type": "code", 734 | "source": [ 735 | "data_index = 12\n", 736 | "data_words = \" \".join(test_data_words[data_index])\n", 737 | "data_indexes = test_data[data_index]\n", 738 | "print(data_words)\n", 739 | "\n", 740 | "predicted = model.predict([[data_indexes]])\n", 741 | "print(encoder.classes_[np.argmax(predicted)])" 742 | ], 743 | "execution_count": 21, 744 | "outputs": [ 745 | { 746 | "output_type": "stream", 747 | "text": [ 748 | "спортын төв ордонд өнөөдөр азийн оюутны аварга шалгаруулах эмэгтэй волейболчдын хоёр дахь удаагийн тэмцээний талаар мэдээлэл хийлээ анхны тэмцээн онд тайландын бангконг хотноо болж хоёрдугаар тэмцээнийг азийн оюутны спортын холбооноос аосх олгосон эрхийн дагуу оны дөрөвдүгээр сарын ны өдрүүдэд монгол улсын нийслэл улаанбаатар хотноо зохион байгуулах тэмцээний эрхийг монгол улс оны тавдугаар сарын хуралдсан аосхны гүйцэтгэх хорооны хурлаар хоёр оронтой өрсөлдөн авчээ уг тэмцээнийг монгол улсад авах талаар мосхолбоо оноос санаачлага гарган хөөцөлдөж эхэлсэн тэмцээний эрхийг авахад муын засгийн газрын санхүүгийн дэмжлэг мэргэжлийн холбоодын ажлын туршлага манай улсын олон улсын нэр хүнд ихээхэн тус хүргэжээ зохион байгуулах хороог с ламбаа удирдаж тэмцээний зохион байгуулах хороог збх эрүүл мэндийн сайдын оны тоот тушаалаар батлаж даргаар уихын гишүүн монголын волейболын холбооны мвх хүндэт ерөнхийлөгч сламбаа ажиллаж збхны орлогч даргаар згхагентлагбтсгын дарга чнаранбаатар збхны нарийн бичгийн даргаар монголын оюутны спортын холбооны мосх ерөнхий нарийн бичгийн дарга джаргалсайхан збхны гишүүдэд бсшуяны төрийн нарийн бичгийн дарга ддалайжаргал нийслэлийн здтгазрын дарга цболдсайхан сяны газрын дарга дбатжаргал гхяамны консулын газрын дарга дганхуяг бсшуяны мэргэжлийн боловсролын газрын дарга мбаасанжав гихалбаны дарга дмөрөн мосхны ерөнхийлөгч оуосхны ерөнхий санхүүч дбаясгалан муисийн ректор стөмөрочир мубисийн ректор бжадамба залуу монгол корпорацийн ерөнхийлөгч мсономпил мохны ерөнхий нарийн бичгийн дарга нбямбагэрэл мвхолбооны ерөнхий нарий бичгийн дарга цбатэнх миат хкийн маркетинг борлуулатын хэлтсийн дарга тмэндсайхан боловсрол суваг телевизийн ерөнхий захирал аамундра нар сонгогдон ажиллаж тэмцээнийг үнэ төлбөргүй үзүүлнэ волейболын болон оюутны спортыг сурталчилах дэлгэрүүлэх үүднээс тэмцээнийг үнэ төлбөргүй үзүүлэхээр збхорооны анхдугаар хурлаас шийдвэрлэсэн нийслэлийн иргэдийг тэмцээнийг өргөнөөр үзэхийг здтгазраас уриалсан тэмцээнийг зөвхөн улаанбаатар хотын иргэд бус аймгаас волейболын спортыг сонирхон хөгжөөн\n", 749 | "эрүүл мэнд\n" 750 | ], 751 | "name": "stdout" 752 | } 753 | ] 754 | } 755 | ] 756 | } -------------------------------------------------------------------------------- /old_stuffs/.gitignore: -------------------------------------------------------------------------------- 1 | # Byte-compiled / optimized / DLL files 2 | __pycache__/ 3 | *.py[cod] 4 | *$py.class 5 | 6 | # C extensions 7 | *.so 8 | 9 | # Distribution / packaging 10 | .Python 11 | build/ 12 | develop-eggs/ 13 | dist/ 14 | downloads/ 15 | eggs/ 16 | .eggs/ 17 | lib/ 18 | lib64/ 19 | parts/ 20 | sdist/ 21 | var/ 22 | wheels/ 23 | *.egg-info/ 24 | .installed.cfg 25 | *.egg 26 | MANIFEST 27 | 28 | # PyInstaller 29 | # Usually these files are written by a python script from a template 30 | # before PyInstaller builds the exe, so as to inject date/other infos into it. 31 | *.manifest 32 | *.spec 33 | 34 | # Installer logs 35 | pip-log.txt 36 | pip-delete-this-directory.txt 37 | 38 | # Unit test / coverage reports 39 | htmlcov/ 40 | .tox/ 41 | .coverage 42 | .coverage.* 43 | .cache 44 | nosetests.xml 45 | coverage.xml 46 | *.cover 47 | .hypothesis/ 48 | .pytest_cache/ 49 | 50 | # Translations 51 | *.mo 52 | *.pot 53 | 54 | # Django stuff: 55 | *.log 56 | .static_storage/ 57 | .media/ 58 | local_settings.py 59 | 60 | # Flask stuff: 61 | instance/ 62 | .webassets-cache 63 | 64 | # Scrapy stuff: 65 | .scrapy 66 | 67 | # Sphinx documentation 68 | docs/_build/ 69 | 70 | # PyBuilder 71 | target/ 72 | 73 | # Jupyter Notebook 74 | .ipynb_checkpoints 75 | 76 | # pyenv 77 | .python-version 78 | 79 | # celery beat schedule file 80 | celerybeat-schedule 81 | 82 | # SageMath parsed files 83 | *.sage.py 84 | 85 | # Environments 86 | .env 87 | .venv 88 | env/ 89 | venv/ 90 | ENV/ 91 | env.bak/ 92 | venv.bak/ 93 | 94 | # Spyder project settings 95 | .spyderproject 96 | .spyproject 97 | 98 | # Rope project settings 99 | .ropeproject 100 | 101 | # mkdocs documentation 102 | /site 103 | 104 | # mypy 105 | .mypy_cache/ 106 | 107 | .vscode 108 | model.bin 109 | quotes.json 110 | ids_matrix.npy 111 | 112 | corpuses/politics 113 | corpuses/economy 114 | corpuses/society 115 | corpuses/health 116 | corpuses/world 117 | corpuses/technology 118 | 119 | temp_corpuses/ 120 | tensorboard/ 121 | 122 | models/lstm 123 | models/bilstm 124 | 125 | djangoapp/db.sqlite3 126 | 127 | pretrained_word2vec/ -------------------------------------------------------------------------------- /old_stuffs/README.md: -------------------------------------------------------------------------------- 1 | Mongolian text classifier in tensorflow. 2 | 3 | # STEPS 4 | 5 | - Run spider in order to collect corpuses and labels from ikon.mn 6 | > scrapy runspider ikon_mn_scrape.py 7 | 8 | - Download corpus from here, let's respect ikon.mn, scraping can be troubling. 9 | > https://drive.google.com/file/d/14NvUkqZRapivmiWc2WOOwIEu9UWyYnnJ/view?usp=sharing 10 | 11 | - Create word2vec from all files inside 'corpuses' directory 12 | > python3 clear_create_word2vec.py 13 | 14 | - Convert word2vec file to ids matrix as a numpy file format in order to use with tensorflow 15 | > python3 numpy_embedding_matrix_tf.py 16 | 17 | - Use embedding matrix with tensorflow in eager mode 18 | > python3 convert_text_to_seqvector_through_embedmatrix.py 19 | 20 | - Prepare training and testing dataset 21 | > python3 prepare_trainingset.py 22 | 23 | - Train lstm recurrent neural network for news classification 24 | > python3 training_bilstm_rnn.py 25 | 26 | > python3 training_lstm_rnn.py 27 | 28 | - Freeze trained checkpoints to servable tf model, iteration number is depends on your trained result, see models/bilstm folder after training 29 | > python3 freeze_tf_model.py --name lstm --iteration 1000 30 | 31 | > python3 freeze_tf_model.py --name bilstm --iteration 3000 32 | 33 | - Start classifier RPC server 34 | > python3 use_freezed_model_rpc.py 35 | 36 | - Start Django to see web interface 37 | > cd djangoapp 38 | 39 | > python manage.py runserver 40 | 41 | 42 | # DONE 43 | - Write some scrapers for ikon.mn 44 | - Prepare training texts with its labels, label should be a type of news. For example: politics, economy, society, health, world, technology etc 45 | - Train lstm classifier, also other ones like bibirectional lstm 46 | - Try to classify text from other sites, for example: news.gogo.mn, write some web backend interface, maybe I can use django 2.0 47 | - Implement testing dataset evaluation metrics 48 | 49 | # IN PROGRESS 50 | 51 | # TODO 52 | - Evaluate current model on 20 percent of the dataset as testnig 53 | - Use pretrained word2vec weights from facebook's fasttext https://fasttext.cc/docs/en/crawl-vectors.html 54 | - Use transfer learning techniques such like ULMFiT, ELMo embedding etc... and compare results 55 | - Implement stacked lstm 56 | - Implement stacked bidirectional lstm 57 | - Implement stacked bidirectional lstm with dropouts 58 | - Implement previous ones with batch normalization 59 | - Compare testing performances 60 | - Handle very long text through Truncated BPTT 61 | - Handle gradient vanishing issue with gradient clipping 62 | - Add attention to the lstms 63 | - Use an IndRNN and compare the results to previous ones 64 | 65 | # RESOURCE 66 | 67 | - ImageNet moment in NLP 68 | > https://thegradient.pub/nlp-imagenet/ 69 | 70 | - checkpointing, save, restore and freeze tensorflow models 71 | > http://cv-tricks.com/tensorflow-tutorial/save-restore-tensorflow-models-quick-complete-tutorial/ 72 | > https://nathanbrixius.wordpress.com/2016/05/24/checkpointing-and-reusing-tensorflow-models/ 73 | > http://cv-tricks.com/how-to/freeze-tensorflow-models/ 74 | 75 | - develop word embeddings python gensim 76 | > https://machinelearningmastery.com/develop-word-embeddings-python-gensim/ 77 | 78 | - how to clean text for machine learning with python 79 | > https://machinelearningmastery.com/clean-text-machine-learning-python/ 80 | 81 | - using gensim word2vec embeddings in tensorflow 82 | > http://adventuresinmachinelearning.com/gensim-word2vec-tutorial/ 83 | 84 | - perform sentimental analysis with lstms using tensorflow 85 | > https://www.oreilly.com/learning/perform-sentiment-analysis-with-lstms-using-tensorflow 86 | 87 | - What does tf.nn.embedding_lookup function do? 88 | > https://stackoverflow.com/questions/34870614/what-does-tf-nn-embedding-lookup-function-do 89 | 90 | - How to One Hot encode categorical sequence data in python 91 | > https://machinelearningmastery.com/how-to-one-hot-encode-sequence-data-in-python/ 92 | > https://www.tensorflow.org/api_docs/python/tf/one_hot 93 | 94 | - How to crawl the web politely with scrapy 95 | > https://blog.scrapinghub.com/2016/08/25/how-to-crawl-the-web-politely-with-scrapy/ 96 | 97 | -------------------------------------------------------------------------------- /old_stuffs/clear_create_word2vec.py: -------------------------------------------------------------------------------- 1 | from clear_text_to_array import * 2 | from gensim.models import Word2Vec 3 | import glob, json, re 4 | 5 | # корпусыг ачаалах 6 | all_corpuses = "" 7 | 8 | max_word_count = 0 9 | max_word_content = "" 10 | max_word_url = "" 11 | file_count = 0 12 | all_words = 0 13 | 14 | print("reading all corpuses, please wait for a little while...") 15 | 16 | for filename in glob.iglob('corpuses/**/*.txt', recursive=True): 17 | with open(filename, 'r', encoding="utf8") as f: 18 | json_content = json.load(f) 19 | all_corpuses = all_corpuses + " " +json_content['title'] + ". \n "+json_content["body"]+". \n " 20 | 21 | file_count = file_count + 1 22 | body_content = json_content['body'] 23 | count = len(re.findall(r'\w+', body_content)) 24 | all_words = all_words + count 25 | if count > max_word_count: 26 | max_word_count = count 27 | max_word_content = body_content 28 | max_word_url = json_content['url'] 29 | 30 | average_words_per_news = all_words/file_count 31 | 32 | print("Reading is done. Here is some stats" ) 33 | print("------------------------------------" ) 34 | print("Total file count : ", file_count ) 35 | print("Average words per news : ", average_words_per_news) 36 | print("Total word count : ", all_words ) 37 | print("Maximum word count : ", max_word_count ) 38 | print("Maximum word count url : ", max_word_url ) 39 | print("------------------------------------" ) 40 | 41 | 42 | 43 | print("converting to the sentence array...") 44 | all_corpuses = all_corpuses + ".АННОУНҮГ." 45 | sentences = clear_text_to_array(all_corpuses) 46 | print('done.') 47 | 48 | print("starting to create word2vec...") 49 | model = Word2Vec(sentences, min_count=1) 50 | model.save('model.bin') 51 | print('word2vec model is saved as gensim file format.') 52 | 53 | total_unique_word_count = len(model.wv.vocab) 54 | print("------------------------------------" ) 55 | print("Unique word count : ", total_unique_word_count) 56 | print("------------------------------------" ) 57 | 58 | #print(words) 59 | #print(model['дээд']) 60 | #import pdb; pdb.set_trace() -------------------------------------------------------------------------------- /old_stuffs/clear_text_to_array.py: -------------------------------------------------------------------------------- 1 | from nltk import sent_tokenize 2 | from nltk.tokenize import word_tokenize 3 | from nltk import stem 4 | import string 5 | from mongolianstopwords import * 6 | from stemmer import * 7 | 8 | def clear_text_to_array(input_text): 9 | text_sentences = sent_tokenize(input_text) 10 | stemmer = Stemmer.instance().stemmer 11 | sentences = [] 12 | for text_sentence in text_sentences: 13 | # өгүүлбэрийн текстийг үгүүд болгож хувиргах 14 | tokens = word_tokenize(text_sentence) 15 | # том үсгүүдийг болиулах 16 | tokens = [w.lower() for w in tokens] 17 | # үг бүрээс тэмдэгтүүдийг хасах 18 | table = str.maketrans('', '', string.punctuation) 19 | stripped = [w.translate(table) for w in tokens] 20 | # текст бус үгүүдийг хасах 21 | words = [word for word in stripped if word.isalpha()] 22 | # stopword уудыг хасах 23 | stop_words = set(stopwordsmn) 24 | words = [w for w in words if not w in stop_words] 25 | # stemming 26 | words = [stemmer.stem(w) for w in words] 27 | sentences.append(words) 28 | return sentences -------------------------------------------------------------------------------- /old_stuffs/convert_text_to_seqvector_through_embedmatrix.py: -------------------------------------------------------------------------------- 1 | from clear_text_to_array import * 2 | from wordtoken_to_id import * 3 | import numpy as np 4 | import tensorflow as tf 5 | import tensorflow.contrib.eager as tfe 6 | from gensim.models import Word2Vec 7 | 8 | tfe.enable_eager_execution() 9 | 10 | word2vec = Word2Vec.load('model.bin') 11 | ids_matrix = np.load('ids_matrix.npy') 12 | 13 | input_sentence = "хоёр өдрийн уулзалтын үр дүнд дээд хэмжээний элчээ илгээсэн юм." 14 | 15 | sentence_array = clear_text_to_array(input_sentence)[0] 16 | 17 | print("---------------------") 18 | first_word = sentence_array[0] 19 | second_word = sentence_array[1] 20 | last_word = sentence_array[-1] 21 | first_index = word2vec.wv.vocab[first_word ].index 22 | second_index = word2vec.wv.vocab[second_word].index 23 | last_index = word2vec.wv.vocab[last_word ].index 24 | print("эхний үг : ", first_word , ", index : ", first_index ) 25 | print("хоёрдугаар үг : ", second_word, ", index : ", second_index) 26 | print("сүүлийн үг : ", last_word , ", index : ", last_index ) 27 | print("нийт үгсийн тоо : ", len(word2vec.wv.vocab)) 28 | print("---------------------") 29 | #print(word2vec.wv[last_word]) 30 | if (np.array_equal(ids_matrix[last_index], word2vec.wv[last_word])): 31 | print("YES, conversion to id sequence can be implemented through gensim word2vec object.") 32 | else: 33 | print("NO") 34 | 35 | print("---------------------") 36 | print("Өгүүлбэр ") 37 | print(sentence_array) 38 | print("---------------------") 39 | 40 | # converting token sequence into sequence of ids 41 | sentence_in_tokenids = [] 42 | for token in sentence_array: 43 | token_id = wordtoken_to_id(word2vec, token) 44 | sentence_in_tokenids.append(token_id) 45 | print("id нуудын жагсаалт") 46 | print(sentence_in_tokenids) 47 | 48 | # trying to convert sequence of vectors through tensorflow embedding lookup stuff. 49 | embeddings = tf.constant(ids_matrix) 50 | ids = tf.constant(sentence_in_tokenids) 51 | sequence_vectors = tf.nn.embedding_lookup(embeddings, ids) 52 | print("үгэн векторуудын жагсаалт тензор хэлбэрээр:") 53 | print(sequence_vectors) -------------------------------------------------------------------------------- /old_stuffs/corpuses_test/economy_news_gogo_mn.txt: -------------------------------------------------------------------------------- 1 | Д.Сумъяабазар уул уурхайн сайд нарын дээд хэмжээний уулзалтад оролцлоо 2 | 3 | Уул уурхай, хүнд үйлдвэрийн сайд Д.Сумъяабазар Олон улсын уул уурхайн сайд нарын гурав дахь удаагийн дээд хэмжээний уулзалтад оролцлоо. Торонто хотноо зохион байгуулагдаж буй Олон улсын хайгуулч, олборлогчдын чуулга уулзалтын үеэр тохиосон уг дээд хэмжээний уулзалтад Перу, ОХУ, Чили, Энэтхэг, Австрали, Канад, Аргентин, Ирланд, Саудын Араб зэрэг 28 орны уул уурхайн асуудал эрхэлсэн сайд нар оролцов. 4 | 5 | Ирээдүйд уул уурхайн салбарыг хөгжүүлэх, уул уурхайн эдийн засаг, нийгмийн өсөлтөд оруулах хувь нэмрийг нэмэгдүүлэхэд Засгийн газар, олон нийт, уул уурхайн компаниудын хоорондын “итгэлцэл” чухал болохыг тус уулзалт онцоллоо. 6 | 7 | Хайгуул, ашиглалтын салбарт итгэлцэл бий болгох, нэгэнт бий болсон итгэлцлийг хадгалахад Засгийн газрууд, салбарын байгууллагууд, ТББ-ууд болон орон нутгийн иргэд бүгд чухал үүрэг гүйцэтгэдэг. Эдгээр оролцогч талууд уул уурхайн салбарт итгэлцэл бий болгоход цаг ямагт хамтран ажиллах хүсэл эрмэлзэлтэй байх ёстой. Энэ нь ирээдүйн уул уурхайн салбарыг бүтээх, эрдэс баялгийн арвин нөөцтэй орнуудын эдийн засгийн өсөлтийг хангах түлхүүр болно хэмээн уулзалт оролцогчид санал нэгдлээ. 8 | 9 | Мөн Засгийн газар, орон нутгийн иргэд, уул уурхайн компаниудын хоорондын харилцаа буурч, улам бүр бэрхшээлтэй болж байгаа тухай онцлон дурдсан. Үл итгэлцэл нь хөрөнгө оруулалтад хамгийн том саад болж байгааг ухамсарлаж, итгэлцэл, нийтлэг зорилгод үндэслэсэн хамтын ажиллагааны шинэ загвар бий болгох хэрэгцээ, шаардлага байгааг уулзалтад оролцогчид хүлээн зөвшөөрлөө. 10 | 11 | Уул уурхай, хүнд үйлдвэрийн сайд Д.Сумъяабазарын Торонтод хийж буй ажлын айлчлал үргэлжилж байна. -------------------------------------------------------------------------------- /old_stuffs/corpuses_test/health_news_gogo_mn.txt: -------------------------------------------------------------------------------- 1 | Хатгалгааны эсрэг вакциныг бүрэн хийлгэвэл 10-15 жилийн дархлаа тогтоно. 2 | 3 | Энэ сарын 1-нээс эхлэн нийслэлийн бүх дүүргийн 2 сартай хүүхдүүдийг уушгины хатгалгаа буюу пневмококкийн эсрэг 13 цент вакцинаар дархлаажуулж эхэллээ. 4 | 5 | Эхний тунг хоёр сартай, хоёр дахь тунг 4 сартай, гурав дахь тунг 9 сартай хүүхдэд тус тус хийнэ. Одоогийн байдлаар вакцинжуулалт 40 хувьтай үргэлжилж байна. 6 | 7 | Эрүүл мэндийн яамны мэдээлснээр энэ онд 38 мянган хүүхдийг 3 удаагийн тунгаар дархлаажуулахад шаардлагатай 110 мянган хүн/тун вакциныг НҮБ-ын Хүүхдийн санд захиалж, 35 мянган хүн/тунг хүлээж авсан бөгөөд үлдсэн 75 мянган хүн/тун вакциныг хоёрдугаар улиралд багтаан авах аж. 8 | 9 | Вакцины зах зээлийн үнэ нэг хүн/тун нь 13-15 ам.доллар байдаг юм байна. Харин НҮБ-ын Хүүхдийн сангийн хөнгөлөлттэй үнэ болох 3,5 ам.доллараар худалдан авч байна. 10 | 11 | Уг вакцины гурван тунг бүрэн авсан хүүхэд уушгины хатгаа өвчний эсрэг 10-15 жилийн дархлаа тогтоно. 12 | 13 | 2019 онд улсын хэмжээнд 74,3 мянган хүүхдийг 3 удаа дархлаажуулахад 202,700 хүн/тун вакцин, дагалдах хэрэгсэл, тээврийн зардалд 2 тэрбум төгрөг шаардлагатай. 14 | 15 | Өнөөдрийн байдлаар хоёр сартай 520 хүүхэд вакцинд хамрагдаад байна. 16 | 17 | С.Отгонжаргал -------------------------------------------------------------------------------- /old_stuffs/corpuses_test/politics_news_ikon_mn.txt: -------------------------------------------------------------------------------- 1 | УИХ-ын дарга М.Энхболд БНТУ-ын Ерөнхийлөгч Режеп Тайип Эрдоанд бараалхлаа. 2 | 3 | Бүгд Найрамдах Турк Улсад албан ёсны айлчлал хийж буй Монгол Улсын Их Хурлын дарга М.Энхболд тус улсын Ерөнхийлөгч Режеп Тайип Эрдоанд бараалхлаа. 4 | 5 | Нэг цаг гаруй үргэлжилсэн уулзалт элэгсэг дотно, нөхөрсөг, илэн далангүй нөхцөл байдалд болж, хоёр орны харилцаа, хамтын ажиллагааны бүхий л салбарыг хөндөн ярилцав. 6 | 7 | Улсын Их Хурлын дарга М.Энхболд уулзалтын эхэнд “Эрхэмсэг ноён Ерөнхийлөгч Тантай Монгол Улсын Их Хурлын даргын хувиар дахин уулзаж байгаадаа баяртай байна. Өмнө айлчилж байсан үетэй харьцуулахад Турк Улсад бүтээн байгуулалт ихээр хийгдэж, эдийн засаг, нийгмийн амьдралд томоохон өөрчлөлт гарчээ. Энэ бүхэнд ноён Ерөнхийлөгч Таны манлайлал чухал үүрэг гүйцэтгэж байгааг тэмдэглэхийг хүсэж байна” гэлээ. 8 | 9 | Тэрбээр Монгол Улсын Ерөнхий сайдаар ажиллаж байхдаа тухайн үеийн Ерөнхий сайд Режеп Тайип Эрдоаны урилгаар 2006 онд, нийслэлийн Засаг дарга бөгөөд Улаанбаатар хотын захирагчийн хувьд 2003 онд Анкара хотод айлчилж, ноён Режеп Тайип Эрдоан 2005, 2013 онд Ерөнхий сайдын хувиар Монгол Улсад айлчилж байсан, хуучны танил, дотнын нөхөд юм. 10 | 11 | Улсын Их Хурлын дарга М.Энхболд олон улсын терроризмын эсрэг тэмцлийн тэргүүн эгнээнд Турк зогсож, ачааны хүндийг үүрч байгааг Монгол Улс ойлгон дэмжиж байдгийг тэмдэглэв. Түүнчлэн хоёр орны харилцааг цаашид эдийн засгийн агуулгаар баяжуулж, нөөц бололцоог бүрэн дүүрэн ашиглах, үүний тулд тогтонги байдалд байгаа худалдаа, эдийн засгийн хамтын ажиллагааг идэвхжүүлэх, тухайлбал хөдөө аж ахуй, аялал жуулчлал, хөнгөн үйлдвэрлэл, хот төлөвлөлт, барилга, боловсрол, соёл, урлаг, эрүүл мэнд зэрэг салбарт өргөжүүлэн хөгжүүлэхэд бэлэн байгаагаа нотоллоо. Эдгээр хамтын ажиллагааны үндэс нь эцэг дээдсээс уламжилсан хоёр орны угсаа гарлын түүхэн хэлхээ холбоонд байдгийг тэрбээр онцлоод, хувийн хэвшлийнхэн, бизнес эрхлэгчдийн шууд хамтын ажиллагаа нэн чухал байгааг тэмдэглэлээ. 12 | 13 | 14 | Эрхэмсэг ноён Р.Т.Эрдоан Туркийн хөрөнгө оруулалтыг сонирхсон салбарт чиглүүлэхэд тус дөхөм үзүүлэхэд бэлэн байгаагаа нотлов. Тэрбээр Монгол Улс, Монголын ард түмнийг бид ах дүүсээ хэмээн хүндэтгэдэг хэмээн онцлоод, боловсрол, иргэний агаарын тээвэр, хот байгуулалт, хөрөнгө оруулалтын зэрэг салбарт хамтран ажиллах зарим саналыг тавив. 15 | 16 | Тухайлбал, Орхоны хөндийг түшиглэсэн аялал жуулчлалын цогцолбор байгуулах талаарх Улсын Их Хурлын дарга М.Энхболдын саналыг дэмжиж байна гэв. Түүнчлэн Истанбул хотын даргаар ажиллаж байхдаа нийтийн тээврийн асуудлыг хэрхэн шийдэж байсан туршлагаасаа хуваалцаж, энэ чиглэлээр хамтран ажиллах боломжтой гэдгээ илэрхийллээ. Мөн хөдөө аж ахуйн салбар, түүхий дотор мах, махан бүтээгдэхүүний импорт хийх боломжийг судалж үзэхээ амлалаа. 17 | 18 | Улсын Их Хурлын дарга М.Энхболд ноён Р.Т.Эрдоаны тавьсан Монголд үйл ажиллагаа явуулж буй турк сургуулиудын асуудлыг зохистой шийдвэрлэх саналыг дэмжиж байгаагаа тэмдэглэж, холбогдох байгууллагууд хоёр талын эрх ашигт нийцсэн, тохиромжтой шийдлийг олно гэдэгт итгэж байна гэлээ. 19 | 20 | М.Энхболд, Р.Т.Эрдоан нар ирэх онд тохиох, хоёр орны хооронд дипломат харилцаа тогтоосны 50 жилийн ойг хамтын ажиллагаагаа өргөжүүлж, илүү өндөр түвшинд хүргэсэн, тодорхой томоохон үр дүн гаргасан амжилтаар угтахын тулд энэ өдрөөс эхлэн хоёр тал хичээн чармайх нь чухал гэдэгт санал нэгтэй байгаагаа тэмдэглэлээ. 21 | 22 | Эрхэмсэг ноён Р.Т.Эрдоан БНТУ-ын Ерөнхийлөгчийн хувьд Монгол Улс, Монголын ард түмэнтэй улам илүү дотно харилцаж, ах дүү ёсоор туслан дэмжиж, үнэнч нөхөрлөлийн бэлгэдэл болохуйц хамтын ажиллагаа өрнүүлэхэд хэзээд бэлэн байх болно гэдгээ илэрхийлэв. 23 | 24 | Улсын Их Хурлын дарга М.Энхболдод мөн өдөр БНТУ-ын нийслэл Анкара хотын дарга Мустафа Туна бараалхсан юм. Энэ үеэр Улсын Их Хурлын дарга М.Энхболд Улаанбаатар хотын даргын хувиар 2003 онд Анкарад айлчилж, Улаанбаатар-Анкара хотын хооронд “Ах дүү хотуудын харилцаа тогтоох” тунхаглалд гарын үсэг зурж, 2004 онд Улаанбаатар хотын соёлын өдрүүдийг тус хотод зохион байгуулж байснаа дурслаа. 25 | 26 | Ноён Мустафа Туна өмнө нь Анкара хотын Синжан дүүргийн Засаг даргаар ажиллаж байхдаа нийслэлийн Чингэлтэй дүүрэгтэй хамтын ажиллагаа тогтоож, айлчлалын бүрэлдэхүүнд багтан яваа Улсын Их Хурлын гишүүн Д.Ганболдыг тус дүүргийн Засаг дарга байхад элэгсэг дотноор хамтран ажиллаж, ажил хэргийн анд нөхөд болсноо тэмдэглэлээ. 27 | 28 | Улсын Их Хурлын дарга М.Энхболд ноён Мустафа Тунаг БНТУ-ын нийслэл Анкара хотын даргаар ажиллах болсонд нь баяр хүргэж, ажлын амжилт хүсээд, Улаанбаатар хотод зочилж, хоёр хотын хамтын ажиллагааг өргөжүүлэхэд онцгой анхаарна гэдэгт итгэж байгаагаа илэрхийлэв. Тэрбээр Анкара хотын бүтээн байгуулалтын туршлагаас монгол анд нөхөдтэйгээ хуваалцаж, Улаанбаатар хотын агаар, орчны бохирдлыг бууруулах, зам тээврийн тулгамдсан асуудлыг шийдвэрлэх, хот тохижилтыг шинэ шатанд гаргах зэрэг чиглэлээр харилцан ашигтай хамтран ажиллахын чухлыг онцоллоо. 29 | 30 | Ноён Мустафа Туна хотыг дахин төлөвлөх чиглэлээр хамтран ажиллахад бэлэн байгаагаа илэрхийлээд, Улаанбаатар хотын агаар, усны бохирдлыг арилгах талаарх өмнө эхэлсэн ажлаа үргэлжлүүлэх илүү өргөн боломж нээгдэж байгааг тэмдэглэлээ. 31 | 32 | 33 | Энэ өдөр Улсын Их Хурлын дарга М.Энхболд тус хотод буй Чингис хааны цэцэрлэгт хүрээлэнд зочилж, Эзэн Богдын дурсгалын хөшөөнд цэцэг өргөлөө. Тэрбээр энэхүү хүрээлэнг бий болгох, хөшөө босгох санаачилгыг Улаанбаатар хотын даргаар ажиллаж байхдаа гаргаж байжээ. Энэ нь Чингис хааны гадаад орон дахь анхны хөшөө болж байсан түүхтэй юм байна. 34 | 35 | Улсын Их Хурлын дарга М.Энхболд мөн өдөр Туркийн өдөр тутмын “Дэйли Сабах” сонинд ярилцлага өгөв. 36 | 37 | Тэрбээр ярилцлагадаа Монгол Улсын нийгэм, эдийн засгийн өнөөгийн байдал, цаашдын зорилтын талаар танилцуулж, Монгол-Туркийн найрсаг, ах дүү ёсны харилцаа, хамтын ажиллагааг зөвхөн парламентын түвшинд биш, бүхий л салбарт илүү өндөр түвшинд, үр ашигтай хөгжүүлэх талаарх санал бодлоо илэрхийллээ гэж Улсын Их Хурлын Хэвлэл мэдээлэл, олон нийттэй харилцах хэлтсийн ажилтнууд Анкара хотоос мэдээлэв. -------------------------------------------------------------------------------- /old_stuffs/corpuses_test/technology_news_gogo_mn.txt: -------------------------------------------------------------------------------- 1 | Бээжингийн шүүх VR технологи ашиглаж эхэлжээ. 2 | 3 | VR технологи буюу виртуал орчныг үүсгэх технологийг өнөөдөр гэрийн нөхцөлд, музей, видео тоглоом зэрэгт ашигладаг болсон бол Бээжингийн шүүх гэмт хэргийн орчныг харуулахад ашиглаж эхэлжээ. 4 | 5 | Мягмар гарагт болсон шүүх хурлын үеэр анх удаа гэмт хэргийг шүүх явцад ийм технологи ашигласан аж. 2017 оны есдүгээр сард болсон хүн амины хэргийн гэрчийн мэдүүлэгт үндэслэн VR технологийн тусламжтай үйл явдлыг бодитоор дүрсэлжээ. 6 | 7 | 8 | 9 | Бээжингийн шүүхийн төлөөлөгч Жуан Вей “Өмнө нь прокурорууд баримтуудыг ихэвчлэн амаар эсвэл Powerpoint ашиглан үзүүлдэг байлаа. Харин технологийн дэвшлийн ачаар илүү бодитой бөгөөд шүүх танхимд байгаа хүмүүст нээлттэй харуулах боломж бүрдлээ” гэсэн байна. 10 | 11 | Өнөөдөр БНХАУ-д VR технологи ашиглан хар тамхинаас гаргах эмчилгээ хийж, дасгал хийхэд ашиглаж байгаа бөгөөд хамгийн өргөн ашиглаж буй салбар нь видео тоглоом юм. 12 | 13 | Эх сурвалж: Xinhua -------------------------------------------------------------------------------- /old_stuffs/corpuses_test/world_news_gogo_mn.txt: -------------------------------------------------------------------------------- 1 | АНУ-ын Ерөнхийлөгч Доналд Трамп БНАСАУ-ын удирдагч Ким Жон Унтай тавдугаар сард багтаан уулзалт зохион байгуулахаар тохиролцсон талаарх мэдээлэл дэлхий нийтийн анхааралд байна. 2 | 3 | Энэ уулзалт АНУ болон БНАСАУ-ын хооронд хийж буй анхны дээд хэмжээний уулзалт болох бөгөөд одоогийн байдлаар хэзээ, хаана гэдэг нь тодорхойгүй хэвээр байгаа юм. 4 | 5 | Thediplomat.com сайтаас энэ уулзалтыг хоёр Солонгосын хилийн заагт байрлах Панмүнжомд болох болов уу гэж таамаглаж байгаа талаараа мэдээлжээ. Гэвч Доналд Трампийн тамгын газар өөр газар хайж эхэлбэл Монголын нийслэл Улаанбаатар уулзалтад тохиромжтой хэмээн уг мэдээлэлдээ дурдсан байна. 6 | 7 | Түүнчлэн дээд хэмжээний уулзалт болох шийдвэр гарсан гэсэн мэдээлэл цацагдсанаас 22 цагийн дараа Ерөнхийлөгч асан Ц.Элбэгдорж өөрийн твиттер хуудаснаа “Санал байна: АНУ-ын Ерөнхийлөгч Доналд Трамп болон Хойд Солонгосын удирдагч Ким Жон Ун нар Улаанбаатар хотод уулзаж болно. Монгол Улс бол хамгийн тохиромжтой, төвийг сахисан нутаг дэвсгэр юм. Бид Япон ба Хойд Солонгосын уулзалт зэрэг олон чухал уулзалтуудад түлхэц болж байсан. Монгол орны хойшдын үлдээх өв бол Зүүн Хойд Азийн аюулгүй байдлын асуудлаарх Улаанбаатарын яриа хэлэлцээ юм” хэмээн жиргэснийг онцолжээ. 8 | 9 | Thediplomat.com сайтад Жулиан Диркес, М.Жаргалсайхан нарын нийтэлсэн Доналд Трамп, Ким Жон Ун нарын уулзалт Улаанбаатарт болох нь яагаад зөв сонголт байх талаар найман шалтгааныг хүргэж байна. 10 | 11 | 1.Монгол Улс нь 1990 онд Ардчилсан хувьсгал гарснаас хойш бүс нутгийнхаа хөрш орнуудтай улс төрийн төвийг сахисан, найрсаг харилцаатай улс. 2015 онд албан ёсоор төвийг сахисан бодлогын талаар хэлэлцүүлэг өрнүүлж байсан. 12 | 13 | 2.АНУ-тай найрсаг харилцаатай. 1990 оноос АНУ-тай найрсаг харилцаа тогтоон, олон удаа төрийн дээд хэмжээний уулзалтуудыг зохион байгуулж байсан. Мөн АНУ Монгол Улсад тусламж, хөрөнгө оруулалт хийсэн төдийгүй Ардчиллыг бэхжүүлэхэд нэмэр оруулсан хөрш орны нэг. 14 | 15 | 3.БНАСАУ-тай ч найрсаг харилцаатай. Хоёрдугаар сард Пёньяанд болсон өвлийн олимпийн үеэр Монгол Улсын Гадаад харилцааны сайд Д.Цогтбаатартай БНАСАУ-аас Монголд ажилтан ажиллуулах гэрээ хийсэн. Магадгүй хоёр солонгосын дайны үеэр зуу зуун хүүхдийг Монгол уруу нүүлгэн шилжүүлж байсан нь БНАСАУ-ын хувьд хамгийн ач холбогдолтой зүйл байж болох бөгөөд энэ сэтгэл хөдөлгөсөн харилцаа цаашид ч үргэлжилнэ. 16 | 17 | 4.Уулзалт Азид болно. Улаанбаатар бол АНУ болон БНАСАУ-ын төлөөлөгчид амархан хүрч болох байршил. 18 | 19 | 5.Монгол Улсын Засгийн газар БНАСАУ болон 2007, 2012 онуудад Япон ба Хойд Солонгос засгийн газрын түвшний уулзалт Монгол Улсад зохион байгуулагдаж байсан. 2017 онд Монгол Улсын Засгийн газраас Зүүн Хойд Азийн аюулгүй байдлын асуудлаарх Улаанбаатарын яриа хэлэлцээ олон улсын бага хурлыг зохион байгуулсан. БНАСАУ-ын Гадаад хэргийн сайд Ри Ён Хо энэ бага хуралд оролцсон бөгөөд хоёр талын хэлэлцээ бүхий хэд хэдэн уулзалтад оролцсон. Хоёр орны төлөөлөгчид сэтгэл хангалуун үлдэж, эдгээр уулзалт амжилттай болсон. 20 | 21 | 6.Бие даасан найдвартай байдал. Цөмийн зэвсэг үл дэлгэрүүлэх нь Доналд Трамп болон Ким Жон Ун нарын уулзалтын нэг асуудал. Монгол Улсын хувьд “Цөмийн зэвсэггүй бүс” статустай. 22 | 23 | 7.АНУ болон БНАСАУ-ын холбоотон орнууд Улаанбаатарыг зөвшөөрнө. Мэдээж хэрэг Өмнөд Солонгосын эрх баригчид өөрт илүү ойр байршлыг илүүд үзэх ч Монгол Улс байж болохр сонголт. Япон улс өмнө нь Монгол Улсыг дундын орон болохыг хүлээн зөвшөөрч байсан төдийгүй уулзалтыг Солонгосын хойгоос өөр газар зохион байгуулах нь хулгайлагдсан хүмүүсийн асуудлыг хөтөлбөрт багтаахад хэрэгтэй юм. Одоогийн байдлаар Ерөнхийлөгч Путин, Ши Жиньпин нарын хувьд тусгайлан төлөвлөсөн газар байгаа эсэх нь тодорхойгүй бөгөөд аль аль Улаанбаатарт болохыг зөвшөөрөх магадлалтай. 24 | 25 | 8.Дээд хэмжээний уулзалтыг Улаанбаатарт хийхэд зуу зуун албаны хүмүүс очно. Үүнтэй төстэй уулзалтыг 2016 онд Улаанбаатарт зохион байгуулж байсан. Тодруулбал АСЕМ. Тавдугаар сар гэдэг бол Монголын хахир хатуу хүйтэн өвөл дуусаад удаагүй байх хугацаа тул жуулчид ч тийм их биш. Энэ нь зочид буудал, агаарын тээвэрт айлчлалын баг санаа зовох зүйлгүй. -------------------------------------------------------------------------------- /old_stuffs/djangoapp/app/__init__.py: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/sharavsambuu/mongolian-text-classification/1b82259dbc093864f00b00c5a36cecb0b6af0f5e/old_stuffs/djangoapp/app/__init__.py -------------------------------------------------------------------------------- /old_stuffs/djangoapp/app/admin.py: -------------------------------------------------------------------------------- 1 | from django.contrib import admin 2 | from django.contrib.auth import get_user_model 3 | from django.contrib.auth.admin import UserAdmin 4 | 5 | from .forms import CustomUserCreationForm, CustomUserChangeForm 6 | from .models import CustomUser 7 | 8 | class CustomUserAdmin(UserAdmin): 9 | model = CustomUser 10 | add_form = CustomUserCreationForm 11 | form = CustomUserChangeForm 12 | 13 | admin.site.register(CustomUser, CustomUserAdmin) -------------------------------------------------------------------------------- /old_stuffs/djangoapp/app/apps.py: -------------------------------------------------------------------------------- 1 | from django.apps import AppConfig 2 | 3 | 4 | class AppConfig(AppConfig): 5 | name = 'app' 6 | -------------------------------------------------------------------------------- /old_stuffs/djangoapp/app/forms.py: -------------------------------------------------------------------------------- 1 | from django import forms 2 | from django.contrib.auth.forms import UserCreationForm, UserChangeForm 3 | from .models import CustomUser 4 | 5 | class CustomUserCreationForm(UserCreationForm): 6 | class Meta(UserCreationForm.Meta): 7 | model = CustomUser 8 | fields = UserCreationForm.Meta.fields 9 | 10 | class CustomUserChangeForm(UserChangeForm): 11 | class Meta: 12 | model = CustomUser 13 | fields = UserChangeForm.Meta.fields 14 | 15 | 16 | class MongolianTextForm(forms.Form): 17 | content = forms.CharField( 18 | widget=forms.Textarea(attrs={'placeholder': 'Please copy and paste some mongolian text here...'}), 19 | label="" 20 | ) -------------------------------------------------------------------------------- /old_stuffs/djangoapp/app/migrations/0001_initial.py: -------------------------------------------------------------------------------- 1 | # Generated by Django 2.0.3 on 2018-03-10 16:23 2 | 3 | import app.models 4 | import django.contrib.auth.validators 5 | from django.db import migrations, models 6 | import django.utils.timezone 7 | 8 | 9 | class Migration(migrations.Migration): 10 | 11 | initial = True 12 | 13 | dependencies = [ 14 | ('auth', '0009_alter_user_last_name_max_length'), 15 | ] 16 | 17 | operations = [ 18 | migrations.CreateModel( 19 | name='CustomUser', 20 | fields=[ 21 | ('id', models.AutoField(auto_created=True, primary_key=True, serialize=False, verbose_name='ID')), 22 | ('password', models.CharField(max_length=128, verbose_name='password')), 23 | ('last_login', models.DateTimeField(blank=True, null=True, verbose_name='last login')), 24 | ('is_superuser', models.BooleanField(default=False, help_text='Designates that this user has all permissions without explicitly assigning them.', verbose_name='superuser status')), 25 | ('username', models.CharField(error_messages={'unique': 'A user with that username already exists.'}, help_text='Required. 150 characters or fewer. Letters, digits and @/./+/-/_ only.', max_length=150, unique=True, validators=[django.contrib.auth.validators.UnicodeUsernameValidator()], verbose_name='username')), 26 | ('first_name', models.CharField(blank=True, max_length=30, verbose_name='first name')), 27 | ('last_name', models.CharField(blank=True, max_length=150, verbose_name='last name')), 28 | ('email', models.EmailField(blank=True, max_length=254, verbose_name='email address')), 29 | ('is_staff', models.BooleanField(default=False, help_text='Designates whether the user can log into this admin site.', verbose_name='staff status')), 30 | ('is_active', models.BooleanField(default=True, help_text='Designates whether this user should be treated as active. Unselect this instead of deleting accounts.', verbose_name='active')), 31 | ('date_joined', models.DateTimeField(default=django.utils.timezone.now, verbose_name='date joined')), 32 | ('groups', models.ManyToManyField(blank=True, help_text='The groups this user belongs to. A user will get all permissions granted to each of their groups.', related_name='user_set', related_query_name='user', to='auth.Group', verbose_name='groups')), 33 | ('user_permissions', models.ManyToManyField(blank=True, help_text='Specific permissions for this user.', related_name='user_set', related_query_name='user', to='auth.Permission', verbose_name='user permissions')), 34 | ], 35 | options={ 36 | 'verbose_name': 'user', 37 | 'verbose_name_plural': 'users', 38 | 'abstract': False, 39 | }, 40 | managers=[ 41 | ('objects', app.models.CustomUserManager()), 42 | ], 43 | ), 44 | ] 45 | -------------------------------------------------------------------------------- /old_stuffs/djangoapp/app/migrations/__init__.py: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/sharavsambuu/mongolian-text-classification/1b82259dbc093864f00b00c5a36cecb0b6af0f5e/old_stuffs/djangoapp/app/migrations/__init__.py -------------------------------------------------------------------------------- /old_stuffs/djangoapp/app/models.py: -------------------------------------------------------------------------------- 1 | from django.db import models 2 | from django.contrib.auth.models import AbstractUser, UserManager 3 | 4 | class CustomUserManager(UserManager): 5 | pass 6 | 7 | class CustomUser(AbstractUser): 8 | objects = CustomUserManager() 9 | -------------------------------------------------------------------------------- /old_stuffs/djangoapp/app/tests.py: -------------------------------------------------------------------------------- 1 | from django.test import TestCase 2 | 3 | # Create your tests here. 4 | -------------------------------------------------------------------------------- /old_stuffs/djangoapp/app/urls.py: -------------------------------------------------------------------------------- 1 | from django.urls import path 2 | from . import views 3 | 4 | urlpatterns = [ 5 | path('', views.index, name='index'), 6 | path('classify', views.classify, name='classify'), 7 | ] -------------------------------------------------------------------------------- /old_stuffs/djangoapp/app/views.py: -------------------------------------------------------------------------------- 1 | from django.shortcuts import render 2 | from django.http import HttpResponse 3 | from django.shortcuts import render, redirect 4 | from .forms import MongolianTextForm 5 | import xmlrpc.client 6 | 7 | def index(request): 8 | form = MongolianTextForm() 9 | return render(request, 'home.html', {'form' : form}) 10 | 11 | def classify(request): 12 | if request.method == "POST": 13 | form = MongolianTextForm(request.POST) 14 | if form.is_valid(): 15 | content = form.cleaned_data['content'] 16 | with xmlrpc.client.ServerProxy("http://localhost:50001/") as proxy: 17 | news_class = proxy.predict_class_from_text(content) 18 | return render(request, 'classify.html', {'content' : content, 'news_class': news_class}) 19 | return redirect('index') 20 | else: 21 | return redirect('index') 22 | -------------------------------------------------------------------------------- /old_stuffs/djangoapp/djangoapp/__init__.py: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/sharavsambuu/mongolian-text-classification/1b82259dbc093864f00b00c5a36cecb0b6af0f5e/old_stuffs/djangoapp/djangoapp/__init__.py -------------------------------------------------------------------------------- /old_stuffs/djangoapp/djangoapp/settings.py: -------------------------------------------------------------------------------- 1 | """ 2 | Django settings for djangoapp project. 3 | 4 | Generated by 'django-admin startproject' using Django 2.0.3. 5 | 6 | For more information on this file, see 7 | https://docs.djangoproject.com/en/2.0/topics/settings/ 8 | 9 | For the full list of settings and their values, see 10 | https://docs.djangoproject.com/en/2.0/ref/settings/ 11 | """ 12 | 13 | import os 14 | 15 | # Build paths inside the project like this: os.path.join(BASE_DIR, ...) 16 | BASE_DIR = os.path.dirname(os.path.dirname(os.path.abspath(__file__))) 17 | 18 | 19 | # Quick-start development settings - unsuitable for production 20 | # See https://docs.djangoproject.com/en/2.0/howto/deployment/checklist/ 21 | 22 | # SECURITY WARNING: keep the secret key used in production secret! 23 | SECRET_KEY = '(a_8w^*gzx=w$nt2l&94e!^6#$e5+lq1qa0x0z0d4^-+ir(=i2' 24 | 25 | # SECURITY WARNING: don't run with debug turned on in production! 26 | DEBUG = True 27 | 28 | ALLOWED_HOSTS = [] 29 | 30 | 31 | # Application definition 32 | 33 | INSTALLED_APPS = [ 34 | 'django.contrib.admin', 35 | 'django.contrib.auth', 36 | 'django.contrib.contenttypes', 37 | 'django.contrib.sessions', 38 | 'django.contrib.messages', 39 | 'django.contrib.staticfiles', 40 | 41 | 'app' 42 | ] 43 | 44 | MIDDLEWARE = [ 45 | 'django.middleware.security.SecurityMiddleware', 46 | 'django.contrib.sessions.middleware.SessionMiddleware', 47 | 'django.middleware.common.CommonMiddleware', 48 | 'django.middleware.csrf.CsrfViewMiddleware', 49 | 'django.contrib.auth.middleware.AuthenticationMiddleware', 50 | 'django.contrib.messages.middleware.MessageMiddleware', 51 | 'django.middleware.clickjacking.XFrameOptionsMiddleware', 52 | ] 53 | 54 | ROOT_URLCONF = 'djangoapp.urls' 55 | 56 | TEMPLATES = [ 57 | { 58 | 'BACKEND': 'django.template.backends.django.DjangoTemplates', 59 | 'DIRS': [os.path.join(BASE_DIR, 'templates')], 60 | 'APP_DIRS': True, 61 | 'OPTIONS': { 62 | 'context_processors': [ 63 | 'django.template.context_processors.debug', 64 | 'django.template.context_processors.request', 65 | 'django.contrib.auth.context_processors.auth', 66 | 'django.contrib.messages.context_processors.messages', 67 | ], 68 | }, 69 | }, 70 | ] 71 | 72 | WSGI_APPLICATION = 'djangoapp.wsgi.application' 73 | 74 | 75 | # Database 76 | # https://docs.djangoproject.com/en/2.0/ref/settings/#databases 77 | 78 | DATABASES = { 79 | 'default': { 80 | 'ENGINE': 'django.db.backends.sqlite3', 81 | 'NAME': os.path.join(BASE_DIR, 'db.sqlite3'), 82 | } 83 | } 84 | 85 | 86 | # Password validation 87 | # https://docs.djangoproject.com/en/2.0/ref/settings/#auth-password-validators 88 | 89 | AUTH_PASSWORD_VALIDATORS = [ 90 | { 91 | 'NAME': 'django.contrib.auth.password_validation.UserAttributeSimilarityValidator', 92 | }, 93 | { 94 | 'NAME': 'django.contrib.auth.password_validation.MinimumLengthValidator', 95 | }, 96 | { 97 | 'NAME': 'django.contrib.auth.password_validation.CommonPasswordValidator', 98 | }, 99 | { 100 | 'NAME': 'django.contrib.auth.password_validation.NumericPasswordValidator', 101 | }, 102 | ] 103 | 104 | 105 | # Internationalization 106 | # https://docs.djangoproject.com/en/2.0/topics/i18n/ 107 | 108 | LANGUAGE_CODE = 'en-us' 109 | 110 | TIME_ZONE = 'UTC' 111 | 112 | USE_I18N = True 113 | 114 | USE_L10N = True 115 | 116 | USE_TZ = True 117 | 118 | 119 | # Static files (CSS, JavaScript, Images) 120 | # https://docs.djangoproject.com/en/2.0/howto/static-files/ 121 | 122 | STATIC_URL = '/static/' 123 | 124 | # custom settings 125 | AUTH_USER_MODEL = 'app.CustomUser' -------------------------------------------------------------------------------- /old_stuffs/djangoapp/djangoapp/urls.py: -------------------------------------------------------------------------------- 1 | from django.contrib import admin 2 | from django.urls import path, include 3 | from django.views.generic.base import TemplateView 4 | 5 | urlpatterns = [ 6 | #path('', TemplateView.as_view(template_name='home.html'), name='home'), 7 | path('', include('app.urls')), 8 | path('', include('django.contrib.auth.urls')), 9 | path('admin/', admin.site.urls), 10 | ] 11 | -------------------------------------------------------------------------------- /old_stuffs/djangoapp/djangoapp/wsgi.py: -------------------------------------------------------------------------------- 1 | """ 2 | WSGI config for djangoapp project. 3 | 4 | It exposes the WSGI callable as a module-level variable named ``application``. 5 | 6 | For more information on this file, see 7 | https://docs.djangoproject.com/en/2.0/howto/deployment/wsgi/ 8 | """ 9 | 10 | import os 11 | 12 | from django.core.wsgi import get_wsgi_application 13 | 14 | os.environ.setdefault("DJANGO_SETTINGS_MODULE", "djangoapp.settings") 15 | 16 | application = get_wsgi_application() 17 | -------------------------------------------------------------------------------- /old_stuffs/djangoapp/manage.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python 2 | import os 3 | import sys 4 | 5 | if __name__ == "__main__": 6 | os.environ.setdefault("DJANGO_SETTINGS_MODULE", "djangoapp.settings") 7 | try: 8 | from django.core.management import execute_from_command_line 9 | except ImportError as exc: 10 | raise ImportError( 11 | "Couldn't import Django. Are you sure it's installed and " 12 | "available on your PYTHONPATH environment variable? Did you " 13 | "forget to activate a virtual environment?" 14 | ) from exc 15 | execute_from_command_line(sys.argv) 16 | -------------------------------------------------------------------------------- /old_stuffs/djangoapp/templates/base.html: -------------------------------------------------------------------------------- 1 | 2 | 3 | 4 | 5 | 6 | {% block title %}Mongolian text classification app{% endblock %} 7 | 8 | 9 | 10 | 11 | 12 |
13 |
14 |
15 |
16 | Home 17 |
18 |
19 |
20 |
21 | {% block content %} {% endblock %} 22 |
23 |
24 |
25 |
26 | 27 | -------------------------------------------------------------------------------- /old_stuffs/djangoapp/templates/classify.html: -------------------------------------------------------------------------------- 1 | 2 | {% extends 'base.html' %} 3 | 4 | {% block title %}Mongolian text classification app{% endblock %} 5 | 6 | {% block content %} 7 | 8 | class : {{ news_class }}
9 | {{ content }} 10 | 11 | {% endblock %} -------------------------------------------------------------------------------- /old_stuffs/djangoapp/templates/home.html: -------------------------------------------------------------------------------- 1 | 2 | {% extends 'base.html' %} 3 | 4 | {% block title %}Mongolian text classification app{% endblock %} 5 | 6 | {% block content %} 7 | 8 |
9 | {% csrf_token %} 10 | {{ form.as_p }} 11 | 12 |
13 | 14 | {% endblock %} -------------------------------------------------------------------------------- /old_stuffs/freeze_tf_model.py: -------------------------------------------------------------------------------- 1 | import tensorflow as tf 2 | import argparse, sys 3 | 4 | parser = argparse.ArgumentParser() 5 | parser.add_argument("--name", help="name of nn architecture") 6 | parser.add_argument("--iteration", help="iteration value") 7 | args = parser.parse_args() 8 | 9 | if args.name is None or args.iteration is None: 10 | print("please provide parameters, more info") 11 | print("python freeze_tf_model.py -h") 12 | sys.exit() 13 | 14 | name = args.name 15 | iteration = args.iteration 16 | 17 | saver = tf.train.import_meta_graph('./models/{name}/pretrained_{name}.ckpt-{iteration}.meta'.format(name=name, iteration=iteration), clear_devices=True) 18 | graph = tf.get_default_graph() 19 | input_graph_def = graph.as_graph_def() 20 | sess = tf.Session() 21 | saver.restore(sess, "./models/{name}/pretrained_{name}.ckpt-{iteration}".format(name=name, iteration=iteration)) 22 | 23 | # output variable name 24 | output_node_names = "input_placeholder,prediction_op" 25 | output_graph_def = tf.graph_util.convert_variables_to_constants( 26 | sess, input_graph_def, output_node_names.split(",") 27 | ) 28 | 29 | output_graph = "./models/{name}/pretrained_{name}-{iteration}.pb".format(name=name, iteration=iteration) 30 | with tf.gfile.GFile(output_graph, "wb") as f: 31 | f.write(output_graph_def.SerializeToString()) 32 | print("saved to ", output_graph) 33 | 34 | sess.close() -------------------------------------------------------------------------------- /old_stuffs/ikon_mn_scrape.py: -------------------------------------------------------------------------------- 1 | import scrapy 2 | from scrapy import Request 3 | from scrapy.shell import inspect_response 4 | import json, os 5 | from hashlib import md5 6 | 7 | root_link = "http://ikon.mn" 8 | 9 | class IkonSpider(scrapy.Spider): 10 | name='ikonspider' 11 | robotstxt_obey = True 12 | download_delay = 0.5 13 | user_agent = 'sharavaa-crawler-for-nlp (sharavsambuu@gmail.com)' 14 | autothrottle_enabled = True 15 | httpcache_enabled = True 16 | 17 | def start_requests(self): 18 | start_urls = [ 19 | (root_link+'/l/1' , "politics" ), # улс төр 20 | (root_link+'/l/2' , "economy" ), # эдийн засаг 21 | (root_link+'/l/3' , "society" ), # нийгэм 22 | (root_link+'/l/16', "health" ), # эрүүл мэнд 23 | (root_link+'/l/4' , "world" ), # дэлхийд 24 | (root_link+'/l/7' , "technology"), # технологи 25 | ] 26 | for index, url_tuple in enumerate(start_urls): 27 | url = url_tuple[0] 28 | category = url_tuple[1] 29 | yield Request(url, meta={'category': category}) 30 | 31 | def parse(self, response): 32 | news_title = response.xpath("//*[contains(@class, 'inews')]//h1/text()").extract() 33 | if (len(news_title)==0): 34 | print(">>>>>>>>>>>>> I'M GROOOOOOT ") 35 | else: 36 | news_title = news_title[0].strip() 37 | news_body = response.xpath("//*[contains(@class, 'icontent')]/descendant::*/text()[normalize-space() and not(ancestor::a | ancestor::script | ancestor::style)]").extract() 38 | news_body = " ".join(news_body) 39 | category = response.meta.get('category', 'default') 40 | url = response.request.url 41 | hashed_name = md5(news_title.encode("utf-8")).hexdigest() 42 | file_name = "./corpuses/"+category+"/"+hashed_name+".txt" 43 | print("saving to ", file_name) 44 | data = {} 45 | data['title'] = news_title 46 | data['body' ] = news_body 47 | data['url' ] = url 48 | os.makedirs(os.path.dirname(file_name), exist_ok=True) 49 | with open(file_name, "w", encoding="utf8") as outfile: 50 | json.dump(data, outfile, ensure_ascii=False) 51 | 52 | #import pdb; pdb.set_trace() 53 | 54 | for next_page in response.xpath("//*[contains(@class, 'nlitem')]//a"): 55 | yield response.follow(next_page, self.parse, meta={'category': response.meta.get('category', 'default')}) 56 | 57 | for next_page in response.xpath("//*[contains(@class, 'ikon-right-dir')]/parent::a"): 58 | yield response.follow(next_page, self.parse, meta={'category': response.meta.get('category', 'default')}) 59 | -------------------------------------------------------------------------------- /old_stuffs/images/accuracy.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/sharavsambuu/mongolian-text-classification/1b82259dbc093864f00b00c5a36cecb0b6af0f5e/old_stuffs/images/accuracy.png -------------------------------------------------------------------------------- /old_stuffs/images/classifiedresult.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/sharavsambuu/mongolian-text-classification/1b82259dbc093864f00b00c5a36cecb0b6af0f5e/old_stuffs/images/classifiedresult.png -------------------------------------------------------------------------------- /old_stuffs/images/loss.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/sharavsambuu/mongolian-text-classification/1b82259dbc093864f00b00c5a36cecb0b6af0f5e/old_stuffs/images/loss.png -------------------------------------------------------------------------------- /old_stuffs/images/webinput.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/sharavsambuu/mongolian-text-classification/1b82259dbc093864f00b00c5a36cecb0b6af0f5e/old_stuffs/images/webinput.png -------------------------------------------------------------------------------- /old_stuffs/mongolianstopwords.py: -------------------------------------------------------------------------------- 1 | # lists from https://github.com/Xangis/extra-stopwords/blob/master/mongolian 2 | # and https://github.com/erkhemee/chatbot-ub-hackathon 3 | stopwordsmn = [ 4 | 'аа', 5 | 'аанхаа', 6 | 'алив', 7 | 'ба', 8 | 'байдаг', 9 | 'байжээ', 10 | 'байна', 11 | 'байсаар', 12 | 'байсан', 13 | 'байхаа', 14 | 'бас', 15 | 'бишүү', 16 | 'бол', 17 | 'болжээ', 18 | 'болно', 19 | 'болоо' 20 | 'бэ', 21 | 'вэ', 22 | 'гэж', 23 | 'гэжээ', 24 | 'гэлтгүй', 25 | 'гэсэн', 26 | 'гэтэл', 27 | 'за', 28 | 'л', 29 | 'мөн', 30 | 'нь', 31 | 'тэр', 32 | 'уу', 33 | 'харин', 34 | 'хэн', 35 | 'ч', 36 | 'энэ', 37 | 'ээ', 38 | 'юм', 39 | 'үү', 40 | '?', 41 | '', 42 | '.', 43 | ',', 44 | '-', 45 | '-ийн', 46 | '-ын', 47 | '-тай', 48 | '-г', 49 | '-ийг', 50 | '-д', 51 | '-н', 52 | '-ний', 53 | '-дээр', 54 | 'юу', 55 | ] -------------------------------------------------------------------------------- /old_stuffs/numpy_embedding_matrix_tf.py: -------------------------------------------------------------------------------- 1 | from gensim.models import Word2Vec 2 | import numpy as np 3 | 4 | model = Word2Vec.load('model.bin') 5 | 6 | vector_dim = 100 7 | 8 | #word_to_id = {} 9 | embedding_matrix = np.zeros((len(model.wv.vocab), vector_dim)) 10 | for i in range(len(model.wv.vocab)): 11 | embedding_vector = model.wv[model.wv.index2word[i]] 12 | if embedding_vector is not None: 13 | embedding_matrix[i] = embedding_vector 14 | #word_to_id[i] = model.wv.index2word 15 | 16 | np.save('ids_matrix', embedding_matrix) 17 | #import pdb; pdb.set_trace() 18 | print('embedded ids matrix is saved as a numpy file format.') 19 | 20 | -------------------------------------------------------------------------------- /old_stuffs/prepare_trainingset.py: -------------------------------------------------------------------------------- 1 | from os import listdir 2 | from os.path import isfile, join 3 | import shutil 4 | import glob, os, os.path 5 | import random 6 | import math 7 | import json 8 | 9 | import numpy as np 10 | import tensorflow as tf 11 | import tensorflow.contrib.eager as tfe 12 | 13 | from gensim.models import Word2Vec 14 | 15 | from wordtoken_to_id import * 16 | from clear_text_to_array import * 17 | 18 | global_corpuses = [] 19 | 20 | def convert_to_onehot(name): 21 | switcher = { 22 | "economy" : lambda: [1,0,0,0,0,0], 23 | "health" : lambda: [0,1,0,0,0,0], 24 | "politics" : lambda: [0,0,1,0,0,0], 25 | "society" : lambda: [0,0,0,1,0,0], 26 | "technology" : lambda: [0,0,0,0,1,0], 27 | "world" : lambda: [0,0,0,0,0,1] 28 | } 29 | return switcher.get(name, lambda: [0,0,0,0,0,0])() 30 | 31 | def fix_news_body(filename): 32 | found = False 33 | jsoncontent = "" 34 | with open(filename, encoding="utf8") as f: 35 | jsoncontent = json.load(f) 36 | body = jsoncontent['body'].strip() 37 | if not body: 38 | print("YES EMPTY BODY FOUND...") 39 | found = True 40 | jsoncontent['body'] = jsoncontent['title'] 41 | if found: 42 | with open(filename, "w", encoding="utf8") as outfile: 43 | print("FIXING...", filename) 44 | json.dump(jsoncontent, outfile, ensure_ascii=False) 45 | 46 | for filename in glob.iglob('corpuses/**/*.txt', recursive=True): 47 | fix_news_body(filename) # some news body is empty, fix it by replacing its title 48 | current_file_path = os.path.abspath(filename) 49 | current_directory = os.path.abspath(os.path.join(current_file_path, os.pardir)) 50 | current_directory_name = os.path.split(current_directory) 51 | category = current_directory_name[1] 52 | one_hot = convert_to_onehot(category) 53 | only_file_name = os.path.basename(filename) 54 | global_corpuses.append((only_file_name, one_hot, category)) 55 | 56 | random.shuffle(global_corpuses) 57 | random.shuffle(global_corpuses) 58 | random.shuffle(global_corpuses) 59 | 60 | split_location = math.floor(80*len(global_corpuses)/100) # 80% for training, 20% for testing 61 | training_set = global_corpuses[:split_location] 62 | test_set = global_corpuses[split_location:] 63 | dataset_info = { 64 | 'training' : training_set, 65 | 'testing' : test_set 66 | } 67 | 68 | temp_corpus_dir = 'temp_corpuses' 69 | if os.path.exists(temp_corpus_dir): 70 | shutil.rmtree(temp_corpus_dir) 71 | os.makedirs(temp_corpus_dir) 72 | 73 | with open("temp_corpuses/dataset.json", "w", encoding="utf8") as outfile: 74 | json.dump(dataset_info, outfile, ensure_ascii=False) 75 | 76 | #import pdb; pdb.set_trace() -------------------------------------------------------------------------------- /old_stuffs/requirements.txt: -------------------------------------------------------------------------------- 1 | absl-py==0.1.10 2 | asn1crypto==0.24.0 3 | astor==0.6.2 4 | astroid==1.6.1 5 | attrs==17.4.0 6 | Automat==0.6.0 7 | autopep8==1.3.4 8 | backcall==0.1.0 9 | beautifulsoup4==4.6.0 10 | bleach==3.3.0 11 | boto==2.48.0 12 | boto3==1.6.3 13 | botocore==1.9.3 14 | bz2file==0.98 15 | certifi==2018.1.18 16 | cffi==1.11.5 17 | chardet==3.0.4 18 | colorama==0.4.0 19 | constantly==15.1.0 20 | cryptography==3.3.2 21 | cssselect==1.0.3 22 | decorator==4.3.0 23 | defusedxml==0.5.0 24 | Django==2.2.26 25 | docutils==0.14 26 | entrypoints==0.2.3 27 | gast==0.2.0 28 | gensim==3.4.0 29 | grpcio==1.10.0 30 | h5py==2.8.0 31 | html5lib==0.9999999 32 | hyperlink==18.0.0 33 | idna==2.6 34 | incremental==17.5.0 35 | ipykernel==5.1.0 36 | ipython==7.16.3 37 | ipython-genutils==0.2.0 38 | ipywidgets==7.4.2 39 | isort==4.3.4 40 | jedi==0.13.1 41 | Jinja2==2.11.3 42 | jmespath==0.9.3 43 | jsonschema==2.6.0 44 | jupyter==1.0.0 45 | jupyter-client==5.2.3 46 | jupyter-console==6.0.0 47 | jupyter-core==4.4.0 48 | Keras-Applications==1.0.6 49 | Keras-Preprocessing==1.0.5 50 | lazy-object-proxy==1.3.1 51 | lxml==4.6.5 52 | Markdown==2.6.11 53 | MarkupSafe==1.1.0 54 | mccabe==0.6.1 55 | mistune==0.8.4 56 | nbconvert==5.4.0 57 | nbformat==4.4.0 58 | nltk==3.6.6 59 | notebook==6.4.1 60 | numpy==1.21.0 61 | pandocfilters==1.4.2 62 | parsel==1.4.0 63 | parso==0.3.1 64 | pickleshare==0.7.5 65 | prometheus-client==0.4.2 66 | prompt-toolkit==2.0.7 67 | protobuf==3.6.1 68 | pyasn1==0.4.2 69 | pyasn1-modules==0.2.1 70 | pycodestyle==2.3.1 71 | pycparser==2.18 72 | PyDispatcher==2.0.5 73 | Pygments==2.7.4 74 | pylint==1.8.2 75 | pyOpenSSL==17.5.0 76 | python-dateutil==2.6.1 77 | pytz==2018.3 78 | pywinpty==0.5.4 79 | pyzmq==17.1.2 80 | qtconsole==4.4.3 81 | queuelib==1.4.2 82 | requests==2.18.4 83 | s3transfer==0.1.13 84 | scipy==1.0.0 85 | Scrapy==1.8.1 86 | scrapy-splash==0.8.0 87 | Send2Trash==1.5.0 88 | service-identity==17.0.0 89 | six==1.11.0 90 | smart-open==1.5.6 91 | termcolor==1.1.0 92 | terminado==0.8.1 93 | testpath==0.4.2 94 | tornado==5.1.1 95 | traitlets==4.3.2 96 | Twisted==20.3.0 97 | typed-ast==1.1.0 98 | urllib3==1.26.5 99 | w3lib==1.19.0 100 | wcwidth==0.1.7 101 | Werkzeug==0.14.1 102 | widgetsnbextension==3.4.2 103 | wrapt==1.10.11 104 | zope.interface==4.4.3 105 | -------------------------------------------------------------------------------- /old_stuffs/research/ikon-research.txt: -------------------------------------------------------------------------------- 1 | Дараагийн хуудас 2 | $('.ikon-right-dir').trigger('click'); 3 | 4 | Хуудсан дахь item-үүдийн жагсаалт 5 | $('.nlitem') 6 | 7 | Item-ээс холбоос хаягийг нь авах 8 | $($('.nlitem')[0]).find('a')[0] 9 | 10 | XPath хэрэглэж item ийн холбоосыг авах 11 | item_link = response.selector.xpath("//*[contains(@class, 'nlitem')]//a/@href").extract()[0] 12 | 13 | sentiment утгуудыг авах 14 | let arr = []; 15 | $('.reaction .value').each((index, emoji) => { 16 | let s = $(emoji).text(); 17 | let score = isNaN(parseInt(s)) ? 0 : parseInt(s); 18 | arr.push(score) 19 | }); 20 | console.log(arr); 21 | 22 | Сургалтын өгөгдлийн статистик 23 | 24 | Категорийн тоо : 6 25 | Файлын тоо : 6124 ширхэг 26 | Нийтлэлийн дундаж урт : 418.6 үг 27 | ХИ үгтэй нийтлэл : 7418, http://ikon.mn/n/dh7 28 | Нийт татсан үгийн тоо : 2563792 29 | Нийт ялгаатай үгийн тоо : 102153, stemming хийсний дараа 68767 болж цөөрлөө. 30 | -------------------------------------------------------------------------------- /old_stuffs/research/nlp-research.txt: -------------------------------------------------------------------------------- 1 | some stopwords list 2 | https://github.com/Xangis/extra-stopwords/blob/master/mongolia 3 | 4 | stemmers and some stopwords for mongolian language 5 | https://github.com/erkhemee/chatbot-ub-hackathon -------------------------------------------------------------------------------- /old_stuffs/stemmer.py: -------------------------------------------------------------------------------- 1 | from nltk import stem 2 | 3 | # code from https://github.com/erkhemee/chatbot-ub-hackathon 4 | stemmer_rules_tuple = ("ы$|ий$|ыг$|ийг$|ны$|ний$|лаа$|лээ$|лоо$|лөө$|даа$|дээ$|доо$|дөө$|уулж$|үүлж$|гдах$|гдэх$|гдох$|гдөх$|лалт$|лэлт$|лолт$|лөлт$|дахад$|дэхэд$|доход$|дөхөд$|уудыг$|үүдийг$|нүүдийг$|нуудыг$|чихлаа$|чихлээ$|чихлоо$|чихлөө$|ид$|ын$|ийн$|уудын$|үүдийн$|тай$|той$|төй$|тэй$|лал$|лэл$|лол$|лөл$|аар$|ээр$|оор$|өөр$|лах$|лэх$|лох$|лөх$|ахад$|эхэд$|оход$|өхөд$|лалд$|лэлд$|лөлд$|лолд$|ууд$|үүд$|даг$|дэг$|дог$|дөг$|ддаг$|ддэг$|ддог$|ддөг$|даж$|дэж$|дож$|дөж$|сэн$|сан$|сөн$|сон$|гах$|гэх$|гох$|гөх$|рах$|рэх$|рох$|рөх$|иас$|иэс$|иос$|иөс$|нд$|нт$|ад$|эд$|од$|өд$|ааж$|ээж$|оож$|өөж$|лаар$|лээр$|лоор$|лөөр$|аг$|эг$|ог$|өг$|уг$|өөг$|ээг$|оог$|үүг$|ууг$|бар$|бэр$|бор$|бөр$|бур$|бүр$|раа$|рээ$|рөө$|рүү$|роо$|руу$|аас$|ээс$|оос$|өөс$|аарх$|ээрх$|оорх$|уурх$|үүрх$|өөрх$|аад$|ээд$|ууд$|үүд$|оод$|өөд$|нууд$|нүүд$|дсан$|дсэн$|дсөн$|дсон$|дах$|дэх$|дох$|дөх$|хлаа$|хлээ$|хлоо$|хлөө$|лахдаа$|лэхдээ$|лохдоо$|лөхдөө$|лдаг$|лдэг$|лж$|аагүй$|ээгүй$|оогүй$|өөгүй$|даггүй$|дэггүй$|доггүй$|дөггүй$|нуудийнхаа$|нүүдийнхээ$|гүй$") 5 | 6 | class Singleton: 7 | def __init__(self, decorated): 8 | self._decorated = decorated 9 | 10 | def instance(self): 11 | try: 12 | return self._instance 13 | except AttributeError: 14 | self._instance = self._decorated() 15 | return self._instance 16 | 17 | def __call__(self): 18 | raise TypeError('Singletons must be accessed through `instance()`.') 19 | 20 | def __instancecheck__(self, inst): 21 | return isinstance(inst, self._decorated) 22 | 23 | @Singleton 24 | class Stemmer: 25 | def __init__(self): 26 | self.stemmer = stem.RegexpStemmer(stemmer_rules_tuple, min=6) -------------------------------------------------------------------------------- /old_stuffs/training_bilstm_rnn.py: -------------------------------------------------------------------------------- 1 | import datetime 2 | import tensorflow as tf 3 | import numpy as np 4 | from training_helpers import * 5 | 6 | batch_size = 24 7 | lstm_units = 128 8 | num_classes = 6 9 | max_seq_length = 500 10 | vector_length = 100 # word2vec dimensions 11 | iterations = 100000 # 100000 12 | 13 | dataset = DataSetHelper() 14 | 15 | tf.reset_default_graph() 16 | 17 | input_placeholder = tf.placeholder(tf.int32 , [batch_size, max_seq_length], name='input_placeholder') 18 | label_placeholder = tf.placeholder(tf.float32, [batch_size, num_classes ]) 19 | 20 | ids_matrix = np.load('ids_matrix.npy') 21 | embeddings_tf = tf.constant(ids_matrix) 22 | 23 | batch_data = tf.Variable(tf.zeros([batch_size, max_seq_length, vector_length]), dtype=tf.float32) 24 | batch_data = tf.nn.embedding_lookup(embeddings_tf, input_placeholder) 25 | batch_data = tf.cast(batch_data, tf.float32) # https://github.com/tensorflow/tensorflow/issues/8281 26 | 27 | # composing bidirectional lstm 28 | batch_unstack = tf.unstack(batch_data, max_seq_length, 1) 29 | fw_lstm_cell = tf.nn.rnn_cell.LSTMCell(lstm_units) # forward lstm cell 30 | fw_lstm_cell = tf.nn.rnn_cell.DropoutWrapper(cell=fw_lstm_cell, output_keep_prob=0.75) 31 | bw_lstm_cell = tf.nn.rnn_cell.LSTMCell(lstm_units) # backward lstm cell 32 | bw_lstm_cell = tf.nn.rnn_cell.DropoutWrapper(cell=bw_lstm_cell, output_keep_prob=0.75) 33 | outputs, _, _ = tf.nn.static_bidirectional_rnn( 34 | fw_lstm_cell , 35 | bw_lstm_cell , 36 | batch_unstack, 37 | dtype=tf.float32 38 | ) 39 | 40 | weight = tf.Variable(tf.truncated_normal([2*lstm_units, num_classes])) 41 | bias = tf.Variable(tf.constant(0.1, shape=[num_classes])) 42 | prediction = tf.add(tf.matmul(outputs[-1], weight), bias, name="prediction_op") 43 | 44 | correct_prediction = tf.equal(tf.argmax(prediction, 1), tf.argmax(label_placeholder, 1)) 45 | accuracy = tf.reduce_mean(tf.cast(correct_prediction, tf.float32)) 46 | 47 | loss = tf.reduce_mean(tf.nn.softmax_cross_entropy_with_logits_v2(logits=prediction, labels=label_placeholder)) 48 | optimizer = tf.train.AdamOptimizer().minimize(loss) 49 | #optimizer = tf.train.GradientDescentOptimizer(learning_rate=0.001).minimize(loss) 50 | 51 | print("starting at ", datetime.datetime.now()) 52 | 53 | loss_summary = tf.summary.scalar('Loss' , loss ) 54 | validation_accuracy_summary = tf.summary.scalar('Batch Validation Accuracy', accuracy) 55 | testing_accuracy_summary = tf.summary.scalar('Testing Dataset Accuracy' , accuracy) 56 | log_dir = "tensorboard/bilstm/"+datetime.datetime.now().strftime("%Y%m%d-%H%M%S")+"/" 57 | 58 | init = tf.global_variables_initializer() 59 | with tf.Session() as sess: 60 | writer = tf.summary.FileWriter(log_dir, sess.graph) 61 | saver = tf.train.Saver() 62 | sess.run(init) 63 | 64 | for i in range(iterations): 65 | next_input_batch, next_label_batch = dataset.get_training_batch(batch_size, max_seq_length) 66 | test_input_batch, test_label_batch = dataset.get_testing_batch (batch_size, max_seq_length) 67 | sess.run(optimizer, feed_dict={input_placeholder: next_input_batch, label_placeholder: next_label_batch}) 68 | if (i%50 == 0): 69 | acc = sess.run(accuracy, feed_dict={input_placeholder: next_input_batch, label_placeholder: next_label_batch}) 70 | los = sess.run(loss , feed_dict={input_placeholder: next_input_batch, label_placeholder: next_label_batch}) 71 | tes = sess.run(accuracy, feed_dict={input_placeholder: test_input_batch, label_placeholder: test_label_batch}) 72 | print("___________________________________") 73 | print("Iteration : ", i ) 74 | print("Validation acc : ", acc) 75 | print("Loss : ", los) 76 | print("Test acc : ", tes) 77 | validation_accuracy_result = sess.run(validation_accuracy_summary, feed_dict={input_placeholder: next_input_batch, label_placeholder: next_label_batch}) 78 | testing_accuracy_result = sess.run(testing_accuracy_summary , feed_dict={input_placeholder: test_input_batch, label_placeholder: test_label_batch}) 79 | loss_result = sess.run(loss_summary , feed_dict={input_placeholder: next_input_batch, label_placeholder: next_label_batch}) 80 | writer.add_summary(validation_accuracy_result, i) 81 | writer.add_summary(testing_accuracy_result , i) 82 | writer.add_summary(loss_result , i) 83 | if (i%1000 == 0 and i != 0): 84 | save_path = saver.save(sess, "models/bilstm/pretrained_bilstm.ckpt", global_step=i) 85 | print("model is saved to %s"%save_path) 86 | writer.close() 87 | 88 | print("ending at ", datetime.datetime.now()) -------------------------------------------------------------------------------- /old_stuffs/training_helpers.py: -------------------------------------------------------------------------------- 1 | from random import randint 2 | import random 3 | from clear_text_to_array import * 4 | from gensim.models import Word2Vec 5 | import glob, json, re 6 | import json 7 | import numpy as np 8 | from wordtoken_to_id import * 9 | from clear_text_to_array import * 10 | from itertools import chain 11 | 12 | class DataSetHelper(): 13 | def __init__(self,): 14 | self.word2vec = Word2Vec.load('model.bin') 15 | self.ids_matrix = np.load('ids_matrix.npy') 16 | self.unknown_word_id = wordtoken_to_id(self.word2vec, "анноунүг") 17 | with open("temp_corpuses/dataset.json", "r", encoding="utf8") as f: 18 | self.dataset_json = json.load(f) 19 | self.training_set = self.dataset_json['training'] 20 | self.testing_set = self.dataset_json['testing' ] 21 | pass 22 | 23 | def sentence_to_ids(self, sentence, max_seq_length): 24 | sentence_array = clear_text_to_array(sentence) 25 | sentence_array = list(chain(*sentence_array)) 26 | sentence_array = sentence_array[:max_seq_length] 27 | ids_of_sentence = np.zeros((max_seq_length), dtype='int32') 28 | for index, word in enumerate(sentence_array): 29 | try: 30 | ids_of_sentence[index] = wordtoken_to_id(self.word2vec, word) 31 | except KeyError: 32 | ids_of_sentence[index] = self.unknown_word_id # unknown word, АННОУНҮГ 33 | return ids_of_sentence 34 | 35 | def get_training_batch(self, batch_size, max_seq_length): 36 | batch_labels = [] 37 | batch_arr = np.zeros([batch_size, max_seq_length]) 38 | for i in range(batch_size): 39 | random_corpus = random.choice(self.training_set) 40 | file_name = random_corpus[0] 41 | one_hot = random_corpus[1] 42 | category = random_corpus[2] 43 | file_path = "corpuses/"+category+"/"+file_name 44 | with open(file_path, encoding="utf8") as f: 45 | sentence = json.load(f)['body'] 46 | #print("##########################") 47 | #print(file_path) 48 | #print(sentence) 49 | ids_of_sentence = self.sentence_to_ids(sentence, max_seq_length) 50 | batch_arr[i] = ids_of_sentence 51 | batch_labels.append(one_hot) 52 | return (batch_arr, batch_labels) 53 | 54 | def get_testing_batch(self, batch_size, max_seq_length): 55 | batch_labels = [] 56 | batch_arr = np.zeros([batch_size, max_seq_length]) 57 | for i in range(batch_size): 58 | random_corpus = random.choice(self.testing_set) 59 | file_name = random_corpus[0] 60 | one_hot = random_corpus[1] 61 | category = random_corpus[2] 62 | file_path = "corpuses/"+category+"/"+file_name 63 | with open(file_path, encoding="utf8") as f: 64 | sentence = json.load(f)['body'] 65 | ids_of_sentence = self.sentence_to_ids(sentence, max_seq_length) 66 | batch_arr[i] = ids_of_sentence 67 | batch_labels.append(one_hot) 68 | return (batch_arr, batch_labels) 69 | 70 | #batch_size, seq_length = 50, 500 71 | #dataset = DataSetHelper() 72 | #for i in range(10000): 73 | # inp, label = dataset.get_training_batch(batch_size, seq_length) 74 | #import pdb; pdb.set_trace() -------------------------------------------------------------------------------- /old_stuffs/training_lstm_rnn.py: -------------------------------------------------------------------------------- 1 | import datetime 2 | import tensorflow as tf 3 | import numpy as np 4 | from training_helpers import * 5 | 6 | batch_size = 24 7 | lstm_units = 128 8 | num_classes = 6 9 | max_seq_length = 500 10 | vector_length = 100 # word2vec dimensions 11 | iterations = 100000 # 100000 12 | 13 | dataset = DataSetHelper() 14 | 15 | tf.reset_default_graph() 16 | 17 | input_placeholder = tf.placeholder(tf.int32 , [batch_size, max_seq_length], name='input_placeholder') 18 | label_placeholder = tf.placeholder(tf.float32, [batch_size, num_classes ]) 19 | 20 | ids_matrix = np.load('ids_matrix.npy') 21 | embeddings_tf = tf.constant(ids_matrix) 22 | 23 | batch_data = tf.Variable(tf.zeros([batch_size, max_seq_length, vector_length]), dtype=tf.float32) 24 | batch_data = tf.nn.embedding_lookup(embeddings_tf, input_placeholder) 25 | batch_data = tf.cast(batch_data, tf.float32) # https://github.com/tensorflow/tensorflow/issues/8281 26 | 27 | lstm_cell = tf.contrib.rnn.BasicLSTMCell(lstm_units) 28 | lstm_cell = tf.contrib.rnn.DropoutWrapper(cell=lstm_cell, output_keep_prob=0.75) 29 | value, _ = tf.nn.dynamic_rnn(lstm_cell, batch_data, dtype=tf.float32) 30 | 31 | weight = tf.Variable(tf.truncated_normal([lstm_units, num_classes])) 32 | bias = tf.Variable(tf.constant(0.1, shape=[num_classes])) 33 | value = tf.transpose(value, [1, 0, 2]) 34 | last = tf.gather(value, int(value.get_shape()[0]) - 1) 35 | prediction = tf.add(tf.matmul(last, weight), bias, name='prediction_op') 36 | 37 | correct_prediction = tf.equal(tf.argmax(prediction, 1), tf.argmax(label_placeholder, 1)) 38 | accuracy = tf.reduce_mean(tf.cast(correct_prediction, tf.float32)) 39 | 40 | loss = tf.reduce_mean(tf.nn.softmax_cross_entropy_with_logits_v2(logits=prediction, labels=label_placeholder)) 41 | optimizer = tf.train.AdamOptimizer().minimize(loss) 42 | #optimizer = tf.train.GradientDescentOptimizer(learning_rate=0.001).minimize(loss) 43 | 44 | print("started at ", datetime.datetime.now()) 45 | 46 | loss_summary = tf.summary.scalar('Loss' , loss ) 47 | validation_accuracy_summary = tf.summary.scalar('Batch Validation Accuracy', accuracy) 48 | testing_accuracy_summary = tf.summary.scalar('Testing Dataset Accuracy' , accuracy) 49 | 50 | log_dir = "tensorboard/lstm/"+datetime.datetime.now().strftime("%Y%m%d-%H%M%S")+"/" 51 | 52 | init = tf.global_variables_initializer() 53 | with tf.Session() as sess: 54 | writer = tf.summary.FileWriter(log_dir, sess.graph) 55 | saver = tf.train.Saver() 56 | sess.run(init) 57 | 58 | for i in range(iterations): 59 | next_input_batch, next_label_batch = dataset.get_training_batch(batch_size, max_seq_length) 60 | test_input_batch, test_label_batch = dataset.get_testing_batch (batch_size, max_seq_length) 61 | sess.run(optimizer, feed_dict={input_placeholder: next_input_batch, label_placeholder: next_label_batch}) 62 | if (i%10 == 0): 63 | acc = sess.run(accuracy, feed_dict={input_placeholder: next_input_batch, label_placeholder: next_label_batch}) 64 | los = sess.run(loss , feed_dict={input_placeholder: next_input_batch, label_placeholder: next_label_batch}) 65 | tes = sess.run(accuracy, feed_dict={input_placeholder: test_input_batch, label_placeholder: test_label_batch}) 66 | print("___________________________________") 67 | print("Iteration : ", i ) 68 | print("Validation : ", acc) 69 | print("Loss : ", los) 70 | print("Test acc : ", tes) 71 | validation_accuracy_result = sess.run(validation_accuracy_summary, feed_dict={input_placeholder: next_input_batch, label_placeholder: next_label_batch}) 72 | testing_accuracy_result = sess.run(testing_accuracy_summary , feed_dict={input_placeholder: test_input_batch, label_placeholder: test_label_batch}) 73 | loss_result = sess.run(loss_summary , feed_dict={input_placeholder: next_input_batch, label_placeholder: next_label_batch}) 74 | writer.add_summary(validation_accuracy_result, i) 75 | writer.add_summary(testing_accuracy_result , i) 76 | writer.add_summary(loss_result , i) 77 | if (i%1000 == 0 and i != 0): 78 | save_path = saver.save(sess, "models/lstm/pretrained_lstm.ckpt", global_step=i) 79 | print("model is saved to %s"%save_path) 80 | writer.close() 81 | 82 | print("ended at ", datetime.datetime.now()) 83 | -------------------------------------------------------------------------------- /old_stuffs/training_stacked_lstm.py: -------------------------------------------------------------------------------- 1 | import datetime 2 | import tensorflow as tf 3 | import numpy as np 4 | from training_helpers import * 5 | 6 | batch_size = 24 7 | lstm_units = 128 8 | num_classes = 6 9 | max_seq_length = 500 10 | vector_length = 100 # word2vec dimensions 11 | iterations = 100000 # 100000 12 | stack_count = 5 13 | 14 | dataset = DataSetHelper() 15 | 16 | tf.reset_default_graph() 17 | 18 | def shape_detective(sess, tensor, explainer=""): 19 | print("-------------------------") 20 | print(explainer, sess.run(tf.shape(tensor))) 21 | 22 | input_placeholder = tf.placeholder(tf.int32 , [batch_size, max_seq_length], name='input_placeholder') 23 | label_placeholder = tf.placeholder(tf.float32, [batch_size, num_classes ]) 24 | 25 | ids_matrix = np.load('ids_matrix.npy') 26 | embeddings_tf = tf.constant(ids_matrix) 27 | batch_data = tf.Variable(tf.zeros([batch_size, max_seq_length, vector_length]), dtype=tf.float32) 28 | batch_data = tf.nn.embedding_lookup(embeddings_tf, input_placeholder) 29 | batch_data = tf.cast(batch_data, tf.float32) # https://github.com/tensorflow/tensorflow/issues/8281 30 | 31 | stacked_lstms = [] 32 | for i in range(stack_count): 33 | stacked_lstms.append(tf.contrib.rnn.BasicLSTMCell(lstm_units)) 34 | stacked_rnn = tf.contrib.rnn.MultiRNNCell(stacked_lstms) 35 | value_before_transpose, _ = tf.nn.dynamic_rnn(stacked_rnn, batch_data, dtype=tf.float32) 36 | 37 | weight = tf.Variable(tf.truncated_normal([lstm_units, num_classes])) 38 | bias = tf.Variable(tf.constant(0.1, shape=[num_classes])) 39 | value_after_transpose = tf.transpose(value_before_transpose, [1, 0, 2]) 40 | last = tf.gather(value_after_transpose, int(value_after_transpose.get_shape()[0]) - 1) 41 | prediction = tf.add(tf.matmul(last, weight), bias, name='prediction_op') 42 | 43 | correct_prediction = tf.equal(tf.argmax(prediction, 1), tf.argmax(label_placeholder, 1)) 44 | accuracy = tf.reduce_mean(tf.cast(correct_prediction, tf.float32)) 45 | 46 | loss = tf.reduce_mean(tf.nn.softmax_cross_entropy_with_logits_v2(logits=prediction, labels=label_placeholder)) 47 | optimizer = tf.train.AdamOptimizer().minimize(loss) 48 | 49 | print("started at ", datetime.datetime.now()) 50 | 51 | loss_summary = tf.summary.scalar('Loss' , loss ) 52 | validation_accuracy_summary = tf.summary.scalar('Batch Validation Accuracy', accuracy) 53 | testing_accuracy_summary = tf.summary.scalar('Testing Dataset Accuracy' , accuracy) 54 | 55 | log_dir = "tensorboard/stackedlstm/"+datetime.datetime.now().strftime("%Y%m%d-%H%M%S")+"/" 56 | 57 | init = tf.global_variables_initializer() 58 | with tf.Session() as sess: 59 | writer = tf.summary.FileWriter(log_dir, sess.graph) 60 | saver = tf.train.Saver() 61 | sess.run(init) 62 | print("detecting shape flow and changes...") 63 | 64 | for i in range(iterations): 65 | next_input_batch, next_label_batch = dataset.get_training_batch(batch_size, max_seq_length) 66 | test_input_batch, test_label_batch = dataset.get_testing_batch (batch_size, max_seq_length) 67 | sess.run(optimizer, feed_dict={input_placeholder: next_input_batch, label_placeholder: next_label_batch}) 68 | if (i%10 == 0): 69 | acc = sess.run(accuracy, feed_dict={input_placeholder: next_input_batch, label_placeholder: next_label_batch}) 70 | los = sess.run(loss , feed_dict={input_placeholder: next_input_batch, label_placeholder: next_label_batch}) 71 | tes = sess.run(accuracy, feed_dict={input_placeholder: test_input_batch, label_placeholder: test_label_batch}) 72 | print("__________________________________") 73 | print("Iteration : ", i ) 74 | print("Validation : ", acc) 75 | print("Loss : ", los) 76 | print("Test acc : ", tes) 77 | validation_accuracy_result = sess.run(validation_accuracy_summary, feed_dict={input_placeholder: next_input_batch, label_placeholder: next_label_batch}) 78 | testing_accuracy_result = sess.run(testing_accuracy_summary , feed_dict={input_placeholder: test_input_batch, label_placeholder: test_label_batch}) 79 | loss_result = sess.run(loss_summary , feed_dict={input_placeholder: next_input_batch, label_placeholder: next_label_batch}) 80 | writer.add_summary(validation_accuracy_result, i) 81 | writer.add_summary(testing_accuracy_result , i) 82 | writer.add_summary(loss_result , i) 83 | if (i%1000 == 0 and i != 0): 84 | save_path = saver.save(sess, "models/stackedlstm/pretrained_lstm.ckpt", global_step=i) 85 | print("model is saved to %s"%save_path) 86 | 87 | print("__________________________________") 88 | shape_detective(sess, input_placeholder , explainer="input_placeholder :") 89 | shape_detective(sess, label_placeholder , explainer="label_placeholder :") 90 | shape_detective(sess, embeddings_tf , explainer="embeddings :") 91 | shape_detective(sess, batch_data , explainer="batch_data before unstacking :") 92 | shape_detective(sess, weight , explainer="weight :") 93 | shape_detective(sess, bias , explainer="bias :") 94 | shape_detective(sess, value_before_transpose, explainer="value shape before transpose stacked 2 lstms :") 95 | shape_detective(sess, value_after_transpose , explainer="value shape after_transpose :") 96 | shape_detective(sess, last , explainer="shape after gather transposed value :") 97 | shape_detective(sess, prediction , explainer="dense connection, prediction shape :") 98 | 99 | 100 | writer.close() 101 | 102 | print("ended at ", datetime.datetime.now()) 103 | -------------------------------------------------------------------------------- /old_stuffs/use_freezed_model_rpc.py: -------------------------------------------------------------------------------- 1 | import tensorflow as tf 2 | import numpy as np 3 | from training_helpers import * 4 | from itertools import chain 5 | from clear_text_to_array import * 6 | from xmlrpc.server import SimpleXMLRPCServer # for django app 7 | import sys # handle interrupt 8 | 9 | def softmax(x): 10 | score_math_exp = np.exp(np.asarray(x)) 11 | return score_math_exp / score_math_exp.sum(0) 12 | 13 | frozen_graph = './models/bilstm/pretrained_bilstm-23000.pb' 14 | 15 | with tf.gfile.GFile(frozen_graph, "rb") as f: 16 | restored_graph_def = tf.GraphDef() 17 | restored_graph_def.ParseFromString(f.read()) 18 | 19 | with tf.Graph().as_default() as graph: 20 | input_placeholder, prediction = tf.import_graph_def( 21 | restored_graph_def, 22 | input_map = None, 23 | return_elements = ['input_placeholder', 'prediction_op'], 24 | name = '' 25 | ) 26 | 27 | input_placeholder = graph.get_tensor_by_name("input_placeholder:0") 28 | prediction_op = graph.get_tensor_by_name("prediction_op:0") 29 | 30 | dataset_helper = DataSetHelper() 31 | 32 | 33 | def get_class_name(x): 34 | switcher = { 35 | 0: lambda: "economy" , 36 | 1: lambda: "health" , 37 | 2: lambda: "politics" , 38 | 3: lambda: "society" , 39 | 4: lambda: "technology", 40 | 5: lambda: "world" 41 | } 42 | return switcher.get(x, lambda: "UNKNOWN")() 43 | 44 | sess = tf.Session(graph=graph) 45 | 46 | def predict_class(sess, filename): 47 | with open(filename, 'r') as content_file: 48 | content = content_file.read() 49 | 50 | max_seq_length = 500 51 | num_classes = 6 52 | 53 | word_ids = dataset_helper.sentence_to_ids(content, max_seq_length) 54 | x_batch = [] 55 | for i in range(24): 56 | x_batch.append(word_ids) 57 | x_batch = np.array(x_batch) 58 | results = [] 59 | 60 | results_tf = sess.run(prediction_op, feed_dict={input_placeholder: x_batch}) 61 | for i in results_tf: 62 | softmax_result = softmax(i) 63 | argmax = softmax_result.argmax(axis=0) 64 | name = get_class_name(argmax) 65 | results.append([softmax_result, argmax, name]) 66 | 67 | print("result: ", results[0][2]) 68 | print(results[0][0]) 69 | 70 | print('----------------------------') 71 | print("trying to predict world news") 72 | predict_class(sess, "./corpuses_test/world_news_gogo_mn.txt") 73 | 74 | print('----------------------------') 75 | print("trying to predict economy news") 76 | predict_class(sess, "./corpuses_test/economy_news_gogo_mn.txt") 77 | 78 | print('----------------------------') 79 | print("trying to predict technology news") 80 | predict_class(sess, "./corpuses_test/technology_news_gogo_mn.txt") 81 | 82 | print('----------------------------') 83 | print("trying to predict health news") 84 | predict_class(sess, "./corpuses_test/health_news_gogo_mn.txt") 85 | 86 | 87 | print('----------------------------') 88 | print("trying to predict political news") 89 | predict_class(sess, "./corpuses_test/politics_news_ikon_mn.txt") 90 | 91 | def predict_class_from_text(content): 92 | max_seq_length = 500 93 | 94 | word_ids = dataset_helper.sentence_to_ids(content, max_seq_length) 95 | x_batch = [] 96 | for i in range(24): 97 | x_batch.append(word_ids) 98 | x_batch = np.array(x_batch) 99 | results = [] 100 | 101 | results_tf = sess.run(prediction_op, feed_dict={input_placeholder: x_batch}) 102 | for i in results_tf: 103 | softmax_result = softmax(i) 104 | argmax = softmax_result.argmax(axis=0) 105 | name = get_class_name(argmax) 106 | results.append([softmax_result, argmax, name]) 107 | 108 | return str(results[0][2]) 109 | 110 | try: 111 | rpc_server = SimpleXMLRPCServer(("localhost", 50001)) 112 | print("----------------------------") 113 | print("classifier RPC server is listening on port 50001...") 114 | rpc_server.register_function(predict_class_from_text, "predict_class_from_text") 115 | rpc_server.serve_forever() 116 | except KeyboardInterrupt: 117 | sess.close() 118 | sys.exit() 119 | 120 | sess.close() 121 | -------------------------------------------------------------------------------- /old_stuffs/using pretrained word2vec for mongolian text classification.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "Download pre trained mongolian word2vec from fasttext\n", 8 | "https://fasttext.cc/docs/en/crawl-vectors.html " 9 | ] 10 | }, 11 | { 12 | "cell_type": "code", 13 | "execution_count": null, 14 | "metadata": {}, 15 | "outputs": [], 16 | "source": [ 17 | "!wget --no-check-certificate http://s3-us-west-1.amazonaws.com/fasttext-vectors/word-vectors-v2/cc.mn.300.bin.gz -P ./pretrained_word2vec" 18 | ] 19 | } 20 | ], 21 | "metadata": { 22 | "kernelspec": { 23 | "display_name": "Python 3", 24 | "language": "python", 25 | "name": "python3" 26 | }, 27 | "language_info": { 28 | "codemirror_mode": { 29 | "name": "ipython", 30 | "version": 3 31 | }, 32 | "file_extension": ".py", 33 | "mimetype": "text/x-python", 34 | "name": "python", 35 | "nbconvert_exporter": "python", 36 | "pygments_lexer": "ipython3", 37 | "version": "3.6.7" 38 | } 39 | }, 40 | "nbformat": 4, 41 | "nbformat_minor": 2 42 | } 43 | -------------------------------------------------------------------------------- /old_stuffs/wordtoken_to_id.py: -------------------------------------------------------------------------------- 1 | 2 | def wordtoken_to_id(model, word): 3 | token_id = model.wv.vocab[word].index 4 | return token_id -------------------------------------------------------------------------------- /old_stuffs/wordvec_exp.py: -------------------------------------------------------------------------------- 1 | from gensim.models import Word2Vec 2 | 3 | sentences = [ 4 | ['this', 'is', 'the', 'first', 'sentence', 'for', 'word2vec'], 5 | ['this', 'is', 'the', 'second', 'sentence'], 6 | ['yet', 'another', 'sentence'], 7 | ['one', 'more', 'sentence'], 8 | ['and', 'the', 'final', 'sentence'] 9 | ] 10 | 11 | model = Word2Vec(sentences, min_count=1) 12 | print(model) 13 | words = list(model.wv.vocab) 14 | print(words) 15 | print(model['sentence']) 16 | model.save('model.bin') 17 | new_model = Word2Vec.load('model.bin') 18 | print(new_model) 19 | -------------------------------------------------------------------------------- /preprocess_dataset/preprocess_eduge.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "code", 5 | "execution_count": 3, 6 | "metadata": {}, 7 | "outputs": [ 8 | { 9 | "name": "stderr", 10 | "output_type": "stream", 11 | "text": [ 12 | "SYSTEM_WGETRC = c:/progra~1/wget/etc/wgetrc\n", 13 | "syswgetrc = C:\\Program Files (x86)\\GnuWin32/etc/wgetrc\n", 14 | "--2019-04-13 07:30:06-- https://github.com/tugstugi/mongolian-nlp/raw/master/datasets/eduge.csv.gz\n", 15 | "Resolving github.com... 13.250.177.223, 13.229.188.59, 52.74.223.119\n", 16 | "Connecting to github.com|13.250.177.223|:443... connected.\n", 17 | "OpenSSL: error:1407742E:SSL routines:SSL23_GET_SERVER_HELLO:tlsv1 alert protocol version\n", 18 | "Unable to establish SSL connection.\n", 19 | "'gunzip' is not recognized as an internal or external command,\n", 20 | "operable program or batch file.\n" 21 | ] 22 | } 23 | ], 24 | "source": [ 25 | "import os\n", 26 | "if not os.path.exists(\"eduge.csv.gz\"):\n", 27 | " !wget https://github.com/tugstugi/mongolian-nlp/raw/master/datasets/eduge.csv.gz\n", 28 | " !gunzip eduge.csv.gz" 29 | ] 30 | }, 31 | { 32 | "cell_type": "code", 33 | "execution_count": 1, 34 | "metadata": {}, 35 | "outputs": [ 36 | { 37 | "data": { 38 | "text/plain": [ 39 | "['урлаг соёл',\n", 40 | " 'эдийн засаг',\n", 41 | " 'эрүүл мэнд',\n", 42 | " 'хууль',\n", 43 | " 'улс төр',\n", 44 | " 'спорт',\n", 45 | " 'технологи',\n", 46 | " 'боловсрол',\n", 47 | " 'байгал орчин']" 48 | ] 49 | }, 50 | "execution_count": 1, 51 | "metadata": {}, 52 | "output_type": "execute_result" 53 | } 54 | ], 55 | "source": [ 56 | "import pandas as pd\n", 57 | "df = pd.read_csv(\"eduge.csv\")\n", 58 | "df = df.rename(columns=lambda x: x.strip())\n", 59 | "labels = df['label'].unique().tolist()\n", 60 | "labels" 61 | ] 62 | }, 63 | { 64 | "cell_type": "code", 65 | "execution_count": 2, 66 | "metadata": {}, 67 | "outputs": [ 68 | { 69 | "name": "stderr", 70 | "output_type": "stream", 71 | "text": [ 72 | "[nltk_data] Downloading package punkt to C:\\Users\\sharavsambuu-\n", 73 | "[nltk_data] laptop\\AppData\\Roaming\\nltk_data...\n" 74 | ] 75 | }, 76 | { 77 | "name": "stdout", 78 | "output_type": "stream", 79 | "text": [ 80 | "['Сайн байна уу?', 'Танд энэ өдрийн мэнд хүргье.', 'Монгол текст ангилах гэж байна.']\n", 81 | "['Монгол', 'улсын', 'их', 'хурал']\n" 82 | ] 83 | }, 84 | { 85 | "name": "stderr", 86 | "output_type": "stream", 87 | "text": [ 88 | "[nltk_data] Package punkt is already up-to-date!\n" 89 | ] 90 | } 91 | ], 92 | "source": [ 93 | "import nltk\n", 94 | "nltk.download('punkt')\n", 95 | "print(nltk.sent_tokenize(\"Сайн байна уу? Танд энэ өдрийн мэнд хүргье. Монгол текст ангилах гэж байна.\"))\n", 96 | "print(nltk.word_tokenize(\"Монгол улсын их хурал\"))" 97 | ] 98 | }, 99 | { 100 | "cell_type": "code", 101 | "execution_count": 6, 102 | "metadata": {}, 103 | "outputs": [], 104 | "source": [ 105 | "import string\n", 106 | "stopwordsmn = ['аа','аанхаа','алив','ба','байдаг','байжээ','байна','байсаар','байсан','байхаа','бас','бишүү','бол','болжээ','болно','болоо','бэ','вэ','гэж','гэжээ','гэлтгүй','гэсэн','гэтэл','за','л','мөн','нь','тэр','уу','харин','хэн','ч','энэ','ээ','юм','үү','?','', '.', ',', '-','ийн','ын','тай','г','ийг','д','н','ний','дээр','юу']\n", 107 | "eduge_preprocessed = []\n", 108 | "eduge_preprocessed_stopwords = []\n", 109 | "word_dict = {}\n", 110 | "for idx, row in df.iterrows():\n", 111 | " news = row['news']\n", 112 | " label = row['label']\n", 113 | " sentences = nltk.sent_tokenize(news)\n", 114 | " news_sentences = []\n", 115 | " news_sentences_stopwords = []\n", 116 | " for sentence in sentences:\n", 117 | " tokens = nltk.word_tokenize(sentence)\n", 118 | " tokens = [w.lower() for w in tokens]\n", 119 | " table = str.maketrans('', '', string.punctuation)\n", 120 | " stripped = [w.translate(table) for w in tokens]\n", 121 | " words = [word for word in stripped if word.isalpha()]\n", 122 | " words_stopwords = [w for w in words if not w in stopwordsmn]\n", 123 | " news_sentences.append(words)\n", 124 | " news_sentences_stopwords.append(words_stopwords)\n", 125 | " for w in words:\n", 126 | " word_dict[w] = 0\n", 127 | " eduge_preprocessed.append([news_sentences, label])\n", 128 | " eduge_preprocessed_stopwords.append([news_sentences_stopwords, label])" 129 | ] 130 | }, 131 | { 132 | "cell_type": "code", 133 | "execution_count": null, 134 | "metadata": {}, 135 | "outputs": [], 136 | "source": [ 137 | "import pickle\n", 138 | "\n", 139 | "with open('eduge.pickle', 'wb') as handle:\n", 140 | " pickle.dump(eduge_preprocessed, handle, protocol=pickle.HIGHEST_PROTOCOL)\n", 141 | " print(\"saved to eduge.pickle\")" 142 | ] 143 | }, 144 | { 145 | "cell_type": "code", 146 | "execution_count": 10, 147 | "metadata": {}, 148 | "outputs": [ 149 | { 150 | "name": "stdout", 151 | "output_type": "stream", 152 | "text": [ 153 | "saved to eduge_stopwords_removed.pickle\n" 154 | ] 155 | } 156 | ], 157 | "source": [ 158 | "with open('eduge_stopwords_removed.pickle', 'wb') as handle:\n", 159 | " pickle.dump(eduge_preprocessed_stopwords, handle, protocol=pickle.HIGHEST_PROTOCOL)\n", 160 | " print(\"saved to eduge_stopwords_removed.pickle\")" 161 | ] 162 | }, 163 | { 164 | "cell_type": "code", 165 | "execution_count": 12, 166 | "metadata": {}, 167 | "outputs": [], 168 | "source": [ 169 | "word_index = {}\n", 170 | "word_index[\"\" ] = 0\n", 171 | "word_index[\"\" ] = 1\n", 172 | "word_index[\"\" ] = 2\n", 173 | "word_index[\"\"] = 3\n", 174 | "cnt = 4\n", 175 | "for k, v in word_dict.items():\n", 176 | " word_index[k] = cnt\n", 177 | " cnt += 1\n", 178 | "#print(word_index)" 179 | ] 180 | }, 181 | { 182 | "cell_type": "code", 183 | "execution_count": 15, 184 | "metadata": {}, 185 | "outputs": [], 186 | "source": [ 187 | "reversed_word_index = dict([(value, key) for (key, value) in word_index.items()])" 188 | ] 189 | }, 190 | { 191 | "cell_type": "code", 192 | "execution_count": 16, 193 | "metadata": {}, 194 | "outputs": [ 195 | { 196 | "name": "stdout", 197 | "output_type": "stream", 198 | "text": [ 199 | "saved to word_index.pickle\n", 200 | "saved to reversed_word_index.pickle\n" 201 | ] 202 | } 203 | ], 204 | "source": [ 205 | "with open('word_index.pickle', 'wb') as handle:\n", 206 | " pickle.dump(word_index, handle, protocol=pickle.HIGHEST_PROTOCOL)\n", 207 | " print(\"saved to word_index.pickle\")\n", 208 | " \n", 209 | "with open('reversed_word_index.pickle', 'wb') as handle:\n", 210 | " pickle.dump(reversed_word_index, handle, protocol=pickle.HIGHEST_PROTOCOL)\n", 211 | " print(\"saved to reversed_word_index.pickle\")" 212 | ] 213 | } 214 | ], 215 | "metadata": { 216 | "kernelspec": { 217 | "display_name": "Python 3", 218 | "language": "python", 219 | "name": "python3" 220 | }, 221 | "language_info": { 222 | "codemirror_mode": { 223 | "name": "ipython", 224 | "version": 3 225 | }, 226 | "file_extension": ".py", 227 | "mimetype": "text/x-python", 228 | "name": "python", 229 | "nbconvert_exporter": "python", 230 | "pygments_lexer": "ipython3", 231 | "version": "3.6.7" 232 | } 233 | }, 234 | "nbformat": 4, 235 | "nbformat_minor": 2 236 | } 237 | -------------------------------------------------------------------------------- /requirements.txt: -------------------------------------------------------------------------------- 1 | jupyter 2 | nltk 3 | pandas --------------------------------------------------------------------------------