├── .gitignore
├── README.md
├── dataset
    ├── 94k_sentences
    │   └── mn_sentences_94k.txt
    ├── book
    │   └── usan-dooguur-20k.txt
    └── lyrics
    │   └── mongolian-hiphop-lyrics.txt
├── images
    ├── bert
    │   └── mongolian-bert-attend-visualization.png
    └── cnn-weights
    │   ├── 1.png
    │   ├── 2.png
    │   ├── 3.png
    │   └── 4.png
├── mn_bert_finetuning_notebooks
    ├── Fine_tuning_Mongolian_BERT_for_eduge_classification_on_GPU,_32K_512.ipynb
    ├── Mongolian_BERT_finetuning_for_EDUGE_on_TPU,_32K_512.ipynb
    └── finetuning mongolian bert model on GPU.ipynb
├── neural_network_classifier_notebooks
    ├── Mongolian_text_classification_01,_simple.ipynb
    ├── Mongolian_text_classification_02,_word2vec_initialization,_static_embedding_matrix.ipynb
    ├── Mongolian_text_classification_02,_word2vec_initialization,_trainable_embedding_matrix.ipynb
    ├── Mongolian_text_classification_03,_1D_Convolution.ipynb
    ├── Mongolian_text_classification_03,_Multiple_1D_Convolution_layers.ipynb
    ├── Mongolian_text_classification_04,_RNN(LSTM).ipynb
    └── Mongolian_text_classification_05,_Attention.ipynb
├── old_stuffs
    ├── .gitignore
    ├── README.md
    ├── clear_create_word2vec.py
    ├── clear_text_to_array.py
    ├── convert_text_to_seqvector_through_embedmatrix.py
    ├── corpuses_test
    │   ├── economy_news_gogo_mn.txt
    │   ├── health_news_gogo_mn.txt
    │   ├── politics_news_ikon_mn.txt
    │   ├── technology_news_gogo_mn.txt
    │   └── world_news_gogo_mn.txt
    ├── djangoapp
    │   ├── app
    │   │   ├── __init__.py
    │   │   ├── admin.py
    │   │   ├── apps.py
    │   │   ├── forms.py
    │   │   ├── migrations
    │   │   │   ├── 0001_initial.py
    │   │   │   └── __init__.py
    │   │   ├── models.py
    │   │   ├── tests.py
    │   │   ├── urls.py
    │   │   └── views.py
    │   ├── djangoapp
    │   │   ├── __init__.py
    │   │   ├── settings.py
    │   │   ├── urls.py
    │   │   └── wsgi.py
    │   ├── manage.py
    │   └── templates
    │   │   ├── base.html
    │   │   ├── classify.html
    │   │   └── home.html
    ├── freeze_tf_model.py
    ├── ikon_mn_scrape.py
    ├── images
    │   ├── accuracy.png
    │   ├── classifiedresult.png
    │   ├── loss.png
    │   └── webinput.png
    ├── mongolianstopwords.py
    ├── numpy_embedding_matrix_tf.py
    ├── prepare_trainingset.py
    ├── requirements.txt
    ├── research
    │   ├── ikon-research.txt
    │   └── nlp-research.txt
    ├── stemmer.py
    ├── training_bilstm_rnn.py
    ├── training_helpers.py
    ├── training_lstm_rnn.py
    ├── training_stacked_lstm.py
    ├── use_freezed_model_rpc.py
    ├── using pretrained word2vec for mongolian text classification.ipynb
    ├── wordtoken_to_id.py
    └── wordvec_exp.py
├── preprocess_dataset
    └── preprocess_eduge.ipynb
└── requirements.txt


/.gitignore:
--------------------------------------------------------------------------------
1 | env/
2 | 
3 | preprocess_dataset/.ipynb_checkpoints
4 | preprocess_dataset/eduge.csv
5 | 


--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
 1 | # mongolian-text-classification
 2 | Mongolian cyrillic text classification with modern tensorflow and some fine tuning on TugsTugi's BERT model.
 3 | 
 4 | # Load Mongolian BERT in Tensorflow 2
 5 | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1ReDLH2DDiCt_Y800vGub8OuYJlR-TsZw)
 6 | 
 7 | # Generate text using Mongolian BERT
 8 | 
 9 | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1jJA-YSAsbq5gbpyGYE-8p-rCzgqSU9eX)
10 | 
11 | # Visualize Mongolian BERT using bertviz and pytorch model
12 | 
13 | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1UEDNlfEmXxZy1jRrE7pCTZNu8DplWVQv)
14 | 
15 | ![Alt text](images/bert/mongolian-bert-attend-visualization.png?raw=true "Mongolian BERT attend")
16 | 
17 | 
18 | # Fine tuning TugsTugi's Mongolian BERT model
19 | On TPU mode, loading checkpoints from the file system doesn't supported by the bert and bucket should be used.
20 | 
21 | Fine tuning mongolian BERT on TPU, You need own bucket in order to finetune on TPU [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1CnGd2OnNDlxe6ZUjmOa7zg__CcKk5X85)
22 | 
23 | Fine tune a mongolian BERT on GPU, a lot of computation needed, a low batch size matters due to memory limit [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1u9mVeWRh7GWLONAzZ3XpJciPfv38vHaZ)
24 | 
25 | # Classifiers using simple neural networks
26 | 
27 | No 01, Simplest classifier [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1Ulv6tUAjOsp-jN4sTdef3lTuJb0yX4qy)
28 | 
29 | No 02, Pretrained Word2Vec initialization from Facebook's fasttext, kind of transfer learningish. Embedding layer is not trainable in this case [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1SfwdhIoRMi4kXeAN8eUjYXKuT5zig9WV) and with trainable embedding layer [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1WQvCa6KDOxQ2YjDdb48g4zsN60_Svbhg)
30 | 
31 | No 03, 1D Convolution [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1JgJN74E1w1x8RSjm9qi06uw6y0I_9k1J) and multiple 1D convnets [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1lTh2dG64L4aJsCip714sCA_xQgMttxOb)
32 | 
33 | No 04, LSTM [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1j0MN3UTGz-990bl61n5B1mrtjnq8hSdh)
34 | 
35 | Visualize RNN neuron firing in text generation [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1ndM1G-0qZx4wi6E9kPL1D9IjaM0pq3r9)
36 | 
37 | No 05, LSTM with Attention, visualization of attention scores in text classification [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/10nPgRmbZsjad46CdVJKRHklestXcEpZ5)
38 | 
39 | No 06, Classification with Mongolian BERT and Tensorflow 2.0, with frozen bert layers [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1JQ87pFlGkDMbpHQp9ZSiyrQwdiRUAm-N) 
40 | 
41 | No 07, Classification with Mongolian BERT large and HuggingFace and Tensorflow 2 [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1e8t6MZMvpoINkXtv7o1h3ESZq04pXPUv?usp=sharing) 
42 | 
43 | 
44 | 
45 | # Mongolian sentence interpolation experiments
46 | 
47 | Sequence loss in keras and tf2 [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1jlyB2fOi_JBAi4WPMVDJ_e8-_WHtQK_9)
48 | 
49 | Variational Auto Encoder for Mongolian text [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1tBTudj9M5CGih3p8Uxj0R1SA6f3BJj-Z)
50 | 
51 | # Other experiments
52 | Predict next word, greedy text generation [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1urjsJUuNTnTAAAqu_eXpIkwRWUi72xp_)
53 | 
54 | # Series included(or will) followings
55 | word2vec initialization, 1D Convolution, RNN variants, Attention, Some weights visualization for reasoning, Transformer, Techniques to handle longer texts and so on...
56 | 
57 | 
58 | # useful references and resources
59 |   - Mongolian BERT models
60 |     https://github.com/tugstugi/mongolian-bert
61 |   - Mongolian NLP
62 |     https://github.com/tugstugi/mongolian-nlp
63 |   - Eduge classification baseline using SVM
64 |   	https://colab.research.google.com/github/tugstugi/mongolian-nlp/blob/master/misc/Eduge_SVM.ipynb
65 |   - News crawler
66 |     https://github.com/codelucas/newspaper
67 |   
68 | # Images and screenshots
69 | 
70 | ![Alt text](images/cnn-weights/1.png?raw=true "CNN weights 1")
71 | ![Alt text](images/cnn-weights/2.png?raw=true "CNN weights 2")
72 | ![Alt text](images/cnn-weights/3.png?raw=true "CNN weights 3")
73 | ![Alt text](images/cnn-weights/4.png?raw=true "CNN weights 4")
74 | 


--------------------------------------------------------------------------------
/images/bert/mongolian-bert-attend-visualization.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/sharavsambuu/mongolian-text-classification/1b82259dbc093864f00b00c5a36cecb0b6af0f5e/images/bert/mongolian-bert-attend-visualization.png


--------------------------------------------------------------------------------
/images/cnn-weights/1.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/sharavsambuu/mongolian-text-classification/1b82259dbc093864f00b00c5a36cecb0b6af0f5e/images/cnn-weights/1.png


--------------------------------------------------------------------------------
/images/cnn-weights/2.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/sharavsambuu/mongolian-text-classification/1b82259dbc093864f00b00c5a36cecb0b6af0f5e/images/cnn-weights/2.png


--------------------------------------------------------------------------------
/images/cnn-weights/3.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/sharavsambuu/mongolian-text-classification/1b82259dbc093864f00b00c5a36cecb0b6af0f5e/images/cnn-weights/3.png


--------------------------------------------------------------------------------
/images/cnn-weights/4.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/sharavsambuu/mongolian-text-classification/1b82259dbc093864f00b00c5a36cecb0b6af0f5e/images/cnn-weights/4.png


--------------------------------------------------------------------------------
/neural_network_classifier_notebooks/Mongolian_text_classification_01,_simple.ipynb:
--------------------------------------------------------------------------------
  1 | {
  2 |   "nbformat": 4,
  3 |   "nbformat_minor": 0,
  4 |   "metadata": {
  5 |     "colab": {
  6 |       "name": "Mongolian text classification #01, simple.ipynb",
  7 |       "version": "0.3.2",
  8 |       "provenance": [],
  9 |       "collapsed_sections": []
 10 |     },
 11 |     "kernelspec": {
 12 |       "name": "python3",
 13 |       "display_name": "Python 3"
 14 |     },
 15 |     "accelerator": "GPU"
 16 |   },
 17 |   "cells": [
 18 |     {
 19 |       "metadata": {
 20 |         "id": "muNP8k9fqaJb",
 21 |         "colab_type": "text"
 22 |       },
 23 |       "cell_type": "markdown",
 24 |       "source": [
 25 |         "Mongolian text classification series #01\n",
 26 |         "\n",
 27 |         "In this notebook I'm gonna try to classify cyrillic mongolian texts using modern Tensorflow 2.0\n",
 28 |         "\n",
 29 |         "Eduge dataset provided by Bolorsoft LLC\n",
 30 |         "\n",
 31 |         "Author : Sharavsambuu Gunchinish (sharavsambuu@gmail.com)\n",
 32 |         "\n",
 33 |         "Github: https://github.com/sharavsambuu/mongolian-text-classification \n",
 34 |         "\n"
 35 |       ]
 36 |     },
 37 |     {
 38 |       "metadata": {
 39 |         "id": "iY9jwdg6qT8M",
 40 |         "colab_type": "code",
 41 |         "outputId": "23b48c0b-bba9-4004-ad2e-40fdb909a037",
 42 |         "colab": {
 43 |           "base_uri": "https://localhost:8080/",
 44 |           "height": 34
 45 |         }
 46 |       },
 47 |       "cell_type": "code",
 48 |       "source": [
 49 |         "from __future__ import absolute_import, division, print_function, unicode_literals\n",
 50 |         "\n",
 51 |         "!pip install -q tensorflow-gpu==2.0.0-alpha0\n",
 52 |         "import tensorflow as tf\n",
 53 |         "from tensorflow import keras\n",
 54 |         "\n",
 55 |         "import numpy as np\n",
 56 |         "\n",
 57 |         "print(tf.__version__)"
 58 |       ],
 59 |       "execution_count": 1,
 60 |       "outputs": [
 61 |         {
 62 |           "output_type": "stream",
 63 |           "text": [
 64 |             "2.0.0-alpha0\n"
 65 |           ],
 66 |           "name": "stdout"
 67 |         }
 68 |       ]
 69 |     },
 70 |     {
 71 |       "metadata": {
 72 |         "id": "smJeJfoo4qcu",
 73 |         "colab_type": "text"
 74 |       },
 75 |       "cell_type": "markdown",
 76 |       "source": [
 77 |         "[More info about creation of eduge dataset pickles](https://github.com/sharavsambuu/mongolian-text-classification/blob/master/preprocess_dataset/preprocess_eduge.ipynb) "
 78 |       ]
 79 |     },
 80 |     {
 81 |       "metadata": {
 82 |         "id": "CDayX_Yx3REh",
 83 |         "colab_type": "code",
 84 |         "outputId": "660f4571-4c05-4391-dfb9-e1a89b897f29",
 85 |         "colab": {
 86 |           "base_uri": "https://localhost:8080/",
 87 |           "height": 476
 88 |         }
 89 |       },
 90 |       "cell_type": "code",
 91 |       "source": [
 92 |         "import os\n",
 93 |         "from os.path import exists, join, basename, splitext\n",
 94 |         "import sys\n",
 95 |         "\n",
 96 |         "def download_from_google_drive(file_id, file_name):\n",
 97 |         "  !rm -f ./cookie\n",
 98 |         "  !curl -c ./cookie -s -L \"https://drive.google.com/uc?export=download&id=$file_id\" > /dev/null\n",
 99 |         "  confirm_text = !awk '/download/ {print $NF}' ./cookie\n",
100 |         "  confirm_text = confirm_text[0]\n",
101 |         "  !curl -Lb ./cookie \"https://drive.google.com/uc?export=download&confirm=$confirm_text&id=$file_id\" -o $file_name\n",
102 |         "  \n",
103 |         "# download eduge pickles\n",
104 |         "file_path = 'eduge_pickles'\n",
105 |         "if not exists(file_path):\n",
106 |         "  download_from_google_drive('1vjJ9YgIe8o0ErhbN0lH1XqPv3KFP8acv', '%s.rar' % file_path)\n",
107 |         "  rar_file = file_path+\".rar\"\n",
108 |         "  !unrar x $rar_file"
109 |       ],
110 |       "execution_count": 2,
111 |       "outputs": [
112 |         {
113 |           "output_type": "stream",
114 |           "text": [
115 |             "  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current\n",
116 |             "                                 Dload  Upload   Total   Spent    Left  Speed\n",
117 |             "100   388    0   388    0     0   4511      0 --:--:-- --:--:-- --:--:--  4511\n",
118 |             "100  106M    0  106M    0     0  90.6M      0 --:--:--  0:00:01 --:--:--  126M\n",
119 |             "\n",
120 |             "UNRAR 5.50 freeware      Copyright (c) 1993-2017 Alexander Roshal\n",
121 |             "\n",
122 |             "\n",
123 |             "Extracting from eduge_pickles.rar\n",
124 |             "\n",
125 |             "\n",
126 |             "Would you like to replace the existing file word_index.pickle\n",
127 |             "9178153 bytes, modified on 2019-04-13 01:44\n",
128 |             "with a new one\n",
129 |             "9178153 bytes, modified on 2019-04-13 01:44\n",
130 |             "\n",
131 |             "[Y]es, [N]o, [A]ll, n[E]ver, [R]ename, [Q]uit n\n",
132 |             "\n",
133 |             "\n",
134 |             "Would you like to replace the existing file eduge.pickle\n",
135 |             "359611555 bytes, modified on 2019-04-13 01:44\n",
136 |             "with a new one\n",
137 |             "359611555 bytes, modified on 2019-04-13 01:44\n",
138 |             "\n",
139 |             "[Y]es, [N]o, [A]ll, n[E]ver, [R]ename, [Q]uit q\n",
140 |             "\n",
141 |             "Program aborted\n"
142 |           ],
143 |           "name": "stdout"
144 |         }
145 |       ]
146 |     },
147 |     {
148 |       "metadata": {
149 |         "id": "pPHJcnfi4Rzg",
150 |         "colab_type": "code",
151 |         "colab": {}
152 |       },
153 |       "cell_type": "code",
154 |       "source": [
155 |         "import pickle\n",
156 |         "\n",
157 |         "with open('word_index.pickle', 'rb') as handle:\n",
158 |         "  word_index = pickle.load(handle)\n",
159 |         "    \n",
160 |         "with open('reversed_word_index.pickle', 'rb') as handle:\n",
161 |         "  reversed_word_index = pickle.load(handle)\n",
162 |         "  \n",
163 |         "with open('eduge_stopwords_removed.pickle', 'rb') as handle:\n",
164 |         "  eduge_ds = pickle.load(handle)"
165 |       ],
166 |       "execution_count": 0,
167 |       "outputs": []
168 |     },
169 |     {
170 |       "metadata": {
171 |         "id": "XFxd1QGR65VV",
172 |         "colab_type": "code",
173 |         "colab": {}
174 |       },
175 |       "cell_type": "code",
176 |       "source": [
177 |         "MAX_LEN = 512\n",
178 |         "\n",
179 |         "import itertools\n",
180 |         "\n",
181 |         "for item in eduge_ds:\n",
182 |         "  item[0] = list(itertools.chain(*item[0]))[:MAX_LEN]"
183 |       ],
184 |       "execution_count": 0,
185 |       "outputs": []
186 |     },
187 |     {
188 |       "metadata": {
189 |         "id": "U8PTeX0WCbhR",
190 |         "colab_type": "code",
191 |         "colab": {}
192 |       },
193 |       "cell_type": "code",
194 |       "source": [
195 |         "from sklearn.model_selection import train_test_split\n",
196 |         "train, test = train_test_split(eduge_ds, test_size=0.1, random_state=999)"
197 |       ],
198 |       "execution_count": 0,
199 |       "outputs": []
200 |     },
201 |     {
202 |       "metadata": {
203 |         "id": "8mgMCFcgDHH4",
204 |         "colab_type": "code",
205 |         "colab": {}
206 |       },
207 |       "cell_type": "code",
208 |       "source": [
209 |         "train_data_words  = [i[0] for i in train]\n",
210 |         "train_label_words = [i[1] for i in train]\n",
211 |         "test_data_words   = [i[0] for i in test ]\n",
212 |         "test_label_words  = [i[1] for i in test ]"
213 |       ],
214 |       "execution_count": 0,
215 |       "outputs": []
216 |     },
217 |     {
218 |       "metadata": {
219 |         "id": "rrXC7UiuFkCH",
220 |         "colab_type": "code",
221 |         "colab": {}
222 |       },
223 |       "cell_type": "code",
224 |       "source": [
225 |         "def encode_news(text):\n",
226 |         "    return [word_index.get(i, 2) for i in text]\n",
227 |         "  \n",
228 |         "train_data = [encode_news(sent) for sent in train_data_words]\n",
229 |         "test_data  = [encode_news(sent) for sent in test_data_words ]"
230 |       ],
231 |       "execution_count": 0,
232 |       "outputs": []
233 |     },
234 |     {
235 |       "metadata": {
236 |         "id": "FV-h_avPEzM1",
237 |         "colab_type": "code",
238 |         "colab": {}
239 |       },
240 |       "cell_type": "code",
241 |       "source": [
242 |         "train_data = keras.preprocessing.sequence.pad_sequences(train_data,\n",
243 |         "                                                        value=word_index[\"<PAD>\"],\n",
244 |         "                                                        padding='post',\n",
245 |         "                                                        maxlen=MAX_LEN)\n",
246 |         "\n",
247 |         "test_data = keras.preprocessing.sequence.pad_sequences(test_data,\n",
248 |         "                                                       value=word_index[\"<PAD>\"],\n",
249 |         "                                                       padding='post',\n",
250 |         "                                                       maxlen=MAX_LEN)"
251 |       ],
252 |       "execution_count": 0,
253 |       "outputs": []
254 |     },
255 |     {
256 |       "metadata": {
257 |         "id": "gDVqmPqxIMid",
258 |         "colab_type": "code",
259 |         "outputId": "20a69735-0785-4267-ae33-da76cb72e475",
260 |         "colab": {
261 |           "base_uri": "https://localhost:8080/",
262 |           "height": 170
263 |         }
264 |       },
265 |       "cell_type": "code",
266 |       "source": [
267 |         "labels = list(set(test_label_words))\n",
268 |         "labels"
269 |       ],
270 |       "execution_count": 9,
271 |       "outputs": [
272 |         {
273 |           "output_type": "execute_result",
274 |           "data": {
275 |             "text/plain": [
276 |               "['боловсрол',\n",
277 |               " 'байгал орчин',\n",
278 |               " 'хууль',\n",
279 |               " 'эдийн засаг',\n",
280 |               " 'улс төр',\n",
281 |               " 'эрүүл мэнд',\n",
282 |               " 'урлаг соёл',\n",
283 |               " 'спорт',\n",
284 |               " 'технологи']"
285 |             ]
286 |           },
287 |           "metadata": {
288 |             "tags": []
289 |           },
290 |           "execution_count": 9
291 |         }
292 |       ]
293 |     },
294 |     {
295 |       "metadata": {
296 |         "id": "PBKj3GQqJq29",
297 |         "colab_type": "code",
298 |         "colab": {}
299 |       },
300 |       "cell_type": "code",
301 |       "source": [
302 |         "from sklearn.preprocessing import LabelBinarizer\n",
303 |         "encoder     = LabelBinarizer()\n",
304 |         "train_label = transfomed_label = encoder.fit_transform(train_label_words)\n",
305 |         "test_label  = transfomed_label = encoder.fit_transform(test_label_words )"
306 |       ],
307 |       "execution_count": 0,
308 |       "outputs": []
309 |     },
310 |     {
311 |       "metadata": {
312 |         "id": "DPq45PN5HZ15",
313 |         "colab_type": "code",
314 |         "outputId": "8ad202f8-3c10-4cbe-86f0-b68183fa8c16",
315 |         "colab": {
316 |           "base_uri": "https://localhost:8080/",
317 |           "height": 289
318 |         }
319 |       },
320 |       "cell_type": "code",
321 |       "source": [
322 |         "vocab_size = len(word_index)\n",
323 |         "\n",
324 |         "model = keras.Sequential()\n",
325 |         "model.add(keras.layers.Embedding(vocab_size, 16))\n",
326 |         "model.add(keras.layers.GlobalAveragePooling1D())\n",
327 |         "model.add(keras.layers.Dense(16, activation='relu'))\n",
328 |         "model.add(keras.layers.Dense(len(labels), activation='sigmoid'))\n",
329 |         "\n",
330 |         "model.summary()"
331 |       ],
332 |       "execution_count": 11,
333 |       "outputs": [
334 |         {
335 |           "output_type": "stream",
336 |           "text": [
337 |             "Model: \"sequential\"\n",
338 |             "_________________________________________________________________\n",
339 |             "Layer (type)                 Output Shape              Param #   \n",
340 |             "=================================================================\n",
341 |             "embedding (Embedding)        (None, None, 16)          5932704   \n",
342 |             "_________________________________________________________________\n",
343 |             "global_average_pooling1d (Gl (None, 16)                0         \n",
344 |             "_________________________________________________________________\n",
345 |             "dense (Dense)                (None, 16)                272       \n",
346 |             "_________________________________________________________________\n",
347 |             "dense_1 (Dense)              (None, 9)                 153       \n",
348 |             "=================================================================\n",
349 |             "Total params: 5,933,129\n",
350 |             "Trainable params: 5,933,129\n",
351 |             "Non-trainable params: 0\n",
352 |             "_________________________________________________________________\n"
353 |           ],
354 |           "name": "stdout"
355 |         }
356 |       ]
357 |     },
358 |     {
359 |       "metadata": {
360 |         "id": "cAgP1KlqHu2F",
361 |         "colab_type": "code",
362 |         "colab": {}
363 |       },
364 |       "cell_type": "code",
365 |       "source": [
366 |         "model.compile(optimizer='adam',\n",
367 |         "              loss='categorical_crossentropy',\n",
368 |         "              metrics=['accuracy'])"
369 |       ],
370 |       "execution_count": 0,
371 |       "outputs": []
372 |     },
373 |     {
374 |       "metadata": {
375 |         "id": "ZPw8roFQKrHm",
376 |         "colab_type": "code",
377 |         "outputId": "c54c1e93-baaf-4097-ea8e-7da87bcb8080",
378 |         "colab": {
379 |           "base_uri": "https://localhost:8080/",
380 |           "height": 51
381 |         }
382 |       },
383 |       "cell_type": "code",
384 |       "source": [
385 |         "print(len(train_data), len(train_label))\n",
386 |         "print(len(test_data ), len(test_label) )\n",
387 |         "\n",
388 |         "partial_index = 3000\n",
389 |         "\n",
390 |         "x_val = train_data[:partial_index]\n",
391 |         "partial_x_train = train_data[partial_index:]\n",
392 |         "\n",
393 |         "y_val = train_label[:partial_index]\n",
394 |         "partial_y_train = train_label[partial_index:]"
395 |       ],
396 |       "execution_count": 13,
397 |       "outputs": [
398 |         {
399 |           "output_type": "stream",
400 |           "text": [
401 |             "68094 68094\n",
402 |             "7567 7567\n"
403 |           ],
404 |           "name": "stdout"
405 |         }
406 |       ]
407 |     },
408 |     {
409 |       "metadata": {
410 |         "id": "iSTB4--RKacs",
411 |         "colab_type": "code",
412 |         "outputId": "ecdc60b8-05be-43a4-c72d-6056d86e7ba0",
413 |         "colab": {
414 |           "base_uri": "https://localhost:8080/",
415 |           "height": 1074
416 |         }
417 |       },
418 |       "cell_type": "code",
419 |       "source": [
420 |         "epochs = 30\n",
421 |         "history = model.fit(partial_x_train,\n",
422 |         "                    partial_y_train,\n",
423 |         "                    epochs=epochs,\n",
424 |         "                    batch_size=512,\n",
425 |         "                    validation_data=(x_val, y_val),\n",
426 |         "                    verbose=1)"
427 |       ],
428 |       "execution_count": 14,
429 |       "outputs": [
430 |         {
431 |           "output_type": "stream",
432 |           "text": [
433 |             "Train on 65094 samples, validate on 3000 samples\n",
434 |             "Epoch 1/30\n",
435 |             "65094/65094 [==============================] - 5s 73us/sample - loss: 2.1506 - accuracy: 0.2542 - val_loss: 2.0977 - val_accuracy: 0.2807\n",
436 |             "Epoch 2/30\n",
437 |             "65094/65094 [==============================] - 4s 58us/sample - loss: 2.0284 - accuracy: 0.3043 - val_loss: 1.9403 - val_accuracy: 0.3053\n",
438 |             "Epoch 3/30\n",
439 |             "65094/65094 [==============================] - 4s 58us/sample - loss: 1.7681 - accuracy: 0.3700 - val_loss: 1.5439 - val_accuracy: 0.4700\n",
440 |             "Epoch 4/30\n",
441 |             "65094/65094 [==============================] - 4s 59us/sample - loss: 1.3003 - accuracy: 0.6406 - val_loss: 1.1447 - val_accuracy: 0.7063\n",
442 |             "Epoch 5/30\n",
443 |             "65094/65094 [==============================] - 4s 60us/sample - loss: 1.0101 - accuracy: 0.7528 - val_loss: 0.9562 - val_accuracy: 0.7673\n",
444 |             "Epoch 6/30\n",
445 |             "65094/65094 [==============================] - 4s 60us/sample - loss: 0.8498 - accuracy: 0.7863 - val_loss: 0.8369 - val_accuracy: 0.7790\n",
446 |             "Epoch 7/30\n",
447 |             "65094/65094 [==============================] - 4s 60us/sample - loss: 0.7358 - accuracy: 0.8130 - val_loss: 0.7477 - val_accuracy: 0.8180\n",
448 |             "Epoch 8/30\n",
449 |             "65094/65094 [==============================] - 4s 60us/sample - loss: 0.6472 - accuracy: 0.8410 - val_loss: 0.6793 - val_accuracy: 0.8297\n",
450 |             "Epoch 9/30\n",
451 |             "65094/65094 [==============================] - 4s 59us/sample - loss: 0.5772 - accuracy: 0.8601 - val_loss: 0.6277 - val_accuracy: 0.8407\n",
452 |             "Epoch 10/30\n",
453 |             "65094/65094 [==============================] - 4s 60us/sample - loss: 0.5213 - accuracy: 0.8747 - val_loss: 0.5863 - val_accuracy: 0.8520\n",
454 |             "Epoch 11/30\n",
455 |             "65094/65094 [==============================] - 4s 60us/sample - loss: 0.4766 - accuracy: 0.8854 - val_loss: 0.5565 - val_accuracy: 0.8577\n",
456 |             "Epoch 12/30\n",
457 |             "65094/65094 [==============================] - 4s 60us/sample - loss: 0.4396 - accuracy: 0.8928 - val_loss: 0.5314 - val_accuracy: 0.8637\n",
458 |             "Epoch 13/30\n",
459 |             "65094/65094 [==============================] - 4s 61us/sample - loss: 0.4087 - accuracy: 0.9005 - val_loss: 0.5110 - val_accuracy: 0.8710\n",
460 |             "Epoch 14/30\n",
461 |             "65094/65094 [==============================] - 4s 60us/sample - loss: 0.3825 - accuracy: 0.9061 - val_loss: 0.4965 - val_accuracy: 0.8693\n",
462 |             "Epoch 15/30\n",
463 |             "65094/65094 [==============================] - 4s 61us/sample - loss: 0.3597 - accuracy: 0.9116 - val_loss: 0.4843 - val_accuracy: 0.8727\n",
464 |             "Epoch 16/30\n",
465 |             "65094/65094 [==============================] - 4s 61us/sample - loss: 0.3387 - accuracy: 0.9166 - val_loss: 0.4732 - val_accuracy: 0.8783\n",
466 |             "Epoch 17/30\n",
467 |             "65094/65094 [==============================] - 4s 60us/sample - loss: 0.3199 - accuracy: 0.9215 - val_loss: 0.4653 - val_accuracy: 0.8767\n",
468 |             "Epoch 18/30\n",
469 |             "65094/65094 [==============================] - 4s 61us/sample - loss: 0.3028 - accuracy: 0.9257 - val_loss: 0.4574 - val_accuracy: 0.8797\n",
470 |             "Epoch 19/30\n",
471 |             "65094/65094 [==============================] - 4s 61us/sample - loss: 0.2869 - accuracy: 0.9298 - val_loss: 0.4520 - val_accuracy: 0.8810\n",
472 |             "Epoch 20/30\n",
473 |             "65094/65094 [==============================] - 4s 60us/sample - loss: 0.2724 - accuracy: 0.9333 - val_loss: 0.4472 - val_accuracy: 0.8823\n",
474 |             "Epoch 21/30\n",
475 |             "65094/65094 [==============================] - 4s 60us/sample - loss: 0.2586 - accuracy: 0.9362 - val_loss: 0.4437 - val_accuracy: 0.8807\n",
476 |             "Epoch 22/30\n",
477 |             "65094/65094 [==============================] - 4s 60us/sample - loss: 0.2458 - accuracy: 0.9399 - val_loss: 0.4395 - val_accuracy: 0.8813\n",
478 |             "Epoch 23/30\n",
479 |             "65094/65094 [==============================] - 4s 60us/sample - loss: 0.2338 - accuracy: 0.9427 - val_loss: 0.4363 - val_accuracy: 0.8830\n",
480 |             "Epoch 24/30\n",
481 |             "65094/65094 [==============================] - 4s 61us/sample - loss: 0.2222 - accuracy: 0.9459 - val_loss: 0.4348 - val_accuracy: 0.8837\n",
482 |             "Epoch 25/30\n",
483 |             "65094/65094 [==============================] - 4s 61us/sample - loss: 0.2116 - accuracy: 0.9487 - val_loss: 0.4333 - val_accuracy: 0.8827\n",
484 |             "Epoch 26/30\n",
485 |             "65094/65094 [==============================] - 4s 60us/sample - loss: 0.2012 - accuracy: 0.9514 - val_loss: 0.4339 - val_accuracy: 0.8820\n",
486 |             "Epoch 27/30\n",
487 |             "65094/65094 [==============================] - 4s 61us/sample - loss: 0.1923 - accuracy: 0.9537 - val_loss: 0.4341 - val_accuracy: 0.8833\n",
488 |             "Epoch 28/30\n",
489 |             "65094/65094 [==============================] - 4s 61us/sample - loss: 0.1829 - accuracy: 0.9561 - val_loss: 0.4325 - val_accuracy: 0.8843\n",
490 |             "Epoch 29/30\n",
491 |             "65094/65094 [==============================] - 4s 60us/sample - loss: 0.1750 - accuracy: 0.9580 - val_loss: 0.4349 - val_accuracy: 0.8833\n",
492 |             "Epoch 30/30\n",
493 |             "65094/65094 [==============================] - 4s 60us/sample - loss: 0.1663 - accuracy: 0.9596 - val_loss: 0.4335 - val_accuracy: 0.8850\n"
494 |           ],
495 |           "name": "stdout"
496 |         }
497 |       ]
498 |     },
499 |     {
500 |       "metadata": {
501 |         "id": "r8_mvDjYL3CX",
502 |         "colab_type": "code",
503 |         "outputId": "1f0a15f4-a747-4c07-b6c4-493efd4ee2a8",
504 |         "colab": {
505 |           "base_uri": "https://localhost:8080/",
506 |           "height": 51
507 |         }
508 |       },
509 |       "cell_type": "code",
510 |       "source": [
511 |         "results = model.evaluate(test_data, test_label)\n",
512 |         "print(results)"
513 |       ],
514 |       "execution_count": 15,
515 |       "outputs": [
516 |         {
517 |           "output_type": "stream",
518 |           "text": [
519 |             "7567/7567 [==============================] - 1s 88us/sample - loss: 0.4136 - accuracy: 0.8960\n",
520 |             "[0.4136109899459213, 0.8959958]\n"
521 |           ],
522 |           "name": "stdout"
523 |         }
524 |       ]
525 |     },
526 |     {
527 |       "metadata": {
528 |         "id": "VaIioR7EPfig",
529 |         "colab_type": "code",
530 |         "outputId": "f19a6548-45fe-4cc1-d906-f6f79bce804a",
531 |         "colab": {
532 |           "base_uri": "https://localhost:8080/",
533 |           "height": 71
534 |         }
535 |       },
536 |       "cell_type": "code",
537 |       "source": [
538 |         "data_index   = 12\n",
539 |         "data_words   = \" \".join(test_data_words[data_index])\n",
540 |         "data_indexes = test_data[data_index]\n",
541 |         "print(data_words)\n",
542 |         "#print(data_indexes)\n",
543 |         "import numpy as np\n",
544 |         "predicted = model.predict([data_indexes])\n",
545 |         "print(encoder.classes_[np.argmax(predicted)])"
546 |       ],
547 |       "execution_count": 16,
548 |       "outputs": [
549 |         {
550 |           "output_type": "stream",
551 |           "text": [
552 |             "спортын төв ордонд өнөөдөр азийн оюутны аварга шалгаруулах эмэгтэй волейболчдын хоёр дахь удаагийн тэмцээний талаар мэдээлэл хийлээ анхны тэмцээн онд тайландын бангконг хотноо болж хоёрдугаар тэмцээнийг азийн оюутны спортын холбооноос аосх олгосон эрхийн дагуу оны дөрөвдүгээр сарын ны өдрүүдэд монгол улсын нийслэл улаанбаатар хотноо зохион байгуулах тэмцээний эрхийг монгол улс оны тавдугаар сарын хуралдсан аосхны гүйцэтгэх хорооны хурлаар хоёр оронтой өрсөлдөн авчээ уг тэмцээнийг монгол улсад авах талаар мосхолбоо оноос санаачлага гарган хөөцөлдөж эхэлсэн тэмцээний эрхийг авахад муын засгийн газрын санхүүгийн дэмжлэг мэргэжлийн холбоодын ажлын туршлага манай улсын олон улсын нэр хүнд ихээхэн тус хүргэжээ зохион байгуулах хороог с ламбаа удирдаж тэмцээний зохион байгуулах хороог збх эрүүл мэндийн сайдын оны тоот тушаалаар батлаж даргаар уихын гишүүн монголын волейболын холбооны мвх хүндэт ерөнхийлөгч сламбаа ажиллаж збхны орлогч даргаар згхагентлагбтсгын дарга чнаранбаатар збхны нарийн бичгийн даргаар монголын оюутны спортын холбооны мосх ерөнхий нарийн бичгийн дарга джаргалсайхан збхны гишүүдэд бсшуяны төрийн нарийн бичгийн дарга ддалайжаргал нийслэлийн здтгазрын дарга цболдсайхан сяны газрын дарга дбатжаргал гхяамны консулын газрын дарга дганхуяг бсшуяны мэргэжлийн боловсролын газрын дарга мбаасанжав гихалбаны дарга дмөрөн мосхны ерөнхийлөгч оуосхны ерөнхий санхүүч дбаясгалан муисийн ректор стөмөрочир мубисийн ректор бжадамба залуу монгол корпорацийн ерөнхийлөгч мсономпил мохны ерөнхий нарийн бичгийн дарга нбямбагэрэл мвхолбооны ерөнхий нарий бичгийн дарга цбатэнх миат хкийн маркетинг борлуулатын хэлтсийн дарга тмэндсайхан боловсрол суваг телевизийн ерөнхий захирал аамундра нар сонгогдон ажиллаж тэмцээнийг үнэ төлбөргүй үзүүлнэ волейболын болон оюутны спортыг сурталчилах дэлгэрүүлэх үүднээс тэмцээнийг үнэ төлбөргүй үзүүлэхээр збхорооны анхдугаар хурлаас шийдвэрлэсэн нийслэлийн иргэдийг тэмцээнийг өргөнөөр үзэхийг здтгазраас уриалсан тэмцээнийг зөвхөн улаанбаатар хотын иргэд бус аймгаас волейболын спортыг сонирхон хөгжөөн дэмжигч үзэгч волейболын спортын мэргэжилтэн багш нар секцэнд хичээллэгч хүүхдүүд зохион байгуулалтай ирэхээр ялангуяа тэмцээн болох газартай хамгийн ойрхон хануул дүүргийн здтгазар дүүргийнхээ ард иргэд хөдөлмөрчид сургуулийн сурагчид оюутнууд цэргийн албан хаагчид буянтухаа орчимын албан байгууллага хамт олныг идэвхтэй оролцуулах арга хэмжээ авч эхлэжээ олон зуун оюутнууд тэмцээн үзэх боллоо тэмцээний өдрүүдэд нийслэлээс буянтухаагийн спортын ордонг чиглэсэн хүмүүсийн цуваа ихсэх төлөвтэй учир нийслэлд үйл ажиллагаа явуулж байгаа орчим идсийн оюутнууд тэмцээнийг анги сургууль хамт олноороо үзэх сонирхолтой байгаагаа монголын оюутны холбоо биеийн тамирын тэнхимдээ хүсчээ үүний дагуу бсшуяам мох монголын оюутны спортын холбоо мосх ноос тэмцээнийг өдөр бүр гаруй сургуулийн орчим оюутнууд нэгдсэн хуваарийн дагуу үзэх хуваарийг бсшуяны төрийн нарийн бичгийн дарга зохион байгуулах үндэсний хороо збх ны гишүүн ддалайжаргал батлан сургуулиудад албан тоотоор хүргүүлжээ мосхд тэмцээнийг үзэхээр олон арван сургуулиуд оюутны тоогоо өгч бүртгүүлж суудлын хувиарлалтанд орж байгаа ажээ ялангуяа биеийн тамирын мэргэжлийн дээд сургуулийн оюутнууд дадлага хичээлээ тэмцээний үеэр хийхээр хичээлийн хувиараа зохицуулсан нийслэлийн засаг дарга оюутнуудад туслав улаанбаатар хотноо болдог оюутны олон улс тив дэлхийн тэмцээн бүрт нийслэлийн засаг дарга гмөнхбаяр ихээхэн туслалцаа үзүүлэн оюутан залуусаа байнга дэмжин оролцдог ажээ тэрээр тус тэмцээнд оролцохоор бэлтгэж байгаа монголын оюутны шигшээ багийн тамирчидын хоногийн бэлтгэл сургалтын зардлыг хариуцан гаргасан хөрөнгө санхүүгийн хүндрэлтэй байгаа үеэд тэмцээнд бэлтгэж байгаа оюутан тамирчидаа цагаа олж хэрэгцээтэй үеэд дэмжлээ мосхолбоо монголын волейболын холбоо мвх тамирчидынхаа өмнөөс талархал илэрхийлжээ монголын баг тамирчид эрдэнэт хотод оны сарын өдрөөс эхлэн хоногийн бэлтгэл хийснийхээ дараа ийнхүү нийслэлийн засаг даргын туслалцаатайгаар гадаадын тамирчидтай хамт байрлах цэцэг зочид буудалдаа орж бэлтгэл сургуулиалтаа үргэлжүүлэх боломжтой нздтгазраас баг\n",
553 |             "спорт\n"
554 |           ],
555 |           "name": "stdout"
556 |         }
557 |       ]
558 |     }
559 |   ]
560 | }


--------------------------------------------------------------------------------
/neural_network_classifier_notebooks/Mongolian_text_classification_04,_RNN(LSTM).ipynb:
--------------------------------------------------------------------------------
  1 | {
  2 |   "nbformat": 4,
  3 |   "nbformat_minor": 0,
  4 |   "metadata": {
  5 |     "colab": {
  6 |       "name": "Mongolian text classification #04, RNN(LSTM).ipynb",
  7 |       "version": "0.3.2",
  8 |       "provenance": [],
  9 |       "collapsed_sections": []
 10 |     },
 11 |     "kernelspec": {
 12 |       "name": "python3",
 13 |       "display_name": "Python 3"
 14 |     },
 15 |     "accelerator": "GPU"
 16 |   },
 17 |   "cells": [
 18 |     {
 19 |       "metadata": {
 20 |         "id": "muNP8k9fqaJb",
 21 |         "colab_type": "text"
 22 |       },
 23 |       "cell_type": "markdown",
 24 |       "source": [
 25 |         "Mongolian text classification series #01\n",
 26 |         "\n",
 27 |         "In this notebook I'm gonna try to classify cyrillic mongolian texts with LSTM.\n",
 28 |         "\n",
 29 |         "Eduge dataset provided by Bolorsoft LLC\n",
 30 |         "\n",
 31 |         "Author : Sharavsambuu Gunchinish (sharavsambuu@gmail.com)\n",
 32 |         "\n",
 33 |         "Github: https://github.com/sharavsambuu/mongolian-text-classification \n",
 34 |         "\n"
 35 |       ]
 36 |     },
 37 |     {
 38 |       "metadata": {
 39 |         "id": "iY9jwdg6qT8M",
 40 |         "colab_type": "code",
 41 |         "outputId": "70ae13ba-6931-476b-9d75-49ddb96c4bd9",
 42 |         "colab": {
 43 |           "base_uri": "https://localhost:8080/",
 44 |           "height": 360
 45 |         }
 46 |       },
 47 |       "cell_type": "code",
 48 |       "source": [
 49 |         "from __future__ import absolute_import, division, print_function, unicode_literals\n",
 50 |         "\n",
 51 |         "!pip install -q tensorflow-gpu==2.0.0-alpha0\n",
 52 |         "!pip install gensim\n",
 53 |         "\n",
 54 |         "import tensorflow as tf\n",
 55 |         "from tensorflow import keras\n",
 56 |         "\n",
 57 |         "import numpy as np\n",
 58 |         "\n",
 59 |         "print(tf.__version__)"
 60 |       ],
 61 |       "execution_count": 1,
 62 |       "outputs": [
 63 |         {
 64 |           "output_type": "stream",
 65 |           "text": [
 66 |             "Requirement already satisfied: gensim in /usr/local/lib/python3.6/dist-packages (3.6.0)\n",
 67 |             "Requirement already satisfied: smart-open>=1.2.1 in /usr/local/lib/python3.6/dist-packages (from gensim) (1.8.1)\n",
 68 |             "Requirement already satisfied: six>=1.5.0 in /usr/local/lib/python3.6/dist-packages (from gensim) (1.11.0)\n",
 69 |             "Requirement already satisfied: scipy>=0.18.1 in /usr/local/lib/python3.6/dist-packages (from gensim) (1.2.1)\n",
 70 |             "Requirement already satisfied: numpy>=1.11.3 in /usr/local/lib/python3.6/dist-packages (from gensim) (1.16.2)\n",
 71 |             "Requirement already satisfied: boto>=2.32 in /usr/local/lib/python3.6/dist-packages (from smart-open>=1.2.1->gensim) (2.49.0)\n",
 72 |             "Requirement already satisfied: bz2file in /usr/local/lib/python3.6/dist-packages (from smart-open>=1.2.1->gensim) (0.98)\n",
 73 |             "Requirement already satisfied: requests in /usr/local/lib/python3.6/dist-packages (from smart-open>=1.2.1->gensim) (2.18.4)\n",
 74 |             "Requirement already satisfied: boto3 in /usr/local/lib/python3.6/dist-packages (from smart-open>=1.2.1->gensim) (1.9.130)\n",
 75 |             "Requirement already satisfied: idna<2.7,>=2.5 in /usr/local/lib/python3.6/dist-packages (from requests->smart-open>=1.2.1->gensim) (2.6)\n",
 76 |             "Requirement already satisfied: urllib3<1.23,>=1.21.1 in /usr/local/lib/python3.6/dist-packages (from requests->smart-open>=1.2.1->gensim) (1.22)\n",
 77 |             "Requirement already satisfied: chardet<3.1.0,>=3.0.2 in /usr/local/lib/python3.6/dist-packages (from requests->smart-open>=1.2.1->gensim) (3.0.4)\n",
 78 |             "Requirement already satisfied: certifi>=2017.4.17 in /usr/local/lib/python3.6/dist-packages (from requests->smart-open>=1.2.1->gensim) (2019.3.9)\n",
 79 |             "Requirement already satisfied: botocore<1.13.0,>=1.12.130 in /usr/local/lib/python3.6/dist-packages (from boto3->smart-open>=1.2.1->gensim) (1.12.130)\n",
 80 |             "Requirement already satisfied: s3transfer<0.3.0,>=0.2.0 in /usr/local/lib/python3.6/dist-packages (from boto3->smart-open>=1.2.1->gensim) (0.2.0)\n",
 81 |             "Requirement already satisfied: jmespath<1.0.0,>=0.7.1 in /usr/local/lib/python3.6/dist-packages (from boto3->smart-open>=1.2.1->gensim) (0.9.4)\n",
 82 |             "Requirement already satisfied: python-dateutil<3.0.0,>=2.1; python_version >= \"2.7\" in /usr/local/lib/python3.6/dist-packages (from botocore<1.13.0,>=1.12.130->boto3->smart-open>=1.2.1->gensim) (2.5.3)\n",
 83 |             "Requirement already satisfied: docutils>=0.10 in /usr/local/lib/python3.6/dist-packages (from botocore<1.13.0,>=1.12.130->boto3->smart-open>=1.2.1->gensim) (0.14)\n",
 84 |             "2.0.0-alpha0\n"
 85 |           ],
 86 |           "name": "stdout"
 87 |         }
 88 |       ]
 89 |     },
 90 |     {
 91 |       "metadata": {
 92 |         "id": "smJeJfoo4qcu",
 93 |         "colab_type": "text"
 94 |       },
 95 |       "cell_type": "markdown",
 96 |       "source": [
 97 |         "[More info about creation of eduge dataset pickles](https://github.com/sharavsambuu/mongolian-text-classification/blob/master/preprocess_dataset/preprocess_eduge.ipynb) preprocessing eats a lot of CPU cycle so it's good idea to cook it before using colab."
 98 |       ]
 99 |     },
100 |     {
101 |       "metadata": {
102 |         "id": "CDayX_Yx3REh",
103 |         "colab_type": "code",
104 |         "outputId": "225460ef-a61d-486e-e831-6c132cd65f5b",
105 |         "colab": {
106 |           "base_uri": "https://localhost:8080/",
107 |           "height": 340
108 |         }
109 |       },
110 |       "cell_type": "code",
111 |       "source": [
112 |         "import os\n",
113 |         "from os.path import exists, join, basename, splitext\n",
114 |         "import sys\n",
115 |         "\n",
116 |         "def download_from_google_drive(file_id, file_name):\n",
117 |         "  !rm -f ./cookie\n",
118 |         "  !curl -c ./cookie -s -L \"https://drive.google.com/uc?export=download&id=$file_id\" > /dev/null\n",
119 |         "  confirm_text = !awk '/download/ {print $NF}' ./cookie\n",
120 |         "  confirm_text = confirm_text[0]\n",
121 |         "  !curl -Lb ./cookie \"https://drive.google.com/uc?export=download&confirm=$confirm_text&id=$file_id\" -o $file_name\n",
122 |         "  \n",
123 |         "# download eduge pickles\n",
124 |         "file_path = 'eduge_pickles'\n",
125 |         "if not exists(file_path):\n",
126 |         "  download_from_google_drive('1vjJ9YgIe8o0ErhbN0lH1XqPv3KFP8acv', '%s.rar' % file_path)\n",
127 |         "  rar_file = file_path+\".rar\"\n",
128 |         "  !unrar x $rar_file"
129 |       ],
130 |       "execution_count": 2,
131 |       "outputs": [
132 |         {
133 |           "output_type": "stream",
134 |           "text": [
135 |             "  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current\n",
136 |             "                                 Dload  Upload   Total   Spent    Left  Speed\n",
137 |             "100   388    0   388    0     0   4974      0 --:--:-- --:--:-- --:--:--  4974\n",
138 |             "100  106M    0  106M    0     0   104M      0 --:--:--  0:00:01 --:--:--  231M\n",
139 |             "\n",
140 |             "UNRAR 5.50 freeware      Copyright (c) 1993-2017 Alexander Roshal\n",
141 |             "\n",
142 |             "\n",
143 |             "Extracting from eduge_pickles.rar\n",
144 |             "\n",
145 |             "\n",
146 |             "Would you like to replace the existing file word_index.pickle\n",
147 |             "9178153 bytes, modified on 2019-04-13 01:44\n",
148 |             "with a new one\n",
149 |             "9178153 bytes, modified on 2019-04-13 01:44\n",
150 |             "\n",
151 |             "[Y]es, [N]o, [A]ll, n[E]ver, [R]ename, [Q]uit q\n",
152 |             "\n",
153 |             "Program aborted\n"
154 |           ],
155 |           "name": "stdout"
156 |         }
157 |       ]
158 |     },
159 |     {
160 |       "metadata": {
161 |         "id": "pPHJcnfi4Rzg",
162 |         "colab_type": "code",
163 |         "colab": {}
164 |       },
165 |       "cell_type": "code",
166 |       "source": [
167 |         "import pickle\n",
168 |         "\n",
169 |         "with open('word_index.pickle', 'rb') as handle:\n",
170 |         "  word_index = pickle.load(handle)\n",
171 |         "    \n",
172 |         "with open('reversed_word_index.pickle', 'rb') as handle:\n",
173 |         "  reversed_word_index = pickle.load(handle)\n",
174 |         "  \n",
175 |         "with open('eduge_stopwords_removed.pickle', 'rb') as handle:\n",
176 |         "  eduge_ds = pickle.load(handle)"
177 |       ],
178 |       "execution_count": 0,
179 |       "outputs": []
180 |     },
181 |     {
182 |       "metadata": {
183 |         "id": "ASRW7ISNnbM-",
184 |         "colab_type": "code",
185 |         "outputId": "3db9b6a2-330f-400c-964c-60fb90b35182",
186 |         "colab": {
187 |           "base_uri": "https://localhost:8080/",
188 |           "height": 51
189 |         }
190 |       },
191 |       "cell_type": "code",
192 |       "source": [
193 |         "# facebook trained word2vec on both commoncrawl and wikipedia. So this model should contain enough representation about our mongolian words.\n",
194 |         "mongolian_word2vec_download=\"https://dl.fbaipublicfiles.com/fasttext/vectors-crawl/cc.mn.300.bin.gz\"\n",
195 |         "if not exists(\"cc.mn.300.bin.gz\"):\n",
196 |         "  !wget https://dl.fbaipublicfiles.com/fasttext/vectors-crawl/cc.mn.300.bin.gz\n",
197 |         "if exists('cc.mn.300.bin.gz'):\n",
198 |         "  !gunzip cc.mn.300.bin.gz"
199 |       ],
200 |       "execution_count": 4,
201 |       "outputs": [
202 |         {
203 |           "output_type": "stream",
204 |           "text": [
205 |             "gzip: cc.mn.300.bin already exists; do you wish to overwrite (y or n)? n\n",
206 |             "\tnot overwritten\n"
207 |           ],
208 |           "name": "stdout"
209 |         }
210 |       ]
211 |     },
212 |     {
213 |       "metadata": {
214 |         "id": "BqGAauUZpnFz",
215 |         "colab_type": "code",
216 |         "outputId": "5efb56a9-8cae-44a4-8474-bc1db3d56564",
217 |         "colab": {
218 |           "base_uri": "https://localhost:8080/",
219 |           "height": 88
220 |         }
221 |       },
222 |       "cell_type": "code",
223 |       "source": [
224 |         "from gensim.models.wrappers import FastText\n",
225 |         "\n",
226 |         "word2vec_model = FastText.load_fasttext_format('cc.mn.300.bin')"
227 |       ],
228 |       "execution_count": 5,
229 |       "outputs": [
230 |         {
231 |           "output_type": "stream",
232 |           "text": [
233 |             "WARNING: Logging before flag parsing goes to stderr.\n",
234 |             "W0414 01:07:52.334683 139997640750976 ssh.py:33] paramiko missing, opening SSH/SCP/SFTP paths will be disabled.  `pip install paramiko` to suppress\n",
235 |             "W0414 01:07:52.809054 139997640750976 word2vec.py:573] Slow version of gensim.models.deprecated.word2vec is being used\n"
236 |           ],
237 |           "name": "stderr"
238 |         }
239 |       ]
240 |     },
241 |     {
242 |       "metadata": {
243 |         "id": "kkc1iiqJp-CJ",
244 |         "colab_type": "code",
245 |         "outputId": "99442697-27b5-4474-e384-f4f5c35dc7ff",
246 |         "colab": {
247 |           "base_uri": "https://localhost:8080/",
248 |           "height": 88
249 |         }
250 |       },
251 |       "cell_type": "code",
252 |       "source": [
253 |         "print(word2vec_model.most_similar('монгол'))"
254 |       ],
255 |       "execution_count": 6,
256 |       "outputs": [
257 |         {
258 |           "output_type": "stream",
259 |           "text": [
260 |             "[('Монгол', 0.6342526078224182), ('монголын', 0.6047513484954834), ('хятад', 0.5558866858482361), ('Монголын', 0.5087883472442627), ('судлалаараа', 0.48851606249809265), ('манай', 0.4853793680667877), ('уйгаржин', 0.4725492596626282), ('угсаатангууд', 0.47093287110328674), ('орос', 0.46463483572006226), ('худам', 0.4609120190143585)]\n"
261 |           ],
262 |           "name": "stdout"
263 |         },
264 |         {
265 |           "output_type": "stream",
266 |           "text": [
267 |             "/usr/local/lib/python3.6/dist-packages/gensim/matutils.py:737: FutureWarning: Conversion of the second argument of issubdtype from `int` to `np.signedinteger` is deprecated. In future, it will be treated as `np.int64 == np.dtype(int).type`.\n",
268 |             "  if np.issubdtype(vec.dtype, np.int):\n"
269 |           ],
270 |           "name": "stderr"
271 |         }
272 |       ]
273 |     },
274 |     {
275 |       "metadata": {
276 |         "id": "oF6vB3Qnq08I",
277 |         "colab_type": "code",
278 |         "colab": {}
279 |       },
280 |       "cell_type": "code",
281 |       "source": [
282 |         "# preparing embedding matrix\n",
283 |         "import numpy as np\n",
284 |         "\n",
285 |         "words_not_found = []\n",
286 |         "embed_dim       = 300\n",
287 |         "embedding_matrix = np.random.uniform(-1, 1, (len(word_index), embed_dim))\n",
288 |         "for word, i in word_index.items():\n",
289 |         "  if i<4:\n",
290 |         "    continue\n",
291 |         "  try:\n",
292 |         "    embedding_vector = word2vec_model[word]\n",
293 |         "    if (embedding_vector is not None) and len(embedding_vector) > 0:\n",
294 |         "      embedding_matrix[i] = embedding_vector\n",
295 |         "  except:\n",
296 |         "    words_not_found.append(word)\n",
297 |         "    pass"
298 |       ],
299 |       "execution_count": 0,
300 |       "outputs": []
301 |     },
302 |     {
303 |       "metadata": {
304 |         "id": "aQAaXWIgsxm9",
305 |         "colab_type": "code",
306 |         "outputId": "d07b1789-a789-4089-ff5d-6a6b58ab8d74",
307 |         "colab": {
308 |           "base_uri": "https://localhost:8080/",
309 |           "height": 34
310 |         }
311 |       },
312 |       "cell_type": "code",
313 |       "source": [
314 |         "print(embedding_matrix.shape)\n",
315 |         "#print(embedding_matrix[5])"
316 |       ],
317 |       "execution_count": 8,
318 |       "outputs": [
319 |         {
320 |           "output_type": "stream",
321 |           "text": [
322 |             "(370794, 300)\n"
323 |           ],
324 |           "name": "stdout"
325 |         }
326 |       ]
327 |     },
328 |     {
329 |       "metadata": {
330 |         "id": "XFxd1QGR65VV",
331 |         "colab_type": "code",
332 |         "colab": {}
333 |       },
334 |       "cell_type": "code",
335 |       "source": [
336 |         "MAX_LEN = 512\n",
337 |         "\n",
338 |         "import itertools\n",
339 |         "\n",
340 |         "for item in eduge_ds:\n",
341 |         "  item[0] = list(itertools.chain(*item[0]))[:MAX_LEN]"
342 |       ],
343 |       "execution_count": 0,
344 |       "outputs": []
345 |     },
346 |     {
347 |       "metadata": {
348 |         "id": "U8PTeX0WCbhR",
349 |         "colab_type": "code",
350 |         "colab": {}
351 |       },
352 |       "cell_type": "code",
353 |       "source": [
354 |         "from sklearn.model_selection import train_test_split\n",
355 |         "train, test = train_test_split(eduge_ds, test_size=0.1, random_state=999)"
356 |       ],
357 |       "execution_count": 0,
358 |       "outputs": []
359 |     },
360 |     {
361 |       "metadata": {
362 |         "id": "8mgMCFcgDHH4",
363 |         "colab_type": "code",
364 |         "colab": {}
365 |       },
366 |       "cell_type": "code",
367 |       "source": [
368 |         "train_data_words  = [i[0] for i in train]\n",
369 |         "train_label_words = [i[1] for i in train]\n",
370 |         "test_data_words   = [i[0] for i in test ]\n",
371 |         "test_label_words  = [i[1] for i in test ]"
372 |       ],
373 |       "execution_count": 0,
374 |       "outputs": []
375 |     },
376 |     {
377 |       "metadata": {
378 |         "id": "rrXC7UiuFkCH",
379 |         "colab_type": "code",
380 |         "colab": {}
381 |       },
382 |       "cell_type": "code",
383 |       "source": [
384 |         "def encode_news(text):\n",
385 |         "    return [word_index.get(i, 2) for i in text]\n",
386 |         "  \n",
387 |         "train_data = [encode_news(sent) for sent in train_data_words]\n",
388 |         "test_data  = [encode_news(sent) for sent in test_data_words ]"
389 |       ],
390 |       "execution_count": 0,
391 |       "outputs": []
392 |     },
393 |     {
394 |       "metadata": {
395 |         "id": "FV-h_avPEzM1",
396 |         "colab_type": "code",
397 |         "colab": {}
398 |       },
399 |       "cell_type": "code",
400 |       "source": [
401 |         "train_data = keras.preprocessing.sequence.pad_sequences(train_data,\n",
402 |         "                                                        value=word_index[\"<PAD>\"],\n",
403 |         "                                                        padding='post',\n",
404 |         "                                                        maxlen=MAX_LEN)\n",
405 |         "\n",
406 |         "test_data = keras.preprocessing.sequence.pad_sequences(test_data,\n",
407 |         "                                                       value=word_index[\"<PAD>\"],\n",
408 |         "                                                       padding='post',\n",
409 |         "                                                       maxlen=MAX_LEN)"
410 |       ],
411 |       "execution_count": 0,
412 |       "outputs": []
413 |     },
414 |     {
415 |       "metadata": {
416 |         "id": "gDVqmPqxIMid",
417 |         "colab_type": "code",
418 |         "outputId": "ec30beb7-996c-4508-8729-5d5111a96e58",
419 |         "colab": {
420 |           "base_uri": "https://localhost:8080/",
421 |           "height": 170
422 |         }
423 |       },
424 |       "cell_type": "code",
425 |       "source": [
426 |         "labels = list(set(test_label_words))\n",
427 |         "labels"
428 |       ],
429 |       "execution_count": 14,
430 |       "outputs": [
431 |         {
432 |           "output_type": "execute_result",
433 |           "data": {
434 |             "text/plain": [
435 |               "['спорт',\n",
436 |               " 'эрүүл мэнд',\n",
437 |               " 'урлаг соёл',\n",
438 |               " 'эдийн засаг',\n",
439 |               " 'байгал орчин',\n",
440 |               " 'хууль',\n",
441 |               " 'технологи',\n",
442 |               " 'улс төр',\n",
443 |               " 'боловсрол']"
444 |             ]
445 |           },
446 |           "metadata": {
447 |             "tags": []
448 |           },
449 |           "execution_count": 14
450 |         }
451 |       ]
452 |     },
453 |     {
454 |       "metadata": {
455 |         "id": "PBKj3GQqJq29",
456 |         "colab_type": "code",
457 |         "colab": {}
458 |       },
459 |       "cell_type": "code",
460 |       "source": [
461 |         "from sklearn.preprocessing import LabelBinarizer\n",
462 |         "encoder     = LabelBinarizer()\n",
463 |         "train_label = transfomed_label = encoder.fit_transform(train_label_words)\n",
464 |         "test_label  = transfomed_label = encoder.fit_transform(test_label_words )"
465 |       ],
466 |       "execution_count": 0,
467 |       "outputs": []
468 |     },
469 |     {
470 |       "metadata": {
471 |         "id": "DPq45PN5HZ15",
472 |         "colab_type": "code",
473 |         "outputId": "16c06c24-94a0-4438-fa99-6c7e9cf9262c",
474 |         "colab": {
475 |           "base_uri": "https://localhost:8080/",
476 |           "height": 394
477 |         }
478 |       },
479 |       "cell_type": "code",
480 |       "source": [
481 |         "vocab_size = len(word_index)\n",
482 |         "\n",
483 |         "sequence_input     = keras.layers.Input(shape=(MAX_LEN,), dtype='int32')\n",
484 |         "embedded_sequences = keras.layers.Embedding(\n",
485 |         "    vocab_size, \n",
486 |         "    embed_dim , \n",
487 |         "    weights=[embedding_matrix], \n",
488 |         "    input_length=MAX_LEN, \n",
489 |         "    trainable=False)(sequence_input)\n",
490 |         "x     = keras.layers.LSTM(128)(embedded_sequences)\n",
491 |         "x     = keras.layers.Dense(245, activation='relu')(x)\n",
492 |         "x     = keras.layers.Dropout(0.5)(x) # prevents overfitting\n",
493 |         "preds = keras.layers.Dense(len(labels), activation='softmax')(x)\n",
494 |         "\n",
495 |         "model = keras.models.Model(sequence_input, preds)\n",
496 |         "model.summary()"
497 |       ],
498 |       "execution_count": 17,
499 |       "outputs": [
500 |         {
501 |           "output_type": "stream",
502 |           "text": [
503 |             "W0414 01:09:45.882035 139997640750976 tf_logging.py:161] <tensorflow.python.keras.layers.recurrent.UnifiedLSTM object at 0x7f52c07278d0>: Note that this layer is not optimized for performance. Please use tf.keras.layers.CuDNNLSTM for better performance on GPU.\n"
504 |           ],
505 |           "name": "stderr"
506 |         },
507 |         {
508 |           "output_type": "stream",
509 |           "text": [
510 |             "Model: \"model\"\n",
511 |             "_________________________________________________________________\n",
512 |             "Layer (type)                 Output Shape              Param #   \n",
513 |             "=================================================================\n",
514 |             "input_2 (InputLayer)         [(None, 512)]             0         \n",
515 |             "_________________________________________________________________\n",
516 |             "embedding_1 (Embedding)      (None, 512, 300)          111238200 \n",
517 |             "_________________________________________________________________\n",
518 |             "unified_lstm_1 (UnifiedLSTM) (None, 128)               219648    \n",
519 |             "_________________________________________________________________\n",
520 |             "dense (Dense)                (None, 245)               31605     \n",
521 |             "_________________________________________________________________\n",
522 |             "dropout (Dropout)            (None, 245)               0         \n",
523 |             "_________________________________________________________________\n",
524 |             "dense_1 (Dense)              (None, 9)                 2214      \n",
525 |             "=================================================================\n",
526 |             "Total params: 111,491,667\n",
527 |             "Trainable params: 253,467\n",
528 |             "Non-trainable params: 111,238,200\n",
529 |             "_________________________________________________________________\n"
530 |           ],
531 |           "name": "stdout"
532 |         }
533 |       ]
534 |     },
535 |     {
536 |       "metadata": {
537 |         "id": "cAgP1KlqHu2F",
538 |         "colab_type": "code",
539 |         "colab": {}
540 |       },
541 |       "cell_type": "code",
542 |       "source": [
543 |         "model.compile(optimizer='rmsprop',\n",
544 |         "              loss='categorical_crossentropy',\n",
545 |         "              metrics=['accuracy'])"
546 |       ],
547 |       "execution_count": 0,
548 |       "outputs": []
549 |     },
550 |     {
551 |       "metadata": {
552 |         "id": "ZPw8roFQKrHm",
553 |         "colab_type": "code",
554 |         "outputId": "869ebf0b-2b21-47c0-bc42-dbb1bb46efa8",
555 |         "colab": {
556 |           "base_uri": "https://localhost:8080/",
557 |           "height": 51
558 |         }
559 |       },
560 |       "cell_type": "code",
561 |       "source": [
562 |         "print(len(train_data), len(train_label))\n",
563 |         "print(len(test_data ), len(test_label) )\n",
564 |         "\n",
565 |         "partial_index = 3000\n",
566 |         "\n",
567 |         "x_val = train_data[:partial_index]\n",
568 |         "partial_x_train = train_data[partial_index:]\n",
569 |         "\n",
570 |         "y_val = train_label[:partial_index]\n",
571 |         "partial_y_train = train_label[partial_index:]"
572 |       ],
573 |       "execution_count": 19,
574 |       "outputs": [
575 |         {
576 |           "output_type": "stream",
577 |           "text": [
578 |             "68094 68094\n",
579 |             "7567 7567\n"
580 |           ],
581 |           "name": "stdout"
582 |         }
583 |       ]
584 |     },
585 |     {
586 |       "metadata": {
587 |         "id": "iSTB4--RKacs",
588 |         "colab_type": "code",
589 |         "outputId": "da0e04a5-1f5a-41e7-d6f3-d8476069248a",
590 |         "colab": {
591 |           "base_uri": "https://localhost:8080/",
592 |           "height": 1985
593 |         }
594 |       },
595 |       "cell_type": "code",
596 |       "source": [
597 |         "epochs = 50\n",
598 |         "history = model.fit(partial_x_train,\n",
599 |         "                    partial_y_train,\n",
600 |         "                    epochs=epochs  ,\n",
601 |         "                    batch_size=512 ,\n",
602 |         "                    validation_data=(x_val, y_val),\n",
603 |         "                    verbose=1)"
604 |       ],
605 |       "execution_count": 20,
606 |       "outputs": [
607 |         {
608 |           "output_type": "stream",
609 |           "text": [
610 |             "Train on 65094 samples, validate on 3000 samples\n",
611 |             "Epoch 1/50\n",
612 |             "65094/65094 [==============================] - 43s 657us/sample - loss: 2.1111 - accuracy: 0.1992 - val_loss: 2.0996 - val_accuracy: 0.2053\n",
613 |             "Epoch 2/50\n",
614 |             "65094/65094 [==============================] - 40s 613us/sample - loss: 2.0969 - accuracy: 0.2025 - val_loss: 2.7476 - val_accuracy: 0.1823\n",
615 |             "Epoch 3/50\n",
616 |             "65094/65094 [==============================] - 40s 610us/sample - loss: 2.0993 - accuracy: 0.2055 - val_loss: 2.0971 - val_accuracy: 0.2083\n",
617 |             "Epoch 4/50\n",
618 |             "65094/65094 [==============================] - 39s 606us/sample - loss: 2.1472 - accuracy: 0.2054 - val_loss: 2.0850 - val_accuracy: 0.2063\n",
619 |             "Epoch 5/50\n",
620 |             "65094/65094 [==============================] - 39s 607us/sample - loss: 2.1068 - accuracy: 0.2094 - val_loss: 2.0656 - val_accuracy: 0.2107\n",
621 |             "Epoch 6/50\n",
622 |             "65094/65094 [==============================] - 39s 605us/sample - loss: 2.0949 - accuracy: 0.2136 - val_loss: 2.1074 - val_accuracy: 0.2140\n",
623 |             "Epoch 7/50\n",
624 |             "65094/65094 [==============================] - 39s 605us/sample - loss: 2.0030 - accuracy: 0.2641 - val_loss: 1.9556 - val_accuracy: 0.3073\n",
625 |             "Epoch 8/50\n",
626 |             "65094/65094 [==============================] - 40s 607us/sample - loss: 1.9544 - accuracy: 0.3111 - val_loss: 1.9054 - val_accuracy: 0.3263\n",
627 |             "Epoch 9/50\n",
628 |             "65094/65094 [==============================] - 39s 604us/sample - loss: 1.9012 - accuracy: 0.3251 - val_loss: 2.1235 - val_accuracy: 0.2920\n",
629 |             "Epoch 10/50\n",
630 |             "65094/65094 [==============================] - 40s 607us/sample - loss: 1.8758 - accuracy: 0.3344 - val_loss: 1.7436 - val_accuracy: 0.3987\n",
631 |             "Epoch 11/50\n",
632 |             "65094/65094 [==============================] - 39s 603us/sample - loss: 1.7716 - accuracy: 0.3840 - val_loss: 1.7155 - val_accuracy: 0.4107\n",
633 |             "Epoch 12/50\n",
634 |             "65094/65094 [==============================] - 39s 604us/sample - loss: 1.7536 - accuracy: 0.3985 - val_loss: 1.7148 - val_accuracy: 0.4077\n",
635 |             "Epoch 13/50\n",
636 |             "65094/65094 [==============================] - 39s 604us/sample - loss: 1.7925 - accuracy: 0.3795 - val_loss: 1.8479 - val_accuracy: 0.3360\n",
637 |             "Epoch 14/50\n",
638 |             "65094/65094 [==============================] - 39s 603us/sample - loss: 1.7657 - accuracy: 0.3835 - val_loss: 1.7782 - val_accuracy: 0.3720\n",
639 |             "Epoch 15/50\n",
640 |             "65094/65094 [==============================] - 39s 604us/sample - loss: 1.8404 - accuracy: 0.3434 - val_loss: 1.8095 - val_accuracy: 0.3343\n",
641 |             "Epoch 16/50\n",
642 |             "65094/65094 [==============================] - 39s 600us/sample - loss: 1.8219 - accuracy: 0.3431 - val_loss: 1.8222 - val_accuracy: 0.3450\n",
643 |             "Epoch 17/50\n",
644 |             "65094/65094 [==============================] - 39s 604us/sample - loss: 1.7985 - accuracy: 0.3657 - val_loss: 1.6955 - val_accuracy: 0.4177\n",
645 |             "Epoch 18/50\n",
646 |             "65094/65094 [==============================] - 39s 602us/sample - loss: 1.7112 - accuracy: 0.4165 - val_loss: 1.7584 - val_accuracy: 0.3990\n",
647 |             "Epoch 19/50\n",
648 |             "65094/65094 [==============================] - 39s 606us/sample - loss: 1.7114 - accuracy: 0.4046 - val_loss: 1.8195 - val_accuracy: 0.3443\n",
649 |             "Epoch 20/50\n",
650 |             "65094/65094 [==============================] - 40s 607us/sample - loss: 1.8404 - accuracy: 0.3466 - val_loss: 1.8112 - val_accuracy: 0.3450\n",
651 |             "Epoch 21/50\n",
652 |             "65094/65094 [==============================] - 39s 604us/sample - loss: 1.7253 - accuracy: 0.3978 - val_loss: 1.6797 - val_accuracy: 0.4187\n",
653 |             "Epoch 22/50\n",
654 |             "65094/65094 [==============================] - 39s 602us/sample - loss: 1.6953 - accuracy: 0.4241 - val_loss: 1.6971 - val_accuracy: 0.4050\n",
655 |             "Epoch 23/50\n",
656 |             "65094/65094 [==============================] - 39s 603us/sample - loss: 1.7175 - accuracy: 0.4211 - val_loss: 1.7561 - val_accuracy: 0.3780\n",
657 |             "Epoch 24/50\n",
658 |             "65094/65094 [==============================] - 39s 604us/sample - loss: 1.6900 - accuracy: 0.4178 - val_loss: 2.1019 - val_accuracy: 0.3810\n",
659 |             "Epoch 25/50\n",
660 |             "65094/65094 [==============================] - 39s 604us/sample - loss: 1.7528 - accuracy: 0.3679 - val_loss: 1.9362 - val_accuracy: 0.2957\n",
661 |             "Epoch 26/50\n",
662 |             "65094/65094 [==============================] - 39s 604us/sample - loss: 1.7457 - accuracy: 0.3713 - val_loss: 1.7600 - val_accuracy: 0.3330\n",
663 |             "Epoch 27/50\n",
664 |             "25088/65094 [==========>...................] - ETA: 23s - loss: 1.7159 - accuracy: 0.3599"
665 |           ],
666 |           "name": "stdout"
667 |         },
668 |         {
669 |           "output_type": "error",
670 |           "ename": "KeyboardInterrupt",
671 |           "evalue": "ignored",
672 |           "traceback": [
673 |             "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m",
674 |             "\u001b[0;31mKeyboardInterrupt\u001b[0m                         Traceback (most recent call last)",
675 |             "\u001b[0;32m<ipython-input-20-2bf363790b0a>\u001b[0m in \u001b[0;36m<module>\u001b[0;34m()\u001b[0m\n\u001b[1;32m      5\u001b[0m                     \u001b[0mbatch_size\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0;36m512\u001b[0m \u001b[0;34m,\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m      6\u001b[0m                     \u001b[0mvalidation_data\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mx_val\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0my_val\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m,\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m----> 7\u001b[0;31m                     verbose=1)\n\u001b[0m",
676 |             "\u001b[0;32m/usr/local/lib/python3.6/dist-packages/tensorflow/python/keras/engine/training.py\u001b[0m in \u001b[0;36mfit\u001b[0;34m(self, x, y, batch_size, epochs, verbose, callbacks, validation_split, validation_data, shuffle, class_weight, sample_weight, initial_epoch, steps_per_epoch, validation_steps, validation_freq, max_queue_size, workers, use_multiprocessing, **kwargs)\u001b[0m\n\u001b[1;32m    871\u001b[0m           \u001b[0mvalidation_steps\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0mvalidation_steps\u001b[0m\u001b[0;34m,\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m    872\u001b[0m           \u001b[0mvalidation_freq\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0mvalidation_freq\u001b[0m\u001b[0;34m,\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m--> 873\u001b[0;31m           steps_name='steps_per_epoch')\n\u001b[0m\u001b[1;32m    874\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m    875\u001b[0m   def evaluate(self,\n",
677 |             "\u001b[0;32m/usr/local/lib/python3.6/dist-packages/tensorflow/python/keras/engine/training_arrays.py\u001b[0m in \u001b[0;36mmodel_iteration\u001b[0;34m(model, inputs, targets, sample_weights, batch_size, epochs, verbose, callbacks, val_inputs, val_targets, val_sample_weights, shuffle, initial_epoch, steps_per_epoch, validation_steps, validation_freq, mode, validation_in_fit, prepared_feed_values_from_dataset, steps_name, **kwargs)\u001b[0m\n\u001b[1;32m    350\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m    351\u001b[0m         \u001b[0;31m# Get outputs.\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m--> 352\u001b[0;31m         \u001b[0mbatch_outs\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mf\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mins_batch\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m    353\u001b[0m         \u001b[0;32mif\u001b[0m \u001b[0;32mnot\u001b[0m \u001b[0misinstance\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mbatch_outs\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mlist\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m    354\u001b[0m           \u001b[0mbatch_outs\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0;34m[\u001b[0m\u001b[0mbatch_outs\u001b[0m\u001b[0;34m]\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n",
678 |             "\u001b[0;32m/usr/local/lib/python3.6/dist-packages/tensorflow/python/keras/backend.py\u001b[0m in \u001b[0;36m__call__\u001b[0;34m(self, inputs)\u001b[0m\n\u001b[1;32m   3215\u001b[0m         \u001b[0mvalue\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mmath_ops\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mcast\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mvalue\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mtensor\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mdtype\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m   3216\u001b[0m       \u001b[0mconverted_inputs\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mappend\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mvalue\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m-> 3217\u001b[0;31m     \u001b[0moutputs\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0m_graph_fn\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m*\u001b[0m\u001b[0mconverted_inputs\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m   3218\u001b[0m     return nest.pack_sequence_as(self._outputs_structure,\n\u001b[1;32m   3219\u001b[0m                                  [x.numpy() for x in outputs])\n",
679 |             "\u001b[0;32m/usr/local/lib/python3.6/dist-packages/tensorflow/python/eager/function.py\u001b[0m in \u001b[0;36m__call__\u001b[0;34m(self, *args, **kwargs)\u001b[0m\n\u001b[1;32m    556\u001b[0m       raise TypeError(\"Keyword arguments {} unknown. Expected {}.\".format(\n\u001b[1;32m    557\u001b[0m           list(kwargs.keys()), list(self._arg_keywords)))\n\u001b[0;32m--> 558\u001b[0;31m     \u001b[0;32mreturn\u001b[0m \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0m_call_flat\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0margs\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m    559\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m    560\u001b[0m   \u001b[0;32mdef\u001b[0m \u001b[0m_filtered_call\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mself\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0margs\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mkwargs\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n",
680 |             "\u001b[0;32m/usr/local/lib/python3.6/dist-packages/tensorflow/python/eager/function.py\u001b[0m in \u001b[0;36m_call_flat\u001b[0;34m(self, args)\u001b[0m\n\u001b[1;32m    625\u001b[0m     \u001b[0;31m# Only need to override the gradient in graph mode and when we have outputs.\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m    626\u001b[0m     \u001b[0;32mif\u001b[0m \u001b[0mcontext\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mexecuting_eagerly\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m)\u001b[0m \u001b[0;32mor\u001b[0m \u001b[0;32mnot\u001b[0m \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0moutputs\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m--> 627\u001b[0;31m       \u001b[0moutputs\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0m_inference_function\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mcall\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mctx\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0margs\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m    628\u001b[0m     \u001b[0;32melse\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m    629\u001b[0m       \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0m_register_gradient\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n",
681 |             "\u001b[0;32m/usr/local/lib/python3.6/dist-packages/tensorflow/python/eager/function.py\u001b[0m in \u001b[0;36mcall\u001b[0;34m(self, ctx, args)\u001b[0m\n\u001b[1;32m    413\u001b[0m             attrs=(\"executor_type\", executor_type,\n\u001b[1;32m    414\u001b[0m                    \"config_proto\", config),\n\u001b[0;32m--> 415\u001b[0;31m             ctx=ctx)\n\u001b[0m\u001b[1;32m    416\u001b[0m       \u001b[0;31m# Replace empty list with None\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m    417\u001b[0m       \u001b[0moutputs\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0moutputs\u001b[0m \u001b[0;32mor\u001b[0m \u001b[0;32mNone\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n",
682 |             "\u001b[0;32m/usr/local/lib/python3.6/dist-packages/tensorflow/python/eager/execute.py\u001b[0m in \u001b[0;36mquick_execute\u001b[0;34m(op_name, num_outputs, inputs, attrs, ctx, name)\u001b[0m\n\u001b[1;32m     58\u001b[0m     tensors = pywrap_tensorflow.TFE_Py_Execute(ctx._handle, device_name,\n\u001b[1;32m     59\u001b[0m                                                \u001b[0mop_name\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0minputs\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mattrs\u001b[0m\u001b[0;34m,\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m---> 60\u001b[0;31m                                                num_outputs)\n\u001b[0m\u001b[1;32m     61\u001b[0m   \u001b[0;32mexcept\u001b[0m \u001b[0mcore\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0m_NotOkStatusException\u001b[0m \u001b[0;32mas\u001b[0m \u001b[0me\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m     62\u001b[0m     \u001b[0;32mif\u001b[0m \u001b[0mname\u001b[0m \u001b[0;32mis\u001b[0m \u001b[0;32mnot\u001b[0m \u001b[0;32mNone\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n",
683 |             "\u001b[0;31mKeyboardInterrupt\u001b[0m: "
684 |           ]
685 |         }
686 |       ]
687 |     },
688 |     {
689 |       "metadata": {
690 |         "id": "r8_mvDjYL3CX",
691 |         "colab_type": "code",
692 |         "outputId": "5b436b6c-914a-4787-b9f0-a81b12a2f67b",
693 |         "colab": {
694 |           "base_uri": "https://localhost:8080/",
695 |           "height": 51
696 |         }
697 |       },
698 |       "cell_type": "code",
699 |       "source": [
700 |         "results = model.evaluate(test_data, test_label)\n",
701 |         "print(results)"
702 |       ],
703 |       "execution_count": 21,
704 |       "outputs": [
705 |         {
706 |           "output_type": "stream",
707 |           "text": [
708 |             "7567/7567 [==============================] - 11s 1ms/sample - loss: 1.6580 - accuracy: 0.3986\n",
709 |             "[1.6579768257694387, 0.39857274]\n"
710 |           ],
711 |           "name": "stdout"
712 |         }
713 |       ]
714 |     },
715 |     {
716 |       "metadata": {
717 |         "id": "VaIioR7EPfig",
718 |         "colab_type": "code",
719 |         "outputId": "b5bbe828-3ace-4291-f762-5717df7d0bea",
720 |         "colab": {
721 |           "base_uri": "https://localhost:8080/",
722 |           "height": 71
723 |         }
724 |       },
725 |       "cell_type": "code",
726 |       "source": [
727 |         "data_index   = 12\n",
728 |         "data_words   = \" \".join(test_data_words[data_index])\n",
729 |         "data_indexes = test_data[data_index]\n",
730 |         "print(data_words)\n",
731 |         "\n",
732 |         "predicted = model.predict([[data_indexes]])\n",
733 |         "print(encoder.classes_[np.argmax(predicted)])"
734 |       ],
735 |       "execution_count": 22,
736 |       "outputs": [
737 |         {
738 |           "output_type": "stream",
739 |           "text": [
740 |             "спортын төв ордонд өнөөдөр азийн оюутны аварга шалгаруулах эмэгтэй волейболчдын хоёр дахь удаагийн тэмцээний талаар мэдээлэл хийлээ анхны тэмцээн онд тайландын бангконг хотноо болж хоёрдугаар тэмцээнийг азийн оюутны спортын холбооноос аосх олгосон эрхийн дагуу оны дөрөвдүгээр сарын ны өдрүүдэд монгол улсын нийслэл улаанбаатар хотноо зохион байгуулах тэмцээний эрхийг монгол улс оны тавдугаар сарын хуралдсан аосхны гүйцэтгэх хорооны хурлаар хоёр оронтой өрсөлдөн авчээ уг тэмцээнийг монгол улсад авах талаар мосхолбоо оноос санаачлага гарган хөөцөлдөж эхэлсэн тэмцээний эрхийг авахад муын засгийн газрын санхүүгийн дэмжлэг мэргэжлийн холбоодын ажлын туршлага манай улсын олон улсын нэр хүнд ихээхэн тус хүргэжээ зохион байгуулах хороог с ламбаа удирдаж тэмцээний зохион байгуулах хороог збх эрүүл мэндийн сайдын оны тоот тушаалаар батлаж даргаар уихын гишүүн монголын волейболын холбооны мвх хүндэт ерөнхийлөгч сламбаа ажиллаж збхны орлогч даргаар згхагентлагбтсгын дарга чнаранбаатар збхны нарийн бичгийн даргаар монголын оюутны спортын холбооны мосх ерөнхий нарийн бичгийн дарга джаргалсайхан збхны гишүүдэд бсшуяны төрийн нарийн бичгийн дарга ддалайжаргал нийслэлийн здтгазрын дарга цболдсайхан сяны газрын дарга дбатжаргал гхяамны консулын газрын дарга дганхуяг бсшуяны мэргэжлийн боловсролын газрын дарга мбаасанжав гихалбаны дарга дмөрөн мосхны ерөнхийлөгч оуосхны ерөнхий санхүүч дбаясгалан муисийн ректор стөмөрочир мубисийн ректор бжадамба залуу монгол корпорацийн ерөнхийлөгч мсономпил мохны ерөнхий нарийн бичгийн дарга нбямбагэрэл мвхолбооны ерөнхий нарий бичгийн дарга цбатэнх миат хкийн маркетинг борлуулатын хэлтсийн дарга тмэндсайхан боловсрол суваг телевизийн ерөнхий захирал аамундра нар сонгогдон ажиллаж тэмцээнийг үнэ төлбөргүй үзүүлнэ волейболын болон оюутны спортыг сурталчилах дэлгэрүүлэх үүднээс тэмцээнийг үнэ төлбөргүй үзүүлэхээр збхорооны анхдугаар хурлаас шийдвэрлэсэн нийслэлийн иргэдийг тэмцээнийг өргөнөөр үзэхийг здтгазраас уриалсан тэмцээнийг зөвхөн улаанбаатар хотын иргэд бус аймгаас волейболын спортыг сонирхон хөгжөөн дэмжигч үзэгч волейболын спортын мэргэжилтэн багш нар секцэнд хичээллэгч хүүхдүүд зохион байгуулалтай ирэхээр ялангуяа тэмцээн болох газартай хамгийн ойрхон хануул дүүргийн здтгазар дүүргийнхээ ард иргэд хөдөлмөрчид сургуулийн сурагчид оюутнууд цэргийн албан хаагчид буянтухаа орчимын албан байгууллага хамт олныг идэвхтэй оролцуулах арга хэмжээ авч эхлэжээ олон зуун оюутнууд тэмцээн үзэх боллоо тэмцээний өдрүүдэд нийслэлээс буянтухаагийн спортын ордонг чиглэсэн хүмүүсийн цуваа ихсэх төлөвтэй учир нийслэлд үйл ажиллагаа явуулж байгаа орчим идсийн оюутнууд тэмцээнийг анги сургууль хамт олноороо үзэх сонирхолтой байгаагаа монголын оюутны холбоо биеийн тамирын тэнхимдээ хүсчээ үүний дагуу бсшуяам мох монголын оюутны спортын холбоо мосх ноос тэмцээнийг өдөр бүр гаруй сургуулийн орчим оюутнууд нэгдсэн хуваарийн дагуу үзэх хуваарийг бсшуяны төрийн нарийн бичгийн дарга зохион байгуулах үндэсний хороо збх ны гишүүн ддалайжаргал батлан сургуулиудад албан тоотоор хүргүүлжээ мосхд тэмцээнийг үзэхээр олон арван сургуулиуд оюутны тоогоо өгч бүртгүүлж суудлын хувиарлалтанд орж байгаа ажээ ялангуяа биеийн тамирын мэргэжлийн дээд сургуулийн оюутнууд дадлага хичээлээ тэмцээний үеэр хийхээр хичээлийн хувиараа зохицуулсан нийслэлийн засаг дарга оюутнуудад туслав улаанбаатар хотноо болдог оюутны олон улс тив дэлхийн тэмцээн бүрт нийслэлийн засаг дарга гмөнхбаяр ихээхэн туслалцаа үзүүлэн оюутан залуусаа байнга дэмжин оролцдог ажээ тэрээр тус тэмцээнд оролцохоор бэлтгэж байгаа монголын оюутны шигшээ багийн тамирчидын хоногийн бэлтгэл сургалтын зардлыг хариуцан гаргасан хөрөнгө санхүүгийн хүндрэлтэй байгаа үеэд тэмцээнд бэлтгэж байгаа оюутан тамирчидаа цагаа олж хэрэгцээтэй үеэд дэмжлээ мосхолбоо монголын волейболын холбоо мвх тамирчидынхаа өмнөөс талархал илэрхийлжээ монголын баг тамирчид эрдэнэт хотод оны сарын өдрөөс эхлэн хоногийн бэлтгэл хийснийхээ дараа ийнхүү нийслэлийн засаг даргын туслалцаатайгаар гадаадын тамирчидтай хамт байрлах цэцэг зочид буудалдаа орж бэлтгэл сургуулиалтаа үргэлжүүлэх боломжтой нздтгазраас баг\n",
741 |             "спорт\n"
742 |           ],
743 |           "name": "stdout"
744 |         }
745 |       ]
746 |     }
747 |   ]
748 | }


--------------------------------------------------------------------------------
/neural_network_classifier_notebooks/Mongolian_text_classification_05,_Attention.ipynb:
--------------------------------------------------------------------------------
  1 | {
  2 |   "nbformat": 4,
  3 |   "nbformat_minor": 0,
  4 |   "metadata": {
  5 |     "colab": {
  6 |       "name": "Mongolian text classification #05, Attention.ipynb",
  7 |       "version": "0.3.2",
  8 |       "provenance": [],
  9 |       "collapsed_sections": []
 10 |     },
 11 |     "kernelspec": {
 12 |       "name": "python3",
 13 |       "display_name": "Python 3"
 14 |     },
 15 |     "accelerator": "GPU"
 16 |   },
 17 |   "cells": [
 18 |     {
 19 |       "metadata": {
 20 |         "id": "muNP8k9fqaJb",
 21 |         "colab_type": "text"
 22 |       },
 23 |       "cell_type": "markdown",
 24 |       "source": [
 25 |         "Mongolian text classification series #01\n",
 26 |         "\n",
 27 |         "This notebook's purpose is to reveal attention mechanism by visualizing it.\n",
 28 |         "\n",
 29 |         "Eduge dataset provided by Bolorsoft LLC\n",
 30 |         "\n",
 31 |         "Author : Sharavsambuu Gunchinish (sharavsambuu@gmail.com)\n",
 32 |         "\n",
 33 |         "Github: https://github.com/sharavsambuu/mongolian-text-classification \n",
 34 |         "\n"
 35 |       ]
 36 |     },
 37 |     {
 38 |       "metadata": {
 39 |         "id": "iY9jwdg6qT8M",
 40 |         "colab_type": "code",
 41 |         "outputId": "d1d20ba2-495a-4194-d2c4-8376024e8d07",
 42 |         "colab": {
 43 |           "base_uri": "https://localhost:8080/",
 44 |           "height": 360
 45 |         }
 46 |       },
 47 |       "cell_type": "code",
 48 |       "source": [
 49 |         "from __future__ import absolute_import, division, print_function, unicode_literals\n",
 50 |         "\n",
 51 |         "!pip install -q tensorflow-gpu==2.0.0-alpha0\n",
 52 |         "!pip install gensim\n",
 53 |         "\n",
 54 |         "import tensorflow as tf\n",
 55 |         "from tensorflow import keras\n",
 56 |         "\n",
 57 |         "import numpy as np\n",
 58 |         "\n",
 59 |         "print(tf.__version__)"
 60 |       ],
 61 |       "execution_count": 1,
 62 |       "outputs": [
 63 |         {
 64 |           "output_type": "stream",
 65 |           "text": [
 66 |             "Requirement already satisfied: gensim in /usr/local/lib/python3.6/dist-packages (3.6.0)\n",
 67 |             "Requirement already satisfied: scipy>=0.18.1 in /usr/local/lib/python3.6/dist-packages (from gensim) (1.2.1)\n",
 68 |             "Requirement already satisfied: numpy>=1.11.3 in /usr/local/lib/python3.6/dist-packages (from gensim) (1.16.2)\n",
 69 |             "Requirement already satisfied: smart-open>=1.2.1 in /usr/local/lib/python3.6/dist-packages (from gensim) (1.8.1)\n",
 70 |             "Requirement already satisfied: six>=1.5.0 in /usr/local/lib/python3.6/dist-packages (from gensim) (1.11.0)\n",
 71 |             "Requirement already satisfied: boto>=2.32 in /usr/local/lib/python3.6/dist-packages (from smart-open>=1.2.1->gensim) (2.49.0)\n",
 72 |             "Requirement already satisfied: bz2file in /usr/local/lib/python3.6/dist-packages (from smart-open>=1.2.1->gensim) (0.98)\n",
 73 |             "Requirement already satisfied: requests in /usr/local/lib/python3.6/dist-packages (from smart-open>=1.2.1->gensim) (2.18.4)\n",
 74 |             "Requirement already satisfied: boto3 in /usr/local/lib/python3.6/dist-packages (from smart-open>=1.2.1->gensim) (1.9.130)\n",
 75 |             "Requirement already satisfied: idna<2.7,>=2.5 in /usr/local/lib/python3.6/dist-packages (from requests->smart-open>=1.2.1->gensim) (2.6)\n",
 76 |             "Requirement already satisfied: urllib3<1.23,>=1.21.1 in /usr/local/lib/python3.6/dist-packages (from requests->smart-open>=1.2.1->gensim) (1.22)\n",
 77 |             "Requirement already satisfied: certifi>=2017.4.17 in /usr/local/lib/python3.6/dist-packages (from requests->smart-open>=1.2.1->gensim) (2019.3.9)\n",
 78 |             "Requirement already satisfied: chardet<3.1.0,>=3.0.2 in /usr/local/lib/python3.6/dist-packages (from requests->smart-open>=1.2.1->gensim) (3.0.4)\n",
 79 |             "Requirement already satisfied: s3transfer<0.3.0,>=0.2.0 in /usr/local/lib/python3.6/dist-packages (from boto3->smart-open>=1.2.1->gensim) (0.2.0)\n",
 80 |             "Requirement already satisfied: jmespath<1.0.0,>=0.7.1 in /usr/local/lib/python3.6/dist-packages (from boto3->smart-open>=1.2.1->gensim) (0.9.4)\n",
 81 |             "Requirement already satisfied: botocore<1.13.0,>=1.12.130 in /usr/local/lib/python3.6/dist-packages (from boto3->smart-open>=1.2.1->gensim) (1.12.130)\n",
 82 |             "Requirement already satisfied: docutils>=0.10 in /usr/local/lib/python3.6/dist-packages (from botocore<1.13.0,>=1.12.130->boto3->smart-open>=1.2.1->gensim) (0.14)\n",
 83 |             "Requirement already satisfied: python-dateutil<3.0.0,>=2.1; python_version >= \"2.7\" in /usr/local/lib/python3.6/dist-packages (from botocore<1.13.0,>=1.12.130->boto3->smart-open>=1.2.1->gensim) (2.5.3)\n",
 84 |             "2.0.0-alpha0\n"
 85 |           ],
 86 |           "name": "stdout"
 87 |         }
 88 |       ]
 89 |     },
 90 |     {
 91 |       "metadata": {
 92 |         "id": "smJeJfoo4qcu",
 93 |         "colab_type": "text"
 94 |       },
 95 |       "cell_type": "markdown",
 96 |       "source": [
 97 |         "[More info about creation of eduge dataset pickles](https://github.com/sharavsambuu/mongolian-text-classification/blob/master/preprocess_dataset/preprocess_eduge.ipynb) preprocessing eats a lot of CPU cycle so it's good idea to cook it before using colab."
 98 |       ]
 99 |     },
100 |     {
101 |       "metadata": {
102 |         "id": "CDayX_Yx3REh",
103 |         "colab_type": "code",
104 |         "outputId": "4ab773b3-001b-458a-c339-a4128c1dd426",
105 |         "colab": {
106 |           "base_uri": "https://localhost:8080/",
107 |           "height": 340
108 |         }
109 |       },
110 |       "cell_type": "code",
111 |       "source": [
112 |         "import os\n",
113 |         "from os.path import exists, join, basename, splitext\n",
114 |         "import sys\n",
115 |         "\n",
116 |         "def download_from_google_drive(file_id, file_name):\n",
117 |         "  !rm -f ./cookie\n",
118 |         "  !curl -c ./cookie -s -L \"https://drive.google.com/uc?export=download&id=$file_id\" > /dev/null\n",
119 |         "  confirm_text = !awk '/download/ {print $NF}' ./cookie\n",
120 |         "  confirm_text = confirm_text[0]\n",
121 |         "  !curl -Lb ./cookie \"https://drive.google.com/uc?export=download&confirm=$confirm_text&id=$file_id\" -o $file_name\n",
122 |         "  \n",
123 |         "# download eduge pickles\n",
124 |         "file_path = 'eduge_pickles'\n",
125 |         "if not exists(file_path):\n",
126 |         "  download_from_google_drive('1vjJ9YgIe8o0ErhbN0lH1XqPv3KFP8acv', '%s.rar' % file_path)\n",
127 |         "  rar_file = file_path+\".rar\"\n",
128 |         "  !unrar x $rar_file"
129 |       ],
130 |       "execution_count": 2,
131 |       "outputs": [
132 |         {
133 |           "output_type": "stream",
134 |           "text": [
135 |             "  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current\n",
136 |             "                                 Dload  Upload   Total   Spent    Left  Speed\n",
137 |             "100   388    0   388    0     0   1385      0 --:--:-- --:--:-- --:--:--  1385\n",
138 |             "100  106M    0  106M    0     0  91.5M      0 --:--:--  0:00:01 --:--:--  242M\n",
139 |             "\n",
140 |             "UNRAR 5.50 freeware      Copyright (c) 1993-2017 Alexander Roshal\n",
141 |             "\n",
142 |             "\n",
143 |             "Extracting from eduge_pickles.rar\n",
144 |             "\n",
145 |             "\n",
146 |             "Would you like to replace the existing file word_index.pickle\n",
147 |             "9178153 bytes, modified on 2019-04-13 01:44\n",
148 |             "with a new one\n",
149 |             "9178153 bytes, modified on 2019-04-13 01:44\n",
150 |             "\n",
151 |             "[Y]es, [N]o, [A]ll, n[E]ver, [R]ename, [Q]uit q\n",
152 |             "\n",
153 |             "Program aborted\n"
154 |           ],
155 |           "name": "stdout"
156 |         }
157 |       ]
158 |     },
159 |     {
160 |       "metadata": {
161 |         "id": "pPHJcnfi4Rzg",
162 |         "colab_type": "code",
163 |         "colab": {}
164 |       },
165 |       "cell_type": "code",
166 |       "source": [
167 |         "import pickle\n",
168 |         "\n",
169 |         "with open('word_index.pickle', 'rb') as handle:\n",
170 |         "  word_index = pickle.load(handle)\n",
171 |         "    \n",
172 |         "with open('reversed_word_index.pickle', 'rb') as handle:\n",
173 |         "  reversed_word_index = pickle.load(handle)\n",
174 |         "  \n",
175 |         "with open('eduge_stopwords_removed.pickle', 'rb') as handle:\n",
176 |         "  eduge_ds = pickle.load(handle)"
177 |       ],
178 |       "execution_count": 0,
179 |       "outputs": []
180 |     },
181 |     {
182 |       "metadata": {
183 |         "id": "ASRW7ISNnbM-",
184 |         "colab_type": "code",
185 |         "outputId": "9d003960-fdd8-44c6-9424-37515e774f58",
186 |         "colab": {
187 |           "base_uri": "https://localhost:8080/",
188 |           "height": 207
189 |         }
190 |       },
191 |       "cell_type": "code",
192 |       "source": [
193 |         "# facebook trained word2vec on both commoncrawl and wikipedia. So this model should contain enough representation about our mongolian words.\n",
194 |         "mongolian_word2vec_download=\"https://dl.fbaipublicfiles.com/fasttext/vectors-crawl/cc.mn.300.bin.gz\"\n",
195 |         "if not exists(\"cc.mn.300.bin.gz\"):\n",
196 |         "  !wget https://dl.fbaipublicfiles.com/fasttext/vectors-crawl/cc.mn.300.bin.gz\n",
197 |         "if exists('cc.mn.300.bin.gz'):\n",
198 |         "  !gunzip cc.mn.300.bin.gz"
199 |       ],
200 |       "execution_count": 4,
201 |       "outputs": [
202 |         {
203 |           "output_type": "stream",
204 |           "text": [
205 |             "--2019-04-14 11:03:40--  https://dl.fbaipublicfiles.com/fasttext/vectors-crawl/cc.mn.300.bin.gz\n",
206 |             "Resolving dl.fbaipublicfiles.com (dl.fbaipublicfiles.com)... 104.20.22.166, 104.20.6.166, 2606:4700:10::6814:16a6, ...\n",
207 |             "Connecting to dl.fbaipublicfiles.com (dl.fbaipublicfiles.com)|104.20.22.166|:443... connected.\n",
208 |             "HTTP request sent, awaiting response... 200 OK\n",
209 |             "Length: 2937042399 (2.7G) [application/octet-stream]\n",
210 |             "Saving to: ‘cc.mn.300.bin.gz’\n",
211 |             "\n",
212 |             "cc.mn.300.bin.gz     15%[==>                 ] 438.51M  12.7MB/s    eta 3m 12s ^C\n",
213 |             "gzip: cc.mn.300.bin already exists; do you wish to overwrite (y or n)? n\n",
214 |             "\tnot overwritten\n"
215 |           ],
216 |           "name": "stdout"
217 |         }
218 |       ]
219 |     },
220 |     {
221 |       "metadata": {
222 |         "id": "BqGAauUZpnFz",
223 |         "colab_type": "code",
224 |         "outputId": "75c0cfc6-12e4-4791-a16a-893a8a6749a2",
225 |         "colab": {
226 |           "base_uri": "https://localhost:8080/",
227 |           "height": 88
228 |         }
229 |       },
230 |       "cell_type": "code",
231 |       "source": [
232 |         "from gensim.models.wrappers import FastText\n",
233 |         "\n",
234 |         "word2vec_model = FastText.load_fasttext_format('cc.mn.300.bin')"
235 |       ],
236 |       "execution_count": 5,
237 |       "outputs": [
238 |         {
239 |           "output_type": "stream",
240 |           "text": [
241 |             "WARNING: Logging before flag parsing goes to stderr.\n",
242 |             "W0414 11:04:27.822593 140650259199872 ssh.py:33] paramiko missing, opening SSH/SCP/SFTP paths will be disabled.  `pip install paramiko` to suppress\n",
243 |             "W0414 11:04:28.318554 140650259199872 word2vec.py:573] Slow version of gensim.models.deprecated.word2vec is being used\n"
244 |           ],
245 |           "name": "stderr"
246 |         }
247 |       ]
248 |     },
249 |     {
250 |       "metadata": {
251 |         "id": "kkc1iiqJp-CJ",
252 |         "colab_type": "code",
253 |         "outputId": "3558d2cb-aa96-4b16-8687-a3e8d0516ed3",
254 |         "colab": {
255 |           "base_uri": "https://localhost:8080/",
256 |           "height": 88
257 |         }
258 |       },
259 |       "cell_type": "code",
260 |       "source": [
261 |         "print(word2vec_model.most_similar('монгол'))"
262 |       ],
263 |       "execution_count": 6,
264 |       "outputs": [
265 |         {
266 |           "output_type": "stream",
267 |           "text": [
268 |             "[('Монгол', 0.6342526078224182), ('монголын', 0.6047513484954834), ('хятад', 0.5558866858482361), ('Монголын', 0.5087883472442627), ('судлалаараа', 0.48851606249809265), ('манай', 0.4853793680667877), ('уйгаржин', 0.4725492596626282), ('угсаатангууд', 0.47093287110328674), ('орос', 0.46463483572006226), ('худам', 0.4609120190143585)]\n"
269 |           ],
270 |           "name": "stdout"
271 |         },
272 |         {
273 |           "output_type": "stream",
274 |           "text": [
275 |             "/usr/local/lib/python3.6/dist-packages/gensim/matutils.py:737: FutureWarning: Conversion of the second argument of issubdtype from `int` to `np.signedinteger` is deprecated. In future, it will be treated as `np.int64 == np.dtype(int).type`.\n",
276 |             "  if np.issubdtype(vec.dtype, np.int):\n"
277 |           ],
278 |           "name": "stderr"
279 |         }
280 |       ]
281 |     },
282 |     {
283 |       "metadata": {
284 |         "id": "oF6vB3Qnq08I",
285 |         "colab_type": "code",
286 |         "colab": {}
287 |       },
288 |       "cell_type": "code",
289 |       "source": [
290 |         "# preparing embedding matrix\n",
291 |         "import numpy as np\n",
292 |         "\n",
293 |         "words_not_found = []\n",
294 |         "embed_dim       = 300\n",
295 |         "embedding_matrix = np.random.uniform(-1, 1, (len(word_index), embed_dim))\n",
296 |         "for word, i in word_index.items():\n",
297 |         "  if i<4:\n",
298 |         "    continue\n",
299 |         "  try:\n",
300 |         "    embedding_vector = word2vec_model[word]\n",
301 |         "    if (embedding_vector is not None) and len(embedding_vector) > 0:\n",
302 |         "      embedding_matrix[i] = embedding_vector\n",
303 |         "  except:\n",
304 |         "    words_not_found.append(word)\n",
305 |         "    pass"
306 |       ],
307 |       "execution_count": 0,
308 |       "outputs": []
309 |     },
310 |     {
311 |       "metadata": {
312 |         "id": "aQAaXWIgsxm9",
313 |         "colab_type": "code",
314 |         "outputId": "b2c1d495-2ec7-44b5-ad8d-553753195efd",
315 |         "colab": {
316 |           "base_uri": "https://localhost:8080/",
317 |           "height": 34
318 |         }
319 |       },
320 |       "cell_type": "code",
321 |       "source": [
322 |         "print(embedding_matrix.shape)\n",
323 |         "#print(embedding_matrix[5])"
324 |       ],
325 |       "execution_count": 8,
326 |       "outputs": [
327 |         {
328 |           "output_type": "stream",
329 |           "text": [
330 |             "(370794, 300)\n"
331 |           ],
332 |           "name": "stdout"
333 |         }
334 |       ]
335 |     },
336 |     {
337 |       "metadata": {
338 |         "id": "XFxd1QGR65VV",
339 |         "colab_type": "code",
340 |         "colab": {}
341 |       },
342 |       "cell_type": "code",
343 |       "source": [
344 |         "MAX_LEN = 256\n",
345 |         "\n",
346 |         "import itertools\n",
347 |         "\n",
348 |         "for item in eduge_ds:\n",
349 |         "  item[0] = list(itertools.chain(*item[0]))[:MAX_LEN]"
350 |       ],
351 |       "execution_count": 0,
352 |       "outputs": []
353 |     },
354 |     {
355 |       "metadata": {
356 |         "id": "U8PTeX0WCbhR",
357 |         "colab_type": "code",
358 |         "colab": {}
359 |       },
360 |       "cell_type": "code",
361 |       "source": [
362 |         "from sklearn.model_selection import train_test_split\n",
363 |         "train, test = train_test_split(eduge_ds, test_size=0.1, random_state=999)"
364 |       ],
365 |       "execution_count": 0,
366 |       "outputs": []
367 |     },
368 |     {
369 |       "metadata": {
370 |         "id": "8mgMCFcgDHH4",
371 |         "colab_type": "code",
372 |         "colab": {}
373 |       },
374 |       "cell_type": "code",
375 |       "source": [
376 |         "train_data_words  = [i[0] for i in train]\n",
377 |         "train_label_words = [i[1] for i in train]\n",
378 |         "test_data_words   = [i[0] for i in test ]\n",
379 |         "test_label_words  = [i[1] for i in test ]"
380 |       ],
381 |       "execution_count": 0,
382 |       "outputs": []
383 |     },
384 |     {
385 |       "metadata": {
386 |         "id": "rrXC7UiuFkCH",
387 |         "colab_type": "code",
388 |         "colab": {}
389 |       },
390 |       "cell_type": "code",
391 |       "source": [
392 |         "def encode_news(text):\n",
393 |         "    return [word_index.get(i, 2) for i in text]\n",
394 |         "  \n",
395 |         "train_data = [encode_news(sent) for sent in train_data_words]\n",
396 |         "test_data  = [encode_news(sent) for sent in test_data_words ]"
397 |       ],
398 |       "execution_count": 0,
399 |       "outputs": []
400 |     },
401 |     {
402 |       "metadata": {
403 |         "id": "FV-h_avPEzM1",
404 |         "colab_type": "code",
405 |         "colab": {}
406 |       },
407 |       "cell_type": "code",
408 |       "source": [
409 |         "train_data = keras.preprocessing.sequence.pad_sequences(train_data,\n",
410 |         "                                                        value=word_index[\"<PAD>\"],\n",
411 |         "                                                        padding='post',\n",
412 |         "                                                        maxlen=MAX_LEN)\n",
413 |         "\n",
414 |         "test_data = keras.preprocessing.sequence.pad_sequences(test_data,\n",
415 |         "                                                       value=word_index[\"<PAD>\"],\n",
416 |         "                                                       padding='post',\n",
417 |         "                                                       maxlen=MAX_LEN)"
418 |       ],
419 |       "execution_count": 0,
420 |       "outputs": []
421 |     },
422 |     {
423 |       "metadata": {
424 |         "id": "gDVqmPqxIMid",
425 |         "colab_type": "code",
426 |         "outputId": "2eef5781-66f9-4357-ef18-4bdf2e5a07e5",
427 |         "colab": {
428 |           "base_uri": "https://localhost:8080/",
429 |           "height": 170
430 |         }
431 |       },
432 |       "cell_type": "code",
433 |       "source": [
434 |         "labels = list(set(test_label_words))\n",
435 |         "labels"
436 |       ],
437 |       "execution_count": 14,
438 |       "outputs": [
439 |         {
440 |           "output_type": "execute_result",
441 |           "data": {
442 |             "text/plain": [
443 |               "['эрүүл мэнд',\n",
444 |               " 'хууль',\n",
445 |               " 'байгал орчин',\n",
446 |               " 'улс төр',\n",
447 |               " 'боловсрол',\n",
448 |               " 'эдийн засаг',\n",
449 |               " 'спорт',\n",
450 |               " 'технологи',\n",
451 |               " 'урлаг соёл']"
452 |             ]
453 |           },
454 |           "metadata": {
455 |             "tags": []
456 |           },
457 |           "execution_count": 14
458 |         }
459 |       ]
460 |     },
461 |     {
462 |       "metadata": {
463 |         "id": "PBKj3GQqJq29",
464 |         "colab_type": "code",
465 |         "colab": {}
466 |       },
467 |       "cell_type": "code",
468 |       "source": [
469 |         "from sklearn.preprocessing import LabelBinarizer\n",
470 |         "encoder     = LabelBinarizer()\n",
471 |         "train_label = transfomed_label = encoder.fit_transform(train_label_words)\n",
472 |         "test_label  = transfomed_label = encoder.fit_transform(test_label_words )"
473 |       ],
474 |       "execution_count": 0,
475 |       "outputs": []
476 |     },
477 |     {
478 |       "metadata": {
479 |         "id": "DPq45PN5HZ15",
480 |         "colab_type": "code",
481 |         "colab": {
482 |           "base_uri": "https://localhost:8080/",
483 |           "height": 632
484 |         },
485 |         "outputId": "9e3057cc-41e3-4383-aa1c-7eca6fb7d85c"
486 |       },
487 |       "cell_type": "code",
488 |       "source": [
489 |         "class Attention(keras.Model):\n",
490 |         "  def __init__(self, units):\n",
491 |         "    super(Attention, self).__init__()\n",
492 |         "    self.W1 = keras.layers.Dense(units)\n",
493 |         "    self.W2 = keras.layers.Dense(units)\n",
494 |         "    self.V  = keras.layers.Dense(1)\n",
495 |         "  def call(self, features, hidden):\n",
496 |         "    hidden_with_time_axis = tf.expand_dims(hidden, 1)\n",
497 |         "    score = tf.nn.tanh(self.W1(features)+self.W2(hidden_with_time_axis))\n",
498 |         "    attention_weights = tf.nn.softmax(self.V(score), axis=1)\n",
499 |         "    context_vector = attention_weights*features\n",
500 |         "    context_vector = tf.reduce_sum(context_vector, axis=1)\n",
501 |         "    return context_vector, attention_weights\n",
502 |         "\n",
503 |         "attention=Attention(64)\n",
504 |         "  \n",
505 |         "vocab_size = len(word_index)\n",
506 |         "\n",
507 |         "sequence_input     = keras.layers.Input(shape=(MAX_LEN,), dtype='int32')\n",
508 |         "embedded_sequences = keras.layers.Embedding(\n",
509 |         "    vocab_size, \n",
510 |         "    embed_dim , \n",
511 |         "    weights=[embedding_matrix], \n",
512 |         "    input_length=MAX_LEN, \n",
513 |         "    trainable=False)(sequence_input)\n",
514 |         "lstm  = keras.layers.Bidirectional(\n",
515 |         "    keras.layers.LSTM(\n",
516 |         "        64, # rnn cell size\n",
517 |         "        dropout               = 0.3,\n",
518 |         "        return_sequences      = True,\n",
519 |         "        return_state          = True,\n",
520 |         "        recurrent_activation  = 'relu',\n",
521 |         "        recurrent_initializer = 'glorot_uniform' \n",
522 |         "    )\n",
523 |         "  )(embedded_sequences)\n",
524 |         "lstm, forward_h, forward_c, backward_h, backward_c = keras.layers.Bidirectional(\n",
525 |         "    keras.layers.LSTM(\n",
526 |         "        64, # rnn cell size\n",
527 |         "        dropout               = 0.2,\n",
528 |         "        return_sequences      = True,\n",
529 |         "        return_state          = True,\n",
530 |         "        recurrent_activation  = 'relu',\n",
531 |         "        recurrent_initializer = 'glorot_uniform'\n",
532 |         "        \n",
533 |         "    )\n",
534 |         ")(lstm)\n",
535 |         "state_h = keras.layers.Concatenate()([forward_h, backward_h])\n",
536 |         "state_c = keras.layers.Concatenate()([forward_c, backward_c])\n",
537 |         "context_vector, attention_weights = attention(lstm, state_h)\n",
538 |         "\n",
539 |         "preds = keras.layers.Dense(len(labels), activation='sigmoid')(context_vector)\n",
540 |         "\n",
541 |         "model = keras.models.Model(inputs=sequence_input, outputs=preds)\n",
542 |         "model.summary()"
543 |       ],
544 |       "execution_count": 16,
545 |       "outputs": [
546 |         {
547 |           "output_type": "stream",
548 |           "text": [
549 |             "W0414 11:06:00.272124 140650259199872 tf_logging.py:161] <tensorflow.python.keras.layers.recurrent.UnifiedLSTM object at 0x7feaaebdc7f0>: Note that this layer is not optimized for performance. Please use tf.keras.layers.CuDNNLSTM for better performance on GPU.\n",
550 |             "W0414 11:06:00.289174 140650259199872 tf_logging.py:161] <tensorflow.python.keras.layers.recurrent.UnifiedLSTM object at 0x7feaaf031eb8>: Note that this layer is not optimized for performance. Please use tf.keras.layers.CuDNNLSTM for better performance on GPU.\n",
551 |             "W0414 11:06:00.382109 140650259199872 deprecation.py:506] From /usr/local/lib/python3.6/dist-packages/tensorflow/python/keras/backend.py:4081: calling dropout (from tensorflow.python.ops.nn_ops) with keep_prob is deprecated and will be removed in a future version.\n",
552 |             "Instructions for updating:\n",
553 |             "Please use `rate` instead of `keep_prob`. Rate should be set to `rate = 1 - keep_prob`.\n",
554 |             "W0414 11:06:01.074629 140650259199872 tf_logging.py:161] <tensorflow.python.keras.layers.recurrent.UnifiedLSTM object at 0x7feaaed6e470>: Note that this layer is not optimized for performance. Please use tf.keras.layers.CuDNNLSTM for better performance on GPU.\n",
555 |             "W0414 11:06:01.078676 140650259199872 tf_logging.py:161] <tensorflow.python.keras.layers.recurrent.UnifiedLSTM object at 0x7feab601ac50>: Note that this layer is not optimized for performance. Please use tf.keras.layers.CuDNNLSTM for better performance on GPU.\n"
556 |           ],
557 |           "name": "stderr"
558 |         },
559 |         {
560 |           "output_type": "stream",
561 |           "text": [
562 |             "Model: \"model\"\n",
563 |             "__________________________________________________________________________________________________\n",
564 |             "Layer (type)                    Output Shape         Param #     Connected to                     \n",
565 |             "==================================================================================================\n",
566 |             "input_1 (InputLayer)            [(None, 256)]        0                                            \n",
567 |             "__________________________________________________________________________________________________\n",
568 |             "embedding (Embedding)           (None, 256, 300)     111238200   input_1[0][0]                    \n",
569 |             "__________________________________________________________________________________________________\n",
570 |             "bidirectional (Bidirectional)   [(None, 256, 128), ( 186880      embedding[0][0]                  \n",
571 |             "__________________________________________________________________________________________________\n",
572 |             "bidirectional_1 (Bidirectional) [(None, 256, 128), ( 98816       bidirectional[0][0]              \n",
573 |             "                                                                 bidirectional[0][1]              \n",
574 |             "                                                                 bidirectional[0][2]              \n",
575 |             "                                                                 bidirectional[0][3]              \n",
576 |             "                                                                 bidirectional[0][4]              \n",
577 |             "__________________________________________________________________________________________________\n",
578 |             "concatenate (Concatenate)       (None, 128)          0           bidirectional_1[0][1]            \n",
579 |             "                                                                 bidirectional_1[0][3]            \n",
580 |             "__________________________________________________________________________________________________\n",
581 |             "attention (Attention)           ((None, 128), (None, 16577       bidirectional_1[0][0]            \n",
582 |             "                                                                 concatenate[0][0]                \n",
583 |             "__________________________________________________________________________________________________\n",
584 |             "dense_3 (Dense)                 (None, 9)            1161        attention[0][0]                  \n",
585 |             "==================================================================================================\n",
586 |             "Total params: 111,541,634\n",
587 |             "Trainable params: 303,434\n",
588 |             "Non-trainable params: 111,238,200\n",
589 |             "__________________________________________________________________________________________________\n"
590 |           ],
591 |           "name": "stdout"
592 |         }
593 |       ]
594 |     },
595 |     {
596 |       "metadata": {
597 |         "id": "cAgP1KlqHu2F",
598 |         "colab_type": "code",
599 |         "colab": {}
600 |       },
601 |       "cell_type": "code",
602 |       "source": [
603 |         "model.compile(optimizer='adam',\n",
604 |         "              loss='categorical_crossentropy',\n",
605 |         "              metrics=['accuracy'])"
606 |       ],
607 |       "execution_count": 0,
608 |       "outputs": []
609 |     },
610 |     {
611 |       "metadata": {
612 |         "id": "ZPw8roFQKrHm",
613 |         "colab_type": "code",
614 |         "colab": {
615 |           "base_uri": "https://localhost:8080/",
616 |           "height": 51
617 |         },
618 |         "outputId": "930796b7-13e0-4fce-f1e5-f7fd72030d92"
619 |       },
620 |       "cell_type": "code",
621 |       "source": [
622 |         "print(len(train_data), len(train_label))\n",
623 |         "print(len(test_data ), len(test_label) )\n",
624 |         "\n",
625 |         "partial_index = 3000\n",
626 |         "\n",
627 |         "x_val = train_data[:partial_index]\n",
628 |         "partial_x_train = train_data[partial_index:]\n",
629 |         "\n",
630 |         "y_val = train_label[:partial_index]\n",
631 |         "partial_y_train = train_label[partial_index:]"
632 |       ],
633 |       "execution_count": 18,
634 |       "outputs": [
635 |         {
636 |           "output_type": "stream",
637 |           "text": [
638 |             "68094 68094\n",
639 |             "7567 7567\n"
640 |           ],
641 |           "name": "stdout"
642 |         }
643 |       ]
644 |     },
645 |     {
646 |       "metadata": {
647 |         "id": "iSTB4--RKacs",
648 |         "colab_type": "code",
649 |         "colab": {
650 |           "base_uri": "https://localhost:8080/",
651 |           "height": 1101
652 |         },
653 |         "outputId": "19106331-e667-42c7-cdf8-88c1f7e4b52d"
654 |       },
655 |       "cell_type": "code",
656 |       "source": [
657 |         "epochs = 50\n",
658 |         "history = model.fit(partial_x_train,\n",
659 |         "                    partial_y_train,\n",
660 |         "                    epochs=epochs  ,\n",
661 |         "                    batch_size=128 ,\n",
662 |         "                    validation_data=(x_val, y_val),\n",
663 |         "                    verbose=1)"
664 |       ],
665 |       "execution_count": 19,
666 |       "outputs": [
667 |         {
668 |           "output_type": "stream",
669 |           "text": [
670 |             "Train on 65094 samples, validate on 3000 samples\n",
671 |             "Epoch 1/50\n",
672 |             "27904/65094 [===========>..................] - ETA: 46:44 - loss: 2.1963 - accuracy: 0.0786"
673 |           ],
674 |           "name": "stdout"
675 |         },
676 |         {
677 |           "output_type": "error",
678 |           "ename": "KeyboardInterrupt",
679 |           "evalue": "ignored",
680 |           "traceback": [
681 |             "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m",
682 |             "\u001b[0;31mKeyboardInterrupt\u001b[0m                         Traceback (most recent call last)",
683 |             "\u001b[0;32m<ipython-input-19-5a8a3f4af01f>\u001b[0m in \u001b[0;36m<module>\u001b[0;34m()\u001b[0m\n\u001b[1;32m      5\u001b[0m                     \u001b[0mbatch_size\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0;36m128\u001b[0m \u001b[0;34m,\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m      6\u001b[0m                     \u001b[0mvalidation_data\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mx_val\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0my_val\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m,\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m----> 7\u001b[0;31m                     verbose=1)\n\u001b[0m",
684 |             "\u001b[0;32m/usr/local/lib/python3.6/dist-packages/tensorflow/python/keras/engine/training.py\u001b[0m in \u001b[0;36mfit\u001b[0;34m(self, x, y, batch_size, epochs, verbose, callbacks, validation_split, validation_data, shuffle, class_weight, sample_weight, initial_epoch, steps_per_epoch, validation_steps, validation_freq, max_queue_size, workers, use_multiprocessing, **kwargs)\u001b[0m\n\u001b[1;32m    871\u001b[0m           \u001b[0mvalidation_steps\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0mvalidation_steps\u001b[0m\u001b[0;34m,\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m    872\u001b[0m           \u001b[0mvalidation_freq\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0mvalidation_freq\u001b[0m\u001b[0;34m,\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m--> 873\u001b[0;31m           steps_name='steps_per_epoch')\n\u001b[0m\u001b[1;32m    874\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m    875\u001b[0m   def evaluate(self,\n",
685 |             "\u001b[0;32m/usr/local/lib/python3.6/dist-packages/tensorflow/python/keras/engine/training_arrays.py\u001b[0m in \u001b[0;36mmodel_iteration\u001b[0;34m(model, inputs, targets, sample_weights, batch_size, epochs, verbose, callbacks, val_inputs, val_targets, val_sample_weights, shuffle, initial_epoch, steps_per_epoch, validation_steps, validation_freq, mode, validation_in_fit, prepared_feed_values_from_dataset, steps_name, **kwargs)\u001b[0m\n\u001b[1;32m    350\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m    351\u001b[0m         \u001b[0;31m# Get outputs.\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m--> 352\u001b[0;31m         \u001b[0mbatch_outs\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mf\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mins_batch\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m    353\u001b[0m         \u001b[0;32mif\u001b[0m \u001b[0;32mnot\u001b[0m \u001b[0misinstance\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mbatch_outs\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mlist\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m    354\u001b[0m           \u001b[0mbatch_outs\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0;34m[\u001b[0m\u001b[0mbatch_outs\u001b[0m\u001b[0;34m]\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n",
686 |             "\u001b[0;32m/usr/local/lib/python3.6/dist-packages/tensorflow/python/keras/backend.py\u001b[0m in \u001b[0;36m__call__\u001b[0;34m(self, inputs)\u001b[0m\n\u001b[1;32m   3215\u001b[0m         \u001b[0mvalue\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mmath_ops\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mcast\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mvalue\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mtensor\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mdtype\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m   3216\u001b[0m       \u001b[0mconverted_inputs\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mappend\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mvalue\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m-> 3217\u001b[0;31m     \u001b[0moutputs\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0m_graph_fn\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m*\u001b[0m\u001b[0mconverted_inputs\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m   3218\u001b[0m     return nest.pack_sequence_as(self._outputs_structure,\n\u001b[1;32m   3219\u001b[0m                                  [x.numpy() for x in outputs])\n",
687 |             "\u001b[0;32m/usr/local/lib/python3.6/dist-packages/tensorflow/python/eager/function.py\u001b[0m in \u001b[0;36m__call__\u001b[0;34m(self, *args, **kwargs)\u001b[0m\n\u001b[1;32m    556\u001b[0m       raise TypeError(\"Keyword arguments {} unknown. Expected {}.\".format(\n\u001b[1;32m    557\u001b[0m           list(kwargs.keys()), list(self._arg_keywords)))\n\u001b[0;32m--> 558\u001b[0;31m     \u001b[0;32mreturn\u001b[0m \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0m_call_flat\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0margs\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m    559\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m    560\u001b[0m   \u001b[0;32mdef\u001b[0m \u001b[0m_filtered_call\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mself\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0margs\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mkwargs\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n",
688 |             "\u001b[0;32m/usr/local/lib/python3.6/dist-packages/tensorflow/python/eager/function.py\u001b[0m in \u001b[0;36m_call_flat\u001b[0;34m(self, args)\u001b[0m\n\u001b[1;32m    625\u001b[0m     \u001b[0;31m# Only need to override the gradient in graph mode and when we have outputs.\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m    626\u001b[0m     \u001b[0;32mif\u001b[0m \u001b[0mcontext\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mexecuting_eagerly\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m)\u001b[0m \u001b[0;32mor\u001b[0m \u001b[0;32mnot\u001b[0m \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0moutputs\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m--> 627\u001b[0;31m       \u001b[0moutputs\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0m_inference_function\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mcall\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mctx\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0margs\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m    628\u001b[0m     \u001b[0;32melse\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m    629\u001b[0m       \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0m_register_gradient\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n",
689 |             "\u001b[0;32m/usr/local/lib/python3.6/dist-packages/tensorflow/python/eager/function.py\u001b[0m in \u001b[0;36mcall\u001b[0;34m(self, ctx, args)\u001b[0m\n\u001b[1;32m    413\u001b[0m             attrs=(\"executor_type\", executor_type,\n\u001b[1;32m    414\u001b[0m                    \"config_proto\", config),\n\u001b[0;32m--> 415\u001b[0;31m             ctx=ctx)\n\u001b[0m\u001b[1;32m    416\u001b[0m       \u001b[0;31m# Replace empty list with None\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m    417\u001b[0m       \u001b[0moutputs\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0moutputs\u001b[0m \u001b[0;32mor\u001b[0m \u001b[0;32mNone\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n",
690 |             "\u001b[0;32m/usr/local/lib/python3.6/dist-packages/tensorflow/python/eager/execute.py\u001b[0m in \u001b[0;36mquick_execute\u001b[0;34m(op_name, num_outputs, inputs, attrs, ctx, name)\u001b[0m\n\u001b[1;32m     58\u001b[0m     tensors = pywrap_tensorflow.TFE_Py_Execute(ctx._handle, device_name,\n\u001b[1;32m     59\u001b[0m                                                \u001b[0mop_name\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0minputs\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mattrs\u001b[0m\u001b[0;34m,\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m---> 60\u001b[0;31m                                                num_outputs)\n\u001b[0m\u001b[1;32m     61\u001b[0m   \u001b[0;32mexcept\u001b[0m \u001b[0mcore\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0m_NotOkStatusException\u001b[0m \u001b[0;32mas\u001b[0m \u001b[0me\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m     62\u001b[0m     \u001b[0;32mif\u001b[0m \u001b[0mname\u001b[0m \u001b[0;32mis\u001b[0m \u001b[0;32mnot\u001b[0m \u001b[0;32mNone\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n",
691 |             "\u001b[0;31mKeyboardInterrupt\u001b[0m: "
692 |           ]
693 |         }
694 |       ]
695 |     },
696 |     {
697 |       "metadata": {
698 |         "id": "r8_mvDjYL3CX",
699 |         "colab_type": "code",
700 |         "colab": {
701 |           "base_uri": "https://localhost:8080/",
702 |           "height": 51
703 |         },
704 |         "outputId": "5e95c9f4-570b-41e4-b129-daccfa733e37"
705 |       },
706 |       "cell_type": "code",
707 |       "source": [
708 |         "results = model.evaluate(test_data, test_label)\n",
709 |         "print(results)"
710 |       ],
711 |       "execution_count": 20,
712 |       "outputs": [
713 |         {
714 |           "output_type": "stream",
715 |           "text": [
716 |             "7567/7567 [==============================] - 229s 30ms/sample - loss: 2.1969 - accuracy: 0.0650\n",
717 |             "[2.1969342474674667, 0.06501916]\n"
718 |           ],
719 |           "name": "stdout"
720 |         }
721 |       ]
722 |     },
723 |     {
724 |       "metadata": {
725 |         "id": "VaIioR7EPfig",
726 |         "colab_type": "code",
727 |         "colab": {
728 |           "base_uri": "https://localhost:8080/",
729 |           "height": 71
730 |         },
731 |         "outputId": "44e8c4e6-a27f-4bbe-ce29-4a21714871a3"
732 |       },
733 |       "cell_type": "code",
734 |       "source": [
735 |         "data_index   = 12\n",
736 |         "data_words   = \" \".join(test_data_words[data_index])\n",
737 |         "data_indexes = test_data[data_index]\n",
738 |         "print(data_words)\n",
739 |         "\n",
740 |         "predicted = model.predict([[data_indexes]])\n",
741 |         "print(encoder.classes_[np.argmax(predicted)])"
742 |       ],
743 |       "execution_count": 21,
744 |       "outputs": [
745 |         {
746 |           "output_type": "stream",
747 |           "text": [
748 |             "спортын төв ордонд өнөөдөр азийн оюутны аварга шалгаруулах эмэгтэй волейболчдын хоёр дахь удаагийн тэмцээний талаар мэдээлэл хийлээ анхны тэмцээн онд тайландын бангконг хотноо болж хоёрдугаар тэмцээнийг азийн оюутны спортын холбооноос аосх олгосон эрхийн дагуу оны дөрөвдүгээр сарын ны өдрүүдэд монгол улсын нийслэл улаанбаатар хотноо зохион байгуулах тэмцээний эрхийг монгол улс оны тавдугаар сарын хуралдсан аосхны гүйцэтгэх хорооны хурлаар хоёр оронтой өрсөлдөн авчээ уг тэмцээнийг монгол улсад авах талаар мосхолбоо оноос санаачлага гарган хөөцөлдөж эхэлсэн тэмцээний эрхийг авахад муын засгийн газрын санхүүгийн дэмжлэг мэргэжлийн холбоодын ажлын туршлага манай улсын олон улсын нэр хүнд ихээхэн тус хүргэжээ зохион байгуулах хороог с ламбаа удирдаж тэмцээний зохион байгуулах хороог збх эрүүл мэндийн сайдын оны тоот тушаалаар батлаж даргаар уихын гишүүн монголын волейболын холбооны мвх хүндэт ерөнхийлөгч сламбаа ажиллаж збхны орлогч даргаар згхагентлагбтсгын дарга чнаранбаатар збхны нарийн бичгийн даргаар монголын оюутны спортын холбооны мосх ерөнхий нарийн бичгийн дарга джаргалсайхан збхны гишүүдэд бсшуяны төрийн нарийн бичгийн дарга ддалайжаргал нийслэлийн здтгазрын дарга цболдсайхан сяны газрын дарга дбатжаргал гхяамны консулын газрын дарга дганхуяг бсшуяны мэргэжлийн боловсролын газрын дарга мбаасанжав гихалбаны дарга дмөрөн мосхны ерөнхийлөгч оуосхны ерөнхий санхүүч дбаясгалан муисийн ректор стөмөрочир мубисийн ректор бжадамба залуу монгол корпорацийн ерөнхийлөгч мсономпил мохны ерөнхий нарийн бичгийн дарга нбямбагэрэл мвхолбооны ерөнхий нарий бичгийн дарга цбатэнх миат хкийн маркетинг борлуулатын хэлтсийн дарга тмэндсайхан боловсрол суваг телевизийн ерөнхий захирал аамундра нар сонгогдон ажиллаж тэмцээнийг үнэ төлбөргүй үзүүлнэ волейболын болон оюутны спортыг сурталчилах дэлгэрүүлэх үүднээс тэмцээнийг үнэ төлбөргүй үзүүлэхээр збхорооны анхдугаар хурлаас шийдвэрлэсэн нийслэлийн иргэдийг тэмцээнийг өргөнөөр үзэхийг здтгазраас уриалсан тэмцээнийг зөвхөн улаанбаатар хотын иргэд бус аймгаас волейболын спортыг сонирхон хөгжөөн\n",
749 |             "эрүүл мэнд\n"
750 |           ],
751 |           "name": "stdout"
752 |         }
753 |       ]
754 |     }
755 |   ]
756 | }


--------------------------------------------------------------------------------
/old_stuffs/.gitignore:
--------------------------------------------------------------------------------
  1 | # Byte-compiled / optimized / DLL files
  2 | __pycache__/
  3 | *.py[cod]
  4 | *$py.class
  5 | 
  6 | # C extensions
  7 | *.so
  8 | 
  9 | # Distribution / packaging
 10 | .Python
 11 | build/
 12 | develop-eggs/
 13 | dist/
 14 | downloads/
 15 | eggs/
 16 | .eggs/
 17 | lib/
 18 | lib64/
 19 | parts/
 20 | sdist/
 21 | var/
 22 | wheels/
 23 | *.egg-info/
 24 | .installed.cfg
 25 | *.egg
 26 | MANIFEST
 27 | 
 28 | # PyInstaller
 29 | #  Usually these files are written by a python script from a template
 30 | #  before PyInstaller builds the exe, so as to inject date/other infos into it.
 31 | *.manifest
 32 | *.spec
 33 | 
 34 | # Installer logs
 35 | pip-log.txt
 36 | pip-delete-this-directory.txt
 37 | 
 38 | # Unit test / coverage reports
 39 | htmlcov/
 40 | .tox/
 41 | .coverage
 42 | .coverage.*
 43 | .cache
 44 | nosetests.xml
 45 | coverage.xml
 46 | *.cover
 47 | .hypothesis/
 48 | .pytest_cache/
 49 | 
 50 | # Translations
 51 | *.mo
 52 | *.pot
 53 | 
 54 | # Django stuff:
 55 | *.log
 56 | .static_storage/
 57 | .media/
 58 | local_settings.py
 59 | 
 60 | # Flask stuff:
 61 | instance/
 62 | .webassets-cache
 63 | 
 64 | # Scrapy stuff:
 65 | .scrapy
 66 | 
 67 | # Sphinx documentation
 68 | docs/_build/
 69 | 
 70 | # PyBuilder
 71 | target/
 72 | 
 73 | # Jupyter Notebook
 74 | .ipynb_checkpoints
 75 | 
 76 | # pyenv
 77 | .python-version
 78 | 
 79 | # celery beat schedule file
 80 | celerybeat-schedule
 81 | 
 82 | # SageMath parsed files
 83 | *.sage.py
 84 | 
 85 | # Environments
 86 | .env
 87 | .venv
 88 | env/
 89 | venv/
 90 | ENV/
 91 | env.bak/
 92 | venv.bak/
 93 | 
 94 | # Spyder project settings
 95 | .spyderproject
 96 | .spyproject
 97 | 
 98 | # Rope project settings
 99 | .ropeproject
100 | 
101 | # mkdocs documentation
102 | /site
103 | 
104 | # mypy
105 | .mypy_cache/
106 | 
107 | .vscode
108 | model.bin
109 | quotes.json
110 | ids_matrix.npy
111 | 
112 | corpuses/politics
113 | corpuses/economy
114 | corpuses/society
115 | corpuses/health
116 | corpuses/world
117 | corpuses/technology
118 | 
119 | temp_corpuses/
120 | tensorboard/
121 | 
122 | models/lstm
123 | models/bilstm
124 | 
125 | djangoapp/db.sqlite3
126 | 
127 | pretrained_word2vec/


--------------------------------------------------------------------------------
/old_stuffs/README.md:
--------------------------------------------------------------------------------
 1 | Mongolian text classifier in tensorflow.
 2 | 
 3 | # STEPS
 4 | 
 5 | - Run spider in order to collect corpuses and labels from ikon.mn 
 6 |     > scrapy runspider ikon_mn_scrape.py
 7 | 
 8 | - Download corpus from here, let's respect ikon.mn, scraping can be troubling. 
 9 |     > https://drive.google.com/file/d/14NvUkqZRapivmiWc2WOOwIEu9UWyYnnJ/view?usp=sharing
10 | 
11 | - Create word2vec from all files inside 'corpuses' directory
12 |     > python3 clear_create_word2vec.py 
13 | 
14 | - Convert word2vec file to ids matrix as a numpy file format in order to use with tensorflow
15 |     > python3 numpy_embedding_matrix_tf.py
16 | 
17 | - Use embedding matrix with tensorflow in eager mode
18 |     > python3 convert_text_to_seqvector_through_embedmatrix.py
19 | 
20 | - Prepare training and testing dataset
21 |     > python3 prepare_trainingset.py
22 | 
23 | - Train lstm recurrent neural network for news classification
24 |     > python3 training_bilstm_rnn.py
25 |     
26 |     > python3 training_lstm_rnn.py
27 | 
28 | - Freeze trained checkpoints to servable tf model, iteration number is depends on your trained result, see models/bilstm folder after training
29 |     > python3 freeze_tf_model.py --name lstm --iteration 1000
30 | 
31 |     > python3 freeze_tf_model.py --name bilstm --iteration 3000
32 | 
33 | - Start classifier RPC server
34 |     > python3 use_freezed_model_rpc.py 
35 | 
36 | - Start Django to see web interface
37 |     > cd djangoapp
38 | 
39 |     > python manage.py runserver
40 | 
41 | 
42 | # DONE
43 | - Write some scrapers for ikon.mn
44 | - Prepare training texts with its labels, label should be a type of news. For example: politics, economy, society, health, world, technology etc
45 | - Train lstm classifier, also other ones like bibirectional lstm
46 | - Try to classify text from other sites, for example: news.gogo.mn, write some web backend interface, maybe I can use django 2.0
47 | - Implement testing dataset evaluation metrics
48 | 
49 | # IN PROGRESS
50 | 
51 | # TODO
52 | - Evaluate current model on 20 percent of the dataset as testnig
53 | - Use pretrained word2vec weights from facebook's fasttext https://fasttext.cc/docs/en/crawl-vectors.html
54 | - Use transfer learning techniques such like ULMFiT, ELMo embedding etc... and compare results
55 | - Implement stacked lstm
56 | - Implement stacked bidirectional lstm
57 | - Implement stacked bidirectional lstm with dropouts
58 | - Implement previous ones with batch normalization
59 | - Compare testing performances
60 | - Handle very long text through Truncated BPTT
61 | - Handle gradient vanishing issue with gradient clipping
62 | - Add attention to the lstms
63 | - Use an IndRNN and compare the results to previous ones
64 | 
65 | # RESOURCE
66 | 
67 | - ImageNet moment in NLP
68 |     > https://thegradient.pub/nlp-imagenet/
69 |     
70 | - checkpointing, save, restore and freeze tensorflow models
71 |     > http://cv-tricks.com/tensorflow-tutorial/save-restore-tensorflow-models-quick-complete-tutorial/
72 |     > https://nathanbrixius.wordpress.com/2016/05/24/checkpointing-and-reusing-tensorflow-models/
73 |     > http://cv-tricks.com/how-to/freeze-tensorflow-models/
74 | 
75 | - develop word embeddings python gensim
76 |     > https://machinelearningmastery.com/develop-word-embeddings-python-gensim/
77 | 
78 | - how to clean text for machine learning with python
79 |     > https://machinelearningmastery.com/clean-text-machine-learning-python/
80 | 
81 | - using gensim word2vec embeddings in tensorflow
82 |     > http://adventuresinmachinelearning.com/gensim-word2vec-tutorial/
83 | 
84 | - perform sentimental analysis with lstms using tensorflow
85 |     > https://www.oreilly.com/learning/perform-sentiment-analysis-with-lstms-using-tensorflow
86 | 
87 | - What does tf.nn.embedding_lookup function do?
88 |     > https://stackoverflow.com/questions/34870614/what-does-tf-nn-embedding-lookup-function-do
89 | 
90 | - How to One Hot encode categorical sequence data in python
91 |     > https://machinelearningmastery.com/how-to-one-hot-encode-sequence-data-in-python/
92 |     > https://www.tensorflow.org/api_docs/python/tf/one_hot
93 | 
94 | - How to crawl the web politely with scrapy
95 |     > https://blog.scrapinghub.com/2016/08/25/how-to-crawl-the-web-politely-with-scrapy/
96 | 
97 | 


--------------------------------------------------------------------------------
/old_stuffs/clear_create_word2vec.py:
--------------------------------------------------------------------------------
 1 | from clear_text_to_array import *
 2 | from gensim.models import Word2Vec
 3 | import glob, json, re
 4 | 
 5 | # корпусыг ачаалах
 6 | all_corpuses     = ""
 7 | 
 8 | max_word_count   = 0
 9 | max_word_content = ""
10 | max_word_url     = ""
11 | file_count       = 0
12 | all_words        = 0
13 | 
14 | print("reading all corpuses, please wait for a little while...")
15 | 
16 | for filename in glob.iglob('corpuses/**/*.txt', recursive=True):
17 |     with open(filename, 'r', encoding="utf8") as f:
18 |         json_content = json.load(f)
19 |         all_corpuses = all_corpuses + " " +json_content['title'] + ". \n "+json_content["body"]+". \n "
20 | 
21 |         file_count   = file_count + 1
22 |         body_content = json_content['body']
23 |         count        = len(re.findall(r'\w+', body_content))
24 |         all_words    = all_words + count
25 |         if count > max_word_count:
26 |             max_word_count   = count
27 |             max_word_content = body_content
28 |             max_word_url     = json_content['url']
29 | 
30 | average_words_per_news = all_words/file_count
31 | 
32 | print("Reading is done. Here is some stats"              )
33 | print("------------------------------------"             )
34 | print("Total file count       : ", file_count            )
35 | print("Average words per news : ", average_words_per_news)
36 | print("Total word count       : ", all_words             )
37 | print("Maximum word count     : ", max_word_count        )
38 | print("Maximum word count url : ", max_word_url          )
39 | print("------------------------------------"             )
40 | 
41 | 
42 | 
43 | print("converting to the sentence array...")
44 | all_corpuses = all_corpuses + ".АННОУНҮГ."
45 | sentences = clear_text_to_array(all_corpuses)
46 | print('done.')
47 | 
48 | print("starting to create word2vec...")
49 | model = Word2Vec(sentences, min_count=1)
50 | model.save('model.bin')
51 | print('word2vec model is saved as gensim file format.')
52 | 
53 | total_unique_word_count = len(model.wv.vocab)
54 | print("------------------------------------"         )
55 | print("Unique word count : ", total_unique_word_count)
56 | print("------------------------------------"         )
57 | 
58 | #print(words)
59 | #print(model['дээд'])
60 | #import pdb; pdb.set_trace()


--------------------------------------------------------------------------------
/old_stuffs/clear_text_to_array.py:
--------------------------------------------------------------------------------
 1 | from nltk import sent_tokenize
 2 | from nltk.tokenize import word_tokenize
 3 | from nltk import stem
 4 | import string
 5 | from mongolianstopwords import *
 6 | from stemmer import *
 7 | 
 8 | def clear_text_to_array(input_text):
 9 |     text_sentences = sent_tokenize(input_text)
10 |     stemmer = Stemmer.instance().stemmer
11 |     sentences = []
12 |     for text_sentence in text_sentences:
13 |         # өгүүлбэрийн текстийг үгүүд болгож хувиргах
14 |         tokens = word_tokenize(text_sentence)
15 |         # том үсгүүдийг болиулах
16 |         tokens = [w.lower() for w in tokens]
17 |         # үг бүрээс тэмдэгтүүдийг хасах
18 |         table = str.maketrans('', '', string.punctuation)
19 |         stripped = [w.translate(table) for w in tokens]
20 |         # текст бус үгүүдийг хасах
21 |         words = [word for word in stripped if word.isalpha()]
22 |         # stopword уудыг хасах
23 |         stop_words = set(stopwordsmn)
24 |         words = [w for w in words if not w in stop_words]
25 |         # stemming
26 |         words = [stemmer.stem(w) for w in words]
27 |         sentences.append(words)
28 |     return sentences


--------------------------------------------------------------------------------
/old_stuffs/convert_text_to_seqvector_through_embedmatrix.py:
--------------------------------------------------------------------------------
 1 | from clear_text_to_array import *
 2 | from wordtoken_to_id import *
 3 | import numpy as np
 4 | import tensorflow as tf
 5 | import tensorflow.contrib.eager as tfe
 6 | from gensim.models import Word2Vec
 7 | 
 8 | tfe.enable_eager_execution()
 9 | 
10 | word2vec   = Word2Vec.load('model.bin')
11 | ids_matrix = np.load('ids_matrix.npy')
12 | 
13 | input_sentence = "хоёр өдрийн уулзалтын үр дүнд дээд хэмжээний элчээ илгээсэн юм."
14 | 
15 | sentence_array = clear_text_to_array(input_sentence)[0]
16 | 
17 | print("---------------------")
18 | first_word   = sentence_array[0]
19 | second_word  = sentence_array[1]
20 | last_word    = sentence_array[-1]
21 | first_index  = word2vec.wv.vocab[first_word ].index
22 | second_index = word2vec.wv.vocab[second_word].index
23 | last_index   = word2vec.wv.vocab[last_word  ].index
24 | print("эхний үг      : ", first_word , ", index : ", first_index )
25 | print("хоёрдугаар үг : ", second_word, ", index : ", second_index)
26 | print("сүүлийн үг    : ", last_word  , ", index : ", last_index  )
27 | print("нийт үгсийн тоо : ", len(word2vec.wv.vocab))
28 | print("---------------------")
29 | #print(word2vec.wv[last_word])
30 | if (np.array_equal(ids_matrix[last_index], word2vec.wv[last_word])):
31 |     print("YES, conversion to id sequence can be implemented through gensim word2vec object.")
32 | else:
33 |     print("NO")
34 | 
35 | print("---------------------")
36 | print("Өгүүлбэр ")
37 | print(sentence_array)
38 | print("---------------------")
39 | 
40 | # converting token sequence into sequence of ids
41 | sentence_in_tokenids = []
42 | for token in sentence_array:
43 |     token_id = wordtoken_to_id(word2vec, token)
44 |     sentence_in_tokenids.append(token_id)
45 | print("id нуудын жагсаалт")
46 | print(sentence_in_tokenids)
47 | 
48 | # trying to convert sequence of vectors through tensorflow embedding lookup stuff.
49 | embeddings       = tf.constant(ids_matrix)
50 | ids              = tf.constant(sentence_in_tokenids)
51 | sequence_vectors = tf.nn.embedding_lookup(embeddings, ids)
52 | print("үгэн векторуудын жагсаалт тензор хэлбэрээр:")
53 | print(sequence_vectors)


--------------------------------------------------------------------------------
/old_stuffs/corpuses_test/economy_news_gogo_mn.txt:
--------------------------------------------------------------------------------
 1 | Д.Сумъяабазар уул уурхайн сайд нарын дээд хэмжээний уулзалтад оролцлоо
 2 | 
 3 | Уул уурхай, хүнд үйлдвэрийн сайд Д.Сумъяабазар Олон улсын уул уурхайн сайд нарын гурав дахь удаагийн дээд хэмжээний уулзалтад оролцлоо. Торонто хотноо зохион байгуулагдаж буй Олон улсын хайгуулч, олборлогчдын чуулга уулзалтын үеэр тохиосон уг дээд хэмжээний уулзалтад Перу, ОХУ, Чили, Энэтхэг, Австрали, Канад, Аргентин, Ирланд, Саудын Араб зэрэг 28 орны уул уурхайн асуудал эрхэлсэн сайд нар оролцов.
 4 | 
 5 | Ирээдүйд уул уурхайн салбарыг хөгжүүлэх, уул уурхайн эдийн засаг, нийгмийн өсөлтөд оруулах хувь нэмрийг нэмэгдүүлэхэд Засгийн газар, олон нийт, уул уурхайн компаниудын хоорондын “итгэлцэл” чухал болохыг тус уулзалт онцоллоо.
 6 | 
 7 | Хайгуул, ашиглалтын салбарт итгэлцэл бий болгох, нэгэнт бий болсон итгэлцлийг хадгалахад Засгийн газрууд, салбарын байгууллагууд, ТББ-ууд болон орон нутгийн иргэд бүгд чухал үүрэг гүйцэтгэдэг. Эдгээр оролцогч талууд уул уурхайн салбарт итгэлцэл бий болгоход цаг ямагт хамтран ажиллах хүсэл эрмэлзэлтэй байх ёстой. Энэ нь ирээдүйн уул уурхайн салбарыг бүтээх, эрдэс баялгийн арвин нөөцтэй орнуудын эдийн засгийн өсөлтийг хангах түлхүүр болно хэмээн уулзалт оролцогчид санал нэгдлээ.
 8 | 
 9 | Мөн Засгийн газар, орон нутгийн иргэд, уул уурхайн компаниудын хоорондын харилцаа буурч, улам бүр бэрхшээлтэй болж байгаа тухай онцлон дурдсан. Үл итгэлцэл нь хөрөнгө оруулалтад хамгийн том саад болж байгааг ухамсарлаж, итгэлцэл, нийтлэг зорилгод үндэслэсэн хамтын ажиллагааны шинэ загвар бий болгох хэрэгцээ, шаардлага байгааг уулзалтад оролцогчид хүлээн зөвшөөрлөө.
10 | 
11 | Уул уурхай, хүнд үйлдвэрийн сайд Д.Сумъяабазарын Торонтод хийж буй ажлын айлчлал үргэлжилж байна.


--------------------------------------------------------------------------------
/old_stuffs/corpuses_test/health_news_gogo_mn.txt:
--------------------------------------------------------------------------------
 1 | Хатгалгааны эсрэг вакциныг бүрэн хийлгэвэл 10-15 жилийн дархлаа тогтоно.
 2 | 
 3 | Энэ сарын 1-нээс  эхлэн нийслэлийн бүх дүүргийн 2 сартай хүүхдүүдийг уушгины хатгалгаа буюу пневмококкийн эсрэг 13 цент вакцинаар дархлаажуулж эхэллээ.
 4 | 
 5 | Эхний тунг хоёр сартай, хоёр дахь тунг 4 сартай, гурав дахь тунг 9 сартай хүүхдэд тус тус хийнэ.  Одоогийн байдлаар вакцинжуулалт 40 хувьтай үргэлжилж байна.
 6 | 
 7 | Эрүүл мэндийн яамны мэдээлснээр энэ онд 38 мянган хүүхдийг 3 удаагийн тунгаар дархлаажуулахад шаардлагатай 110 мянган хүн/тун вакциныг НҮБ-ын Хүүхдийн санд захиалж, 35 мянган хүн/тунг хүлээж авсан бөгөөд үлдсэн 75 мянган хүн/тун вакциныг хоёрдугаар улиралд багтаан авах аж.
 8 | 
 9 | Вакцины зах зээлийн үнэ нэг хүн/тун нь 13-15 ам.доллар байдаг юм байна. Харин НҮБ-ын Хүүхдийн сангийн хөнгөлөлттэй үнэ болох 3,5 ам.доллараар  худалдан авч байна.
10 | 
11 | Уг вакцины гурван тунг бүрэн авсан хүүхэд уушгины хатгаа өвчний эсрэг 10-15 жилийн дархлаа тогтоно.
12 | 
13 | 2019 онд улсын хэмжээнд 74,3 мянган хүүхдийг 3 удаа дархлаажуулахад 202,700 хүн/тун вакцин, дагалдах хэрэгсэл, тээврийн зардалд 2 тэрбум төгрөг шаардлагатай.
14 | 
15 | Өнөөдрийн байдлаар хоёр сартай 520 хүүхэд вакцинд хамрагдаад байна. 
16 | 
17 | С.Отгонжаргал


--------------------------------------------------------------------------------
/old_stuffs/corpuses_test/politics_news_ikon_mn.txt:
--------------------------------------------------------------------------------
 1 | УИХ-ын дарга М.Энхболд БНТУ-ын Ерөнхийлөгч Режеп Тайип Эрдоанд бараалхлаа.
 2 | 
 3 | Бүгд Найрамдах Турк Улсад албан ёсны айлчлал хийж буй Монгол Улсын Их Хурлын дарга М.Энхболд тус улсын Ерөнхийлөгч Режеп Тайип Эрдоанд бараалхлаа.
 4 | 
 5 | Нэг цаг гаруй үргэлжилсэн уулзалт элэгсэг дотно, нөхөрсөг, илэн далангүй нөхцөл байдалд болж, хоёр орны харилцаа, хамтын ажиллагааны бүхий л салбарыг хөндөн ярилцав.
 6 | 
 7 | Улсын Их Хурлын дарга М.Энхболд уулзалтын эхэнд “Эрхэмсэг ноён Ерөнхийлөгч Тантай Монгол Улсын Их Хурлын даргын хувиар дахин уулзаж байгаадаа баяртай байна. Өмнө айлчилж байсан үетэй харьцуулахад Турк Улсад бүтээн байгуулалт ихээр хийгдэж, эдийн засаг, нийгмийн амьдралд томоохон өөрчлөлт гарчээ. Энэ бүхэнд ноён Ерөнхийлөгч Таны манлайлал чухал үүрэг гүйцэтгэж байгааг тэмдэглэхийг хүсэж байна” гэлээ.
 8 | 
 9 | Тэрбээр Монгол Улсын Ерөнхий сайдаар ажиллаж байхдаа тухайн үеийн Ерөнхий сайд Режеп Тайип Эрдоаны урилгаар 2006 онд, нийслэлийн Засаг дарга бөгөөд Улаанбаатар хотын захирагчийн хувьд 2003 онд Анкара хотод айлчилж, ноён Режеп Тайип Эрдоан 2005, 2013 онд Ерөнхий сайдын хувиар Монгол Улсад айлчилж байсан, хуучны танил, дотнын нөхөд юм.
10 | 
11 | Улсын Их Хурлын дарга М.Энхболд олон улсын терроризмын эсрэг тэмцлийн тэргүүн эгнээнд Турк зогсож, ачааны хүндийг үүрч байгааг Монгол Улс ойлгон дэмжиж байдгийг тэмдэглэв. Түүнчлэн хоёр орны харилцааг цаашид эдийн засгийн агуулгаар баяжуулж, нөөц бололцоог бүрэн дүүрэн ашиглах, үүний тулд тогтонги байдалд байгаа худалдаа, эдийн засгийн хамтын ажиллагааг идэвхжүүлэх, тухайлбал хөдөө аж ахуй, аялал жуулчлал, хөнгөн үйлдвэрлэл, хот төлөвлөлт, барилга, боловсрол, соёл, урлаг, эрүүл мэнд зэрэг салбарт өргөжүүлэн хөгжүүлэхэд бэлэн байгаагаа нотоллоо. Эдгээр хамтын ажиллагааны үндэс нь эцэг дээдсээс уламжилсан хоёр орны угсаа гарлын түүхэн хэлхээ холбоонд байдгийг тэрбээр онцлоод, хувийн хэвшлийнхэн, бизнес эрхлэгчдийн шууд хамтын ажиллагаа нэн чухал байгааг тэмдэглэлээ.
12 | 
13 | 
14 | Эрхэмсэг ноён Р.Т.Эрдоан Туркийн хөрөнгө оруулалтыг сонирхсон салбарт чиглүүлэхэд тус дөхөм үзүүлэхэд бэлэн байгаагаа нотлов. Тэрбээр Монгол Улс, Монголын ард түмнийг бид ах дүүсээ хэмээн хүндэтгэдэг хэмээн онцлоод, боловсрол, иргэний агаарын тээвэр, хот байгуулалт, хөрөнгө оруулалтын зэрэг салбарт хамтран ажиллах зарим саналыг тавив.
15 | 
16 | Тухайлбал, Орхоны хөндийг түшиглэсэн аялал жуулчлалын цогцолбор байгуулах талаарх Улсын Их Хурлын дарга М.Энхболдын саналыг дэмжиж байна гэв. Түүнчлэн Истанбул хотын даргаар ажиллаж байхдаа нийтийн тээврийн асуудлыг хэрхэн шийдэж байсан туршлагаасаа хуваалцаж, энэ чиглэлээр хамтран ажиллах боломжтой гэдгээ илэрхийллээ. Мөн хөдөө аж ахуйн салбар, түүхий дотор мах, махан бүтээгдэхүүний импорт хийх боломжийг судалж үзэхээ амлалаа.
17 | 
18 | Улсын Их Хурлын дарга М.Энхболд ноён Р.Т.Эрдоаны тавьсан Монголд үйл ажиллагаа явуулж буй турк сургуулиудын асуудлыг зохистой шийдвэрлэх саналыг дэмжиж байгаагаа тэмдэглэж, холбогдох байгууллагууд хоёр талын эрх ашигт нийцсэн, тохиромжтой шийдлийг олно гэдэгт итгэж байна гэлээ.
19 | 
20 | М.Энхболд, Р.Т.Эрдоан нар ирэх онд тохиох, хоёр орны хооронд дипломат харилцаа тогтоосны 50 жилийн ойг хамтын ажиллагаагаа өргөжүүлж, илүү өндөр түвшинд хүргэсэн, тодорхой томоохон үр дүн гаргасан амжилтаар угтахын тулд энэ өдрөөс эхлэн хоёр тал хичээн чармайх нь чухал гэдэгт санал нэгтэй байгаагаа тэмдэглэлээ.
21 | 
22 | Эрхэмсэг ноён Р.Т.Эрдоан БНТУ-ын Ерөнхийлөгчийн хувьд Монгол Улс, Монголын ард түмэнтэй улам илүү дотно харилцаж, ах дүү ёсоор туслан дэмжиж, үнэнч нөхөрлөлийн бэлгэдэл болохуйц хамтын ажиллагаа өрнүүлэхэд хэзээд бэлэн байх болно гэдгээ илэрхийлэв.
23 | 
24 | Улсын Их Хурлын дарга М.Энхболдод мөн өдөр БНТУ-ын нийслэл Анкара хотын дарга Мустафа Туна бараалхсан юм. Энэ үеэр Улсын Их Хурлын дарга М.Энхболд Улаанбаатар хотын даргын хувиар 2003 онд Анкарад айлчилж, Улаанбаатар-Анкара хотын хооронд “Ах дүү хотуудын харилцаа тогтоох” тунхаглалд гарын үсэг зурж, 2004 онд Улаанбаатар хотын соёлын өдрүүдийг тус хотод зохион байгуулж байснаа дурслаа.
25 | 
26 | Ноён Мустафа Туна өмнө нь Анкара хотын Синжан дүүргийн Засаг даргаар ажиллаж байхдаа нийслэлийн Чингэлтэй дүүрэгтэй хамтын ажиллагаа тогтоож, айлчлалын бүрэлдэхүүнд багтан яваа Улсын Их Хурлын гишүүн Д.Ганболдыг тус дүүргийн Засаг дарга байхад элэгсэг дотноор хамтран ажиллаж, ажил хэргийн анд нөхөд болсноо тэмдэглэлээ.
27 | 
28 | Улсын Их Хурлын дарга М.Энхболд ноён Мустафа Тунаг БНТУ-ын нийслэл Анкара хотын даргаар ажиллах болсонд нь баяр хүргэж, ажлын амжилт хүсээд, Улаанбаатар хотод зочилж, хоёр хотын хамтын ажиллагааг өргөжүүлэхэд онцгой анхаарна гэдэгт итгэж байгаагаа илэрхийлэв. Тэрбээр Анкара хотын бүтээн байгуулалтын туршлагаас монгол анд нөхөдтэйгээ хуваалцаж, Улаанбаатар хотын агаар, орчны бохирдлыг бууруулах, зам тээврийн тулгамдсан асуудлыг шийдвэрлэх, хот тохижилтыг шинэ шатанд гаргах зэрэг чиглэлээр харилцан ашигтай хамтран ажиллахын чухлыг онцоллоо.
29 | 
30 | Ноён Мустафа Туна хотыг дахин төлөвлөх чиглэлээр хамтран ажиллахад бэлэн байгаагаа илэрхийлээд, Улаанбаатар хотын агаар, усны бохирдлыг арилгах талаарх өмнө эхэлсэн ажлаа үргэлжлүүлэх илүү өргөн боломж нээгдэж байгааг тэмдэглэлээ.
31 | 
32 | 
33 | Энэ өдөр Улсын Их Хурлын дарга М.Энхболд тус хотод буй Чингис хааны цэцэрлэгт хүрээлэнд зочилж, Эзэн Богдын дурсгалын хөшөөнд цэцэг өргөлөө. Тэрбээр энэхүү хүрээлэнг бий болгох, хөшөө босгох санаачилгыг Улаанбаатар хотын даргаар ажиллаж байхдаа гаргаж байжээ. Энэ нь Чингис хааны гадаад орон дахь анхны хөшөө болж байсан түүхтэй юм байна.
34 | 
35 | Улсын Их Хурлын дарга М.Энхболд мөн өдөр Туркийн өдөр тутмын “Дэйли Сабах” сонинд ярилцлага өгөв.
36 | 
37 | Тэрбээр ярилцлагадаа Монгол Улсын нийгэм, эдийн засгийн өнөөгийн байдал, цаашдын зорилтын талаар танилцуулж, Монгол-Туркийн найрсаг, ах дүү ёсны харилцаа, хамтын ажиллагааг зөвхөн парламентын түвшинд биш, бүхий л салбарт илүү өндөр түвшинд, үр ашигтай хөгжүүлэх талаарх санал бодлоо илэрхийллээ гэж Улсын Их Хурлын Хэвлэл мэдээлэл, олон нийттэй харилцах хэлтсийн ажилтнууд Анкара хотоос мэдээлэв.


--------------------------------------------------------------------------------
/old_stuffs/corpuses_test/technology_news_gogo_mn.txt:
--------------------------------------------------------------------------------
 1 | Бээжингийн шүүх VR технологи ашиглаж эхэлжээ.
 2 | 
 3 | VR технологи буюу виртуал орчныг үүсгэх технологийг өнөөдөр гэрийн нөхцөлд, музей, видео тоглоом зэрэгт ашигладаг болсон бол Бээжингийн шүүх гэмт хэргийн орчныг харуулахад ашиглаж эхэлжээ.
 4 | 
 5 | Мягмар гарагт болсон шүүх хурлын үеэр анх удаа гэмт хэргийг шүүх явцад ийм технологи ашигласан аж. 2017 оны есдүгээр сард болсон хүн амины хэргийн гэрчийн мэдүүлэгт үндэслэн VR технологийн тусламжтай үйл явдлыг бодитоор дүрсэлжээ.
 6 | 
 7 | 
 8 | 
 9 | Бээжингийн шүүхийн төлөөлөгч Жуан Вей “Өмнө нь прокурорууд баримтуудыг ихэвчлэн амаар эсвэл Powerpoint ашиглан үзүүлдэг байлаа. Харин технологийн дэвшлийн ачаар илүү бодитой бөгөөд шүүх танхимд байгаа хүмүүст нээлттэй харуулах боломж бүрдлээ” гэсэн байна.
10 | 
11 | Өнөөдөр БНХАУ-д VR технологи ашиглан хар тамхинаас гаргах эмчилгээ хийж, дасгал хийхэд ашиглаж байгаа бөгөөд хамгийн өргөн ашиглаж буй салбар нь видео тоглоом юм.
12 | 
13 | Эх сурвалж: Xinhua


--------------------------------------------------------------------------------
/old_stuffs/corpuses_test/world_news_gogo_mn.txt:
--------------------------------------------------------------------------------
 1 | АНУ-ын Ерөнхийлөгч Доналд Трамп БНАСАУ-ын удирдагч Ким Жон Унтай тавдугаар сард багтаан  уулзалт зохион байгуулахаар тохиролцсон талаарх мэдээлэл дэлхий нийтийн анхааралд байна.
 2 | 
 3 | Энэ уулзалт АНУ болон БНАСАУ-ын хооронд хийж буй анхны дээд хэмжээний уулзалт болох бөгөөд одоогийн байдлаар хэзээ, хаана гэдэг нь тодорхойгүй хэвээр байгаа юм.
 4 | 
 5 | Thediplomat.com сайтаас энэ уулзалтыг хоёр Солонгосын хилийн заагт байрлах Панмүнжомд болох болов уу гэж таамаглаж байгаа талаараа мэдээлжээ. Гэвч Доналд Трампийн тамгын газар өөр газар хайж эхэлбэл Монголын нийслэл Улаанбаатар уулзалтад тохиромжтой хэмээн уг мэдээлэлдээ дурдсан байна.
 6 | 
 7 | Түүнчлэн дээд хэмжээний уулзалт болох шийдвэр гарсан гэсэн мэдээлэл цацагдсанаас 22 цагийн дараа Ерөнхийлөгч асан Ц.Элбэгдорж өөрийн твиттер хуудаснаа “Санал байна: АНУ-ын Ерөнхийлөгч Доналд Трамп болон Хойд Солонгосын удирдагч Ким Жон Ун нар Улаанбаатар хотод уулзаж болно. Монгол Улс бол хамгийн тохиромжтой, төвийг сахисан нутаг дэвсгэр юм. Бид Япон ба Хойд Солонгосын уулзалт зэрэг олон чухал уулзалтуудад түлхэц болж байсан. Монгол орны хойшдын үлдээх өв бол Зүүн Хойд Азийн аюулгүй байдлын асуудлаарх Улаанбаатарын яриа хэлэлцээ юм” хэмээн жиргэснийг онцолжээ.
 8 | 
 9 | Thediplomat.com сайтад Жулиан Диркес, М.Жаргалсайхан нарын нийтэлсэн Доналд Трамп, Ким Жон Ун нарын уулзалт Улаанбаатарт болох нь яагаад зөв сонголт байх талаар найман шалтгааныг хүргэж байна.
10 | 
11 | 1.Монгол Улс нь 1990 онд Ардчилсан хувьсгал гарснаас хойш бүс нутгийнхаа хөрш орнуудтай улс төрийн төвийг сахисан, найрсаг харилцаатай улс. 2015 онд албан ёсоор төвийг сахисан бодлогын талаар хэлэлцүүлэг өрнүүлж байсан.
12 | 
13 | 2.АНУ-тай найрсаг харилцаатай. 1990 оноос АНУ-тай найрсаг харилцаа тогтоон, олон удаа төрийн дээд хэмжээний уулзалтуудыг зохион байгуулж байсан. Мөн АНУ Монгол Улсад тусламж, хөрөнгө оруулалт хийсэн төдийгүй Ардчиллыг бэхжүүлэхэд нэмэр оруулсан хөрш орны нэг.
14 | 
15 | 3.БНАСАУ-тай ч найрсаг харилцаатай. Хоёрдугаар сард Пёньяанд болсон өвлийн олимпийн үеэр Монгол Улсын Гадаад харилцааны сайд Д.Цогтбаатартай БНАСАУ-аас Монголд ажилтан ажиллуулах гэрээ хийсэн. Магадгүй хоёр солонгосын дайны үеэр зуу зуун хүүхдийг Монгол уруу нүүлгэн шилжүүлж байсан нь БНАСАУ-ын хувьд хамгийн ач холбогдолтой зүйл байж болох бөгөөд энэ сэтгэл хөдөлгөсөн харилцаа цаашид ч үргэлжилнэ.
16 | 
17 | 4.Уулзалт Азид болно. Улаанбаатар бол АНУ болон БНАСАУ-ын төлөөлөгчид амархан хүрч болох байршил.
18 | 
19 | 5.Монгол Улсын Засгийн газар БНАСАУ болон 2007, 2012 онуудад Япон ба Хойд Солонгос засгийн газрын түвшний уулзалт Монгол Улсад зохион байгуулагдаж байсан. 2017 онд Монгол Улсын Засгийн газраас Зүүн Хойд Азийн аюулгүй байдлын асуудлаарх Улаанбаатарын яриа хэлэлцээ олон улсын бага хурлыг зохион байгуулсан. БНАСАУ-ын Гадаад хэргийн сайд Ри Ён Хо энэ бага хуралд оролцсон бөгөөд хоёр талын хэлэлцээ бүхий хэд хэдэн уулзалтад оролцсон. Хоёр орны төлөөлөгчид сэтгэл хангалуун үлдэж, эдгээр уулзалт амжилттай болсон.
20 | 
21 | 6.Бие даасан найдвартай байдал. Цөмийн зэвсэг үл дэлгэрүүлэх нь Доналд Трамп болон Ким Жон Ун нарын уулзалтын нэг асуудал. Монгол Улсын хувьд “Цөмийн зэвсэггүй бүс” статустай.
22 | 
23 | 7.АНУ болон БНАСАУ-ын холбоотон орнууд Улаанбаатарыг зөвшөөрнө. Мэдээж хэрэг Өмнөд Солонгосын эрх баригчид өөрт илүү ойр байршлыг илүүд үзэх ч Монгол Улс байж болохр сонголт. Япон улс өмнө нь Монгол Улсыг дундын орон болохыг хүлээн зөвшөөрч байсан төдийгүй уулзалтыг Солонгосын хойгоос өөр газар зохион байгуулах нь хулгайлагдсан хүмүүсийн асуудлыг хөтөлбөрт багтаахад хэрэгтэй юм. Одоогийн байдлаар Ерөнхийлөгч Путин, Ши Жиньпин нарын хувьд тусгайлан төлөвлөсөн газар байгаа эсэх нь тодорхойгүй бөгөөд аль аль Улаанбаатарт болохыг зөвшөөрөх магадлалтай.
24 | 
25 | 8.Дээд хэмжээний уулзалтыг Улаанбаатарт хийхэд зуу зуун албаны хүмүүс очно. Үүнтэй төстэй уулзалтыг 2016 онд Улаанбаатарт зохион байгуулж байсан. Тодруулбал АСЕМ. Тавдугаар сар гэдэг бол Монголын хахир хатуу хүйтэн өвөл дуусаад удаагүй байх хугацаа тул жуулчид ч тийм их биш. Энэ нь зочид буудал, агаарын тээвэрт айлчлалын баг санаа зовох зүйлгүй.


--------------------------------------------------------------------------------
/old_stuffs/djangoapp/app/__init__.py:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/sharavsambuu/mongolian-text-classification/1b82259dbc093864f00b00c5a36cecb0b6af0f5e/old_stuffs/djangoapp/app/__init__.py


--------------------------------------------------------------------------------
/old_stuffs/djangoapp/app/admin.py:
--------------------------------------------------------------------------------
 1 | from django.contrib import admin
 2 | from django.contrib.auth import get_user_model
 3 | from django.contrib.auth.admin import UserAdmin
 4 | 
 5 | from .forms import CustomUserCreationForm, CustomUserChangeForm
 6 | from .models import CustomUser
 7 | 
 8 | class CustomUserAdmin(UserAdmin):
 9 |     model = CustomUser
10 |     add_form = CustomUserCreationForm
11 |     form = CustomUserChangeForm
12 | 
13 | admin.site.register(CustomUser, CustomUserAdmin)


--------------------------------------------------------------------------------
/old_stuffs/djangoapp/app/apps.py:
--------------------------------------------------------------------------------
1 | from django.apps import AppConfig
2 | 
3 | 
4 | class AppConfig(AppConfig):
5 |     name = 'app'
6 | 


--------------------------------------------------------------------------------
/old_stuffs/djangoapp/app/forms.py:
--------------------------------------------------------------------------------
 1 | from django import forms
 2 | from django.contrib.auth.forms import UserCreationForm, UserChangeForm
 3 | from .models import CustomUser
 4 | 
 5 | class CustomUserCreationForm(UserCreationForm):
 6 |     class Meta(UserCreationForm.Meta):
 7 |         model = CustomUser
 8 |         fields = UserCreationForm.Meta.fields
 9 | 
10 | class CustomUserChangeForm(UserChangeForm):
11 |     class Meta:
12 |         model = CustomUser
13 |         fields = UserChangeForm.Meta.fields
14 | 
15 | 
16 | class MongolianTextForm(forms.Form):
17 |     content = forms.CharField(
18 |         widget=forms.Textarea(attrs={'placeholder': 'Please copy and paste some mongolian text here...'}),
19 |         label=""
20 |     )


--------------------------------------------------------------------------------
/old_stuffs/djangoapp/app/migrations/0001_initial.py:
--------------------------------------------------------------------------------
 1 | # Generated by Django 2.0.3 on 2018-03-10 16:23
 2 | 
 3 | import app.models
 4 | import django.contrib.auth.validators
 5 | from django.db import migrations, models
 6 | import django.utils.timezone
 7 | 
 8 | 
 9 | class Migration(migrations.Migration):
10 | 
11 |     initial = True
12 | 
13 |     dependencies = [
14 |         ('auth', '0009_alter_user_last_name_max_length'),
15 |     ]
16 | 
17 |     operations = [
18 |         migrations.CreateModel(
19 |             name='CustomUser',
20 |             fields=[
21 |                 ('id', models.AutoField(auto_created=True, primary_key=True, serialize=False, verbose_name='ID')),
22 |                 ('password', models.CharField(max_length=128, verbose_name='password')),
23 |                 ('last_login', models.DateTimeField(blank=True, null=True, verbose_name='last login')),
24 |                 ('is_superuser', models.BooleanField(default=False, help_text='Designates that this user has all permissions without explicitly assigning them.', verbose_name='superuser status')),
25 |                 ('username', models.CharField(error_messages={'unique': 'A user with that username already exists.'}, help_text='Required. 150 characters or fewer. Letters, digits and @/./+/-/_ only.', max_length=150, unique=True, validators=[django.contrib.auth.validators.UnicodeUsernameValidator()], verbose_name='username')),
26 |                 ('first_name', models.CharField(blank=True, max_length=30, verbose_name='first name')),
27 |                 ('last_name', models.CharField(blank=True, max_length=150, verbose_name='last name')),
28 |                 ('email', models.EmailField(blank=True, max_length=254, verbose_name='email address')),
29 |                 ('is_staff', models.BooleanField(default=False, help_text='Designates whether the user can log into this admin site.', verbose_name='staff status')),
30 |                 ('is_active', models.BooleanField(default=True, help_text='Designates whether this user should be treated as active. Unselect this instead of deleting accounts.', verbose_name='active')),
31 |                 ('date_joined', models.DateTimeField(default=django.utils.timezone.now, verbose_name='date joined')),
32 |                 ('groups', models.ManyToManyField(blank=True, help_text='The groups this user belongs to. A user will get all permissions granted to each of their groups.', related_name='user_set', related_query_name='user', to='auth.Group', verbose_name='groups')),
33 |                 ('user_permissions', models.ManyToManyField(blank=True, help_text='Specific permissions for this user.', related_name='user_set', related_query_name='user', to='auth.Permission', verbose_name='user permissions')),
34 |             ],
35 |             options={
36 |                 'verbose_name': 'user',
37 |                 'verbose_name_plural': 'users',
38 |                 'abstract': False,
39 |             },
40 |             managers=[
41 |                 ('objects', app.models.CustomUserManager()),
42 |             ],
43 |         ),
44 |     ]
45 | 


--------------------------------------------------------------------------------
/old_stuffs/djangoapp/app/migrations/__init__.py:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/sharavsambuu/mongolian-text-classification/1b82259dbc093864f00b00c5a36cecb0b6af0f5e/old_stuffs/djangoapp/app/migrations/__init__.py


--------------------------------------------------------------------------------
/old_stuffs/djangoapp/app/models.py:
--------------------------------------------------------------------------------
1 | from django.db import models
2 | from django.contrib.auth.models import AbstractUser, UserManager
3 | 
4 | class CustomUserManager(UserManager):
5 |     pass
6 | 
7 | class CustomUser(AbstractUser):
8 |     objects = CustomUserManager()
9 | 


--------------------------------------------------------------------------------
/old_stuffs/djangoapp/app/tests.py:
--------------------------------------------------------------------------------
1 | from django.test import TestCase
2 | 
3 | # Create your tests here.
4 | 


--------------------------------------------------------------------------------
/old_stuffs/djangoapp/app/urls.py:
--------------------------------------------------------------------------------
1 | from django.urls import path
2 | from . import views
3 | 
4 | urlpatterns = [
5 |     path('', views.index, name='index'),
6 |     path('classify', views.classify, name='classify'),
7 | ]


--------------------------------------------------------------------------------
/old_stuffs/djangoapp/app/views.py:
--------------------------------------------------------------------------------
 1 | from django.shortcuts import render
 2 | from django.http import HttpResponse
 3 | from django.shortcuts import render, redirect
 4 | from .forms import MongolianTextForm
 5 | import xmlrpc.client
 6 | 
 7 | def index(request):
 8 |     form = MongolianTextForm()
 9 |     return render(request, 'home.html', {'form' : form})
10 | 
11 | def classify(request):
12 |     if request.method == "POST":
13 |         form = MongolianTextForm(request.POST)
14 |         if form.is_valid():
15 |             content = form.cleaned_data['content']
16 |             with xmlrpc.client.ServerProxy("http://localhost:50001/") as proxy:
17 |                 news_class = proxy.predict_class_from_text(content)
18 |             return render(request, 'classify.html', {'content' : content, 'news_class': news_class})
19 |         return redirect('index')
20 |     else:
21 |         return redirect('index')
22 | 


--------------------------------------------------------------------------------
/old_stuffs/djangoapp/djangoapp/__init__.py:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/sharavsambuu/mongolian-text-classification/1b82259dbc093864f00b00c5a36cecb0b6af0f5e/old_stuffs/djangoapp/djangoapp/__init__.py


--------------------------------------------------------------------------------
/old_stuffs/djangoapp/djangoapp/settings.py:
--------------------------------------------------------------------------------
  1 | """
  2 | Django settings for djangoapp project.
  3 | 
  4 | Generated by 'django-admin startproject' using Django 2.0.3.
  5 | 
  6 | For more information on this file, see
  7 | https://docs.djangoproject.com/en/2.0/topics/settings/
  8 | 
  9 | For the full list of settings and their values, see
 10 | https://docs.djangoproject.com/en/2.0/ref/settings/
 11 | """
 12 | 
 13 | import os
 14 | 
 15 | # Build paths inside the project like this: os.path.join(BASE_DIR, ...)
 16 | BASE_DIR = os.path.dirname(os.path.dirname(os.path.abspath(__file__)))
 17 | 
 18 | 
 19 | # Quick-start development settings - unsuitable for production
 20 | # See https://docs.djangoproject.com/en/2.0/howto/deployment/checklist/
 21 | 
 22 | # SECURITY WARNING: keep the secret key used in production secret!
 23 | SECRET_KEY = '(a_8w^*gzx=w$nt2l&94e!^6#$e5+lq1qa0x0z0d4^-+ir(=i2'
 24 | 
 25 | # SECURITY WARNING: don't run with debug turned on in production!
 26 | DEBUG = True
 27 | 
 28 | ALLOWED_HOSTS = []
 29 | 
 30 | 
 31 | # Application definition
 32 | 
 33 | INSTALLED_APPS = [
 34 |     'django.contrib.admin',
 35 |     'django.contrib.auth',
 36 |     'django.contrib.contenttypes',
 37 |     'django.contrib.sessions',
 38 |     'django.contrib.messages',
 39 |     'django.contrib.staticfiles',
 40 | 
 41 |     'app'
 42 | ]
 43 | 
 44 | MIDDLEWARE = [
 45 |     'django.middleware.security.SecurityMiddleware',
 46 |     'django.contrib.sessions.middleware.SessionMiddleware',
 47 |     'django.middleware.common.CommonMiddleware',
 48 |     'django.middleware.csrf.CsrfViewMiddleware',
 49 |     'django.contrib.auth.middleware.AuthenticationMiddleware',
 50 |     'django.contrib.messages.middleware.MessageMiddleware',
 51 |     'django.middleware.clickjacking.XFrameOptionsMiddleware',
 52 | ]
 53 | 
 54 | ROOT_URLCONF = 'djangoapp.urls'
 55 | 
 56 | TEMPLATES = [
 57 |     {
 58 |         'BACKEND': 'django.template.backends.django.DjangoTemplates',
 59 |         'DIRS': [os.path.join(BASE_DIR, 'templates')],
 60 |         'APP_DIRS': True,
 61 |         'OPTIONS': {
 62 |             'context_processors': [
 63 |                 'django.template.context_processors.debug',
 64 |                 'django.template.context_processors.request',
 65 |                 'django.contrib.auth.context_processors.auth',
 66 |                 'django.contrib.messages.context_processors.messages',
 67 |             ],
 68 |         },
 69 |     },
 70 | ]
 71 | 
 72 | WSGI_APPLICATION = 'djangoapp.wsgi.application'
 73 | 
 74 | 
 75 | # Database
 76 | # https://docs.djangoproject.com/en/2.0/ref/settings/#databases
 77 | 
 78 | DATABASES = {
 79 |     'default': {
 80 |         'ENGINE': 'django.db.backends.sqlite3',
 81 |         'NAME': os.path.join(BASE_DIR, 'db.sqlite3'),
 82 |     }
 83 | }
 84 | 
 85 | 
 86 | # Password validation
 87 | # https://docs.djangoproject.com/en/2.0/ref/settings/#auth-password-validators
 88 | 
 89 | AUTH_PASSWORD_VALIDATORS = [
 90 |     {
 91 |         'NAME': 'django.contrib.auth.password_validation.UserAttributeSimilarityValidator',
 92 |     },
 93 |     {
 94 |         'NAME': 'django.contrib.auth.password_validation.MinimumLengthValidator',
 95 |     },
 96 |     {
 97 |         'NAME': 'django.contrib.auth.password_validation.CommonPasswordValidator',
 98 |     },
 99 |     {
100 |         'NAME': 'django.contrib.auth.password_validation.NumericPasswordValidator',
101 |     },
102 | ]
103 | 
104 | 
105 | # Internationalization
106 | # https://docs.djangoproject.com/en/2.0/topics/i18n/
107 | 
108 | LANGUAGE_CODE = 'en-us'
109 | 
110 | TIME_ZONE = 'UTC'
111 | 
112 | USE_I18N = True
113 | 
114 | USE_L10N = True
115 | 
116 | USE_TZ = True
117 | 
118 | 
119 | # Static files (CSS, JavaScript, Images)
120 | # https://docs.djangoproject.com/en/2.0/howto/static-files/
121 | 
122 | STATIC_URL = '/static/'
123 | 
124 | # custom settings
125 | AUTH_USER_MODEL = 'app.CustomUser'


--------------------------------------------------------------------------------
/old_stuffs/djangoapp/djangoapp/urls.py:
--------------------------------------------------------------------------------
 1 | from django.contrib import admin
 2 | from django.urls import path, include
 3 | from django.views.generic.base import TemplateView
 4 | 
 5 | urlpatterns = [
 6 |     #path('', TemplateView.as_view(template_name='home.html'), name='home'),
 7 |     path('', include('app.urls')),
 8 |     path('', include('django.contrib.auth.urls')),
 9 |     path('admin/', admin.site.urls),
10 | ]
11 | 


--------------------------------------------------------------------------------
/old_stuffs/djangoapp/djangoapp/wsgi.py:
--------------------------------------------------------------------------------
 1 | """
 2 | WSGI config for djangoapp project.
 3 | 
 4 | It exposes the WSGI callable as a module-level variable named ``application``.
 5 | 
 6 | For more information on this file, see
 7 | https://docs.djangoproject.com/en/2.0/howto/deployment/wsgi/
 8 | """
 9 | 
10 | import os
11 | 
12 | from django.core.wsgi import get_wsgi_application
13 | 
14 | os.environ.setdefault("DJANGO_SETTINGS_MODULE", "djangoapp.settings")
15 | 
16 | application = get_wsgi_application()
17 | 


--------------------------------------------------------------------------------
/old_stuffs/djangoapp/manage.py:
--------------------------------------------------------------------------------
 1 | #!/usr/bin/env python
 2 | import os
 3 | import sys
 4 | 
 5 | if __name__ == "__main__":
 6 |     os.environ.setdefault("DJANGO_SETTINGS_MODULE", "djangoapp.settings")
 7 |     try:
 8 |         from django.core.management import execute_from_command_line
 9 |     except ImportError as exc:
10 |         raise ImportError(
11 |             "Couldn't import Django. Are you sure it's installed and "
12 |             "available on your PYTHONPATH environment variable? Did you "
13 |             "forget to activate a virtual environment?"
14 |         ) from exc
15 |     execute_from_command_line(sys.argv)
16 | 


--------------------------------------------------------------------------------
/old_stuffs/djangoapp/templates/base.html:
--------------------------------------------------------------------------------
 1 | <!-- templates/base.html -->
 2 | <!DOCTYPE html>
 3 | <html>
 4 | <head>
 5 |   <meta charset="utf-8">
 6 |   <title>{% block title %}Mongolian text classification app{% endblock %}</title>
 7 |   <link rel="stylesheet" href="https://maxcdn.bootstrapcdn.com/bootstrap/4.0.0/css/bootstrap.min.css">
 8 |   <script src="https://maxcdn.bootstrapcdn.com/bootstrap/4.0.0/js/bootstrap.min.js"></script>
 9 |   <script src="https://maxcdn.bootstrapcdn.com/bootstrap/4.0.0/js/bootstrap.bundle.min.js"></script>
10 | </head>
11 | <body>
12 |   <main>
13 |     <div class="content container">
14 |       <div class="row">
15 |         <div class="col-md-8">
16 |           <a href="{% url 'index' %}">Home</a>
17 |         </div>
18 |       </div>
19 |       <div class="row">
20 |         <div class="col-md-8">
21 |           {% block content %} {% endblock %}
22 |         </div>
23 |       </div>
24 |     </div>
25 |   </main>
26 | </body>
27 | </html>


--------------------------------------------------------------------------------
/old_stuffs/djangoapp/templates/classify.html:
--------------------------------------------------------------------------------
 1 | <!-- templates/classify.html -->
 2 | {% extends 'base.html' %}
 3 | 
 4 | {% block title %}Mongolian text classification app{% endblock %}
 5 | 
 6 | {% block content %}
 7 | 
 8 | class : <b>{{ news_class }}</b> <br>
 9 | {{ content }}
10 | 
11 | {% endblock %}


--------------------------------------------------------------------------------
/old_stuffs/djangoapp/templates/home.html:
--------------------------------------------------------------------------------
 1 | <!-- templates/home.html -->
 2 | {% extends 'base.html' %}
 3 | 
 4 | {% block title %}Mongolian text classification app{% endblock %}
 5 | 
 6 | {% block content %}
 7 | 
 8 | <form action="/classify" method="post">
 9 |     {% csrf_token %}
10 |     {{ form.as_p }}
11 |     <input type="submit" class="save btn btn-default" value="Classify" />
12 | </form>
13 | 
14 | {% endblock %}


--------------------------------------------------------------------------------
/old_stuffs/freeze_tf_model.py:
--------------------------------------------------------------------------------
 1 | import tensorflow as tf
 2 | import argparse, sys
 3 | 
 4 | parser = argparse.ArgumentParser()
 5 | parser.add_argument("--name", help="name of nn architecture")
 6 | parser.add_argument("--iteration", help="iteration value")
 7 | args = parser.parse_args()
 8 | 
 9 | if args.name is None or args.iteration is None:
10 |     print("please provide parameters, more info")
11 |     print("python freeze_tf_model.py -h")
12 |     sys.exit()
13 | 
14 | name      = args.name
15 | iteration = args.iteration
16 | 
17 | saver = tf.train.import_meta_graph('./models/{name}/pretrained_{name}.ckpt-{iteration}.meta'.format(name=name, iteration=iteration), clear_devices=True)
18 | graph = tf.get_default_graph()
19 | input_graph_def = graph.as_graph_def()
20 | sess = tf.Session()
21 | saver.restore(sess, "./models/{name}/pretrained_{name}.ckpt-{iteration}".format(name=name, iteration=iteration))
22 | 
23 | # output variable name
24 | output_node_names = "input_placeholder,prediction_op"
25 | output_graph_def = tf.graph_util.convert_variables_to_constants(
26 |     sess, input_graph_def, output_node_names.split(",")
27 | )
28 | 
29 | output_graph = "./models/{name}/pretrained_{name}-{iteration}.pb".format(name=name, iteration=iteration)
30 | with tf.gfile.GFile(output_graph, "wb") as f:
31 |     f.write(output_graph_def.SerializeToString())
32 |     print("saved to ", output_graph)
33 | 
34 | sess.close()


--------------------------------------------------------------------------------
/old_stuffs/ikon_mn_scrape.py:
--------------------------------------------------------------------------------
 1 | import scrapy
 2 | from scrapy import Request
 3 | from scrapy.shell import inspect_response
 4 | import json, os
 5 | from hashlib import md5
 6 | 
 7 | root_link = "http://ikon.mn"
 8 | 
 9 | class IkonSpider(scrapy.Spider):
10 |     name='ikonspider'
11 |     robotstxt_obey = True
12 |     download_delay = 0.5
13 |     user_agent = 'sharavaa-crawler-for-nlp (sharavsambuu@gmail.com)'
14 |     autothrottle_enabled = True
15 |     httpcache_enabled = True
16 | 
17 |     def start_requests(self):
18 |         start_urls = [
19 |             (root_link+'/l/1' , "politics"  ), # улс төр
20 |             (root_link+'/l/2' , "economy"   ), # эдийн засаг
21 |             (root_link+'/l/3' , "society"   ), # нийгэм
22 |             (root_link+'/l/16', "health"    ), # эрүүл мэнд
23 |             (root_link+'/l/4' , "world"     ), # дэлхийд
24 |             (root_link+'/l/7' , "technology"), # технологи
25 |         ]
26 |         for index, url_tuple in enumerate(start_urls):
27 |             url      = url_tuple[0]
28 |             category = url_tuple[1]
29 |             yield Request(url, meta={'category': category})
30 | 
31 |     def parse(self, response):
32 |         news_title = response.xpath("//*[contains(@class, 'inews')]//h1/text()").extract()
33 |         if (len(news_title)==0):
34 |             print(">>>>>>>>>>>>> I'M GROOOOOOT ")
35 |         else:
36 |             news_title  = news_title[0].strip()
37 |             news_body   = response.xpath("//*[contains(@class, 'icontent')]/descendant::*/text()[normalize-space() and not(ancestor::a | ancestor::script | ancestor::style)]").extract()
38 |             news_body   = " ".join(news_body)
39 |             category    = response.meta.get('category', 'default')
40 |             url         = response.request.url
41 |             hashed_name = md5(news_title.encode("utf-8")).hexdigest()
42 |             file_name = "./corpuses/"+category+"/"+hashed_name+".txt"
43 |             print("saving to ", file_name)
44 |             data = {}
45 |             data['title'] = news_title
46 |             data['body' ] = news_body
47 |             data['url'  ] = url
48 |             os.makedirs(os.path.dirname(file_name), exist_ok=True)
49 |             with open(file_name, "w", encoding="utf8") as outfile:
50 |                 json.dump(data, outfile, ensure_ascii=False)
51 | 
52 |             #import pdb; pdb.set_trace()
53 | 
54 |         for next_page in response.xpath("//*[contains(@class, 'nlitem')]//a"):
55 |             yield response.follow(next_page, self.parse, meta={'category': response.meta.get('category', 'default')})
56 | 
57 |         for next_page in response.xpath("//*[contains(@class, 'ikon-right-dir')]/parent::a"):
58 |             yield response.follow(next_page, self.parse, meta={'category': response.meta.get('category', 'default')})
59 | 


--------------------------------------------------------------------------------
/old_stuffs/images/accuracy.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/sharavsambuu/mongolian-text-classification/1b82259dbc093864f00b00c5a36cecb0b6af0f5e/old_stuffs/images/accuracy.png


--------------------------------------------------------------------------------
/old_stuffs/images/classifiedresult.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/sharavsambuu/mongolian-text-classification/1b82259dbc093864f00b00c5a36cecb0b6af0f5e/old_stuffs/images/classifiedresult.png


--------------------------------------------------------------------------------
/old_stuffs/images/loss.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/sharavsambuu/mongolian-text-classification/1b82259dbc093864f00b00c5a36cecb0b6af0f5e/old_stuffs/images/loss.png


--------------------------------------------------------------------------------
/old_stuffs/images/webinput.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/sharavsambuu/mongolian-text-classification/1b82259dbc093864f00b00c5a36cecb0b6af0f5e/old_stuffs/images/webinput.png


--------------------------------------------------------------------------------
/old_stuffs/mongolianstopwords.py:
--------------------------------------------------------------------------------
 1 | # lists from https://github.com/Xangis/extra-stopwords/blob/master/mongolian
 2 | # and https://github.com/erkhemee/chatbot-ub-hackathon
 3 | stopwordsmn = [
 4 |     'аа',
 5 |     'аанхаа',
 6 |     'алив',
 7 |     'ба',
 8 |     'байдаг',
 9 |     'байжээ',
10 |     'байна',
11 |     'байсаар',
12 |     'байсан',
13 |     'байхаа',
14 |     'бас',
15 |     'бишүү',
16 |     'бол',
17 |     'болжээ',
18 |     'болно',
19 |     'болоо'
20 |     'бэ',
21 |     'вэ',
22 |     'гэж',
23 |     'гэжээ',
24 |     'гэлтгүй',
25 |     'гэсэн',
26 |     'гэтэл',
27 |     'за',
28 |     'л',
29 |     'мөн',
30 |     'нь',
31 |     'тэр',
32 |     'уу',
33 |     'харин',
34 |     'хэн',
35 |     'ч',
36 |     'энэ',
37 |     'ээ',
38 |     'юм',
39 |     'үү',
40 |     '?', 
41 |     '', 
42 |     '.', 
43 |     ',', 
44 |     '-',
45 |     '-ийн', 
46 |     '-ын', 
47 |     '-тай', 
48 |     '-г', 
49 |     '-ийг',
50 |     '-д', 
51 |     '-н', 
52 |     '-ний', 
53 |     '-дээр', 
54 |     'юу',
55 | ]


--------------------------------------------------------------------------------
/old_stuffs/numpy_embedding_matrix_tf.py:
--------------------------------------------------------------------------------
 1 | from gensim.models import Word2Vec
 2 | import numpy as np
 3 | 
 4 | model = Word2Vec.load('model.bin')
 5 | 
 6 | vector_dim = 100
 7 | 
 8 | #word_to_id = {}
 9 | embedding_matrix = np.zeros((len(model.wv.vocab), vector_dim))
10 | for i in range(len(model.wv.vocab)):
11 |     embedding_vector = model.wv[model.wv.index2word[i]]
12 |     if embedding_vector is not None:
13 |         embedding_matrix[i] = embedding_vector
14 |         #word_to_id[i] = model.wv.index2word
15 | 
16 | np.save('ids_matrix', embedding_matrix)
17 | #import pdb; pdb.set_trace()
18 | print('embedded ids matrix is saved as a numpy file format.')
19 | 
20 | 


--------------------------------------------------------------------------------
/old_stuffs/prepare_trainingset.py:
--------------------------------------------------------------------------------
 1 | from os import listdir
 2 | from os.path import isfile, join
 3 | import shutil
 4 | import glob, os, os.path
 5 | import random
 6 | import math
 7 | import json
 8 | 
 9 | import numpy as np
10 | import tensorflow as tf
11 | import tensorflow.contrib.eager as tfe
12 | 
13 | from gensim.models import Word2Vec
14 | 
15 | from wordtoken_to_id import *
16 | from clear_text_to_array import *
17 | 
18 | global_corpuses = []
19 | 
20 | def convert_to_onehot(name):
21 |     switcher = {
22 |         "economy"    : lambda: [1,0,0,0,0,0],
23 |         "health"     : lambda: [0,1,0,0,0,0],
24 |         "politics"   : lambda: [0,0,1,0,0,0],
25 |         "society"    : lambda: [0,0,0,1,0,0],
26 |         "technology" : lambda: [0,0,0,0,1,0],
27 |         "world"      : lambda: [0,0,0,0,0,1]
28 |     }
29 |     return switcher.get(name, lambda: [0,0,0,0,0,0])()
30 | 
31 | def fix_news_body(filename):
32 |     found = False
33 |     jsoncontent = ""
34 |     with open(filename, encoding="utf8") as f:
35 |         jsoncontent = json.load(f)
36 |         body        = jsoncontent['body'].strip()
37 |         if not body:
38 |             print("YES EMPTY BODY FOUND...")
39 |             found = True
40 |             jsoncontent['body'] = jsoncontent['title']
41 |     if found:
42 |         with open(filename, "w", encoding="utf8") as outfile:
43 |             print("FIXING...", filename)
44 |             json.dump(jsoncontent, outfile, ensure_ascii=False)
45 | 
46 | for filename in glob.iglob('corpuses/**/*.txt', recursive=True):
47 |     fix_news_body(filename) # some news body is empty, fix it by replacing its title
48 |     current_file_path      = os.path.abspath(filename)
49 |     current_directory      = os.path.abspath(os.path.join(current_file_path, os.pardir))
50 |     current_directory_name = os.path.split(current_directory)
51 |     category               = current_directory_name[1]
52 |     one_hot                = convert_to_onehot(category)
53 |     only_file_name         = os.path.basename(filename)
54 |     global_corpuses.append((only_file_name, one_hot, category))
55 | 
56 | random.shuffle(global_corpuses)
57 | random.shuffle(global_corpuses)
58 | random.shuffle(global_corpuses)
59 | 
60 | split_location = math.floor(80*len(global_corpuses)/100) # 80% for training, 20% for testing
61 | training_set   = global_corpuses[:split_location]
62 | test_set       = global_corpuses[split_location:]
63 | dataset_info   = {
64 |     'training' : training_set,
65 |     'testing'  : test_set
66 | }
67 | 
68 | temp_corpus_dir = 'temp_corpuses'
69 | if os.path.exists(temp_corpus_dir):
70 |     shutil.rmtree(temp_corpus_dir)
71 | os.makedirs(temp_corpus_dir)
72 | 
73 | with open("temp_corpuses/dataset.json", "w", encoding="utf8") as outfile:
74 |     json.dump(dataset_info, outfile, ensure_ascii=False)
75 | 
76 | #import pdb; pdb.set_trace()


--------------------------------------------------------------------------------
/old_stuffs/requirements.txt:
--------------------------------------------------------------------------------
  1 | absl-py==0.1.10
  2 | asn1crypto==0.24.0
  3 | astor==0.6.2
  4 | astroid==1.6.1
  5 | attrs==17.4.0
  6 | Automat==0.6.0
  7 | autopep8==1.3.4
  8 | backcall==0.1.0
  9 | beautifulsoup4==4.6.0
 10 | bleach==3.3.0
 11 | boto==2.48.0
 12 | boto3==1.6.3
 13 | botocore==1.9.3
 14 | bz2file==0.98
 15 | certifi==2018.1.18
 16 | cffi==1.11.5
 17 | chardet==3.0.4
 18 | colorama==0.4.0
 19 | constantly==15.1.0
 20 | cryptography==3.3.2
 21 | cssselect==1.0.3
 22 | decorator==4.3.0
 23 | defusedxml==0.5.0
 24 | Django==2.2.26
 25 | docutils==0.14
 26 | entrypoints==0.2.3
 27 | gast==0.2.0
 28 | gensim==3.4.0
 29 | grpcio==1.10.0
 30 | h5py==2.8.0
 31 | html5lib==0.9999999
 32 | hyperlink==18.0.0
 33 | idna==2.6
 34 | incremental==17.5.0
 35 | ipykernel==5.1.0
 36 | ipython==7.16.3
 37 | ipython-genutils==0.2.0
 38 | ipywidgets==7.4.2
 39 | isort==4.3.4
 40 | jedi==0.13.1
 41 | Jinja2==2.11.3
 42 | jmespath==0.9.3
 43 | jsonschema==2.6.0
 44 | jupyter==1.0.0
 45 | jupyter-client==5.2.3
 46 | jupyter-console==6.0.0
 47 | jupyter-core==4.4.0
 48 | Keras-Applications==1.0.6
 49 | Keras-Preprocessing==1.0.5
 50 | lazy-object-proxy==1.3.1
 51 | lxml==4.6.5
 52 | Markdown==2.6.11
 53 | MarkupSafe==1.1.0
 54 | mccabe==0.6.1
 55 | mistune==0.8.4
 56 | nbconvert==5.4.0
 57 | nbformat==4.4.0
 58 | nltk==3.6.6
 59 | notebook==6.4.1
 60 | numpy==1.21.0
 61 | pandocfilters==1.4.2
 62 | parsel==1.4.0
 63 | parso==0.3.1
 64 | pickleshare==0.7.5
 65 | prometheus-client==0.4.2
 66 | prompt-toolkit==2.0.7
 67 | protobuf==3.6.1
 68 | pyasn1==0.4.2
 69 | pyasn1-modules==0.2.1
 70 | pycodestyle==2.3.1
 71 | pycparser==2.18
 72 | PyDispatcher==2.0.5
 73 | Pygments==2.7.4
 74 | pylint==1.8.2
 75 | pyOpenSSL==17.5.0
 76 | python-dateutil==2.6.1
 77 | pytz==2018.3
 78 | pywinpty==0.5.4
 79 | pyzmq==17.1.2
 80 | qtconsole==4.4.3
 81 | queuelib==1.4.2
 82 | requests==2.18.4
 83 | s3transfer==0.1.13
 84 | scipy==1.0.0
 85 | Scrapy==1.8.1
 86 | scrapy-splash==0.8.0
 87 | Send2Trash==1.5.0
 88 | service-identity==17.0.0
 89 | six==1.11.0
 90 | smart-open==1.5.6
 91 | termcolor==1.1.0
 92 | terminado==0.8.1
 93 | testpath==0.4.2
 94 | tornado==5.1.1
 95 | traitlets==4.3.2
 96 | Twisted==20.3.0
 97 | typed-ast==1.1.0
 98 | urllib3==1.26.5
 99 | w3lib==1.19.0
100 | wcwidth==0.1.7
101 | Werkzeug==0.14.1
102 | widgetsnbextension==3.4.2
103 | wrapt==1.10.11
104 | zope.interface==4.4.3
105 | 


--------------------------------------------------------------------------------
/old_stuffs/research/ikon-research.txt:
--------------------------------------------------------------------------------
 1 | Дараагийн хуудас
 2 | $('.ikon-right-dir').trigger('click');
 3 | 
 4 | Хуудсан дахь item-үүдийн жагсаалт
 5 | $('.nlitem')
 6 | 
 7 | Item-ээс холбоос хаягийг нь авах
 8 | $($('.nlitem')[0]).find('a')[0]
 9 | 
10 | XPath хэрэглэж item ийн холбоосыг авах
11 | item_link = response.selector.xpath("//*[contains(@class, 'nlitem')]//a/@href").extract()[0]
12 | 
13 | sentiment утгуудыг авах
14 | let arr = []; 
15 | $('.reaction .value').each((index, emoji) => { 
16 |     let s     = $(emoji).text(); 
17 |     let score = isNaN(parseInt(s)) ? 0 : parseInt(s); 
18 |     arr.push(score) 
19 | });
20 | console.log(arr);
21 | 
22 | Сургалтын өгөгдлийн статистик
23 | 
24 | Категорийн тоо          : 6
25 | Файлын тоо              : 6124 ширхэг
26 | Нийтлэлийн дундаж урт   : 418.6 үг
27 | ХИ үгтэй нийтлэл        : 7418, http://ikon.mn/n/dh7
28 | Нийт татсан үгийн тоо   : 2563792
29 | Нийт ялгаатай үгийн тоо : 102153, stemming хийсний дараа 68767 болж цөөрлөө.
30 | 


--------------------------------------------------------------------------------
/old_stuffs/research/nlp-research.txt:
--------------------------------------------------------------------------------
1 | some stopwords list
2 | https://github.com/Xangis/extra-stopwords/blob/master/mongolia
3 | 
4 | stemmers and some stopwords for mongolian language
5 | https://github.com/erkhemee/chatbot-ub-hackathon


--------------------------------------------------------------------------------
/old_stuffs/stemmer.py:
--------------------------------------------------------------------------------
 1 | from nltk import stem
 2 | 
 3 | # code from https://github.com/erkhemee/chatbot-ub-hackathon
 4 | stemmer_rules_tuple = ("ы$|ий$|ыг$|ийг$|ны$|ний$|лаа$|лээ$|лоо$|лөө$|даа$|дээ$|доо$|дөө$|уулж$|үүлж$|гдах$|гдэх$|гдох$|гдөх$|лалт$|лэлт$|лолт$|лөлт$|дахад$|дэхэд$|доход$|дөхөд$|уудыг$|үүдийг$|нүүдийг$|нуудыг$|чихлаа$|чихлээ$|чихлоо$|чихлөө$|ид$|ын$|ийн$|уудын$|үүдийн$|тай$|той$|төй$|тэй$|лал$|лэл$|лол$|лөл$|аар$|ээр$|оор$|өөр$|лах$|лэх$|лох$|лөх$|ахад$|эхэд$|оход$|өхөд$|лалд$|лэлд$|лөлд$|лолд$|ууд$|үүд$|даг$|дэг$|дог$|дөг$|ддаг$|ддэг$|ддог$|ддөг$|даж$|дэж$|дож$|дөж$|сэн$|сан$|сөн$|сон$|гах$|гэх$|гох$|гөх$|рах$|рэх$|рох$|рөх$|иас$|иэс$|иос$|иөс$|нд$|нт$|ад$|эд$|од$|өд$|ааж$|ээж$|оож$|өөж$|лаар$|лээр$|лоор$|лөөр$|аг$|эг$|ог$|өг$|уг$|өөг$|ээг$|оог$|үүг$|ууг$|бар$|бэр$|бор$|бөр$|бур$|бүр$|раа$|рээ$|рөө$|рүү$|роо$|руу$|аас$|ээс$|оос$|өөс$|аарх$|ээрх$|оорх$|уурх$|үүрх$|өөрх$|аад$|ээд$|ууд$|үүд$|оод$|өөд$|нууд$|нүүд$|дсан$|дсэн$|дсөн$|дсон$|дах$|дэх$|дох$|дөх$|хлаа$|хлээ$|хлоо$|хлөө$|лахдаа$|лэхдээ$|лохдоо$|лөхдөө$|лдаг$|лдэг$|лж$|аагүй$|ээгүй$|оогүй$|өөгүй$|даггүй$|дэггүй$|доггүй$|дөггүй$|нуудийнхаа$|нүүдийнхээ$|гүй$")
 5 | 
 6 | class Singleton:
 7 |     def __init__(self, decorated):
 8 |         self._decorated = decorated
 9 | 
10 |     def instance(self):
11 |         try:
12 |             return self._instance
13 |         except AttributeError:
14 |             self._instance = self._decorated()
15 |             return self._instance
16 | 
17 |     def __call__(self):
18 |         raise TypeError('Singletons must be accessed through `instance()`.')
19 | 
20 |     def __instancecheck__(self, inst):
21 |         return isinstance(inst, self._decorated)
22 | 
23 | @Singleton
24 | class Stemmer:
25 |     def __init__(self):
26 |         self.stemmer = stem.RegexpStemmer(stemmer_rules_tuple, min=6)


--------------------------------------------------------------------------------
/old_stuffs/training_bilstm_rnn.py:
--------------------------------------------------------------------------------
 1 | import datetime
 2 | import tensorflow as tf
 3 | import numpy as np
 4 | from training_helpers import *
 5 | 
 6 | batch_size     = 24
 7 | lstm_units     = 128
 8 | num_classes    = 6
 9 | max_seq_length = 500
10 | vector_length  = 100    # word2vec dimensions
11 | iterations     = 100000 # 100000
12 | 
13 | dataset = DataSetHelper() 
14 | 
15 | tf.reset_default_graph()
16 | 
17 | input_placeholder = tf.placeholder(tf.int32  , [batch_size, max_seq_length], name='input_placeholder')
18 | label_placeholder = tf.placeholder(tf.float32, [batch_size, num_classes   ])
19 | 
20 | ids_matrix    = np.load('ids_matrix.npy')
21 | embeddings_tf = tf.constant(ids_matrix)
22 | 
23 | batch_data = tf.Variable(tf.zeros([batch_size, max_seq_length, vector_length]), dtype=tf.float32)
24 | batch_data = tf.nn.embedding_lookup(embeddings_tf, input_placeholder)
25 | batch_data = tf.cast(batch_data, tf.float32) # https://github.com/tensorflow/tensorflow/issues/8281
26 | 
27 | # composing bidirectional lstm
28 | batch_unstack = tf.unstack(batch_data, max_seq_length, 1)
29 | fw_lstm_cell  = tf.nn.rnn_cell.LSTMCell(lstm_units) # forward lstm cell
30 | fw_lstm_cell  = tf.nn.rnn_cell.DropoutWrapper(cell=fw_lstm_cell, output_keep_prob=0.75)
31 | bw_lstm_cell  = tf.nn.rnn_cell.LSTMCell(lstm_units) # backward lstm cell
32 | bw_lstm_cell  = tf.nn.rnn_cell.DropoutWrapper(cell=bw_lstm_cell, output_keep_prob=0.75)
33 | outputs, _, _ = tf.nn.static_bidirectional_rnn(
34 |     fw_lstm_cell ,
35 |     bw_lstm_cell ,
36 |     batch_unstack,
37 |     dtype=tf.float32
38 | )
39 | 
40 | weight     = tf.Variable(tf.truncated_normal([2*lstm_units, num_classes]))
41 | bias       = tf.Variable(tf.constant(0.1, shape=[num_classes]))
42 | prediction = tf.add(tf.matmul(outputs[-1], weight), bias, name="prediction_op")
43 | 
44 | correct_prediction = tf.equal(tf.argmax(prediction, 1), tf.argmax(label_placeholder, 1))
45 | accuracy           = tf.reduce_mean(tf.cast(correct_prediction, tf.float32))
46 | 
47 | loss      = tf.reduce_mean(tf.nn.softmax_cross_entropy_with_logits_v2(logits=prediction, labels=label_placeholder))
48 | optimizer = tf.train.AdamOptimizer().minimize(loss)
49 | #optimizer = tf.train.GradientDescentOptimizer(learning_rate=0.001).minimize(loss)
50 | 
51 | print("starting at ", datetime.datetime.now())
52 | 
53 | loss_summary                = tf.summary.scalar('Loss'                     , loss    )
54 | validation_accuracy_summary = tf.summary.scalar('Batch Validation Accuracy', accuracy)
55 | testing_accuracy_summary    = tf.summary.scalar('Testing Dataset Accuracy' , accuracy)
56 | log_dir = "tensorboard/bilstm/"+datetime.datetime.now().strftime("%Y%m%d-%H%M%S")+"/"
57 | 
58 | init = tf.global_variables_initializer()
59 | with tf.Session() as sess:
60 |     writer  = tf.summary.FileWriter(log_dir, sess.graph)
61 |     saver   = tf.train.Saver()
62 |     sess.run(init)
63 | 
64 |     for i in range(iterations):
65 |         next_input_batch, next_label_batch = dataset.get_training_batch(batch_size, max_seq_length)
66 |         test_input_batch, test_label_batch = dataset.get_testing_batch (batch_size, max_seq_length)
67 |         sess.run(optimizer, feed_dict={input_placeholder: next_input_batch, label_placeholder: next_label_batch})
68 |         if (i%50 == 0):
69 |             acc = sess.run(accuracy, feed_dict={input_placeholder: next_input_batch, label_placeholder: next_label_batch})
70 |             los = sess.run(loss    , feed_dict={input_placeholder: next_input_batch, label_placeholder: next_label_batch})
71 |             tes = sess.run(accuracy, feed_dict={input_placeholder: test_input_batch, label_placeholder: test_label_batch})
72 |             print("___________________________________")
73 |             print("Iteration      : ", i  )
74 |             print("Validation acc : ", acc)
75 |             print("Loss           : ", los)
76 |             print("Test acc       : ", tes)
77 |             validation_accuracy_result = sess.run(validation_accuracy_summary, feed_dict={input_placeholder: next_input_batch, label_placeholder: next_label_batch})
78 |             testing_accuracy_result    = sess.run(testing_accuracy_summary   , feed_dict={input_placeholder: test_input_batch, label_placeholder: test_label_batch})
79 |             loss_result                = sess.run(loss_summary               , feed_dict={input_placeholder: next_input_batch, label_placeholder: next_label_batch})
80 |             writer.add_summary(validation_accuracy_result, i)
81 |             writer.add_summary(testing_accuracy_result   , i)
82 |             writer.add_summary(loss_result               , i)            
83 |         if (i%1000 == 0 and i != 0):
84 |             save_path = saver.save(sess, "models/bilstm/pretrained_bilstm.ckpt", global_step=i)
85 |             print("model is saved to %s"%save_path)
86 |     writer.close()
87 | 
88 | print("ending at ", datetime.datetime.now())


--------------------------------------------------------------------------------
/old_stuffs/training_helpers.py:
--------------------------------------------------------------------------------
 1 | from random import randint
 2 | import random
 3 | from clear_text_to_array import *
 4 | from gensim.models import Word2Vec
 5 | import glob, json, re
 6 | import json
 7 | import numpy as np
 8 | from wordtoken_to_id import *
 9 | from clear_text_to_array import *
10 | from itertools import chain
11 | 
12 | class DataSetHelper():
13 |     def __init__(self,):
14 |         self.word2vec        = Word2Vec.load('model.bin')
15 |         self.ids_matrix      = np.load('ids_matrix.npy')
16 |         self.unknown_word_id = wordtoken_to_id(self.word2vec, "анноунүг")
17 |         with open("temp_corpuses/dataset.json", "r", encoding="utf8") as f:
18 |             self.dataset_json = json.load(f)
19 |             self.training_set = self.dataset_json['training']
20 |             self.testing_set  = self.dataset_json['testing' ]
21 |         pass
22 | 
23 |     def sentence_to_ids(self, sentence, max_seq_length):
24 |         sentence_array  = clear_text_to_array(sentence)
25 |         sentence_array  = list(chain(*sentence_array))
26 |         sentence_array  = sentence_array[:max_seq_length]
27 |         ids_of_sentence = np.zeros((max_seq_length), dtype='int32')
28 |         for index, word in enumerate(sentence_array):
29 |             try:
30 |                 ids_of_sentence[index] = wordtoken_to_id(self.word2vec, word)
31 |             except KeyError:
32 |                 ids_of_sentence[index] = self.unknown_word_id # unknown word, АННОУНҮГ 
33 |         return ids_of_sentence
34 | 
35 |     def get_training_batch(self, batch_size, max_seq_length):
36 |         batch_labels = []
37 |         batch_arr = np.zeros([batch_size, max_seq_length])
38 |         for i in range(batch_size):
39 |             random_corpus = random.choice(self.training_set)
40 |             file_name     = random_corpus[0]
41 |             one_hot       = random_corpus[1]
42 |             category      = random_corpus[2]
43 |             file_path     = "corpuses/"+category+"/"+file_name
44 |             with open(file_path, encoding="utf8") as f:
45 |                 sentence = json.load(f)['body']
46 |                 #print("##########################")
47 |                 #print(file_path)
48 |                 #print(sentence)
49 |                 ids_of_sentence = self.sentence_to_ids(sentence, max_seq_length)
50 |                 batch_arr[i] = ids_of_sentence
51 |             batch_labels.append(one_hot)
52 |         return (batch_arr, batch_labels)
53 | 
54 |     def get_testing_batch(self, batch_size, max_seq_length):
55 |         batch_labels = []
56 |         batch_arr = np.zeros([batch_size, max_seq_length])
57 |         for i in range(batch_size):
58 |             random_corpus = random.choice(self.testing_set)
59 |             file_name     = random_corpus[0]
60 |             one_hot       = random_corpus[1]
61 |             category      = random_corpus[2]
62 |             file_path     = "corpuses/"+category+"/"+file_name
63 |             with open(file_path, encoding="utf8") as f:
64 |                 sentence = json.load(f)['body']
65 |                 ids_of_sentence = self.sentence_to_ids(sentence, max_seq_length)
66 |                 batch_arr[i] = ids_of_sentence
67 |             batch_labels.append(one_hot)
68 |         return (batch_arr, batch_labels)
69 | 
70 | #batch_size, seq_length = 50, 500
71 | #dataset = DataSetHelper()
72 | #for i in range(10000):
73 | #    inp, label = dataset.get_training_batch(batch_size, seq_length)
74 | #import pdb; pdb.set_trace()


--------------------------------------------------------------------------------
/old_stuffs/training_lstm_rnn.py:
--------------------------------------------------------------------------------
 1 | import datetime
 2 | import tensorflow as tf
 3 | import numpy as np
 4 | from training_helpers import *
 5 | 
 6 | batch_size     = 24
 7 | lstm_units     = 128
 8 | num_classes    = 6
 9 | max_seq_length = 500
10 | vector_length  = 100    # word2vec dimensions
11 | iterations     = 100000 # 100000
12 | 
13 | dataset = DataSetHelper() 
14 | 
15 | tf.reset_default_graph()
16 | 
17 | input_placeholder = tf.placeholder(tf.int32  , [batch_size, max_seq_length], name='input_placeholder')
18 | label_placeholder = tf.placeholder(tf.float32, [batch_size, num_classes   ])
19 | 
20 | ids_matrix    = np.load('ids_matrix.npy')
21 | embeddings_tf = tf.constant(ids_matrix)
22 | 
23 | batch_data = tf.Variable(tf.zeros([batch_size, max_seq_length, vector_length]), dtype=tf.float32)
24 | batch_data = tf.nn.embedding_lookup(embeddings_tf, input_placeholder)
25 | batch_data = tf.cast(batch_data, tf.float32) # https://github.com/tensorflow/tensorflow/issues/8281
26 | 
27 | lstm_cell  = tf.contrib.rnn.BasicLSTMCell(lstm_units)
28 | lstm_cell  = tf.contrib.rnn.DropoutWrapper(cell=lstm_cell, output_keep_prob=0.75)
29 | value, _   = tf.nn.dynamic_rnn(lstm_cell, batch_data, dtype=tf.float32)
30 | 
31 | weight     = tf.Variable(tf.truncated_normal([lstm_units, num_classes]))
32 | bias       = tf.Variable(tf.constant(0.1, shape=[num_classes]))
33 | value      = tf.transpose(value, [1, 0, 2])
34 | last       = tf.gather(value, int(value.get_shape()[0]) - 1)
35 | prediction = tf.add(tf.matmul(last, weight), bias, name='prediction_op')
36 | 
37 | correct_prediction = tf.equal(tf.argmax(prediction, 1), tf.argmax(label_placeholder, 1))
38 | accuracy           = tf.reduce_mean(tf.cast(correct_prediction, tf.float32))
39 | 
40 | loss      = tf.reduce_mean(tf.nn.softmax_cross_entropy_with_logits_v2(logits=prediction, labels=label_placeholder))
41 | optimizer = tf.train.AdamOptimizer().minimize(loss)
42 | #optimizer = tf.train.GradientDescentOptimizer(learning_rate=0.001).minimize(loss)
43 | 
44 | print("started at ", datetime.datetime.now())
45 | 
46 | loss_summary                = tf.summary.scalar('Loss'                     , loss    )
47 | validation_accuracy_summary = tf.summary.scalar('Batch Validation Accuracy', accuracy)
48 | testing_accuracy_summary    = tf.summary.scalar('Testing Dataset Accuracy' , accuracy)
49 | 
50 | log_dir = "tensorboard/lstm/"+datetime.datetime.now().strftime("%Y%m%d-%H%M%S")+"/"
51 | 
52 | init = tf.global_variables_initializer()
53 | with tf.Session() as sess:
54 |     writer  = tf.summary.FileWriter(log_dir, sess.graph)
55 |     saver   = tf.train.Saver()
56 |     sess.run(init)
57 | 
58 |     for i in range(iterations):
59 |         next_input_batch, next_label_batch = dataset.get_training_batch(batch_size, max_seq_length)
60 |         test_input_batch, test_label_batch = dataset.get_testing_batch (batch_size, max_seq_length)
61 |         sess.run(optimizer, feed_dict={input_placeholder: next_input_batch, label_placeholder: next_label_batch})
62 |         if (i%10 == 0):
63 |             acc = sess.run(accuracy, feed_dict={input_placeholder: next_input_batch, label_placeholder: next_label_batch})
64 |             los = sess.run(loss    , feed_dict={input_placeholder: next_input_batch, label_placeholder: next_label_batch})
65 |             tes = sess.run(accuracy, feed_dict={input_placeholder: test_input_batch, label_placeholder: test_label_batch})
66 |             print("___________________________________")
67 |             print("Iteration  : ", i  )
68 |             print("Validation : ", acc)
69 |             print("Loss       : ", los)
70 |             print("Test acc   : ", tes)
71 |             validation_accuracy_result = sess.run(validation_accuracy_summary, feed_dict={input_placeholder: next_input_batch, label_placeholder: next_label_batch})            
72 |             testing_accuracy_result    = sess.run(testing_accuracy_summary   , feed_dict={input_placeholder: test_input_batch, label_placeholder: test_label_batch})
73 |             loss_result                = sess.run(loss_summary               , feed_dict={input_placeholder: next_input_batch, label_placeholder: next_label_batch})
74 |             writer.add_summary(validation_accuracy_result, i)
75 |             writer.add_summary(testing_accuracy_result   , i)
76 |             writer.add_summary(loss_result               , i)
77 |         if (i%1000 == 0 and i != 0):
78 |             save_path = saver.save(sess, "models/lstm/pretrained_lstm.ckpt", global_step=i)
79 |             print("model is saved to %s"%save_path)
80 |     writer.close()
81 | 
82 | print("ended at ", datetime.datetime.now())
83 | 


--------------------------------------------------------------------------------
/old_stuffs/training_stacked_lstm.py:
--------------------------------------------------------------------------------
  1 | import datetime
  2 | import tensorflow as tf
  3 | import numpy as np
  4 | from training_helpers import *
  5 | 
  6 | batch_size     = 24
  7 | lstm_units     = 128
  8 | num_classes    = 6
  9 | max_seq_length = 500
 10 | vector_length  = 100    # word2vec dimensions
 11 | iterations     = 100000 # 100000
 12 | stack_count    = 5
 13 | 
 14 | dataset = DataSetHelper() 
 15 | 
 16 | tf.reset_default_graph()
 17 | 
 18 | def shape_detective(sess, tensor, explainer=""):
 19 |     print("-------------------------")
 20 |     print(explainer, sess.run(tf.shape(tensor)))
 21 | 
 22 | input_placeholder = tf.placeholder(tf.int32  , [batch_size, max_seq_length], name='input_placeholder')
 23 | label_placeholder = tf.placeholder(tf.float32, [batch_size, num_classes   ])
 24 | 
 25 | ids_matrix    = np.load('ids_matrix.npy')
 26 | embeddings_tf = tf.constant(ids_matrix)
 27 | batch_data    = tf.Variable(tf.zeros([batch_size, max_seq_length, vector_length]), dtype=tf.float32)
 28 | batch_data    = tf.nn.embedding_lookup(embeddings_tf, input_placeholder)
 29 | batch_data    = tf.cast(batch_data, tf.float32) # https://github.com/tensorflow/tensorflow/issues/8281
 30 | 
 31 | stacked_lstms = []
 32 | for i in range(stack_count):
 33 |     stacked_lstms.append(tf.contrib.rnn.BasicLSTMCell(lstm_units))
 34 | stacked_rnn               = tf.contrib.rnn.MultiRNNCell(stacked_lstms)
 35 | value_before_transpose, _ = tf.nn.dynamic_rnn(stacked_rnn, batch_data, dtype=tf.float32)
 36 | 
 37 | weight                = tf.Variable(tf.truncated_normal([lstm_units, num_classes]))
 38 | bias                  = tf.Variable(tf.constant(0.1, shape=[num_classes]))
 39 | value_after_transpose = tf.transpose(value_before_transpose, [1, 0, 2])
 40 | last                  = tf.gather(value_after_transpose, int(value_after_transpose.get_shape()[0]) - 1)
 41 | prediction            = tf.add(tf.matmul(last, weight), bias, name='prediction_op')
 42 | 
 43 | correct_prediction = tf.equal(tf.argmax(prediction, 1), tf.argmax(label_placeholder, 1))
 44 | accuracy           = tf.reduce_mean(tf.cast(correct_prediction, tf.float32))
 45 | 
 46 | loss      = tf.reduce_mean(tf.nn.softmax_cross_entropy_with_logits_v2(logits=prediction, labels=label_placeholder))
 47 | optimizer = tf.train.AdamOptimizer().minimize(loss)
 48 | 
 49 | print("started at ", datetime.datetime.now())
 50 | 
 51 | loss_summary                = tf.summary.scalar('Loss'                     , loss    )
 52 | validation_accuracy_summary = tf.summary.scalar('Batch Validation Accuracy', accuracy)
 53 | testing_accuracy_summary    = tf.summary.scalar('Testing Dataset Accuracy' , accuracy)
 54 | 
 55 | log_dir = "tensorboard/stackedlstm/"+datetime.datetime.now().strftime("%Y%m%d-%H%M%S")+"/"
 56 | 
 57 | init = tf.global_variables_initializer()
 58 | with tf.Session() as sess:
 59 |     writer  = tf.summary.FileWriter(log_dir, sess.graph)
 60 |     saver   = tf.train.Saver()
 61 |     sess.run(init)
 62 |     print("detecting shape flow and changes...")
 63 | 
 64 |     for i in range(iterations):
 65 |         next_input_batch, next_label_batch = dataset.get_training_batch(batch_size, max_seq_length)
 66 |         test_input_batch, test_label_batch = dataset.get_testing_batch (batch_size, max_seq_length)
 67 |         sess.run(optimizer, feed_dict={input_placeholder: next_input_batch, label_placeholder: next_label_batch})
 68 |         if (i%10 == 0):
 69 |             acc = sess.run(accuracy, feed_dict={input_placeholder: next_input_batch, label_placeholder: next_label_batch})
 70 |             los = sess.run(loss    , feed_dict={input_placeholder: next_input_batch, label_placeholder: next_label_batch})
 71 |             tes = sess.run(accuracy, feed_dict={input_placeholder: test_input_batch, label_placeholder: test_label_batch})
 72 |             print("__________________________________")
 73 |             print("Iteration  : ", i  )
 74 |             print("Validation : ", acc)
 75 |             print("Loss       : ", los)
 76 |             print("Test acc   : ", tes)
 77 |             validation_accuracy_result = sess.run(validation_accuracy_summary, feed_dict={input_placeholder: next_input_batch, label_placeholder: next_label_batch})            
 78 |             testing_accuracy_result    = sess.run(testing_accuracy_summary   , feed_dict={input_placeholder: test_input_batch, label_placeholder: test_label_batch})
 79 |             loss_result                = sess.run(loss_summary               , feed_dict={input_placeholder: next_input_batch, label_placeholder: next_label_batch})
 80 |             writer.add_summary(validation_accuracy_result, i)
 81 |             writer.add_summary(testing_accuracy_result   , i)
 82 |             writer.add_summary(loss_result               , i)
 83 |         if (i%1000 == 0 and i != 0):
 84 |             save_path = saver.save(sess, "models/stackedlstm/pretrained_lstm.ckpt", global_step=i)
 85 |             print("model is saved to %s"%save_path)
 86 | 
 87 |         print("__________________________________")
 88 |         shape_detective(sess, input_placeholder     , explainer="input_placeholder                            :")
 89 |         shape_detective(sess, label_placeholder     , explainer="label_placeholder                            :")
 90 |         shape_detective(sess, embeddings_tf         , explainer="embeddings                                   :")
 91 |         shape_detective(sess, batch_data            , explainer="batch_data before unstacking                 :")
 92 |         shape_detective(sess, weight                , explainer="weight                                       :")
 93 |         shape_detective(sess, bias                  , explainer="bias                                         :")
 94 |         shape_detective(sess, value_before_transpose, explainer="value shape before transpose stacked 2 lstms :")
 95 |         shape_detective(sess, value_after_transpose , explainer="value shape after_transpose                  :")
 96 |         shape_detective(sess, last                  , explainer="shape after gather transposed value          :")
 97 |         shape_detective(sess, prediction            , explainer="dense connection, prediction shape           :")
 98 | 
 99 | 
100 |     writer.close()
101 | 
102 | print("ended at ", datetime.datetime.now())
103 | 


--------------------------------------------------------------------------------
/old_stuffs/use_freezed_model_rpc.py:
--------------------------------------------------------------------------------
  1 | import tensorflow as tf
  2 | import numpy as np
  3 | from training_helpers import *
  4 | from itertools import chain
  5 | from clear_text_to_array import *
  6 | from xmlrpc.server import SimpleXMLRPCServer # for django app
  7 | import sys # handle interrupt
  8 | 
  9 | def softmax(x):
 10 |     score_math_exp = np.exp(np.asarray(x))
 11 |     return score_math_exp / score_math_exp.sum(0)
 12 | 
 13 | frozen_graph = './models/bilstm/pretrained_bilstm-23000.pb'
 14 | 
 15 | with tf.gfile.GFile(frozen_graph, "rb") as f:
 16 |     restored_graph_def = tf.GraphDef()
 17 |     restored_graph_def.ParseFromString(f.read())
 18 | 
 19 | with tf.Graph().as_default() as graph:
 20 |     input_placeholder, prediction = tf.import_graph_def(
 21 |         restored_graph_def,
 22 |         input_map       = None,
 23 |         return_elements = ['input_placeholder', 'prediction_op'],
 24 |         name            = ''
 25 |     )
 26 | 
 27 | input_placeholder = graph.get_tensor_by_name("input_placeholder:0")
 28 | prediction_op     = graph.get_tensor_by_name("prediction_op:0")
 29 | 
 30 | dataset_helper = DataSetHelper()
 31 | 
 32 | 
 33 | def get_class_name(x):
 34 |     switcher = {
 35 |         0: lambda: "economy"   ,
 36 |         1: lambda: "health"    ,
 37 |         2: lambda: "politics"  ,
 38 |         3: lambda: "society"   ,
 39 |         4: lambda: "technology",
 40 |         5: lambda: "world"      
 41 |     }
 42 |     return switcher.get(x, lambda: "UNKNOWN")()
 43 | 
 44 | sess = tf.Session(graph=graph)
 45 | 
 46 | def predict_class(sess, filename):
 47 |     with open(filename, 'r') as content_file:
 48 |         content = content_file.read()
 49 | 
 50 |     max_seq_length = 500
 51 |     num_classes = 6
 52 | 
 53 |     word_ids = dataset_helper.sentence_to_ids(content, max_seq_length)
 54 |     x_batch = []
 55 |     for i in range(24):
 56 |         x_batch.append(word_ids)
 57 |     x_batch = np.array(x_batch)
 58 |     results = []
 59 | 
 60 |     results_tf = sess.run(prediction_op, feed_dict={input_placeholder: x_batch})
 61 |     for i in results_tf:
 62 |         softmax_result = softmax(i)
 63 |         argmax = softmax_result.argmax(axis=0)
 64 |         name = get_class_name(argmax)
 65 |         results.append([softmax_result, argmax, name])
 66 | 
 67 |     print("result: ", results[0][2])
 68 |     print(results[0][0])
 69 | 
 70 | print('----------------------------')
 71 | print("trying to predict world news")
 72 | predict_class(sess, "./corpuses_test/world_news_gogo_mn.txt")
 73 | 
 74 | print('----------------------------')
 75 | print("trying to predict economy news")
 76 | predict_class(sess, "./corpuses_test/economy_news_gogo_mn.txt")
 77 | 
 78 | print('----------------------------')
 79 | print("trying to predict technology news")
 80 | predict_class(sess, "./corpuses_test/technology_news_gogo_mn.txt")
 81 | 
 82 | print('----------------------------')
 83 | print("trying to predict health news")
 84 | predict_class(sess, "./corpuses_test/health_news_gogo_mn.txt")
 85 | 
 86 | 
 87 | print('----------------------------')
 88 | print("trying to predict political news")
 89 | predict_class(sess, "./corpuses_test/politics_news_ikon_mn.txt")
 90 | 
 91 | def predict_class_from_text(content):
 92 |     max_seq_length = 500
 93 | 
 94 |     word_ids = dataset_helper.sentence_to_ids(content, max_seq_length)
 95 |     x_batch = []
 96 |     for i in range(24):
 97 |         x_batch.append(word_ids)
 98 |     x_batch = np.array(x_batch)
 99 |     results = []
100 | 
101 |     results_tf = sess.run(prediction_op, feed_dict={input_placeholder: x_batch})
102 |     for i in results_tf:
103 |         softmax_result = softmax(i)
104 |         argmax = softmax_result.argmax(axis=0)
105 |         name = get_class_name(argmax)
106 |         results.append([softmax_result, argmax, name])
107 | 
108 |     return str(results[0][2])
109 | 
110 | try:
111 |     rpc_server = SimpleXMLRPCServer(("localhost", 50001))
112 |     print("----------------------------")
113 |     print("classifier RPC server is listening on port 50001...")
114 |     rpc_server.register_function(predict_class_from_text, "predict_class_from_text")
115 |     rpc_server.serve_forever()
116 | except KeyboardInterrupt:
117 |     sess.close()
118 |     sys.exit()
119 | 
120 | sess.close()
121 | 


--------------------------------------------------------------------------------
/old_stuffs/using pretrained word2vec for mongolian text classification.ipynb:
--------------------------------------------------------------------------------
 1 | {
 2 |  "cells": [
 3 |   {
 4 |    "cell_type": "markdown",
 5 |    "metadata": {},
 6 |    "source": [
 7 |     "Download pre trained mongolian word2vec from fasttext\n",
 8 |     "https://fasttext.cc/docs/en/crawl-vectors.html "
 9 |    ]
10 |   },
11 |   {
12 |    "cell_type": "code",
13 |    "execution_count": null,
14 |    "metadata": {},
15 |    "outputs": [],
16 |    "source": [
17 |     "!wget --no-check-certificate http://s3-us-west-1.amazonaws.com/fasttext-vectors/word-vectors-v2/cc.mn.300.bin.gz -P ./pretrained_word2vec"
18 |    ]
19 |   }
20 |  ],
21 |  "metadata": {
22 |   "kernelspec": {
23 |    "display_name": "Python 3",
24 |    "language": "python",
25 |    "name": "python3"
26 |   },
27 |   "language_info": {
28 |    "codemirror_mode": {
29 |     "name": "ipython",
30 |     "version": 3
31 |    },
32 |    "file_extension": ".py",
33 |    "mimetype": "text/x-python",
34 |    "name": "python",
35 |    "nbconvert_exporter": "python",
36 |    "pygments_lexer": "ipython3",
37 |    "version": "3.6.7"
38 |   }
39 |  },
40 |  "nbformat": 4,
41 |  "nbformat_minor": 2
42 | }
43 | 


--------------------------------------------------------------------------------
/old_stuffs/wordtoken_to_id.py:
--------------------------------------------------------------------------------
1 | 
2 | def wordtoken_to_id(model, word):
3 |     token_id = model.wv.vocab[word].index
4 |     return token_id


--------------------------------------------------------------------------------
/old_stuffs/wordvec_exp.py:
--------------------------------------------------------------------------------
 1 | from gensim.models import Word2Vec
 2 | 
 3 | sentences = [
 4 |     ['this', 'is', 'the', 'first', 'sentence', 'for', 'word2vec'],
 5 |     ['this', 'is', 'the', 'second', 'sentence'],
 6 |     ['yet', 'another', 'sentence'],
 7 |     ['one', 'more', 'sentence'],
 8 |     ['and', 'the', 'final', 'sentence']
 9 | ]
10 | 
11 | model = Word2Vec(sentences, min_count=1)
12 | print(model)
13 | words = list(model.wv.vocab)
14 | print(words)
15 | print(model['sentence'])
16 | model.save('model.bin')
17 | new_model = Word2Vec.load('model.bin')
18 | print(new_model)
19 | 


--------------------------------------------------------------------------------
/preprocess_dataset/preprocess_eduge.ipynb:
--------------------------------------------------------------------------------
  1 | {
  2 |  "cells": [
  3 |   {
  4 |    "cell_type": "code",
  5 |    "execution_count": 3,
  6 |    "metadata": {},
  7 |    "outputs": [
  8 |     {
  9 |      "name": "stderr",
 10 |      "output_type": "stream",
 11 |      "text": [
 12 |       "SYSTEM_WGETRC = c:/progra~1/wget/etc/wgetrc\n",
 13 |       "syswgetrc = C:\\Program Files (x86)\\GnuWin32/etc/wgetrc\n",
 14 |       "--2019-04-13 07:30:06--  https://github.com/tugstugi/mongolian-nlp/raw/master/datasets/eduge.csv.gz\n",
 15 |       "Resolving github.com... 13.250.177.223, 13.229.188.59, 52.74.223.119\n",
 16 |       "Connecting to github.com|13.250.177.223|:443... connected.\n",
 17 |       "OpenSSL: error:1407742E:SSL routines:SSL23_GET_SERVER_HELLO:tlsv1 alert protocol version\n",
 18 |       "Unable to establish SSL connection.\n",
 19 |       "'gunzip' is not recognized as an internal or external command,\n",
 20 |       "operable program or batch file.\n"
 21 |      ]
 22 |     }
 23 |    ],
 24 |    "source": [
 25 |     "import os\n",
 26 |     "if not os.path.exists(\"eduge.csv.gz\"):\n",
 27 |     "  !wget https://github.com/tugstugi/mongolian-nlp/raw/master/datasets/eduge.csv.gz\n",
 28 |     "  !gunzip eduge.csv.gz"
 29 |    ]
 30 |   },
 31 |   {
 32 |    "cell_type": "code",
 33 |    "execution_count": 1,
 34 |    "metadata": {},
 35 |    "outputs": [
 36 |     {
 37 |      "data": {
 38 |       "text/plain": [
 39 |        "['урлаг соёл',\n",
 40 |        " 'эдийн засаг',\n",
 41 |        " 'эрүүл мэнд',\n",
 42 |        " 'хууль',\n",
 43 |        " 'улс төр',\n",
 44 |        " 'спорт',\n",
 45 |        " 'технологи',\n",
 46 |        " 'боловсрол',\n",
 47 |        " 'байгал орчин']"
 48 |       ]
 49 |      },
 50 |      "execution_count": 1,
 51 |      "metadata": {},
 52 |      "output_type": "execute_result"
 53 |     }
 54 |    ],
 55 |    "source": [
 56 |     "import pandas as pd\n",
 57 |     "df = pd.read_csv(\"eduge.csv\")\n",
 58 |     "df = df.rename(columns=lambda x: x.strip())\n",
 59 |     "labels = df['label'].unique().tolist()\n",
 60 |     "labels"
 61 |    ]
 62 |   },
 63 |   {
 64 |    "cell_type": "code",
 65 |    "execution_count": 2,
 66 |    "metadata": {},
 67 |    "outputs": [
 68 |     {
 69 |      "name": "stderr",
 70 |      "output_type": "stream",
 71 |      "text": [
 72 |       "[nltk_data] Downloading package punkt to C:\\Users\\sharavsambuu-\n",
 73 |       "[nltk_data]     laptop\\AppData\\Roaming\\nltk_data...\n"
 74 |      ]
 75 |     },
 76 |     {
 77 |      "name": "stdout",
 78 |      "output_type": "stream",
 79 |      "text": [
 80 |       "['Сайн байна уу?', 'Танд энэ өдрийн мэнд хүргье.', 'Монгол текст ангилах гэж байна.']\n",
 81 |       "['Монгол', 'улсын', 'их', 'хурал']\n"
 82 |      ]
 83 |     },
 84 |     {
 85 |      "name": "stderr",
 86 |      "output_type": "stream",
 87 |      "text": [
 88 |       "[nltk_data]   Package punkt is already up-to-date!\n"
 89 |      ]
 90 |     }
 91 |    ],
 92 |    "source": [
 93 |     "import nltk\n",
 94 |     "nltk.download('punkt')\n",
 95 |     "print(nltk.sent_tokenize(\"Сайн байна уу? Танд энэ өдрийн мэнд хүргье. Монгол текст ангилах гэж байна.\"))\n",
 96 |     "print(nltk.word_tokenize(\"Монгол улсын их хурал\"))"
 97 |    ]
 98 |   },
 99 |   {
100 |    "cell_type": "code",
101 |    "execution_count": 6,
102 |    "metadata": {},
103 |    "outputs": [],
104 |    "source": [
105 |     "import string\n",
106 |     "stopwordsmn = ['аа','аанхаа','алив','ба','байдаг','байжээ','байна','байсаар','байсан','байхаа','бас','бишүү','бол','болжээ','болно','болоо','бэ','вэ','гэж','гэжээ','гэлтгүй','гэсэн','гэтэл','за','л','мөн','нь','тэр','уу','харин','хэн','ч','энэ','ээ','юм','үү','?','', '.', ',', '-','ийн','ын','тай','г','ийг','д','н','ний','дээр','юу']\n",
107 |     "eduge_preprocessed           = []\n",
108 |     "eduge_preprocessed_stopwords = []\n",
109 |     "word_dict   = {}\n",
110 |     "for idx, row in df.iterrows():\n",
111 |     "    news  = row['news']\n",
112 |     "    label = row['label']\n",
113 |     "    sentences = nltk.sent_tokenize(news)\n",
114 |     "    news_sentences           = []\n",
115 |     "    news_sentences_stopwords = []\n",
116 |     "    for sentence in sentences:\n",
117 |     "        tokens   = nltk.word_tokenize(sentence)\n",
118 |     "        tokens   = [w.lower() for w in tokens]\n",
119 |     "        table    = str.maketrans('', '', string.punctuation)\n",
120 |     "        stripped = [w.translate(table) for w in tokens]\n",
121 |     "        words    = [word for word in stripped if word.isalpha()]\n",
122 |     "        words_stopwords = [w for w in words if not w in stopwordsmn]\n",
123 |     "        news_sentences.append(words)\n",
124 |     "        news_sentences_stopwords.append(words_stopwords)\n",
125 |     "        for w in words:\n",
126 |     "            word_dict[w] = 0\n",
127 |     "    eduge_preprocessed.append([news_sentences, label])\n",
128 |     "    eduge_preprocessed_stopwords.append([news_sentences_stopwords, label])"
129 |    ]
130 |   },
131 |   {
132 |    "cell_type": "code",
133 |    "execution_count": null,
134 |    "metadata": {},
135 |    "outputs": [],
136 |    "source": [
137 |     "import pickle\n",
138 |     "\n",
139 |     "with open('eduge.pickle', 'wb') as handle:\n",
140 |     "    pickle.dump(eduge_preprocessed, handle, protocol=pickle.HIGHEST_PROTOCOL)\n",
141 |     "    print(\"saved to eduge.pickle\")"
142 |    ]
143 |   },
144 |   {
145 |    "cell_type": "code",
146 |    "execution_count": 10,
147 |    "metadata": {},
148 |    "outputs": [
149 |     {
150 |      "name": "stdout",
151 |      "output_type": "stream",
152 |      "text": [
153 |       "saved to eduge_stopwords_removed.pickle\n"
154 |      ]
155 |     }
156 |    ],
157 |    "source": [
158 |     "with open('eduge_stopwords_removed.pickle', 'wb') as handle:\n",
159 |     "    pickle.dump(eduge_preprocessed_stopwords, handle, protocol=pickle.HIGHEST_PROTOCOL)\n",
160 |     "    print(\"saved to eduge_stopwords_removed.pickle\")"
161 |    ]
162 |   },
163 |   {
164 |    "cell_type": "code",
165 |    "execution_count": 12,
166 |    "metadata": {},
167 |    "outputs": [],
168 |    "source": [
169 |     "word_index = {}\n",
170 |     "word_index[\"<PAD>\"   ] = 0\n",
171 |     "word_index[\"<START>\" ] = 1\n",
172 |     "word_index[\"<UNK>\"   ] = 2\n",
173 |     "word_index[\"<UNUSED>\"] = 3\n",
174 |     "cnt = 4\n",
175 |     "for k, v in word_dict.items():\n",
176 |     "    word_index[k] = cnt\n",
177 |     "    cnt += 1\n",
178 |     "#print(word_index)"
179 |    ]
180 |   },
181 |   {
182 |    "cell_type": "code",
183 |    "execution_count": 15,
184 |    "metadata": {},
185 |    "outputs": [],
186 |    "source": [
187 |     "reversed_word_index = dict([(value, key) for (key, value) in word_index.items()])"
188 |    ]
189 |   },
190 |   {
191 |    "cell_type": "code",
192 |    "execution_count": 16,
193 |    "metadata": {},
194 |    "outputs": [
195 |     {
196 |      "name": "stdout",
197 |      "output_type": "stream",
198 |      "text": [
199 |       "saved to word_index.pickle\n",
200 |       "saved to reversed_word_index.pickle\n"
201 |      ]
202 |     }
203 |    ],
204 |    "source": [
205 |     "with open('word_index.pickle', 'wb') as handle:\n",
206 |     "    pickle.dump(word_index, handle, protocol=pickle.HIGHEST_PROTOCOL)\n",
207 |     "    print(\"saved to word_index.pickle\")\n",
208 |     "    \n",
209 |     "with open('reversed_word_index.pickle', 'wb') as handle:\n",
210 |     "    pickle.dump(reversed_word_index, handle, protocol=pickle.HIGHEST_PROTOCOL)\n",
211 |     "    print(\"saved to reversed_word_index.pickle\")"
212 |    ]
213 |   }
214 |  ],
215 |  "metadata": {
216 |   "kernelspec": {
217 |    "display_name": "Python 3",
218 |    "language": "python",
219 |    "name": "python3"
220 |   },
221 |   "language_info": {
222 |    "codemirror_mode": {
223 |     "name": "ipython",
224 |     "version": 3
225 |    },
226 |    "file_extension": ".py",
227 |    "mimetype": "text/x-python",
228 |    "name": "python",
229 |    "nbconvert_exporter": "python",
230 |    "pygments_lexer": "ipython3",
231 |    "version": "3.6.7"
232 |   }
233 |  },
234 |  "nbformat": 4,
235 |  "nbformat_minor": 2
236 | }
237 | 


--------------------------------------------------------------------------------
/requirements.txt:
--------------------------------------------------------------------------------
1 | jupyter
2 | nltk
3 | pandas


--------------------------------------------------------------------------------