├── .gitignore ├── LICENSE ├── README.md ├── Task 1 └── README.md ├── Task 2: Working with English data ├── NER_model.py ├── english_NER.ipynb ├── get_word_vectors.py └── process_data.py ├── Task 3: Hindi data ├── Convert SSF data to CoNLL form.ipynb ├── Hindi_NER.ipynb ├── NER_model.py ├── first_hindi_model ├── get_word_vectors.py ├── hindi_vectors.ipynb └── process_data.py └── data ├── CoNLL-2003 ├── eng.testa ├── eng.testa.openNLP ├── eng.testb ├── eng.testb.openNLP ├── eng.train └── eng.train.openNLP ├── readme.md └── training_hindi_NER.utf8 /.gitignore: -------------------------------------------------------------------------------- 1 | # My additions 2 | data/hin* 3 | data/training-hindi 4 | 5 | # Byte-compiled / optimized / DLL files 6 | __pycache__/ 7 | *.py[cod] 8 | *$py.class 9 | 10 | # C extensions 11 | *.so 12 | 13 | # Distribution / packaging 14 | .Python 15 | env/ 16 | build/ 17 | develop-eggs/ 18 | dist/ 19 | downloads/ 20 | eggs/ 21 | .eggs/ 22 | lib/ 23 | lib64/ 24 | parts/ 25 | sdist/ 26 | var/ 27 | *.egg-info/ 28 | .installed.cfg 29 | *.egg 30 | 31 | # PyInstaller 32 | # Usually these files are written by a python script from a template 33 | # before PyInstaller builds the exe, so as to inject date/other infos into it. 34 | *.manifest 35 | *.spec 36 | 37 | # Installer logs 38 | pip-log.txt 39 | pip-delete-this-directory.txt 40 | 41 | # Unit test / coverage reports 42 | htmlcov/ 43 | .tox/ 44 | .coverage 45 | .coverage.* 46 | .cache 47 | nosetests.xml 48 | coverage.xml 49 | *,cover 50 | .hypothesis/ 51 | 52 | # Translations 53 | *.mo 54 | *.pot 55 | 56 | # Django stuff: 57 | *.log 58 | local_settings.py 59 | 60 | # Flask stuff: 61 | instance/ 62 | .webassets-cache 63 | 64 | # Scrapy stuff: 65 | .scrapy 66 | 67 | # Sphinx documentation 68 | docs/_build/ 69 | 70 | # PyBuilder 71 | target/ 72 | 73 | # IPython Notebook 74 | .ipynb_checkpoints 75 | 76 | # pyenv 77 | .python-version 78 | 79 | # celery beat schedule file 80 | celerybeat-schedule 81 | 82 | # dotenv 83 | .env 84 | 85 | # virtualenv 86 | venv/ 87 | ENV/ 88 | 89 | # Spyder project settings 90 | .spyderproject 91 | 92 | # Rope project settings 93 | .ropeproject 94 | 95 | 96 | -------------------------------------------------------------------------------- /LICENSE: -------------------------------------------------------------------------------- 1 | MIT License 2 | 3 | Copyright (c) 2017 Divesh Pandey 4 | 5 | Permission is hereby granted, free of charge, to any person obtaining a copy 6 | of this software and associated documentation files (the "Software"), to deal 7 | in the Software without restriction, including without limitation the rights 8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell 9 | copies of the Software, and to permit persons to whom the Software is 10 | furnished to do so, subject to the following conditions: 11 | 12 | The above copyright notice and this permission notice shall be included in all 13 | copies or substantial portions of the Software. 14 | 15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR 16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, 17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE 18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER 19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, 20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE 21 | SOFTWARE. 22 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # NER-using-Deep-Learning 2 | A project on achieving Named-Entity Recognition using Deep Learning. 3 | 4 | As the [page on Wikipedia](https://en.wikipedia.org/wiki/Named-entity_recognition) says, **Named-entity recognition (NER)** (also known as entity identification, entity chunking and entity extraction) is a subtask of information extraction that seeks to locate and classify named entities in text into pre-defined categories such as the names of persons, organizations, locations, expressions of times, quantities, monetary values, percentages, etc. 5 | 6 | Take a look at this example: 7 | 8 | > _Jim bought 300 shares of Acme Corp. in 2006._ 9 | 10 | Applying method of NER method, we must get: 11 | 12 | > _[Jim]Person bought 300 shares of [Acme Corp.]Organization in [2006]Time._ 13 | 14 | I am doing project under the guidance of [Dr. A. K. Singh](http://www.iitbhu.ac.in/cse/index.php/people/faculty/37.html). I will be adding all relevant work I do regarding this project. Check out all the subfolders for my work. 15 | 16 | 17 | 18 | -------------------------------------------------------------------------------- /Task 1/README.md: -------------------------------------------------------------------------------- 1 | #Introduction 2 | ## Tools Selection 3 | In the beginning, I spent sometime about searching for various tools that I may use in my project. 4 | 5 | ### Deep Learning Framework 6 | I started with finding a suitable deep learning framework for this project. Here I am listing all the options I considered. 7 | 8 | * [**Tensorflow**](https://www.tensorflow.org/) 9 | * [**Theono**](http://deeplearning.net/software/theano/) 10 | * [**Keras**](https://keras.io/) 11 | * [**Caffe**](http://caffe.berkeleyvision.org/) 12 | * [**Torch**](http://torch.ch/) 13 | * [**Dl4j**](https://deeplearning4j.org/) 14 | 15 | I had used Tensorflow earlier, so without much doubt, Tensorflow was my first choice. But to make things little easy, I chose to go with Keras(with Tensorflow backend). I found that it provides an easy implementation of deep neural nets. I installed Keras with Tensorflow as the backend, with GPU support. 16 | -------------------------------------------------------------------------------- /Task 2: Working with English data/NER_model.py: -------------------------------------------------------------------------------- 1 | # Keras imports 2 | from keras.preprocessing import sequence 3 | from keras.models import Sequential 4 | from keras.layers import Dense 5 | from keras.layers import LSTM 6 | from keras.layers.wrappers import TimeDistributed 7 | from keras.layers.wrappers import Bidirectional 8 | from keras.layers.core import Dropout 9 | from keras.regularizers import l2 10 | from keras import metrics 11 | 12 | import numpy as np 13 | import pandas as pd 14 | from sklearn.metrics import confusion_matrix, classification_report 15 | 16 | 17 | class NER(): 18 | def __init__(self, data_reader): 19 | self.data_reader = data_reader 20 | self.x, self.y = data_reader.get_data(); 21 | self.model = None 22 | self.x_train = None 23 | self.y_train = None 24 | self.x_test = None 25 | self.y_test = None 26 | 27 | def make_and_compile(self, units = 150, dropout = 0.2, regul_alpha = 0.0): 28 | self.model = Sequential() 29 | # Bidirectional LSTM with 100 outputs/memory units 30 | self.model.add(Bidirectional(LSTM(units, 31 | return_sequences=True, 32 | W_regularizer=l2(regul_alpha), 33 | b_regularizer=l2(regul_alpha)), 34 | input_shape = [self.data_reader.max_len, 35 | self.data_reader.LEN_WORD_VECTORS])) 36 | self.model.add(TimeDistributed(Dense(self.data_reader.LEN_NAMED_CLASSES, 37 | activation='softmax', 38 | W_regularizer=l2(regul_alpha), 39 | b_regularizer=l2(regul_alpha)))) 40 | self.model.add(Dropout(dropout)) 41 | self.model.compile(loss='categorical_crossentropy', 42 | optimizer='adam', 43 | metrics=['accuracy']) 44 | print self.model.summary() 45 | 46 | def train(self, train_split = 0.8, epochs = 10, batch_size = 50): 47 | split_mask = np.random.rand(len(self.x)) < (train_split) 48 | self.x_train = self.x[split_mask] 49 | self.y_train = self.y[split_mask] 50 | self.x_test = self.x[~split_mask] 51 | self.y_test = self.y[~split_mask] 52 | 53 | self.model.fit(self.x_train, self.y_train, nb_epoch=epochs, batch_size=batch_size) 54 | 55 | def evaluate(self): 56 | predicted_tags= [] 57 | test_data_tags = [] 58 | 59 | for x,y in zip(self.x_test, self.y_test): 60 | flag = 0 61 | tags = self.model.predict(np.array([x]), batch_size=1)[0] 62 | pred_tags = self.data_reader.decode_result(tags) 63 | test_tags = self.data_reader.decode_result(y) 64 | for i,j in zip(pred_tags, test_tags): 65 | if j != self.data_reader.NULL_CLASS: 66 | flag = 1 67 | if flag == 1: 68 | test_data_tags.append(j) 69 | predicted_tags.append(i) 70 | 71 | 72 | predicted_tags = np.array(predicted_tags) 73 | test_data_tags = np.array(test_data_tags) 74 | print classification_report(test_data_tags, predicted_tags) 75 | 76 | simple_conf_matrix = confusion_matrix(test_data_tags,predicted_tags) 77 | all_tags = sorted(list(set(test_data_tags))) 78 | conf_matrix = pd.DataFrame( 79 | columns = all_tags, 80 | index = all_tags) 81 | for x,y in zip(simple_conf_matrix, all_tags): 82 | conf_matrix[y] = x 83 | conf_matrix = conf_matrix.transpose() 84 | return conf_matrix 85 | 86 | 87 | def predict_tags(self, sentence): 88 | sentence_list = sentence.strip().split() 89 | sent_len = len(sentence_list) 90 | # Get padded word vectors 91 | x = self.data_reader.encode_sentence(sentence) 92 | tags = self.model.predict(x, batch_size=1)[0] 93 | 94 | tags = tags[-sent_len:] 95 | pred_tags = self.data_reader.decode_result(tags) 96 | 97 | for s,tag in zip(sentence_list,pred_tags): 98 | print s + "/" + tag 99 | -------------------------------------------------------------------------------- /Task 2: Working with English data/english_NER.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "code", 5 | "execution_count": 1, 6 | "metadata": { 7 | "collapsed": false 8 | }, 9 | "outputs": [ 10 | { 11 | "name": "stderr", 12 | "output_type": "stream", 13 | "text": [ 14 | "Using TensorFlow backend.\n" 15 | ] 16 | } 17 | ], 18 | "source": [ 19 | "from process_data import DataHandler" 20 | ] 21 | }, 22 | { 23 | "cell_type": "code", 24 | "execution_count": 19, 25 | "metadata": { 26 | "collapsed": false 27 | }, 28 | "outputs": [], 29 | "source": [ 30 | "s = DataHandler(\"../data/CoNLL-2003/eng.testa\")" 31 | ] 32 | }, 33 | { 34 | "cell_type": "code", 35 | "execution_count": 3, 36 | "metadata": { 37 | "collapsed": false 38 | }, 39 | "outputs": [ 40 | { 41 | "data": { 42 | "text/plain": [ 43 | "(3466, 109, 60)" 44 | ] 45 | }, 46 | "execution_count": 3, 47 | "metadata": {}, 48 | "output_type": "execute_result" 49 | } 50 | ], 51 | "source": [ 52 | "s.get_data()[0].shape" 53 | ] 54 | }, 55 | { 56 | "cell_type": "code", 57 | "execution_count": 4, 58 | "metadata": { 59 | "collapsed": false 60 | }, 61 | "outputs": [], 62 | "source": [ 63 | "from NER_model import NER" 64 | ] 65 | }, 66 | { 67 | "cell_type": "code", 68 | "execution_count": 5, 69 | "metadata": { 70 | "collapsed": false 71 | }, 72 | "outputs": [], 73 | "source": [ 74 | "m = NER(s)" 75 | ] 76 | }, 77 | { 78 | "cell_type": "code", 79 | "execution_count": 6, 80 | "metadata": { 81 | "collapsed": false, 82 | "scrolled": true 83 | }, 84 | "outputs": [ 85 | { 86 | "name": "stdout", 87 | "output_type": "stream", 88 | "text": [ 89 | "____________________________________________________________________________________________________\n", 90 | "Layer (type) Output Shape Param # Connected to \n", 91 | "====================================================================================================\n", 92 | "bidirectional_1 (Bidirectional) (None, 109, 300) 253200 bidirectional_input_1[0][0] \n", 93 | "____________________________________________________________________________________________________\n", 94 | "timedistributed_1 (TimeDistribut (None, 109, 5) 1505 bidirectional_1[0][0] \n", 95 | "____________________________________________________________________________________________________\n", 96 | "dropout_1 (Dropout) (None, 109, 5) 0 timedistributed_1[0][0] \n", 97 | "====================================================================================================\n", 98 | "Total params: 254,705\n", 99 | "Trainable params: 254,705\n", 100 | "Non-trainable params: 0\n", 101 | "____________________________________________________________________________________________________\n", 102 | "None\n" 103 | ] 104 | } 105 | ], 106 | "source": [ 107 | "m.make_and_compile()" 108 | ] 109 | }, 110 | { 111 | "cell_type": "code", 112 | "execution_count": null, 113 | "metadata": { 114 | "collapsed": false 115 | }, 116 | "outputs": [], 117 | "source": [ 118 | "# m.train()\n", 119 | "m.train(epochs=10)" 120 | ] 121 | }, 122 | { 123 | "cell_type": "code", 124 | "execution_count": 12, 125 | "metadata": { 126 | "collapsed": false 127 | }, 128 | "outputs": [ 129 | { 130 | "name": "stdout", 131 | "output_type": "stream", 132 | "text": [ 133 | " precision recall f1-score support\n", 134 | "\n", 135 | " LOC 0.91 0.71 0.80 471\n", 136 | " MISC 0.77 0.47 0.59 276\n", 137 | " O 0.00 0.00 0.00 0\n", 138 | " ORG 0.80 0.49 0.61 362\n", 139 | " PER 0.92 0.81 0.86 635\n", 140 | "\n", 141 | "avg / total 0.87 0.66 0.75 1744\n", 142 | "\n" 143 | ] 144 | }, 145 | { 146 | "data": { 147 | "text/html": [ 148 | "
\n", 149 | "\n", 150 | " \n", 151 | " \n", 152 | " \n", 153 | " \n", 154 | " \n", 155 | " \n", 156 | " \n", 157 | " \n", 158 | " \n", 159 | " \n", 160 | " \n", 161 | " \n", 162 | " \n", 163 | " \n", 164 | " \n", 165 | " \n", 166 | " \n", 167 | " \n", 168 | " \n", 169 | " \n", 170 | " \n", 171 | " \n", 172 | " \n", 173 | " \n", 174 | " \n", 175 | " \n", 176 | " \n", 177 | " \n", 178 | " \n", 179 | " \n", 180 | " \n", 181 | " \n", 182 | " \n", 183 | " \n", 184 | " \n", 185 | " \n", 186 | " \n", 187 | " \n", 188 | " \n", 189 | " \n", 190 | " \n", 191 | " \n", 192 | " \n", 193 | " \n", 194 | " \n", 195 | " \n", 196 | " \n", 197 | " \n", 198 | " \n", 199 | " \n", 200 | " \n", 201 | " \n", 202 | "
LOCMISCOORGPER
LOC3342381267
MISC1513011795
O00000
ORG171312417731
PER021138512
\n", 203 | "
" 204 | ], 205 | "text/plain": [ 206 | " LOC MISC O ORG PER\n", 207 | "LOC 334 23 81 26 7\n", 208 | "MISC 15 130 117 9 5\n", 209 | "O 0 0 0 0 0\n", 210 | "ORG 17 13 124 177 31\n", 211 | "PER 0 2 113 8 512" 212 | ] 213 | }, 214 | "execution_count": 12, 215 | "metadata": {}, 216 | "output_type": "execute_result" 217 | } 218 | ], 219 | "source": [ 220 | "m.evaluate()" 221 | ] 222 | }, 223 | { 224 | "cell_type": "code", 225 | "execution_count": 17, 226 | "metadata": { 227 | "collapsed": false 228 | }, 229 | "outputs": [ 230 | { 231 | "name": "stdout", 232 | "output_type": "stream", 233 | "text": [ 234 | "(1, 109, 60)\n", 235 | "The/O\n", 236 | "strongest/O\n", 237 | "rain/O\n", 238 | "ever/O\n", 239 | "recorded/O\n", 240 | "in/O\n", 241 | "India/LOC\n", 242 | "shut/O\n", 243 | "down/O\n", 244 | "the/O\n", 245 | "financial/O\n", 246 | "hub/O\n", 247 | "of/O\n", 248 | "Mumbai,/LOC\n", 249 | "snapped/O\n", 250 | "communication/O\n", 251 | "lines,/O\n", 252 | "closed/O\n", 253 | "airports/O\n", 254 | "and/O\n", 255 | "forced/O\n", 256 | "thousands/O\n", 257 | "of/O\n", 258 | "people/O\n", 259 | "to/O\n", 260 | "sleep/O\n", 261 | "in/O\n", 262 | "their/O\n", 263 | "offices/O\n", 264 | "or/O\n", 265 | "walk/O\n", 266 | "home/O\n", 267 | "during/O\n", 268 | "the/O\n", 269 | "night,/O\n", 270 | "officials/O\n", 271 | "said/O\n", 272 | "today./O\n" 273 | ] 274 | } 275 | ], 276 | "source": [ 277 | "# m.predict_tags(\"The strongest man on Earth is Mark Henry\")\n", 278 | "m.predict_tags(\"The strongest rain ever recorded in India shut down the financial hub of Mumbai, snapped communication lines, closed airports and forced thousands of people to sleep in their offices or walk home during the night, officials said today.\")" 279 | ] 280 | }, 281 | { 282 | "cell_type": "code", 283 | "execution_count": null, 284 | "metadata": { 285 | "collapsed": false 286 | }, 287 | "outputs": [], 288 | "source": [ 289 | "m.model.save(\"./first_model\")" 290 | ] 291 | }, 292 | { 293 | "cell_type": "code", 294 | "execution_count": 8, 295 | "metadata": { 296 | "collapsed": false 297 | }, 298 | "outputs": [], 299 | "source": [ 300 | "from keras.models import load_model\n", 301 | "m.model = load_model(\"./first_model\")" 302 | ] 303 | } 304 | ], 305 | "metadata": { 306 | "anaconda-cloud": {}, 307 | "kernelspec": { 308 | "display_name": "Python [conda env:keras_tensorflow]", 309 | "language": "python", 310 | "name": "conda-env-keras_tensorflow-py" 311 | }, 312 | "language_info": { 313 | "codemirror_mode": { 314 | "name": "ipython", 315 | "version": 2 316 | }, 317 | "file_extension": ".py", 318 | "mimetype": "text/x-python", 319 | "name": "python", 320 | "nbconvert_exporter": "python", 321 | "pygments_lexer": "ipython2", 322 | "version": "2.7.13" 323 | } 324 | }, 325 | "nbformat": 4, 326 | "nbformat_minor": 1 327 | } 328 | -------------------------------------------------------------------------------- /Task 2: Working with English data/get_word_vectors.py: -------------------------------------------------------------------------------- 1 | # Impor Spacy and create Word Vector Model (GLOVE Model) 2 | import spacy 3 | # The next step takes some time to execute. 4 | NLP = spacy.load("en") 5 | 6 | def get_sentence_vectors(sentence): 7 | """ 8 | Returns word vectors for complete sentence as a python list""" 9 | s = sentence.strip().split() 10 | vec = [ get_word_vector(word) for word in s ] 11 | return vec 12 | 13 | def get_word_vector(word): 14 | """ 15 | Returns word vectors for a single word as a python list""" 16 | 17 | s = NLP(unicode(word)) 18 | return s.vector 19 | -------------------------------------------------------------------------------- /Task 2: Working with English data/process_data.py: -------------------------------------------------------------------------------- 1 | import numpy as np 2 | from keras.preprocessing import sequence 3 | # For getting English word vectors 4 | from get_word_vectors import get_word_vector, get_sentence_vectors 5 | 6 | 7 | class DataHandler(): 8 | """ 9 | Class for handling all data processing and preparing training/testing data""" 10 | 11 | def __init__(self, datapath): 12 | # Default values 13 | self.LEN_NAMED_CLASSES = 5 # 4 names and 1 null class 14 | self.NULL_CLASS = "O" 15 | self.LEN_WORD_VECTORS = 60 16 | 17 | self.tags = [] 18 | # string tags mapped to int and one hot vectors 19 | self.tag_id_map = {} 20 | self.tag_to_one_hot_map = {} 21 | 22 | # All data(to be filled by read_data method) 23 | self.x = [] 24 | self.y = [] 25 | 26 | self.read_data(datapath) 27 | 28 | def read_data(self, datapath): 29 | _id = 0 30 | sentence = [] 31 | sentence_tags = [] 32 | all_data = [] 33 | 34 | with open(datapath, 'r') as f: 35 | for l in f: 36 | line = l.strip().split() 37 | if line: 38 | word, named_tag = line[0], line[3] 39 | if named_tag != self.NULL_CLASS: 40 | named_tag = self.process_tag(named_tag) 41 | 42 | if named_tag not in self.tags: 43 | self.tags.append(named_tag) 44 | self.tag_id_map[_id] = named_tag 45 | one_hot_vec = np.zeros(self.LEN_NAMED_CLASSES, dtype = np.int32) 46 | one_hot_vec[_id] = 1 47 | self.tag_to_one_hot_map[named_tag] = one_hot_vec 48 | 49 | _id+=1; 50 | 51 | # Get word vectors for given word 52 | sentence.append(get_word_vector(word)[:self.LEN_WORD_VECTORS]) 53 | sentence_tags.append(self.tag_to_one_hot_map[named_tag]) 54 | else: 55 | all_data.append( (sentence, sentence_tags) ); 56 | sentence_tags = [] 57 | sentence = [] 58 | 59 | #Find length of largest sentence 60 | self.max_len = 0 61 | for pair in all_data: 62 | if self.max_len < len(pair[0]): 63 | self.max_len = len(pair[0]) 64 | 65 | for vectors, one_hot_tags in all_data: 66 | # Pad the sequences and make them all of same length 67 | temp_X = np.zeros(self.LEN_WORD_VECTORS, dtype = np.int32) 68 | temp_Y = np.array(self.tag_to_one_hot_map[self.NULL_CLASS]) 69 | pad_length = self.max_len - len(vectors) 70 | 71 | #Insert into main data list 72 | self.x.append( ((pad_length)*[temp_X]) + vectors) 73 | self.y.append( ((pad_length)*[temp_Y]) + one_hot_tags) 74 | 75 | self.x = np.array(self.x) 76 | self.y = np.array(self.y) 77 | 78 | def process_tag(self, tag): 79 | # For simplicity, removing any initial I- or B- tag 80 | return tag[2:] 81 | 82 | def get_data(self): 83 | # Returns proper data for training/testing 84 | return (self.x, self.y) 85 | 86 | def encode_sentence(self, sentence): 87 | vectors = get_sentence_vectors(sentence) 88 | vectors = [v[:self.LEN_WORD_VECTORS] for v in vectors] 89 | return sequence.pad_sequences([vectors], maxlen=self.max_len, dtype=np.float32) 90 | 91 | def decode_result(self, result_sequence): 92 | pred_named_tags = [] 93 | for pred in result_sequence: 94 | _id = np.argmax(pred) 95 | pred_named_tags.append(self.tag_id_map[_id]) 96 | return pred_named_tags 97 | 98 | 99 | 100 | 101 | 102 | 103 | -------------------------------------------------------------------------------- /Task 3: Hindi data/Convert SSF data to CoNLL form.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "code", 5 | "execution_count": 5, 6 | "metadata": { 7 | "collapsed": true 8 | }, 9 | "outputs": [], 10 | "source": [ 11 | "import glob\n", 12 | "import codecs" 13 | ] 14 | }, 15 | { 16 | "cell_type": "code", 17 | "execution_count": 6, 18 | "metadata": { 19 | "collapsed": true 20 | }, 21 | "outputs": [], 22 | "source": [ 23 | "hindi_filenames = sorted(glob.glob(\"../data/training-hindi/*utf8\"))" 24 | ] 25 | }, 26 | { 27 | "cell_type": "code", 28 | "execution_count": 9, 29 | "metadata": { 30 | "collapsed": true 31 | }, 32 | "outputs": [], 33 | "source": [ 34 | "output_file = codecs.open('../data/training_hindi_NER.utf8','w')" 35 | ] 36 | }, 37 | { 38 | "cell_type": "code", 39 | "execution_count": 10, 40 | "metadata": { 41 | "collapsed": false 42 | }, 43 | "outputs": [ 44 | { 45 | "name": "stderr", 46 | "output_type": "stream", 47 | "text": [ 48 | "/home/divesh_pandey/anaconda2/envs/keras_tensorflow/lib/python2.7/site-packages/ipykernel/__main__.py:12: UnicodeWarning: Unicode equal comparison failed to convert both arguments to Unicode - interpreting them as being unequal\n" 49 | ] 50 | }, 51 | { 52 | "name": "stdout", 53 | "output_type": "stream", 54 | "text": [ 55 | "10 शनैः - शनैः\n" 56 | ] 57 | } 58 | ], 59 | "source": [ 60 | "for file_name in hindi_filenames:\n", 61 | " file = codecs.open(file_name,'r')\n", 62 | " status = False\n", 63 | " flag = 0\n", 64 | " for line in file:\n", 65 | " if flag==0:\n", 66 | " if line == \"\\n\":\n", 67 | " flag=1\n", 68 | " continue\n", 69 | " if line[0]==u'<' and line[-2]==u'>':\n", 70 | " pass\n", 71 | " elif len(line)>2 and line[-2]==u')' and line[-3]==u')':\n", 72 | " pass\n", 73 | " elif line[0]==u'0':\n", 74 | " pass\n", 75 | " else:\n", 76 | " line = line.strip().split()\n", 77 | " if len(line) == 2:\n", 78 | " output_file.write(line[1])\n", 79 | " if status:\n", 80 | " output_file.write(\"\\t\" + entity)\n", 81 | " status = False\n", 82 | " else:\n", 83 | " output_file.write(\"\\tO\")\n", 84 | " output_file.write(\"\\n\")\n", 85 | " elif len(line) == 4:\n", 86 | " status = True\n", 87 | " try:\n", 88 | " entity = line[-1].split(\"=\")[1][:-1]\n", 89 | " except:\n", 90 | " print \" \".join(line)\n", 91 | " else:\n", 92 | " output_file.write(\"\\n\")" 93 | ] 94 | } 95 | ], 96 | "metadata": { 97 | "anaconda-cloud": {}, 98 | "kernelspec": { 99 | "display_name": "Python [conda env:keras_tensorflow]", 100 | "language": "python", 101 | "name": "conda-env-keras_tensorflow-py" 102 | }, 103 | "language_info": { 104 | "codemirror_mode": { 105 | "name": "ipython", 106 | "version": 2 107 | }, 108 | "file_extension": ".py", 109 | "mimetype": "text/x-python", 110 | "name": "python", 111 | "nbconvert_exporter": "python", 112 | "pygments_lexer": "ipython2", 113 | "version": "2.7.13" 114 | } 115 | }, 116 | "nbformat": 4, 117 | "nbformat_minor": 1 118 | } 119 | -------------------------------------------------------------------------------- /Task 3: Hindi data/Hindi_NER.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "code", 5 | "execution_count": 1, 6 | "metadata": { 7 | "collapsed": false 8 | }, 9 | "outputs": [ 10 | { 11 | "name": "stderr", 12 | "output_type": "stream", 13 | "text": [ 14 | "Using TensorFlow backend.\n" 15 | ] 16 | } 17 | ], 18 | "source": [ 19 | "from process_data import DataHandler" 20 | ] 21 | }, 22 | { 23 | "cell_type": "code", 24 | "execution_count": 2, 25 | "metadata": { 26 | "collapsed": false 27 | }, 28 | "outputs": [], 29 | "source": [ 30 | "s = DataHandler(\"../data/training_hindi_NER.utf8\")" 31 | ] 32 | }, 33 | { 34 | "cell_type": "code", 35 | "execution_count": 3, 36 | "metadata": { 37 | "collapsed": false 38 | }, 39 | "outputs": [ 40 | { 41 | "data": { 42 | "text/plain": [ 43 | "(2023, 191, 50)" 44 | ] 45 | }, 46 | "execution_count": 3, 47 | "metadata": {}, 48 | "output_type": "execute_result" 49 | } 50 | ], 51 | "source": [ 52 | "s.get_data()[0].shape" 53 | ] 54 | }, 55 | { 56 | "cell_type": "code", 57 | "execution_count": 4, 58 | "metadata": { 59 | "collapsed": false 60 | }, 61 | "outputs": [], 62 | "source": [ 63 | "from NER_model import NER" 64 | ] 65 | }, 66 | { 67 | "cell_type": "code", 68 | "execution_count": 5, 69 | "metadata": { 70 | "collapsed": true 71 | }, 72 | "outputs": [], 73 | "source": [ 74 | "m = NER(s)" 75 | ] 76 | }, 77 | { 78 | "cell_type": "code", 79 | "execution_count": 6, 80 | "metadata": { 81 | "collapsed": false 82 | }, 83 | "outputs": [ 84 | { 85 | "name": "stdout", 86 | "output_type": "stream", 87 | "text": [ 88 | "____________________________________________________________________________________________________\n", 89 | "Layer (type) Output Shape Param # Connected to \n", 90 | "====================================================================================================\n", 91 | "bidirectional_1 (Bidirectional) (None, 191, 300) 241200 bidirectional_input_1[0][0] \n", 92 | "____________________________________________________________________________________________________\n", 93 | "timedistributed_1 (TimeDistribut (None, 191, 12) 3612 bidirectional_1[0][0] \n", 94 | "____________________________________________________________________________________________________\n", 95 | "dropout_1 (Dropout) (None, 191, 12) 0 timedistributed_1[0][0] \n", 96 | "====================================================================================================\n", 97 | "Total params: 244,812\n", 98 | "Trainable params: 244,812\n", 99 | "Non-trainable params: 0\n", 100 | "____________________________________________________________________________________________________\n", 101 | "None\n" 102 | ] 103 | } 104 | ], 105 | "source": [ 106 | "m.make_and_compile()" 107 | ] 108 | }, 109 | { 110 | "cell_type": "code", 111 | "execution_count": 7, 112 | "metadata": { 113 | "collapsed": false 114 | }, 115 | "outputs": [ 116 | { 117 | "name": "stdout", 118 | "output_type": "stream", 119 | "text": [ 120 | "Epoch 1/10\n", 121 | "1613/1613 [==============================] - 29s - loss: 3.6243 - acc: 0.7633 \n", 122 | "Epoch 2/10\n", 123 | "1613/1613 [==============================] - 20s - loss: 3.2839 - acc: 0.7908 \n", 124 | "Epoch 3/10\n", 125 | "1613/1613 [==============================] - 19s - loss: 3.2750 - acc: 0.7908 \n", 126 | "Epoch 4/10\n", 127 | "1613/1613 [==============================] - 19s - loss: 3.2753 - acc: 0.7906 \n", 128 | "Epoch 5/10\n", 129 | "1613/1613 [==============================] - 18s - loss: 3.2466 - acc: 0.7922 \n", 130 | "Epoch 6/10\n", 131 | "1613/1613 [==============================] - 18s - loss: 3.2592 - acc: 0.7913 \n", 132 | "Epoch 7/10\n", 133 | "1613/1613 [==============================] - 18s - loss: 3.2838 - acc: 0.7898 \n", 134 | "Epoch 8/10\n", 135 | "1613/1613 [==============================] - 19s - loss: 3.2770 - acc: 0.7901 \n", 136 | "Epoch 9/10\n", 137 | "1613/1613 [==============================] - 18s - loss: 3.2569 - acc: 0.7917 \n", 138 | "Epoch 10/10\n", 139 | "1613/1613 [==============================] - 18s - loss: 3.2486 - acc: 0.7922 \n" 140 | ] 141 | } 142 | ], 143 | "source": [ 144 | "# m.train()\n", 145 | "m.train(epochs=10)" 146 | ] 147 | }, 148 | { 149 | "cell_type": "code", 150 | "execution_count": 8, 151 | "metadata": { 152 | "collapsed": false 153 | }, 154 | "outputs": [ 155 | { 156 | "name": "stdout", 157 | "output_type": "stream", 158 | "text": [ 159 | " precision recall f1-score support\n", 160 | "\n", 161 | " NEA 0.00 0.00 0.00 7\n", 162 | " NED 0.00 0.00 0.00 48\n", 163 | " NEL 0.56 0.12 0.20 162\n", 164 | " NEM 0.00 0.00 0.00 17\n", 165 | " NEN 0.91 0.40 0.55 246\n", 166 | " NEO 0.00 0.00 0.00 31\n", 167 | " NEP 0.65 0.17 0.28 189\n", 168 | " NETE 0.00 0.00 0.00 160\n", 169 | " NETI 0.00 0.00 0.00 46\n", 170 | " NETO 0.00 0.00 0.00 41\n", 171 | " O 0.90 1.00 0.95 7105\n", 172 | "\n", 173 | "avg / total 0.85 0.90 0.86 8052\n", 174 | "\n" 175 | ] 176 | }, 177 | { 178 | "name": "stderr", 179 | "output_type": "stream", 180 | "text": [ 181 | "/home/divesh_pandey/anaconda2/envs/keras_tensorflow/lib/python2.7/site-packages/sklearn/metrics/classification.py:1113: UndefinedMetricWarning: Precision and F-score are ill-defined and being set to 0.0 in labels with no predicted samples.\n", 182 | " 'precision', 'predicted', average, warn_for)\n" 183 | ] 184 | }, 185 | { 186 | "data": { 187 | "text/html": [ 188 | "
\n", 189 | "\n", 190 | " \n", 191 | " \n", 192 | " \n", 193 | " \n", 194 | " \n", 195 | " \n", 196 | " \n", 197 | " \n", 198 | " \n", 199 | " \n", 200 | " \n", 201 | " \n", 202 | " \n", 203 | " \n", 204 | " \n", 205 | " \n", 206 | " \n", 207 | " \n", 208 | " \n", 209 | " \n", 210 | " \n", 211 | " \n", 212 | " \n", 213 | " \n", 214 | " \n", 215 | " \n", 216 | " \n", 217 | " \n", 218 | " \n", 219 | " \n", 220 | " \n", 221 | " \n", 222 | " \n", 223 | " \n", 224 | " \n", 225 | " \n", 226 | " \n", 227 | " \n", 228 | " \n", 229 | " \n", 230 | " \n", 231 | " \n", 232 | " \n", 233 | " \n", 234 | " \n", 235 | " \n", 236 | " \n", 237 | " \n", 238 | " \n", 239 | " \n", 240 | " \n", 241 | " \n", 242 | " \n", 243 | " \n", 244 | " \n", 245 | " \n", 246 | " \n", 247 | " \n", 248 | " \n", 249 | " \n", 250 | " \n", 251 | " \n", 252 | " \n", 253 | " \n", 254 | " \n", 255 | " \n", 256 | " \n", 257 | " \n", 258 | " \n", 259 | " \n", 260 | " \n", 261 | " \n", 262 | " \n", 263 | " \n", 264 | " \n", 265 | " \n", 266 | " \n", 267 | " \n", 268 | " \n", 269 | " \n", 270 | " \n", 271 | " \n", 272 | " \n", 273 | " \n", 274 | " \n", 275 | " \n", 276 | " \n", 277 | " \n", 278 | " \n", 279 | " \n", 280 | " \n", 281 | " \n", 282 | " \n", 283 | " \n", 284 | " \n", 285 | " \n", 286 | " \n", 287 | " \n", 288 | " \n", 289 | " \n", 290 | " \n", 291 | " \n", 292 | " \n", 293 | " \n", 294 | " \n", 295 | " \n", 296 | " \n", 297 | " \n", 298 | " \n", 299 | " \n", 300 | " \n", 301 | " \n", 302 | " \n", 303 | " \n", 304 | " \n", 305 | " \n", 306 | " \n", 307 | " \n", 308 | " \n", 309 | " \n", 310 | " \n", 311 | " \n", 312 | " \n", 313 | " \n", 314 | " \n", 315 | " \n", 316 | " \n", 317 | " \n", 318 | " \n", 319 | " \n", 320 | " \n", 321 | " \n", 322 | " \n", 323 | " \n", 324 | " \n", 325 | " \n", 326 | " \n", 327 | " \n", 328 | " \n", 329 | " \n", 330 | " \n", 331 | " \n", 332 | " \n", 333 | " \n", 334 | " \n", 335 | " \n", 336 | " \n", 337 | " \n", 338 | " \n", 339 | " \n", 340 | " \n", 341 | " \n", 342 | " \n", 343 | " \n", 344 | " \n", 345 | " \n", 346 | " \n", 347 | " \n", 348 | " \n", 349 | " \n", 350 | " \n", 351 | " \n", 352 | " \n", 353 | " \n", 354 | " \n", 355 | " \n", 356 | " \n", 357 | " \n", 358 | " \n", 359 | " \n", 360 | " \n", 361 | " \n", 362 | "
NEANEDNELNEMNENNEONEPNETENETINETOO
NEA00000000007
NED000000100047
NEL00200000000142
NEM000010000016
NEN00109800000147
NEO000000000031
NEP00000033000156
NETE0000003000157
NETI001040000041
NETO000010300037
O0014040111007075
\n", 363 | "
" 364 | ], 365 | "text/plain": [ 366 | " NEA NED NEL NEM NEN NEO NEP NETE NETI NETO O\n", 367 | "NEA 0 0 0 0 0 0 0 0 0 0 7\n", 368 | "NED 0 0 0 0 0 0 1 0 0 0 47\n", 369 | "NEL 0 0 20 0 0 0 0 0 0 0 142\n", 370 | "NEM 0 0 0 0 1 0 0 0 0 0 16\n", 371 | "NEN 0 0 1 0 98 0 0 0 0 0 147\n", 372 | "NEO 0 0 0 0 0 0 0 0 0 0 31\n", 373 | "NEP 0 0 0 0 0 0 33 0 0 0 156\n", 374 | "NETE 0 0 0 0 0 0 3 0 0 0 157\n", 375 | "NETI 0 0 1 0 4 0 0 0 0 0 41\n", 376 | "NETO 0 0 0 0 1 0 3 0 0 0 37\n", 377 | "O 0 0 14 0 4 0 11 1 0 0 7075" 378 | ] 379 | }, 380 | "execution_count": 8, 381 | "metadata": {}, 382 | "output_type": "execute_result" 383 | } 384 | ], 385 | "source": [ 386 | "m.evaluate()" 387 | ] 388 | }, 389 | { 390 | "cell_type": "code", 391 | "execution_count": 9, 392 | "metadata": { 393 | "collapsed": false 394 | }, 395 | "outputs": [ 396 | { 397 | "name": "stdout", 398 | "output_type": "stream", 399 | "text": [ 400 | "इन्होंने/O\n", 401 | "भारतीय/O\n", 402 | "आर्य/O\n", 403 | "भाषा/O\n", 404 | "तथा/O\n", 405 | "द्रविड़/O\n", 406 | "भाषाओं/O\n", 407 | "का/O\n", 408 | "व्याकरण/O\n", 409 | "नामक/O\n", 410 | "अन्य/O\n", 411 | "महत्वपूर्ण/O\n", 412 | "ग्रन्थ/O\n", 413 | "भी/O\n", 414 | "लिखे/O\n", 415 | "हैं/O\n", 416 | "।/O\n" 417 | ] 418 | } 419 | ], 420 | "source": [ 421 | "m.predict_tags(\"इन्होंने भारतीय आर्य भाषा तथा द्रविड़ भाषाओं का व्याकरण नामक अन्य महत्वपूर्ण ग्रन्थ भी लिखे हैं ।\")" 422 | ] 423 | }, 424 | { 425 | "cell_type": "code", 426 | "execution_count": 10, 427 | "metadata": { 428 | "collapsed": false 429 | }, 430 | "outputs": [], 431 | "source": [ 432 | "m.model.save(\"./first_hindi_model\")" 433 | ] 434 | }, 435 | { 436 | "cell_type": "code", 437 | "execution_count": 11, 438 | "metadata": { 439 | "collapsed": true 440 | }, 441 | "outputs": [], 442 | "source": [ 443 | "from keras.models import load_model\n", 444 | "m.model = load_model(\"./first_hindi_model\")" 445 | ] 446 | }, 447 | { 448 | "cell_type": "code", 449 | "execution_count": 12, 450 | "metadata": { 451 | "collapsed": false 452 | }, 453 | "outputs": [], 454 | "source": [ 455 | "from get_word_vectors import get_word_vector\n", 456 | "s = get_word_vector(\"द्रविड़\")" 457 | ] 458 | } 459 | ], 460 | "metadata": { 461 | "anaconda-cloud": {}, 462 | "kernelspec": { 463 | "display_name": "Python [conda env:keras_tensorflow]", 464 | "language": "python", 465 | "name": "conda-env-keras_tensorflow-py" 466 | }, 467 | "language_info": { 468 | "codemirror_mode": { 469 | "name": "ipython", 470 | "version": 2 471 | }, 472 | "file_extension": ".py", 473 | "mimetype": "text/x-python", 474 | "name": "python", 475 | "nbconvert_exporter": "python", 476 | "pygments_lexer": "ipython2", 477 | "version": "2.7.13" 478 | } 479 | }, 480 | "nbformat": 4, 481 | "nbformat_minor": 1 482 | } 483 | -------------------------------------------------------------------------------- /Task 3: Hindi data/NER_model.py: -------------------------------------------------------------------------------- 1 | # Keras imports 2 | from keras.preprocessing import sequence 3 | from keras.models import Sequential 4 | from keras.layers import Dense 5 | from keras.layers import LSTM 6 | from keras.layers.wrappers import TimeDistributed 7 | from keras.layers.wrappers import Bidirectional 8 | from keras.layers.core import Dropout 9 | from keras.regularizers import l2 10 | from keras import metrics 11 | 12 | import numpy as np 13 | import pandas as pd 14 | from sklearn.metrics import confusion_matrix, classification_report 15 | 16 | 17 | class NER(): 18 | def __init__(self, data_reader): 19 | self.data_reader = data_reader 20 | self.x, self.y = data_reader.get_data(); 21 | self.model = None 22 | self.x_train = None 23 | self.y_train = None 24 | self.x_test = None 25 | self.y_test = None 26 | 27 | def make_and_compile(self, units = 150, dropout = 0.2, regul_alpha = 0.0): 28 | self.model = Sequential() 29 | # Bidirectional LSTM with 100 outputs/memory units 30 | self.model.add(Bidirectional(LSTM(units, 31 | return_sequences=True, 32 | W_regularizer=l2(regul_alpha), 33 | b_regularizer=l2(regul_alpha)), 34 | input_shape = [self.data_reader.max_len, 35 | self.data_reader.LEN_WORD_VECTORS])) 36 | self.model.add(TimeDistributed(Dense(self.data_reader.LEN_NAMED_CLASSES, 37 | activation='softmax', 38 | W_regularizer=l2(regul_alpha), 39 | b_regularizer=l2(regul_alpha)))) 40 | self.model.add(Dropout(dropout)) 41 | self.model.compile(loss='categorical_crossentropy', 42 | optimizer='adam', 43 | metrics=['accuracy']) 44 | print self.model.summary() 45 | 46 | def train(self, train_split = 0.8, epochs = 10, batch_size = 50): 47 | split_mask = np.random.rand(len(self.x)) < (train_split) 48 | self.x_train = self.x[split_mask] 49 | self.y_train = self.y[split_mask] 50 | self.x_test = self.x[~split_mask] 51 | self.y_test = self.y[~split_mask] 52 | 53 | self.model.fit(self.x_train, self.y_train, nb_epoch=epochs, batch_size=batch_size) 54 | 55 | def evaluate(self): 56 | predicted_tags= [] 57 | test_data_tags = [] 58 | 59 | for x,y in zip(self.x_test, self.y_test): 60 | flag = 0 61 | tags = self.model.predict(np.array([x]), batch_size=1)[0] 62 | pred_tags = self.data_reader.decode_result(tags) 63 | test_tags = self.data_reader.decode_result(y) 64 | for i,j in zip(pred_tags, test_tags): 65 | if j != self.data_reader.NULL_CLASS: 66 | flag = 1 67 | if flag == 1: 68 | test_data_tags.append(j) 69 | predicted_tags.append(i) 70 | 71 | 72 | predicted_tags = np.array(predicted_tags) 73 | test_data_tags = np.array(test_data_tags) 74 | print classification_report(test_data_tags, predicted_tags) 75 | 76 | simple_conf_matrix = confusion_matrix(test_data_tags,predicted_tags) 77 | all_tags = sorted(list(set(test_data_tags))) 78 | 79 | conf_matrix = pd.DataFrame( 80 | columns = all_tags, 81 | index = all_tags) 82 | for x,y in zip(simple_conf_matrix, all_tags): 83 | conf_matrix[y] = x 84 | conf_matrix = conf_matrix.transpose() 85 | return conf_matrix 86 | 87 | 88 | def predict_tags(self, sentence): 89 | sentence_list = sentence.strip().split() 90 | sent_len = len(sentence_list) 91 | # Get padded word vectors 92 | x = self.data_reader.encode_sentence(sentence) 93 | tags = self.model.predict(x, batch_size=1)[0] 94 | 95 | tags = tags[-sent_len:] 96 | pred_tags = self.data_reader.decode_result(tags) 97 | 98 | for s,tag in zip(sentence_list,pred_tags): 99 | print s + "/" + tag 100 | -------------------------------------------------------------------------------- /Task 3: Hindi data/first_hindi_model: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/pandeydivesh15/NER-using-Deep-Learning/56f9248e890b579eb1a93352fb7420fd2a137c83/Task 3: Hindi data/first_hindi_model -------------------------------------------------------------------------------- /Task 3: Hindi data/get_word_vectors.py: -------------------------------------------------------------------------------- 1 | import gensim.models.word2vec as w2v 2 | import numpy as np 3 | import os 4 | 5 | trained_model = w2v.Word2Vec.load(os.path.join("../data/", "hindi_word2Vec_small.w2v")) 6 | 7 | def get_sentence_vectors(sentence): 8 | """ 9 | Returns word vectors for complete sentence as a python list""" 10 | s = sentence.strip().split() 11 | vec = [ get_word_vector(word) for word in s ] 12 | return vec 13 | 14 | def get_word_vector(word): 15 | """ 16 | Returns word vectors for a single word as a python list""" 17 | s = word.decode("utf-8") 18 | try: 19 | vect = trained_model.wv[s] 20 | except: 21 | vect = np.zeros(50, dtype = np.float32) 22 | return vect 23 | 24 | -------------------------------------------------------------------------------- /Task 3: Hindi data/hindi_vectors.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "code", 5 | "execution_count": 1, 6 | "metadata": { 7 | "collapsed": false 8 | }, 9 | "outputs": [], 10 | "source": [ 11 | "from __future__ import absolute_import, division, print_function" 12 | ] 13 | }, 14 | { 15 | "cell_type": "code", 16 | "execution_count": 2, 17 | "metadata": { 18 | "collapsed": true 19 | }, 20 | "outputs": [], 21 | "source": [ 22 | "import codecs\n", 23 | "import glob\n", 24 | "import multiprocessing\n", 25 | "import os\n", 26 | "import pprint\n", 27 | "import re" 28 | ] 29 | }, 30 | { 31 | "cell_type": "code", 32 | "execution_count": 3, 33 | "metadata": { 34 | "collapsed": false 35 | }, 36 | "outputs": [], 37 | "source": [ 38 | "import nltk\n", 39 | "import gensim.models.word2vec as w2v\n", 40 | "import sklearn.manifold\n", 41 | "import numpy as np\n", 42 | "import matplotlib.pyplot as plt\n", 43 | "import pandas as pd\n", 44 | "import seaborn as sns" 45 | ] 46 | }, 47 | { 48 | "cell_type": "code", 49 | "execution_count": 4, 50 | "metadata": { 51 | "collapsed": false 52 | }, 53 | "outputs": [ 54 | { 55 | "name": "stdout", 56 | "output_type": "stream", 57 | "text": [ 58 | "Populating the interactive namespace from numpy and matplotlib\n" 59 | ] 60 | } 61 | ], 62 | "source": [ 63 | "%pylab inline" 64 | ] 65 | }, 66 | { 67 | "cell_type": "code", 68 | "execution_count": 5, 69 | "metadata": { 70 | "collapsed": false 71 | }, 72 | "outputs": [ 73 | { 74 | "name": "stdout", 75 | "output_type": "stream", 76 | "text": [ 77 | "[nltk_data] Downloading package punkt to\n", 78 | "[nltk_data] /home/divesh_pandey/nltk_data...\n", 79 | "[nltk_data] Package punkt is already up-to-date!\n" 80 | ] 81 | }, 82 | { 83 | "data": { 84 | "text/plain": [ 85 | "True" 86 | ] 87 | }, 88 | "execution_count": 5, 89 | "metadata": {}, 90 | "output_type": "execute_result" 91 | } 92 | ], 93 | "source": [ 94 | "nltk.download(\"punkt\")" 95 | ] 96 | }, 97 | { 98 | "cell_type": "markdown", 99 | "metadata": {}, 100 | "source": [ 101 | "## Getting and cleaning data" 102 | ] 103 | }, 104 | { 105 | "cell_type": "code", 106 | "execution_count": 6, 107 | "metadata": { 108 | "collapsed": false 109 | }, 110 | "outputs": [], 111 | "source": [ 112 | "hindi_filenames = sorted(glob.glob(\"../data/hin_corp_unicode/*txt\"))\n", 113 | "#hindi_filenames" 114 | ] 115 | }, 116 | { 117 | "cell_type": "code", 118 | "execution_count": 7, 119 | "metadata": { 120 | "collapsed": false 121 | }, 122 | "outputs": [ 123 | { 124 | "name": "stdout", 125 | "output_type": "stream", 126 | "text": [ 127 | "Reading '../data/hin_corp_unicode/1000_utf.txt'...\n", 128 | "Corpus is now 15764 characters long\n", 129 | "\n", 130 | "Reading '../data/hin_corp_unicode/1001_utf.txt'...\n", 131 | "Corpus is now 33663 characters long\n", 132 | "\n", 133 | "Reading '../data/hin_corp_unicode/1002_utf.txt'...\n", 134 | "Corpus is now 48153 characters long\n", 135 | "\n", 136 | "Reading '../data/hin_corp_unicode/1003_utf.txt'...\n", 137 | "Corpus is now 63362 characters long\n", 138 | "\n", 139 | "Reading '../data/hin_corp_unicode/1004_utf.txt'...\n", 140 | "Corpus is now 77899 characters long\n", 141 | "\n", 142 | "Reading '../data/hin_corp_unicode/1005_utf.txt'...\n", 143 | "Corpus is now 95324 characters long\n", 144 | "\n", 145 | "Reading '../data/hin_corp_unicode/1006_utf.txt'...\n", 146 | "Corpus is now 106578 characters long\n", 147 | "\n", 148 | "Reading '../data/hin_corp_unicode/1007_utf.txt'...\n", 149 | "Corpus is now 118142 characters long\n", 150 | "\n", 151 | "Reading '../data/hin_corp_unicode/1008_utf.txt'...\n", 152 | "Corpus is now 132864 characters long\n", 153 | "\n", 154 | "Reading '../data/hin_corp_unicode/1009_utf.txt'...\n", 155 | "Corpus is now 144821 characters long\n", 156 | "\n", 157 | "Reading '../data/hin_corp_unicode/100_utf.txt'...\n", 158 | "Corpus is now 154593 characters long\n", 159 | "\n", 160 | "Reading '../data/hin_corp_unicode/1010_utf.txt'...\n", 161 | "Corpus is now 170781 characters long\n", 162 | "\n", 163 | "Reading '../data/hin_corp_unicode/1011_utf.txt'...\n", 164 | "Corpus is now 185033 characters long\n", 165 | "\n", 166 | "Reading '../data/hin_corp_unicode/1012_utf.txt'...\n", 167 | "Corpus is now 201858 characters long\n", 168 | "\n", 169 | "Reading '../data/hin_corp_unicode/1013_utf.txt'...\n", 170 | "Corpus is now 212884 characters long\n", 171 | "\n", 172 | "Reading '../data/hin_corp_unicode/1014_utf.txt'...\n", 173 | "Corpus is now 226149 characters long\n", 174 | "\n", 175 | "Reading '../data/hin_corp_unicode/1015_utf.txt'...\n", 176 | "Corpus is now 237890 characters long\n", 177 | "\n", 178 | "Reading '../data/hin_corp_unicode/1016_utf.txt'...\n", 179 | "Corpus is now 253906 characters long\n", 180 | "\n", 181 | "Reading '../data/hin_corp_unicode/1017_utf.txt'...\n", 182 | "Corpus is now 266044 characters long\n", 183 | "\n", 184 | "Reading '../data/hin_corp_unicode/1018_utf.txt'...\n", 185 | "Corpus is now 289256 characters long\n", 186 | "\n", 187 | "Reading '../data/hin_corp_unicode/1019_utf.txt'...\n", 188 | "Corpus is now 307677 characters long\n", 189 | "\n", 190 | "Reading '../data/hin_corp_unicode/101_utf.txt'...\n", 191 | "Corpus is now 319262 characters long\n", 192 | "\n", 193 | "Reading '../data/hin_corp_unicode/1020_utf.txt'...\n", 194 | "Corpus is now 333035 characters long\n", 195 | "\n", 196 | "Reading '../data/hin_corp_unicode/1021_utf.txt'...\n", 197 | "Corpus is now 345398 characters long\n", 198 | "\n", 199 | "Reading '../data/hin_corp_unicode/1022_utf.txt'...\n", 200 | "Corpus is now 364372 characters long\n", 201 | "\n", 202 | "Reading '../data/hin_corp_unicode/1023_utf.txt'...\n", 203 | "Corpus is now 374942 characters long\n", 204 | "\n", 205 | "Reading '../data/hin_corp_unicode/1024_utf.txt'...\n", 206 | "Corpus is now 389504 characters long\n", 207 | "\n", 208 | "Reading '../data/hin_corp_unicode/1025_utf.txt'...\n", 209 | "Corpus is now 402124 characters long\n", 210 | "\n", 211 | "Reading '../data/hin_corp_unicode/1026_utf.txt'...\n", 212 | "Corpus is now 415294 characters long\n", 213 | "\n", 214 | "Reading '../data/hin_corp_unicode/1027_utf.txt'...\n", 215 | "Corpus is now 429118 characters long\n", 216 | "\n", 217 | "Reading '../data/hin_corp_unicode/1028_utf.txt'...\n", 218 | "Corpus is now 441083 characters long\n", 219 | "\n", 220 | "Reading '../data/hin_corp_unicode/1029_utf.txt'...\n", 221 | "Corpus is now 452915 characters long\n", 222 | "\n", 223 | "Reading '../data/hin_corp_unicode/102_utf.txt'...\n", 224 | "Corpus is now 463474 characters long\n", 225 | "\n", 226 | "Reading '../data/hin_corp_unicode/1030_utf.txt'...\n", 227 | "Corpus is now 476956 characters long\n", 228 | "\n", 229 | "Reading '../data/hin_corp_unicode/1031_utf.txt'...\n", 230 | "Corpus is now 490632 characters long\n", 231 | "\n", 232 | "Reading '../data/hin_corp_unicode/1032_utf.txt'...\n", 233 | "Corpus is now 502763 characters long\n", 234 | "\n", 235 | "Reading '../data/hin_corp_unicode/1033_utf.txt'...\n", 236 | "Corpus is now 516883 characters long\n", 237 | "\n", 238 | "Reading '../data/hin_corp_unicode/1034_utf.txt'...\n", 239 | "Corpus is now 529055 characters long\n", 240 | "\n", 241 | "Reading '../data/hin_corp_unicode/1035_utf.txt'...\n", 242 | "Corpus is now 542161 characters long\n", 243 | "\n", 244 | "Reading '../data/hin_corp_unicode/1036_utf.txt'...\n", 245 | "Corpus is now 558328 characters long\n", 246 | "\n", 247 | "Reading '../data/hin_corp_unicode/1037_utf.txt'...\n", 248 | "Corpus is now 577599 characters long\n", 249 | "\n", 250 | "Reading '../data/hin_corp_unicode/1038_utf.txt'...\n", 251 | "Corpus is now 592185 characters long\n", 252 | "\n", 253 | "Reading '../data/hin_corp_unicode/1039_utf.txt'...\n", 254 | "Corpus is now 603482 characters long\n", 255 | "\n", 256 | "Reading '../data/hin_corp_unicode/103_utf.txt'...\n", 257 | "Corpus is now 615778 characters long\n", 258 | "\n", 259 | "Reading '../data/hin_corp_unicode/1040_utf.txt'...\n", 260 | "Corpus is now 627410 characters long\n", 261 | "\n", 262 | "Reading '../data/hin_corp_unicode/1041_utf.txt'...\n", 263 | "Corpus is now 639794 characters long\n", 264 | "\n", 265 | "Reading '../data/hin_corp_unicode/1042_utf.txt'...\n", 266 | "Corpus is now 655508 characters long\n", 267 | "\n", 268 | "Reading '../data/hin_corp_unicode/1043_utf.txt'...\n", 269 | "Corpus is now 671222 characters long\n", 270 | "\n", 271 | "Reading '../data/hin_corp_unicode/1044_utf.txt'...\n", 272 | "Corpus is now 680037 characters long\n", 273 | "\n", 274 | "Reading '../data/hin_corp_unicode/1045_utf.txt'...\n", 275 | "Corpus is now 689238 characters long\n", 276 | "\n", 277 | "Reading '../data/hin_corp_unicode/1046_utf.txt'...\n", 278 | "Corpus is now 703815 characters long\n", 279 | "\n", 280 | "Reading '../data/hin_corp_unicode/1047_utf.txt'...\n", 281 | "Corpus is now 714513 characters long\n", 282 | "\n", 283 | "Reading '../data/hin_corp_unicode/1048_utf.txt'...\n", 284 | "Corpus is now 725701 characters long\n", 285 | "\n", 286 | "Reading '../data/hin_corp_unicode/1049_utf.txt'...\n", 287 | "Corpus is now 735080 characters long\n", 288 | "\n", 289 | "Reading '../data/hin_corp_unicode/104_utf.txt'...\n", 290 | "Corpus is now 744998 characters long\n", 291 | "\n", 292 | "Reading '../data/hin_corp_unicode/1050_utf.txt'...\n", 293 | "Corpus is now 755339 characters long\n", 294 | "\n", 295 | "Reading '../data/hin_corp_unicode/1051_utf.txt'...\n", 296 | "Corpus is now 767986 characters long\n", 297 | "\n", 298 | "Reading '../data/hin_corp_unicode/1052_utf.txt'...\n", 299 | "Corpus is now 779144 characters long\n", 300 | "\n", 301 | "Reading '../data/hin_corp_unicode/1053_utf.txt'...\n", 302 | "Corpus is now 789303 characters long\n", 303 | "\n", 304 | "Reading '../data/hin_corp_unicode/1054_utf.txt'...\n", 305 | "Corpus is now 804172 characters long\n", 306 | "\n", 307 | "Reading '../data/hin_corp_unicode/1055_utf.txt'...\n", 308 | "Corpus is now 817688 characters long\n", 309 | "\n", 310 | "Reading '../data/hin_corp_unicode/1056_utf.txt'...\n", 311 | "Corpus is now 833523 characters long\n", 312 | "\n", 313 | "Reading '../data/hin_corp_unicode/1057_utf.txt'...\n", 314 | "Corpus is now 846478 characters long\n", 315 | "\n", 316 | "Reading '../data/hin_corp_unicode/1058_utf.txt'...\n", 317 | "Corpus is now 859991 characters long\n", 318 | "\n", 319 | "Reading '../data/hin_corp_unicode/1059_utf.txt'...\n", 320 | "Corpus is now 859991 characters long\n", 321 | "\n", 322 | "Reading '../data/hin_corp_unicode/105_utf.txt'...\n", 323 | "Corpus is now 870258 characters long\n", 324 | "\n", 325 | "Reading '../data/hin_corp_unicode/1060_utf.txt'...\n", 326 | "Corpus is now 882560 characters long\n", 327 | "\n", 328 | "Reading '../data/hin_corp_unicode/1061_utf.txt'...\n", 329 | "Corpus is now 889832 characters long\n", 330 | "\n", 331 | "Reading '../data/hin_corp_unicode/1062_utf.txt'...\n", 332 | "Corpus is now 906535 characters long\n", 333 | "\n", 334 | "Reading '../data/hin_corp_unicode/1063_utf.txt'...\n", 335 | "Corpus is now 919811 characters long\n", 336 | "\n", 337 | "Reading '../data/hin_corp_unicode/1064_utf.txt'...\n", 338 | "Corpus is now 933032 characters long\n", 339 | "\n", 340 | "Reading '../data/hin_corp_unicode/1065_utf.txt'...\n", 341 | "Corpus is now 947554 characters long\n", 342 | "\n", 343 | "Reading '../data/hin_corp_unicode/1066_utf.txt'...\n", 344 | "Corpus is now 963606 characters long\n", 345 | "\n", 346 | "Reading '../data/hin_corp_unicode/1067_utf.txt'...\n", 347 | "Corpus is now 975799 characters long\n", 348 | "\n", 349 | "Reading '../data/hin_corp_unicode/1068_utf.txt'...\n", 350 | "Corpus is now 987608 characters long\n", 351 | "\n", 352 | "Reading '../data/hin_corp_unicode/1069_utf.txt'...\n", 353 | "Corpus is now 1000161 characters long\n", 354 | "\n", 355 | "Reading '../data/hin_corp_unicode/106_utf.txt'...\n", 356 | "Corpus is now 1005735 characters long\n", 357 | "\n", 358 | "Reading '../data/hin_corp_unicode/1070_utf.txt'...\n", 359 | "Corpus is now 1017269 characters long\n", 360 | "\n", 361 | "Reading '../data/hin_corp_unicode/1071_utf.txt'...\n", 362 | "Corpus is now 1028839 characters long\n", 363 | "\n", 364 | "Reading '../data/hin_corp_unicode/1072_utf.txt'...\n", 365 | "Corpus is now 1045725 characters long\n", 366 | "\n", 367 | "Reading '../data/hin_corp_unicode/1073_utf.txt'...\n", 368 | "Corpus is now 1057589 characters long\n", 369 | "\n", 370 | "Reading '../data/hin_corp_unicode/1074_utf.txt'...\n", 371 | "Corpus is now 1071326 characters long\n", 372 | "\n", 373 | "Reading '../data/hin_corp_unicode/1075_utf.txt'...\n", 374 | "Corpus is now 1084484 characters long\n", 375 | "\n", 376 | "Reading '../data/hin_corp_unicode/1076_utf.txt'...\n", 377 | "Corpus is now 1097388 characters long\n", 378 | "\n", 379 | "Reading '../data/hin_corp_unicode/1077_utf.txt'...\n", 380 | "Corpus is now 1111360 characters long\n", 381 | "\n", 382 | "Reading '../data/hin_corp_unicode/1078_utf.txt'...\n", 383 | "Corpus is now 1245315 characters long\n", 384 | "\n", 385 | "Reading '../data/hin_corp_unicode/1079_utf.txt'...\n", 386 | "Corpus is now 1254499 characters long\n", 387 | "\n", 388 | "Reading '../data/hin_corp_unicode/107_utf.txt'...\n", 389 | "Corpus is now 1262657 characters long\n", 390 | "\n", 391 | "Reading '../data/hin_corp_unicode/1080_utf.txt'...\n", 392 | "Corpus is now 1272779 characters long\n", 393 | "\n", 394 | "Reading '../data/hin_corp_unicode/1081_utf.txt'...\n", 395 | "Corpus is now 1284285 characters long\n", 396 | "\n", 397 | "Reading '../data/hin_corp_unicode/1082_utf.txt'...\n", 398 | "Corpus is now 1294212 characters long\n", 399 | "\n", 400 | "Reading '../data/hin_corp_unicode/1083_utf.txt'...\n", 401 | "Corpus is now 1299936 characters long\n", 402 | "\n", 403 | "Reading '../data/hin_corp_unicode/1084_utf.txt'...\n", 404 | "Corpus is now 1322878 characters long\n", 405 | "\n", 406 | "Reading '../data/hin_corp_unicode/1085_utf.txt'...\n", 407 | "Corpus is now 1339027 characters long\n", 408 | "\n", 409 | "Reading '../data/hin_corp_unicode/1086_utf.txt'...\n", 410 | "Corpus is now 1358000 characters long\n", 411 | "\n", 412 | "Reading '../data/hin_corp_unicode/1087_utf.txt'...\n", 413 | "Corpus is now 1371126 characters long\n", 414 | "\n", 415 | "Reading '../data/hin_corp_unicode/1088_utf.txt'...\n", 416 | "Corpus is now 1378603 characters long\n", 417 | "\n", 418 | "Reading '../data/hin_corp_unicode/1089_utf.txt'...\n", 419 | "Corpus is now 1386042 characters long\n", 420 | "\n", 421 | "Reading '../data/hin_corp_unicode/108_utf.txt'...\n", 422 | "Corpus is now 1397305 characters long\n", 423 | "\n", 424 | "Reading '../data/hin_corp_unicode/1090_utf.txt'...\n", 425 | "Corpus is now 1409499 characters long\n", 426 | "\n", 427 | "Reading '../data/hin_corp_unicode/1091_utf.txt'...\n", 428 | "Corpus is now 1422868 characters long\n", 429 | "\n", 430 | "Reading '../data/hin_corp_unicode/1092_utf.txt'...\n", 431 | "Corpus is now 1434318 characters long\n", 432 | "\n", 433 | "Reading '../data/hin_corp_unicode/1093_utf.txt'...\n", 434 | "Corpus is now 1446496 characters long\n", 435 | "\n", 436 | "Reading '../data/hin_corp_unicode/1094_utf.txt'...\n", 437 | "Corpus is now 1458628 characters long\n", 438 | "\n", 439 | "Reading '../data/hin_corp_unicode/1095_utf.txt'...\n", 440 | "Corpus is now 1470424 characters long\n", 441 | "\n", 442 | "Reading '../data/hin_corp_unicode/1096_utf.txt'...\n", 443 | "Corpus is now 1479903 characters long\n", 444 | "\n", 445 | "Reading '../data/hin_corp_unicode/1097_utf.txt'...\n", 446 | "Corpus is now 1489477 characters long\n", 447 | "\n", 448 | "Reading '../data/hin_corp_unicode/1098_utf.txt'...\n", 449 | "Corpus is now 1498965 characters long\n", 450 | "\n", 451 | "Reading '../data/hin_corp_unicode/1099_utf.txt'...\n", 452 | "Corpus is now 1508966 characters long\n", 453 | "\n", 454 | "Reading '../data/hin_corp_unicode/109_utf.txt'...\n", 455 | "Corpus is now 1517681 characters long\n", 456 | "\n", 457 | "Reading '../data/hin_corp_unicode/10_utf8.txt'...\n", 458 | "Corpus is now 1517681 characters long\n", 459 | "\n", 460 | "Reading '../data/hin_corp_unicode/1100_utf.txt'...\n", 461 | "Corpus is now 1526753 characters long\n", 462 | "\n", 463 | "Reading '../data/hin_corp_unicode/1101_utf.txt'...\n", 464 | "Corpus is now 1535995 characters long\n", 465 | "\n", 466 | "Reading '../data/hin_corp_unicode/1102_utf.txt'...\n", 467 | "Corpus is now 1548699 characters long\n", 468 | "\n", 469 | "Reading '../data/hin_corp_unicode/1103_utf.txt'...\n", 470 | "Corpus is now 1559247 characters long\n", 471 | "\n", 472 | "Reading '../data/hin_corp_unicode/1104_utf.txt'...\n", 473 | "Corpus is now 1571989 characters long\n", 474 | "\n", 475 | "Reading '../data/hin_corp_unicode/1105_utf.txt'...\n", 476 | "Corpus is now 1581501 characters long\n", 477 | "\n", 478 | "Reading '../data/hin_corp_unicode/1106_utf.txt'...\n", 479 | "Corpus is now 1592379 characters long\n", 480 | "\n", 481 | "Reading '../data/hin_corp_unicode/1107_utf.txt'...\n", 482 | "Corpus is now 1602546 characters long\n", 483 | "\n", 484 | "Reading '../data/hin_corp_unicode/1108_utf.txt'...\n", 485 | "Corpus is now 1614202 characters long\n", 486 | "\n", 487 | "Reading '../data/hin_corp_unicode/1109_utf.txt'...\n", 488 | "Corpus is now 1627371 characters long\n", 489 | "\n", 490 | "Reading '../data/hin_corp_unicode/110_utf.txt'...\n", 491 | "Corpus is now 1637978 characters long\n", 492 | "\n", 493 | "Reading '../data/hin_corp_unicode/1110_utf.txt'...\n", 494 | "Corpus is now 1650751 characters long\n", 495 | "\n", 496 | "Reading '../data/hin_corp_unicode/1111_utf.txt'...\n", 497 | "Corpus is now 1660708 characters long\n", 498 | "\n", 499 | "Reading '../data/hin_corp_unicode/1112_utf.txt'...\n", 500 | "Corpus is now 1672087 characters long\n", 501 | "\n", 502 | "Reading '../data/hin_corp_unicode/1113_utf.txt'...\n", 503 | "Corpus is now 1683636 characters long\n", 504 | "\n", 505 | "Reading '../data/hin_corp_unicode/1114_utf.txt'...\n", 506 | "Corpus is now 1697858 characters long\n", 507 | "\n", 508 | "Reading '../data/hin_corp_unicode/1115_utf.txt'...\n", 509 | "Corpus is now 1712990 characters long\n", 510 | "\n", 511 | "Reading '../data/hin_corp_unicode/1116_utf.txt'...\n", 512 | "Corpus is now 1727920 characters long\n", 513 | "\n", 514 | "Reading '../data/hin_corp_unicode/1117_utf.txt'...\n", 515 | "Corpus is now 1738544 characters long\n", 516 | "\n", 517 | "Reading '../data/hin_corp_unicode/1118_utf.txt'...\n", 518 | "Corpus is now 1748863 characters long\n", 519 | "\n", 520 | "Reading '../data/hin_corp_unicode/1119_utf.txt'...\n", 521 | "Corpus is now 1761075 characters long\n", 522 | "\n", 523 | "Reading '../data/hin_corp_unicode/111_utf.txt'...\n", 524 | "Corpus is now 1773923 characters long\n", 525 | "\n", 526 | "Reading '../data/hin_corp_unicode/1120_utf.txt'...\n", 527 | "Corpus is now 1786706 characters long\n", 528 | "\n", 529 | "Reading '../data/hin_corp_unicode/1121_utf.txt'...\n", 530 | "Corpus is now 1803759 characters long\n", 531 | "\n", 532 | "Reading '../data/hin_corp_unicode/1122_utf.txt'...\n", 533 | "Corpus is now 1816351 characters long\n", 534 | "\n", 535 | "Reading '../data/hin_corp_unicode/1123_utf.txt'...\n", 536 | "Corpus is now 1830221 characters long\n", 537 | "\n", 538 | "Reading '../data/hin_corp_unicode/1124_utf.txt'...\n", 539 | "Corpus is now 1845435 characters long\n", 540 | "\n", 541 | "Reading '../data/hin_corp_unicode/1125_utf.txt'...\n", 542 | "Corpus is now 1859781 characters long\n", 543 | "\n", 544 | "Reading '../data/hin_corp_unicode/1126_utf.txt'...\n", 545 | "Corpus is now 1873232 characters long\n", 546 | "\n", 547 | "Reading '../data/hin_corp_unicode/1127_utf.txt'...\n", 548 | "Corpus is now 1886498 characters long\n", 549 | "\n", 550 | "Reading '../data/hin_corp_unicode/1128_utf.txt'...\n", 551 | "Corpus is now 1902306 characters long\n", 552 | "\n", 553 | "Reading '../data/hin_corp_unicode/1129_utf.txt'...\n", 554 | "Corpus is now 1913924 characters long\n", 555 | "\n", 556 | "Reading '../data/hin_corp_unicode/112_utf.txt'...\n", 557 | "Corpus is now 1924358 characters long\n", 558 | "\n", 559 | "Reading '../data/hin_corp_unicode/1130_utf.txt'...\n", 560 | "Corpus is now 1935669 characters long\n", 561 | "\n", 562 | "Reading '../data/hin_corp_unicode/1131_utf.txt'...\n", 563 | "Corpus is now 1945838 characters long\n", 564 | "\n", 565 | "Reading '../data/hin_corp_unicode/1132_utf.txt'...\n", 566 | "Corpus is now 1956502 characters long\n", 567 | "\n", 568 | "Reading '../data/hin_corp_unicode/1133_utf.txt'...\n", 569 | "Corpus is now 1966760 characters long\n", 570 | "\n", 571 | "Reading '../data/hin_corp_unicode/1134_utf.txt'...\n", 572 | "Corpus is now 1978041 characters long\n", 573 | "\n", 574 | "Reading '../data/hin_corp_unicode/1135_utf.txt'...\n", 575 | "Corpus is now 1999388 characters long\n", 576 | "\n", 577 | "Reading '../data/hin_corp_unicode/1136_utf.txt'...\n", 578 | "Corpus is now 2009544 characters long\n", 579 | "\n", 580 | "Reading '../data/hin_corp_unicode/1137_utf.txt'...\n", 581 | "Corpus is now 2020061 characters long\n", 582 | "\n", 583 | "Reading '../data/hin_corp_unicode/1138_utf.txt'...\n", 584 | "Corpus is now 2031542 characters long\n", 585 | "\n", 586 | "Reading '../data/hin_corp_unicode/1139_utf.txt'...\n", 587 | "Corpus is now 2041658 characters long\n", 588 | "\n", 589 | "Reading '../data/hin_corp_unicode/113_utf.txt'...\n", 590 | "Corpus is now 2054635 characters long\n", 591 | "\n", 592 | "Reading '../data/hin_corp_unicode/1140_utf.txt'...\n", 593 | "Corpus is now 2064671 characters long\n", 594 | "\n", 595 | "Reading '../data/hin_corp_unicode/1141_utf.txt'...\n", 596 | "Corpus is now 2074453 characters long\n", 597 | "\n", 598 | "Reading '../data/hin_corp_unicode/1142_utf.txt'...\n", 599 | "Corpus is now 2084399 characters long\n", 600 | "\n", 601 | "Reading '../data/hin_corp_unicode/1143_utf.txt'...\n", 602 | "Corpus is now 2097400 characters long\n", 603 | "\n", 604 | "Reading '../data/hin_corp_unicode/1144_utf.txt'...\n", 605 | "Corpus is now 2107890 characters long\n", 606 | "\n", 607 | "Reading '../data/hin_corp_unicode/1145_utf.txt'...\n", 608 | "Corpus is now 2117922 characters long\n", 609 | "\n", 610 | "Reading '../data/hin_corp_unicode/1146_utf.txt'...\n", 611 | "Corpus is now 2127483 characters long\n", 612 | "\n", 613 | "Reading '../data/hin_corp_unicode/1147_utf.txt'...\n", 614 | "Corpus is now 2138945 characters long\n", 615 | "\n", 616 | "Reading '../data/hin_corp_unicode/1148_utf.txt'...\n", 617 | "Corpus is now 2149981 characters long\n", 618 | "\n", 619 | "Reading '../data/hin_corp_unicode/1149_utf.txt'...\n", 620 | "Corpus is now 2157493 characters long\n", 621 | "\n", 622 | "Reading '../data/hin_corp_unicode/114_utf.txt'...\n", 623 | "Corpus is now 2168073 characters long\n", 624 | "\n", 625 | "Reading '../data/hin_corp_unicode/1150_utf.txt'...\n", 626 | "Corpus is now 2180893 characters long\n", 627 | "\n", 628 | "Reading '../data/hin_corp_unicode/1151_utf.txt'...\n", 629 | "Corpus is now 2194251 characters long\n", 630 | "\n", 631 | "Reading '../data/hin_corp_unicode/1152_utf.txt'...\n", 632 | "Corpus is now 2204388 characters long\n", 633 | "\n", 634 | "Reading '../data/hin_corp_unicode/1153_utf.txt'...\n", 635 | "Corpus is now 2214759 characters long\n", 636 | "\n", 637 | "Reading '../data/hin_corp_unicode/1154_utf.txt'...\n", 638 | "Corpus is now 2225483 characters long\n", 639 | "\n", 640 | "Reading '../data/hin_corp_unicode/1155_utf.txt'...\n", 641 | "Corpus is now 2235061 characters long\n", 642 | "\n", 643 | "Reading '../data/hin_corp_unicode/1156_utf.txt'...\n", 644 | "Corpus is now 2246873 characters long\n", 645 | "\n", 646 | "Reading '../data/hin_corp_unicode/1157_utf.txt'...\n", 647 | "Corpus is now 2270926 characters long\n", 648 | "\n", 649 | "Reading '../data/hin_corp_unicode/1158_utf.txt'...\n", 650 | "Corpus is now 2278405 characters long\n", 651 | "\n", 652 | "Reading '../data/hin_corp_unicode/115_utf.txt'...\n", 653 | "Corpus is now 2288118 characters long\n", 654 | "\n", 655 | "Reading '../data/hin_corp_unicode/1160_utf.txt'...\n", 656 | "Corpus is now 2292344 characters long\n", 657 | "\n", 658 | "Reading '../data/hin_corp_unicode/1161_utf.txt'...\n", 659 | "Corpus is now 2302999 characters long\n", 660 | "\n", 661 | "Reading '../data/hin_corp_unicode/1162_utf.txt'...\n", 662 | "Corpus is now 2312185 characters long\n", 663 | "\n", 664 | "Reading '../data/hin_corp_unicode/1163_utf.txt'...\n", 665 | "Corpus is now 2320191 characters long\n", 666 | "\n", 667 | "Reading '../data/hin_corp_unicode/1164_utf.txt'...\n", 668 | "Corpus is now 2329773 characters long\n", 669 | "\n", 670 | "Reading '../data/hin_corp_unicode/1165_utf.txt'...\n", 671 | "Corpus is now 2337208 characters long\n", 672 | "\n", 673 | "Reading '../data/hin_corp_unicode/1166_utf.txt'...\n", 674 | "Corpus is now 2347525 characters long\n", 675 | "\n", 676 | "Reading '../data/hin_corp_unicode/1167_utf.txt'...\n", 677 | "Corpus is now 2356006 characters long\n", 678 | "\n", 679 | "Reading '../data/hin_corp_unicode/1168_utf.txt'...\n", 680 | "Corpus is now 2365511 characters long\n", 681 | "\n", 682 | "Reading '../data/hin_corp_unicode/1169_utf.txt'...\n", 683 | "Corpus is now 2378250 characters long\n", 684 | "\n", 685 | "Reading '../data/hin_corp_unicode/116_utf.txt'...\n", 686 | "Corpus is now 2388467 characters long\n", 687 | "\n", 688 | "Reading '../data/hin_corp_unicode/1170_utf.txt'...\n", 689 | "Corpus is now 2397501 characters long\n", 690 | "\n", 691 | "Reading '../data/hin_corp_unicode/1171_utf.txt'...\n", 692 | "Corpus is now 2407281 characters long\n", 693 | "\n", 694 | "Reading '../data/hin_corp_unicode/1172_utf.txt'...\n", 695 | "Corpus is now 2416850 characters long\n", 696 | "\n", 697 | "Reading '../data/hin_corp_unicode/1173_utf.txt'...\n", 698 | "Corpus is now 2427732 characters long\n", 699 | "\n", 700 | "Reading '../data/hin_corp_unicode/1174_utf.txt'...\n", 701 | "Corpus is now 2436884 characters long\n", 702 | "\n", 703 | "Reading '../data/hin_corp_unicode/1175_utf.txt'...\n", 704 | "Corpus is now 2444844 characters long\n", 705 | "\n", 706 | "Reading '../data/hin_corp_unicode/1176_utf.txt'...\n", 707 | "Corpus is now 2455852 characters long\n", 708 | "\n", 709 | "Reading '../data/hin_corp_unicode/1177_utf.txt'...\n", 710 | "Corpus is now 2463769 characters long\n", 711 | "\n", 712 | "Reading '../data/hin_corp_unicode/1178_utf.txt'...\n", 713 | "Corpus is now 2476557 characters long\n", 714 | "\n", 715 | "Reading '../data/hin_corp_unicode/1179_utf.txt'...\n", 716 | "Corpus is now 2476557 characters long\n", 717 | "\n", 718 | "Reading '../data/hin_corp_unicode/117_utf.txt'...\n", 719 | "Corpus is now 2484876 characters long\n", 720 | "\n", 721 | "Reading '../data/hin_corp_unicode/1180_utf.txt'...\n", 722 | "Corpus is now 2484876 characters long\n", 723 | "\n", 724 | "Reading '../data/hin_corp_unicode/1181_utf.txt'...\n", 725 | "Corpus is now 2492120 characters long\n", 726 | "\n", 727 | "Reading '../data/hin_corp_unicode/1182_utf.txt'...\n", 728 | "Corpus is now 2503852 characters long\n", 729 | "\n", 730 | "Reading '../data/hin_corp_unicode/1183_utf.txt'...\n", 731 | "Corpus is now 2513778 characters long\n", 732 | "\n", 733 | "Reading '../data/hin_corp_unicode/1184_utf.txt'...\n", 734 | "Corpus is now 2522787 characters long\n", 735 | "\n", 736 | "Reading '../data/hin_corp_unicode/1185_utf.txt'...\n", 737 | "Corpus is now 2533197 characters long\n", 738 | "\n", 739 | "Reading '../data/hin_corp_unicode/1186_utf.txt'...\n", 740 | "Corpus is now 2541103 characters long\n", 741 | "\n", 742 | "Reading '../data/hin_corp_unicode/1187_utf.txt'...\n", 743 | "Corpus is now 2552220 characters long\n", 744 | "\n", 745 | "Reading '../data/hin_corp_unicode/1188_utf.txt'...\n", 746 | "Corpus is now 2558627 characters long\n", 747 | "\n", 748 | "Reading '../data/hin_corp_unicode/1189_utf.txt'...\n", 749 | "Corpus is now 2568593 characters long\n", 750 | "\n", 751 | "Reading '../data/hin_corp_unicode/118_utf.txt'...\n", 752 | "Corpus is now 2577837 characters long\n", 753 | "\n", 754 | "Reading '../data/hin_corp_unicode/1190_utf.txt'...\n", 755 | "Corpus is now 2577837 characters long\n", 756 | "\n", 757 | "Reading '../data/hin_corp_unicode/1191_utf.txt'...\n", 758 | "Corpus is now 2589740 characters long\n", 759 | "\n", 760 | "Reading '../data/hin_corp_unicode/1192_utf.txt'...\n", 761 | "Corpus is now 2600457 characters long\n", 762 | "\n", 763 | "Reading '../data/hin_corp_unicode/1193_utf.txt'...\n", 764 | "Corpus is now 2612320 characters long\n", 765 | "\n", 766 | "Reading '../data/hin_corp_unicode/1194_utf.txt'...\n", 767 | "Corpus is now 2624351 characters long\n", 768 | "\n", 769 | "Reading '../data/hin_corp_unicode/1195_utf.txt'...\n", 770 | "Corpus is now 2635716 characters long\n", 771 | "\n", 772 | "Reading '../data/hin_corp_unicode/1196_utf.txt'...\n", 773 | "Corpus is now 2648028 characters long\n", 774 | "\n", 775 | "Reading '../data/hin_corp_unicode/1197_utf.txt'...\n", 776 | "Corpus is now 2654171 characters long\n", 777 | "\n", 778 | "Reading '../data/hin_corp_unicode/1198_utf.txt'...\n", 779 | "Corpus is now 2665069 characters long\n", 780 | "\n", 781 | "Reading '../data/hin_corp_unicode/1199_utf.txt'...\n", 782 | "Corpus is now 2675262 characters long\n", 783 | "\n", 784 | "Reading '../data/hin_corp_unicode/119_utf.txt'...\n", 785 | "Corpus is now 2686391 characters long\n", 786 | "\n", 787 | "Reading '../data/hin_corp_unicode/11_utf8.txt'...\n", 788 | "Corpus is now 2686391 characters long\n", 789 | "\n", 790 | "Reading '../data/hin_corp_unicode/1200_utf.txt'...\n", 791 | "Corpus is now 2697083 characters long\n", 792 | "\n", 793 | "Reading '../data/hin_corp_unicode/1201_utf.txt'...\n", 794 | "Corpus is now 2750848 characters long\n", 795 | "\n", 796 | "Reading '../data/hin_corp_unicode/1202_utf.txt'...\n", 797 | "Corpus is now 2761018 characters long\n", 798 | "\n", 799 | "Reading '../data/hin_corp_unicode/1203_utf.txt'...\n", 800 | "Corpus is now 2770808 characters long\n", 801 | "\n", 802 | "Reading '../data/hin_corp_unicode/1204_utf.txt'...\n", 803 | "Corpus is now 2783324 characters long\n", 804 | "\n", 805 | "Reading '../data/hin_corp_unicode/1205_utf.txt'...\n", 806 | "Corpus is now 2802845 characters long\n", 807 | "\n", 808 | "Reading '../data/hin_corp_unicode/1206_utf.txt'...\n", 809 | "Corpus is now 2816520 characters long\n", 810 | "\n", 811 | "Reading '../data/hin_corp_unicode/1207_utf.txt'...\n", 812 | "Corpus is now 2827395 characters long\n", 813 | "\n", 814 | "Reading '../data/hin_corp_unicode/1208_utf.txt'...\n", 815 | "Corpus is now 2840445 characters long\n", 816 | "\n", 817 | "Reading '../data/hin_corp_unicode/1209_utf.txt'...\n", 818 | "Corpus is now 2848265 characters long\n", 819 | "\n", 820 | "Reading '../data/hin_corp_unicode/120_utf.txt'...\n", 821 | "Corpus is now 2858811 characters long\n", 822 | "\n", 823 | "Reading '../data/hin_corp_unicode/1210_utf.txt'...\n", 824 | "Corpus is now 2874289 characters long\n", 825 | "\n", 826 | "Reading '../data/hin_corp_unicode/1211_utf.txt'...\n", 827 | "Corpus is now 2885632 characters long\n", 828 | "\n", 829 | "Reading '../data/hin_corp_unicode/1212_utf.txt'...\n", 830 | "Corpus is now 2893405 characters long\n", 831 | "\n", 832 | "Reading '../data/hin_corp_unicode/1213_utf.txt'...\n", 833 | "Corpus is now 2907759 characters long\n", 834 | "\n", 835 | "Reading '../data/hin_corp_unicode/1214_utf.txt'...\n", 836 | "Corpus is now 2918669 characters long\n", 837 | "\n", 838 | "Reading '../data/hin_corp_unicode/1215_utf.txt'...\n", 839 | "Corpus is now 2931345 characters long\n", 840 | "\n", 841 | "Reading '../data/hin_corp_unicode/1216_utf.txt'...\n", 842 | "Corpus is now 2943983 characters long\n", 843 | "\n", 844 | "Reading '../data/hin_corp_unicode/1217_utf.txt'...\n", 845 | "Corpus is now 2953782 characters long\n", 846 | "\n", 847 | "Reading '../data/hin_corp_unicode/1218_utf.txt'...\n", 848 | "Corpus is now 2965592 characters long\n", 849 | "\n", 850 | "Reading '../data/hin_corp_unicode/1219_utf.txt'...\n", 851 | "Corpus is now 2977052 characters long\n", 852 | "\n", 853 | "Reading '../data/hin_corp_unicode/121_utf.txt'...\n", 854 | "Corpus is now 2983707 characters long\n", 855 | "\n", 856 | "Reading '../data/hin_corp_unicode/1220_utf.txt'...\n", 857 | "Corpus is now 2994859 characters long\n", 858 | "\n", 859 | "Reading '../data/hin_corp_unicode/1221_utf.txt'...\n", 860 | "Corpus is now 3003901 characters long\n", 861 | "\n", 862 | "Reading '../data/hin_corp_unicode/1222_utf.txt'...\n", 863 | "Corpus is now 3017694 characters long\n", 864 | "\n", 865 | "Reading '../data/hin_corp_unicode/1223_utf.txt'...\n", 866 | "Corpus is now 3030022 characters long\n", 867 | "\n", 868 | "Reading '../data/hin_corp_unicode/1224_utf.txt'...\n", 869 | "Corpus is now 3040170 characters long\n", 870 | "\n", 871 | "Reading '../data/hin_corp_unicode/1225_utf.txt'...\n", 872 | "Corpus is now 3050266 characters long\n", 873 | "\n", 874 | "Reading '../data/hin_corp_unicode/1226_utf.txt'...\n", 875 | "Corpus is now 3059827 characters long\n", 876 | "\n", 877 | "Reading '../data/hin_corp_unicode/1227_utf.txt'...\n", 878 | "Corpus is now 3081787 characters long\n", 879 | "\n", 880 | "Reading '../data/hin_corp_unicode/1228_utf.txt'...\n", 881 | "Corpus is now 3093071 characters long\n", 882 | "\n", 883 | "Reading '../data/hin_corp_unicode/1229_utf.txt'...\n", 884 | "Corpus is now 3103477 characters long\n", 885 | "\n", 886 | "Reading '../data/hin_corp_unicode/122_utf.txt'...\n", 887 | "Corpus is now 3113532 characters long\n", 888 | "\n", 889 | "Reading '../data/hin_corp_unicode/1230_utf.txt'...\n", 890 | "Corpus is now 3126208 characters long\n", 891 | "\n", 892 | "Reading '../data/hin_corp_unicode/1231_utf.txt'...\n", 893 | "Corpus is now 3136430 characters long\n", 894 | "\n", 895 | "Reading '../data/hin_corp_unicode/1232_utf.txt'...\n", 896 | "Corpus is now 3146958 characters long\n", 897 | "\n", 898 | "Reading '../data/hin_corp_unicode/1233_utf.txt'...\n", 899 | "Corpus is now 3157456 characters long\n", 900 | "\n", 901 | "Reading '../data/hin_corp_unicode/123_utf.txt'...\n", 902 | "Corpus is now 3169580 characters long\n", 903 | "\n", 904 | "Reading '../data/hin_corp_unicode/124_utf.txt'...\n", 905 | "Corpus is now 3179593 characters long\n", 906 | "\n", 907 | "Reading '../data/hin_corp_unicode/125_utf.txt'...\n", 908 | "Corpus is now 3187626 characters long\n", 909 | "\n", 910 | "Reading '../data/hin_corp_unicode/126_utf.txt'...\n", 911 | "Corpus is now 3199528 characters long\n", 912 | "\n", 913 | "Reading '../data/hin_corp_unicode/127_utf.txt'...\n", 914 | "Corpus is now 3211722 characters long\n", 915 | "\n", 916 | "Reading '../data/hin_corp_unicode/128_utf.txt'...\n", 917 | "Corpus is now 3223003 characters long\n", 918 | "\n", 919 | "Reading '../data/hin_corp_unicode/129_utf.txt'...\n", 920 | "Corpus is now 3234838 characters long\n", 921 | "\n", 922 | "Reading '../data/hin_corp_unicode/12_utf8.txt'...\n", 923 | "Corpus is now 3234838 characters long\n", 924 | "\n", 925 | "Reading '../data/hin_corp_unicode/130_utf.txt'...\n", 926 | "Corpus is now 3246268 characters long\n", 927 | "\n", 928 | "Reading '../data/hin_corp_unicode/131_utf.txt'...\n", 929 | "Corpus is now 3259267 characters long\n", 930 | "\n", 931 | "Reading '../data/hin_corp_unicode/132_utf.txt'...\n", 932 | "Corpus is now 3268394 characters long\n", 933 | "\n", 934 | "Reading '../data/hin_corp_unicode/133_utf.txt'...\n", 935 | "Corpus is now 3279467 characters long\n", 936 | "\n", 937 | "Reading '../data/hin_corp_unicode/134_utf.txt'...\n", 938 | "Corpus is now 3290481 characters long\n", 939 | "\n", 940 | "Reading '../data/hin_corp_unicode/135_utf.txt'...\n", 941 | "Corpus is now 3299738 characters long\n", 942 | "\n", 943 | "Reading '../data/hin_corp_unicode/136_utf.txt'...\n", 944 | "Corpus is now 3312265 characters long\n", 945 | "\n", 946 | "Reading '../data/hin_corp_unicode/137_utf.txt'...\n", 947 | "Corpus is now 3323398 characters long\n", 948 | "\n", 949 | "Reading '../data/hin_corp_unicode/138_utf.txt'...\n", 950 | "Corpus is now 3334686 characters long\n", 951 | "\n", 952 | "Reading '../data/hin_corp_unicode/139_utf.txt'...\n", 953 | "Corpus is now 3346195 characters long\n", 954 | "\n", 955 | "Reading '../data/hin_corp_unicode/13_utf8.txt'...\n", 956 | "Corpus is now 3346195 characters long\n", 957 | "\n", 958 | "Reading '../data/hin_corp_unicode/140_utf.txt'...\n", 959 | "Corpus is now 3356806 characters long\n", 960 | "\n", 961 | "Reading '../data/hin_corp_unicode/141_utf.txt'...\n", 962 | "Corpus is now 3368547 characters long\n", 963 | "\n", 964 | "Reading '../data/hin_corp_unicode/142_utf.txt'...\n", 965 | "Corpus is now 3381978 characters long\n", 966 | "\n", 967 | "Reading '../data/hin_corp_unicode/143_utf.txt'...\n", 968 | "Corpus is now 3394604 characters long\n", 969 | "\n", 970 | "Reading '../data/hin_corp_unicode/144_utf.txt'...\n", 971 | "Corpus is now 3406369 characters long\n", 972 | "\n", 973 | "Reading '../data/hin_corp_unicode/145_utf.txt'...\n", 974 | "Corpus is now 3418696 characters long\n", 975 | "\n", 976 | "Reading '../data/hin_corp_unicode/146_utf.txt'...\n", 977 | "Corpus is now 3430965 characters long\n", 978 | "\n", 979 | "Reading '../data/hin_corp_unicode/147_utf.txt'...\n", 980 | "Corpus is now 3442255 characters long\n", 981 | "\n", 982 | "Reading '../data/hin_corp_unicode/148_utf.txt'...\n", 983 | "Corpus is now 3455283 characters long\n", 984 | "\n", 985 | "Reading '../data/hin_corp_unicode/149_utf.txt'...\n", 986 | "Corpus is now 3467130 characters long\n", 987 | "\n", 988 | "Reading '../data/hin_corp_unicode/14_utf8.txt'...\n", 989 | "Corpus is now 3467130 characters long\n", 990 | "\n", 991 | "Reading '../data/hin_corp_unicode/150_utf.txt'...\n", 992 | "Corpus is now 3480050 characters long\n", 993 | "\n", 994 | "Reading '../data/hin_corp_unicode/151_utf.txt'...\n", 995 | "Corpus is now 3495987 characters long\n", 996 | "\n", 997 | "Reading '../data/hin_corp_unicode/152_utf.txt'...\n", 998 | "Corpus is now 3505838 characters long\n", 999 | "\n", 1000 | "Reading '../data/hin_corp_unicode/153_utf.txt'...\n", 1001 | "Corpus is now 3514622 characters long\n", 1002 | "\n", 1003 | "Reading '../data/hin_corp_unicode/154_utf.txt'...\n", 1004 | "Corpus is now 3522729 characters long\n", 1005 | "\n", 1006 | "Reading '../data/hin_corp_unicode/155_utf.txt'...\n", 1007 | "Corpus is now 3532040 characters long\n", 1008 | "\n", 1009 | "Reading '../data/hin_corp_unicode/156_utf.txt'...\n", 1010 | "Corpus is now 3541837 characters long\n", 1011 | "\n", 1012 | "Reading '../data/hin_corp_unicode/157_utf.txt'...\n", 1013 | "Corpus is now 3550838 characters long\n", 1014 | "\n", 1015 | "Reading '../data/hin_corp_unicode/158_utf.txt'...\n", 1016 | "Corpus is now 3564716 characters long\n", 1017 | "\n", 1018 | "Reading '../data/hin_corp_unicode/15_utf8.txt'...\n", 1019 | "Corpus is now 3564716 characters long\n", 1020 | "\n", 1021 | "Reading '../data/hin_corp_unicode/160_utf.txt'...\n", 1022 | "Corpus is now 3580209 characters long\n", 1023 | "\n", 1024 | "Reading '../data/hin_corp_unicode/161_utf.txt'...\n", 1025 | "Corpus is now 3592993 characters long\n", 1026 | "\n", 1027 | "Reading '../data/hin_corp_unicode/162_utf.txt'...\n", 1028 | "Corpus is now 3602175 characters long\n", 1029 | "\n", 1030 | "Reading '../data/hin_corp_unicode/163_utf.txt'...\n", 1031 | "Corpus is now 3611018 characters long\n", 1032 | "\n", 1033 | "Reading '../data/hin_corp_unicode/164_utf.txt'...\n", 1034 | "Corpus is now 3620033 characters long\n", 1035 | "\n", 1036 | "Reading '../data/hin_corp_unicode/165_utf.txt'...\n", 1037 | "Corpus is now 3629194 characters long\n", 1038 | "\n", 1039 | "Reading '../data/hin_corp_unicode/166_utf.txt'...\n", 1040 | "Corpus is now 3636956 characters long\n", 1041 | "\n", 1042 | "Reading '../data/hin_corp_unicode/167_utf.txt'...\n", 1043 | "Corpus is now 3646730 characters long\n", 1044 | "\n", 1045 | "Reading '../data/hin_corp_unicode/168_utf.txt'...\n", 1046 | "Corpus is now 3659282 characters long\n", 1047 | "\n", 1048 | "Reading '../data/hin_corp_unicode/169_utf.txt'...\n", 1049 | "Corpus is now 3668692 characters long\n", 1050 | "\n", 1051 | "Reading '../data/hin_corp_unicode/16_utf8.txt'...\n", 1052 | "Corpus is now 3668692 characters long\n", 1053 | "\n", 1054 | "Reading '../data/hin_corp_unicode/170_utf.txt'...\n", 1055 | "Corpus is now 3676910 characters long\n", 1056 | "\n", 1057 | "Reading '../data/hin_corp_unicode/171_utf.txt'...\n", 1058 | "Corpus is now 3689307 characters long\n", 1059 | "\n", 1060 | "Reading '../data/hin_corp_unicode/172_utf.txt'...\n", 1061 | "Corpus is now 3703359 characters long\n", 1062 | "\n", 1063 | "Reading '../data/hin_corp_unicode/173_utf.txt'...\n", 1064 | "Corpus is now 3714971 characters long\n", 1065 | "\n", 1066 | "Reading '../data/hin_corp_unicode/174_utf.txt'...\n", 1067 | "Corpus is now 3728024 characters long\n", 1068 | "\n", 1069 | "Reading '../data/hin_corp_unicode/175_utf.txt'...\n", 1070 | "Corpus is now 3742170 characters long\n", 1071 | "\n", 1072 | "Reading '../data/hin_corp_unicode/176_utf.txt'...\n", 1073 | "Corpus is now 3754765 characters long\n", 1074 | "\n", 1075 | "Reading '../data/hin_corp_unicode/177_utf.txt'...\n", 1076 | "Corpus is now 3768692 characters long\n", 1077 | "\n", 1078 | "Reading '../data/hin_corp_unicode/178_utf.txt'...\n", 1079 | "Corpus is now 3783267 characters long\n", 1080 | "\n", 1081 | "Reading '../data/hin_corp_unicode/179_utf.txt'...\n", 1082 | "Corpus is now 3798872 characters long\n", 1083 | "\n", 1084 | "Reading '../data/hin_corp_unicode/17_utf8.txt'...\n", 1085 | "Corpus is now 3798872 characters long\n", 1086 | "\n", 1087 | "Reading '../data/hin_corp_unicode/180_utf.txt'...\n", 1088 | "Corpus is now 3806384 characters long\n", 1089 | "\n", 1090 | "Reading '../data/hin_corp_unicode/181_utf.txt'...\n", 1091 | "Corpus is now 3817090 characters long\n", 1092 | "\n", 1093 | "Reading '../data/hin_corp_unicode/182_utf.txt'...\n", 1094 | "Corpus is now 3830451 characters long\n", 1095 | "\n", 1096 | "Reading '../data/hin_corp_unicode/183_utf.txt'...\n", 1097 | "Corpus is now 3847490 characters long\n", 1098 | "\n", 1099 | "Reading '../data/hin_corp_unicode/184_utf.txt'...\n", 1100 | "Corpus is now 3859543 characters long\n", 1101 | "\n", 1102 | "Reading '../data/hin_corp_unicode/185_utf.txt'...\n", 1103 | "Corpus is now 3872552 characters long\n", 1104 | "\n", 1105 | "Reading '../data/hin_corp_unicode/186_utf.txt'...\n", 1106 | "Corpus is now 3890828 characters long\n", 1107 | "\n", 1108 | "Reading '../data/hin_corp_unicode/187_utf.txt'...\n", 1109 | "Corpus is now 3898888 characters long\n", 1110 | "\n", 1111 | "Reading '../data/hin_corp_unicode/188_utf.txt'...\n", 1112 | "Corpus is now 3908924 characters long\n", 1113 | "\n", 1114 | "Reading '../data/hin_corp_unicode/189_utf.txt'...\n", 1115 | "Corpus is now 3916870 characters long\n", 1116 | "\n", 1117 | "Reading '../data/hin_corp_unicode/18_utf8.txt'...\n", 1118 | "Corpus is now 3916870 characters long\n", 1119 | "\n", 1120 | "Reading '../data/hin_corp_unicode/190_utf.txt'...\n", 1121 | "Corpus is now 3928991 characters long\n", 1122 | "\n", 1123 | "Reading '../data/hin_corp_unicode/191_utf.txt'...\n", 1124 | "Corpus is now 3936493 characters long\n", 1125 | "\n", 1126 | "Reading '../data/hin_corp_unicode/192_utf.txt'...\n", 1127 | "Corpus is now 3948575 characters long\n", 1128 | "\n", 1129 | "Reading '../data/hin_corp_unicode/193_utf.txt'...\n", 1130 | "Corpus is now 3960820 characters long\n", 1131 | "\n", 1132 | "Reading '../data/hin_corp_unicode/194_utf.txt'...\n", 1133 | "Corpus is now 3975794 characters long\n", 1134 | "\n", 1135 | "Reading '../data/hin_corp_unicode/195_utf.txt'...\n", 1136 | "Corpus is now 3987568 characters long\n", 1137 | "\n", 1138 | "Reading '../data/hin_corp_unicode/196_utf.txt'...\n", 1139 | "Corpus is now 4001763 characters long\n", 1140 | "\n", 1141 | "Reading '../data/hin_corp_unicode/197_utf.txt'...\n", 1142 | "Corpus is now 4020747 characters long\n", 1143 | "\n", 1144 | "Reading '../data/hin_corp_unicode/198_utf.txt'...\n", 1145 | "Corpus is now 4032134 characters long\n", 1146 | "\n", 1147 | "Reading '../data/hin_corp_unicode/199_utf.txt'...\n", 1148 | "Corpus is now 4041001 characters long\n", 1149 | "\n", 1150 | "Reading '../data/hin_corp_unicode/19_utf8.txt'...\n", 1151 | "Corpus is now 4041001 characters long\n", 1152 | "\n", 1153 | "Reading '../data/hin_corp_unicode/1_utf8.txt'...\n", 1154 | "Corpus is now 4041001 characters long\n", 1155 | "\n", 1156 | "Reading '../data/hin_corp_unicode/200_utf.txt'...\n", 1157 | "Corpus is now 4041001 characters long\n", 1158 | "\n", 1159 | "Reading '../data/hin_corp_unicode/201_utf.txt'...\n", 1160 | "Corpus is now 4052758 characters long\n", 1161 | "\n", 1162 | "Reading '../data/hin_corp_unicode/202_utf.txt'...\n", 1163 | "Corpus is now 4063136 characters long\n", 1164 | "\n", 1165 | "Reading '../data/hin_corp_unicode/203_utf.txt'...\n", 1166 | "Corpus is now 4075248 characters long\n", 1167 | "\n", 1168 | "Reading '../data/hin_corp_unicode/204_utf.txt'...\n", 1169 | "Corpus is now 4085763 characters long\n", 1170 | "\n", 1171 | "Reading '../data/hin_corp_unicode/205_utf.txt'...\n", 1172 | "Corpus is now 4101854 characters long\n", 1173 | "\n", 1174 | "Reading '../data/hin_corp_unicode/206_utf.txt'...\n", 1175 | "Corpus is now 4114766 characters long\n", 1176 | "\n", 1177 | "Reading '../data/hin_corp_unicode/207_utf.txt'...\n", 1178 | "Corpus is now 4128165 characters long\n", 1179 | "\n", 1180 | "Reading '../data/hin_corp_unicode/208_utf.txt'...\n", 1181 | "Corpus is now 4143181 characters long\n", 1182 | "\n", 1183 | "Reading '../data/hin_corp_unicode/209_utf.txt'...\n", 1184 | "Corpus is now 4155646 characters long\n", 1185 | "\n", 1186 | "Reading '../data/hin_corp_unicode/20_utf8.txt'...\n", 1187 | "Corpus is now 4155646 characters long\n", 1188 | "\n", 1189 | "Reading '../data/hin_corp_unicode/210_utf.txt'...\n", 1190 | "Corpus is now 4168357 characters long\n", 1191 | "\n", 1192 | "Reading '../data/hin_corp_unicode/211_utf.txt'...\n", 1193 | "Corpus is now 4182887 characters long\n", 1194 | "\n", 1195 | "Reading '../data/hin_corp_unicode/212_utf.txt'...\n", 1196 | "Corpus is now 4202240 characters long\n", 1197 | "\n", 1198 | "Reading '../data/hin_corp_unicode/213_utf.txt'...\n", 1199 | "Corpus is now 4217115 characters long\n", 1200 | "\n", 1201 | "Reading '../data/hin_corp_unicode/214_utf.txt'...\n", 1202 | "Corpus is now 4230117 characters long\n", 1203 | "\n", 1204 | "Reading '../data/hin_corp_unicode/215_utf.txt'...\n", 1205 | "Corpus is now 4246072 characters long\n", 1206 | "\n", 1207 | "Reading '../data/hin_corp_unicode/216_utf.txt'...\n", 1208 | "Corpus is now 4261517 characters long\n", 1209 | "\n", 1210 | "Reading '../data/hin_corp_unicode/217_utf.txt'...\n", 1211 | "Corpus is now 4275997 characters long\n", 1212 | "\n", 1213 | "Reading '../data/hin_corp_unicode/218_utf.txt'...\n", 1214 | "Corpus is now 4289869 characters long\n", 1215 | "\n", 1216 | "Reading '../data/hin_corp_unicode/219_utf.txt'...\n", 1217 | "Corpus is now 4304987 characters long\n", 1218 | "\n", 1219 | "Reading '../data/hin_corp_unicode/21_utf8.txt'...\n", 1220 | "Corpus is now 4311352 characters long\n", 1221 | "\n", 1222 | "Reading '../data/hin_corp_unicode/220_utf.txt'...\n", 1223 | "Corpus is now 4325463 characters long\n", 1224 | "\n", 1225 | "Reading '../data/hin_corp_unicode/221_utf.txt'...\n", 1226 | "Corpus is now 4338770 characters long\n", 1227 | "\n", 1228 | "Reading '../data/hin_corp_unicode/222_utf.txt'...\n", 1229 | "Corpus is now 4352546 characters long\n", 1230 | "\n", 1231 | "Reading '../data/hin_corp_unicode/223_utf.txt'...\n", 1232 | "Corpus is now 4365854 characters long\n", 1233 | "\n", 1234 | "Reading '../data/hin_corp_unicode/224_utf.txt'...\n", 1235 | "Corpus is now 4379088 characters long\n", 1236 | "\n", 1237 | "Reading '../data/hin_corp_unicode/225_utf.txt'...\n", 1238 | "Corpus is now 4392052 characters long\n", 1239 | "\n", 1240 | "Reading '../data/hin_corp_unicode/226_utf.txt'...\n", 1241 | "Corpus is now 4406799 characters long\n", 1242 | "\n", 1243 | "Reading '../data/hin_corp_unicode/227_utf.txt'...\n", 1244 | "Corpus is now 4421189 characters long\n", 1245 | "\n", 1246 | "Reading '../data/hin_corp_unicode/228_utf.txt'...\n", 1247 | "Corpus is now 4435584 characters long\n", 1248 | "\n", 1249 | "Reading '../data/hin_corp_unicode/229_utf.txt'...\n", 1250 | "Corpus is now 4450868 characters long\n", 1251 | "\n", 1252 | "Reading '../data/hin_corp_unicode/22_utf8.txt'...\n", 1253 | "Corpus is now 4461686 characters long\n", 1254 | "\n", 1255 | "Reading '../data/hin_corp_unicode/230_utf.txt'...\n", 1256 | "Corpus is now 4476164 characters long\n", 1257 | "\n", 1258 | "Reading '../data/hin_corp_unicode/231_utf.txt'...\n", 1259 | "Corpus is now 4491097 characters long\n", 1260 | "\n", 1261 | "Reading '../data/hin_corp_unicode/232_utf.txt'...\n", 1262 | "Corpus is now 4508902 characters long\n", 1263 | "\n", 1264 | "Reading '../data/hin_corp_unicode/233_utf.txt'...\n", 1265 | "Corpus is now 4522731 characters long\n", 1266 | "\n", 1267 | "Reading '../data/hin_corp_unicode/234_utf.txt'...\n", 1268 | "Corpus is now 4534967 characters long\n", 1269 | "\n", 1270 | "Reading '../data/hin_corp_unicode/235_utf.txt'...\n", 1271 | "Corpus is now 4550081 characters long\n", 1272 | "\n", 1273 | "Reading '../data/hin_corp_unicode/236_utf.txt'...\n", 1274 | "Corpus is now 4566495 characters long\n", 1275 | "\n", 1276 | "Reading '../data/hin_corp_unicode/237_utf.txt'...\n", 1277 | "Corpus is now 4577398 characters long\n", 1278 | "\n", 1279 | "Reading '../data/hin_corp_unicode/238_utf.txt'...\n", 1280 | "Corpus is now 4588329 characters long\n", 1281 | "\n", 1282 | "Reading '../data/hin_corp_unicode/239_utf.txt'...\n", 1283 | "Corpus is now 4599895 characters long\n", 1284 | "\n", 1285 | "Reading '../data/hin_corp_unicode/23_utf8.txt'...\n", 1286 | "Corpus is now 4613192 characters long\n", 1287 | "\n", 1288 | "Reading '../data/hin_corp_unicode/240_utf.txt'...\n", 1289 | "Corpus is now 4624918 characters long\n", 1290 | "\n", 1291 | "Reading '../data/hin_corp_unicode/241_utf.txt'...\n", 1292 | "Corpus is now 4636262 characters long\n", 1293 | "\n", 1294 | "Reading '../data/hin_corp_unicode/242_utf.txt'...\n", 1295 | "Corpus is now 4653258 characters long\n", 1296 | "\n", 1297 | "Reading '../data/hin_corp_unicode/243_utf.txt'...\n", 1298 | "Corpus is now 4664241 characters long\n", 1299 | "\n", 1300 | "Reading '../data/hin_corp_unicode/244_utf.txt'...\n", 1301 | "Corpus is now 4676462 characters long\n", 1302 | "\n", 1303 | "Reading '../data/hin_corp_unicode/245_utf.txt'...\n", 1304 | "Corpus is now 4687854 characters long\n", 1305 | "\n", 1306 | "Reading '../data/hin_corp_unicode/246_utf.txt'...\n", 1307 | "Corpus is now 4700843 characters long\n", 1308 | "\n", 1309 | "Reading '../data/hin_corp_unicode/247_utf.txt'...\n", 1310 | "Corpus is now 4713886 characters long\n", 1311 | "\n", 1312 | "Reading '../data/hin_corp_unicode/248_utf.txt'...\n", 1313 | "Corpus is now 4727388 characters long\n", 1314 | "\n", 1315 | "Reading '../data/hin_corp_unicode/249_utf.txt'...\n", 1316 | "Corpus is now 4742026 characters long\n", 1317 | "\n", 1318 | "Reading '../data/hin_corp_unicode/24_utf8.txt'...\n", 1319 | "Corpus is now 4755433 characters long\n", 1320 | "\n", 1321 | "Reading '../data/hin_corp_unicode/250_utf.txt'...\n", 1322 | "Corpus is now 4768480 characters long\n", 1323 | "\n", 1324 | "Reading '../data/hin_corp_unicode/251_utf.txt'...\n", 1325 | "Corpus is now 4783503 characters long\n", 1326 | "\n", 1327 | "Reading '../data/hin_corp_unicode/252_utf.txt'...\n", 1328 | "Corpus is now 4798170 characters long\n", 1329 | "\n", 1330 | "Reading '../data/hin_corp_unicode/253_utf.txt'...\n", 1331 | "Corpus is now 4811340 characters long\n", 1332 | "\n", 1333 | "Reading '../data/hin_corp_unicode/254_utf.txt'...\n", 1334 | "Corpus is now 4824161 characters long\n", 1335 | "\n", 1336 | "Reading '../data/hin_corp_unicode/255_utf.txt'...\n", 1337 | "Corpus is now 4839034 characters long\n", 1338 | "\n", 1339 | "Reading '../data/hin_corp_unicode/256_utf.txt'...\n", 1340 | "Corpus is now 4855757 characters long\n", 1341 | "\n", 1342 | "Reading '../data/hin_corp_unicode/257_utf.txt'...\n", 1343 | "Corpus is now 4872701 characters long\n", 1344 | "\n", 1345 | "Reading '../data/hin_corp_unicode/258_utf.txt'...\n", 1346 | "Corpus is now 4895079 characters long\n", 1347 | "\n", 1348 | "Reading '../data/hin_corp_unicode/25_utf8.txt'...\n", 1349 | "Corpus is now 4906841 characters long\n", 1350 | "\n", 1351 | "Reading '../data/hin_corp_unicode/260_utf.txt'...\n", 1352 | "Corpus is now 4920048 characters long\n", 1353 | "\n", 1354 | "Reading '../data/hin_corp_unicode/261_utf.txt'...\n", 1355 | "Corpus is now 4934702 characters long\n", 1356 | "\n", 1357 | "Reading '../data/hin_corp_unicode/262_utf.txt'...\n", 1358 | "Corpus is now 4949777 characters long\n", 1359 | "\n", 1360 | "Reading '../data/hin_corp_unicode/263_utf.txt'...\n", 1361 | "Corpus is now 4964616 characters long\n", 1362 | "\n", 1363 | "Reading '../data/hin_corp_unicode/264_utf.txt'...\n", 1364 | "Corpus is now 4978252 characters long\n", 1365 | "\n", 1366 | "Reading '../data/hin_corp_unicode/265_utf.txt'...\n", 1367 | "Corpus is now 4993192 characters long\n", 1368 | "\n", 1369 | "Reading '../data/hin_corp_unicode/266_utf.txt'...\n", 1370 | "Corpus is now 5006249 characters long\n", 1371 | "\n", 1372 | "Reading '../data/hin_corp_unicode/267_utf.txt'...\n", 1373 | "Corpus is now 5019571 characters long\n", 1374 | "\n", 1375 | "Reading '../data/hin_corp_unicode/268_utf.txt'...\n", 1376 | "Corpus is now 5035939 characters long\n", 1377 | "\n", 1378 | "Reading '../data/hin_corp_unicode/269_utf.txt'...\n", 1379 | "Corpus is now 5049041 characters long\n", 1380 | "\n", 1381 | "Reading '../data/hin_corp_unicode/26_utf8.txt'...\n", 1382 | "Corpus is now 5061318 characters long\n", 1383 | "\n", 1384 | "Reading '../data/hin_corp_unicode/270_utf.txt'...\n", 1385 | "Corpus is now 5074241 characters long\n", 1386 | "\n", 1387 | "Reading '../data/hin_corp_unicode/271_utf.txt'...\n", 1388 | "Corpus is now 5086540 characters long\n", 1389 | "\n", 1390 | "Reading '../data/hin_corp_unicode/272_utf.txt'...\n", 1391 | "Corpus is now 5098601 characters long\n", 1392 | "\n", 1393 | "Reading '../data/hin_corp_unicode/273_utf.txt'...\n", 1394 | "Corpus is now 5112467 characters long\n", 1395 | "\n", 1396 | "Reading '../data/hin_corp_unicode/274_utf.txt'...\n", 1397 | "Corpus is now 5126436 characters long\n", 1398 | "\n", 1399 | "Reading '../data/hin_corp_unicode/275_utf.txt'...\n", 1400 | "Corpus is now 5141559 characters long\n", 1401 | "\n", 1402 | "Reading '../data/hin_corp_unicode/276_utf.txt'...\n", 1403 | "Corpus is now 5157986 characters long\n", 1404 | "\n", 1405 | "Reading '../data/hin_corp_unicode/277_utf.txt'...\n", 1406 | "Corpus is now 5170056 characters long\n", 1407 | "\n", 1408 | "Reading '../data/hin_corp_unicode/278_utf.txt'...\n", 1409 | "Corpus is now 5183801 characters long\n", 1410 | "\n", 1411 | "Reading '../data/hin_corp_unicode/279_utf.txt'...\n", 1412 | "Corpus is now 5196903 characters long\n", 1413 | "\n", 1414 | "Reading '../data/hin_corp_unicode/27_utf8.txt'...\n", 1415 | "Corpus is now 5209273 characters long\n", 1416 | "\n", 1417 | "Reading '../data/hin_corp_unicode/280_utf.txt'...\n", 1418 | "Corpus is now 5223063 characters long\n", 1419 | "\n", 1420 | "Reading '../data/hin_corp_unicode/281_utf.txt'...\n", 1421 | "Corpus is now 5237222 characters long\n", 1422 | "\n", 1423 | "Reading '../data/hin_corp_unicode/282_utf.txt'...\n", 1424 | "Corpus is now 5250061 characters long\n", 1425 | "\n", 1426 | "Reading '../data/hin_corp_unicode/283_utf.txt'...\n", 1427 | "Corpus is now 5265063 characters long\n", 1428 | "\n", 1429 | "Reading '../data/hin_corp_unicode/284_utf.txt'...\n", 1430 | "Corpus is now 5276532 characters long\n", 1431 | "\n", 1432 | "Reading '../data/hin_corp_unicode/285_utf.txt'...\n", 1433 | "Corpus is now 5288264 characters long\n", 1434 | "\n", 1435 | "Reading '../data/hin_corp_unicode/286_utf.txt'...\n", 1436 | "Corpus is now 5303228 characters long\n", 1437 | "\n", 1438 | "Reading '../data/hin_corp_unicode/287_utf.txt'...\n", 1439 | "Corpus is now 5315065 characters long\n", 1440 | "\n", 1441 | "Reading '../data/hin_corp_unicode/288_utf.txt'...\n", 1442 | "Corpus is now 5327824 characters long\n", 1443 | "\n", 1444 | "Reading '../data/hin_corp_unicode/289_utf.txt'...\n", 1445 | "Corpus is now 5342528 characters long\n", 1446 | "\n", 1447 | "Reading '../data/hin_corp_unicode/28_utf8.txt'...\n", 1448 | "Corpus is now 5352332 characters long\n", 1449 | "\n", 1450 | "Reading '../data/hin_corp_unicode/290_utf.txt'...\n", 1451 | "Corpus is now 5363925 characters long\n", 1452 | "\n", 1453 | "Reading '../data/hin_corp_unicode/291_utf.txt'...\n", 1454 | "Corpus is now 5379707 characters long\n", 1455 | "\n", 1456 | "Reading '../data/hin_corp_unicode/292_utf.txt'...\n", 1457 | "Corpus is now 5390393 characters long\n", 1458 | "\n", 1459 | "Reading '../data/hin_corp_unicode/293_utf.txt'...\n", 1460 | "Corpus is now 5401551 characters long\n", 1461 | "\n", 1462 | "Reading '../data/hin_corp_unicode/294_utf.txt'...\n", 1463 | "Corpus is now 5414925 characters long\n", 1464 | "\n", 1465 | "Reading '../data/hin_corp_unicode/295_utf.txt'...\n", 1466 | "Corpus is now 5428209 characters long\n", 1467 | "\n", 1468 | "Reading '../data/hin_corp_unicode/296_utf.txt'...\n", 1469 | "Corpus is now 5441870 characters long\n", 1470 | "\n", 1471 | "Reading '../data/hin_corp_unicode/297_utf.txt'...\n", 1472 | "Corpus is now 5452633 characters long\n", 1473 | "\n", 1474 | "Reading '../data/hin_corp_unicode/298_utf.txt'...\n", 1475 | "Corpus is now 5466687 characters long\n", 1476 | "\n", 1477 | "Reading '../data/hin_corp_unicode/299_utf.txt'...\n", 1478 | "Corpus is now 5479385 characters long\n", 1479 | "\n", 1480 | "Reading '../data/hin_corp_unicode/29_utf8.txt'...\n", 1481 | "Corpus is now 5493181 characters long\n", 1482 | "\n", 1483 | "Reading '../data/hin_corp_unicode/2_utf8.txt'...\n", 1484 | "Corpus is now 5493181 characters long\n", 1485 | "\n", 1486 | "Reading '../data/hin_corp_unicode/300_utf.txt'...\n", 1487 | "Corpus is now 5505683 characters long\n", 1488 | "\n", 1489 | "Reading '../data/hin_corp_unicode/301_utf.txt'...\n", 1490 | "Corpus is now 5518052 characters long\n", 1491 | "\n", 1492 | "Reading '../data/hin_corp_unicode/302_utf.txt'...\n", 1493 | "Corpus is now 5529675 characters long\n", 1494 | "\n", 1495 | "Reading '../data/hin_corp_unicode/303_utf.txt'...\n", 1496 | "Corpus is now 5544052 characters long\n", 1497 | "\n", 1498 | "Reading '../data/hin_corp_unicode/304_utf.txt'...\n", 1499 | "Corpus is now 5558153 characters long\n", 1500 | "\n", 1501 | "Reading '../data/hin_corp_unicode/305_utf.txt'...\n", 1502 | "Corpus is now 5570521 characters long\n", 1503 | "\n", 1504 | "Reading '../data/hin_corp_unicode/306_utf.txt'...\n", 1505 | "Corpus is now 5584865 characters long\n", 1506 | "\n", 1507 | "Reading '../data/hin_corp_unicode/307_utf.txt'...\n", 1508 | "Corpus is now 5598842 characters long\n", 1509 | "\n", 1510 | "Reading '../data/hin_corp_unicode/308_utf.txt'...\n", 1511 | "Corpus is now 5616143 characters long\n", 1512 | "\n", 1513 | "Reading '../data/hin_corp_unicode/309_utf.txt'...\n", 1514 | "Corpus is now 5630296 characters long\n", 1515 | "\n", 1516 | "Reading '../data/hin_corp_unicode/30_utf8.txt'...\n", 1517 | "Corpus is now 5643001 characters long\n", 1518 | "\n", 1519 | "Reading '../data/hin_corp_unicode/310_utf.txt'...\n", 1520 | "Corpus is now 5656845 characters long\n", 1521 | "\n", 1522 | "Reading '../data/hin_corp_unicode/311_utf.txt'...\n", 1523 | "Corpus is now 5670433 characters long\n", 1524 | "\n", 1525 | "Reading '../data/hin_corp_unicode/312_utf.txt'...\n", 1526 | "Corpus is now 5682342 characters long\n", 1527 | "\n", 1528 | "Reading '../data/hin_corp_unicode/313_utf.txt'...\n", 1529 | "Corpus is now 5697294 characters long\n", 1530 | "\n", 1531 | "Reading '../data/hin_corp_unicode/314_utf.txt'...\n", 1532 | "Corpus is now 5708604 characters long\n", 1533 | "\n", 1534 | "Reading '../data/hin_corp_unicode/315_utf.txt'...\n", 1535 | "Corpus is now 5720294 characters long\n", 1536 | "\n", 1537 | "Reading '../data/hin_corp_unicode/316_utf.txt'...\n", 1538 | "Corpus is now 5732291 characters long\n", 1539 | "\n", 1540 | "Reading '../data/hin_corp_unicode/317_utf.txt'...\n", 1541 | "Corpus is now 5747173 characters long\n", 1542 | "\n", 1543 | "Reading '../data/hin_corp_unicode/318_utf.txt'...\n", 1544 | "Corpus is now 5767604 characters long\n", 1545 | "\n", 1546 | "Reading '../data/hin_corp_unicode/319_utf.txt'...\n", 1547 | "Corpus is now 5782702 characters long\n", 1548 | "\n", 1549 | "Reading '../data/hin_corp_unicode/31_utf8.txt'...\n", 1550 | "Corpus is now 5804959 characters long\n", 1551 | "\n", 1552 | "Reading '../data/hin_corp_unicode/320_utf.txt'...\n", 1553 | "Corpus is now 5818954 characters long\n", 1554 | "\n", 1555 | "Reading '../data/hin_corp_unicode/321_utf.txt'...\n", 1556 | "Corpus is now 5836543 characters long\n", 1557 | "\n", 1558 | "Reading '../data/hin_corp_unicode/322_utf.txt'...\n", 1559 | "Corpus is now 5850598 characters long\n", 1560 | "\n", 1561 | "Reading '../data/hin_corp_unicode/323_utf.txt'...\n", 1562 | "Corpus is now 5865063 characters long\n", 1563 | "\n", 1564 | "Reading '../data/hin_corp_unicode/324_utf.txt'...\n", 1565 | "Corpus is now 5878976 characters long\n", 1566 | "\n", 1567 | "Reading '../data/hin_corp_unicode/325_utf.txt'...\n", 1568 | "Corpus is now 5892089 characters long\n", 1569 | "\n", 1570 | "Reading '../data/hin_corp_unicode/326_utf.txt'...\n", 1571 | "Corpus is now 5903870 characters long\n", 1572 | "\n", 1573 | "Reading '../data/hin_corp_unicode/327_utf.txt'...\n", 1574 | "Corpus is now 5919368 characters long\n", 1575 | "\n", 1576 | "Reading '../data/hin_corp_unicode/328_utf.txt'...\n", 1577 | "Corpus is now 5936108 characters long\n", 1578 | "\n", 1579 | "Reading '../data/hin_corp_unicode/329_utf.txt'...\n", 1580 | "Corpus is now 5952521 characters long\n", 1581 | "\n", 1582 | "Reading '../data/hin_corp_unicode/32_utf8.txt'...\n", 1583 | "Corpus is now 5962310 characters long\n", 1584 | "\n", 1585 | "Reading '../data/hin_corp_unicode/330_utf.txt'...\n", 1586 | "Corpus is now 5976174 characters long\n", 1587 | "\n", 1588 | "Reading '../data/hin_corp_unicode/331_utf.txt'...\n", 1589 | "Corpus is now 5992148 characters long\n", 1590 | "\n", 1591 | "Reading '../data/hin_corp_unicode/332_utf.txt'...\n", 1592 | "Corpus is now 6006692 characters long\n", 1593 | "\n", 1594 | "Reading '../data/hin_corp_unicode/333_utf.txt'...\n", 1595 | "Corpus is now 6021066 characters long\n", 1596 | "\n", 1597 | "Reading '../data/hin_corp_unicode/334_utf.txt'...\n", 1598 | "Corpus is now 6037471 characters long\n", 1599 | "\n", 1600 | "Reading '../data/hin_corp_unicode/335_utf.txt'...\n", 1601 | "Corpus is now 6053659 characters long\n", 1602 | "\n", 1603 | "Reading '../data/hin_corp_unicode/336_utf.txt'...\n", 1604 | "Corpus is now 6071558 characters long\n", 1605 | "\n", 1606 | "Reading '../data/hin_corp_unicode/337_utf.txt'...\n", 1607 | "Corpus is now 6088938 characters long\n", 1608 | "\n", 1609 | "Reading '../data/hin_corp_unicode/338_utf.txt'...\n", 1610 | "Corpus is now 6108104 characters long\n", 1611 | "\n", 1612 | "Reading '../data/hin_corp_unicode/339_utf.txt'...\n", 1613 | "Corpus is now 6118465 characters long\n", 1614 | "\n", 1615 | "Reading '../data/hin_corp_unicode/33_utf8.txt'...\n", 1616 | "Corpus is now 6129532 characters long\n", 1617 | "\n", 1618 | "Reading '../data/hin_corp_unicode/340_utf.txt'...\n", 1619 | "Corpus is now 6148981 characters long\n", 1620 | "\n", 1621 | "Reading '../data/hin_corp_unicode/341_utf.txt'...\n", 1622 | "Corpus is now 6163494 characters long\n", 1623 | "\n", 1624 | "Reading '../data/hin_corp_unicode/342_utf.txt'...\n", 1625 | "Corpus is now 6178613 characters long\n", 1626 | "\n", 1627 | "Reading '../data/hin_corp_unicode/343_utf.txt'...\n", 1628 | "Corpus is now 6194274 characters long\n", 1629 | "\n", 1630 | "Reading '../data/hin_corp_unicode/344_utf.txt'...\n", 1631 | "Corpus is now 6207504 characters long\n", 1632 | "\n", 1633 | "Reading '../data/hin_corp_unicode/345_utf.txt'...\n", 1634 | "Corpus is now 6221186 characters long\n", 1635 | "\n", 1636 | "Reading '../data/hin_corp_unicode/346_utf.txt'...\n", 1637 | "Corpus is now 6232709 characters long\n", 1638 | "\n", 1639 | "Reading '../data/hin_corp_unicode/347_utf.txt'...\n", 1640 | "Corpus is now 6244051 characters long\n", 1641 | "\n", 1642 | "Reading '../data/hin_corp_unicode/348_utf.txt'...\n", 1643 | "Corpus is now 6259045 characters long\n", 1644 | "\n", 1645 | "Reading '../data/hin_corp_unicode/349_utf.txt'...\n", 1646 | "Corpus is now 6272195 characters long\n", 1647 | "\n", 1648 | "Reading '../data/hin_corp_unicode/34_utf8.txt'...\n", 1649 | "Corpus is now 6282579 characters long\n", 1650 | "\n", 1651 | "Reading '../data/hin_corp_unicode/350_utf.txt'...\n", 1652 | "Corpus is now 6295783 characters long\n", 1653 | "\n", 1654 | "Reading '../data/hin_corp_unicode/351_utf.txt'...\n", 1655 | "Corpus is now 6309577 characters long\n", 1656 | "\n", 1657 | "Reading '../data/hin_corp_unicode/352_utf.txt'...\n", 1658 | "Corpus is now 6321745 characters long\n", 1659 | "\n", 1660 | "Reading '../data/hin_corp_unicode/353_utf.txt'...\n", 1661 | "Corpus is now 6335630 characters long\n", 1662 | "\n", 1663 | "Reading '../data/hin_corp_unicode/354_utf.txt'...\n", 1664 | "Corpus is now 6350010 characters long\n", 1665 | "\n", 1666 | "Reading '../data/hin_corp_unicode/355_utf.txt'...\n", 1667 | "Corpus is now 6362369 characters long\n", 1668 | "\n", 1669 | "Reading '../data/hin_corp_unicode/356_utf.txt'...\n", 1670 | "Corpus is now 6375050 characters long\n", 1671 | "\n", 1672 | "Reading '../data/hin_corp_unicode/357_utf.txt'...\n", 1673 | "Corpus is now 6386793 characters long\n", 1674 | "\n", 1675 | "Reading '../data/hin_corp_unicode/358_utf.txt'...\n", 1676 | "Corpus is now 6399710 characters long\n", 1677 | "\n", 1678 | "Reading '../data/hin_corp_unicode/35_utf8.txt'...\n", 1679 | "Corpus is now 6412740 characters long\n", 1680 | "\n", 1681 | "Reading '../data/hin_corp_unicode/360_utf.txt'...\n", 1682 | "Corpus is now 6428857 characters long\n", 1683 | "\n", 1684 | "Reading '../data/hin_corp_unicode/361_utf.txt'...\n", 1685 | "Corpus is now 6440934 characters long\n", 1686 | "\n", 1687 | "Reading '../data/hin_corp_unicode/362_utf.txt'...\n", 1688 | "Corpus is now 6450700 characters long\n", 1689 | "\n", 1690 | "Reading '../data/hin_corp_unicode/363_utf.txt'...\n", 1691 | "Corpus is now 6461605 characters long\n", 1692 | "\n", 1693 | "Reading '../data/hin_corp_unicode/364_utf.txt'...\n", 1694 | "Corpus is now 6474160 characters long\n", 1695 | "\n", 1696 | "Reading '../data/hin_corp_unicode/365_utf.txt'...\n", 1697 | "Corpus is now 6488238 characters long\n", 1698 | "\n", 1699 | "Reading '../data/hin_corp_unicode/366_utf.txt'...\n", 1700 | "Corpus is now 6501033 characters long\n", 1701 | "\n", 1702 | "Reading '../data/hin_corp_unicode/367_utf.txt'...\n", 1703 | "Corpus is now 6513610 characters long\n", 1704 | "\n", 1705 | "Reading '../data/hin_corp_unicode/368_utf.txt'...\n", 1706 | "Corpus is now 6527693 characters long\n", 1707 | "\n", 1708 | "Reading '../data/hin_corp_unicode/369_utf.txt'...\n", 1709 | "Corpus is now 6540292 characters long\n", 1710 | "\n", 1711 | "Reading '../data/hin_corp_unicode/36_utf8.txt'...\n", 1712 | "Corpus is now 6551583 characters long\n", 1713 | "\n", 1714 | "Reading '../data/hin_corp_unicode/370_utf.txt'...\n", 1715 | "Corpus is now 6564508 characters long\n", 1716 | "\n", 1717 | "Reading '../data/hin_corp_unicode/371_utf.txt'...\n", 1718 | "Corpus is now 6577179 characters long\n", 1719 | "\n", 1720 | "Reading '../data/hin_corp_unicode/372_utf.txt'...\n", 1721 | "Corpus is now 6591352 characters long\n", 1722 | "\n", 1723 | "Reading '../data/hin_corp_unicode/373_utf.txt'...\n", 1724 | "Corpus is now 6606773 characters long\n", 1725 | "\n", 1726 | "Reading '../data/hin_corp_unicode/374_utf.txt'...\n", 1727 | "Corpus is now 6624108 characters long\n", 1728 | "\n", 1729 | "Reading '../data/hin_corp_unicode/375_utf.txt'...\n", 1730 | "Corpus is now 6639503 characters long\n", 1731 | "\n", 1732 | "Reading '../data/hin_corp_unicode/376_utf.txt'...\n", 1733 | "Corpus is now 6651629 characters long\n", 1734 | "\n", 1735 | "Reading '../data/hin_corp_unicode/377_utf.txt'...\n", 1736 | "Corpus is now 6664979 characters long\n", 1737 | "\n", 1738 | "Reading '../data/hin_corp_unicode/378_utf.txt'...\n", 1739 | "Corpus is now 6680668 characters long\n", 1740 | "\n", 1741 | "Reading '../data/hin_corp_unicode/379_utf.txt'...\n", 1742 | "Corpus is now 6693267 characters long\n", 1743 | "\n", 1744 | "Reading '../data/hin_corp_unicode/37_utf8.txt'...\n", 1745 | "Corpus is now 6707011 characters long\n", 1746 | "\n", 1747 | "Reading '../data/hin_corp_unicode/380_utf.txt'...\n", 1748 | "Corpus is now 6719627 characters long\n", 1749 | "\n", 1750 | "Reading '../data/hin_corp_unicode/381_utf.txt'...\n", 1751 | "Corpus is now 6734723 characters long\n", 1752 | "\n", 1753 | "Reading '../data/hin_corp_unicode/382_utf.txt'...\n", 1754 | "Corpus is now 6747607 characters long\n", 1755 | "\n", 1756 | "Reading '../data/hin_corp_unicode/383_utf.txt'...\n", 1757 | "Corpus is now 6759683 characters long\n", 1758 | "\n", 1759 | "Reading '../data/hin_corp_unicode/384_utf.txt'...\n", 1760 | "Corpus is now 6772928 characters long\n", 1761 | "\n", 1762 | "Reading '../data/hin_corp_unicode/385_utf.txt'...\n", 1763 | "Corpus is now 6787050 characters long\n", 1764 | "\n", 1765 | "Reading '../data/hin_corp_unicode/386_utf.txt'...\n", 1766 | "Corpus is now 6802286 characters long\n", 1767 | "\n", 1768 | "Reading '../data/hin_corp_unicode/387_utf.txt'...\n", 1769 | "Corpus is now 6814185 characters long\n", 1770 | "\n", 1771 | "Reading '../data/hin_corp_unicode/388_utf.txt'...\n", 1772 | "Corpus is now 6829350 characters long\n", 1773 | "\n", 1774 | "Reading '../data/hin_corp_unicode/389_utf.txt'...\n", 1775 | "Corpus is now 6843262 characters long\n", 1776 | "\n", 1777 | "Reading '../data/hin_corp_unicode/38_utf8.txt'...\n", 1778 | "Corpus is now 6855693 characters long\n", 1779 | "\n", 1780 | "Reading '../data/hin_corp_unicode/390_utf.txt'...\n", 1781 | "Corpus is now 6868591 characters long\n", 1782 | "\n", 1783 | "Reading '../data/hin_corp_unicode/391_utf.txt'...\n", 1784 | "Corpus is now 6882134 characters long\n", 1785 | "\n", 1786 | "Reading '../data/hin_corp_unicode/392_utf.txt'...\n", 1787 | "Corpus is now 6898836 characters long\n", 1788 | "\n", 1789 | "Reading '../data/hin_corp_unicode/393_utf.txt'...\n", 1790 | "Corpus is now 6912639 characters long\n", 1791 | "\n", 1792 | "Reading '../data/hin_corp_unicode/394_utf.txt'...\n", 1793 | "Corpus is now 6924197 characters long\n", 1794 | "\n", 1795 | "Reading '../data/hin_corp_unicode/395_utf.txt'...\n", 1796 | "Corpus is now 6936502 characters long\n", 1797 | "\n", 1798 | "Reading '../data/hin_corp_unicode/396_utf.txt'...\n", 1799 | "Corpus is now 6949813 characters long\n", 1800 | "\n", 1801 | "Reading '../data/hin_corp_unicode/397_utf.txt'...\n", 1802 | "Corpus is now 6960430 characters long\n", 1803 | "\n", 1804 | "Reading '../data/hin_corp_unicode/398_utf.txt'...\n", 1805 | "Corpus is now 6975245 characters long\n", 1806 | "\n", 1807 | "Reading '../data/hin_corp_unicode/399_utf.txt'...\n", 1808 | "Corpus is now 6988217 characters long\n", 1809 | "\n", 1810 | "Reading '../data/hin_corp_unicode/39_utf8.txt'...\n", 1811 | "Corpus is now 6999459 characters long\n", 1812 | "\n", 1813 | "Reading '../data/hin_corp_unicode/3_utf8.txt'...\n", 1814 | "Corpus is now 6999459 characters long\n", 1815 | "\n", 1816 | "Reading '../data/hin_corp_unicode/400_utf.txt'...\n", 1817 | "Corpus is now 7013486 characters long\n", 1818 | "\n", 1819 | "Reading '../data/hin_corp_unicode/401_utf.txt'...\n", 1820 | "Corpus is now 7025134 characters long\n", 1821 | "\n", 1822 | "Reading '../data/hin_corp_unicode/402_utf.txt'...\n", 1823 | "Corpus is now 7039469 characters long\n", 1824 | "\n", 1825 | "Reading '../data/hin_corp_unicode/403_utf.txt'...\n", 1826 | "Corpus is now 7054034 characters long\n", 1827 | "\n", 1828 | "Reading '../data/hin_corp_unicode/404_utf.txt'...\n", 1829 | "Corpus is now 7068362 characters long\n", 1830 | "\n", 1831 | "Reading '../data/hin_corp_unicode/405_utf.txt'...\n", 1832 | "Corpus is now 7082339 characters long\n", 1833 | "\n", 1834 | "Reading '../data/hin_corp_unicode/406_utf.txt'...\n", 1835 | "Corpus is now 7099268 characters long\n", 1836 | "\n", 1837 | "Reading '../data/hin_corp_unicode/407_utf.txt'...\n", 1838 | "Corpus is now 7111876 characters long\n", 1839 | "\n", 1840 | "Reading '../data/hin_corp_unicode/408_utf.txt'...\n", 1841 | "Corpus is now 7124652 characters long\n", 1842 | "\n", 1843 | "Reading '../data/hin_corp_unicode/409_utf.txt'...\n", 1844 | "Corpus is now 7136719 characters long\n", 1845 | "\n", 1846 | "Reading '../data/hin_corp_unicode/40_utf8.txt'...\n", 1847 | "Corpus is now 7146638 characters long\n", 1848 | "\n", 1849 | "Reading '../data/hin_corp_unicode/410_utf.txt'...\n", 1850 | "Corpus is now 7163273 characters long\n", 1851 | "\n", 1852 | "Reading '../data/hin_corp_unicode/411_utf.txt'...\n", 1853 | "Corpus is now 7176509 characters long\n", 1854 | "\n", 1855 | "Reading '../data/hin_corp_unicode/412_utf.txt'...\n", 1856 | "Corpus is now 7188767 characters long\n", 1857 | "\n", 1858 | "Reading '../data/hin_corp_unicode/413_utf.txt'...\n", 1859 | "Corpus is now 7201834 characters long\n", 1860 | "\n", 1861 | "Reading '../data/hin_corp_unicode/414_utf.txt'...\n", 1862 | "Corpus is now 7217561 characters long\n", 1863 | "\n", 1864 | "Reading '../data/hin_corp_unicode/415_utf.txt'...\n", 1865 | "Corpus is now 7232791 characters long\n", 1866 | "\n", 1867 | "Reading '../data/hin_corp_unicode/416_utf.txt'...\n", 1868 | "Corpus is now 7245283 characters long\n", 1869 | "\n", 1870 | "Reading '../data/hin_corp_unicode/417_utf.txt'...\n", 1871 | "Corpus is now 7266530 characters long\n", 1872 | "\n", 1873 | "Reading '../data/hin_corp_unicode/418_utf.txt'...\n", 1874 | "Corpus is now 7283588 characters long\n", 1875 | "\n", 1876 | "Reading '../data/hin_corp_unicode/419_utf.txt'...\n", 1877 | "Corpus is now 7298947 characters long\n", 1878 | "\n", 1879 | "Reading '../data/hin_corp_unicode/420_utf.txt'...\n", 1880 | "Corpus is now 7315252 characters long\n", 1881 | "\n", 1882 | "Reading '../data/hin_corp_unicode/421_utf.txt'...\n", 1883 | "Corpus is now 7330352 characters long\n", 1884 | "\n", 1885 | "Reading '../data/hin_corp_unicode/422_utf.txt'...\n", 1886 | "Corpus is now 7343230 characters long\n", 1887 | "\n", 1888 | "Reading '../data/hin_corp_unicode/423_utf.txt'...\n", 1889 | "Corpus is now 7357580 characters long\n", 1890 | "\n", 1891 | "Reading '../data/hin_corp_unicode/424_utf.txt'...\n", 1892 | "Corpus is now 7371572 characters long\n", 1893 | "\n", 1894 | "Reading '../data/hin_corp_unicode/425_utf.txt'...\n", 1895 | "Corpus is now 7387240 characters long\n", 1896 | "\n", 1897 | "Reading '../data/hin_corp_unicode/426_utf.txt'...\n", 1898 | "Corpus is now 7400374 characters long\n", 1899 | "\n", 1900 | "Reading '../data/hin_corp_unicode/427_utf.txt'...\n", 1901 | "Corpus is now 7414026 characters long\n", 1902 | "\n", 1903 | "Reading '../data/hin_corp_unicode/428_utf.txt'...\n", 1904 | "Corpus is now 7428639 characters long\n", 1905 | "\n", 1906 | "Reading '../data/hin_corp_unicode/429_utf.txt'...\n", 1907 | "Corpus is now 7442850 characters long\n", 1908 | "\n", 1909 | "Reading '../data/hin_corp_unicode/430_utf.txt'...\n", 1910 | "Corpus is now 7457952 characters long\n", 1911 | "\n", 1912 | "Reading '../data/hin_corp_unicode/431_utf.txt'...\n", 1913 | "Corpus is now 7473105 characters long\n", 1914 | "\n", 1915 | "Reading '../data/hin_corp_unicode/432_utf.txt'...\n", 1916 | "Corpus is now 7488689 characters long\n", 1917 | "\n", 1918 | "Reading '../data/hin_corp_unicode/433_utf.txt'...\n", 1919 | "Corpus is now 7503672 characters long\n", 1920 | "\n", 1921 | "Reading '../data/hin_corp_unicode/434_utf.txt'...\n", 1922 | "Corpus is now 7517701 characters long\n", 1923 | "\n", 1924 | "Reading '../data/hin_corp_unicode/435_utf.txt'...\n", 1925 | "Corpus is now 7531879 characters long\n", 1926 | "\n", 1927 | "Reading '../data/hin_corp_unicode/436_utf.txt'...\n", 1928 | "Corpus is now 7547692 characters long\n", 1929 | "\n", 1930 | "Reading '../data/hin_corp_unicode/437_utf.txt'...\n", 1931 | "Corpus is now 7562106 characters long\n", 1932 | "\n", 1933 | "Reading '../data/hin_corp_unicode/438_utf.txt'...\n", 1934 | "Corpus is now 7575694 characters long\n", 1935 | "\n", 1936 | "Reading '../data/hin_corp_unicode/439_utf.txt'...\n", 1937 | "Corpus is now 7594305 characters long\n", 1938 | "\n", 1939 | "Reading '../data/hin_corp_unicode/440_utf.txt'...\n", 1940 | "Corpus is now 7610944 characters long\n", 1941 | "\n", 1942 | "Reading '../data/hin_corp_unicode/441_utf.txt'...\n", 1943 | "Corpus is now 7627367 characters long\n", 1944 | "\n", 1945 | "Reading '../data/hin_corp_unicode/442_utf.txt'...\n", 1946 | "Corpus is now 7645649 characters long\n", 1947 | "\n", 1948 | "Reading '../data/hin_corp_unicode/443_utf.txt'...\n", 1949 | "Corpus is now 7662992 characters long\n", 1950 | "\n", 1951 | "Reading '../data/hin_corp_unicode/444_utf.txt'...\n", 1952 | "Corpus is now 7682322 characters long\n", 1953 | "\n", 1954 | "Reading '../data/hin_corp_unicode/445_utf.txt'...\n", 1955 | "Corpus is now 7697636 characters long\n", 1956 | "\n", 1957 | "Reading '../data/hin_corp_unicode/446_utf.txt'...\n", 1958 | "Corpus is now 7711158 characters long\n", 1959 | "\n", 1960 | "Reading '../data/hin_corp_unicode/447_utf.txt'...\n", 1961 | "Corpus is now 7724052 characters long\n", 1962 | "\n", 1963 | "Reading '../data/hin_corp_unicode/448_utf.txt'...\n", 1964 | "Corpus is now 7737567 characters long\n", 1965 | "\n", 1966 | "Reading '../data/hin_corp_unicode/449_utf.txt'...\n", 1967 | "Corpus is now 7749320 characters long\n", 1968 | "\n", 1969 | "Reading '../data/hin_corp_unicode/450_utf.txt'...\n", 1970 | "Corpus is now 7763551 characters long\n", 1971 | "\n", 1972 | "Reading '../data/hin_corp_unicode/451_utf.txt'...\n", 1973 | "Corpus is now 7777173 characters long\n", 1974 | "\n", 1975 | "Reading '../data/hin_corp_unicode/452_utf.txt'...\n", 1976 | "Corpus is now 7790221 characters long\n", 1977 | "\n", 1978 | "Reading '../data/hin_corp_unicode/453_utf.txt'...\n", 1979 | "Corpus is now 7803608 characters long\n", 1980 | "\n", 1981 | "Reading '../data/hin_corp_unicode/454_utf.txt'...\n", 1982 | "Corpus is now 7815934 characters long\n", 1983 | "\n", 1984 | "Reading '../data/hin_corp_unicode/455_utf.txt'...\n", 1985 | "Corpus is now 7829445 characters long\n", 1986 | "\n", 1987 | "Reading '../data/hin_corp_unicode/456_utf.txt'...\n", 1988 | "Corpus is now 7840741 characters long\n", 1989 | "\n", 1990 | "Reading '../data/hin_corp_unicode/457_utf.txt'...\n", 1991 | "Corpus is now 7851962 characters long\n", 1992 | "\n", 1993 | "Reading '../data/hin_corp_unicode/458_utf.txt'...\n", 1994 | "Corpus is now 7869995 characters long\n", 1995 | "\n", 1996 | "Reading '../data/hin_corp_unicode/460_utf.txt'...\n", 1997 | "Corpus is now 7887448 characters long\n", 1998 | "\n", 1999 | "Reading '../data/hin_corp_unicode/461_utf.txt'...\n", 2000 | "Corpus is now 7903312 characters long\n", 2001 | "\n", 2002 | "Reading '../data/hin_corp_unicode/462_utf.txt'...\n", 2003 | "Corpus is now 7917839 characters long\n", 2004 | "\n", 2005 | "Reading '../data/hin_corp_unicode/463_utf.txt'...\n", 2006 | "Corpus is now 7934287 characters long\n", 2007 | "\n", 2008 | "Reading '../data/hin_corp_unicode/464_utf.txt'...\n", 2009 | "Corpus is now 7948455 characters long\n", 2010 | "\n", 2011 | "Reading '../data/hin_corp_unicode/465_utf.txt'...\n", 2012 | "Corpus is now 7961474 characters long\n", 2013 | "\n", 2014 | "Reading '../data/hin_corp_unicode/466_utf.txt'...\n", 2015 | "Corpus is now 7975712 characters long\n", 2016 | "\n", 2017 | "Reading '../data/hin_corp_unicode/467_utf.txt'...\n", 2018 | "Corpus is now 7993439 characters long\n", 2019 | "\n", 2020 | "Reading '../data/hin_corp_unicode/468_utf.txt'...\n", 2021 | "Corpus is now 8018423 characters long\n", 2022 | "\n", 2023 | "Reading '../data/hin_corp_unicode/469_utf.txt'...\n", 2024 | "Corpus is now 8031944 characters long\n", 2025 | "\n", 2026 | "Reading '../data/hin_corp_unicode/470_utf.txt'...\n", 2027 | "Corpus is now 8046851 characters long\n", 2028 | "\n", 2029 | "Reading '../data/hin_corp_unicode/471_utf.txt'...\n", 2030 | "Corpus is now 8059473 characters long\n", 2031 | "\n", 2032 | "Reading '../data/hin_corp_unicode/472_utf.txt'...\n", 2033 | "Corpus is now 8074922 characters long\n", 2034 | "\n", 2035 | "Reading '../data/hin_corp_unicode/473_utf.txt'...\n", 2036 | "Corpus is now 8092468 characters long\n", 2037 | "\n", 2038 | "Reading '../data/hin_corp_unicode/474_utf.txt'...\n", 2039 | "Corpus is now 8107816 characters long\n", 2040 | "\n", 2041 | "Reading '../data/hin_corp_unicode/475_utf.txt'...\n", 2042 | "Corpus is now 8123542 characters long\n", 2043 | "\n", 2044 | "Reading '../data/hin_corp_unicode/476_utf.txt'...\n", 2045 | "Corpus is now 8139389 characters long\n", 2046 | "\n", 2047 | "Reading '../data/hin_corp_unicode/477_utf.txt'...\n", 2048 | "Corpus is now 8153068 characters long\n", 2049 | "\n", 2050 | "Reading '../data/hin_corp_unicode/478_utf.txt'...\n", 2051 | "Corpus is now 8169661 characters long\n", 2052 | "\n", 2053 | "Reading '../data/hin_corp_unicode/479_utf.txt'...\n", 2054 | "Corpus is now 8183182 characters long\n", 2055 | "\n", 2056 | "Reading '../data/hin_corp_unicode/480_utf.txt'...\n", 2057 | "Corpus is now 8196859 characters long\n", 2058 | "\n", 2059 | "Reading '../data/hin_corp_unicode/481_utf.txt'...\n", 2060 | "Corpus is now 8210888 characters long\n", 2061 | "\n", 2062 | "Reading '../data/hin_corp_unicode/482_utf.txt'...\n", 2063 | "Corpus is now 8227394 characters long\n", 2064 | "\n", 2065 | "Reading '../data/hin_corp_unicode/483_utf.txt'...\n", 2066 | "Corpus is now 8241604 characters long\n", 2067 | "\n", 2068 | "Reading '../data/hin_corp_unicode/484_utf.txt'...\n", 2069 | "Corpus is now 8252810 characters long\n", 2070 | "\n", 2071 | "Reading '../data/hin_corp_unicode/485_utf.txt'...\n", 2072 | "Corpus is now 8266078 characters long\n", 2073 | "\n", 2074 | "Reading '../data/hin_corp_unicode/486_utf.txt'...\n", 2075 | "Corpus is now 8278876 characters long\n", 2076 | "\n", 2077 | "Reading '../data/hin_corp_unicode/487_utf.txt'...\n", 2078 | "Corpus is now 8293353 characters long\n", 2079 | "\n", 2080 | "Reading '../data/hin_corp_unicode/488_utf.txt'...\n", 2081 | "Corpus is now 8305262 characters long\n", 2082 | "\n", 2083 | "Reading '../data/hin_corp_unicode/489_utf.txt'...\n", 2084 | "Corpus is now 8318755 characters long\n", 2085 | "\n", 2086 | "Reading '../data/hin_corp_unicode/490_utf.txt'...\n", 2087 | "Corpus is now 8331629 characters long\n", 2088 | "\n", 2089 | "Reading '../data/hin_corp_unicode/491_utf.txt'...\n", 2090 | "Corpus is now 8345192 characters long\n", 2091 | "\n", 2092 | "Reading '../data/hin_corp_unicode/492_utf.txt'...\n", 2093 | "Corpus is now 8358961 characters long\n", 2094 | "\n", 2095 | "Reading '../data/hin_corp_unicode/493_utf.txt'...\n", 2096 | "Corpus is now 8372297 characters long\n", 2097 | "\n", 2098 | "Reading '../data/hin_corp_unicode/494_utf.txt'...\n", 2099 | "Corpus is now 8385846 characters long\n", 2100 | "\n", 2101 | "Reading '../data/hin_corp_unicode/495_utf.txt'...\n", 2102 | "Corpus is now 8402341 characters long\n", 2103 | "\n", 2104 | "Reading '../data/hin_corp_unicode/496_utf.txt'...\n", 2105 | "Corpus is now 8415750 characters long\n", 2106 | "\n", 2107 | "Reading '../data/hin_corp_unicode/497_utf.txt'...\n", 2108 | "Corpus is now 8429965 characters long\n", 2109 | "\n", 2110 | "Reading '../data/hin_corp_unicode/498_utf.txt'...\n", 2111 | "Corpus is now 8441988 characters long\n", 2112 | "\n", 2113 | "Reading '../data/hin_corp_unicode/499_utf.txt'...\n", 2114 | "Corpus is now 8456165 characters long\n", 2115 | "\n", 2116 | "Reading '../data/hin_corp_unicode/4_utf8.txt'...\n", 2117 | "Corpus is now 8456165 characters long\n", 2118 | "\n", 2119 | "Reading '../data/hin_corp_unicode/500_utf.txt'...\n", 2120 | "Corpus is now 8469805 characters long\n", 2121 | "\n", 2122 | "Reading '../data/hin_corp_unicode/501_utf.txt'...\n", 2123 | "Corpus is now 8484940 characters long\n", 2124 | "\n", 2125 | "Reading '../data/hin_corp_unicode/502_utf.txt'...\n", 2126 | "Corpus is now 8497451 characters long\n", 2127 | "\n", 2128 | "Reading '../data/hin_corp_unicode/503_utf.txt'...\n", 2129 | "Corpus is now 8511323 characters long\n", 2130 | "\n", 2131 | "Reading '../data/hin_corp_unicode/504_utf.txt'...\n", 2132 | "Corpus is now 8526302 characters long\n", 2133 | "\n", 2134 | "Reading '../data/hin_corp_unicode/505_utf.txt'...\n", 2135 | "Corpus is now 8540775 characters long\n", 2136 | "\n", 2137 | "Reading '../data/hin_corp_unicode/506_utf.txt'...\n", 2138 | "Corpus is now 8556631 characters long\n", 2139 | "\n", 2140 | "Reading '../data/hin_corp_unicode/507_utf.txt'...\n", 2141 | "Corpus is now 8571597 characters long\n", 2142 | "\n", 2143 | "Reading '../data/hin_corp_unicode/508_utf.txt'...\n", 2144 | "Corpus is now 8585683 characters long\n", 2145 | "\n", 2146 | "Reading '../data/hin_corp_unicode/509_utf.txt'...\n", 2147 | "Corpus is now 8598998 characters long\n", 2148 | "\n", 2149 | "Reading '../data/hin_corp_unicode/510_utf.txt'...\n", 2150 | "Corpus is now 8612528 characters long\n", 2151 | "\n", 2152 | "Reading '../data/hin_corp_unicode/511_utf.txt'...\n", 2153 | "Corpus is now 8626003 characters long\n", 2154 | "\n", 2155 | "Reading '../data/hin_corp_unicode/512_utf.txt'...\n", 2156 | "Corpus is now 8639426 characters long\n", 2157 | "\n", 2158 | "Reading '../data/hin_corp_unicode/513_utf.txt'...\n", 2159 | "Corpus is now 8653871 characters long\n", 2160 | "\n", 2161 | "Reading '../data/hin_corp_unicode/514_utf.txt'...\n", 2162 | "Corpus is now 8667520 characters long\n", 2163 | "\n", 2164 | "Reading '../data/hin_corp_unicode/515_utf.txt'...\n", 2165 | "Corpus is now 8681443 characters long\n", 2166 | "\n", 2167 | "Reading '../data/hin_corp_unicode/516_utf.txt'...\n", 2168 | "Corpus is now 8698510 characters long\n", 2169 | "\n", 2170 | "Reading '../data/hin_corp_unicode/517_utf.txt'...\n", 2171 | "Corpus is now 8711713 characters long\n", 2172 | "\n", 2173 | "Reading '../data/hin_corp_unicode/518_utf.txt'...\n", 2174 | "Corpus is now 8727806 characters long\n", 2175 | "\n", 2176 | "Reading '../data/hin_corp_unicode/519_utf.txt'...\n", 2177 | "Corpus is now 8741742 characters long\n", 2178 | "\n", 2179 | "Reading '../data/hin_corp_unicode/520_utf.txt'...\n", 2180 | "Corpus is now 8755491 characters long\n", 2181 | "\n", 2182 | "Reading '../data/hin_corp_unicode/521_utf.txt'...\n", 2183 | "Corpus is now 8767130 characters long\n", 2184 | "\n", 2185 | "Reading '../data/hin_corp_unicode/522_utf.txt'...\n", 2186 | "Corpus is now 8780326 characters long\n", 2187 | "\n", 2188 | "Reading '../data/hin_corp_unicode/523_utf.txt'...\n", 2189 | "Corpus is now 8793522 characters long\n", 2190 | "\n", 2191 | "Reading '../data/hin_corp_unicode/524_utf.txt'...\n", 2192 | "Corpus is now 8806431 characters long\n", 2193 | "\n", 2194 | "Reading '../data/hin_corp_unicode/525_utf.txt'...\n", 2195 | "Corpus is now 8820364 characters long\n", 2196 | "\n", 2197 | "Reading '../data/hin_corp_unicode/526_utf.txt'...\n", 2198 | "Corpus is now 8834507 characters long\n", 2199 | "\n", 2200 | "Reading '../data/hin_corp_unicode/527_utf.txt'...\n", 2201 | "Corpus is now 8846627 characters long\n", 2202 | "\n", 2203 | "Reading '../data/hin_corp_unicode/528_utf.txt'...\n", 2204 | "Corpus is now 8858764 characters long\n", 2205 | "\n", 2206 | "Reading '../data/hin_corp_unicode/529_utf.txt'...\n", 2207 | "Corpus is now 8873314 characters long\n", 2208 | "\n", 2209 | "Reading '../data/hin_corp_unicode/530_utf.txt'...\n", 2210 | "Corpus is now 8890897 characters long\n", 2211 | "\n", 2212 | "Reading '../data/hin_corp_unicode/531_utf.txt'...\n", 2213 | "Corpus is now 8905544 characters long\n", 2214 | "\n", 2215 | "Reading '../data/hin_corp_unicode/532_utf.txt'...\n", 2216 | "Corpus is now 8916272 characters long\n", 2217 | "\n", 2218 | "Reading '../data/hin_corp_unicode/533_utf.txt'...\n", 2219 | "Corpus is now 8929231 characters long\n", 2220 | "\n", 2221 | "Reading '../data/hin_corp_unicode/534_utf.txt'...\n", 2222 | "Corpus is now 8943688 characters long\n", 2223 | "\n", 2224 | "Reading '../data/hin_corp_unicode/535_utf.txt'...\n", 2225 | "Corpus is now 8955648 characters long\n", 2226 | "\n", 2227 | "Reading '../data/hin_corp_unicode/536_utf.txt'...\n", 2228 | "Corpus is now 8971843 characters long\n", 2229 | "\n", 2230 | "Reading '../data/hin_corp_unicode/537_utf.txt'...\n", 2231 | "Corpus is now 8983601 characters long\n", 2232 | "\n", 2233 | "Reading '../data/hin_corp_unicode/538_utf.txt'...\n", 2234 | "Corpus is now 8997258 characters long\n", 2235 | "\n", 2236 | "Reading '../data/hin_corp_unicode/539_utf.txt'...\n", 2237 | "Corpus is now 9006362 characters long\n", 2238 | "\n", 2239 | "Reading '../data/hin_corp_unicode/540_utf.txt'...\n", 2240 | "Corpus is now 9017498 characters long\n", 2241 | "\n", 2242 | "Reading '../data/hin_corp_unicode/541_utf.txt'...\n", 2243 | "Corpus is now 9026958 characters long\n", 2244 | "\n", 2245 | "Reading '../data/hin_corp_unicode/542_utf.txt'...\n", 2246 | "Corpus is now 9037852 characters long\n", 2247 | "\n", 2248 | "Reading '../data/hin_corp_unicode/543_utf.txt'...\n", 2249 | "Corpus is now 9049878 characters long\n", 2250 | "\n", 2251 | "Reading '../data/hin_corp_unicode/544_utf.txt'...\n", 2252 | "Corpus is now 9064526 characters long\n", 2253 | "\n", 2254 | "Reading '../data/hin_corp_unicode/545_utf.txt'...\n", 2255 | "Corpus is now 9076532 characters long\n", 2256 | "\n", 2257 | "Reading '../data/hin_corp_unicode/546_utf.txt'...\n", 2258 | "Corpus is now 9086723 characters long\n", 2259 | "\n", 2260 | "Reading '../data/hin_corp_unicode/547_utf.txt'...\n", 2261 | "Corpus is now 9098189 characters long\n", 2262 | "\n", 2263 | "Reading '../data/hin_corp_unicode/548_utf.txt'...\n", 2264 | "Corpus is now 9107374 characters long\n", 2265 | "\n", 2266 | "Reading '../data/hin_corp_unicode/549_utf.txt'...\n", 2267 | "Corpus is now 9118824 characters long\n", 2268 | "\n", 2269 | "Reading '../data/hin_corp_unicode/550_utf.txt'...\n", 2270 | "Corpus is now 9131548 characters long\n", 2271 | "\n", 2272 | "Reading '../data/hin_corp_unicode/551_utf.txt'...\n", 2273 | "Corpus is now 9143077 characters long\n", 2274 | "\n", 2275 | "Reading '../data/hin_corp_unicode/552_utf.txt'...\n", 2276 | "Corpus is now 9154784 characters long\n", 2277 | "\n", 2278 | "Reading '../data/hin_corp_unicode/553_utf.txt'...\n", 2279 | "Corpus is now 9169398 characters long\n", 2280 | "\n", 2281 | "Reading '../data/hin_corp_unicode/554_utf.txt'...\n", 2282 | "Corpus is now 9180708 characters long\n", 2283 | "\n", 2284 | "Reading '../data/hin_corp_unicode/555_utf.txt'...\n", 2285 | "Corpus is now 9195767 characters long\n", 2286 | "\n", 2287 | "Reading '../data/hin_corp_unicode/556_utf.txt'...\n", 2288 | "Corpus is now 9206663 characters long\n", 2289 | "\n", 2290 | "Reading '../data/hin_corp_unicode/557_utf.txt'...\n", 2291 | "Corpus is now 9219280 characters long\n", 2292 | "\n", 2293 | "Reading '../data/hin_corp_unicode/558_utf.txt'...\n", 2294 | "Corpus is now 9229725 characters long\n", 2295 | "\n", 2296 | "Reading '../data/hin_corp_unicode/560_utf.txt'...\n", 2297 | "Corpus is now 9241615 characters long\n", 2298 | "\n", 2299 | "Reading '../data/hin_corp_unicode/561_utf.txt'...\n", 2300 | "Corpus is now 9254810 characters long\n", 2301 | "\n", 2302 | "Reading '../data/hin_corp_unicode/562_utf.txt'...\n", 2303 | "Corpus is now 9265063 characters long\n", 2304 | "\n", 2305 | "Reading '../data/hin_corp_unicode/563_utf.txt'...\n", 2306 | "Corpus is now 9274364 characters long\n", 2307 | "\n", 2308 | "Reading '../data/hin_corp_unicode/564_utf.txt'...\n", 2309 | "Corpus is now 9285645 characters long\n", 2310 | "\n", 2311 | "Reading '../data/hin_corp_unicode/565_utf.txt'...\n", 2312 | "Corpus is now 9295374 characters long\n", 2313 | "\n", 2314 | "Reading '../data/hin_corp_unicode/566_utf.txt'...\n", 2315 | "Corpus is now 9306851 characters long\n", 2316 | "\n", 2317 | "Reading '../data/hin_corp_unicode/567_utf.txt'...\n", 2318 | "Corpus is now 9316678 characters long\n", 2319 | "\n", 2320 | "Reading '../data/hin_corp_unicode/568_utf.txt'...\n", 2321 | "Corpus is now 9327243 characters long\n", 2322 | "\n", 2323 | "Reading '../data/hin_corp_unicode/569_utf.txt'...\n", 2324 | "Corpus is now 9338184 characters long\n", 2325 | "\n", 2326 | "Reading '../data/hin_corp_unicode/570_utf.txt'...\n", 2327 | "Corpus is now 9351544 characters long\n", 2328 | "\n", 2329 | "Reading '../data/hin_corp_unicode/571_utf.txt'...\n", 2330 | "Corpus is now 9365401 characters long\n", 2331 | "\n", 2332 | "Reading '../data/hin_corp_unicode/572_utf.txt'...\n", 2333 | "Corpus is now 9374520 characters long\n", 2334 | "\n", 2335 | "Reading '../data/hin_corp_unicode/573_utf.txt'...\n", 2336 | "Corpus is now 9386805 characters long\n", 2337 | "\n", 2338 | "Reading '../data/hin_corp_unicode/574_utf.txt'...\n", 2339 | "Corpus is now 9395241 characters long\n", 2340 | "\n", 2341 | "Reading '../data/hin_corp_unicode/575_utf.txt'...\n", 2342 | "Corpus is now 9400093 characters long\n", 2343 | "\n", 2344 | "Reading '../data/hin_corp_unicode/576_utf.txt'...\n", 2345 | "Corpus is now 9407934 characters long\n", 2346 | "\n", 2347 | "Reading '../data/hin_corp_unicode/577_utf.txt'...\n", 2348 | "Corpus is now 9418427 characters long\n", 2349 | "\n", 2350 | "Reading '../data/hin_corp_unicode/578_utf.txt'...\n", 2351 | "Corpus is now 9430231 characters long\n", 2352 | "\n", 2353 | "Reading '../data/hin_corp_unicode/579_utf.txt'...\n", 2354 | "Corpus is now 9441172 characters long\n", 2355 | "\n", 2356 | "Reading '../data/hin_corp_unicode/580_utf.txt'...\n", 2357 | "Corpus is now 9453389 characters long\n", 2358 | "\n", 2359 | "Reading '../data/hin_corp_unicode/581_utf.txt'...\n", 2360 | "Corpus is now 9464262 characters long\n", 2361 | "\n", 2362 | "Reading '../data/hin_corp_unicode/582_utf.txt'...\n", 2363 | "Corpus is now 9473832 characters long\n", 2364 | "\n", 2365 | "Reading '../data/hin_corp_unicode/583_utf.txt'...\n", 2366 | "Corpus is now 9485490 characters long\n", 2367 | "\n", 2368 | "Reading '../data/hin_corp_unicode/584_utf.txt'...\n", 2369 | "Corpus is now 9498187 characters long\n", 2370 | "\n", 2371 | "Reading '../data/hin_corp_unicode/585_utf.txt'...\n", 2372 | "Corpus is now 9509190 characters long\n", 2373 | "\n", 2374 | "Reading '../data/hin_corp_unicode/586_utf.txt'...\n", 2375 | "Corpus is now 9519688 characters long\n", 2376 | "\n", 2377 | "Reading '../data/hin_corp_unicode/587_utf.txt'...\n", 2378 | "Corpus is now 9530460 characters long\n", 2379 | "\n", 2380 | "Reading '../data/hin_corp_unicode/588_utf.txt'...\n", 2381 | "Corpus is now 9543764 characters long\n", 2382 | "\n", 2383 | "Reading '../data/hin_corp_unicode/589_utf.txt'...\n", 2384 | "Corpus is now 9554299 characters long\n", 2385 | "\n", 2386 | "Reading '../data/hin_corp_unicode/590_utf.txt'...\n", 2387 | "Corpus is now 9565089 characters long\n", 2388 | "\n", 2389 | "Reading '../data/hin_corp_unicode/591_utf.txt'...\n", 2390 | "Corpus is now 9579483 characters long\n", 2391 | "\n", 2392 | "Reading '../data/hin_corp_unicode/592_utf.txt'...\n", 2393 | "Corpus is now 9592761 characters long\n", 2394 | "\n", 2395 | "Reading '../data/hin_corp_unicode/593_utf.txt'...\n", 2396 | "Corpus is now 9601279 characters long\n", 2397 | "\n", 2398 | "Reading '../data/hin_corp_unicode/594_utf.txt'...\n", 2399 | "Corpus is now 9611270 characters long\n", 2400 | "\n", 2401 | "Reading '../data/hin_corp_unicode/595_utf.txt'...\n", 2402 | "Corpus is now 9622449 characters long\n", 2403 | "\n", 2404 | "Reading '../data/hin_corp_unicode/596_utf.txt'...\n", 2405 | "Corpus is now 9636803 characters long\n", 2406 | "\n", 2407 | "Reading '../data/hin_corp_unicode/597_utf.txt'...\n", 2408 | "Corpus is now 9649834 characters long\n", 2409 | "\n", 2410 | "Reading '../data/hin_corp_unicode/598_utf.txt'...\n", 2411 | "Corpus is now 9661818 characters long\n", 2412 | "\n", 2413 | "Reading '../data/hin_corp_unicode/599_utf.txt'...\n", 2414 | "Corpus is now 9669882 characters long\n", 2415 | "\n", 2416 | "Reading '../data/hin_corp_unicode/5_utf8.txt'...\n", 2417 | "Corpus is now 9669882 characters long\n", 2418 | "\n", 2419 | "Reading '../data/hin_corp_unicode/600_utf.txt'...\n", 2420 | "Corpus is now 9678989 characters long\n", 2421 | "\n", 2422 | "Reading '../data/hin_corp_unicode/601_utf.txt'...\n", 2423 | "Corpus is now 9693980 characters long\n", 2424 | "\n", 2425 | "Reading '../data/hin_corp_unicode/602_utf.txt'...\n", 2426 | "Corpus is now 9703015 characters long\n", 2427 | "\n", 2428 | "Reading '../data/hin_corp_unicode/603_utf.txt'...\n", 2429 | "Corpus is now 9714949 characters long\n", 2430 | "\n", 2431 | "Reading '../data/hin_corp_unicode/604_utf.txt'...\n", 2432 | "Corpus is now 9724043 characters long\n", 2433 | "\n", 2434 | "Reading '../data/hin_corp_unicode/605_utf.txt'...\n", 2435 | "Corpus is now 9733824 characters long\n", 2436 | "\n", 2437 | "Reading '../data/hin_corp_unicode/606_utf.txt'...\n", 2438 | "Corpus is now 9741902 characters long\n", 2439 | "\n", 2440 | "Reading '../data/hin_corp_unicode/607_utf.txt'...\n", 2441 | "Corpus is now 9754762 characters long\n", 2442 | "\n", 2443 | "Reading '../data/hin_corp_unicode/608_utf.txt'...\n", 2444 | "Corpus is now 9764558 characters long\n", 2445 | "\n", 2446 | "Reading '../data/hin_corp_unicode/609_utf.txt'...\n", 2447 | "Corpus is now 9776955 characters long\n", 2448 | "\n", 2449 | "Reading '../data/hin_corp_unicode/610_utf.txt'...\n", 2450 | "Corpus is now 9790713 characters long\n", 2451 | "\n", 2452 | "Reading '../data/hin_corp_unicode/611_utf.txt'...\n", 2453 | "Corpus is now 9803202 characters long\n", 2454 | "\n", 2455 | "Reading '../data/hin_corp_unicode/612_utf.txt'...\n", 2456 | "Corpus is now 9812263 characters long\n", 2457 | "\n", 2458 | "Reading '../data/hin_corp_unicode/613_utf.txt'...\n", 2459 | "Corpus is now 9822935 characters long\n", 2460 | "\n", 2461 | "Reading '../data/hin_corp_unicode/614_utf.txt'...\n", 2462 | "Corpus is now 9834136 characters long\n", 2463 | "\n", 2464 | "Reading '../data/hin_corp_unicode/615_utf.txt'...\n", 2465 | "Corpus is now 9846079 characters long\n", 2466 | "\n", 2467 | "Reading '../data/hin_corp_unicode/616_utf.txt'...\n", 2468 | "Corpus is now 9858447 characters long\n", 2469 | "\n", 2470 | "Reading '../data/hin_corp_unicode/617_utf.txt'...\n", 2471 | "Corpus is now 9869061 characters long\n", 2472 | "\n", 2473 | "Reading '../data/hin_corp_unicode/618_utf.txt'...\n", 2474 | "Corpus is now 9884364 characters long\n", 2475 | "\n", 2476 | "Reading '../data/hin_corp_unicode/619_utf.txt'...\n", 2477 | "Corpus is now 9896932 characters long\n", 2478 | "\n", 2479 | "Reading '../data/hin_corp_unicode/620_utf.txt'...\n", 2480 | "Corpus is now 9907821 characters long\n", 2481 | "\n", 2482 | "Reading '../data/hin_corp_unicode/621_utf.txt'...\n", 2483 | "Corpus is now 9918185 characters long\n", 2484 | "\n", 2485 | "Reading '../data/hin_corp_unicode/622_utf.txt'...\n", 2486 | "Corpus is now 9933619 characters long\n", 2487 | "\n", 2488 | "Reading '../data/hin_corp_unicode/623_utf.txt'...\n", 2489 | "Corpus is now 9945753 characters long\n", 2490 | "\n", 2491 | "Reading '../data/hin_corp_unicode/624_utf.txt'...\n", 2492 | "Corpus is now 9953667 characters long\n", 2493 | "\n", 2494 | "Reading '../data/hin_corp_unicode/625_utf.txt'...\n", 2495 | "Corpus is now 9963913 characters long\n", 2496 | "\n", 2497 | "Reading '../data/hin_corp_unicode/626_utf.txt'...\n", 2498 | "Corpus is now 9975284 characters long\n", 2499 | "\n", 2500 | "Reading '../data/hin_corp_unicode/627_utf.txt'...\n", 2501 | "Corpus is now 9986741 characters long\n", 2502 | "\n", 2503 | "Reading '../data/hin_corp_unicode/628_utf.txt'...\n", 2504 | "Corpus is now 9998917 characters long\n", 2505 | "\n", 2506 | "Reading '../data/hin_corp_unicode/629_utf.txt'...\n", 2507 | "Corpus is now 10011638 characters long\n", 2508 | "\n", 2509 | "Reading '../data/hin_corp_unicode/630_utf.txt'...\n", 2510 | "Corpus is now 10024681 characters long\n", 2511 | "\n", 2512 | "Reading '../data/hin_corp_unicode/631_utf.txt'...\n", 2513 | "Corpus is now 10036216 characters long\n", 2514 | "\n", 2515 | "Reading '../data/hin_corp_unicode/632_utf.txt'...\n", 2516 | "Corpus is now 10046779 characters long\n", 2517 | "\n", 2518 | "Reading '../data/hin_corp_unicode/633_utf.txt'...\n", 2519 | "Corpus is now 10061616 characters long\n", 2520 | "\n", 2521 | "Reading '../data/hin_corp_unicode/634_utf.txt'...\n", 2522 | "Corpus is now 10071378 characters long\n", 2523 | "\n", 2524 | "Reading '../data/hin_corp_unicode/635_utf.txt'...\n", 2525 | "Corpus is now 10080607 characters long\n", 2526 | "\n", 2527 | "Reading '../data/hin_corp_unicode/636_utf.txt'...\n", 2528 | "Corpus is now 10092553 characters long\n", 2529 | "\n", 2530 | "Reading '../data/hin_corp_unicode/637_utf.txt'...\n", 2531 | "Corpus is now 10101782 characters long\n", 2532 | "\n", 2533 | "Reading '../data/hin_corp_unicode/638_utf.txt'...\n", 2534 | "Corpus is now 10113728 characters long\n", 2535 | "\n", 2536 | "Reading '../data/hin_corp_unicode/639_utf.txt'...\n", 2537 | "Corpus is now 10122437 characters long\n", 2538 | "\n", 2539 | "Reading '../data/hin_corp_unicode/640_utf.txt'...\n", 2540 | "Corpus is now 10134201 characters long\n", 2541 | "\n", 2542 | "Reading '../data/hin_corp_unicode/641_utf.txt'...\n", 2543 | "Corpus is now 10142806 characters long\n", 2544 | "\n", 2545 | "Reading '../data/hin_corp_unicode/642_utf.txt'...\n", 2546 | "Corpus is now 10153390 characters long\n", 2547 | "\n", 2548 | "Reading '../data/hin_corp_unicode/643_utf.txt'...\n", 2549 | "Corpus is now 10165368 characters long\n", 2550 | "\n", 2551 | "Reading '../data/hin_corp_unicode/644_utf.txt'...\n", 2552 | "Corpus is now 10176006 characters long\n", 2553 | "\n", 2554 | "Reading '../data/hin_corp_unicode/645_utf.txt'...\n", 2555 | "Corpus is now 10185999 characters long\n", 2556 | "\n", 2557 | "Reading '../data/hin_corp_unicode/646_utf.txt'...\n", 2558 | "Corpus is now 10196005 characters long\n", 2559 | "\n", 2560 | "Reading '../data/hin_corp_unicode/647_utf.txt'...\n", 2561 | "Corpus is now 10208225 characters long\n", 2562 | "\n", 2563 | "Reading '../data/hin_corp_unicode/648_utf.txt'...\n", 2564 | "Corpus is now 10220254 characters long\n", 2565 | "\n", 2566 | "Reading '../data/hin_corp_unicode/649_utf.txt'...\n", 2567 | "Corpus is now 10231185 characters long\n", 2568 | "\n", 2569 | "Reading '../data/hin_corp_unicode/650_utf.txt'...\n", 2570 | "Corpus is now 10243288 characters long\n", 2571 | "\n", 2572 | "Reading '../data/hin_corp_unicode/651_utf.txt'...\n", 2573 | "Corpus is now 10255136 characters long\n", 2574 | "\n", 2575 | "Reading '../data/hin_corp_unicode/652_utf.txt'...\n", 2576 | "Corpus is now 10267009 characters long\n", 2577 | "\n", 2578 | "Reading '../data/hin_corp_unicode/653_utf.txt'...\n", 2579 | "Corpus is now 10277074 characters long\n", 2580 | "\n", 2581 | "Reading '../data/hin_corp_unicode/654_utf.txt'...\n", 2582 | "Corpus is now 10287102 characters long\n", 2583 | "\n", 2584 | "Reading '../data/hin_corp_unicode/655_utf.txt'...\n", 2585 | "Corpus is now 10298868 characters long\n", 2586 | "\n", 2587 | "Reading '../data/hin_corp_unicode/656_utf.txt'...\n", 2588 | "Corpus is now 10310945 characters long\n", 2589 | "\n", 2590 | "Reading '../data/hin_corp_unicode/657_utf.txt'...\n", 2591 | "Corpus is now 10322230 characters long\n", 2592 | "\n", 2593 | "Reading '../data/hin_corp_unicode/658_utf.txt'...\n", 2594 | "Corpus is now 10331965 characters long\n", 2595 | "\n", 2596 | "Reading '../data/hin_corp_unicode/660_utf.txt'...\n", 2597 | "Corpus is now 10340741 characters long\n", 2598 | "\n", 2599 | "Reading '../data/hin_corp_unicode/661_utf.txt'...\n", 2600 | "Corpus is now 10350321 characters long\n", 2601 | "\n", 2602 | "Reading '../data/hin_corp_unicode/662_utf.txt'...\n", 2603 | "Corpus is now 10359306 characters long\n", 2604 | "\n", 2605 | "Reading '../data/hin_corp_unicode/663_utf.txt'...\n", 2606 | "Corpus is now 10370501 characters long\n", 2607 | "\n", 2608 | "Reading '../data/hin_corp_unicode/664_utf.txt'...\n", 2609 | "Corpus is now 10380708 characters long\n", 2610 | "\n", 2611 | "Reading '../data/hin_corp_unicode/665_utf.txt'...\n", 2612 | "Corpus is now 10392480 characters long\n", 2613 | "\n", 2614 | "Reading '../data/hin_corp_unicode/666_utf.txt'...\n", 2615 | "Corpus is now 10404267 characters long\n", 2616 | "\n", 2617 | "Reading '../data/hin_corp_unicode/667_utf.txt'...\n", 2618 | "Corpus is now 10412494 characters long\n", 2619 | "\n", 2620 | "Reading '../data/hin_corp_unicode/668_utf.txt'...\n", 2621 | "Corpus is now 10421900 characters long\n", 2622 | "\n", 2623 | "Reading '../data/hin_corp_unicode/669_utf.txt'...\n", 2624 | "Corpus is now 10434585 characters long\n", 2625 | "\n", 2626 | "Reading '../data/hin_corp_unicode/670_utf.txt'...\n", 2627 | "Corpus is now 10446139 characters long\n", 2628 | "\n", 2629 | "Reading '../data/hin_corp_unicode/671_utf.txt'...\n", 2630 | "Corpus is now 10456215 characters long\n", 2631 | "\n", 2632 | "Reading '../data/hin_corp_unicode/672_utf.txt'...\n", 2633 | "Corpus is now 10467899 characters long\n", 2634 | "\n", 2635 | "Reading '../data/hin_corp_unicode/673_utf.txt'...\n", 2636 | "Corpus is now 10479130 characters long\n", 2637 | "\n", 2638 | "Reading '../data/hin_corp_unicode/674_utf.txt'...\n", 2639 | "Corpus is now 10490598 characters long\n", 2640 | "\n", 2641 | "Reading '../data/hin_corp_unicode/675_utf.txt'...\n", 2642 | "Corpus is now 10500049 characters long\n", 2643 | "\n", 2644 | "Reading '../data/hin_corp_unicode/676_utf.txt'...\n", 2645 | "Corpus is now 10512812 characters long\n", 2646 | "\n", 2647 | "Reading '../data/hin_corp_unicode/677_utf.txt'...\n", 2648 | "Corpus is now 10523419 characters long\n", 2649 | "\n", 2650 | "Reading '../data/hin_corp_unicode/678_utf.txt'...\n", 2651 | "Corpus is now 10534874 characters long\n", 2652 | "\n", 2653 | "Reading '../data/hin_corp_unicode/679_utf.txt'...\n", 2654 | "Corpus is now 10547559 characters long\n", 2655 | "\n", 2656 | "Reading '../data/hin_corp_unicode/680_utf.txt'...\n", 2657 | "Corpus is now 10555297 characters long\n", 2658 | "\n", 2659 | "Reading '../data/hin_corp_unicode/681_utf.txt'...\n", 2660 | "Corpus is now 10563801 characters long\n", 2661 | "\n", 2662 | "Reading '../data/hin_corp_unicode/682_utf.txt'...\n", 2663 | "Corpus is now 10572900 characters long\n", 2664 | "\n", 2665 | "Reading '../data/hin_corp_unicode/683_utf.txt'...\n", 2666 | "Corpus is now 10587579 characters long\n", 2667 | "\n", 2668 | "Reading '../data/hin_corp_unicode/684_utf.txt'...\n", 2669 | "Corpus is now 10602284 characters long\n", 2670 | "\n", 2671 | "Reading '../data/hin_corp_unicode/685_utf.txt'...\n", 2672 | "Corpus is now 10613147 characters long\n", 2673 | "\n", 2674 | "Reading '../data/hin_corp_unicode/686_utf.txt'...\n", 2675 | "Corpus is now 10624427 characters long\n", 2676 | "\n", 2677 | "Reading '../data/hin_corp_unicode/687_utf.txt'...\n", 2678 | "Corpus is now 10636115 characters long\n", 2679 | "\n", 2680 | "Reading '../data/hin_corp_unicode/688_utf.txt'...\n", 2681 | "Corpus is now 10644562 characters long\n", 2682 | "\n", 2683 | "Reading '../data/hin_corp_unicode/689_utf.txt'...\n", 2684 | "Corpus is now 10654681 characters long\n", 2685 | "\n", 2686 | "Reading '../data/hin_corp_unicode/690_utf.txt'...\n", 2687 | "Corpus is now 10664241 characters long\n", 2688 | "\n", 2689 | "Reading '../data/hin_corp_unicode/691_utf.txt'...\n", 2690 | "Corpus is now 10673595 characters long\n", 2691 | "\n", 2692 | "Reading '../data/hin_corp_unicode/692_utf.txt'...\n", 2693 | "Corpus is now 10683870 characters long\n", 2694 | "\n", 2695 | "Reading '../data/hin_corp_unicode/693_utf.txt'...\n", 2696 | "Corpus is now 10693841 characters long\n", 2697 | "\n", 2698 | "Reading '../data/hin_corp_unicode/694_utf.txt'...\n", 2699 | "Corpus is now 10703998 characters long\n", 2700 | "\n", 2701 | "Reading '../data/hin_corp_unicode/695_utf.txt'...\n", 2702 | "Corpus is now 10714222 characters long\n", 2703 | "\n", 2704 | "Reading '../data/hin_corp_unicode/696_utf.txt'...\n", 2705 | "Corpus is now 10725473 characters long\n", 2706 | "\n", 2707 | "Reading '../data/hin_corp_unicode/697_utf.txt'...\n", 2708 | "Corpus is now 10735806 characters long\n", 2709 | "\n", 2710 | "Reading '../data/hin_corp_unicode/698_utf.txt'...\n", 2711 | "Corpus is now 10745474 characters long\n", 2712 | "\n", 2713 | "Reading '../data/hin_corp_unicode/699_utf.txt'...\n", 2714 | "Corpus is now 10755453 characters long\n", 2715 | "\n", 2716 | "Reading '../data/hin_corp_unicode/6_utf8.txt'...\n", 2717 | "Corpus is now 10755453 characters long\n", 2718 | "\n", 2719 | "Reading '../data/hin_corp_unicode/700_utf.txt'...\n", 2720 | "Corpus is now 10767292 characters long\n", 2721 | "\n", 2722 | "Reading '../data/hin_corp_unicode/701_utf.txt'...\n", 2723 | "Corpus is now 10776689 characters long\n", 2724 | "\n", 2725 | "Reading '../data/hin_corp_unicode/702_utf.txt'...\n", 2726 | "Corpus is now 10784173 characters long\n", 2727 | "\n", 2728 | "Reading '../data/hin_corp_unicode/703_utf.txt'...\n", 2729 | "Corpus is now 10794967 characters long\n", 2730 | "\n", 2731 | "Reading '../data/hin_corp_unicode/704_utf.txt'...\n", 2732 | "Corpus is now 10805778 characters long\n", 2733 | "\n", 2734 | "Reading '../data/hin_corp_unicode/705_utf.txt'...\n", 2735 | "Corpus is now 10815040 characters long\n", 2736 | "\n", 2737 | "Reading '../data/hin_corp_unicode/706_utf.txt'...\n", 2738 | "Corpus is now 10833931 characters long\n", 2739 | "\n", 2740 | "Reading '../data/hin_corp_unicode/707_utf.txt'...\n", 2741 | "Corpus is now 10840264 characters long\n", 2742 | "\n", 2743 | "Reading '../data/hin_corp_unicode/708_utf.txt'...\n", 2744 | "Corpus is now 10852818 characters long\n", 2745 | "\n", 2746 | "Reading '../data/hin_corp_unicode/709_utf.txt'...\n", 2747 | "Corpus is now 10864271 characters long\n", 2748 | "\n", 2749 | "Reading '../data/hin_corp_unicode/70_utf.txt'...\n", 2750 | "Corpus is now 10871041 characters long\n", 2751 | "\n", 2752 | "Reading '../data/hin_corp_unicode/710_utf.txt'...\n", 2753 | "Corpus is now 10878338 characters long\n", 2754 | "\n", 2755 | "Reading '../data/hin_corp_unicode/711_utf.txt'...\n", 2756 | "Corpus is now 10888316 characters long\n", 2757 | "\n", 2758 | "Reading '../data/hin_corp_unicode/712_utf.txt'...\n", 2759 | "Corpus is now 10896764 characters long\n", 2760 | "\n", 2761 | "Reading '../data/hin_corp_unicode/713_utf.txt'...\n", 2762 | "Corpus is now 10910132 characters long\n", 2763 | "\n", 2764 | "Reading '../data/hin_corp_unicode/714_utf.txt'...\n", 2765 | "Corpus is now 10915221 characters long\n", 2766 | "\n", 2767 | "Reading '../data/hin_corp_unicode/715_utf.txt'...\n", 2768 | "Corpus is now 10926061 characters long\n", 2769 | "\n", 2770 | "Reading '../data/hin_corp_unicode/716_utf.txt'...\n", 2771 | "Corpus is now 10940265 characters long\n", 2772 | "\n", 2773 | "Reading '../data/hin_corp_unicode/717_utf.txt'...\n", 2774 | "Corpus is now 10950718 characters long\n", 2775 | "\n", 2776 | "Reading '../data/hin_corp_unicode/718_utf.txt'...\n", 2777 | "Corpus is now 10963165 characters long\n", 2778 | "\n", 2779 | "Reading '../data/hin_corp_unicode/719_utf.txt'...\n", 2780 | "Corpus is now 10977428 characters long\n", 2781 | "\n", 2782 | "Reading '../data/hin_corp_unicode/71_utf.txt'...\n", 2783 | "Corpus is now 10988275 characters long\n", 2784 | "\n", 2785 | "Reading '../data/hin_corp_unicode/720_utf.txt'...\n", 2786 | "Corpus is now 10995346 characters long\n", 2787 | "\n", 2788 | "Reading '../data/hin_corp_unicode/721_utf.txt'...\n", 2789 | "Corpus is now 11005123 characters long\n", 2790 | "\n", 2791 | "Reading '../data/hin_corp_unicode/722_utf.txt'...\n", 2792 | "Corpus is now 11015983 characters long\n", 2793 | "\n", 2794 | "Reading '../data/hin_corp_unicode/723_utf.txt'...\n", 2795 | "Corpus is now 11026302 characters long\n", 2796 | "\n", 2797 | "Reading '../data/hin_corp_unicode/724_utf.txt'...\n", 2798 | "Corpus is now 11036455 characters long\n", 2799 | "\n", 2800 | "Reading '../data/hin_corp_unicode/725_utf.txt'...\n", 2801 | "Corpus is now 11046190 characters long\n", 2802 | "\n", 2803 | "Reading '../data/hin_corp_unicode/726_utf.txt'...\n", 2804 | "Corpus is now 11055884 characters long\n", 2805 | "\n", 2806 | "Reading '../data/hin_corp_unicode/727_utf.txt'...\n", 2807 | "Corpus is now 11066846 characters long\n", 2808 | "\n", 2809 | "Reading '../data/hin_corp_unicode/728_utf.txt'...\n", 2810 | "Corpus is now 11077324 characters long\n", 2811 | "\n", 2812 | "Reading '../data/hin_corp_unicode/729_utf.txt'...\n", 2813 | "Corpus is now 11086270 characters long\n", 2814 | "\n", 2815 | "Reading '../data/hin_corp_unicode/72_utf.txt'...\n", 2816 | "Corpus is now 11098023 characters long\n", 2817 | "\n", 2818 | "Reading '../data/hin_corp_unicode/730_utf.txt'...\n", 2819 | "Corpus is now 11108387 characters long\n", 2820 | "\n", 2821 | "Reading '../data/hin_corp_unicode/731_utf.txt'...\n", 2822 | "Corpus is now 11118982 characters long\n", 2823 | "\n", 2824 | "Reading '../data/hin_corp_unicode/732_utf.txt'...\n", 2825 | "Corpus is now 11130378 characters long\n", 2826 | "\n", 2827 | "Reading '../data/hin_corp_unicode/733_utf.txt'...\n", 2828 | "Corpus is now 11139469 characters long\n", 2829 | "\n", 2830 | "Reading '../data/hin_corp_unicode/734_utf.txt'...\n", 2831 | "Corpus is now 11148044 characters long\n", 2832 | "\n", 2833 | "Reading '../data/hin_corp_unicode/735_utf.txt'...\n", 2834 | "Corpus is now 11158764 characters long\n", 2835 | "\n", 2836 | "Reading '../data/hin_corp_unicode/736_utf.txt'...\n", 2837 | "Corpus is now 11166300 characters long\n", 2838 | "\n", 2839 | "Reading '../data/hin_corp_unicode/737_utf.txt'...\n", 2840 | "Corpus is now 11174875 characters long\n", 2841 | "\n", 2842 | "Reading '../data/hin_corp_unicode/738_utf.txt'...\n", 2843 | "Corpus is now 11185595 characters long\n", 2844 | "\n", 2845 | "Reading '../data/hin_corp_unicode/739_utf.txt'...\n", 2846 | "Corpus is now 11193131 characters long\n", 2847 | "\n", 2848 | "Reading '../data/hin_corp_unicode/73_utf.txt'...\n", 2849 | "Corpus is now 11203714 characters long\n", 2850 | "\n", 2851 | "Reading '../data/hin_corp_unicode/740_utf.txt'...\n", 2852 | "Corpus is now 11218688 characters long\n", 2853 | "\n", 2854 | "Reading '../data/hin_corp_unicode/741_utf.txt'...\n", 2855 | "Corpus is now 11234001 characters long\n", 2856 | "\n", 2857 | "Reading '../data/hin_corp_unicode/742_utf.txt'...\n", 2858 | "Corpus is now 11245665 characters long\n", 2859 | "\n", 2860 | "Reading '../data/hin_corp_unicode/743_utf.txt'...\n", 2861 | "Corpus is now 11258925 characters long\n", 2862 | "\n", 2863 | "Reading '../data/hin_corp_unicode/744_utf.txt'...\n", 2864 | "Corpus is now 11272436 characters long\n", 2865 | "\n", 2866 | "Reading '../data/hin_corp_unicode/745_utf.txt'...\n", 2867 | "Corpus is now 11285929 characters long\n", 2868 | "\n", 2869 | "Reading '../data/hin_corp_unicode/746_utf.txt'...\n", 2870 | "Corpus is now 11299524 characters long\n", 2871 | "\n", 2872 | "Reading '../data/hin_corp_unicode/747_utf.txt'...\n", 2873 | "Corpus is now 11312383 characters long\n", 2874 | "\n", 2875 | "Reading '../data/hin_corp_unicode/748_utf.txt'...\n", 2876 | "Corpus is now 11327924 characters long\n", 2877 | "\n", 2878 | "Reading '../data/hin_corp_unicode/749_utf.txt'...\n", 2879 | "Corpus is now 11342558 characters long\n", 2880 | "\n", 2881 | "Reading '../data/hin_corp_unicode/74_utf.txt'...\n", 2882 | "Corpus is now 11354145 characters long\n", 2883 | "\n", 2884 | "Reading '../data/hin_corp_unicode/750_utf.txt'...\n", 2885 | "Corpus is now 11365462 characters long\n", 2886 | "\n", 2887 | "Reading '../data/hin_corp_unicode/751_utf.txt'...\n", 2888 | "Corpus is now 11382783 characters long\n", 2889 | "\n", 2890 | "Reading '../data/hin_corp_unicode/752_utf.txt'...\n", 2891 | "Corpus is now 11396066 characters long\n", 2892 | "\n", 2893 | "Reading '../data/hin_corp_unicode/753_utf.txt'...\n", 2894 | "Corpus is now 11409996 characters long\n", 2895 | "\n", 2896 | "Reading '../data/hin_corp_unicode/754_utf.txt'...\n", 2897 | "Corpus is now 11422730 characters long\n", 2898 | "\n", 2899 | "Reading '../data/hin_corp_unicode/755_utf.txt'...\n", 2900 | "Corpus is now 11435068 characters long\n", 2901 | "\n", 2902 | "Reading '../data/hin_corp_unicode/756_utf.txt'...\n", 2903 | "Corpus is now 11450901 characters long\n", 2904 | "\n", 2905 | "Reading '../data/hin_corp_unicode/757_utf.txt'...\n", 2906 | "Corpus is now 11472399 characters long\n", 2907 | "\n", 2908 | "Reading '../data/hin_corp_unicode/758_utf.txt'...\n", 2909 | "Corpus is now 11484637 characters long\n", 2910 | "\n", 2911 | "Reading '../data/hin_corp_unicode/75_utf.txt'...\n", 2912 | "Corpus is now 11494197 characters long\n", 2913 | "\n", 2914 | "Reading '../data/hin_corp_unicode/760_utf.txt'...\n", 2915 | "Corpus is now 11507662 characters long\n", 2916 | "\n", 2917 | "Reading '../data/hin_corp_unicode/761_utf.txt'...\n", 2918 | "Corpus is now 11522556 characters long\n", 2919 | "\n", 2920 | "Reading '../data/hin_corp_unicode/762_utf.txt'...\n", 2921 | "Corpus is now 11535130 characters long\n", 2922 | "\n", 2923 | "Reading '../data/hin_corp_unicode/763_utf.txt'...\n", 2924 | "Corpus is now 11550886 characters long\n", 2925 | "\n", 2926 | "Reading '../data/hin_corp_unicode/764_utf.txt'...\n", 2927 | "Corpus is now 11567135 characters long\n", 2928 | "\n", 2929 | "Reading '../data/hin_corp_unicode/765_utf.txt'...\n", 2930 | "Corpus is now 11583079 characters long\n", 2931 | "\n", 2932 | "Reading '../data/hin_corp_unicode/766_utf.txt'...\n", 2933 | "Corpus is now 11596844 characters long\n", 2934 | "\n", 2935 | "Reading '../data/hin_corp_unicode/767_utf.txt'...\n", 2936 | "Corpus is now 11611648 characters long\n", 2937 | "\n", 2938 | "Reading '../data/hin_corp_unicode/768_utf.txt'...\n", 2939 | "Corpus is now 11628930 characters long\n", 2940 | "\n", 2941 | "Reading '../data/hin_corp_unicode/769_utf.txt'...\n", 2942 | "Corpus is now 11634436 characters long\n", 2943 | "\n", 2944 | "Reading '../data/hin_corp_unicode/76_utf.txt'...\n", 2945 | "Corpus is now 11645323 characters long\n", 2946 | "\n", 2947 | "Reading '../data/hin_corp_unicode/770_utf.txt'...\n", 2948 | "Corpus is now 11658798 characters long\n", 2949 | "\n", 2950 | "Reading '../data/hin_corp_unicode/771_utf.txt'...\n", 2951 | "Corpus is now 11674530 characters long\n", 2952 | "\n", 2953 | "Reading '../data/hin_corp_unicode/772_utf.txt'...\n", 2954 | "Corpus is now 11687265 characters long\n", 2955 | "\n", 2956 | "Reading '../data/hin_corp_unicode/773_utf.txt'...\n", 2957 | "Corpus is now 11700966 characters long\n", 2958 | "\n", 2959 | "Reading '../data/hin_corp_unicode/774_utf.txt'...\n", 2960 | "Corpus is now 11714036 characters long\n", 2961 | "\n", 2962 | "Reading '../data/hin_corp_unicode/775_utf.txt'...\n", 2963 | "Corpus is now 11727392 characters long\n", 2964 | "\n", 2965 | "Reading '../data/hin_corp_unicode/776_utf.txt'...\n", 2966 | "Corpus is now 11740380 characters long\n", 2967 | "\n", 2968 | "Reading '../data/hin_corp_unicode/777_utf.txt'...\n", 2969 | "Corpus is now 11757955 characters long\n", 2970 | "\n", 2971 | "Reading '../data/hin_corp_unicode/778_utf.txt'...\n", 2972 | "Corpus is now 11769969 characters long\n", 2973 | "\n", 2974 | "Reading '../data/hin_corp_unicode/779_utf.txt'...\n", 2975 | "Corpus is now 11775475 characters long\n", 2976 | "\n", 2977 | "Reading '../data/hin_corp_unicode/77_utf.txt'...\n", 2978 | "Corpus is now 11786361 characters long\n", 2979 | "\n", 2980 | "Reading '../data/hin_corp_unicode/780_utf.txt'...\n", 2981 | "Corpus is now 11802161 characters long\n", 2982 | "\n", 2983 | "Reading '../data/hin_corp_unicode/781_utf.txt'...\n", 2984 | "Corpus is now 11817726 characters long\n", 2985 | "\n", 2986 | "Reading '../data/hin_corp_unicode/782_utf.txt'...\n", 2987 | "Corpus is now 11837079 characters long\n", 2988 | "\n", 2989 | "Reading '../data/hin_corp_unicode/783_utf.txt'...\n", 2990 | "Corpus is now 11848865 characters long\n", 2991 | "\n", 2992 | "Reading '../data/hin_corp_unicode/784_utf.txt'...\n", 2993 | "Corpus is now 11865137 characters long\n", 2994 | "\n", 2995 | "Reading '../data/hin_corp_unicode/785_utf.txt'...\n", 2996 | "Corpus is now 11878090 characters long\n", 2997 | "\n", 2998 | "Reading '../data/hin_corp_unicode/786_utf.txt'...\n", 2999 | "Corpus is now 11891018 characters long\n", 3000 | "\n", 3001 | "Reading '../data/hin_corp_unicode/787_utf.txt'...\n", 3002 | "Corpus is now 11917347 characters long\n", 3003 | "\n", 3004 | "Reading '../data/hin_corp_unicode/788_utf.txt'...\n", 3005 | "Corpus is now 11928834 characters long\n", 3006 | "\n", 3007 | "Reading '../data/hin_corp_unicode/789_utf.txt'...\n", 3008 | "Corpus is now 11942916 characters long\n", 3009 | "\n", 3010 | "Reading '../data/hin_corp_unicode/78_utf.txt'...\n", 3011 | "Corpus is now 11952171 characters long\n", 3012 | "\n", 3013 | "Reading '../data/hin_corp_unicode/790_utf.txt'...\n", 3014 | "Corpus is now 11967691 characters long\n", 3015 | "\n", 3016 | "Reading '../data/hin_corp_unicode/791_utf.txt'...\n", 3017 | "Corpus is now 11980744 characters long\n", 3018 | "\n", 3019 | "Reading '../data/hin_corp_unicode/792_utf.txt'...\n", 3020 | "Corpus is now 11993556 characters long\n", 3021 | "\n", 3022 | "Reading '../data/hin_corp_unicode/793_utf.txt'...\n", 3023 | "Corpus is now 12008262 characters long\n", 3024 | "\n", 3025 | "Reading '../data/hin_corp_unicode/794_utf.txt'...\n", 3026 | "Corpus is now 12021853 characters long\n", 3027 | "\n", 3028 | "Reading '../data/hin_corp_unicode/795_utf.txt'...\n", 3029 | "Corpus is now 12035320 characters long\n", 3030 | "\n", 3031 | "Reading '../data/hin_corp_unicode/796_utf.txt'...\n", 3032 | "Corpus is now 12051035 characters long\n", 3033 | "\n", 3034 | "Reading '../data/hin_corp_unicode/797_utf.txt'...\n", 3035 | "Corpus is now 12065971 characters long\n", 3036 | "\n", 3037 | "Reading '../data/hin_corp_unicode/798_utf.txt'...\n", 3038 | "Corpus is now 12078871 characters long\n", 3039 | "\n", 3040 | "Reading '../data/hin_corp_unicode/799_utf.txt'...\n", 3041 | "Corpus is now 12093663 characters long\n", 3042 | "\n", 3043 | "Reading '../data/hin_corp_unicode/79_utf.txt'...\n", 3044 | "Corpus is now 12105687 characters long\n", 3045 | "\n", 3046 | "Reading '../data/hin_corp_unicode/7_utf8.txt'...\n", 3047 | "Corpus is now 12105687 characters long\n", 3048 | "\n", 3049 | "Reading '../data/hin_corp_unicode/800_utf.txt'...\n", 3050 | "Corpus is now 12119381 characters long\n", 3051 | "\n", 3052 | "Reading '../data/hin_corp_unicode/801_utf.txt'...\n", 3053 | "Corpus is now 12134767 characters long\n", 3054 | "\n", 3055 | "Reading '../data/hin_corp_unicode/802_utf.txt'...\n", 3056 | "Corpus is now 12151520 characters long\n", 3057 | "\n", 3058 | "Reading '../data/hin_corp_unicode/803_utf.txt'...\n", 3059 | "Corpus is now 12166176 characters long\n", 3060 | "\n", 3061 | "Reading '../data/hin_corp_unicode/804_utf.txt'...\n", 3062 | "Corpus is now 12179776 characters long\n", 3063 | "\n", 3064 | "Reading '../data/hin_corp_unicode/805_utf.txt'...\n", 3065 | "Corpus is now 12196729 characters long\n", 3066 | "\n", 3067 | "Reading '../data/hin_corp_unicode/806_utf.txt'...\n", 3068 | "Corpus is now 12214104 characters long\n", 3069 | "\n", 3070 | "Reading '../data/hin_corp_unicode/807_utf.txt'...\n", 3071 | "Corpus is now 12227821 characters long\n", 3072 | "\n", 3073 | "Reading '../data/hin_corp_unicode/808_utf.txt'...\n", 3074 | "Corpus is now 12242348 characters long\n", 3075 | "\n", 3076 | "Reading '../data/hin_corp_unicode/809_utf.txt'...\n", 3077 | "Corpus is now 12255545 characters long\n", 3078 | "\n", 3079 | "Reading '../data/hin_corp_unicode/80_utf.txt'...\n", 3080 | "Corpus is now 12268916 characters long\n", 3081 | "\n", 3082 | "Reading '../data/hin_corp_unicode/810_utf.txt'...\n", 3083 | "Corpus is now 12283026 characters long\n", 3084 | "\n", 3085 | "Reading '../data/hin_corp_unicode/811_utf.txt'...\n", 3086 | "Corpus is now 12296520 characters long\n", 3087 | "\n", 3088 | "Reading '../data/hin_corp_unicode/812_utf.txt'...\n", 3089 | "Corpus is now 12307587 characters long\n", 3090 | "\n", 3091 | "Reading '../data/hin_corp_unicode/813_utf.txt'...\n", 3092 | "Corpus is now 12320805 characters long\n", 3093 | "\n", 3094 | "Reading '../data/hin_corp_unicode/814_utf.txt'...\n", 3095 | "Corpus is now 12336656 characters long\n", 3096 | "\n", 3097 | "Reading '../data/hin_corp_unicode/815_utf.txt'...\n", 3098 | "Corpus is now 12345154 characters long\n", 3099 | "\n", 3100 | "Reading '../data/hin_corp_unicode/816_utf.txt'...\n", 3101 | "Corpus is now 12353201 characters long\n", 3102 | "\n", 3103 | "Reading '../data/hin_corp_unicode/817_utf.txt'...\n", 3104 | "Corpus is now 12372928 characters long\n", 3105 | "\n", 3106 | "Reading '../data/hin_corp_unicode/818_utf.txt'...\n", 3107 | "Corpus is now 12379167 characters long\n", 3108 | "\n", 3109 | "Reading '../data/hin_corp_unicode/819_utf.txt'...\n", 3110 | "Corpus is now 12388006 characters long\n", 3111 | "\n", 3112 | "Reading '../data/hin_corp_unicode/81_utf.txt'...\n", 3113 | "Corpus is now 12397316 characters long\n", 3114 | "\n", 3115 | "Reading '../data/hin_corp_unicode/820_utf.txt'...\n", 3116 | "Corpus is now 12414191 characters long\n", 3117 | "\n", 3118 | "Reading '../data/hin_corp_unicode/821_utf.txt'...\n", 3119 | "Corpus is now 12421280 characters long\n", 3120 | "\n", 3121 | "Reading '../data/hin_corp_unicode/822_utf.txt'...\n", 3122 | "Corpus is now 12430393 characters long\n", 3123 | "\n", 3124 | "Reading '../data/hin_corp_unicode/823_utf.txt'...\n", 3125 | "Corpus is now 12439506 characters long\n", 3126 | "\n", 3127 | "Reading '../data/hin_corp_unicode/824_utf.txt'...\n", 3128 | "Corpus is now 12451944 characters long\n", 3129 | "\n", 3130 | "Reading '../data/hin_corp_unicode/825_utf.txt'...\n", 3131 | "Corpus is now 12462393 characters long\n", 3132 | "\n", 3133 | "Reading '../data/hin_corp_unicode/826_utf.txt'...\n", 3134 | "Corpus is now 12474503 characters long\n", 3135 | "\n", 3136 | "Reading '../data/hin_corp_unicode/827_utf.txt'...\n", 3137 | "Corpus is now 12490700 characters long\n", 3138 | "\n", 3139 | "Reading '../data/hin_corp_unicode/828_utf.txt'...\n", 3140 | "Corpus is now 12506058 characters long\n", 3141 | "\n", 3142 | "Reading '../data/hin_corp_unicode/829_utf.txt'...\n", 3143 | "Corpus is now 12518199 characters long\n", 3144 | "\n", 3145 | "Reading '../data/hin_corp_unicode/82_utf.txt'...\n", 3146 | "Corpus is now 12530014 characters long\n", 3147 | "\n", 3148 | "Reading '../data/hin_corp_unicode/830_utf.txt'...\n", 3149 | "Corpus is now 12542118 characters long\n", 3150 | "\n", 3151 | "Reading '../data/hin_corp_unicode/831_utf.txt'...\n", 3152 | "Corpus is now 12554677 characters long\n", 3153 | "\n", 3154 | "Reading '../data/hin_corp_unicode/832_utf.txt'...\n", 3155 | "Corpus is now 12564654 characters long\n", 3156 | "\n", 3157 | "Reading '../data/hin_corp_unicode/833_utf.txt'...\n", 3158 | "Corpus is now 12577332 characters long\n", 3159 | "\n", 3160 | "Reading '../data/hin_corp_unicode/834_utf.txt'...\n", 3161 | "Corpus is now 12587062 characters long\n", 3162 | "\n", 3163 | "Reading '../data/hin_corp_unicode/835_utf.txt'...\n", 3164 | "Corpus is now 12598874 characters long\n", 3165 | "\n", 3166 | "Reading '../data/hin_corp_unicode/836_utf.txt'...\n", 3167 | "Corpus is now 12608597 characters long\n", 3168 | "\n", 3169 | "Reading '../data/hin_corp_unicode/837_utf.txt'...\n", 3170 | "Corpus is now 12622538 characters long\n", 3171 | "\n", 3172 | "Reading '../data/hin_corp_unicode/838_utf.txt'...\n", 3173 | "Corpus is now 12642945 characters long\n", 3174 | "\n", 3175 | "Reading '../data/hin_corp_unicode/839_utf.txt'...\n", 3176 | "Corpus is now 12655994 characters long\n", 3177 | "\n", 3178 | "Reading '../data/hin_corp_unicode/83_utf.txt'...\n", 3179 | "Corpus is now 12669233 characters long\n", 3180 | "\n", 3181 | "Reading '../data/hin_corp_unicode/840_utf.txt'...\n", 3182 | "Corpus is now 12673941 characters long\n", 3183 | "\n", 3184 | "Reading '../data/hin_corp_unicode/841_utf.txt'...\n", 3185 | "Corpus is now 12685790 characters long\n", 3186 | "\n", 3187 | "Reading '../data/hin_corp_unicode/842_utf.txt'...\n", 3188 | "Corpus is now 12699759 characters long\n", 3189 | "\n", 3190 | "Reading '../data/hin_corp_unicode/843_utf.txt'...\n", 3191 | "Corpus is now 12711200 characters long\n", 3192 | "\n", 3193 | "Reading '../data/hin_corp_unicode/844_utf.txt'...\n", 3194 | "Corpus is now 12724008 characters long\n", 3195 | "\n", 3196 | "Reading '../data/hin_corp_unicode/845_utf.txt'...\n", 3197 | "Corpus is now 12735228 characters long\n", 3198 | "\n", 3199 | "Reading '../data/hin_corp_unicode/846_utf.txt'...\n", 3200 | "Corpus is now 12749624 characters long\n", 3201 | "\n", 3202 | "Reading '../data/hin_corp_unicode/847_utf.txt'...\n", 3203 | "Corpus is now 12761403 characters long\n", 3204 | "\n", 3205 | "Reading '../data/hin_corp_unicode/848_utf.txt'...\n", 3206 | "Corpus is now 12773618 characters long\n", 3207 | "\n", 3208 | "Reading '../data/hin_corp_unicode/849_utf.txt'...\n", 3209 | "Corpus is now 12784239 characters long\n", 3210 | "\n", 3211 | "Reading '../data/hin_corp_unicode/84_utf.txt'...\n", 3212 | "Corpus is now 12795135 characters long\n", 3213 | "\n", 3214 | "Reading '../data/hin_corp_unicode/850_utf.txt'...\n", 3215 | "Corpus is now 12807055 characters long\n", 3216 | "\n", 3217 | "Reading '../data/hin_corp_unicode/851_utf.txt'...\n", 3218 | "Corpus is now 12819870 characters long\n", 3219 | "\n", 3220 | "Reading '../data/hin_corp_unicode/852_utf.txt'...\n", 3221 | "Corpus is now 12829972 characters long\n", 3222 | "\n", 3223 | "Reading '../data/hin_corp_unicode/853_utf.txt'...\n", 3224 | "Corpus is now 12841817 characters long\n", 3225 | "\n", 3226 | "Reading '../data/hin_corp_unicode/854_utf.txt'...\n", 3227 | "Corpus is now 12856310 characters long\n", 3228 | "\n", 3229 | "Reading '../data/hin_corp_unicode/855_utf.txt'...\n", 3230 | "Corpus is now 12870238 characters long\n", 3231 | "\n", 3232 | "Reading '../data/hin_corp_unicode/856_utf.txt'...\n", 3233 | "Corpus is now 12882681 characters long\n", 3234 | "\n", 3235 | "Reading '../data/hin_corp_unicode/857_utf.txt'...\n", 3236 | "Corpus is now 12895620 characters long\n", 3237 | "\n", 3238 | "Reading '../data/hin_corp_unicode/858_utf.txt'...\n", 3239 | "Corpus is now 12909206 characters long\n", 3240 | "\n", 3241 | "Reading '../data/hin_corp_unicode/859_utf.txt'...\n", 3242 | "Corpus is now 12924180 characters long\n", 3243 | "\n", 3244 | "Reading '../data/hin_corp_unicode/85_utf.txt'...\n", 3245 | "Corpus is now 12937795 characters long\n", 3246 | "\n", 3247 | "Reading '../data/hin_corp_unicode/860_utf.txt'...\n", 3248 | "Corpus is now 12947251 characters long\n", 3249 | "\n", 3250 | "Reading '../data/hin_corp_unicode/861_utf.txt'...\n", 3251 | "Corpus is now 12960607 characters long\n", 3252 | "\n", 3253 | "Reading '../data/hin_corp_unicode/862_utf.txt'...\n", 3254 | "Corpus is now 12972621 characters long\n", 3255 | "\n", 3256 | "Reading '../data/hin_corp_unicode/863_utf.txt'...\n", 3257 | "Corpus is now 12990196 characters long\n", 3258 | "\n", 3259 | "Reading '../data/hin_corp_unicode/864_utf.txt'...\n", 3260 | "Corpus is now 13005997 characters long\n", 3261 | "\n", 3262 | "Reading '../data/hin_corp_unicode/865_utf.txt'...\n", 3263 | "Corpus is now 13018951 characters long\n", 3264 | "\n", 3265 | "Reading '../data/hin_corp_unicode/866_utf.txt'...\n", 3266 | "Corpus is now 13034516 characters long\n", 3267 | "\n", 3268 | "Reading '../data/hin_corp_unicode/867_utf.txt'...\n", 3269 | "Corpus is now 13047444 characters long\n", 3270 | "\n", 3271 | "Reading '../data/hin_corp_unicode/868_utf.txt'...\n", 3272 | "Corpus is now 13059611 characters long\n", 3273 | "\n", 3274 | "Reading '../data/hin_corp_unicode/869_utf.txt'...\n", 3275 | "Corpus is now 13073305 characters long\n", 3276 | "\n", 3277 | "Reading '../data/hin_corp_unicode/86_utf.txt'...\n", 3278 | "Corpus is now 13086265 characters long\n", 3279 | "\n", 3280 | "Reading '../data/hin_corp_unicode/870_utf.txt'...\n", 3281 | "Corpus is now 13097752 characters long\n", 3282 | "\n", 3283 | "Reading '../data/hin_corp_unicode/871_utf.txt'...\n", 3284 | "Corpus is now 13113272 characters long\n", 3285 | "\n", 3286 | "Reading '../data/hin_corp_unicode/872_utf.txt'...\n", 3287 | "Corpus is now 13126325 characters long\n", 3288 | "\n", 3289 | "Reading '../data/hin_corp_unicode/873_utf.txt'...\n", 3290 | "Corpus is now 13139137 characters long\n", 3291 | "\n", 3292 | "Reading '../data/hin_corp_unicode/874_utf.txt'...\n", 3293 | "Corpus is now 13153844 characters long\n", 3294 | "\n", 3295 | "Reading '../data/hin_corp_unicode/875_utf.txt'...\n", 3296 | "Corpus is now 13169559 characters long\n", 3297 | "\n", 3298 | "Reading '../data/hin_corp_unicode/876_utf.txt'...\n", 3299 | "Corpus is now 13183026 characters long\n", 3300 | "\n", 3301 | "Reading '../data/hin_corp_unicode/877_utf.txt'...\n", 3302 | "Corpus is now 13195926 characters long\n", 3303 | "\n", 3304 | "Reading '../data/hin_corp_unicode/878_utf.txt'...\n", 3305 | "Corpus is now 13210718 characters long\n", 3306 | "\n", 3307 | "Reading '../data/hin_corp_unicode/879_utf.txt'...\n", 3308 | "Corpus is now 13224412 characters long\n", 3309 | "\n", 3310 | "Reading '../data/hin_corp_unicode/87_utf.txt'...\n", 3311 | "Corpus is now 13236690 characters long\n", 3312 | "\n", 3313 | "Reading '../data/hin_corp_unicode/880_utf.txt'...\n", 3314 | "Corpus is now 13252076 characters long\n", 3315 | "\n", 3316 | "Reading '../data/hin_corp_unicode/881_utf.txt'...\n", 3317 | "Corpus is now 13266732 characters long\n", 3318 | "\n", 3319 | "Reading '../data/hin_corp_unicode/882_utf.txt'...\n", 3320 | "Corpus is now 13275230 characters long\n", 3321 | "\n", 3322 | "Reading '../data/hin_corp_unicode/883_utf.txt'...\n", 3323 | "Corpus is now 13283480 characters long\n", 3324 | "\n", 3325 | "Reading '../data/hin_corp_unicode/884_utf.txt'...\n", 3326 | "Corpus is now 13291370 characters long\n", 3327 | "\n", 3328 | "Reading '../data/hin_corp_unicode/885_utf.txt'...\n", 3329 | "Corpus is now 13297609 characters long\n", 3330 | "\n", 3331 | "Reading '../data/hin_corp_unicode/886_utf.txt'...\n", 3332 | "Corpus is now 13306448 characters long\n", 3333 | "\n", 3334 | "Reading '../data/hin_corp_unicode/887_utf.txt'...\n", 3335 | "Corpus is now 13314676 characters long\n", 3336 | "\n", 3337 | "Reading '../data/hin_corp_unicode/888_utf.txt'...\n", 3338 | "Corpus is now 13321765 characters long\n", 3339 | "\n", 3340 | "Reading '../data/hin_corp_unicode/889_utf.txt'...\n", 3341 | "Corpus is now 13330878 characters long\n", 3342 | "\n", 3343 | "Reading '../data/hin_corp_unicode/88_utf.txt'...\n", 3344 | "Corpus is now 13343327 characters long\n", 3345 | "\n", 3346 | "Reading '../data/hin_corp_unicode/890_utf.txt'...\n", 3347 | "Corpus is now 13352394 characters long\n", 3348 | "\n", 3349 | "Reading '../data/hin_corp_unicode/891_utf.txt'...\n", 3350 | "Corpus is now 13364504 characters long\n", 3351 | "\n", 3352 | "Reading '../data/hin_corp_unicode/892_utf.txt'...\n", 3353 | "Corpus is now 13380701 characters long\n", 3354 | "\n", 3355 | "Reading '../data/hin_corp_unicode/893_utf.txt'...\n", 3356 | "Corpus is now 13392859 characters long\n", 3357 | "\n", 3358 | "Reading '../data/hin_corp_unicode/894_utf.txt'...\n", 3359 | "Corpus is now 13405418 characters long\n", 3360 | "\n", 3361 | "Reading '../data/hin_corp_unicode/895_utf.txt'...\n", 3362 | "Corpus is now 13415395 characters long\n", 3363 | "\n", 3364 | "Reading '../data/hin_corp_unicode/896_utf.txt'...\n", 3365 | "Corpus is now 13425371 characters long\n", 3366 | "\n", 3367 | "Reading '../data/hin_corp_unicode/897_utf.txt'...\n", 3368 | "Corpus is now 13434827 characters long\n", 3369 | "\n", 3370 | "Reading '../data/hin_corp_unicode/898_utf.txt'...\n", 3371 | "Corpus is now 13446371 characters long\n", 3372 | "\n", 3373 | "Reading '../data/hin_corp_unicode/89_utf.txt'...\n", 3374 | "Corpus is now 13458189 characters long\n", 3375 | "\n", 3376 | "Reading '../data/hin_corp_unicode/8_utf8.txt'...\n", 3377 | "Corpus is now 13458189 characters long\n", 3378 | "\n", 3379 | "Reading '../data/hin_corp_unicode/900_utf.txt'...\n", 3380 | "Corpus is now 13470020 characters long\n", 3381 | "\n", 3382 | "Reading '../data/hin_corp_unicode/901_utf.txt'...\n", 3383 | "Corpus is now 13479722 characters long\n", 3384 | "\n", 3385 | "Reading '../data/hin_corp_unicode/902_utf.txt'...\n", 3386 | "Corpus is now 13490925 characters long\n", 3387 | "\n", 3388 | "Reading '../data/hin_corp_unicode/903_utf.txt'...\n", 3389 | "Corpus is now 13502844 characters long\n", 3390 | "\n", 3391 | "Reading '../data/hin_corp_unicode/904_utf.txt'...\n", 3392 | "Corpus is now 13513868 characters long\n", 3393 | "\n", 3394 | "Reading '../data/hin_corp_unicode/905_utf.txt'...\n", 3395 | "Corpus is now 13525254 characters long\n", 3396 | "\n", 3397 | "Reading '../data/hin_corp_unicode/906_utf.txt'...\n", 3398 | "Corpus is now 13537859 characters long\n", 3399 | "\n", 3400 | "Reading '../data/hin_corp_unicode/907_utf.txt'...\n", 3401 | "Corpus is now 13549753 characters long\n", 3402 | "\n", 3403 | "Reading '../data/hin_corp_unicode/908_utf.txt'...\n", 3404 | "Corpus is now 13560771 characters long\n", 3405 | "\n", 3406 | "Reading '../data/hin_corp_unicode/909_utf.txt'...\n", 3407 | "Corpus is now 13571730 characters long\n", 3408 | "\n", 3409 | "Reading '../data/hin_corp_unicode/90_utf.txt'...\n", 3410 | "Corpus is now 13584037 characters long\n", 3411 | "\n", 3412 | "Reading '../data/hin_corp_unicode/910_utf.txt'...\n", 3413 | "Corpus is now 13596435 characters long\n", 3414 | "\n", 3415 | "Reading '../data/hin_corp_unicode/911_utf.txt'...\n", 3416 | "Corpus is now 13607979 characters long\n", 3417 | "\n", 3418 | "Reading '../data/hin_corp_unicode/912_utf.txt'...\n", 3419 | "Corpus is now 13622487 characters long\n", 3420 | "\n", 3421 | "Reading '../data/hin_corp_unicode/913_utf.txt'...\n", 3422 | "Corpus is now 13634943 characters long\n", 3423 | "\n", 3424 | "Reading '../data/hin_corp_unicode/914_utf.txt'...\n", 3425 | "Corpus is now 13649187 characters long\n", 3426 | "\n", 3427 | "Reading '../data/hin_corp_unicode/915_utf.txt'...\n", 3428 | "Corpus is now 13664443 characters long\n", 3429 | "\n", 3430 | "Reading '../data/hin_corp_unicode/916_utf.txt'...\n", 3431 | "Corpus is now 13676849 characters long\n", 3432 | "\n", 3433 | "Reading '../data/hin_corp_unicode/917_utf.txt'...\n", 3434 | "Corpus is now 13692162 characters long\n", 3435 | "\n", 3436 | "Reading '../data/hin_corp_unicode/918_utf.txt'...\n", 3437 | "Corpus is now 13712625 characters long\n", 3438 | "\n", 3439 | "Reading '../data/hin_corp_unicode/919_utf.txt'...\n", 3440 | "Corpus is now 13723017 characters long\n", 3441 | "\n", 3442 | "Reading '../data/hin_corp_unicode/91_utf.txt'...\n", 3443 | "Corpus is now 13738274 characters long\n", 3444 | "\n", 3445 | "Reading '../data/hin_corp_unicode/920_utf.txt'...\n", 3446 | "Corpus is now 13751930 characters long\n", 3447 | "\n", 3448 | "Reading '../data/hin_corp_unicode/921_utf.txt'...\n", 3449 | "Corpus is now 13766649 characters long\n", 3450 | "\n", 3451 | "Reading '../data/hin_corp_unicode/922_utf.txt'...\n", 3452 | "Corpus is now 13778358 characters long\n", 3453 | "\n", 3454 | "Reading '../data/hin_corp_unicode/923_utf.txt'...\n", 3455 | "Corpus is now 13791594 characters long\n", 3456 | "\n", 3457 | "Reading '../data/hin_corp_unicode/924_utf.txt'...\n", 3458 | "Corpus is now 13805381 characters long\n", 3459 | "\n", 3460 | "Reading '../data/hin_corp_unicode/925_utf.txt'...\n", 3461 | "Corpus is now 13825778 characters long\n", 3462 | "\n", 3463 | "Reading '../data/hin_corp_unicode/926_utf.txt'...\n", 3464 | "Corpus is now 13835784 characters long\n", 3465 | "\n", 3466 | "Reading '../data/hin_corp_unicode/927_utf.txt'...\n", 3467 | "Corpus is now 13850183 characters long\n", 3468 | "\n", 3469 | "Reading '../data/hin_corp_unicode/928_utf.txt'...\n", 3470 | "Corpus is now 13861206 characters long\n", 3471 | "\n", 3472 | "Reading '../data/hin_corp_unicode/929_utf.txt'...\n", 3473 | "Corpus is now 13872528 characters long\n", 3474 | "\n", 3475 | "Reading '../data/hin_corp_unicode/92_utf.txt'...\n", 3476 | "Corpus is now 13886639 characters long\n", 3477 | "\n", 3478 | "Reading '../data/hin_corp_unicode/930_utf.txt'...\n", 3479 | "Corpus is now 13900966 characters long\n", 3480 | "\n", 3481 | "Reading '../data/hin_corp_unicode/931_utf.txt'...\n", 3482 | "Corpus is now 13913022 characters long\n", 3483 | "\n", 3484 | "Reading '../data/hin_corp_unicode/932_utf.txt'...\n", 3485 | "Corpus is now 13926545 characters long\n", 3486 | "\n", 3487 | "Reading '../data/hin_corp_unicode/933_utf.txt'...\n", 3488 | "Corpus is now 13936506 characters long\n", 3489 | "\n", 3490 | "Reading '../data/hin_corp_unicode/934_utf.txt'...\n", 3491 | "Corpus is now 13951231 characters long\n", 3492 | "\n", 3493 | "Reading '../data/hin_corp_unicode/935_utf.txt'...\n", 3494 | "Corpus is now 13965300 characters long\n", 3495 | "\n", 3496 | "Reading '../data/hin_corp_unicode/936_utf.txt'...\n", 3497 | "Corpus is now 13976427 characters long\n", 3498 | "\n", 3499 | "Reading '../data/hin_corp_unicode/937_utf.txt'...\n", 3500 | "Corpus is now 13994126 characters long\n", 3501 | "\n", 3502 | "Reading '../data/hin_corp_unicode/938_utf.txt'...\n", 3503 | "Corpus is now 14007623 characters long\n", 3504 | "\n", 3505 | "Reading '../data/hin_corp_unicode/939_utf.txt'...\n", 3506 | "Corpus is now 14021720 characters long\n", 3507 | "\n", 3508 | "Reading '../data/hin_corp_unicode/93_utf.txt'...\n", 3509 | "Corpus is now 14032675 characters long\n", 3510 | "\n", 3511 | "Reading '../data/hin_corp_unicode/940_utf.txt'...\n", 3512 | "Corpus is now 14046509 characters long\n", 3513 | "\n", 3514 | "Reading '../data/hin_corp_unicode/941_utf.txt'...\n", 3515 | "Corpus is now 14061198 characters long\n", 3516 | "\n", 3517 | "Reading '../data/hin_corp_unicode/942_utf.txt'...\n", 3518 | "Corpus is now 14078415 characters long\n", 3519 | "\n", 3520 | "Reading '../data/hin_corp_unicode/943_utf.txt'...\n", 3521 | "Corpus is now 14089263 characters long\n", 3522 | "\n", 3523 | "Reading '../data/hin_corp_unicode/944_utf.txt'...\n", 3524 | "Corpus is now 14109693 characters long\n", 3525 | "\n", 3526 | "Reading '../data/hin_corp_unicode/945_utf.txt'...\n", 3527 | "Corpus is now 14130998 characters long\n", 3528 | "\n", 3529 | "Reading '../data/hin_corp_unicode/946_utf.txt'...\n", 3530 | "Corpus is now 14143492 characters long\n", 3531 | "\n", 3532 | "Reading '../data/hin_corp_unicode/947_utf.txt'...\n", 3533 | "Corpus is now 14160038 characters long\n", 3534 | "\n", 3535 | "Reading '../data/hin_corp_unicode/948_utf.txt'...\n", 3536 | "Corpus is now 14172985 characters long\n", 3537 | "\n", 3538 | "Reading '../data/hin_corp_unicode/949_utf.txt'...\n", 3539 | "Corpus is now 14187279 characters long\n", 3540 | "\n", 3541 | "Reading '../data/hin_corp_unicode/94_utf.txt'...\n", 3542 | "Corpus is now 14200265 characters long\n", 3543 | "\n", 3544 | "Reading '../data/hin_corp_unicode/950_utf.txt'...\n", 3545 | "Corpus is now 14212501 characters long\n", 3546 | "\n", 3547 | "Reading '../data/hin_corp_unicode/951_utf.txt'...\n", 3548 | "Corpus is now 14225940 characters long\n", 3549 | "\n", 3550 | "Reading '../data/hin_corp_unicode/952_utf.txt'...\n", 3551 | "Corpus is now 14240286 characters long\n", 3552 | "\n", 3553 | "Reading '../data/hin_corp_unicode/953_utf.txt'...\n", 3554 | "Corpus is now 14256915 characters long\n", 3555 | "\n", 3556 | "Reading '../data/hin_corp_unicode/954_utf.txt'...\n", 3557 | "Corpus is now 14270685 characters long\n", 3558 | "\n", 3559 | "Reading '../data/hin_corp_unicode/955_utf.txt'...\n", 3560 | "Corpus is now 14287032 characters long\n", 3561 | "\n", 3562 | "Reading '../data/hin_corp_unicode/956_utf.txt'...\n", 3563 | "Corpus is now 14300289 characters long\n", 3564 | "\n", 3565 | "Reading '../data/hin_corp_unicode/957_utf.txt'...\n", 3566 | "Corpus is now 14314940 characters long\n", 3567 | "\n", 3568 | "Reading '../data/hin_corp_unicode/958_utf.txt'...\n", 3569 | "Corpus is now 14328271 characters long\n", 3570 | "\n", 3571 | "Reading '../data/hin_corp_unicode/959_utf.txt'...\n", 3572 | "Corpus is now 14354577 characters long\n", 3573 | "\n", 3574 | "Reading '../data/hin_corp_unicode/95_utf.txt'...\n", 3575 | "Corpus is now 14365581 characters long\n", 3576 | "\n", 3577 | "Reading '../data/hin_corp_unicode/960_utf.txt'...\n", 3578 | "Corpus is now 14379365 characters long\n", 3579 | "\n", 3580 | "Reading '../data/hin_corp_unicode/961_utf.txt'...\n", 3581 | "Corpus is now 14392768 characters long\n", 3582 | "\n", 3583 | "Reading '../data/hin_corp_unicode/962_utf.txt'...\n", 3584 | "Corpus is now 14405859 characters long\n", 3585 | "\n", 3586 | "Reading '../data/hin_corp_unicode/963_utf.txt'...\n", 3587 | "Corpus is now 14419954 characters long\n", 3588 | "\n", 3589 | "Reading '../data/hin_corp_unicode/964_utf.txt'...\n", 3590 | "Corpus is now 14431329 characters long\n", 3591 | "\n", 3592 | "Reading '../data/hin_corp_unicode/965_utf.txt'...\n", 3593 | "Corpus is now 14444094 characters long\n", 3594 | "\n", 3595 | "Reading '../data/hin_corp_unicode/966_utf.txt'...\n", 3596 | "Corpus is now 14470019 characters long\n", 3597 | "\n", 3598 | "Reading '../data/hin_corp_unicode/967_utf.txt'...\n", 3599 | "Corpus is now 14482150 characters long\n", 3600 | "\n", 3601 | "Reading '../data/hin_corp_unicode/968_utf.txt'...\n", 3602 | "Corpus is now 14494282 characters long\n", 3603 | "\n", 3604 | "Reading '../data/hin_corp_unicode/969_utf.txt'...\n", 3605 | "Corpus is now 14507627 characters long\n", 3606 | "\n", 3607 | "Reading '../data/hin_corp_unicode/96_utf.txt'...\n", 3608 | "Corpus is now 14515777 characters long\n", 3609 | "\n", 3610 | "Reading '../data/hin_corp_unicode/970_utf.txt'...\n", 3611 | "Corpus is now 14530285 characters long\n", 3612 | "\n", 3613 | "Reading '../data/hin_corp_unicode/971_utf.txt'...\n", 3614 | "Corpus is now 14543879 characters long\n", 3615 | "\n", 3616 | "Reading '../data/hin_corp_unicode/972_utf.txt'...\n", 3617 | "Corpus is now 14557111 characters long\n", 3618 | "\n", 3619 | "Reading '../data/hin_corp_unicode/973_utf.txt'...\n", 3620 | "Corpus is now 14575834 characters long\n", 3621 | "\n", 3622 | "Reading '../data/hin_corp_unicode/974_utf.txt'...\n", 3623 | "Corpus is now 14598645 characters long\n", 3624 | "\n", 3625 | "Reading '../data/hin_corp_unicode/975_utf.txt'...\n", 3626 | "Corpus is now 14609721 characters long\n", 3627 | "\n", 3628 | "Reading '../data/hin_corp_unicode/976_utf.txt'...\n", 3629 | "Corpus is now 14624215 characters long\n", 3630 | "\n", 3631 | "Reading '../data/hin_corp_unicode/977_utf.txt'...\n", 3632 | "Corpus is now 14635966 characters long\n", 3633 | "\n", 3634 | "Reading '../data/hin_corp_unicode/978_utf.txt'...\n", 3635 | "Corpus is now 14647147 characters long\n", 3636 | "\n", 3637 | "Reading '../data/hin_corp_unicode/979_utf.txt'...\n", 3638 | "Corpus is now 14660492 characters long\n", 3639 | "\n", 3640 | "Reading '../data/hin_corp_unicode/97_utf.txt'...\n", 3641 | "Corpus is now 14671425 characters long\n", 3642 | "\n", 3643 | "Reading '../data/hin_corp_unicode/980_utf.txt'...\n", 3644 | "Corpus is now 14684640 characters long\n", 3645 | "\n", 3646 | "Reading '../data/hin_corp_unicode/981_utf.txt'...\n", 3647 | "Corpus is now 14698733 characters long\n", 3648 | "\n", 3649 | "Reading '../data/hin_corp_unicode/982_utf.txt'...\n", 3650 | "Corpus is now 14725882 characters long\n", 3651 | "\n", 3652 | "Reading '../data/hin_corp_unicode/983_utf.txt'...\n", 3653 | "Corpus is now 14740731 characters long\n", 3654 | "\n", 3655 | "Reading '../data/hin_corp_unicode/984_utf.txt'...\n", 3656 | "Corpus is now 14753392 characters long\n", 3657 | "\n", 3658 | "Reading '../data/hin_corp_unicode/985_utf.txt'...\n", 3659 | "Corpus is now 14765019 characters long\n", 3660 | "\n", 3661 | "Reading '../data/hin_corp_unicode/986_utf.txt'...\n", 3662 | "Corpus is now 14780342 characters long\n", 3663 | "\n", 3664 | "Reading '../data/hin_corp_unicode/987_utf.txt'...\n", 3665 | "Corpus is now 14792006 characters long\n", 3666 | "\n", 3667 | "Reading '../data/hin_corp_unicode/988_utf.txt'...\n", 3668 | "Corpus is now 14803765 characters long\n", 3669 | "\n", 3670 | "Reading '../data/hin_corp_unicode/989_utf.txt'...\n", 3671 | "Corpus is now 14813858 characters long\n", 3672 | "\n", 3673 | "Reading '../data/hin_corp_unicode/98_utf.txt'...\n", 3674 | "Corpus is now 14830705 characters long\n", 3675 | "\n", 3676 | "Reading '../data/hin_corp_unicode/990_utf.txt'...\n", 3677 | "Corpus is now 14843384 characters long\n", 3678 | "\n", 3679 | "Reading '../data/hin_corp_unicode/991_utf.txt'...\n", 3680 | "Corpus is now 14856712 characters long\n", 3681 | "\n", 3682 | "Reading '../data/hin_corp_unicode/992_utf.txt'...\n", 3683 | "Corpus is now 14871614 characters long\n", 3684 | "\n", 3685 | "Reading '../data/hin_corp_unicode/993_utf.txt'...\n", 3686 | "Corpus is now 14883496 characters long\n", 3687 | "\n", 3688 | "Reading '../data/hin_corp_unicode/994_utf.txt'...\n", 3689 | "Corpus is now 14898129 characters long\n", 3690 | "\n", 3691 | "Reading '../data/hin_corp_unicode/995_utf.txt'...\n", 3692 | "Corpus is now 14915303 characters long\n", 3693 | "\n", 3694 | "Reading '../data/hin_corp_unicode/996_utf.txt'...\n", 3695 | "Corpus is now 14932263 characters long\n", 3696 | "\n", 3697 | "Reading '../data/hin_corp_unicode/997_utf.txt'...\n", 3698 | "Corpus is now 14944914 characters long\n", 3699 | "\n", 3700 | "Reading '../data/hin_corp_unicode/998_utf.txt'...\n", 3701 | "Corpus is now 14961333 characters long\n", 3702 | "\n", 3703 | "Reading '../data/hin_corp_unicode/999_utf.txt'...\n", 3704 | "Corpus is now 14977613 characters long\n", 3705 | "\n", 3706 | "Reading '../data/hin_corp_unicode/99_utf.txt'...\n", 3707 | "Corpus is now 14990762 characters long\n", 3708 | "\n", 3709 | "Reading '../data/hin_corp_unicode/9_utf8.txt'...\n", 3710 | "Corpus is now 14990762 characters long\n", 3711 | "\n" 3712 | ] 3713 | } 3714 | ], 3715 | "source": [ 3716 | "corpus_raw = u\"\"\n", 3717 | "for file_name in hindi_filenames:\n", 3718 | " print(\"Reading '{0}'...\".format(file_name))\n", 3719 | " with codecs.open(file_name, \"r\", \"utf-8\") as f:\n", 3720 | " # Starting two lines are not useful in corpus\n", 3721 | " temp = f.readline()\n", 3722 | " temp = f.readline()\n", 3723 | " corpus_raw += f.read()\n", 3724 | " print(\"Corpus is now {0} characters long\".format(len(corpus_raw)))\n", 3725 | " print()" 3726 | ] 3727 | }, 3728 | { 3729 | "cell_type": "code", 3730 | "execution_count": 8, 3731 | "metadata": { 3732 | "collapsed": false 3733 | }, 3734 | "outputs": [], 3735 | "source": [ 3736 | "tokenizer = nltk.data.load('tokenizers/punkt/english.pickle')" 3737 | ] 3738 | }, 3739 | { 3740 | "cell_type": "code", 3741 | "execution_count": 9, 3742 | "metadata": { 3743 | "collapsed": true 3744 | }, 3745 | "outputs": [], 3746 | "source": [ 3747 | "raw_sentences = tokenizer.tokenize(corpus_raw)" 3748 | ] 3749 | }, 3750 | { 3751 | "cell_type": "code", 3752 | "execution_count": 10, 3753 | "metadata": { 3754 | "collapsed": false 3755 | }, 3756 | "outputs": [], 3757 | "source": [ 3758 | "def sentence_to_wordlist(raw):\n", 3759 | " clean = re.sub(\"[.\\r\\n]\",\" \", raw)\n", 3760 | " words = clean.split()\n", 3761 | " return words\n", 3762 | "\n", 3763 | "sentences = []\n", 3764 | "for raw_sentence in raw_sentences:\n", 3765 | " if len(raw_sentence) > 0:\n", 3766 | " sentences.append(sentence_to_wordlist(raw_sentence))" 3767 | ] 3768 | }, 3769 | { 3770 | "cell_type": "code", 3771 | "execution_count": 11, 3772 | "metadata": { 3773 | "collapsed": false 3774 | }, 3775 | "outputs": [ 3776 | { 3777 | "name": "stdout", 3778 | "output_type": "stream", 3779 | "text": [ 3780 | "The Hindi corpus contains 2,840,476 tokens\n" 3781 | ] 3782 | } 3783 | ], 3784 | "source": [ 3785 | "token_count = sum([len(sentence) for sentence in sentences])\n", 3786 | "print(\"The Hindi corpus contains {0:,} tokens\".format(token_count))" 3787 | ] 3788 | }, 3789 | { 3790 | "cell_type": "code", 3791 | "execution_count": 12, 3792 | "metadata": { 3793 | "collapsed": false 3794 | }, 3795 | "outputs": [ 3796 | { 3797 | "data": { 3798 | "text/plain": [ 3799 | "[u'\\u0915\\u093f\\u0938\\u094d\\u092e\\u094b\\u0902',\n", 3800 | " u'\\u0915\\u0947',\n", 3801 | " u'\\u0935\\u093f\\u0915\\u093e\\u0938',\n", 3802 | " u'\\u092e\\u0947\\u0902',\n", 3803 | " u'\\u0938\\u0902\\u0915\\u0940\\u0930\\u094d\\u0923',\n", 3804 | " u'\\u091c\\u0940\\u0928',\n", 3805 | " u'\\u0906\\u0927\\u093e\\u0930\\u094b\\u0902',\n", 3806 | " u'\\u0915\\u0947',\n", 3807 | " u'\\u092c\\u095d\\u0924\\u0947',\n", 3808 | " u'\\u0909\\u092a\\u092f\\u094b\\u0917',\n", 3809 | " u'\\u0938\\u0947',\n", 3810 | " u'\\u090f\\u0915',\n", 3811 | " u'\\u0914\\u0930',\n", 3812 | " u'\\u0939\\u093e\\u0928\\u093f',\n", 3813 | " u'\\u0939\\u0941\\u0908']" 3814 | ] 3815 | }, 3816 | "execution_count": 12, 3817 | "metadata": {}, 3818 | "output_type": "execute_result" 3819 | } 3820 | ], 3821 | "source": [ 3822 | "sentences[0]" 3823 | ] 3824 | }, 3825 | { 3826 | "cell_type": "markdown", 3827 | "metadata": {}, 3828 | "source": [ 3829 | "## Word Vectors" 3830 | ] 3831 | }, 3832 | { 3833 | "cell_type": "code", 3834 | "execution_count": 13, 3835 | "metadata": { 3836 | "collapsed": false 3837 | }, 3838 | "outputs": [], 3839 | "source": [ 3840 | "# Dimensionality of the resulting word vectors.\n", 3841 | "# More dimensions = more generalized\n", 3842 | "num_features = 50\n", 3843 | "# Minimum word count threshold.\n", 3844 | "min_word_count = 3\n", 3845 | "\n", 3846 | "# Number of threads to run in parallel.\n", 3847 | "num_threads = multiprocessing.cpu_count()\n", 3848 | "\n", 3849 | "# Context window length.\n", 3850 | "context_size = 8\n", 3851 | "\n", 3852 | "# Downsample setting for frequent words.\n", 3853 | "#0 - 1e-5 is good for this\n", 3854 | "downsampling = 1e-3\n", 3855 | "\n", 3856 | "# Seed for the RNG, to make the results reproducible.\n", 3857 | "# Random Number Generator\n", 3858 | "seed = 1" 3859 | ] 3860 | }, 3861 | { 3862 | "cell_type": "code", 3863 | "execution_count": 14, 3864 | "metadata": { 3865 | "collapsed": true 3866 | }, 3867 | "outputs": [], 3868 | "source": [ 3869 | "# Defining the model\n", 3870 | "model = w2v.Word2Vec(\n", 3871 | " sg=1,\n", 3872 | " seed=seed,\n", 3873 | " workers=num_threads,\n", 3874 | " size=num_features,\n", 3875 | " min_count=min_word_count,\n", 3876 | " window=context_size,\n", 3877 | " sample=downsampling\n", 3878 | ")" 3879 | ] 3880 | }, 3881 | { 3882 | "cell_type": "code", 3883 | "execution_count": 15, 3884 | "metadata": { 3885 | "collapsed": false 3886 | }, 3887 | "outputs": [], 3888 | "source": [ 3889 | "model.build_vocab(sentences)" 3890 | ] 3891 | }, 3892 | { 3893 | "cell_type": "code", 3894 | "execution_count": 16, 3895 | "metadata": { 3896 | "collapsed": false 3897 | }, 3898 | "outputs": [ 3899 | { 3900 | "data": { 3901 | "text/plain": [ 3902 | "10730520" 3903 | ] 3904 | }, 3905 | "execution_count": 16, 3906 | "metadata": {}, 3907 | "output_type": "execute_result" 3908 | } 3909 | ], 3910 | "source": [ 3911 | "model.train(sentences)" 3912 | ] 3913 | }, 3914 | { 3915 | "cell_type": "code", 3916 | "execution_count": 17, 3917 | "metadata": { 3918 | "collapsed": false 3919 | }, 3920 | "outputs": [], 3921 | "source": [ 3922 | "# Save our model\n", 3923 | "model.save(os.path.join(\"../data/\", \"hindi_word2Vec_small.w2v\"))" 3924 | ] 3925 | }, 3926 | { 3927 | "cell_type": "markdown", 3928 | "metadata": {}, 3929 | "source": [ 3930 | "## Explore the model" 3931 | ] 3932 | }, 3933 | { 3934 | "cell_type": "code", 3935 | "execution_count": null, 3936 | "metadata": { 3937 | "collapsed": false 3938 | }, 3939 | "outputs": [], 3940 | "source": [ 3941 | "trained_model = w2v.Word2Vec.load(os.path.join(\"../data/\", \"hindi_word2Vec_small.w2v\"))" 3942 | ] 3943 | }, 3944 | { 3945 | "cell_type": "code", 3946 | "execution_count": null, 3947 | "metadata": { 3948 | "collapsed": false 3949 | }, 3950 | "outputs": [], 3951 | "source": [ 3952 | "# For reducing dimensiomns, to visualize vectors\n", 3953 | "tsne = sklearn.manifold.TSNE(n_components=2, random_state=0)\n", 3954 | "all_word_vectors_matrix = trained_model.syn1neg[:200] # Currently giving memory error for all words\n", 3955 | "# Reduced dimensions\n", 3956 | "all_word_vectors_matrix_2d = tsne.fit_transform(all_word_vectors_matrix)" 3957 | ] 3958 | }, 3959 | { 3960 | "cell_type": "code", 3961 | "execution_count": null, 3962 | "metadata": { 3963 | "collapsed": false 3964 | }, 3965 | "outputs": [], 3966 | "source": [ 3967 | "points = pd.DataFrame(\n", 3968 | " [\n", 3969 | " (word, coords[0], coords[1])\n", 3970 | " for word, coords in [\n", 3971 | " (word, all_word_vectors_matrix_2d[trained_model.wv.vocab[word].index])\n", 3972 | " for word in trained_model.wv.vocab\n", 3973 | " if trained_model.wv.vocab[word].index < 200\n", 3974 | " ]\n", 3975 | " ],\n", 3976 | " columns=[\"word\", \"x\", \"y\"]\n", 3977 | ")" 3978 | ] 3979 | }, 3980 | { 3981 | "cell_type": "code", 3982 | "execution_count": null, 3983 | "metadata": { 3984 | "collapsed": false 3985 | }, 3986 | "outputs": [], 3987 | "source": [ 3988 | "s = trained_model.wv[u\"आधार\"]" 3989 | ] 3990 | }, 3991 | { 3992 | "cell_type": "code", 3993 | "execution_count": null, 3994 | "metadata": { 3995 | "collapsed": true 3996 | }, 3997 | "outputs": [], 3998 | "source": [] 3999 | } 4000 | ], 4001 | "metadata": { 4002 | "anaconda-cloud": {}, 4003 | "kernelspec": { 4004 | "display_name": "Python [conda root]", 4005 | "language": "python", 4006 | "name": "conda-root-py" 4007 | }, 4008 | "language_info": { 4009 | "codemirror_mode": { 4010 | "name": "ipython", 4011 | "version": 2 4012 | }, 4013 | "file_extension": ".py", 4014 | "mimetype": "text/x-python", 4015 | "name": "python", 4016 | "nbconvert_exporter": "python", 4017 | "pygments_lexer": "ipython2", 4018 | "version": "2.7.12" 4019 | } 4020 | }, 4021 | "nbformat": 4, 4022 | "nbformat_minor": 1 4023 | } 4024 | -------------------------------------------------------------------------------- /Task 3: Hindi data/process_data.py: -------------------------------------------------------------------------------- 1 | import numpy as np 2 | from keras.preprocessing import sequence 3 | # For getting English word vectors 4 | from get_word_vectors import get_word_vector, get_sentence_vectors 5 | import codecs 6 | 7 | 8 | class DataHandler(): 9 | """ 10 | Class for handling all data processing and preparing training/testing data""" 11 | 12 | def __init__(self, datapath): 13 | # Default values 14 | self.LEN_NAMED_CLASSES = 12 # 4 names and 1 null class 15 | self.NULL_CLASS = "O" 16 | self.LEN_WORD_VECTORS = 50 17 | 18 | self.tags = [] 19 | # string tags mapped to int and one hot vectors 20 | self.tag_id_map = {} 21 | self.tag_to_one_hot_map = {} 22 | 23 | # All data(to be filled by read_data method) 24 | self.x = [] 25 | self.y = [] 26 | 27 | self.read_data(datapath) 28 | 29 | def read_data(self, datapath): 30 | _id = 0 31 | sentence = [] 32 | sentence_tags = [] 33 | all_data = [] 34 | pos = 0 35 | is_entity = False 36 | with codecs.open(datapath, 'r') as f: 37 | for l in f: 38 | line = l.strip().split() 39 | if line: 40 | try: 41 | word, named_tag = line[0], line[1] 42 | except: 43 | continue 44 | if named_tag != self.NULL_CLASS: 45 | is_entity = True 46 | if named_tag not in self.tags: 47 | self.tags.append(named_tag) 48 | self.tag_id_map[_id] = named_tag 49 | one_hot_vec = np.zeros(self.LEN_NAMED_CLASSES, dtype = np.int32) 50 | one_hot_vec[_id] = 1 51 | self.tag_to_one_hot_map[named_tag] = one_hot_vec 52 | 53 | _id+=1; 54 | 55 | # Get word vectors for given word 56 | sentence.append(get_word_vector(word)[:self.LEN_WORD_VECTORS]) 57 | sentence_tags.append(self.tag_to_one_hot_map[named_tag]) 58 | else: 59 | if not is_entity: 60 | is_entity = False 61 | sentence_tags = [] 62 | sentence = [] 63 | continue 64 | all_data.append( (sentence, sentence_tags) ); 65 | sentence_tags = [] 66 | sentence = [] 67 | is_entity = False 68 | 69 | if pos > 1000000: 70 | break; 71 | pos+=1 72 | 73 | #Find length of largest sentence 74 | self.max_len = 0 75 | for pair in all_data: 76 | if self.max_len < len(pair[0]): 77 | self.max_len = len(pair[0]) 78 | 79 | for vectors, one_hot_tags in all_data: 80 | # Pad the sequences and make them all of same length 81 | temp_X = np.zeros(self.LEN_WORD_VECTORS, dtype = np.int32) 82 | temp_Y = np.array(self.tag_to_one_hot_map[self.NULL_CLASS]) 83 | pad_length = self.max_len - len(vectors) 84 | 85 | #Insert into main data list 86 | self.x.append( ((pad_length)*[temp_X]) + vectors) 87 | self.y.append( ((pad_length)*[temp_Y]) + one_hot_tags) 88 | 89 | self.x = np.array(self.x) 90 | self.y = np.array(self.y) 91 | 92 | def get_data(self): 93 | # Returns proper data for training/testing 94 | return (self.x, self.y) 95 | 96 | def encode_sentence(self, sentence): 97 | vectors = get_sentence_vectors(sentence) 98 | vectors = [v[:self.LEN_WORD_VECTORS] for v in vectors] 99 | return sequence.pad_sequences([vectors], maxlen=self.max_len, dtype=np.float32) 100 | 101 | def decode_result(self, result_sequence): 102 | pred_named_tags = [] 103 | for pred in result_sequence: 104 | _id = np.argmax(pred) 105 | pred_named_tags.append(self.tag_id_map[_id]) 106 | return pred_named_tags 107 | 108 | 109 | 110 | 111 | 112 | 113 | -------------------------------------------------------------------------------- /data/readme.md: -------------------------------------------------------------------------------- 1 | #### The CoNLL-2003 data is used from [this repo](https://github.com/synalp/NER). I want to thank the developer for sharing this data in GitHub. --------------------------------------------------------------------------------