├── .gitignore ├── LICENSE ├── README.md ├── README.rst ├── examples ├── __init__.py └── sklearn_fastest_classifier.py ├── fasttextclf.py ├── setup.py └── skfasttext ├── CBOW.py ├── FastTextClassifier.py ├── SkipGram.py └── __init__.py /.gitignore: -------------------------------------------------------------------------------- 1 | *.egg-info 2 | data/ 3 | *.pyc 4 | *.so 5 | *.tar.gz 6 | *.DS_Store 7 | *.bin 8 | *.vec 9 | 10 | build/ 11 | result/ 12 | dist/ 13 | 14 | fasttext/fasttext.cpp 15 | facebookresearch-fasttext-* 16 | 17 | # Intellij 18 | .idea/ 19 | 20 | # pip 21 | .eggs/ 22 | 23 | # For test 24 | test/*_result.txt 25 | test/dbpedia.train 26 | test/dbpedia_csv/ 27 | test/default_params_test 28 | 29 | # Misc 30 | TODO -------------------------------------------------------------------------------- /LICENSE: -------------------------------------------------------------------------------- 1 | Copyright (c) 2016, Bayu Aldi Yansyah 2 | All rights reserved. 3 | 4 | Redistribution and use in source and binary forms, with or without 5 | modification, are permitted provided that the following conditions are met: 6 | 7 | * Redistributions of source code must retain the above copyright notice, this 8 | list of conditions and the following disclaimer. 9 | 10 | * Redistributions in binary form must reproduce the above copyright notice, 11 | this list of conditions and the following disclaimer in the documentation 12 | and/or other materials provided with the distribution. 13 | 14 | * Neither the name of fastText.py nor the names of its 15 | contributors may be used to endorse or promote products derived from 16 | this software without specific prior written permission. 17 | 18 | THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" 19 | AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE 20 | IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE 21 | DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE 22 | FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL 23 | DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR 24 | SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER 25 | CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, 26 | OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE 27 | OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. 28 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # fasttext [![Build Status](https://travis-ci.org/salestock/fastText.py.svg?branch=master)](https://travis-ci.org/salestock/fastText.py) [![PyPI version](https://badge.fury.io/py/fasttext.svg)](https://badge.fury.io/py/fasttext) 2 | 3 | fasttext is a Python interface for 4 | [Facebook fastText](https://github.com/facebookresearch/fastText). 5 | 6 | ## Requirements 7 | 8 | fasttext support Python 2.6 or newer. It requires 9 | [Cython](https://pypi.python.org/pypi/Cython/) in order to build the C++ extension. 10 | 11 | ## Installation 12 | 13 | ```shell 14 | pip install fasttext 15 | ``` 16 | 17 | ## Example usage 18 | 19 | This package has two main use cases: word representation learning and 20 | text classification. 21 | 22 | These were described in the two papers 23 | [1](#enriching-word-vectors-with-subword-information) 24 | and [2](#bag-of-tricks-for-efficient-text-classification). 25 | 26 | ### Scikit-learn interface 27 | 28 | The scikit-learn interface is consistent with of native scikit-learn API 29 | 30 | ### Skipgram model 31 | ```python 32 | from skfasttext.SkipGram import SkipgramFastText 33 | clf=SkipgramFastText() 34 | clf.fit(train_file) 35 | ``` 36 | 37 | ### CBOW model 38 | ```python 39 | from skfasttext.CBOW import cbowFastText 40 | clf=cbowFastText() 41 | clf.fit(train_file) 42 | ``` 43 | ### Attributes and methods for the model 44 | 45 | Skipgram and CBOW model have the following atributes & methods 46 | 47 | ```python 48 | model.model_name # Model name 49 | model.words # List of words in the dictionary 50 | model.dim # Size of word vector 51 | model.ws # Size of context window 52 | model.epoch # Number of epochs 53 | model.min_count # Minimal number of word occurences 54 | model.neg # Number of negative sampled 55 | model.word_ngrams # Max length of word ngram 56 | model.loss_name # Loss function name 57 | model.bucket # Number of buckets 58 | model.minn # Min length of char ngram 59 | model.maxn # Max length of char ngram 60 | model.lr_update_rate # Rate of updates for the learning rate 61 | model.t # Value of sampling threshold 62 | model.encoding # Encoding of the model 63 | model[word] # Get the vector of specified word 64 | ``` 65 | 66 | ### Fasttext classifier model 67 | ```python 68 | from skfasttext.FastTextClassifier import FastTextClassifier 69 | clf=FastTextClassifier() 70 | clf.fit(train_file) 71 | ``` 72 | 73 | Classifier have the following atributes & methods 74 | 75 | ```python 76 | classifier.labels # List of labels 77 | classifier.label_prefix # Prefix of the label 78 | classifier.dim # Size of word vector 79 | classifier.ws # Size of context window 80 | classifier.epoch # Number of epochs 81 | classifier.min_count # Minimal number of word occurences 82 | classifier.neg # Number of negative sampled 83 | classifier.word_ngrams # Max length of word ngram 84 | classifier.loss_name # Loss function name 85 | classifier.bucket # Number of buckets 86 | classifier.minn # Min length of char ngram 87 | classifier.maxn # Max length of char ngram 88 | classifier.lr_update_rate # Rate of updates for the learning rate 89 | classifier.t # Value of sampling threshold 90 | classifier.encoding # Encoding that used by classifier 91 | classifier.test(filename, k) # Test the classifier 92 | classifier.predict(texts, k) # Predict the most likely label 93 | classifier.predict_proba(texts, k) # Predict the most likely label include their probability 94 | 95 | ``` 96 | 97 | ### Native API usage 98 | The source codes could be used with native interface of original fasttext as well. See documentation that follows. 99 | 100 | ### Word representation learning 101 | 102 | In order to learn word vectors, as described in 103 | [1](#enriching-word-vectors-with-subword-information), we can use 104 | `fasttext.skipgram` and `fasttext.cbow` function like the following: 105 | 106 | ```python 107 | import fasttext 108 | 109 | # Skipgram model 110 | model = fasttext.skipgram('data.txt', 'model') 111 | print model.words # list of words in dictionary 112 | 113 | # CBOW model 114 | model = fasttext.cbow('data.txt', 'model') 115 | print model.words # list of words in dictionary 116 | ``` 117 | 118 | where `data.txt` is a training file containing `utf-8` encoded text. 119 | By default the word vectors will take into account character n-grams from 120 | 3 to 6 characters. 121 | 122 | At the end of optimization the program will save two files: 123 | `model.bin` and `model.vec`. 124 | 125 | `model.vec` is a text file containing the word vectors, one per line. 126 | `model.bin` is a binary file containing the parameters of the model 127 | along with the dictionary and all hyper parameters. 128 | 129 | The binary file can be used later to compute word vectors or 130 | to restart the optimization. 131 | 132 | The following `fasttext(1)` command is equivalent 133 | 134 | ```shell 135 | # Skipgram model 136 | ./fasttext skipgram -input data.txt -output model 137 | 138 | # CBOW model 139 | ./fasttext cbow -input data.txt -output model 140 | ``` 141 | 142 | ### Obtaining word vectors for out-of-vocabulary words 143 | 144 | The previously trained model can be used to compute word vectors for 145 | out-of-vocabulary words. 146 | 147 | ```python 148 | print model['king'] # get the vector of the word 'king' 149 | ``` 150 | 151 | the following `fasttext(1)` command is equivalent: 152 | 153 | ```shell 154 | echo "king" | ./fasttext print-vectors model.bin 155 | ``` 156 | 157 | This will output the vector of word `king` to the standard output. 158 | 159 | ### Load pre-trained model 160 | 161 | We can use `fasttext.load_model` to load pre-trained model: 162 | 163 | ```python 164 | model = fasttext.load_model('model.bin') 165 | print model.words # list of words in dictionary 166 | print model['king'] # get the vector of the word 'king' 167 | ``` 168 | 169 | ### Text classification 170 | 171 | This package can also be used to train supervised text classifiers and load 172 | pre-trained classifier from fastText. 173 | 174 | In order to train a text classifier using the method described in 175 | [2](#bag-of-tricks-for-efficient-text-classification), we can use 176 | the following function: 177 | 178 | ```python 179 | classifier = fasttext.supervised('data.train.txt', 'model') 180 | ``` 181 | 182 | equivalent as `fasttext(1)` command: 183 | 184 | ```shell 185 | ./fasttext supervised -input data.train.txt -output model 186 | ``` 187 | 188 | where `data.train.txt` is a text file containing a training sentence per line 189 | along with the labels. By default, we assume that labels are words 190 | that are prefixed by the string `__label__`. 191 | 192 | We can specify the label prefix with the `label_prefix` param: 193 | 194 | ```python 195 | classifier = fasttext.supervised('data.train.txt', 'model', label_prefix='__label__') 196 | ``` 197 | 198 | equivalent as `fasttext(1)` command: 199 | 200 | ```shell 201 | ./fasttext supervised -input data.train.txt -output model -label '__label__' 202 | ``` 203 | 204 | This will output two files: `model.bin` and `model.vec`. 205 | 206 | Once the model was trained, we can evaluate it by computing the precision 207 | at 1 (P@1) and the recall on a test set using `classifier.test` function: 208 | 209 | ```python 210 | result = classifier.test('test.txt') 211 | print 'P@1:', result.precision 212 | print 'R@1:', result.recall 213 | print 'Number of examples:', result.nexamples 214 | ``` 215 | 216 | This will print the same output to stdout as: 217 | 218 | ```shell 219 | ./fasttext test model.bin test.txt 220 | ``` 221 | 222 | In order to obtain the most likely label for a list of text, we can 223 | use `classifer.predict` method: 224 | 225 | ```python 226 | texts = ['example very long text 1', 'example very longtext 2'] 227 | labels = classifier.predict(texts) 228 | print labels 229 | 230 | # Or with the probability 231 | labels = classifier.predict_proba(texts) 232 | print labels 233 | ``` 234 | 235 | We can specify `k` value to get the k-best labels from classifier: 236 | 237 | ```python 238 | labels = classifier.predict(texts, k=3) 239 | print labels 240 | 241 | # Or with the probability 242 | labels = classifier.predict_proba(texts, k=3) 243 | print labels 244 | ``` 245 | 246 | This interface is equivalent as `fasttext(1)` predict command. The same model 247 | with the same input set will have the same prediction. 248 | 249 | ## API documentation 250 | 251 | ### Skipgram model 252 | 253 | Train & load skipgram model 254 | 255 | ```python 256 | model = fasttext.skipgram(params) 257 | ``` 258 | 259 | List of available `params` and their default value: 260 | 261 | ``` 262 | input_file training file path (required) 263 | output output file path (required) 264 | lr learning rate [0.05] 265 | lr_update_rate change the rate of updates for the learning rate [100] 266 | dim size of word vectors [100] 267 | ws size of the context window [5] 268 | epoch number of epochs [5] 269 | min_count minimal number of word occurences [5] 270 | neg number of negatives sampled [5] 271 | word_ngrams max length of word ngram [1] 272 | loss loss function {ns, hs, softmax} [ns] 273 | bucket number of buckets [2000000] 274 | minn min length of char ngram [3] 275 | maxn max length of char ngram [6] 276 | thread number of threads [12] 277 | t sampling threshold [0.0001] 278 | silent disable the log output from the C++ extension [1] 279 | encoding specify input_file encoding [utf-8] 280 | 281 | ``` 282 | 283 | Example usage: 284 | 285 | ```python 286 | model = fasttext.skipgram('train.txt', 'model', lr=0.1, dim=300) 287 | ``` 288 | 289 | ### CBOW model 290 | 291 | Train & load CBOW model 292 | 293 | ```python 294 | model = fasttext.cbow(params) 295 | ``` 296 | 297 | List of available `params` and their default value: 298 | 299 | ``` 300 | input_file training file path (required) 301 | output output file path (required) 302 | lr learning rate [0.05] 303 | lr_update_rate change the rate of updates for the learning rate [100] 304 | dim size of word vectors [100] 305 | ws size of the context window [5] 306 | epoch number of epochs [5] 307 | min_count minimal number of word occurences [5] 308 | neg number of negatives sampled [5] 309 | word_ngrams max length of word ngram [1] 310 | loss loss function {ns, hs, softmax} [ns] 311 | bucket number of buckets [2000000] 312 | minn min length of char ngram [3] 313 | maxn max length of char ngram [6] 314 | thread number of threads [12] 315 | t sampling threshold [0.0001] 316 | silent disable the log output from the C++ extension [1] 317 | encoding specify input_file encoding [utf-8] 318 | 319 | ``` 320 | 321 | Example usage: 322 | 323 | ```python 324 | model = fasttext.cbow('train.txt', 'model', lr=0.1, dim=300) 325 | ``` 326 | 327 | ### Load pre-trained model 328 | 329 | File `.bin` that previously trained or generated by fastText can be 330 | loaded using this function 331 | 332 | ```python 333 | model = fasttext.load_model('model.bin', encoding='utf-8') 334 | ``` 335 | 336 | ### Attributes and methods for the model 337 | 338 | Skipgram and CBOW model have the following atributes & methods 339 | 340 | ```python 341 | model.model_name # Model name 342 | model.words # List of words in the dictionary 343 | model.dim # Size of word vector 344 | model.ws # Size of context window 345 | model.epoch # Number of epochs 346 | model.min_count # Minimal number of word occurences 347 | model.neg # Number of negative sampled 348 | model.word_ngrams # Max length of word ngram 349 | model.loss_name # Loss function name 350 | model.bucket # Number of buckets 351 | model.minn # Min length of char ngram 352 | model.maxn # Max length of char ngram 353 | model.lr_update_rate # Rate of updates for the learning rate 354 | model.t # Value of sampling threshold 355 | model.encoding # Encoding of the model 356 | model[word] # Get the vector of specified word 357 | ``` 358 | 359 | ### Supervised model 360 | 361 | Train & load the classifier 362 | 363 | ```python 364 | classifier = fasttext.supervised(params) 365 | ``` 366 | 367 | List of available `params` and their default value: 368 | 369 | ``` 370 | input_file training file path (required) 371 | output output file path (required) 372 | label_prefix label prefix ['__label__'] 373 | lr learning rate [0.1] 374 | lr_update_rate change the rate of updates for the learning rate [100] 375 | dim size of word vectors [100] 376 | ws size of the context window [5] 377 | epoch number of epochs [5] 378 | min_count minimal number of word occurences [1] 379 | neg number of negatives sampled [5] 380 | word_ngrams max length of word ngram [1] 381 | loss loss function {ns, hs, softmax} [softmax] 382 | bucket number of buckets [0] 383 | minn min length of char ngram [0] 384 | maxn max length of char ngram [0] 385 | thread number of threads [12] 386 | t sampling threshold [0.0001] 387 | silent disable the log output from the C++ extension [1] 388 | encoding specify input_file encoding [utf-8] 389 | pretrained_vectors pretrained word vectors (.vec file) for supervised learning [] 390 | 391 | ``` 392 | 393 | Example usage: 394 | 395 | ```python 396 | classifier = fasttext.supervised('train.txt', 'model', label_prefix='__myprefix__', 397 | thread=4) 398 | ``` 399 | 400 | ### Load pre-trained classifier 401 | 402 | File `.bin` that previously trained or generated by fastText can be 403 | loaded using this function. 404 | 405 | ```shell 406 | ./fasttext supervised -input train.txt -output classifier -label 'some_prefix' 407 | ``` 408 | 409 | ```python 410 | classifier = fasttext.load_model('classifier.bin', label_prefix='some_prefix') 411 | ``` 412 | 413 | ### Test classifier 414 | 415 | This is equivalent as `fasttext(1)` test command. The test using the same 416 | model and test set will produce the same value for the precision at one 417 | and the number of examples. 418 | 419 | ```python 420 | result = classifier.test(params) 421 | 422 | # Properties 423 | result.precision # Precision at one 424 | result.recall # Recall at one 425 | result.nexamples # Number of test examples 426 | ``` 427 | 428 | The param `k` is optional, and equal to `1` by default. 429 | 430 | ### Predict the most-likely label of texts 431 | 432 | This interface is equivalent as `fasttext(1)` predict command. 433 | 434 | `texts` is an array of string 435 | 436 | ```python 437 | labels = classifier.predict(texts, k) 438 | 439 | # Or with probability 440 | labels = classifier.predict_proba(texts, k) 441 | ``` 442 | 443 | The param `k` is optional, and equal to `1` by default. 444 | 445 | ### Attributes and methods for the classifier 446 | 447 | Classifier have the following atributes & methods 448 | 449 | ```python 450 | classifier.labels # List of labels 451 | classifier.label_prefix # Prefix of the label 452 | classifier.dim # Size of word vector 453 | classifier.ws # Size of context window 454 | classifier.epoch # Number of epochs 455 | classifier.min_count # Minimal number of word occurences 456 | classifier.neg # Number of negative sampled 457 | classifier.word_ngrams # Max length of word ngram 458 | classifier.loss_name # Loss function name 459 | classifier.bucket # Number of buckets 460 | classifier.minn # Min length of char ngram 461 | classifier.maxn # Max length of char ngram 462 | classifier.lr_update_rate # Rate of updates for the learning rate 463 | classifier.t # Value of sampling threshold 464 | classifier.encoding # Encoding that used by classifier 465 | classifier.test(filename, k) # Test the classifier 466 | classifier.predict(texts, k) # Predict the most likely label 467 | classifier.predict_proba(texts, k) # Predict the most likely label include their probability 468 | 469 | ``` 470 | 471 | The param `k` for `classifier.test`, `classifier.predict` and 472 | `classifier.predict_proba` is optional, 473 | and equal to `1` by default. 474 | 475 | 476 | 477 | ## References 478 | 479 | ### Enriching Word Vectors with Subword Information 480 | 481 | [1] P. Bojanowski\*, E. Grave\*, A. Joulin, T. Mikolov, [*Enriching Word Vectors with Subword Information*](https://arxiv.org/pdf/1607.04606v1.pdf) 482 | 483 | ``` 484 | @article{bojanowski2016enriching, 485 | title={Enriching Word Vectors with Subword Information}, 486 | author={Bojanowski, Piotr and Grave, Edouard and Joulin, Armand and Mikolov, Tomas}, 487 | journal={arXiv preprint arXiv:1607.04606}, 488 | year={2016} 489 | } 490 | ``` 491 | 492 | ### Bag of Tricks for Efficient Text Classification 493 | 494 | [2] A. Joulin, E. Grave, P. Bojanowski, T. Mikolov, [*Bag of Tricks for Efficient Text Classification*](https://arxiv.org/pdf/1607.01759v2.pdf) 495 | 496 | ``` 497 | @article{joulin2016bag, 498 | title={Bag of Tricks for Efficient Text Classification}, 499 | author={Joulin, Armand and Grave, Edouard and Bojanowski, Piotr and Mikolov, Tomas}, 500 | journal={arXiv preprint arXiv:1607.01759}, 501 | year={2016} 502 | } 503 | ``` 504 | (\* These authors contributed equally.) 505 | 506 | 507 | ### Native python interface 508 | A huge thank you to [fastText.py](https://github.com/salestock/fastText.py) for building such a amazing native python interface around which this scikit wrapper is written 509 | 510 | 511 | ## Join the fastText community 512 | 513 | * Facebook page: https://www.facebook.com/groups/1174547215919768 514 | * Google group: https://groups.google.com/forum/#!forum/fasttext-library 515 | 516 | 517 | -------------------------------------------------------------------------------- /README.rst: -------------------------------------------------------------------------------- 1 | fasttext |Build Status| |PyPI version| 2 | ====================================== 3 | 4 | fasttext is a Python interface for `Facebook 5 | fastText `__. 6 | 7 | Requirements 8 | ------------ 9 | 10 | fasttext support Python 2.6 or newer. It requires 11 | `Cython `__ in order to build the 12 | C++ extension. 13 | 14 | Installation 15 | ------------ 16 | 17 | .. code:: shell 18 | 19 | pip install fasttext 20 | 21 | Example usage 22 | ------------- 23 | 24 | This package has two main use cases: word representation learning and 25 | text classification. 26 | 27 | These were described in the two papers 28 | `1 <#enriching-word-vectors-with-subword-information>`__ and 29 | `2 <#bag-of-tricks-for-efficient-text-classification>`__. 30 | 31 | Word representation learning 32 | ~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 33 | 34 | In order to learn word vectors, as described in 35 | `1 <#enriching-word-vectors-with-subword-information>`__, we can use 36 | ``fasttext.skipgram`` and ``fasttext.cbow`` function like the following: 37 | 38 | .. code:: python 39 | 40 | import fasttext 41 | 42 | # Skipgram model 43 | model = fasttext.skipgram('data.txt', 'model') 44 | print model.words # list of words in dictionary 45 | 46 | # CBOW model 47 | model = fasttext.cbow('data.txt', 'model') 48 | print model.words # list of words in dictionary 49 | 50 | where ``data.txt`` is a training file containing ``utf-8`` encoded text. 51 | By default the word vectors will take into account character n-grams 52 | from 3 to 6 characters. 53 | 54 | At the end of optimization the program will save two files: 55 | ``model.bin`` and ``model.vec``. 56 | 57 | ``model.vec`` is a text file containing the word vectors, one per line. 58 | ``model.bin`` is a binary file containing the parameters of the model 59 | along with the dictionary and all hyper parameters. 60 | 61 | The binary file can be used later to compute word vectors or to restart 62 | the optimization. 63 | 64 | The following ``fasttext(1)`` command is equivalent 65 | 66 | .. code:: shell 67 | 68 | # Skipgram model 69 | ./fasttext skipgram -input data.txt -output model 70 | 71 | # CBOW model 72 | ./fasttext cbow -input data.txt -output model 73 | 74 | Obtaining word vectors for out-of-vocabulary words 75 | ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 76 | 77 | The previously trained model can be used to compute word vectors for 78 | out-of-vocabulary words. 79 | 80 | .. code:: python 81 | 82 | print model['king'] # get the vector of the word 'king' 83 | 84 | the following ``fasttext(1)`` command is equivalent: 85 | 86 | .. code:: shell 87 | 88 | echo "king" | ./fasttext print-vectors model.bin 89 | 90 | This will output the vector of word ``king`` to the standard output. 91 | 92 | Load pre-trained model 93 | ~~~~~~~~~~~~~~~~~~~~~~ 94 | 95 | We can use ``fasttext.load_model`` to load pre-trained model: 96 | 97 | .. code:: python 98 | 99 | model = fasttext.load_model('model.bin') 100 | print model.words # list of words in dictionary 101 | print model['king'] # get the vector of the word 'king' 102 | 103 | Text classification 104 | ~~~~~~~~~~~~~~~~~~~ 105 | 106 | This package can also be used to train supervised text classifiers and 107 | load pre-trained classifier from fastText. 108 | 109 | In order to train a text classifier using the method described in 110 | `2 <#bag-of-tricks-for-efficient-text-classification>`__, we can use the 111 | following function: 112 | 113 | .. code:: python 114 | 115 | classifier = fasttext.supervised('data.train.txt', 'model') 116 | 117 | equivalent as ``fasttext(1)`` command: 118 | 119 | .. code:: shell 120 | 121 | ./fasttext supervised -input data.train.txt -output model 122 | 123 | where ``data.train.txt`` is a text file containing a training sentence 124 | per line along with the labels. By default, we assume that labels are 125 | words that are prefixed by the string ``__label__``. 126 | 127 | We can specify the label prefix with the ``label_prefix`` param: 128 | 129 | .. code:: python 130 | 131 | classifier = fasttext.supervised('data.train.txt', 'model', label_prefix='__label__') 132 | 133 | equivalent as ``fasttext(1)`` command: 134 | 135 | .. code:: shell 136 | 137 | ./fasttext supervised -input data.train.txt -output model -label '__label__' 138 | 139 | This will output two files: ``model.bin`` and ``model.vec``. 140 | 141 | Once the model was trained, we can evaluate it by computing the 142 | precision at 1 (P@1) and the recall on a test set using 143 | ``classifier.test`` function: 144 | 145 | .. code:: python 146 | 147 | result = classifier.test('test.txt') 148 | print 'P@1:', result.precision 149 | print 'R@1:', result.recall 150 | print 'Number of examples:', result.nexamples 151 | 152 | This will print the same output to stdout as: 153 | 154 | .. code:: shell 155 | 156 | ./fasttext test model.bin test.txt 157 | 158 | In order to obtain the most likely label for a list of text, we can use 159 | ``classifer.predict`` method: 160 | 161 | .. code:: python 162 | 163 | texts = ['example very long text 1', 'example very longtext 2'] 164 | labels = classifier.predict(texts) 165 | print labels 166 | 167 | # Or with the probability 168 | labels = classifier.predict_proba(texts) 169 | print labels 170 | 171 | We can specify ``k`` value to get the k-best labels from classifier: 172 | 173 | .. code:: python 174 | 175 | labels = classifier.predict(texts, k=3) 176 | print labels 177 | 178 | # Or with the probability 179 | labels = classifier.predict_proba(texts, k=3) 180 | print labels 181 | 182 | This interface is equivalent as ``fasttext(1)`` predict command. The 183 | same model with the same input set will have the same prediction. 184 | 185 | API documentation 186 | ----------------- 187 | 188 | Skipgram model 189 | ~~~~~~~~~~~~~~ 190 | 191 | Train & load skipgram model 192 | 193 | .. code:: python 194 | 195 | model = fasttext.skipgram(params) 196 | 197 | List of available ``params`` and their default value: 198 | 199 | :: 200 | 201 | input_file training file path (required) 202 | output output file path (required) 203 | lr learning rate [0.05] 204 | lr_update_rate change the rate of updates for the learning rate [100] 205 | dim size of word vectors [100] 206 | ws size of the context window [5] 207 | epoch number of epochs [5] 208 | min_count minimal number of word occurences [5] 209 | neg number of negatives sampled [5] 210 | word_ngrams max length of word ngram [1] 211 | loss loss function {ns, hs, softmax} [ns] 212 | bucket number of buckets [2000000] 213 | minn min length of char ngram [3] 214 | maxn max length of char ngram [6] 215 | thread number of threads [12] 216 | t sampling threshold [0.0001] 217 | silent disable the log output from the C++ extension [1] 218 | encoding specify input_file encoding [utf-8] 219 | 220 | Example usage: 221 | 222 | .. code:: python 223 | 224 | model = fasttext.skipgram('train.txt', 'model', lr=0.1, dim=300) 225 | 226 | CBOW model 227 | ~~~~~~~~~~ 228 | 229 | Train & load CBOW model 230 | 231 | .. code:: python 232 | 233 | model = fasttext.cbow(params) 234 | 235 | List of available ``params`` and their default value: 236 | 237 | :: 238 | 239 | input_file training file path (required) 240 | output output file path (required) 241 | lr learning rate [0.05] 242 | lr_update_rate change the rate of updates for the learning rate [100] 243 | dim size of word vectors [100] 244 | ws size of the context window [5] 245 | epoch number of epochs [5] 246 | min_count minimal number of word occurences [5] 247 | neg number of negatives sampled [5] 248 | word_ngrams max length of word ngram [1] 249 | loss loss function {ns, hs, softmax} [ns] 250 | bucket number of buckets [2000000] 251 | minn min length of char ngram [3] 252 | maxn max length of char ngram [6] 253 | thread number of threads [12] 254 | t sampling threshold [0.0001] 255 | silent disable the log output from the C++ extension [1] 256 | encoding specify input_file encoding [utf-8] 257 | 258 | Example usage: 259 | 260 | .. code:: python 261 | 262 | model = fasttext.cbow('train.txt', 'model', lr=0.1, dim=300) 263 | 264 | Load pre-trained model 265 | ~~~~~~~~~~~~~~~~~~~~~~ 266 | 267 | File ``.bin`` that previously trained or generated by fastText can be 268 | loaded using this function 269 | 270 | .. code:: python 271 | 272 | model = fasttext.load_model('model.bin', encoding='utf-8') 273 | 274 | Attributes and methods for the model 275 | ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 276 | 277 | Skipgram and CBOW model have the following atributes & methods 278 | 279 | .. code:: python 280 | 281 | model.model_name # Model name 282 | model.words # List of words in the dictionary 283 | model.dim # Size of word vector 284 | model.ws # Size of context window 285 | model.epoch # Number of epochs 286 | model.min_count # Minimal number of word occurences 287 | model.neg # Number of negative sampled 288 | model.word_ngrams # Max length of word ngram 289 | model.loss_name # Loss function name 290 | model.bucket # Number of buckets 291 | model.minn # Min length of char ngram 292 | model.maxn # Max length of char ngram 293 | model.lr_update_rate # Rate of updates for the learning rate 294 | model.t # Value of sampling threshold 295 | model.encoding # Encoding of the model 296 | model[word] # Get the vector of specified word 297 | 298 | Supervised model 299 | ~~~~~~~~~~~~~~~~ 300 | 301 | Train & load the classifier 302 | 303 | .. code:: python 304 | 305 | classifier = fasttext.supervised(params) 306 | 307 | List of available ``params`` and their default value: 308 | 309 | :: 310 | 311 | input_file training file path (required) 312 | output output file path (required) 313 | label_prefix label prefix ['__label__'] 314 | lr learning rate [0.1] 315 | lr_update_rate change the rate of updates for the learning rate [100] 316 | dim size of word vectors [100] 317 | ws size of the context window [5] 318 | epoch number of epochs [5] 319 | min_count minimal number of word occurences [1] 320 | neg number of negatives sampled [5] 321 | word_ngrams max length of word ngram [1] 322 | loss loss function {ns, hs, softmax} [softmax] 323 | bucket number of buckets [0] 324 | minn min length of char ngram [0] 325 | maxn max length of char ngram [0] 326 | thread number of threads [12] 327 | t sampling threshold [0.0001] 328 | silent disable the log output from the C++ extension [1] 329 | encoding specify input_file encoding [utf-8] 330 | pretrained_vectors pretrained word vectors (.vec file) for supervised learning [] 331 | 332 | Example usage: 333 | 334 | .. code:: python 335 | 336 | classifier = fasttext.supervised('train.txt', 'model', label_prefix='__myprefix__', 337 | thread=4) 338 | 339 | Load pre-trained classifier 340 | ~~~~~~~~~~~~~~~~~~~~~~~~~~~ 341 | 342 | File ``.bin`` that previously trained or generated by fastText can be 343 | loaded using this function. 344 | 345 | .. code:: shell 346 | 347 | ./fasttext supervised -input train.txt -output classifier -label 'some_prefix' 348 | 349 | .. code:: python 350 | 351 | classifier = fasttext.load_model('classifier.bin', label_prefix='some_prefix') 352 | 353 | Test classifier 354 | ~~~~~~~~~~~~~~~ 355 | 356 | This is equivalent as ``fasttext(1)`` test command. The test using the 357 | same model and test set will produce the same value for the precision at 358 | one and the number of examples. 359 | 360 | .. code:: python 361 | 362 | result = classifier.test(params) 363 | 364 | # Properties 365 | result.precision # Precision at one 366 | result.recall # Recall at one 367 | result.nexamples # Number of test examples 368 | 369 | The param ``k`` is optional, and equal to ``1`` by default. 370 | 371 | Predict the most-likely label of texts 372 | ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 373 | 374 | This interface is equivalent as ``fasttext(1)`` predict command. 375 | 376 | ``texts`` is an array of string 377 | 378 | .. code:: python 379 | 380 | labels = classifier.predict(texts, k) 381 | 382 | # Or with probability 383 | labels = classifier.predict_proba(texts, k) 384 | 385 | The param ``k`` is optional, and equal to ``1`` by default. 386 | 387 | Attributes and methods for the classifier 388 | ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 389 | 390 | Classifier have the following atributes & methods 391 | 392 | .. code:: python 393 | 394 | classifier.labels # List of labels 395 | classifier.label_prefix # Prefix of the label 396 | classifier.dim # Size of word vector 397 | classifier.ws # Size of context window 398 | classifier.epoch # Number of epochs 399 | classifier.min_count # Minimal number of word occurences 400 | classifier.neg # Number of negative sampled 401 | classifier.word_ngrams # Max length of word ngram 402 | classifier.loss_name # Loss function name 403 | classifier.bucket # Number of buckets 404 | classifier.minn # Min length of char ngram 405 | classifier.maxn # Max length of char ngram 406 | classifier.lr_update_rate # Rate of updates for the learning rate 407 | classifier.t # Value of sampling threshold 408 | classifier.encoding # Encoding that used by classifier 409 | classifier.test(filename, k) # Test the classifier 410 | classifier.predict(texts, k) # Predict the most likely label 411 | classifier.predict_proba(texts, k) # Predict the most likely label include their probability 412 | 413 | The param ``k`` for ``classifier.test``, ``classifier.predict`` and 414 | ``classifier.predict_proba`` is optional, and equal to ``1`` by default. 415 | 416 | References 417 | ---------- 418 | 419 | Enriching Word Vectors with Subword Information 420 | ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 421 | 422 | [1] P. Bojanowski\*, E. Grave\*, A. Joulin, T. Mikolov, `*Enriching Word 423 | Vectors with Subword 424 | Information* `__ 425 | 426 | :: 427 | 428 | @article{bojanowski2016enriching, 429 | title={Enriching Word Vectors with Subword Information}, 430 | author={Bojanowski, Piotr and Grave, Edouard and Joulin, Armand and Mikolov, Tomas}, 431 | journal={arXiv preprint arXiv:1607.04606}, 432 | year={2016} 433 | } 434 | 435 | Bag of Tricks for Efficient Text Classification 436 | ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 437 | 438 | [2] A. Joulin, E. Grave, P. Bojanowski, T. Mikolov, `*Bag of Tricks for 439 | Efficient Text 440 | Classification* `__ 441 | 442 | :: 443 | 444 | @article{joulin2016bag, 445 | title={Bag of Tricks for Efficient Text Classification}, 446 | author={Joulin, Armand and Grave, Edouard and Bojanowski, Piotr and Mikolov, Tomas}, 447 | journal={arXiv preprint arXiv:1607.01759}, 448 | year={2016} 449 | } 450 | 451 | (\* These authors contributed equally.) 452 | 453 | Join the fastText community 454 | --------------------------- 455 | 456 | - Facebook page: https://www.facebook.com/groups/1174547215919768 457 | - Google group: 458 | https://groups.google.com/forum/#!forum/fasttext-library 459 | 460 | .. |Build Status| image:: https://travis-ci.org/salestock/fastText.py.svg?branch=master 461 | :target: https://travis-ci.org/salestock/fastText.py 462 | .. |PyPI version| image:: https://badge.fury.io/py/fasttext.svg 463 | :target: https://badge.fury.io/py/fasttext 464 | -------------------------------------------------------------------------------- /examples/__init__.py: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/vishnumani2009/sklearn-fasttext/f0f40a83974dbebd492f6aecea26c04f0fb577f6/examples/__init__.py -------------------------------------------------------------------------------- /examples/sklearn_fastest_classifier.py: -------------------------------------------------------------------------------- 1 | import sys 2 | #sys.path.insert(0, '../skfasttext/') 3 | from skfasttext.FastTextClassifier import FastTextClassifier 4 | 5 | 6 | """ 7 | File paths 8 | """ 9 | train_file="../data/classifier_test.txt" 10 | test_file="../data/classifier_test.txt" 11 | 12 | """ 13 | train model 14 | """ 15 | 16 | clf=FastTextClassifier() 17 | print("training clf") 18 | clf.fit(train_file) 19 | print("testing clf") 20 | print(clf.predict_proba(test_file,k_best=3)) -------------------------------------------------------------------------------- /fasttextclf.py: -------------------------------------------------------------------------------- 1 | from sklearn.base import BaseEstimator, ClassifierMixin 2 | import fasttext as ft 3 | from sklearn.metrics import classification_report 4 | 5 | class FastTextClassifier(BaseEstimator,ClassifierMixin): 6 | """Base classiifer of Fasttext estimator""" 7 | 8 | def __init__(self,lpr='__label__',lr=0.1,lru=100,dim=100,ws=5,epoch=5,minc=1,neg=5,ngram=1,loss='softmax',nbucket=0,minn=0,maxn=0,thread=2,silent=1,output="model"): 9 | """ 10 | label_prefix label prefix ['__label__'] 11 | lr learning rate [0.1] 12 | lr_update_rate change the rate of updates for the learning rate [100] 13 | dim size of word vectors [100] 14 | ws size of the context window [5] 15 | epoch number of epochs [5] 16 | min_count minimal number of word occurences [1] 17 | neg number of negatives sampled [5] 18 | word_ngrams max length of word ngram [1] 19 | loss loss function {ns, hs, softmax} [softmax] 20 | bucket number of buckets [0] 21 | minn min length of char ngram [0] 22 | maxn min length of char ngram [0] 23 | todo : Recheck need of some of the variables, present in default classifier 24 | """ 25 | 26 | self.label_prefix=lpr 27 | self.lr=lr 28 | self.lr_update_rate=lru 29 | self.dim=dim 30 | self.ws=ws 31 | self.epoch=epoch 32 | self.min_count=minc 33 | self.neg=neg 34 | self.word_ngrams=ngram 35 | self.loss=loss 36 | self.bucket=bucket 37 | self.minn=minn 38 | self.maxn=maxn 39 | self.thread=thread 40 | self.silent=silent 41 | self.classifier=None 42 | self.result=None 43 | 44 | self.output=output 45 | 46 | def fit(self,input_file): 47 | ''' 48 | Input: takes input file in format 49 | returns classifier object 50 | to do: add option to feed list of X and Y or file 51 | ''' 52 | 53 | self.classifier = ft.supervised(input_file, self.output, dim=self.dim, lr=self.lr, epoch=self.epoch, min_count=self.min_count, word_ngrams=self.word_ngrams, bucket=self.bucket, thread=self.thread, silent=self.silent, label_prefix=self.lpr) 54 | return(self.classisifer) 55 | 56 | def predict(self,test_file,csvflag=True,reports=False): 57 | ''' 58 | Input: Takes input test finle in format 59 | return results object 60 | to do: add unit tests using sentiment analysis dataset 61 | to do: Add K best labels options for csvflag = False 62 | to do: add report option 63 | ''' 64 | try: 65 | if type(test_file) == 'list' and csvflag=False: 66 | self.result=self.classifier.predict(test_file) 67 | else: 68 | print "Error in input" 69 | if csvflag: 70 | self.result=self.classifier.test(test_file) 71 | except: 72 | print("Exception in predict call error in format of test_file/input sentence list") 73 | return(self.result) 74 | 75 | def report(self,ytrue,ypred): 76 | ''' 77 | Input: predicted and true labels 78 | return reort of classification 79 | to do: add label option and unit testing 80 | 81 | ''' 82 | print(classification_report(ytrue,ypred)) 83 | return None 84 | 85 | def predict_proba(self,X): 86 | ''' 87 | Input: List of sentences 88 | return reort of classification 89 | to do: check output of classifier predct_proba add label option and unit testing 90 | ''' 91 | labels=self.classifier.predict_proba(X) 92 | return(labels) 93 | 94 | def getlabels(self): 95 | ''' 96 | Input: None 97 | returns: Class labels in dataset 98 | to do : check need of the this funcion 99 | ''' 100 | return(self.classifier.labels) 101 | 102 | def getproperties(self): 103 | 104 | ''' 105 | Input: Nothing, other than object self pointer 106 | Return: None , prints the descriptions of the model hyperparameters 107 | ''' 108 | 109 | print("The model has following hyperparameters as part of its specification") 110 | print("Label prefix used : "+str(self.label_prefix) 111 | print("Learning rate :"+ str(lr)) 112 | print("Learning rate update after "+str(self.lr_update_rate)+"iterations") 113 | print("Embedding size: "+str(self.dim)) 114 | print("Epochs :"+ str(self.epochs) 115 | print("minimal number of word occurences: "+self.min_count) 116 | print("number of negatives sampled :"+str(self.neg)) 117 | print("max length of word ngram "+str(self.word_ngrams)) 118 | print("loss function: "+str(self.loss)) 119 | print("number of buckets "+str(self.bucket)) 120 | print("min length of char ngram:"+str(self.minn)) 121 | print("min length of char ngram"+ str(self.maxn)) 122 | return(None) 123 | 124 | 125 | def loadpretrained(self,X): 126 | 'returns the model with pretrained weights' 127 | pass 128 | 129 | class SkipgramFastText(BaseEstimator,ClassifierMixin): 130 | 131 | def __init__(self,lpr='__label__',lr=0.1,lru=100,dim=100,ws=5,epoch=5,minc=1,neg=5,ngram=1,\ 132 | loss='softmax',nbucket=0,minn=0,maxn=0,th=12,t=0.0001,verbosec=0,encoding='utf-8'): 133 | """ 134 | lr learning rate [0.05] 135 | lr_update_rate change the rate of updates for the learning rate [100] 136 | dim size of word vectors [100] 137 | ws size of the context window [5] 138 | epoch number of epochs [5] 139 | min_count minimal number of word occurences [5] 140 | neg number of negatives sampled [5] 141 | word_ngrams max length of word ngram [1] 142 | loss loss function {ns, hs, softmax} [ns] 143 | bucket number of buckets [2000000] 144 | minn min length of char ngram [3] 145 | maxn max length of char ngram [6] 146 | thread number of threads [12] 147 | t sampling threshold [0.0001] 148 | silent disable the log output from the C++ extension [1] 149 | encoding specify input_file encoding [utf-8] 150 | """ 151 | self.lr=lr 152 | self.lr_update_rate=lru 153 | self.dim=dim 154 | self.ws=ws 155 | self.epoch=epoch 156 | self.min_count=minc 157 | self.neg=neg 158 | self.word_ngrams=ngram 159 | self.loss=loss 160 | self.bucket=bucket 161 | self.minn=minn 162 | self.maxn=maxn 163 | self.n_thread=th 164 | self.samplet=t 165 | self.silent=verbosec 166 | self.enc=encodings 167 | self.model=None 168 | self.result=None 169 | 170 | 171 | 172 | def fit(self,X,modelname='model',csvflag=False): 173 | ''' 174 | Input: takes input file in format 175 | returns classifier object 176 | to do: add option to feed list of X and Y or file 177 | to do: check options for the api call 178 | to do: write unit test 179 | ''' 180 | try: 181 | if not csvflag: 182 | self.model=ft.skipgram(X, modelname, lr=self.lr, dim=self.dim,lr_update_rate=self.lr_update_rate,epoch=self.epoch,bucket=self.bucket,loss=self.loss,thread=self.n_thread) 183 | except: 184 | print("Error in input dataset format") 185 | def getproperties(self): 186 | ''' 187 | Input: Nothing, other than object self pointer 188 | Return: None , prints the descriptions of the model hyperparameters 189 | ''' 190 | print("The model has following hyperparameters as part of its specification") 191 | print("Learning rate :"+ str(lr)) 192 | print("Learning rate update after "+str(self.lr_update_rate)+"iterations") 193 | print("Embedding size: "+str(self.dim)) 194 | print("Epochs :"+ str(self.epochs) 195 | print("minimal number of word occurences: "+self.min_count) 196 | print("number of negatives sampled :"+str(self.neg)) 197 | print("max length of word ngram "+str(self.word_ngrams)) 198 | print("loss function: "+str(self.loss)) 199 | print("number of buckets "+str(self.bucket)) 200 | print("min length of char ngram:"+str(self.minn)) 201 | print("min length of char ngram"+ str(self.maxn)) 202 | print("number of threads: "+str(self.n_thread)) 203 | print("sampling threshold"+str(self.samplet)) 204 | print("Verbose log output from the C++ extension enable=1/disble=0:"+ str(self.silent)) 205 | print("input_file encoding :"+str(self.enc)) 206 | return None 207 | 208 | def getwords(self): 209 | """to do: check words list""" 210 | return(self.model.words) 211 | # list of words in dictionary) 212 | def getvector(self,word=None): 213 | """ 214 | to do : add try catch for word type 215 | to do: add try catch for word existance 216 | """ 217 | return(self.model[word]) 218 | 219 | 220 | class cbowFastText((BaseEstimator,ClassifierMixin): 221 | def __init__(self,lpr='__label__',lr=0.1,lru=100,dim=100,ws=5,epoch=5,minc=1,neg=5,ngram=1,\ 222 | loss='softmax',nbucket=0,minn=0,maxn=0,th=12,t=0.0001,verbosec=0,encoding='utf-8'): """ 223 | lr learning rate [0.05] 224 | lr_update_rate change the rate of updates for the learning rate [100] 225 | dim size of word vectors [100] 226 | ws size of the context window [5] 227 | epoch number of epochs [5] 228 | min_count minimal number of word occurences [5] 229 | neg number of negatives sampled [5] 230 | word_ngrams max length of word ngram [1] 231 | loss loss function {ns, hs, softmax} [ns] 232 | bucket number of buckets [2000000] 233 | minn min length of char ngram [3] 234 | maxn max length of char ngram [6] 235 | thread number of threads [12] 236 | t sampling threshold [0.0001] 237 | silent disable the log output from the C++ extension [1] 238 | encoding specify input_file encoding [utf-8] 239 | 240 | """ 241 | self.lr=lr 242 | self.lr_update_rate=lru 243 | self.dim=dim 244 | self.ws=ws 245 | self.epoch=epoch 246 | self.min_count=minc 247 | self.neg=neg 248 | self.word_ngrams=ngram 249 | self.loss=loss 250 | self.bucket=bucket 251 | self.minn=minn 252 | self.maxn=maxn 253 | self.n_thread=th 254 | self.samplet=t 255 | self.silent=verbosec 256 | self.enc=encoding 257 | 258 | def fit(self,X,modelname='model'): 259 | ''' 260 | Input: takes input file in format 261 | returns classifier object 262 | to do: add option to feed list of X and Y or file 263 | to do: check options for the api call 264 | to do: write unit test 265 | ''' 266 | try: 267 | if not csvflag: 268 | self.model=ft.cbow(X, modelname, lr=self.lr, dim=self.dim,lr_update_rate=self.lr_update_rate,epoch=self.epoch,bucket=self.bucket,loss=self.loss,thread=self.n_thread) 269 | except: 270 | print("Error in input dataset format") 271 | 272 | def getproperties(self): 273 | ''' 274 | Input: Nothing, other than object self pointer 275 | Return: None , prints the descriptions of the model hyperparameters 276 | ''' 277 | print("The model has following hyperparameters as part of its specification") 278 | print("Learning rate :"+ str(lr)) 279 | print("Learning rate update after "+str(self.lr_update_rate)+"iterations") 280 | print("Embedding size: "+str(self.dim)) 281 | print("Epochs :"+ str(self.epochs) 282 | print("minimal number of word occurences: "+self.min_count) 283 | print("number of negatives sampled :"+str(self.neg)) 284 | print("max length of word ngram "+str(self.word_ngrams)) 285 | print("loss function: "+str(self.loss)) 286 | print("number of buckets "+str(self.bucket)) 287 | print("min length of char ngram:"+str(self.minn)) 288 | print("min length of char ngram"+ str(self.maxn)) 289 | print("number of threads: "+str(self.n_thread)) 290 | print("sampling threshold"+str(self.samplet)) 291 | print("Verbose log output from the C++ extension enable=1/disble=0:"+ str(self.silent)) 292 | print("input_file encoding :"+str(self.enc)) 293 | 294 | def getwords(self): 295 | """to do: check words list""" 296 | return(self.model.words) 297 | 298 | def getvector(self): 299 | """ 300 | to do : add try catch for word type 301 | to do: add try catch for word existance 302 | """ 303 | return(self.model[word]) 304 | 305 | 306 | 307 | -------------------------------------------------------------------------------- /setup.py: -------------------------------------------------------------------------------- 1 | from setuptools import setup 2 | 3 | setup( 4 | name="skfasttext", 5 | version="0.1", 6 | packages=["skfasttext"] 7 | ) -------------------------------------------------------------------------------- /skfasttext/CBOW.py: -------------------------------------------------------------------------------- 1 | from sklearn.base import BaseEstimator, ClassifierMixin 2 | import fasttext as ft 3 | from sklearn.metrics import classification_report 4 | 5 | 6 | class cbowFastText((BaseEstimator,ClassifierMixin): 7 | def __init__(self,lpr='__label__',lr=0.1,lru=100,dim=100,ws=5,epoch=5,minc=1,neg=5,ngram=1,\ 8 | loss='softmax',nbucket=0,minn=0,maxn=0,th=12,t=0.0001,verbosec=0,encoding='utf-8'): """ 9 | lr learning rate [0.05] 10 | lr_update_rate change the rate of updates for the learning rate [100] 11 | dim size of word vectors [100] 12 | ws size of the context window [5] 13 | epoch number of epochs [5] 14 | min_count minimal number of word occurences [5] 15 | neg number of negatives sampled [5] 16 | word_ngrams max length of word ngram [1] 17 | loss loss function {ns, hs, softmax} [ns] 18 | bucket number of buckets [2000000] 19 | minn min length of char ngram [3] 20 | maxn max length of char ngram [6] 21 | thread number of threads [12] 22 | t sampling threshold [0.0001] 23 | silent disable the log output from the C++ extension [1] 24 | encoding specify input_file encoding [utf-8] 25 | 26 | """ 27 | self.lr=lr 28 | self.lr_update_rate=lru 29 | self.dim=dim 30 | self.ws=ws 31 | self.epoch=epoch 32 | self.min_count=minc 33 | self.neg=neg 34 | self.word_ngrams=ngram 35 | self.loss=loss 36 | self.bucket=bucket 37 | self.minn=minn 38 | self.maxn=maxn 39 | self.n_thread=th 40 | self.samplet=t 41 | self.silent=verbosec 42 | self.enc=encoding 43 | 44 | def fit(self,X,modelname='model'): 45 | """ 46 | Input: takes input file in format 47 | returns classifier object 48 | to do: add option to feed list of X and Y or file 49 | to do: check options for the api call 50 | to do: write unit test 51 | """ 52 | try: 53 | if not csvflag: 54 | self.model=ft.cbow(X, modelname, lr=self.lr, dim=self.dim,lr_update_rate=self.lr_update_rate,epoch=self.epoch,bucket=self.bucket,loss=self.loss,thread=self.n_thread) 55 | except: 56 | print("Error in input dataset format") 57 | 58 | def getproperties(self): 59 | """ 60 | Input: Nothing, other than object self pointer 61 | Return: None , prints the descriptions of the model hyperparameters 62 | """ 63 | print("The model has following hyperparameters as part of its specification") 64 | print("Learning rate :"+ str(lr)) 65 | print("Learning rate update after "+str(self.lr_update_rate)+"iterations") 66 | print("Embedding size: "+str(self.dim)) 67 | print("Epochs :"+ str(self.epochs) 68 | print("minimal number of word occurences: "+self.min_count) 69 | print("number of negatives sampled :"+str(self.neg)) 70 | print("max length of word ngram "+str(self.word_ngrams)) 71 | print("loss function: "+str(self.loss)) 72 | print("number of buckets "+str(self.bucket)) 73 | print("min length of char ngram:"+str(self.minn)) 74 | print("min length of char ngram"+ str(self.maxn)) 75 | print("number of threads: "+str(self.n_thread)) 76 | print("sampling threshold"+str(self.samplet)) 77 | print("Verbose log output from the C++ extension enable=1/disble=0:"+ str(self.silent)) 78 | print("input_file encoding :"+str(self.enc)) 79 | 80 | def getwords(self): 81 | """to do: check words list""" 82 | return(self.model.words) 83 | 84 | def getvector(self): 85 | """ 86 | to do : add try catch for word type 87 | to do: add try catch for word existance 88 | """ 89 | return(self.model[word]) 90 | 91 | -------------------------------------------------------------------------------- /skfasttext/FastTextClassifier.py: -------------------------------------------------------------------------------- 1 | from sklearn.base import BaseEstimator, ClassifierMixin 2 | import fasttext as ft 3 | from sklearn.metrics import classification_report 4 | 5 | class FastTextClassifier(BaseEstimator,ClassifierMixin): 6 | """Base classiifer of Fasttext estimator""" 7 | 8 | def __init__(self,lpr='__label__',lr=0.1,lru=100,dim=100,ws=5,epoch=100,minc=1,neg=5,ngram=1,loss='softmax',nbucket=0,minn=0,maxn=0,thread=4,silent=0,output="model"): 9 | """ 10 | label_prefix label prefix ['__label__'] 11 | lr learning rate [0.1] 12 | lr_update_rate change the rate of updates for the learning rate [100] 13 | dim size of word vectors [100] 14 | ws size of the context window [5] 15 | epoch number of epochs [5] 16 | min_count minimal number of word occurences [1] 17 | neg number of negatives sampled [5] 18 | word_ngrams max length of word ngram [1] 19 | loss loss function {ns, hs, softmax} [softmax] 20 | bucket number of buckets [0] 21 | minn min length of char ngram [0] 22 | maxn min length of char ngram [0] 23 | todo : Recheck need of some of the variables, present in default classifier 24 | """ 25 | 26 | self.label_prefix=lpr 27 | self.lr=lr 28 | self.lr_update_rate=lru 29 | self.dim=dim 30 | self.ws=ws 31 | self.epoch=epoch 32 | self.min_count=minc 33 | self.neg=neg 34 | self.word_ngrams=ngram 35 | self.loss=loss 36 | self.bucket=nbucket 37 | self.minn=minn 38 | self.maxn=maxn 39 | self.thread=thread 40 | self.silent=silent 41 | self.classifier=None 42 | self.result=None 43 | self.output=output 44 | self.lpr=lpr 45 | 46 | def fit(self,input_file): 47 | ''' 48 | Input: takes input file in format 49 | returns classifier object 50 | to do: add option to feed list of X and Y or file 51 | ''' 52 | self.classifier = ft.supervised(input_file, self.output, dim=self.dim, lr=self.lr, epoch=self.epoch, min_count=self.min_count, word_ngrams=self.word_ngrams, bucket=self.bucket, thread=self.thread, silent=self.silent, label_prefix=self.lpr) 53 | return(None) 54 | 55 | def predict(self,test_file,csvflag=True,k_best=1): 56 | ''' 57 | Input: Takes input test finle in format 58 | return results object 59 | to do: add unit tests using sentiment analysis dataset 60 | to do: Add K best labels options for csvflag = False 61 | 62 | ''' 63 | try: 64 | if csvflag==False and type(test_file) == 'list': 65 | self.result=self.classifier.predict(test_file,k=k_best) 66 | if csvflag: 67 | lines=open(test_file,"r").readlines() 68 | sentences=[line.split(" , ")[1] for line in lines] 69 | self.result=self.classifier.predict(sentences,k_best) 70 | except: 71 | print("Error in input dataset.. please see if the file/list of sentences is of correct format") 72 | sys.exit(-1) 73 | self.result=[int(labels[0]) for labels in self.result] 74 | return(self.result) 75 | 76 | def report(self,ytrue,ypred): 77 | ''' 78 | Input: predicted and true labels 79 | return reort of classification 80 | to do: add label option and unit testing 81 | 82 | ''' 83 | print(classification_report(ytrue,ypred)) 84 | return None 85 | 86 | def predict_proba(self,test_file,csvflag=True,k_best=1): 87 | ''' 88 | Input: List of sentences 89 | return reort of classification 90 | to do: check output of classifier predct_proba add label option and unit testing 91 | ''' 92 | try: 93 | if csvflag==False and type(test_file) == 'list': 94 | self.result=self.classifier.predict_proba(test_file,k=k_best) 95 | if csvflag: 96 | lines=open(test_file,"r").readlines() 97 | sentences=[line.split(" , ")[1] for line in lines] 98 | self.result=self.classifier.predict_proba(sentences,k_best) 99 | except: 100 | print("Error in input dataset.. please see if the file/list of sentences is of correct format") 101 | sys.exit(-1) 102 | return(self.result) 103 | 104 | def getlabels(self): 105 | ''' 106 | Input: None 107 | returns: Class labels in dataset 108 | to do : check need of the this funcion 109 | ''' 110 | return(self.classifier.labels) 111 | 112 | def getproperties(self): 113 | 114 | ''' 115 | Input: Nothing, other than object self pointer 116 | Return: None , prints the descriptions of the model hyperparameters 117 | ''' 118 | 119 | print("The model has following hyperparameters as part of its specification") 120 | print("Label prefix used : "+str(self.label_prefix)) 121 | print("Learning rate :"+ str(lr)) 122 | print("Learning rate update after "+str(self.lr_update_rate)+"iterations") 123 | print("Embedding size: "+str(self.dim)) 124 | print("Epochs :"+ str(self.epochs)) 125 | print("minimal number of word occurences: "+self.min_count) 126 | print("number of negatives sampled :"+str(self.neg)) 127 | print("max length of word ngram "+str(self.word_ngrams)) 128 | print("loss function: "+str(self.loss)) 129 | print("number of buckets "+str(self.bucket)) 130 | print("min length of char ngram:"+str(self.minn)) 131 | print("min length of char ngram"+ str(self.maxn)) 132 | return(None) 133 | 134 | 135 | def loadpretrained(self,X): 136 | 'returns the model with pretrained weights' 137 | self.classifier=ft.load_model(X,label_prefix=self.lpr) 138 | -------------------------------------------------------------------------------- /skfasttext/SkipGram.py: -------------------------------------------------------------------------------- 1 | from sklearn.base import BaseEstimator, ClassifierMixin 2 | import fasttext as ft 3 | from sklearn.metrics import classification_report 4 | 5 | class SkipgramFastText(BaseEstimator,ClassifierMixin): 6 | 7 | def __init__(self,lpr='__label__',lr=0.1,lru=100,dim=100,ws=5,epoch=5,minc=1,neg=5,ngram=1,\ 8 | loss='softmax',nbucket=0,minn=0,maxn=0,th=12,t=0.0001,verbosec=0,encoding='utf-8'): 9 | """ 10 | lr learning rate [0.05] 11 | lr_update_rate change the rate of updates for the learning rate [100] 12 | dim size of word vectors [100] 13 | ws size of the context window [5] 14 | epoch number of epochs [5] 15 | min_count minimal number of word occurences [5] 16 | neg number of negatives sampled [5] 17 | word_ngrams max length of word ngram [1] 18 | loss loss function {ns, hs, softmax} [ns] 19 | bucket number of buckets [2000000] 20 | minn min length of char ngram [3] 21 | maxn max length of char ngram [6] 22 | thread number of threads [12] 23 | t sampling threshold [0.0001] 24 | silent disable the log output from the C++ extension [1] 25 | encoding specify input_file encoding [utf-8] 26 | """ 27 | self.lr=lr 28 | self.lr_update_rate=lru 29 | self.dim=dim 30 | self.ws=ws 31 | self.epoch=epoch 32 | self.min_count=minc 33 | self.neg=neg 34 | self.word_ngrams=ngram 35 | self.loss=loss 36 | self.bucket=bucket 37 | self.minn=minn 38 | self.maxn=maxn 39 | self.n_thread=th 40 | self.samplet=t 41 | self.silent=verbosec 42 | self.enc=encodings 43 | self.model=None 44 | self.result=None 45 | 46 | 47 | 48 | def fit(self,X,modelname='model',csvflag=False): 49 | """ 50 | Input: takes input file in format 51 | returns classifier object 52 | to do: add option to feed list of X and Y or file 53 | to do: check options for the api call 54 | to do: write unit test 55 | """ 56 | try: 57 | if not csvflag: 58 | self.model=ft.skipgram(X, modelname, lr=self.lr, dim=self.dim,lr_update_rate=self.lr_update_rate,epoch=self.epoch,bucket=self.bucket,loss=self.loss,thread=self.n_thread) 59 | except: 60 | print("Error in input dataset format") 61 | def getproperties(self): 62 | """ 63 | Input: Nothing, other than object self pointer 64 | Return: None , prints the descriptions of the model hyperparameters 65 | """ 66 | print("The model has following hyperparameters as part of its specification") 67 | print("Learning rate :"+ str(lr)) 68 | print("Learning rate update after "+str(self.lr_update_rate)+"iterations") 69 | print("Embedding size: "+str(self.dim)) 70 | print("Epochs :"+ str(self.epochs) 71 | print("minimal number of word occurences: "+self.min_count) 72 | print("number of negatives sampled :"+str(self.neg)) 73 | print("max length of word ngram "+str(self.word_ngrams)) 74 | print("loss function: "+str(self.loss)) 75 | print("number of buckets "+str(self.bucket)) 76 | print("min length of char ngram:"+str(self.minn)) 77 | print("min length of char ngram"+ str(self.maxn)) 78 | print("number of threads: "+str(self.n_thread)) 79 | print("sampling threshold"+str(self.samplet)) 80 | print("Verbose log output from the C++ extension enable=1/disble=0:"+ str(self.silent)) 81 | print("input_file encoding :"+str(self.enc)) 82 | return None 83 | 84 | def getwords(self): 85 | """to do: check words list""" 86 | return(self.model.words) 87 | # list of words in dictionary) 88 | def getvector(self,word=None): 89 | """ 90 | to do : add try catch for word type 91 | to do: add try catch for word existance 92 | """ 93 | return(self.model[word]) 94 | 95 | 96 | -------------------------------------------------------------------------------- /skfasttext/__init__.py: -------------------------------------------------------------------------------- 1 | from .FastTextClassifier import FastTextClassifier 2 | import os 3 | __VERSION__ = '0.1' --------------------------------------------------------------------------------