├── .gitignore
├── LICENSE
├── README.md
├── README.rst
├── examples
    ├── __init__.py
    └── sklearn_fastest_classifier.py
├── fasttextclf.py
├── setup.py
└── skfasttext
    ├── CBOW.py
    ├── FastTextClassifier.py
    ├── SkipGram.py
    └── __init__.py


/.gitignore:
--------------------------------------------------------------------------------
 1 | *.egg-info
 2 | data/
 3 | *.pyc
 4 | *.so
 5 | *.tar.gz
 6 | *.DS_Store
 7 | *.bin
 8 | *.vec
 9 | 
10 | build/
11 | result/
12 | dist/
13 | 
14 | fasttext/fasttext.cpp
15 | facebookresearch-fasttext-*
16 | 
17 | # Intellij
18 | .idea/
19 | 
20 | # pip
21 | .eggs/
22 | 
23 | # For test
24 | test/*_result.txt
25 | test/dbpedia.train
26 | test/dbpedia_csv/
27 | test/default_params_test
28 | 
29 | # Misc
30 | TODO


--------------------------------------------------------------------------------
/LICENSE:
--------------------------------------------------------------------------------
 1 | Copyright (c) 2016, Bayu Aldi Yansyah
 2 | All rights reserved.
 3 | 
 4 | Redistribution and use in source and binary forms, with or without
 5 | modification, are permitted provided that the following conditions are met:
 6 | 
 7 | * Redistributions of source code must retain the above copyright notice, this
 8 |   list of conditions and the following disclaimer.
 9 | 
10 | * Redistributions in binary form must reproduce the above copyright notice,
11 |   this list of conditions and the following disclaimer in the documentation
12 |   and/or other materials provided with the distribution.
13 | 
14 | * Neither the name of fastText.py nor the names of its
15 |   contributors may be used to endorse or promote products derived from
16 |   this software without specific prior written permission.
17 | 
18 | THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS"
19 | AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
20 | IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
21 | DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE
22 | FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
23 | DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR
24 | SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER
25 | CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY,
26 | OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
27 | OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
28 | 


--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
  1 | # fasttext [![Build Status](https://travis-ci.org/salestock/fastText.py.svg?branch=master)](https://travis-ci.org/salestock/fastText.py) [![PyPI version](https://badge.fury.io/py/fasttext.svg)](https://badge.fury.io/py/fasttext)
  2 | 
  3 | fasttext is a Python interface for
  4 | [Facebook fastText](https://github.com/facebookresearch/fastText).
  5 | 
  6 | ## Requirements
  7 | 
  8 | fasttext support Python 2.6 or newer. It requires
  9 | [Cython](https://pypi.python.org/pypi/Cython/) in order to build the C++ extension.
 10 | 
 11 | ## Installation
 12 | 
 13 | ```shell
 14 | pip install fasttext
 15 | ```
 16 | 
 17 | ## Example usage
 18 | 
 19 | This package has two main use cases: word representation learning and
 20 | text classification.
 21 | 
 22 | These were described in the two papers
 23 | [1](#enriching-word-vectors-with-subword-information)
 24 | and [2](#bag-of-tricks-for-efficient-text-classification).
 25 | 
 26 | ### Scikit-learn interface
 27 | 
 28 | The scikit-learn interface is consistent with of native scikit-learn API
 29 | 
 30 | ### Skipgram model
 31 | ```python
 32 | from skfasttext.SkipGram import SkipgramFastText
 33 | clf=SkipgramFastText()
 34 | clf.fit(train_file)
 35 | ```
 36 | 
 37 | ### CBOW model
 38 | ```python
 39 | from skfasttext.CBOW import cbowFastText
 40 | clf=cbowFastText()
 41 | clf.fit(train_file)
 42 | ```
 43 | ### Attributes and methods for the model
 44 | 
 45 | Skipgram and CBOW model have the following atributes & methods
 46 | 
 47 | ```python
 48 | model.model_name       # Model name
 49 | model.words            # List of words in the dictionary
 50 | model.dim              # Size of word vector
 51 | model.ws               # Size of context window
 52 | model.epoch            # Number of epochs
 53 | model.min_count        # Minimal number of word occurences
 54 | model.neg              # Number of negative sampled
 55 | model.word_ngrams      # Max length of word ngram
 56 | model.loss_name        # Loss function name
 57 | model.bucket           # Number of buckets
 58 | model.minn             # Min length of char ngram
 59 | model.maxn             # Max length of char ngram
 60 | model.lr_update_rate   # Rate of updates for the learning rate
 61 | model.t                # Value of sampling threshold
 62 | model.encoding         # Encoding of the model
 63 | model[word]            # Get the vector of specified word
 64 | ```
 65 | 
 66 | ### Fasttext classifier model
 67 | ```python
 68 | from skfasttext.FastTextClassifier import FastTextClassifier
 69 | clf=FastTextClassifier()
 70 | clf.fit(train_file)
 71 | ```
 72 | 
 73 | Classifier have the following atributes & methods
 74 | 
 75 | ```python
 76 | classifier.labels                  # List of labels
 77 | classifier.label_prefix            # Prefix of the label
 78 | classifier.dim                     # Size of word vector
 79 | classifier.ws                      # Size of context window
 80 | classifier.epoch                   # Number of epochs
 81 | classifier.min_count               # Minimal number of word occurences
 82 | classifier.neg                     # Number of negative sampled
 83 | classifier.word_ngrams             # Max length of word ngram
 84 | classifier.loss_name               # Loss function name
 85 | classifier.bucket                  # Number of buckets
 86 | classifier.minn                    # Min length of char ngram
 87 | classifier.maxn                    # Max length of char ngram
 88 | classifier.lr_update_rate          # Rate of updates for the learning rate
 89 | classifier.t                       # Value of sampling threshold
 90 | classifier.encoding                # Encoding that used by classifier
 91 | classifier.test(filename, k)       # Test the classifier
 92 | classifier.predict(texts, k)       # Predict the most likely label
 93 | classifier.predict_proba(texts, k) # Predict the most likely label include their probability
 94 | 
 95 | ```
 96 | 
 97 | ### Native API usage
 98 | The source codes could be used with native interface of original fasttext as well. See documentation that follows.
 99 | 
100 | ### Word representation learning
101 | 
102 | In order to learn word vectors, as described in
103 | [1](#enriching-word-vectors-with-subword-information), we can use
104 | `fasttext.skipgram` and `fasttext.cbow` function like the following:
105 | 
106 | ```python
107 | import fasttext
108 | 
109 | # Skipgram model
110 | model = fasttext.skipgram('data.txt', 'model')
111 | print model.words # list of words in dictionary
112 | 
113 | # CBOW model
114 | model = fasttext.cbow('data.txt', 'model')
115 | print model.words # list of words in dictionary
116 | ```
117 | 
118 | where `data.txt` is a training file containing `utf-8` encoded text.
119 | By default the word vectors will take into account character n-grams from
120 | 3 to 6 characters.
121 | 
122 | At the end of optimization the program will save two files:
123 | `model.bin` and `model.vec`.
124 | 
125 | `model.vec` is a text file containing the word vectors, one per line.
126 | `model.bin` is a binary file containing the parameters of the model
127 | along with the dictionary and all hyper parameters.
128 | 
129 | The binary file can be used later to compute word vectors or
130 | to restart the optimization.
131 | 
132 | The following `fasttext(1)` command is equivalent
133 | 
134 | ```shell
135 | # Skipgram model
136 | ./fasttext skipgram -input data.txt -output model
137 | 
138 | # CBOW model
139 | ./fasttext cbow -input data.txt -output model
140 | ```
141 | 
142 | ### Obtaining word vectors for out-of-vocabulary words
143 | 
144 | The previously trained model can be used to compute word vectors for
145 | out-of-vocabulary words.
146 | 
147 | ```python
148 | print model['king'] # get the vector of the word 'king'
149 | ```
150 | 
151 | the following `fasttext(1)` command is equivalent:
152 | 
153 | ```shell
154 | echo "king" | ./fasttext print-vectors model.bin
155 | ```
156 | 
157 | This will output the vector of word `king` to the standard output.
158 | 
159 | ### Load pre-trained model
160 | 
161 | We can use `fasttext.load_model` to load pre-trained model:
162 | 
163 | ```python
164 | model = fasttext.load_model('model.bin')
165 | print model.words # list of words in dictionary
166 | print model['king'] # get the vector of the word 'king'
167 | ```
168 | 
169 | ### Text classification
170 | 
171 | This package can also be used to train supervised text classifiers and load
172 | pre-trained classifier from fastText.
173 | 
174 | In order to train a text classifier using the method described in
175 | [2](#bag-of-tricks-for-efficient-text-classification), we can use
176 | the following function:
177 | 
178 | ```python
179 | classifier = fasttext.supervised('data.train.txt', 'model')
180 | ```
181 | 
182 | equivalent as `fasttext(1)` command:
183 | 
184 | ```shell
185 | ./fasttext supervised -input data.train.txt -output model
186 | ```
187 | 
188 | where `data.train.txt` is a text file containing a training sentence per line
189 | along with the labels. By default, we assume that labels are words
190 | that are prefixed by the string `__label__`.
191 | 
192 | We can specify the label prefix with the `label_prefix` param:
193 | 
194 | ```python
195 | classifier = fasttext.supervised('data.train.txt', 'model', label_prefix='__label__')
196 | ```
197 | 
198 | equivalent as `fasttext(1)` command:
199 | 
200 | ```shell
201 | ./fasttext supervised -input data.train.txt -output model -label '__label__'
202 | ```
203 | 
204 | This will output two files: `model.bin` and `model.vec`.
205 | 
206 | Once the model was trained, we can evaluate it by computing the precision
207 | at 1 (P@1) and the recall on a test set using `classifier.test` function:
208 | 
209 | ```python
210 | result = classifier.test('test.txt')
211 | print 'P@1:', result.precision
212 | print 'R@1:', result.recall
213 | print 'Number of examples:', result.nexamples
214 | ```
215 | 
216 | This will print the same output to stdout as:
217 | 
218 | ```shell
219 | ./fasttext test model.bin test.txt
220 | ```
221 | 
222 | In order to obtain the most likely label for a list of text, we can
223 | use `classifer.predict` method:
224 | 
225 | ```python
226 | texts = ['example very long text 1', 'example very longtext 2']
227 | labels = classifier.predict(texts)
228 | print labels
229 | 
230 | # Or with the probability
231 | labels = classifier.predict_proba(texts)
232 | print labels
233 | ```
234 | 
235 | We can specify `k` value to get the k-best labels from classifier:
236 | 
237 | ```python
238 | labels = classifier.predict(texts, k=3)
239 | print labels
240 | 
241 | # Or with the probability
242 | labels = classifier.predict_proba(texts, k=3)
243 | print labels
244 | ```
245 | 
246 | This interface is equivalent as `fasttext(1)` predict command. The same model
247 | with the same input set will have the same prediction.
248 | 
249 | ## API documentation
250 | 
251 | ### Skipgram model
252 | 
253 | Train & load skipgram model
254 | 
255 | ```python
256 | model = fasttext.skipgram(params)
257 | ```
258 | 
259 | List of available `params` and their default value:
260 | 
261 | ```
262 | input_file     training file path (required)
263 | output         output file path (required)
264 | lr             learning rate [0.05]
265 | lr_update_rate change the rate of updates for the learning rate [100]
266 | dim            size of word vectors [100]
267 | ws             size of the context window [5]
268 | epoch          number of epochs [5]
269 | min_count      minimal number of word occurences [5]
270 | neg            number of negatives sampled [5]
271 | word_ngrams    max length of word ngram [1]
272 | loss           loss function {ns, hs, softmax} [ns]
273 | bucket         number of buckets [2000000]
274 | minn           min length of char ngram [3]
275 | maxn           max length of char ngram [6]
276 | thread         number of threads [12]
277 | t              sampling threshold [0.0001]
278 | silent         disable the log output from the C++ extension [1]
279 | encoding       specify input_file encoding [utf-8]
280 | 
281 | ```
282 | 
283 | Example usage:
284 | 
285 | ```python
286 | model = fasttext.skipgram('train.txt', 'model', lr=0.1, dim=300)
287 | ```
288 | 
289 | ### CBOW model
290 | 
291 | Train & load CBOW model
292 | 
293 | ```python
294 | model = fasttext.cbow(params)
295 | ```
296 | 
297 | List of available `params` and their default value:
298 | 
299 | ```
300 | input_file     training file path (required)
301 | output         output file path (required)
302 | lr             learning rate [0.05]
303 | lr_update_rate change the rate of updates for the learning rate [100]
304 | dim            size of word vectors [100]
305 | ws             size of the context window [5]
306 | epoch          number of epochs [5]
307 | min_count      minimal number of word occurences [5]
308 | neg            number of negatives sampled [5]
309 | word_ngrams    max length of word ngram [1]
310 | loss           loss function {ns, hs, softmax} [ns]
311 | bucket         number of buckets [2000000]
312 | minn           min length of char ngram [3]
313 | maxn           max length of char ngram [6]
314 | thread         number of threads [12]
315 | t              sampling threshold [0.0001]
316 | silent         disable the log output from the C++ extension [1]
317 | encoding       specify input_file encoding [utf-8]
318 | 
319 | ```
320 | 
321 | Example usage:
322 | 
323 | ```python
324 | model = fasttext.cbow('train.txt', 'model', lr=0.1, dim=300)
325 | ```
326 | 
327 | ### Load pre-trained model
328 | 
329 | File `.bin` that previously trained or generated by fastText can be
330 | loaded using this function
331 | 
332 | ```python
333 | model = fasttext.load_model('model.bin', encoding='utf-8')
334 | ```
335 | 
336 | ### Attributes and methods for the model
337 | 
338 | Skipgram and CBOW model have the following atributes & methods
339 | 
340 | ```python
341 | model.model_name       # Model name
342 | model.words            # List of words in the dictionary
343 | model.dim              # Size of word vector
344 | model.ws               # Size of context window
345 | model.epoch            # Number of epochs
346 | model.min_count        # Minimal number of word occurences
347 | model.neg              # Number of negative sampled
348 | model.word_ngrams      # Max length of word ngram
349 | model.loss_name        # Loss function name
350 | model.bucket           # Number of buckets
351 | model.minn             # Min length of char ngram
352 | model.maxn             # Max length of char ngram
353 | model.lr_update_rate   # Rate of updates for the learning rate
354 | model.t                # Value of sampling threshold
355 | model.encoding         # Encoding of the model
356 | model[word]            # Get the vector of specified word
357 | ```
358 | 
359 | ### Supervised model
360 | 
361 | Train & load the classifier
362 | 
363 | ```python
364 | classifier = fasttext.supervised(params)
365 | ```
366 | 
367 | List of available `params` and their default value:
368 | 
369 | ```
370 | input_file     			training file path (required)
371 | output         			output file path (required)
372 | label_prefix   			label prefix ['__label__']
373 | lr             			learning rate [0.1]
374 | lr_update_rate 			change the rate of updates for the learning rate [100]
375 | dim            			size of word vectors [100]
376 | ws             			size of the context window [5]
377 | epoch          			number of epochs [5]
378 | min_count      			minimal number of word occurences [1]
379 | neg            			number of negatives sampled [5]
380 | word_ngrams    			max length of word ngram [1]
381 | loss           			loss function {ns, hs, softmax} [softmax]
382 | bucket         			number of buckets [0]
383 | minn           			min length of char ngram [0]
384 | maxn           			max length of char ngram [0]
385 | thread         			number of threads [12]
386 | t              			sampling threshold [0.0001]
387 | silent         			disable the log output from the C++ extension [1]
388 | encoding       			specify input_file encoding [utf-8]
389 | pretrained_vectors		pretrained word vectors (.vec file) for supervised learning []
390 | 
391 | ```
392 | 
393 | Example usage:
394 | 
395 | ```python
396 | classifier = fasttext.supervised('train.txt', 'model', label_prefix='__myprefix__',
397 |                                  thread=4)
398 | ```
399 | 
400 | ### Load pre-trained classifier
401 | 
402 | File `.bin` that previously trained or generated by fastText can be
403 | loaded using this function.
404 | 
405 | ```shell
406 | ./fasttext supervised -input train.txt -output classifier -label 'some_prefix'
407 | ```
408 | 
409 | ```python
410 | classifier = fasttext.load_model('classifier.bin', label_prefix='some_prefix')
411 | ```
412 | 
413 | ### Test classifier
414 | 
415 | This is equivalent as `fasttext(1)` test command. The test using the same
416 | model and test set will produce the same value for the precision at one
417 | and the number of examples.
418 | 
419 | ```python
420 | result = classifier.test(params)
421 | 
422 | # Properties
423 | result.precision # Precision at one
424 | result.recall    # Recall at one
425 | result.nexamples # Number of test examples
426 | ```
427 | 
428 | The param `k` is optional, and equal to `1` by default.
429 | 
430 | ### Predict the most-likely label of texts
431 | 
432 | This interface is equivalent as `fasttext(1)` predict command.
433 | 
434 | `texts` is an array of string
435 | 
436 | ```python
437 | labels = classifier.predict(texts, k)
438 | 
439 | # Or with probability
440 | labels = classifier.predict_proba(texts, k)
441 | ```
442 | 
443 | The param `k` is optional, and equal to `1` by default.
444 | 
445 | ### Attributes and methods for the classifier
446 | 
447 | Classifier have the following atributes & methods
448 | 
449 | ```python
450 | classifier.labels                  # List of labels
451 | classifier.label_prefix            # Prefix of the label
452 | classifier.dim                     # Size of word vector
453 | classifier.ws                      # Size of context window
454 | classifier.epoch                   # Number of epochs
455 | classifier.min_count               # Minimal number of word occurences
456 | classifier.neg                     # Number of negative sampled
457 | classifier.word_ngrams             # Max length of word ngram
458 | classifier.loss_name               # Loss function name
459 | classifier.bucket                  # Number of buckets
460 | classifier.minn                    # Min length of char ngram
461 | classifier.maxn                    # Max length of char ngram
462 | classifier.lr_update_rate          # Rate of updates for the learning rate
463 | classifier.t                       # Value of sampling threshold
464 | classifier.encoding                # Encoding that used by classifier
465 | classifier.test(filename, k)       # Test the classifier
466 | classifier.predict(texts, k)       # Predict the most likely label
467 | classifier.predict_proba(texts, k) # Predict the most likely label include their probability
468 | 
469 | ```
470 | 
471 | The param `k` for `classifier.test`, `classifier.predict` and
472 | `classifier.predict_proba` is optional,
473 | and equal to `1` by default.
474 | 
475 | 
476 | 
477 | ## References
478 | 
479 | ### Enriching Word Vectors with Subword Information
480 | 
481 | [1] P. Bojanowski\*, E. Grave\*, A. Joulin, T. Mikolov, [*Enriching Word Vectors with Subword Information*](https://arxiv.org/pdf/1607.04606v1.pdf)
482 | 
483 | ```
484 | @article{bojanowski2016enriching,
485 |   title={Enriching Word Vectors with Subword Information},
486 |   author={Bojanowski, Piotr and Grave, Edouard and Joulin, Armand and Mikolov, Tomas},
487 |   journal={arXiv preprint arXiv:1607.04606},
488 |   year={2016}
489 | }
490 | ```
491 | 
492 | ### Bag of Tricks for Efficient Text Classification
493 | 
494 | [2] A. Joulin, E. Grave, P. Bojanowski, T. Mikolov, [*Bag of Tricks for Efficient Text Classification*](https://arxiv.org/pdf/1607.01759v2.pdf)
495 | 
496 | ```
497 | @article{joulin2016bag,
498 |   title={Bag of Tricks for Efficient Text Classification},
499 |   author={Joulin, Armand and Grave, Edouard and Bojanowski, Piotr and Mikolov, Tomas},
500 |   journal={arXiv preprint arXiv:1607.01759},
501 |   year={2016}
502 | }
503 | ```
504 | (\* These authors contributed equally.)
505 | 
506 | 
507 | ### Native python interface
508 | A huge thank you to [fastText.py](https://github.com/salestock/fastText.py) for building such a amazing native python interface around which this scikit wrapper is written
509 | 
510 | 
511 | ## Join the fastText community
512 | 
513 | * Facebook page: https://www.facebook.com/groups/1174547215919768
514 | * Google group: https://groups.google.com/forum/#!forum/fasttext-library
515 | 
516 | 
517 | 


--------------------------------------------------------------------------------
/README.rst:
--------------------------------------------------------------------------------
  1 | fasttext |Build Status| |PyPI version|
  2 | ======================================
  3 | 
  4 | fasttext is a Python interface for `Facebook
  5 | fastText <https://github.com/facebookresearch/fastText>`__.
  6 | 
  7 | Requirements
  8 | ------------
  9 | 
 10 | fasttext support Python 2.6 or newer. It requires
 11 | `Cython <https://pypi.python.org/pypi/Cython/>`__ in order to build the
 12 | C++ extension.
 13 | 
 14 | Installation
 15 | ------------
 16 | 
 17 | .. code:: shell
 18 | 
 19 |     pip install fasttext
 20 | 
 21 | Example usage
 22 | -------------
 23 | 
 24 | This package has two main use cases: word representation learning and
 25 | text classification.
 26 | 
 27 | These were described in the two papers
 28 | `1 <#enriching-word-vectors-with-subword-information>`__ and
 29 | `2 <#bag-of-tricks-for-efficient-text-classification>`__.
 30 | 
 31 | Word representation learning
 32 | ~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 33 | 
 34 | In order to learn word vectors, as described in
 35 | `1 <#enriching-word-vectors-with-subword-information>`__, we can use
 36 | ``fasttext.skipgram`` and ``fasttext.cbow`` function like the following:
 37 | 
 38 | .. code:: python
 39 | 
 40 |     import fasttext
 41 | 
 42 |     # Skipgram model
 43 |     model = fasttext.skipgram('data.txt', 'model')
 44 |     print model.words # list of words in dictionary
 45 | 
 46 |     # CBOW model
 47 |     model = fasttext.cbow('data.txt', 'model')
 48 |     print model.words # list of words in dictionary
 49 | 
 50 | where ``data.txt`` is a training file containing ``utf-8`` encoded text.
 51 | By default the word vectors will take into account character n-grams
 52 | from 3 to 6 characters.
 53 | 
 54 | At the end of optimization the program will save two files:
 55 | ``model.bin`` and ``model.vec``.
 56 | 
 57 | ``model.vec`` is a text file containing the word vectors, one per line.
 58 | ``model.bin`` is a binary file containing the parameters of the model
 59 | along with the dictionary and all hyper parameters.
 60 | 
 61 | The binary file can be used later to compute word vectors or to restart
 62 | the optimization.
 63 | 
 64 | The following ``fasttext(1)`` command is equivalent
 65 | 
 66 | .. code:: shell
 67 | 
 68 |     # Skipgram model
 69 |     ./fasttext skipgram -input data.txt -output model
 70 | 
 71 |     # CBOW model
 72 |     ./fasttext cbow -input data.txt -output model
 73 | 
 74 | Obtaining word vectors for out-of-vocabulary words
 75 | ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 76 | 
 77 | The previously trained model can be used to compute word vectors for
 78 | out-of-vocabulary words.
 79 | 
 80 | .. code:: python
 81 | 
 82 |     print model['king'] # get the vector of the word 'king'
 83 | 
 84 | the following ``fasttext(1)`` command is equivalent:
 85 | 
 86 | .. code:: shell
 87 | 
 88 |     echo "king" | ./fasttext print-vectors model.bin
 89 | 
 90 | This will output the vector of word ``king`` to the standard output.
 91 | 
 92 | Load pre-trained model
 93 | ~~~~~~~~~~~~~~~~~~~~~~
 94 | 
 95 | We can use ``fasttext.load_model`` to load pre-trained model:
 96 | 
 97 | .. code:: python
 98 | 
 99 |     model = fasttext.load_model('model.bin')
100 |     print model.words # list of words in dictionary
101 |     print model['king'] # get the vector of the word 'king'
102 | 
103 | Text classification
104 | ~~~~~~~~~~~~~~~~~~~
105 | 
106 | This package can also be used to train supervised text classifiers and
107 | load pre-trained classifier from fastText.
108 | 
109 | In order to train a text classifier using the method described in
110 | `2 <#bag-of-tricks-for-efficient-text-classification>`__, we can use the
111 | following function:
112 | 
113 | .. code:: python
114 | 
115 |     classifier = fasttext.supervised('data.train.txt', 'model')
116 | 
117 | equivalent as ``fasttext(1)`` command:
118 | 
119 | .. code:: shell
120 | 
121 |     ./fasttext supervised -input data.train.txt -output model
122 | 
123 | where ``data.train.txt`` is a text file containing a training sentence
124 | per line along with the labels. By default, we assume that labels are
125 | words that are prefixed by the string ``__label__``.
126 | 
127 | We can specify the label prefix with the ``label_prefix`` param:
128 | 
129 | .. code:: python
130 | 
131 |     classifier = fasttext.supervised('data.train.txt', 'model', label_prefix='__label__')
132 | 
133 | equivalent as ``fasttext(1)`` command:
134 | 
135 | .. code:: shell
136 | 
137 |     ./fasttext supervised -input data.train.txt -output model -label '__label__'
138 | 
139 | This will output two files: ``model.bin`` and ``model.vec``.
140 | 
141 | Once the model was trained, we can evaluate it by computing the
142 | precision at 1 (P@1) and the recall on a test set using
143 | ``classifier.test`` function:
144 | 
145 | .. code:: python
146 | 
147 |     result = classifier.test('test.txt')
148 |     print 'P@1:', result.precision
149 |     print 'R@1:', result.recall
150 |     print 'Number of examples:', result.nexamples
151 | 
152 | This will print the same output to stdout as:
153 | 
154 | .. code:: shell
155 | 
156 |     ./fasttext test model.bin test.txt
157 | 
158 | In order to obtain the most likely label for a list of text, we can use
159 | ``classifer.predict`` method:
160 | 
161 | .. code:: python
162 | 
163 |     texts = ['example very long text 1', 'example very longtext 2']
164 |     labels = classifier.predict(texts)
165 |     print labels
166 | 
167 |     # Or with the probability
168 |     labels = classifier.predict_proba(texts)
169 |     print labels
170 | 
171 | We can specify ``k`` value to get the k-best labels from classifier:
172 | 
173 | .. code:: python
174 | 
175 |     labels = classifier.predict(texts, k=3)
176 |     print labels
177 | 
178 |     # Or with the probability
179 |     labels = classifier.predict_proba(texts, k=3)
180 |     print labels
181 | 
182 | This interface is equivalent as ``fasttext(1)`` predict command. The
183 | same model with the same input set will have the same prediction.
184 | 
185 | API documentation
186 | -----------------
187 | 
188 | Skipgram model
189 | ~~~~~~~~~~~~~~
190 | 
191 | Train & load skipgram model
192 | 
193 | .. code:: python
194 | 
195 |     model = fasttext.skipgram(params)
196 | 
197 | List of available ``params`` and their default value:
198 | 
199 | ::
200 | 
201 |     input_file     training file path (required)
202 |     output         output file path (required)
203 |     lr             learning rate [0.05]
204 |     lr_update_rate change the rate of updates for the learning rate [100]
205 |     dim            size of word vectors [100]
206 |     ws             size of the context window [5]
207 |     epoch          number of epochs [5]
208 |     min_count      minimal number of word occurences [5]
209 |     neg            number of negatives sampled [5]
210 |     word_ngrams    max length of word ngram [1]
211 |     loss           loss function {ns, hs, softmax} [ns]
212 |     bucket         number of buckets [2000000]
213 |     minn           min length of char ngram [3]
214 |     maxn           max length of char ngram [6]
215 |     thread         number of threads [12]
216 |     t              sampling threshold [0.0001]
217 |     silent         disable the log output from the C++ extension [1]
218 |     encoding       specify input_file encoding [utf-8]
219 | 
220 | Example usage:
221 | 
222 | .. code:: python
223 | 
224 |     model = fasttext.skipgram('train.txt', 'model', lr=0.1, dim=300)
225 | 
226 | CBOW model
227 | ~~~~~~~~~~
228 | 
229 | Train & load CBOW model
230 | 
231 | .. code:: python
232 | 
233 |     model = fasttext.cbow(params)
234 | 
235 | List of available ``params`` and their default value:
236 | 
237 | ::
238 | 
239 |     input_file     training file path (required)
240 |     output         output file path (required)
241 |     lr             learning rate [0.05]
242 |     lr_update_rate change the rate of updates for the learning rate [100]
243 |     dim            size of word vectors [100]
244 |     ws             size of the context window [5]
245 |     epoch          number of epochs [5]
246 |     min_count      minimal number of word occurences [5]
247 |     neg            number of negatives sampled [5]
248 |     word_ngrams    max length of word ngram [1]
249 |     loss           loss function {ns, hs, softmax} [ns]
250 |     bucket         number of buckets [2000000]
251 |     minn           min length of char ngram [3]
252 |     maxn           max length of char ngram [6]
253 |     thread         number of threads [12]
254 |     t              sampling threshold [0.0001]
255 |     silent         disable the log output from the C++ extension [1]
256 |     encoding       specify input_file encoding [utf-8]
257 | 
258 | Example usage:
259 | 
260 | .. code:: python
261 | 
262 |     model = fasttext.cbow('train.txt', 'model', lr=0.1, dim=300)
263 | 
264 | Load pre-trained model
265 | ~~~~~~~~~~~~~~~~~~~~~~
266 | 
267 | File ``.bin`` that previously trained or generated by fastText can be
268 | loaded using this function
269 | 
270 | .. code:: python
271 | 
272 |     model = fasttext.load_model('model.bin', encoding='utf-8')
273 | 
274 | Attributes and methods for the model
275 | ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
276 | 
277 | Skipgram and CBOW model have the following atributes & methods
278 | 
279 | .. code:: python
280 | 
281 |     model.model_name       # Model name
282 |     model.words            # List of words in the dictionary
283 |     model.dim              # Size of word vector
284 |     model.ws               # Size of context window
285 |     model.epoch            # Number of epochs
286 |     model.min_count        # Minimal number of word occurences
287 |     model.neg              # Number of negative sampled
288 |     model.word_ngrams      # Max length of word ngram
289 |     model.loss_name        # Loss function name
290 |     model.bucket           # Number of buckets
291 |     model.minn             # Min length of char ngram
292 |     model.maxn             # Max length of char ngram
293 |     model.lr_update_rate   # Rate of updates for the learning rate
294 |     model.t                # Value of sampling threshold
295 |     model.encoding         # Encoding of the model
296 |     model[word]            # Get the vector of specified word
297 | 
298 | Supervised model
299 | ~~~~~~~~~~~~~~~~
300 | 
301 | Train & load the classifier
302 | 
303 | .. code:: python
304 | 
305 |     classifier = fasttext.supervised(params)
306 | 
307 | List of available ``params`` and their default value:
308 | 
309 | ::
310 | 
311 |     input_file             training file path (required)
312 |     output                 output file path (required)
313 |     label_prefix           label prefix ['__label__']
314 |     lr                     learning rate [0.1]
315 |     lr_update_rate         change the rate of updates for the learning rate [100]
316 |     dim                    size of word vectors [100]
317 |     ws                     size of the context window [5]
318 |     epoch                  number of epochs [5]
319 |     min_count              minimal number of word occurences [1]
320 |     neg                    number of negatives sampled [5]
321 |     word_ngrams            max length of word ngram [1]
322 |     loss                   loss function {ns, hs, softmax} [softmax]
323 |     bucket                 number of buckets [0]
324 |     minn                   min length of char ngram [0]
325 |     maxn                   max length of char ngram [0]
326 |     thread                 number of threads [12]
327 |     t                      sampling threshold [0.0001]
328 |     silent                 disable the log output from the C++ extension [1]
329 |     encoding               specify input_file encoding [utf-8]
330 |     pretrained_vectors	   pretrained word vectors (.vec file) for supervised learning []
331 | 
332 | Example usage:
333 | 
334 | .. code:: python
335 | 
336 |     classifier = fasttext.supervised('train.txt', 'model', label_prefix='__myprefix__',
337 |                                      thread=4)
338 | 
339 | Load pre-trained classifier
340 | ~~~~~~~~~~~~~~~~~~~~~~~~~~~
341 | 
342 | File ``.bin`` that previously trained or generated by fastText can be
343 | loaded using this function.
344 | 
345 | .. code:: shell
346 | 
347 |     ./fasttext supervised -input train.txt -output classifier -label 'some_prefix'
348 | 
349 | .. code:: python
350 | 
351 |     classifier = fasttext.load_model('classifier.bin', label_prefix='some_prefix')
352 | 
353 | Test classifier
354 | ~~~~~~~~~~~~~~~
355 | 
356 | This is equivalent as ``fasttext(1)`` test command. The test using the
357 | same model and test set will produce the same value for the precision at
358 | one and the number of examples.
359 | 
360 | .. code:: python
361 | 
362 |     result = classifier.test(params)
363 | 
364 |     # Properties
365 |     result.precision # Precision at one
366 |     result.recall    # Recall at one
367 |     result.nexamples # Number of test examples
368 | 
369 | The param ``k`` is optional, and equal to ``1`` by default.
370 | 
371 | Predict the most-likely label of texts
372 | ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
373 | 
374 | This interface is equivalent as ``fasttext(1)`` predict command.
375 | 
376 | ``texts`` is an array of string
377 | 
378 | .. code:: python
379 | 
380 |     labels = classifier.predict(texts, k)
381 | 
382 |     # Or with probability
383 |     labels = classifier.predict_proba(texts, k)
384 | 
385 | The param ``k`` is optional, and equal to ``1`` by default.
386 | 
387 | Attributes and methods for the classifier
388 | ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
389 | 
390 | Classifier have the following atributes & methods
391 | 
392 | .. code:: python
393 | 
394 |     classifier.labels                  # List of labels
395 |     classifier.label_prefix            # Prefix of the label
396 |     classifier.dim                     # Size of word vector
397 |     classifier.ws                      # Size of context window
398 |     classifier.epoch                   # Number of epochs
399 |     classifier.min_count               # Minimal number of word occurences
400 |     classifier.neg                     # Number of negative sampled
401 |     classifier.word_ngrams             # Max length of word ngram
402 |     classifier.loss_name               # Loss function name
403 |     classifier.bucket                  # Number of buckets
404 |     classifier.minn                    # Min length of char ngram
405 |     classifier.maxn                    # Max length of char ngram
406 |     classifier.lr_update_rate          # Rate of updates for the learning rate
407 |     classifier.t                       # Value of sampling threshold
408 |     classifier.encoding                # Encoding that used by classifier
409 |     classifier.test(filename, k)       # Test the classifier
410 |     classifier.predict(texts, k)       # Predict the most likely label
411 |     classifier.predict_proba(texts, k) # Predict the most likely label include their probability
412 | 
413 | The param ``k`` for ``classifier.test``, ``classifier.predict`` and
414 | ``classifier.predict_proba`` is optional, and equal to ``1`` by default.
415 | 
416 | References
417 | ----------
418 | 
419 | Enriching Word Vectors with Subword Information
420 | ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
421 | 
422 | [1] P. Bojanowski\*, E. Grave\*, A. Joulin, T. Mikolov, `*Enriching Word
423 | Vectors with Subword
424 | Information* <https://arxiv.org/pdf/1607.04606v1.pdf>`__
425 | 
426 | ::
427 | 
428 |     @article{bojanowski2016enriching,
429 |       title={Enriching Word Vectors with Subword Information},
430 |       author={Bojanowski, Piotr and Grave, Edouard and Joulin, Armand and Mikolov, Tomas},
431 |       journal={arXiv preprint arXiv:1607.04606},
432 |       year={2016}
433 |     }
434 | 
435 | Bag of Tricks for Efficient Text Classification
436 | ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
437 | 
438 | [2] A. Joulin, E. Grave, P. Bojanowski, T. Mikolov, `*Bag of Tricks for
439 | Efficient Text
440 | Classification* <https://arxiv.org/pdf/1607.01759v2.pdf>`__
441 | 
442 | ::
443 | 
444 |     @article{joulin2016bag,
445 |       title={Bag of Tricks for Efficient Text Classification},
446 |       author={Joulin, Armand and Grave, Edouard and Bojanowski, Piotr and Mikolov, Tomas},
447 |       journal={arXiv preprint arXiv:1607.01759},
448 |       year={2016}
449 |     }
450 | 
451 | (\* These authors contributed equally.)
452 | 
453 | Join the fastText community
454 | ---------------------------
455 | 
456 | -  Facebook page: https://www.facebook.com/groups/1174547215919768
457 | -  Google group:
458 |    https://groups.google.com/forum/#!forum/fasttext-library
459 | 
460 | .. |Build Status| image:: https://travis-ci.org/salestock/fastText.py.svg?branch=master
461 |    :target: https://travis-ci.org/salestock/fastText.py
462 | .. |PyPI version| image:: https://badge.fury.io/py/fasttext.svg
463 |    :target: https://badge.fury.io/py/fasttext
464 | 


--------------------------------------------------------------------------------
/examples/__init__.py:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/vishnumani2009/sklearn-fasttext/f0f40a83974dbebd492f6aecea26c04f0fb577f6/examples/__init__.py


--------------------------------------------------------------------------------
/examples/sklearn_fastest_classifier.py:
--------------------------------------------------------------------------------
 1 | import sys
 2 | #sys.path.insert(0, '../skfasttext/')
 3 | from skfasttext.FastTextClassifier import FastTextClassifier
 4 | 
 5 | 
 6 | """
 7 | File paths
 8 | """
 9 | train_file="../data/classifier_test.txt"
10 | test_file="../data/classifier_test.txt"
11 | 
12 | """
13 | train model
14 | """
15 | 
16 | clf=FastTextClassifier()
17 | print("training clf")
18 | clf.fit(train_file)
19 | print("testing clf")
20 | print(clf.predict_proba(test_file,k_best=3))


--------------------------------------------------------------------------------
/fasttextclf.py:
--------------------------------------------------------------------------------
  1 | from sklearn.base import BaseEstimator, ClassifierMixin
  2 | import fasttext as ft
  3 | from sklearn.metrics import classification_report
  4 | 
  5 | class FastTextClassifier(BaseEstimator,ClassifierMixin):
  6 | """Base classiifer of Fasttext estimator"""
  7 | 
  8 | 	def __init__(self,lpr='__label__',lr=0.1,lru=100,dim=100,ws=5,epoch=5,minc=1,neg=5,ngram=1,loss='softmax',nbucket=0,minn=0,maxn=0,thread=2,silent=1,output="model"):
  9 | 		"""
 10 | 		label_prefix   			label prefix ['__label__']
 11 | 		lr             			learning rate [0.1]
 12 | 		lr_update_rate 			change the rate of updates for the learning rate [100]
 13 | 		dim            			size of word vectors [100]
 14 | 		ws             			size of the context window [5]
 15 | 		epoch          			number of epochs [5]
 16 | 		min_count      			minimal number of word occurences [1]
 17 | 		neg            			number of negatives sampled [5]
 18 | 		word_ngrams    			max length of word ngram [1]
 19 | 		loss           			loss function {ns, hs, softmax} [softmax]
 20 | 		bucket         			number of buckets [0]
 21 | 		minn           			min length of char ngram [0]
 22 | 		maxn           			min length of char ngram [0]
 23 | 		todo : Recheck need of some of the variables, present in default classifier
 24 | 		"""
 25 | 		
 26 | 		self.label_prefix=lpr
 27 | 		self.lr=lr
 28 | 		self.lr_update_rate=lru
 29 | 		self.dim=dim
 30 | 		self.ws=ws
 31 | 		self.epoch=epoch
 32 | 		self.min_count=minc
 33 | 		self.neg=neg
 34 | 		self.word_ngrams=ngram
 35 | 		self.loss=loss
 36 | 		self.bucket=bucket
 37 | 		self.minn=minn
 38 | 		self.maxn=maxn
 39 | 		self.thread=thread
 40 | 		self.silent=silent
 41 | 		self.classifier=None
 42 | 		self.result=None
 43 | 
 44 | 		self.output=output
 45 | 
 46 | 	def fit(self,input_file):
 47 |                 '''
 48 |                 Input: takes input file in format
 49 |                 returns classifier object
 50 |                 to do: add option to feed list of X and Y or file
 51 |                 '''
 52 |                 
 53 |                 self.classifier = ft.supervised(input_file, self.output, dim=self.dim, lr=self.lr, epoch=self.epoch, min_count=self.min_count, word_ngrams=self.word_ngrams, bucket=self.bucket, thread=self.thread, silent=self.silent, label_prefix=self.lpr)
 54 |                 return(self.classisifer)
 55 |             
 56 | 	def predict(self,test_file,csvflag=True,reports=False):
 57 |                 '''
 58 |                 Input: Takes input test finle in format
 59 |                 return results object
 60 |                 to do: add unit tests using sentiment analysis dataset 
 61 |                 to do: Add K best labels options for csvflag = False 
 62 |                 to do: add report option
 63 |                 '''
 64 |                 try:
 65 |                     if type(test_file) == 'list' and csvflag=False:
 66 |                         self.result=self.classifier.predict(test_file)
 67 |                     else:
 68 |                         print "Error in input"
 69 |                     if csvflag:
 70 |                             self.result=self.classifier.test(test_file)
 71 |                 except:
 72 |                     print("Exception in predict call error in format of test_file/input sentence list")
 73 |                 return(self.result)
 74 |                 
 75 | 	def report(self,ytrue,ypred):
 76 |                 '''
 77 |                 Input: predicted and true labels
 78 |                 return reort of classification
 79 |                 to do: add label option and unit testing
 80 |                 
 81 |                 '''
 82 |                 print(classification_report(ytrue,ypred))
 83 |                 return None
 84 |             
 85 | 	def predict_proba(self,X):
 86 |                 '''
 87 |                 Input: List of sentences
 88 |                 return reort of classification
 89 |                 to do: check output of classifier predct_proba add label option and unit testing
 90 |                 '''
 91 |                 labels=self.classifier.predict_proba(X)
 92 |                 return(labels)
 93 | 
 94 | 	def getlabels(self):
 95 |                 '''
 96 |                 Input: None
 97 |                 returns: Class labels in dataset
 98 |                 to do : check need of the this funcion
 99 |                 '''
100 | 		return(self.classifier.labels)
101 | 		
102 | 	def getproperties(self):
103 |                 
104 |                 '''
105 |                 Input: Nothing, other than object self pointer
106 |                 Return: None , prints the descriptions of the model hyperparameters
107 |                 '''
108 |                 
109 |                 print("The model has following hyperparameters as part of its specification")
110 |                 print("Label prefix used : "+str(self.label_prefix)
111 |                 print("Learning rate :"+ str(lr))
112 |                 print("Learning rate update after "+str(self.lr_update_rate)+"iterations")
113 |                 print("Embedding size: "+str(self.dim))
114 |                 print("Epochs :"+ str(self.epochs)
115 |                 print("minimal number of word occurences: "+self.min_count)
116 |                 print("number of negatives sampled :"+str(self.neg))
117 |                 print("max length of word ngram "+str(self.word_ngrams))
118 |                 print("loss function: "+str(self.loss))
119 |                 print("number of buckets "+str(self.bucket))
120 |                 print("min length of char ngram:"+str(self.minn))
121 |                 print("min length of char ngram"+ str(self.maxn))
122 |                 return(None)
123 |                 
124 | 		
125 | 	def loadpretrained(self,X):
126 | 		'returns the model with pretrained weights'
127 | 		pass
128 | 	
129 | class SkipgramFastText(BaseEstimator,ClassifierMixin):
130 | 
131 | 	def __init__(self,lpr='__label__',lr=0.1,lru=100,dim=100,ws=5,epoch=5,minc=1,neg=5,ngram=1,\
132 | loss='softmax',nbucket=0,minn=0,maxn=0,th=12,t=0.0001,verbosec=0,encoding='utf-8'):
133 | 			"""
134 | 			lr             learning rate [0.05]
135 | 			lr_update_rate change the rate of updates for the learning rate [100]
136 | 			dim            size of word vectors [100]
137 | 			ws             size of the context window [5]
138 | 			epoch          number of epochs [5]
139 | 			min_count      minimal number of word occurences [5]
140 | 			neg            number of negatives sampled [5]
141 | 			word_ngrams    max length of word ngram [1]
142 | 			loss           loss function {ns, hs, softmax} [ns]
143 | 			bucket         number of buckets [2000000]
144 | 			minn           min length of char ngram [3]
145 | 			maxn           max length of char ngram [6]
146 | 			thread         number of threads [12]
147 | 			t              sampling threshold [0.0001]
148 | 			silent         disable the log output from the C++ extension [1]
149 | 			encoding       specify input_file encoding [utf-8]
150 | 			"""
151 | 			self.lr=lr
152 | 			self.lr_update_rate=lru
153 | 			self.dim=dim
154 | 			self.ws=ws
155 | 			self.epoch=epoch
156 | 			self.min_count=minc
157 | 			self.neg=neg
158 | 			self.word_ngrams=ngram
159 | 			self.loss=loss
160 | 			self.bucket=bucket
161 | 			self.minn=minn
162 | 			self.maxn=maxn
163 | 			self.n_thread=th
164 | 			self.samplet=t
165 | 			self.silent=verbosec
166 | 			self.enc=encodings
167 | 			self.model=None
168 | 			self.result=None
169 | 			
170 | 
171 |             
172 | 	def fit(self,X,modelname='model',csvflag=False):
173 |                 '''
174 |                 Input: takes input file in format
175 |                 returns classifier object
176 |                 to do: add option to feed list of X and Y or file
177 |                 to do: check options for the api call 
178 |                 to do: write unit test
179 |                 '''
180 |                 try:
181 |                     if not csvflag:
182 |                         self.model=ft.skipgram(X, modelname, lr=self.lr, dim=self.dim,lr_update_rate=self.lr_update_rate,epoch=self.epoch,bucket=self.bucket,loss=self.loss,thread=self.n_thread)
183 |                 except:
184 |                     print("Error in input dataset format")
185 | 	def getproperties(self):
186 |                 '''
187 |                 Input: Nothing, other than object self pointer
188 |                 Return: None , prints the descriptions of the model hyperparameters
189 |                 '''
190 |                 print("The model has following hyperparameters as part of its specification")
191 |                 print("Learning rate :"+ str(lr))
192 |                 print("Learning rate update after "+str(self.lr_update_rate)+"iterations")
193 |                 print("Embedding size: "+str(self.dim))
194 |                 print("Epochs :"+ str(self.epochs)
195 |                 print("minimal number of word occurences: "+self.min_count)
196 |                 print("number of negatives sampled :"+str(self.neg))
197 |                 print("max length of word ngram "+str(self.word_ngrams))
198 |                 print("loss function: "+str(self.loss))
199 |                 print("number of buckets "+str(self.bucket))
200 |                 print("min length of char ngram:"+str(self.minn))
201 |                 print("min length of char ngram"+ str(self.maxn))
202 |                 print("number of threads: "+str(self.n_thread))
203 |                 print("sampling threshold"+str(self.samplet))
204 |                 print("Verbose log output from the C++ extension enable=1/disble=0:"+ str(self.silent))
205 |                 print("input_file encoding :"+str(self.enc))
206 |                 return None
207 |             
208 | 	def getwords(self):
209 |                 """to do: check words list"""
210 | 		return(self.model.words)
211 |             # list of words in dictionary)
212 | 	def getvector(self,word=None):
213 |                 """
214 |                 to do : add try catch for word type
215 |                 to do: add try catch for word existance 
216 |                 """
217 |                 return(self.model[word])
218 |             
219 |             
220 | class cbowFastText((BaseEstimator,ClassifierMixin):
221 | 	def __init__(self,lpr='__label__',lr=0.1,lru=100,dim=100,ws=5,epoch=5,minc=1,neg=5,ngram=1,\
222 | loss='softmax',nbucket=0,minn=0,maxn=0,th=12,t=0.0001,verbosec=0,encoding='utf-8'):			"""
223 | 			lr             learning rate [0.05]
224 | 			lr_update_rate change the rate of updates for the learning rate [100]
225 | 			dim            size of word vectors [100]
226 | 			ws             size of the context window [5]
227 | 			epoch          number of epochs [5]
228 | 			min_count      minimal number of word occurences [5]
229 | 			neg            number of negatives sampled [5]
230 | 			word_ngrams    max length of word ngram [1]
231 | 			loss           loss function {ns, hs, softmax} [ns]
232 | 			bucket         number of buckets [2000000]
233 | 			minn           min length of char ngram [3]
234 | 			maxn           max length of char ngram [6]
235 | 			thread         number of threads [12]
236 | 			t              sampling threshold [0.0001]
237 | 			silent         disable the log output from the C++ extension [1]
238 | 			encoding       specify input_file encoding [utf-8]
239 | 
240 | 			"""
241 | 			self.lr=lr
242 | 			self.lr_update_rate=lru
243 | 			self.dim=dim
244 | 			self.ws=ws
245 | 			self.epoch=epoch
246 | 			self.min_count=minc
247 | 			self.neg=neg
248 | 			self.word_ngrams=ngram
249 | 			self.loss=loss
250 | 			self.bucket=bucket
251 | 			self.minn=minn
252 | 			self.maxn=maxn
253 | 			self.n_thread=th
254 | 			self.samplet=t
255 | 			self.silent=verbosec
256 | 			self.enc=encoding	
257 | 
258 | 	def fit(self,X,modelname='model'):
259 |                 '''
260 |                 Input: takes input file in format
261 |                 returns classifier object
262 |                 to do: add option to feed list of X and Y or file
263 |                 to do: check options for the api call 
264 |                 to do: write unit test
265 |                 '''
266 | 		try:
267 |                     if not csvflag:
268 |                         self.model=ft.cbow(X, modelname, lr=self.lr, dim=self.dim,lr_update_rate=self.lr_update_rate,epoch=self.epoch,bucket=self.bucket,loss=self.loss,thread=self.n_thread)
269 |                 except:
270 |                     print("Error in input dataset format")
271 |                     
272 | 	def getproperties(self):
273 | 		'''
274 |                 Input: Nothing, other than object self pointer
275 |                 Return: None , prints the descriptions of the model hyperparameters
276 |                 '''
277 |                 print("The model has following hyperparameters as part of its specification")
278 |                 print("Learning rate :"+ str(lr))
279 |                 print("Learning rate update after "+str(self.lr_update_rate)+"iterations")
280 |                 print("Embedding size: "+str(self.dim))
281 |                 print("Epochs :"+ str(self.epochs)
282 |                 print("minimal number of word occurences: "+self.min_count)
283 |                 print("number of negatives sampled :"+str(self.neg))
284 |                 print("max length of word ngram "+str(self.word_ngrams))
285 |                 print("loss function: "+str(self.loss))
286 |                 print("number of buckets "+str(self.bucket))
287 |                 print("min length of char ngram:"+str(self.minn))
288 |                 print("min length of char ngram"+ str(self.maxn))
289 |                 print("number of threads: "+str(self.n_thread))
290 |                 print("sampling threshold"+str(self.samplet))
291 |                 print("Verbose log output from the C++ extension enable=1/disble=0:"+ str(self.silent))
292 |                 print("input_file encoding :"+str(self.enc))
293 |                 
294 | 	def getwords(self):
295 | 		"""to do: check words list"""
296 | 		return(self.model.words)
297 | 
298 | 	def getvector(self):
299 | 		       """
300 |                 to do : add try catch for word type
301 |                 to do: add try catch for word existance 
302 |                 """
303 |                 return(self.model[word])
304 | 
305 | 		
306 | 
307 | 


--------------------------------------------------------------------------------
/setup.py:
--------------------------------------------------------------------------------
1 | from setuptools import setup
2 | 
3 | setup(
4 |     name="skfasttext",
5 |     version="0.1",
6 |     packages=["skfasttext"]
7 |     )


--------------------------------------------------------------------------------
/skfasttext/CBOW.py:
--------------------------------------------------------------------------------
 1 | from sklearn.base import BaseEstimator, ClassifierMixin
 2 | import fasttext as ft
 3 | from sklearn.metrics import classification_report
 4 | 
 5 |    
 6 | class cbowFastText((BaseEstimator,ClassifierMixin):
 7 | 	def __init__(self,lpr='__label__',lr=0.1,lru=100,dim=100,ws=5,epoch=5,minc=1,neg=5,ngram=1,\
 8 | loss='softmax',nbucket=0,minn=0,maxn=0,th=12,t=0.0001,verbosec=0,encoding='utf-8'):			"""
 9 | 			lr             learning rate [0.05]
10 | 			lr_update_rate change the rate of updates for the learning rate [100]
11 | 			dim            size of word vectors [100]
12 | 			ws             size of the context window [5]
13 | 			epoch          number of epochs [5]
14 | 			min_count      minimal number of word occurences [5]
15 | 			neg            number of negatives sampled [5]
16 | 			word_ngrams    max length of word ngram [1]
17 | 			loss           loss function {ns, hs, softmax} [ns]
18 | 			bucket         number of buckets [2000000]
19 | 			minn           min length of char ngram [3]
20 | 			maxn           max length of char ngram [6]
21 | 			thread         number of threads [12]
22 | 			t              sampling threshold [0.0001]
23 | 			silent         disable the log output from the C++ extension [1]
24 | 			encoding       specify input_file encoding [utf-8]
25 | 
26 | 			"""
27 | 			self.lr=lr
28 | 			self.lr_update_rate=lru
29 | 			self.dim=dim
30 | 			self.ws=ws
31 | 			self.epoch=epoch
32 | 			self.min_count=minc
33 | 			self.neg=neg
34 | 			self.word_ngrams=ngram
35 | 			self.loss=loss
36 | 			self.bucket=bucket
37 | 			self.minn=minn
38 | 			self.maxn=maxn
39 | 			self.n_thread=th
40 | 			self.samplet=t
41 | 			self.silent=verbosec
42 | 			self.enc=encoding	
43 | 
44 | 	def fit(self,X,modelname='model'):
45 |                 """
46 |                 Input: takes input file in format
47 |                 returns classifier object
48 |                 to do: add option to feed list of X and Y or file
49 |                 to do: check options for the api call 
50 |                 to do: write unit test
51 |                 """
52 | 		try:
53 |                     if not csvflag:
54 |                         self.model=ft.cbow(X, modelname, lr=self.lr, dim=self.dim,lr_update_rate=self.lr_update_rate,epoch=self.epoch,bucket=self.bucket,loss=self.loss,thread=self.n_thread)
55 |                 except:
56 |                     print("Error in input dataset format")
57 |                     
58 | 	def getproperties(self):
59 | 		"""
60 |                 Input: Nothing, other than object self pointer
61 |                 Return: None , prints the descriptions of the model hyperparameters
62 |                 """
63 |                 print("The model has following hyperparameters as part of its specification")
64 |                 print("Learning rate :"+ str(lr))
65 |                 print("Learning rate update after "+str(self.lr_update_rate)+"iterations")
66 |                 print("Embedding size: "+str(self.dim))
67 |                 print("Epochs :"+ str(self.epochs)
68 |                 print("minimal number of word occurences: "+self.min_count)
69 |                 print("number of negatives sampled :"+str(self.neg))
70 |                 print("max length of word ngram "+str(self.word_ngrams))
71 |                 print("loss function: "+str(self.loss))
72 |                 print("number of buckets "+str(self.bucket))
73 |                 print("min length of char ngram:"+str(self.minn))
74 |                 print("min length of char ngram"+ str(self.maxn))
75 |                 print("number of threads: "+str(self.n_thread))
76 |                 print("sampling threshold"+str(self.samplet))
77 |                 print("Verbose log output from the C++ extension enable=1/disble=0:"+ str(self.silent))
78 |                 print("input_file encoding :"+str(self.enc))
79 |                 
80 | 	def getwords(self):
81 | 		"""to do: check words list"""
82 | 		return(self.model.words)
83 | 
84 | 	def getvector(self):
85 | 		       """
86 |                 to do : add try catch for word type
87 |                 to do: add try catch for word existance 
88 |                 """
89 |                 return(self.model[word])
90 | 
91 | 


--------------------------------------------------------------------------------
/skfasttext/FastTextClassifier.py:
--------------------------------------------------------------------------------
  1 | from sklearn.base import BaseEstimator, ClassifierMixin
  2 | import fasttext as ft
  3 | from sklearn.metrics import classification_report
  4 | 
  5 | class FastTextClassifier(BaseEstimator,ClassifierMixin):
  6 |         """Base classiifer of Fasttext estimator"""
  7 | 
  8 | 	def __init__(self,lpr='__label__',lr=0.1,lru=100,dim=100,ws=5,epoch=100,minc=1,neg=5,ngram=1,loss='softmax',nbucket=0,minn=0,maxn=0,thread=4,silent=0,output="model"):
  9 | 		"""
 10 | 		label_prefix   			label prefix ['__label__']
 11 | 		lr             			learning rate [0.1]
 12 | 		lr_update_rate 			change the rate of updates for the learning rate [100]
 13 | 		dim            			size of word vectors [100]
 14 | 		ws             			size of the context window [5]
 15 | 		epoch          			number of epochs [5]
 16 | 		min_count      			minimal number of word occurences [1]
 17 | 		neg            			number of negatives sampled [5]
 18 | 		word_ngrams    			max length of word ngram [1]
 19 | 		loss           			loss function {ns, hs, softmax} [softmax]
 20 | 		bucket         			number of buckets [0]
 21 | 		minn           			min length of char ngram [0]
 22 | 		maxn           			min length of char ngram [0]
 23 | 		todo : Recheck need of some of the variables, present in default classifier
 24 | 		"""
 25 | 		
 26 | 		self.label_prefix=lpr
 27 | 		self.lr=lr
 28 | 		self.lr_update_rate=lru
 29 | 		self.dim=dim
 30 | 		self.ws=ws
 31 | 		self.epoch=epoch
 32 | 		self.min_count=minc
 33 | 		self.neg=neg
 34 | 		self.word_ngrams=ngram
 35 | 		self.loss=loss
 36 | 		self.bucket=nbucket
 37 | 		self.minn=minn
 38 | 		self.maxn=maxn
 39 | 		self.thread=thread
 40 | 		self.silent=silent
 41 | 		self.classifier=None
 42 | 		self.result=None
 43 | 		self.output=output
 44 | 		self.lpr=lpr
 45 | 
 46 | 	def fit(self,input_file):
 47 |                 '''
 48 |                 Input: takes input file in format
 49 |                 returns classifier object
 50 |                 to do: add option to feed list of X and Y or file
 51 |                 '''
 52 |                 self.classifier = ft.supervised(input_file, self.output, dim=self.dim, lr=self.lr, epoch=self.epoch, min_count=self.min_count, word_ngrams=self.word_ngrams, bucket=self.bucket, thread=self.thread, silent=self.silent, label_prefix=self.lpr)
 53 |                 return(None)
 54 |             
 55 | 	def predict(self,test_file,csvflag=True,k_best=1):
 56 |                 '''
 57 |                 Input: Takes input test finle in format
 58 |                 return results object
 59 |                 to do: add unit tests using sentiment analysis dataset 
 60 |                 to do: Add K best labels options for csvflag = False 
 61 |                 
 62 |                 '''
 63 |                 try:
 64 |                     if csvflag==False and type(test_file) == 'list':
 65 |                         self.result=self.classifier.predict(test_file,k=k_best)
 66 |                     if csvflag:
 67 |                             lines=open(test_file,"r").readlines()
 68 |                             sentences=[line.split(" , ")[1] for line in lines]
 69 |                             self.result=self.classifier.predict(sentences,k_best)
 70 |                 except:
 71 |                     print("Error in input dataset.. please see if the file/list of sentences is of correct format")
 72 |                     sys.exit(-1)
 73 |                 self.result=[int(labels[0]) for labels in self.result]
 74 |                 return(self.result)
 75 |                 
 76 | 	def report(self,ytrue,ypred):
 77 |                 '''
 78 |                 Input: predicted and true labels
 79 |                 return reort of classification
 80 |                 to do: add label option and unit testing
 81 |                 
 82 |                 '''
 83 |                 print(classification_report(ytrue,ypred))
 84 |                 return None
 85 |             
 86 | 	def predict_proba(self,test_file,csvflag=True,k_best=1):
 87 |                 '''
 88 |                 Input: List of sentences
 89 |                 return reort of classification
 90 |                 to do: check output of classifier predct_proba add label option and unit testing
 91 |                 '''
 92 |                 try:
 93 |                     if csvflag==False and type(test_file) == 'list':
 94 |                         self.result=self.classifier.predict_proba(test_file,k=k_best)
 95 |                     if csvflag:
 96 |                             lines=open(test_file,"r").readlines()
 97 |                             sentences=[line.split(" , ")[1] for line in lines]
 98 |                             self.result=self.classifier.predict_proba(sentences,k_best)
 99 |                 except:
100 |                     print("Error in input dataset.. please see if the file/list of sentences is of correct format")
101 |                     sys.exit(-1)
102 |                 return(self.result)
103 | 
104 | 	def getlabels(self):
105 |                 '''
106 |                 Input: None
107 |                 returns: Class labels in dataset
108 |                 to do : check need of the this funcion
109 |                 '''
110 | 		return(self.classifier.labels)
111 | 		
112 | 	def getproperties(self):
113 |                 
114 |                 '''
115 |                 Input: Nothing, other than object self pointer
116 |                 Return: None , prints the descriptions of the model hyperparameters
117 |                 '''
118 |                 
119 |                 print("The model has following hyperparameters as part of its specification")
120 |                 print("Label prefix used : "+str(self.label_prefix))
121 |                 print("Learning rate :"+ str(lr))
122 |                 print("Learning rate update after "+str(self.lr_update_rate)+"iterations")
123 |                 print("Embedding size: "+str(self.dim))
124 |                 print("Epochs :"+ str(self.epochs))
125 |                 print("minimal number of word occurences: "+self.min_count)
126 |                 print("number of negatives sampled :"+str(self.neg))
127 |                 print("max length of word ngram "+str(self.word_ngrams))
128 |                 print("loss function: "+str(self.loss))
129 |                 print("number of buckets "+str(self.bucket))
130 |                 print("min length of char ngram:"+str(self.minn))
131 |                 print("min length of char ngram"+ str(self.maxn))
132 |                 return(None)
133 |                 
134 | 		
135 | 	def loadpretrained(self,X):
136 | 		'returns the model with pretrained weights'
137 | 		self.classifier=ft.load_model(X,label_prefix=self.lpr)
138 | 		


--------------------------------------------------------------------------------
/skfasttext/SkipGram.py:
--------------------------------------------------------------------------------
 1 | from sklearn.base import BaseEstimator, ClassifierMixin
 2 | import fasttext as ft
 3 | from sklearn.metrics import classification_report
 4 | 
 5 | class SkipgramFastText(BaseEstimator,ClassifierMixin):
 6 | 
 7 | 	def __init__(self,lpr='__label__',lr=0.1,lru=100,dim=100,ws=5,epoch=5,minc=1,neg=5,ngram=1,\
 8 | loss='softmax',nbucket=0,minn=0,maxn=0,th=12,t=0.0001,verbosec=0,encoding='utf-8'):
 9 | 			"""
10 | 			lr             learning rate [0.05]
11 | 			lr_update_rate change the rate of updates for the learning rate [100]
12 | 			dim            size of word vectors [100]
13 | 			ws             size of the context window [5]
14 | 			epoch          number of epochs [5]
15 | 			min_count      minimal number of word occurences [5]
16 | 			neg            number of negatives sampled [5]
17 | 			word_ngrams    max length of word ngram [1]
18 | 			loss           loss function {ns, hs, softmax} [ns]
19 | 			bucket         number of buckets [2000000]
20 | 			minn           min length of char ngram [3]
21 | 			maxn           max length of char ngram [6]
22 | 			thread         number of threads [12]
23 | 			t              sampling threshold [0.0001]
24 | 			silent         disable the log output from the C++ extension [1]
25 | 			encoding       specify input_file encoding [utf-8]
26 | 			"""
27 | 			self.lr=lr
28 | 			self.lr_update_rate=lru
29 | 			self.dim=dim
30 | 			self.ws=ws
31 | 			self.epoch=epoch
32 | 			self.min_count=minc
33 | 			self.neg=neg
34 | 			self.word_ngrams=ngram
35 | 			self.loss=loss
36 | 			self.bucket=bucket
37 | 			self.minn=minn
38 | 			self.maxn=maxn
39 | 			self.n_thread=th
40 | 			self.samplet=t
41 | 			self.silent=verbosec
42 | 			self.enc=encodings
43 | 			self.model=None
44 | 			self.result=None
45 | 			
46 | 
47 |             
48 | 	def fit(self,X,modelname='model',csvflag=False):
49 |                 """
50 |                 Input: takes input file in format
51 |                 returns classifier object
52 |                 to do: add option to feed list of X and Y or file
53 |                 to do: check options for the api call 
54 |                 to do: write unit test
55 |                 """
56 |                 try:
57 |                     if not csvflag:
58 |                         self.model=ft.skipgram(X, modelname, lr=self.lr, dim=self.dim,lr_update_rate=self.lr_update_rate,epoch=self.epoch,bucket=self.bucket,loss=self.loss,thread=self.n_thread)
59 |                 except:
60 |                     print("Error in input dataset format")
61 | 	def getproperties(self):
62 |                 """
63 |                 Input: Nothing, other than object self pointer
64 |                 Return: None , prints the descriptions of the model hyperparameters
65 |                 """
66 |                 print("The model has following hyperparameters as part of its specification")
67 |                 print("Learning rate :"+ str(lr))
68 |                 print("Learning rate update after "+str(self.lr_update_rate)+"iterations")
69 |                 print("Embedding size: "+str(self.dim))
70 |                 print("Epochs :"+ str(self.epochs)
71 |                 print("minimal number of word occurences: "+self.min_count)
72 |                 print("number of negatives sampled :"+str(self.neg))
73 |                 print("max length of word ngram "+str(self.word_ngrams))
74 |                 print("loss function: "+str(self.loss))
75 |                 print("number of buckets "+str(self.bucket))
76 |                 print("min length of char ngram:"+str(self.minn))
77 |                 print("min length of char ngram"+ str(self.maxn))
78 |                 print("number of threads: "+str(self.n_thread))
79 |                 print("sampling threshold"+str(self.samplet))
80 |                 print("Verbose log output from the C++ extension enable=1/disble=0:"+ str(self.silent))
81 |                 print("input_file encoding :"+str(self.enc))
82 |                 return None
83 |             
84 | 	def getwords(self):
85 |                 """to do: check words list"""
86 | 		return(self.model.words)
87 |             # list of words in dictionary)
88 | 	def getvector(self,word=None):
89 |                 """
90 |                 to do : add try catch for word type
91 |                 to do: add try catch for word existance 
92 |                 """
93 |                 return(self.model[word])
94 |             
95 |             
96 | 


--------------------------------------------------------------------------------
/skfasttext/__init__.py:
--------------------------------------------------------------------------------
1 | from .FastTextClassifier import FastTextClassifier
2 | import os
3 | __VERSION__ = '0.1'


--------------------------------------------------------------------------------