├── .gitignore ├── .nojekyll ├── .travis.yml ├── 404.html ├── CNAME ├── Dockerfile ├── LICENSE ├── README.md ├── SUMMARY.md ├── asset ├── docsify-apachecn-footer.js ├── docsify-baidu-push.js ├── docsify-baidu-stat.js ├── docsify-clicker.js ├── docsify-cnzz.js ├── docsify-copy-code.min.js ├── prism-darcula.css ├── search.min.js └── style.css ├── doc ├── en │ ├── api.md │ ├── cheatsheet.md │ ├── crawl-vectors.md │ ├── dataset.md │ ├── english-vectors.md │ ├── faqs.md │ ├── language-identification.md │ ├── options.md │ ├── pretrained-vectors.md │ ├── references.md │ ├── supervised-models.md │ ├── supervised-tutorial.md │ ├── support.md │ └── unsupervised-tutorials.md └── zh │ ├── _templates │ └── layout.html │ ├── api.md │ ├── cheatsheet.md │ ├── conf.py │ ├── crawl-vectors.md │ ├── dataset.md │ ├── english-vectors.md │ ├── faqs.md │ ├── language-identification.md │ ├── options.md │ ├── pretrained-vectors.md │ ├── references.md │ ├── supervised-models.md │ ├── supervised-tutorial.md │ ├── support.md │ └── unsupervised-tutorials.md ├── index.html └── update.sh /.gitignore: -------------------------------------------------------------------------------- 1 | # Byte-compiled / optimized / DLL files 2 | __pycache__/ 3 | *.py[cod] 4 | *$py.class 5 | 6 | # C extensions 7 | *.so 8 | 9 | # Distribution / packaging 10 | .Python 11 | env/ 12 | build/ 13 | develop-eggs/ 14 | dist/ 15 | downloads/ 16 | eggs/ 17 | .eggs/ 18 | lib/ 19 | lib64/ 20 | parts/ 21 | sdist/ 22 | var/ 23 | wheels/ 24 | *.egg-info/ 25 | .installed.cfg 26 | *.egg 27 | 28 | # PyInstaller 29 | # Usually these files are written by a python script from a template 30 | # before PyInstaller builds the exe, so as to inject date/other infos into it. 31 | *.manifest 32 | *.spec 33 | 34 | # Installer logs 35 | pip-log.txt 36 | pip-delete-this-directory.txt 37 | 38 | # Unit test / coverage reports 39 | htmlcov/ 40 | .tox/ 41 | .coverage 42 | .coverage.* 43 | .cache 44 | nosetests.xml 45 | coverage.xml 46 | *.cover 47 | .hypothesis/ 48 | 49 | # Translations 50 | *.mo 51 | *.pot 52 | 53 | # Django stuff: 54 | *.log 55 | local_settings.py 56 | 57 | # Flask stuff: 58 | instance/ 59 | .webassets-cache 60 | 61 | # Scrapy stuff: 62 | .scrapy 63 | 64 | # Sphinx documentation 65 | docs/_build/ 66 | 67 | # PyBuilder 68 | target/ 69 | 70 | # Jupyter Notebook 71 | .ipynb_checkpoints 72 | 73 | # pyenv 74 | .python-version 75 | 76 | # celery beat schedule file 77 | celerybeat-schedule 78 | 79 | # SageMath parsed files 80 | *.sage.py 81 | 82 | # dotenv 83 | .env 84 | 85 | # virtualenv 86 | .venv 87 | venv/ 88 | ENV/ 89 | 90 | # Spyder project settings 91 | .spyderproject 92 | .spyproject 93 | 94 | # Rope project settings 95 | .ropeproject 96 | 97 | # mkdocs documentation 98 | /site 99 | 100 | # mypy 101 | .mypy_cache/ 102 | -------------------------------------------------------------------------------- /.nojekyll: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/apachecn/fasttext-doc-zh/31f7dd7cddeedfed76228bed2ad5b6e6c95be2a2/.nojekyll -------------------------------------------------------------------------------- /.travis.yml: -------------------------------------------------------------------------------- 1 | language: python 2 | python: 3.6 3 | 4 | install: 5 | - ':' 6 | 7 | script: 8 | - ':' 9 | 10 | after_script: 11 | - git config user.name ${GH_UN} 12 | - git config user.email ${GH_EMAIL} 13 | - git push "https://${GH_TOKEN}@github.com/${GH_USER}/${GH_REPO}.git" v0.1.0:${GH_BRANCH} -f 14 | 15 | env: 16 | global: 17 | - GH_UN=jiangzhonglian 18 | - GH_EMAIL=jiang-s@163.com 19 | - GH_USER=apachecn 20 | - GH_REPO=fasttext-doc-zh 21 | - GH_BRANCH=gh-pages -------------------------------------------------------------------------------- /404.html: -------------------------------------------------------------------------------- 1 | --- 2 | permalink: /404.html 3 | --- 4 | 5 | -------------------------------------------------------------------------------- /CNAME: -------------------------------------------------------------------------------- 1 | fasttext.apachecn.org -------------------------------------------------------------------------------- /Dockerfile: -------------------------------------------------------------------------------- 1 | FROM httpd:2.4 2 | COPY ./ /usr/local/apache2/htdocs/ -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # FastText 中文文档 2 | 3 | FastText 是一个用于高效学习单词表示和句子分类的库。 4 | 5 | 官方地址: 6 | 中文社区: 7 | 8 | ## 文档地址 9 | 10 | * FastText 中文社区: 11 | * FastText 中文文档: 12 | * FastText 英文文档: 13 | 14 | ## 项目负责人 15 | 16 | * [@wnma](https://github.com/wnma3mz) 17 | * [@Lisanaaa](https://github.com/Lisanaaa) 18 | 19 | **维护组织:** [@ApacheCN](https://github.com/apachecn]) 20 | 21 | ## 贡献者 22 | 23 | ### FastText 0.1.0 中文文档贡献者 24 | 25 | | 标题 | 翻译 | 校对 | 26 | | ------------------------------------------------------------ | ------------------------------------------ | ---- | 27 | | [api](https://github.com/apachecn/fasttext-doc-zh/blob/v0.1.0/doc/zh/api.md) | [@片刻](https://github.com/jiangzhonglian) | [@Lisanaaa](https://github.com/Lisanaaa) | 28 | | [cheatsheet](https://github.com/apachecn/fasttext-doc-zh/blob/v0.1.0/doc/zh/cheatsheet.md) | [@片刻](https://github.com/jiangzhonglian) | [@Lisanaaa](https://github.com/Lisanaaa) | 29 | | [crawl-vectors](https://github.com/apachecn/fasttext-doc-zh/blob/v0.1.0/doc/zh/crawl-vectors.md) | [@GMbappe](https://github.com/GMbappe) | [@Lisanaaa](https://github.com/Lisanaaa) | 30 | | [dataset](https://github.com/apachecn/fasttext-doc-zh/blob/v0.1.0/doc/zh/dataset.md) | [@片刻](https://github.com/jiangzhonglian) | [@Lisanaaa](https://github.com/Lisanaaa) | 31 | | [english-vectors](https://github.com/apachecn/fasttext-doc-zh/blob/v0.1.0/doc/zh/english-vectors.md) | [@GMbappe](https://github.com/GMbappe) | [@Lisanaaa](https://github.com/Lisanaaa) | 32 | | [faqs](https://github.com/apachecn/fasttext-doc-zh/blob/v0.1.0/doc/zh/faqs.md) | [@Twinkle](https://github.com/kemingzeng) | [@Lisanaaa](https://github.com/Lisanaaa) | 33 | | [language-identification](https://github.com/apachecn/fasttext-doc-zh/blob/v0.1.0/doc/zh/language-identification.md) | [@wnma](https://github.com/wnma3mz) | [@Lisanaaa](https://github.com/Lisanaaa) | 34 | | [options](https://github.com/apachecn/fasttext-doc-zh/blob/v0.1.0/doc/zh/options.md) | [@片刻](https://github.com/jiangzhonglian) | [@Lisanaaa](https://github.com/Lisanaaa) | 35 | | [pretrained-vectors](https://github.com/apachecn/fasttext-doc-zh/blob/v0.1.0/doc/zh/pretrained-vectors.md) | [@片刻](https://github.com/jiangzhonglian) | [@Lisanaaa](https://github.com/Lisanaaa) | 36 | | [references](https://github.com/apachecn/fasttext-doc-zh/blob/v0.1.0/doc/zh/references.md) | [@片刻](https://github.com/jiangzhonglian) | [@Lisanaaa](https://github.com/Lisanaaa) | 37 | | [supervised-models](https://github.com/apachecn/fasttext-doc-zh/blob/v0.1.0/doc/zh/supervised-models.md) | [@片刻](https://github.com/jiangzhonglian) | [@Lisanaaa](https://github.com/Lisanaaa) | 38 | | [supervised-tutorial](https://github.com/apachecn/fasttext-doc-zh/blob/v0.1.0/doc/zh/supervised-tutorial.md) | [@Lisanaaa](https://github.com/Lisanaaa) | [@wnma](https://github.com/wnma3mz) | 39 | | [support](https://github.com/apachecn/fasttext-doc-zh/blob/v0.1.0/doc/zh/support.md) | [@片刻](https://github.com/jiangzhonglian) | [@Lisanaaa](https://github.com/Lisanaaa) | 40 | | [unsupervised-tutorials](https://github.com/apachecn/fasttext-doc-zh/blob/v0.1.0/doc/zh/unsupervised-tutorials.md) | [@wnma](https://github.com/wnma3mz) | [@Lisanaaa](https://github.com/Lisanaaa) | 41 | 42 | 43 | ## 加入我们 44 | 45 | 如果想要加入我们, 请参阅: . 46 | 欢迎各位爱装逼的大佬们. 47 | 48 | ## 建议反馈 49 | 50 | - 联系项目负责人 [@wnma](https://github.com/wnma3mz)或者 [@Lisanaaa](https://github.com/Lisanaaa). 51 | - 在我们的 [apachecn/fasttext-doc-zh](https://github.com/apachecn/fasttext-doc-zh) github 上提 issue. 52 | - 发邮件送到 Email: fasttext#apachecn.org (#替换成@) . 53 | - 在我们的 [组织学习交流群](http://www.apachecn.org/organization/348.html) 中联系群主/管理员即可. 54 | 55 | 56 | ## 下载 57 | 58 | ### Docker 59 | 60 | ``` 61 | docker pull apachecn0/fasttext-doc-zh 62 | docker run -tid -p :80 apachecn0/fasttext-doc-zh 63 | # 访问 http://localhost:{port} 查看文档 64 | ``` 65 | 66 | ### PYPI 67 | 68 | ``` 69 | pip install fasttext-doc-zh 70 | fasttext-doc-zh 71 | # 访问 http://localhost:{port} 查看文档 72 | ``` 73 | 74 | ### NPM 75 | 76 | ``` 77 | npm install -g fasttext-doc-zh 78 | fasttext-doc-zh 79 | # 访问 http://localhost:{port} 查看文档 80 | ``` 81 | -------------------------------------------------------------------------------- /SUMMARY.md: -------------------------------------------------------------------------------- 1 | + 介绍 2 | + [快速入门](doc/zh/support.md) 3 | + [备忘单](doc/zh/cheatsheet.md) 4 | + [选项列表](doc/zh/options.md) 5 | + 教程 6 | + [157种语言的词向量](doc/zh/crawl-vectors.md) 7 | + [数据集](doc/zh/dataset.md) 8 | + [英文单词向量](doc/zh/english-vectors.md) 9 | + [语言识别](doc/zh/language-identification.md) 10 | + [维基词向量](doc/zh/pretrained-vectors.md) 11 | + [监督模型](doc/zh/supervised-models.md) 12 | + [文本分类](doc/zh/supervised-tutorial.md) 13 | + [词语表达](doc/zh/unsupervised-tutorials.md) 14 | + 帮助 15 | + [常问问题](doc/zh/faqs.md) 16 | + [API](doc/zh/api.md) 17 | + [参考](doc/zh/references.md) 18 | -------------------------------------------------------------------------------- /asset/docsify-apachecn-footer.js: -------------------------------------------------------------------------------- 1 | (function(){ 2 | var footer = [ 3 | '
', 4 | '
', 5 | '

我们一直在努力

', 6 | '

apachecn/fasttext-doc-zh

', 7 | '

', 8 | ' ', 9 | ' ', 10 | ' ML | ApacheCN

', 11 | '

', 12 | '
', 13 | ' ', 17 | '
', 18 | '
' 19 | ].join('\n') 20 | var plugin = function(hook) { 21 | hook.afterEach(function(html) { 22 | return html + footer 23 | }) 24 | hook.doneEach(function() { 25 | (adsbygoogle = window.adsbygoogle || []).push({}) 26 | }) 27 | } 28 | var plugins = window.$docsify.plugins || [] 29 | plugins.push(plugin) 30 | window.$docsify.plugins = plugins 31 | })() -------------------------------------------------------------------------------- /asset/docsify-baidu-push.js: -------------------------------------------------------------------------------- 1 | (function(){ 2 | var plugin = function(hook) { 3 | hook.doneEach(function() { 4 | new Image().src = 5 | '//api.share.baidu.com/s.gif?r=' + 6 | encodeURIComponent(document.referrer) + 7 | "&l=" + encodeURIComponent(location.href) 8 | }) 9 | } 10 | var plugins = window.$docsify.plugins || [] 11 | plugins.push(plugin) 12 | window.$docsify.plugins = plugins 13 | })() -------------------------------------------------------------------------------- /asset/docsify-baidu-stat.js: -------------------------------------------------------------------------------- 1 | (function(){ 2 | var plugin = function(hook) { 3 | hook.doneEach(function() { 4 | window._hmt = window._hmt || [] 5 | var hm = document.createElement("script") 6 | hm.src = "https://hm.baidu.com/hm.js?" + window.$docsify.bdStatId 7 | document.querySelector("article").appendChild(hm) 8 | }) 9 | } 10 | var plugins = window.$docsify.plugins || [] 11 | plugins.push(plugin) 12 | window.$docsify.plugins = plugins 13 | })() -------------------------------------------------------------------------------- /asset/docsify-clicker.js: -------------------------------------------------------------------------------- 1 | (function() { 2 | var ids = [ 3 | '109577065', '108852955', '102682374', '100520874', '92400861', '90312982', 4 | '109963325', '109323014', '109301511', '108898970', '108590722', '108538676', 5 | '108503526', '108437109', '108402202', '108292691', '108291153', '108268498', 6 | '108030854', '107867070', '107847299', '107827334', '107825454', '107802131', 7 | '107775320', '107752974', '107735139', '107702571', '107598864', '107584507', 8 | '107568311', '107526159', '107452391', '107437455', '107430050', '107395781', 9 | '107325304', '107283210', '107107145', '107085440', '106995421', '106993460', 10 | '106972215', '106959775', '106766787', '106749609', '106745967', '106634313', 11 | '106451602', '106180097', '106095505', '106077010', '106008089', '106002346', 12 | '105653809', '105647855', '105130705', '104837872', '104706815', '104192620', 13 | '104074941', '104040537', '103962171', '103793502', '103783460', '103774572', 14 | '103547748', '103547703', '103547571', '103490757', '103413481', '103341935', 15 | '103330191', '103246597', '103235808', '103204403', '103075981', '103015105', 16 | '103014899', '103014785', '103014702', '103014540', '102993780', '102993754', 17 | '102993680', '102958443', '102913317', '102903382', '102874766', '102870470', 18 | '102864513', '102811179', '102761237', '102711565', '102645443', '102621845', 19 | '102596167', '102593333', '102585262', '102558427', '102537547', '102530610', 20 | '102527017', '102504698', '102489806', '102372981', '102258897', '102257303', 21 | '102056248', '101920097', '101648638', '101516708', '101350577', '101268149', 22 | '101128167', '101107328', '101053939', '101038866', '100977414', '100945061', 23 | '100932401', '100886407', '100797378', '100634918', '100588305', '100572447', 24 | '100192249', '100153559', '100099032', '100061455', '100035392', '100033450', 25 | '99671267', '99624846', '99172551', '98992150', '98989508', '98987516', '98938304', 26 | '98937682', '98725145', '98521688', '98450861', '98306787', '98203342', '98026348', 27 | '97680167', '97492426', '97108940', '96888872', '96568559', '96509100', '96508938', 28 | '96508611', '96508374', '96498314', '96476494', '96333593', '96101522', '95989273', 29 | '95960507', '95771870', '95770611', '95766810', '95727700', '95588929', '95218707', 30 | '95073151', '95054615', '95016540', '94868371', '94839549', '94719281', '94401578', 31 | '93931439', '93853494', '93198026', '92397889', '92063437', '91635930', '91433989', 32 | '91128193', '90915507', '90752423', '90738421', '90725712', '90725083', '90722238', 33 | '90647220', '90604415', '90544478', '90379769', '90288341', '90183695', '90144066', 34 | '90108283', '90021771', '89914471', '89876284', '89852050', '89839033', '89812373', 35 | '89789699', '89786189', '89752620', '89636380', '89632889', '89525811', '89480625', 36 | '89464088', '89464025', '89463984', '89463925', '89445280', '89441793', '89430432', 37 | '89429877', '89416176', '89412750', '89409618', '89409485', '89409365', '89409292', 38 | '89409222', '89399738', '89399674', '89399526', '89355336', '89330241', '89308077', 39 | '89222240', '89140953', '89139942', '89134398', '89069355', '89049266', '89035735', 40 | '89004259', '88925790', '88925049', '88915838', '88912706', '88911548', '88899438', 41 | '88878890', '88837519', '88832555', '88824257', '88777952', '88752158', '88659061', 42 | '88615256', '88551434', '88375675', '88322134', '88322085', '88321996', '88321978', 43 | '88321950', '88321931', '88321919', '88321899', '88321830', '88321756', '88321710', 44 | '88321661', '88321632', '88321566', '88321550', '88321506', '88321475', '88321440', 45 | '88321409', '88321362', '88321321', '88321293', '88321226', '88232699', '88094874', 46 | '88090899', '88090784', '88089091', '88048808', '87938224', '87913318', '87905933', 47 | '87897358', '87856753', '87856461', '87827666', '87822008', '87821456', '87739137', 48 | '87734022', '87643633', '87624617', '87602909', '87548744', '87548689', '87548624', 49 | '87548550', '87548461', '87463201', '87385913', '87344048', '87078109', '87074784', 50 | '87004367', '86997632', '86997466', '86997303', '86997116', '86996474', '86995899', 51 | '86892769', '86892654', '86892569', '86892457', '86892347', '86892239', '86892124', 52 | '86798671', '86777307', '86762845', '86760008', '86759962', '86759944', '86759930', 53 | '86759922', '86759646', '86759638', '86759633', '86759622', '86759611', '86759602', 54 | '86759596', '86759591', '86759580', '86759572', '86759567', '86759558', '86759545', 55 | '86759534', '86749811', '86741502', '86741074', '86741059', '86741020', '86740897', 56 | '86694754', '86670104', '86651882', '86651875', '86651866', '86651828', '86651790', 57 | '86651767', '86651756', '86651735', '86651720', '86651708', '86618534', '86618526', 58 | '86594785', '86590937', '86550497', '86550481', '86550472', '86550453', '86550438', 59 | '86550429', '86550407', '86550381', '86550359', '86536071', '86536035', '86536014', 60 | '86535988', '86535963', '86535953', '86535932', '86535902', '86472491', '86472298', 61 | '86472236', '86472191', '86472108', '86471967', '86471899', '86471822', '86439022', 62 | '86438972', '86438902', '86438887', '86438867', '86438836', '86438818', '85850119', 63 | '85850075', '85850021', '85849945', '85849893', '85849837', '85849790', '85849740', 64 | '85849661', '85849620', '85849550', '85606096', '85564441', '85547709', '85471981', 65 | '85471317', '85471136', '85471073', '85470629', '85470456', '85470169', '85469996', 66 | '85469877', '85469775', '85469651', '85469331', '85469033', '85345768', '85345742', 67 | '85337900', '85337879', '85337860', '85337833', '85337797', '85322822', '85322810', 68 | '85322791', '85322745', '85317667', '85265742', '85265696', '85265618', '85265350', 69 | '85098457', '85057670', '85009890', '84755581', '84637437', '84637431', '84637393', 70 | '84637374', '84637355', '84637338', '84637321', '84637305', '84637283', '84637259', 71 | '84629399', '84629314', '84629233', '84629124', '84629065', '84628997', '84628933', 72 | '84628838', '84628777', '84628690', '84591581', '84591553', '84591511', '84591484', 73 | '84591468', '84591416', '84591386', '84591350', '84591308', '84572155', '84572107', 74 | '84503228', '84500221', '84403516', '84403496', '84403473', '84403442', '84075703', 75 | '84029659', '83933480', '83933459', '83933435', '83903298', '83903274', '83903258', 76 | '83752369', '83345186', '83116487', '83116446', '83116402', '83116334', '83116213', 77 | '82944248', '82941023', '82938777', '82936611', '82932735', '82918102', '82911085', 78 | '82888399', '82884263', '82883507', '82880996', '82875334', '82864060', '82831039', 79 | '82823385', '82795277', '82790832', '82775718', '82752022', '82730437', '82718126', 80 | '82661646', '82588279', '82588267', '82588261', '82588192', '82347066', '82056138', 81 | '81978722', '81211571', '81104145', '81069048', '81006768', '80788365', '80767582', 82 | '80759172', '80759144', '80759129', '80736927', '80661288', '80616304', '80602366', 83 | '80584625', '80561364', '80549878', '80549875', '80541470', '80539726', '80531328', 84 | '80513257', '80469816', '80406810', '80356781', '80334130', '80333252', '80332666', 85 | '80332389', '80311244', '80301070', '80295974', '80292252', '80286963', '80279504', 86 | '80278369', '80274371', '80249825', '80247284', '80223054', '80219559', '80209778', 87 | '80200279', '80164236', '80160900', '80153046', '80149560', '80144670', '80061205', 88 | '80046520', '80025644', '80014721', '80005213', '80004664', '80001653', '79990178', 89 | '79989283', '79947873', '79946002', '79941517', '79938786', '79932755', '79921178', 90 | '79911339', '79897603', '79883931', '79872574', '79846509', '79832150', '79828161', 91 | '79828156', '79828149', '79828146', '79828140', '79828139', '79828135', '79828123', 92 | '79820772', '79776809', '79776801', '79776788', '79776782', '79776772', '79776767', 93 | '79776760', '79776753', '79776736', '79776705', '79676183', '79676171', '79676166', 94 | '79676160', '79658242', '79658137', '79658130', '79658123', '79658119', '79658112', 95 | '79658100', '79658092', '79658089', '79658069', '79658054', '79633508', '79587857', 96 | '79587850', '79587842', '79587831', '79587825', '79587819', '79547908', '79477700', 97 | '79477692', '79440956', '79431176', '79428647', '79416896', '79406699', '79350633', 98 | '79350545', '79344765', '79339391', '79339383', '79339157', '79307345', '79293944', 99 | '79292623', '79274443', '79242798', '79184420', '79184386', '79184355', '79184269', 100 | '79183979', '79100314', '79100206', '79100064', '79090813', '79057834', '78967246', 101 | '78941571', '78927340', '78911467', '78909741', '78848006', '78628917', '78628908', 102 | '78628889', '78571306', '78571273', '78571253', '78508837', '78508791', '78448073', 103 | '78430940', '78408150', '78369548', '78323851', '78314301', '78307417', '78300457', 104 | '78287108', '78278945', '78259349', '78237192', '78231360', '78141031', '78100357', 105 | '78095793', '78084949', '78073873', '78073833', '78067868', '78067811', '78055014', 106 | '78041555', '78039240', '77948804', '77879624', '77837792', '77824937', '77816459', 107 | '77816208', '77801801', '77801767', '77776636', '77776610', '77505676', '77485156', 108 | '77478296', '77460928', '77327521', '77326428', '77278423', '77258908', '77252370', 109 | '77248841', '77239042', '77233843', '77230880', '77200256', '77198140', '77196405', 110 | '77193456', '77186557', '77185568', '77181823', '77170422', '77164604', '77163389', 111 | '77160103', '77159392', '77150721', '77146204', '77141824', '77129604', '77123259', 112 | '77113014', '77103247', '77101924', '77100165', '77098190', '77094986', '77088637', 113 | '77073399', '77062405', '77044198', '77036923', '77017092', '77007016', '76999924', 114 | '76977678', '76944015', '76923087', '76912696', '76890184', '76862282', '76852434', 115 | '76829683', '76794256', '76780755', '76762181', '76732277', '76718569', '76696048', 116 | '76691568', '76689003', '76674746', '76651230', '76640301', '76615315', '76598528', 117 | '76571947', '76551820', '74178127', '74157245', '74090991', '74012309', '74001789', 118 | '73910511', '73613471', '73605647', '73605082', '73503704', '73380636', '73277303', 119 | '73274683', '73252108', '73252085', '73252070', '73252039', '73252025', '73251974', 120 | '73135779', '73087531', '73044025', '73008658', '72998118', '72997953', '72847091', 121 | '72833384', '72830909', '72828999', '72823633', '72793092', '72757626', '71157154', 122 | '71131579', '71128551', '71122253', '71082760', '71078326', '71075369', '71057216', 123 | '70812997', '70384625', '70347260', '70328937', '70313267', '70312950', '70255825', 124 | '70238893', '70237566', '70237072', '70230665', '70228737', '70228729', '70175557', 125 | '70175401', '70173259', '70172591', '70170835', '70140724', '70139606', '70053923', 126 | '69067886', '69063732', '69055974', '69055708', '69031254', '68960022', '68957926', 127 | '68957556', '68953383', '68952755', '68946828', '68483371', '68120861', '68065606', 128 | '68064545', '68064493', '67646436', '67637525', '67632961', '66984317', '66968934', 129 | '66968328', '66491589', '66475786', '66473308', '65946462', '65635220', '65632553', 130 | '65443309', '65437683', '63260222', '63253665', '63253636', '63253628', '63253610', 131 | '63253572', '63252767', '63252672', '63252636', '63252537', '63252440', '63252329', 132 | '63252155', '62888876', '62238064', '62039365', '62038016', '61925813', '60957024', 133 | '60146286', '59523598', '59489460', '59480461', '59160354', '59109234', '59089006', 134 | '58595549', '57406062', '56678797', '55001342', '55001340', '55001336', '55001330', 135 | '55001328', '55001325', '55001311', '55001305', '55001298', '55001290', '55001283', 136 | '55001278', '55001272', '55001265', '55001262', '55001253', '55001246', '55001242', 137 | '55001236', '54907997', '54798827', '54782693', '54782689', '54782688', '54782676', 138 | '54782673', '54782671', '54782662', '54782649', '54782636', '54782630', '54782628', 139 | '54782627', '54782624', '54782621', '54782620', '54782615', '54782613', '54782608', 140 | '54782604', '54782600', '54767237', '54766779', '54755814', '54755674', '54730253', 141 | '54709338', '54667667', '54667657', '54667639', '54646201', '54407212', '54236114', 142 | '54234220', '54233181', '54232788', '54232407', '54177960', '53991319', '53932970', 143 | '53888106', '53887128', '53885944', '53885094', '53884497', '53819985', '53812640', 144 | '53811866', '53790628', '53785053', '53782838', '53768406', '53763191', '53763163', 145 | '53763148', '53763104', '53763092', '53576302', '53576157', '53573472', '53560183', 146 | '53523648', '53516634', '53514474', '53510917', '53502297', '53492224', '53467240', 147 | '53467122', '53437115', '53436579', '53435710', '53415115', '53377875', '53365337', 148 | '53350165', '53337979', '53332925', '53321283', '53318758', '53307049', '53301773', 149 | '53289364', '53286367', '53259948', '53242892', '53239518', '53230890', '53218625', 150 | '53184121', '53148662', '53129280', '53116507', '53116486', '52980893', '52980652', 151 | '52971002', '52950276', '52950259', '52944714', '52934397', '52932994', '52924939', 152 | '52887083', '52877145', '52858258', '52858046', '52840214', '52829673', '52818774', 153 | '52814054', '52805448', '52798019', '52794801', '52786111', '52774750', '52748816', 154 | '52745187', '52739313', '52738109', '52734410', '52734406', '52734401', '52515005', 155 | '52056818', '52039757', '52034057', '50899381', '50738883', '50726018', '50695984', 156 | '50695978', '50695961', '50695931', '50695913', '50695902', '50695898', '50695896', 157 | '50695885', '50695852', '50695843', '50695829', '50643222', '50591997', '50561827', 158 | '50550829', '50541472', '50527581', '50527317', '50527206', '50527094', '50526976', 159 | '50525931', '50525764', '50518363', '50498312', '50493019', '50492927', '50492881', 160 | '50492863', '50492772', '50492741', '50492688', '50492454', '50491686', '50491675', 161 | '50491602', '50491550', '50491467', '50488409', '50485177', '48683433', '48679853', 162 | '48678381', '48626023', '48623059', '48603183', '48599041', '48595555', '48576507', 163 | '48574581', '48574425', '48547849', '48542371', '48518705', '48494395', '48493321', 164 | '48491545', '48471207', '48471161', '48471085', '48468239', '48416035', '48415577', 165 | '48415515', '48297597', '48225865', '48224037', '48223553', '48213383', '48211439', 166 | '48206757', '48195685', '48193981', '48154955', '48128811', '48105995', '48105727', 167 | '48105441', '48105085', '48101717', '48101691', '48101637', '48101569', '48101543', 168 | '48085839', '48085821', '48085797', '48085785', '48085775', '48085765', '48085749', 169 | '48085717', '48085687', '48085377', '48085189', '48085119', '48085043', '48084991', 170 | '48084747', '48084139', '48084075', '48055511', '48055403', '48054259', '48053917', 171 | '47378253', '47359989', '47344793', '47344083', '47336927', '47335827', '47316383', 172 | '47315813', '47312213', '47295745', '47294471', '47259467', '47256015', '47255529', 173 | '47253649', '47207791', '47206309', '47189383', '47172333', '47170495', '47166223', '47149681', '47146967', '47126915', '47126883', '47108297', '47091823', '47084039', 174 | '47080883', '47058549', '47056435', '47054703', '47041395', '47035325', '47035143', 175 | '47027547', '47016851', '47006665', '46854213', '46128743', '45035163', '43053503', 176 | '41968283', '41958265', '40707993', '40706971', '40685165', '40684953', '40684575', 177 | '40683867', '40683021', '39853417', '39806033', '39757139', '38391523', '37595169', 178 | '37584503', '35696501', '29593529', '28100441', '27330071', '26950993', '26011757', 179 | '26010983', '26010603', '26004793', '26003621', '26003575', '26003405', '26003373', 180 | '26003307', '26003225', '26003189', '26002929', '26002863', '26002749', '26001477', 181 | '25641541', '25414671', '25410705', '24973063', '20648491', '20621099', '17802317', 182 | '17171597', '17141619', '17141381', '17139321', '17121903', '16898605', '16886449', 183 | '14523439', '14104635', '14054225', '9317965' 184 | ] 185 | var urlb64 = 'aHR0cHM6Ly9ibG9nLmNzZG4ubmV0L3dpemFyZGZvcmNlbC9hcnRpY2xlL2RldGFpbHMv' 186 | var plugin = function(hook) { 187 | hook.doneEach(function() { 188 | for (var i = 0; i < 5; i++) { 189 | var idx = Math.trunc(Math.random() * ids.length) 190 | new Image().src = atob(urlb64) + ids[idx] 191 | } 192 | }) 193 | } 194 | var plugins = window.$docsify.plugins || [] 195 | plugins.push(plugin) 196 | window.$docsify.plugins = plugins 197 | })() -------------------------------------------------------------------------------- /asset/docsify-cnzz.js: -------------------------------------------------------------------------------- 1 | (function(){ 2 | var plugin = function(hook) { 3 | hook.doneEach(function() { 4 | var sc = document.createElement('script') 5 | sc.src = 'https://s5.cnzz.com/z_stat.php?id=' + 6 | window.$docsify.cnzzId + '&online=1&show=line' 7 | document.querySelector('article').appendChild(sc) 8 | }) 9 | } 10 | var plugins = window.$docsify.plugins || [] 11 | plugins.push(plugin) 12 | window.$docsify.plugins = plugins 13 | })() -------------------------------------------------------------------------------- /asset/docsify-copy-code.min.js: -------------------------------------------------------------------------------- 1 | /*! 2 | * docsify-copy-code 3 | * v2.1.0 4 | * https://github.com/jperasmus/docsify-copy-code 5 | * (c) 2017-2019 JP Erasmus 6 | * MIT license 7 | */ 8 | !function(){"use strict";function r(o){return(r="function"==typeof Symbol&&"symbol"==typeof Symbol.iterator?function(o){return typeof o}:function(o){return o&&"function"==typeof Symbol&&o.constructor===Symbol&&o!==Symbol.prototype?"symbol":typeof o})(o)}!function(o,e){void 0===e&&(e={});var t=e.insertAt;if(o&&"undefined"!=typeof document){var n=document.head||document.getElementsByTagName("head")[0],c=document.createElement("style");c.type="text/css","top"===t&&n.firstChild?n.insertBefore(c,n.firstChild):n.appendChild(c),c.styleSheet?c.styleSheet.cssText=o:c.appendChild(document.createTextNode(o))}}(".docsify-copy-code-button,.docsify-copy-code-button span{cursor:pointer;transition:all .25s ease}.docsify-copy-code-button{position:absolute;z-index:1;top:0;right:0;overflow:visible;padding:.65em .8em;border:0;border-radius:0;outline:0;font-size:1em;background:grey;background:var(--theme-color,grey);color:#fff;opacity:0}.docsify-copy-code-button span{border-radius:3px;background:inherit;pointer-events:none}.docsify-copy-code-button .error,.docsify-copy-code-button .success{position:absolute;z-index:-100;top:50%;left:0;padding:.5em .65em;font-size:.825em;opacity:0;-webkit-transform:translateY(-50%);transform:translateY(-50%)}.docsify-copy-code-button.error .error,.docsify-copy-code-button.success .success{opacity:1;-webkit-transform:translate(-115%,-50%);transform:translate(-115%,-50%)}.docsify-copy-code-button:focus,pre:hover .docsify-copy-code-button{opacity:1}"),document.querySelector('link[href*="docsify-copy-code"]')&&console.warn("[Deprecation] Link to external docsify-copy-code stylesheet is no longer necessary."),window.DocsifyCopyCodePlugin={init:function(){return function(o,e){o.ready(function(){console.warn("[Deprecation] Manually initializing docsify-copy-code using window.DocsifyCopyCodePlugin.init() is no longer necessary.")})}}},window.$docsify=window.$docsify||{},window.$docsify.plugins=[function(o,s){o.doneEach(function(){var o=Array.apply(null,document.querySelectorAll("pre[data-lang]")),c={buttonText:"Copy to clipboard",errorText:"Error",successText:"Copied"};s.config.copyCode&&Object.keys(c).forEach(function(t){var n=s.config.copyCode[t];"string"==typeof n?c[t]=n:"object"===r(n)&&Object.keys(n).some(function(o){var e=-1',''.concat(c.buttonText,""),''.concat(c.errorText,""),''.concat(c.successText,""),""].join("");o.forEach(function(o){o.insertAdjacentHTML("beforeend",e)})}),o.mounted(function(){document.querySelector(".content").addEventListener("click",function(o){if(o.target.classList.contains("docsify-copy-code-button")){var e="BUTTON"===o.target.tagName?o.target:o.target.parentNode,t=document.createRange(),n=e.parentNode.querySelector("code"),c=window.getSelection();t.selectNode(n),c.removeAllRanges(),c.addRange(t);try{document.execCommand("copy")&&(e.classList.add("success"),setTimeout(function(){e.classList.remove("success")},1e3))}catch(o){console.error("docsify-copy-code: ".concat(o)),e.classList.add("error"),setTimeout(function(){e.classList.remove("error")},1e3)}"function"==typeof(c=window.getSelection()).removeRange?c.removeRange(t):"function"==typeof c.removeAllRanges&&c.removeAllRanges()}})})}].concat(window.$docsify.plugins||[])}(); 9 | //# sourceMappingURL=docsify-copy-code.min.js.map 10 | -------------------------------------------------------------------------------- /asset/prism-darcula.css: -------------------------------------------------------------------------------- 1 | /** 2 | * Darcula theme 3 | * 4 | * Adapted from a theme based on: 5 | * IntelliJ Darcula Theme (https://github.com/bulenkov/Darcula) 6 | * 7 | * @author Alexandre Paradis 8 | * @version 1.0 9 | */ 10 | 11 | code[class*="lang-"], 12 | pre[data-lang] { 13 | color: #a9b7c6 !important; 14 | background-color: #2b2b2b !important; 15 | font-family: Consolas, Monaco, 'Andale Mono', monospace; 16 | direction: ltr; 17 | text-align: left; 18 | white-space: pre; 19 | word-spacing: normal; 20 | word-break: normal; 21 | line-height: 1.5; 22 | 23 | -moz-tab-size: 4; 24 | -o-tab-size: 4; 25 | tab-size: 4; 26 | 27 | -webkit-hyphens: none; 28 | -moz-hyphens: none; 29 | -ms-hyphens: none; 30 | hyphens: none; 31 | } 32 | 33 | pre[data-lang]::-moz-selection, pre[data-lang] ::-moz-selection, 34 | code[class*="lang-"]::-moz-selection, code[class*="lang-"] ::-moz-selection { 35 | color: inherit; 36 | background: rgba(33, 66, 131, .85); 37 | } 38 | 39 | pre[data-lang]::selection, pre[data-lang] ::selection, 40 | code[class*="lang-"]::selection, code[class*="lang-"] ::selection { 41 | color: inherit; 42 | background: rgba(33, 66, 131, .85); 43 | } 44 | 45 | /* Code blocks */ 46 | pre[data-lang] { 47 | padding: 1em; 48 | margin: .5em 0; 49 | overflow: auto; 50 | } 51 | 52 | :not(pre) > code[class*="lang-"], 53 | pre[data-lang] { 54 | background: #2b2b2b; 55 | } 56 | 57 | /* Inline code */ 58 | :not(pre) > code[class*="lang-"] { 59 | padding: .1em; 60 | border-radius: .3em; 61 | } 62 | 63 | .token.comment, 64 | .token.prolog, 65 | .token.cdata { 66 | color: #808080; 67 | } 68 | 69 | .token.delimiter, 70 | .token.boolean, 71 | .token.keyword, 72 | .token.selector, 73 | .token.important, 74 | .token.atrule { 75 | color: #cc7832; 76 | } 77 | 78 | .token.operator, 79 | .token.punctuation, 80 | .token.attr-name { 81 | color: #a9b7c6; 82 | } 83 | 84 | .token.tag, 85 | .token.tag .punctuation, 86 | .token.doctype, 87 | .token.builtin { 88 | color: #e8bf6a; 89 | } 90 | 91 | .token.entity, 92 | .token.number, 93 | .token.symbol { 94 | color: #6897bb; 95 | } 96 | 97 | .token.property, 98 | .token.constant, 99 | .token.variable { 100 | color: #9876aa; 101 | } 102 | 103 | .token.string, 104 | .token.char { 105 | color: #6a8759; 106 | } 107 | 108 | .token.attr-value, 109 | .token.attr-value .punctuation { 110 | color: #a5c261; 111 | } 112 | 113 | .token.attr-value .punctuation:first-child { 114 | color: #a9b7c6; 115 | } 116 | 117 | .token.url { 118 | color: #287bde; 119 | text-decoration: underline; 120 | } 121 | 122 | .token.function { 123 | color: #ffc66d; 124 | } 125 | 126 | .token.regex { 127 | background: #364135; 128 | } 129 | 130 | .token.bold { 131 | font-weight: bold; 132 | } 133 | 134 | .token.italic { 135 | font-style: italic; 136 | } 137 | 138 | .token.inserted { 139 | background: #294436; 140 | } 141 | 142 | .token.deleted { 143 | background: #484a4a; 144 | } 145 | 146 | code.lang-css .token.property, 147 | code.lang-css .token.property + .token.punctuation { 148 | color: #a9b7c6; 149 | } 150 | 151 | code.lang-css .token.id { 152 | color: #ffc66d; 153 | } 154 | 155 | code.lang-css .token.selector > .token.class, 156 | code.lang-css .token.selector > .token.attribute, 157 | code.lang-css .token.selector > .token.pseudo-class, 158 | code.lang-css .token.selector > .token.pseudo-element { 159 | color: #ffc66d; 160 | } -------------------------------------------------------------------------------- /asset/search.min.js: -------------------------------------------------------------------------------- 1 | !function(){"use strict";function e(e){var n={"&":"&","<":"<",">":">",'"':""","'":"'","/":"/"};return String(e).replace(/[&<>"'\/]/g,function(e){return n[e]})}function n(e){var n=[];return h.dom.findAll("a:not([data-nosearch])").map(function(t){var o=t.href,i=t.getAttribute("href"),r=e.parse(o).path;r&&-1===n.indexOf(r)&&!Docsify.util.isAbsolutePath(i)&&n.push(r)}),n}function t(e){localStorage.setItem("docsify.search.expires",Date.now()+e),localStorage.setItem("docsify.search.index",JSON.stringify(g))}function o(e,n,t,o){void 0===n&&(n="");var i,r=window.marked.lexer(n),a=window.Docsify.slugify,s={};return r.forEach(function(n){if("heading"===n.type&&n.depth<=o)i=t.toURL(e,{id:a(n.text)}),s[i]={slug:i,title:n.text,body:""};else{if(!i)return;s[i]?s[i].body?s[i].body+="\n"+(n.text||""):s[i].body=n.text:s[i]={slug:i,title:"",body:""}}}),a.clear(),s}function i(n){var t=[],o=[];Object.keys(g).forEach(function(e){o=o.concat(Object.keys(g[e]).map(function(n){return g[e][n]}))}),n=n.trim();var i=n.split(/[\s\-\,\\\/]+/);1!==i.length&&(i=[].concat(n,i));for(var r=0;rl.length&&(d=l.length);var p="..."+e(l).substring(f,d).replace(o,''+n+"")+"...";s+=p}}),a)){var d={title:e(c),content:s,url:f};t.push(d)}}(r);return t}function r(e,i){h=Docsify;var r="auto"===e.paths,a=localStorage.getItem("docsify.search.expires")
',o=Docsify.dom.create("div",t),i=Docsify.dom.find("aside");Docsify.dom.toggleClass(o,"search"),Docsify.dom.before(i,o)}function c(e){var n=Docsify.dom.find("div.search"),t=Docsify.dom.find(n,".results-panel");if(!e)return t.classList.remove("show"),void(t.innerHTML="");var o=i(e),r="";o.forEach(function(e){r+='
\n \n

'+e.title+"

\n

"+e.content+"

\n
\n
"}),t.classList.add("show"),t.innerHTML=r||'

'+y+"

"}function l(){var e,n=Docsify.dom.find("div.search"),t=Docsify.dom.find(n,"input");Docsify.dom.on(n,"click",function(e){return"A"!==e.target.tagName&&e.stopPropagation()}),Docsify.dom.on(t,"input",function(n){clearTimeout(e),e=setTimeout(function(e){return c(n.target.value.trim())},100)})}function f(e,n){var t=Docsify.dom.getNode('.search input[type="search"]');if(t)if("string"==typeof e)t.placeholder=e;else{var o=Object.keys(e).filter(function(e){return n.indexOf(e)>-1})[0];t.placeholder=e[o]}}function d(e,n){if("string"==typeof e)y=e;else{var t=Object.keys(e).filter(function(e){return n.indexOf(e)>-1})[0];y=e[t]}}function p(e,n){var t=n.router.parse().query.s;a(),s(e,t),l(),t&&setTimeout(function(e){return c(t)},500)}function u(e,n){f(e.placeholder,n.route.path),d(e.noData,n.route.path)}var h,g={},y="",m={placeholder:"Type to search",noData:"No Results!",paths:"auto",depth:2,maxAge:864e5},v=function(e,n){var t=Docsify.util,o=n.config.search||m;Array.isArray(o)?m.paths=o:"object"==typeof o&&(m.paths=Array.isArray(o.paths)?o.paths:"auto",m.maxAge=t.isPrimitive(o.maxAge)?o.maxAge:m.maxAge,m.placeholder=o.placeholder||m.placeholder,m.noData=o.noData||m.noData,m.depth=o.depth||m.depth);var i="auto"===m.paths;e.mounted(function(e){p(m,n),!i&&r(m,n)}),e.doneEach(function(e){u(m,n),i&&r(m,n)})};$docsify.plugins=[].concat(v,$docsify.plugins)}(); 2 | -------------------------------------------------------------------------------- /asset/style.css: -------------------------------------------------------------------------------- 1 | /*隐藏头部的目录*/ 2 | #main>ul:nth-child(1) { 3 | display: none; 4 | } 5 | 6 | #main>ul:nth-child(2) { 7 | display: none; 8 | } 9 | 10 | .markdown-section h1 { 11 | margin: 3rem 0 2rem 0; 12 | } 13 | 14 | .markdown-section h2 { 15 | margin: 2rem 0 1rem; 16 | } 17 | 18 | img, 19 | pre { 20 | border-radius: 8px; 21 | } 22 | 23 | .content, 24 | .sidebar, 25 | .markdown-section, 26 | body, 27 | .search input { 28 | background-color: rgba(243, 242, 238, 1) !important; 29 | } 30 | 31 | @media (min-width:600px) { 32 | .sidebar-toggle { 33 | background-color: #f3f2ee; 34 | } 35 | } 36 | 37 | .docsify-copy-code-button { 38 | background: #f8f8f8 !important; 39 | color: #7a7a7a !important; 40 | } 41 | 42 | body { 43 | /*font-family: Microsoft YaHei, Source Sans Pro, Helvetica Neue, Arial, sans-serif !important;*/ 44 | } 45 | 46 | .markdown-section>p { 47 | font-size: 16px !important; 48 | } 49 | 50 | .markdown-section pre>code { 51 | font-family: Consolas, Roboto Mono, Monaco, courier, monospace !important; 52 | font-size: .9rem !important; 53 | 54 | } 55 | 56 | /*.anchor span { 57 | color: rgb(66, 185, 131); 58 | }*/ 59 | 60 | section.cover h1 { 61 | margin: 0; 62 | } 63 | 64 | body>section>div.cover-main>ul>li>a { 65 | color: #42b983; 66 | } 67 | 68 | .markdown-section img { 69 | box-shadow: 7px 9px 10px #aaa !important; 70 | } 71 | 72 | 73 | pre { 74 | background-color: #f3f2ee !important; 75 | } 76 | 77 | @media (min-width:600px) { 78 | pre code { 79 | /*box-shadow: 2px 1px 20px 2px #aaa;*/ 80 | /*border-radius: 10px !important;*/ 81 | padding-left: 20px !important; 82 | } 83 | } 84 | 85 | @media (max-width:600px) { 86 | pre { 87 | padding-left: 0px !important; 88 | padding-right: 0px !important; 89 | } 90 | } 91 | 92 | .markdown-section pre { 93 | padding-left: 0 !important; 94 | padding-right: 0px !important; 95 | box-shadow: 2px 1px 20px 2px #aaa; 96 | } -------------------------------------------------------------------------------- /doc/en/api.md: -------------------------------------------------------------------------------- 1 | --- 2 | id: api 3 | title:API 4 | --- 5 | 6 | We automatically generate our [API documentation](/docs/en/html/index.html) with doxygen. 7 | -------------------------------------------------------------------------------- /doc/en/cheatsheet.md: -------------------------------------------------------------------------------- 1 | --- 2 | id: cheatsheet 3 | title: Cheatsheet 4 | --- 5 | 6 | ## Word representation learning 7 | 8 | In order to learn word vectors do: 9 | 10 | ```bash 11 | $ ./fasttext skipgram -input data.txt -output model 12 | ``` 13 | 14 | ## Obtaining word vectors 15 | 16 | Print word vectors for a text file `queries.txt` containing words. 17 | 18 | ```bash 19 | $ ./fasttext print-word-vectors model.bin < queries.txt 20 | ``` 21 | 22 | ## Text classification 23 | 24 | In order to train a text classifier do: 25 | 26 | ```bash 27 | $ ./fasttext supervised -input train.txt -output model 28 | ``` 29 | 30 | Once the model was trained, you can evaluate it by computing the precision and recall at k (P@k and R@k) on a test set using: 31 | 32 | ```bash 33 | $ ./fasttext test model.bin test.txt 1 34 | ``` 35 | 36 | In order to obtain the k most likely labels for a piece of text, use: 37 | 38 | ```bash 39 | $ ./fasttext predict model.bin test.txt k 40 | ``` 41 | 42 | In order to obtain the k most likely labels and their associated probabilities for a piece of text, use: 43 | 44 | ```bash 45 | $ ./fasttext predict-prob model.bin test.txt k 46 | ``` 47 | 48 | If you want to compute vector representations of sentences or paragraphs, please use: 49 | 50 | ```bash 51 | $ ./fasttext print-sentence-vectors model.bin < text.txt 52 | ``` 53 | 54 | ## Quantization 55 | 56 | In order to create a `.ftz` file with a smaller memory footprint do: 57 | 58 | ```bash 59 | $ ./fasttext quantize -output model 60 | ``` 61 | 62 | All other commands such as test also work with this model 63 | 64 | ```bash 65 | $ ./fasttext test model.ftz test.txt 66 | ``` 67 | -------------------------------------------------------------------------------- /doc/en/crawl-vectors.md: -------------------------------------------------------------------------------- 1 | --- 2 | id: crawl-vectors 3 | title: Word vectors for 157 languages 4 | --- 5 | 6 | We distribute pre-trained word vectors for 157 languages, trained on [*Common Crawl*](http://commoncrawl.org/) and [*Wikipedia*](https://www.wikipedia.org) using fastText. 7 | These models were trained using CBOW with position-weights, in dimension 300, with character n-grams of length 5, a window of size 5 and 10 negatives. 8 | We also distribute three new word analogy datasets, for French, Hindi and Polish. 9 | 10 | ### Format 11 | 12 | The word vectors are available in both binary and text formats. 13 | 14 | Using the binary models, vectors for out-of-vocabulary words can be obtained with 15 | ``` 16 | $ ./fasttext print-word-vectors wiki.it.300.bin < oov_words.txt 17 | ``` 18 | where the file oov_words.txt contains out-of-vocabulary words. 19 | 20 | In the text format, each line contain a word followed by its vector. 21 | Each value is space separated, and words are sorted by frequency in descending order. 22 | These text models can easily be loaded in Python using the following code: 23 | ```python 24 | import io 25 | 26 | def load_vectors(fname): 27 | fin = io.open(fname, 'r', encoding='utf-8', newline='\n', errors='ignore') 28 | n, d = map(int, fin.readline().split()) 29 | data = {} 30 | for line in fin: 31 | tokens = line.rstrip().split(' ') 32 | data[tokens[0]] = map(float, tokens[1:]) 33 | return data 34 | ``` 35 | 36 | ### Tokenization 37 | 38 | We used the [*Stanford word segmenter*](https://nlp.stanford.edu/software/segmenter.html) for Chinese, [*Mecab*](http://taku910.github.io/mecab/) for Japanese and [*UETsegmenter*](https://github.com/phongnt570/UETsegmenter) for Vietnamese. 39 | For languages using the Latin, Cyrillic, Hebrew or Greek scripts, we used the tokenizer from the [*Europarl*](http://www.statmt.org/europarl/) preprocessing tools. 40 | For the remaining languages, we used the ICU tokenizer. 41 | 42 | More information about the training of these models can be found in the article [*Learning Word Vectors for 157 Languages*](https://arxiv.org/abs/1802.06893). 43 | 44 | ### License 45 | 46 | The word vectors are distributed under the [*Creative Commons Attribution-Share-Alike License 3.0*](https://creativecommons.org/licenses/by-sa/3.0/). 47 | 48 | ### References 49 | 50 | If you use these word vectors, please cite the following paper: 51 | 52 | E. Grave\*, P. Bojanowski\*, P. Gupta, A. Joulin, T. Mikolov, [*Learning Word Vectors for 157 Languages*](https://arxiv.org/abs/1802.06893) 53 | 54 | ```markup 55 | @inproceedings{grave2018learning, 56 | title={Learning Word Vectors for 157 Languages}, 57 | author={Grave, Edouard and Bojanowski, Piotr and Gupta, Prakhar and Joulin, Armand and Mikolov, Tomas}, 58 | booktitle={Proceedings of the International Conference on Language Resources and Evaluation (LREC 2018)}, 59 | year={2018} 60 | } 61 | ``` 62 | 63 | ### Evaluation datasets 64 | 65 | The analogy evaluation datasets described in the paper are available here: [French](https://s3-us-west-1.amazonaws.com/fasttext-vectors/word-analogies/questions-words-fr.txt), [Hindi](https://s3-us-west-1.amazonaws.com/fasttext-vectors/word-analogies/questions-words-hi.txt), [Polish](https://s3-us-west-1.amazonaws.com/fasttext-vectors/word-analogies/questions-words-pl.txt). 66 | 67 | ### Models 68 | 69 | The models can be downloaded from: 70 | 71 | |||| 72 | |-|-|-| 73 | | Afrikaans: [bin](https://s3-us-west-1.amazonaws.com/fasttext-vectors/word-vectors-v2/cc.af.300.bin.gz), [text](https://s3-us-west-1.amazonaws.com/fasttext-vectors/word-vectors-v2/cc.af.300.vec.gz) | Albanian: [bin](https://s3-us-west-1.amazonaws.com/fasttext-vectors/word-vectors-v2/cc.sq.300.bin.gz), [text](https://s3-us-west-1.amazonaws.com/fasttext-vectors/word-vectors-v2/cc.sq.300.vec.gz) | Alemannic: [bin](https://s3-us-west-1.amazonaws.com/fasttext-vectors/word-vectors-v2/cc.als.300.bin.gz), [text](https://s3-us-west-1.amazonaws.com/fasttext-vectors/word-vectors-v2/cc.als.300.vec.gz) | 74 | | Amharic: [bin](https://s3-us-west-1.amazonaws.com/fasttext-vectors/word-vectors-v2/cc.am.300.bin.gz), [text](https://s3-us-west-1.amazonaws.com/fasttext-vectors/word-vectors-v2/cc.am.300.vec.gz) | Arabic: [bin](https://s3-us-west-1.amazonaws.com/fasttext-vectors/word-vectors-v2/cc.ar.300.bin.gz), [text](https://s3-us-west-1.amazonaws.com/fasttext-vectors/word-vectors-v2/cc.ar.300.vec.gz) | Aragonese: [bin](https://s3-us-west-1.amazonaws.com/fasttext-vectors/word-vectors-v2/cc.an.300.bin.gz), [text](https://s3-us-west-1.amazonaws.com/fasttext-vectors/word-vectors-v2/cc.an.300.vec.gz) | 75 | | Armenian: [bin](https://s3-us-west-1.amazonaws.com/fasttext-vectors/word-vectors-v2/cc.hy.300.bin.gz), [text](https://s3-us-west-1.amazonaws.com/fasttext-vectors/word-vectors-v2/cc.hy.300.vec.gz) | Assamese: [bin](https://s3-us-west-1.amazonaws.com/fasttext-vectors/word-vectors-v2/cc.as.300.bin.gz), [text](https://s3-us-west-1.amazonaws.com/fasttext-vectors/word-vectors-v2/cc.as.300.vec.gz) | Asturian: [bin](https://s3-us-west-1.amazonaws.com/fasttext-vectors/word-vectors-v2/cc.ast.300.bin.gz), [text](https://s3-us-west-1.amazonaws.com/fasttext-vectors/word-vectors-v2/cc.ast.300.vec.gz) | 76 | | Azerbaijani: [bin](https://s3-us-west-1.amazonaws.com/fasttext-vectors/word-vectors-v2/cc.az.300.bin.gz), [text](https://s3-us-west-1.amazonaws.com/fasttext-vectors/word-vectors-v2/cc.az.300.vec.gz) | Bashkir: [bin](https://s3-us-west-1.amazonaws.com/fasttext-vectors/word-vectors-v2/cc.ba.300.bin.gz), [text](https://s3-us-west-1.amazonaws.com/fasttext-vectors/word-vectors-v2/cc.ba.300.vec.gz) | Basque: [bin](https://s3-us-west-1.amazonaws.com/fasttext-vectors/word-vectors-v2/cc.eu.300.bin.gz), [text](https://s3-us-west-1.amazonaws.com/fasttext-vectors/word-vectors-v2/cc.eu.300.vec.gz) | 77 | | Bavarian: [bin](https://s3-us-west-1.amazonaws.com/fasttext-vectors/word-vectors-v2/cc.bar.300.bin.gz), [text](https://s3-us-west-1.amazonaws.com/fasttext-vectors/word-vectors-v2/cc.bar.300.vec.gz) | Belarusian: [bin](https://s3-us-west-1.amazonaws.com/fasttext-vectors/word-vectors-v2/cc.be.300.bin.gz), [text](https://s3-us-west-1.amazonaws.com/fasttext-vectors/word-vectors-v2/cc.be.300.vec.gz) | Bengali: [bin](https://s3-us-west-1.amazonaws.com/fasttext-vectors/word-vectors-v2/cc.bn.300.bin.gz), [text](https://s3-us-west-1.amazonaws.com/fasttext-vectors/word-vectors-v2/cc.bn.300.vec.gz) | 78 | | Bihari: [bin](https://s3-us-west-1.amazonaws.com/fasttext-vectors/word-vectors-v2/cc.bh.300.bin.gz), [text](https://s3-us-west-1.amazonaws.com/fasttext-vectors/word-vectors-v2/cc.bh.300.vec.gz) | Bishnupriya Manipuri: [bin](https://s3-us-west-1.amazonaws.com/fasttext-vectors/word-vectors-v2/cc.bpy.300.bin.gz), [text](https://s3-us-west-1.amazonaws.com/fasttext-vectors/word-vectors-v2/cc.bpy.300.vec.gz) | Bosnian: [bin](https://s3-us-west-1.amazonaws.com/fasttext-vectors/word-vectors-v2/cc.bs.300.bin.gz), [text](https://s3-us-west-1.amazonaws.com/fasttext-vectors/word-vectors-v2/cc.bs.300.vec.gz) | 79 | | Breton: [bin](https://s3-us-west-1.amazonaws.com/fasttext-vectors/word-vectors-v2/cc.br.300.bin.gz), [text](https://s3-us-west-1.amazonaws.com/fasttext-vectors/word-vectors-v2/cc.br.300.vec.gz) | Bulgarian: [bin](https://s3-us-west-1.amazonaws.com/fasttext-vectors/word-vectors-v2/cc.bg.300.bin.gz), [text](https://s3-us-west-1.amazonaws.com/fasttext-vectors/word-vectors-v2/cc.bg.300.vec.gz) | Burmese: [bin](https://s3-us-west-1.amazonaws.com/fasttext-vectors/word-vectors-v2/cc.my.300.bin.gz), [text](https://s3-us-west-1.amazonaws.com/fasttext-vectors/word-vectors-v2/cc.my.300.vec.gz) | 80 | | Catalan: [bin](https://s3-us-west-1.amazonaws.com/fasttext-vectors/word-vectors-v2/cc.ca.300.bin.gz), [text](https://s3-us-west-1.amazonaws.com/fasttext-vectors/word-vectors-v2/cc.ca.300.vec.gz) | Cebuano: [bin](https://s3-us-west-1.amazonaws.com/fasttext-vectors/word-vectors-v2/cc.ceb.300.bin.gz), [text](https://s3-us-west-1.amazonaws.com/fasttext-vectors/word-vectors-v2/cc.ceb.300.vec.gz) | Central Bicolano: [bin](https://s3-us-west-1.amazonaws.com/fasttext-vectors/word-vectors-v2/cc.bcl.300.bin.gz), [text](https://s3-us-west-1.amazonaws.com/fasttext-vectors/word-vectors-v2/cc.bcl.300.vec.gz) | 81 | | Chechen: [bin](https://s3-us-west-1.amazonaws.com/fasttext-vectors/word-vectors-v2/cc.ce.300.bin.gz), [text](https://s3-us-west-1.amazonaws.com/fasttext-vectors/word-vectors-v2/cc.ce.300.vec.gz) | Chinese: [bin](https://s3-us-west-1.amazonaws.com/fasttext-vectors/word-vectors-v2/cc.zh.300.bin.gz), [text](https://s3-us-west-1.amazonaws.com/fasttext-vectors/word-vectors-v2/cc.zh.300.vec.gz) | Chuvash: [bin](https://s3-us-west-1.amazonaws.com/fasttext-vectors/word-vectors-v2/cc.cv.300.bin.gz), [text](https://s3-us-west-1.amazonaws.com/fasttext-vectors/word-vectors-v2/cc.cv.300.vec.gz) | 82 | | Corsican: [bin](https://s3-us-west-1.amazonaws.com/fasttext-vectors/word-vectors-v2/cc.co.300.bin.gz), [text](https://s3-us-west-1.amazonaws.com/fasttext-vectors/word-vectors-v2/cc.co.300.vec.gz) | Croatian: [bin](https://s3-us-west-1.amazonaws.com/fasttext-vectors/word-vectors-v2/cc.hr.300.bin.gz), [text](https://s3-us-west-1.amazonaws.com/fasttext-vectors/word-vectors-v2/cc.hr.300.vec.gz) | Czech: [bin](https://s3-us-west-1.amazonaws.com/fasttext-vectors/word-vectors-v2/cc.cs.300.bin.gz), [text](https://s3-us-west-1.amazonaws.com/fasttext-vectors/word-vectors-v2/cc.cs.300.vec.gz) | 83 | | Danish: [bin](https://s3-us-west-1.amazonaws.com/fasttext-vectors/word-vectors-v2/cc.da.300.bin.gz), [text](https://s3-us-west-1.amazonaws.com/fasttext-vectors/word-vectors-v2/cc.da.300.vec.gz) | Divehi: [bin](https://s3-us-west-1.amazonaws.com/fasttext-vectors/word-vectors-v2/cc.dv.300.bin.gz), [text](https://s3-us-west-1.amazonaws.com/fasttext-vectors/word-vectors-v2/cc.dv.300.vec.gz) | Dutch: [bin](https://s3-us-west-1.amazonaws.com/fasttext-vectors/word-vectors-v2/cc.nl.300.bin.gz), [text](https://s3-us-west-1.amazonaws.com/fasttext-vectors/word-vectors-v2/cc.nl.300.vec.gz) | 84 | | Eastern Punjabi: [bin](https://s3-us-west-1.amazonaws.com/fasttext-vectors/word-vectors-v2/cc.pa.300.bin.gz), [text](https://s3-us-west-1.amazonaws.com/fasttext-vectors/word-vectors-v2/cc.pa.300.vec.gz) | Egyptian Arabic: [bin](https://s3-us-west-1.amazonaws.com/fasttext-vectors/word-vectors-v2/cc.arz.300.bin.gz), [text](https://s3-us-west-1.amazonaws.com/fasttext-vectors/word-vectors-v2/cc.arz.300.vec.gz) | Emilian-Romagnol: [bin](https://s3-us-west-1.amazonaws.com/fasttext-vectors/word-vectors-v2/cc.eml.300.bin.gz), [text](https://s3-us-west-1.amazonaws.com/fasttext-vectors/word-vectors-v2/cc.eml.300.vec.gz) | 85 | | Erzya: [bin](https://s3-us-west-1.amazonaws.com/fasttext-vectors/word-vectors-v2/cc.myv.300.bin.gz), [text](https://s3-us-west-1.amazonaws.com/fasttext-vectors/word-vectors-v2/cc.myv.300.vec.gz) | Esperanto: [bin](https://s3-us-west-1.amazonaws.com/fasttext-vectors/word-vectors-v2/cc.eo.300.bin.gz), [text](https://s3-us-west-1.amazonaws.com/fasttext-vectors/word-vectors-v2/cc.eo.300.vec.gz) | Estonian: [bin](https://s3-us-west-1.amazonaws.com/fasttext-vectors/word-vectors-v2/cc.et.300.bin.gz), [text](https://s3-us-west-1.amazonaws.com/fasttext-vectors/word-vectors-v2/cc.et.300.vec.gz) | 86 | | Fiji Hindi: [bin](https://s3-us-west-1.amazonaws.com/fasttext-vectors/word-vectors-v2/cc.hif.300.bin.gz), [text](https://s3-us-west-1.amazonaws.com/fasttext-vectors/word-vectors-v2/cc.hif.300.vec.gz) | Finnish: [bin](https://s3-us-west-1.amazonaws.com/fasttext-vectors/word-vectors-v2/cc.fi.300.bin.gz), [text](https://s3-us-west-1.amazonaws.com/fasttext-vectors/word-vectors-v2/cc.fi.300.vec.gz) | French: [bin](https://s3-us-west-1.amazonaws.com/fasttext-vectors/word-vectors-v2/cc.fr.300.bin.gz), [text](https://s3-us-west-1.amazonaws.com/fasttext-vectors/word-vectors-v2/cc.fr.300.vec.gz) | 87 | | Galician: [bin](https://s3-us-west-1.amazonaws.com/fasttext-vectors/word-vectors-v2/cc.gl.300.bin.gz), [text](https://s3-us-west-1.amazonaws.com/fasttext-vectors/word-vectors-v2/cc.gl.300.vec.gz) | Georgian: [bin](https://s3-us-west-1.amazonaws.com/fasttext-vectors/word-vectors-v2/cc.ka.300.bin.gz), [text](https://s3-us-west-1.amazonaws.com/fasttext-vectors/word-vectors-v2/cc.ka.300.vec.gz) | German: [bin](https://s3-us-west-1.amazonaws.com/fasttext-vectors/word-vectors-v2/cc.de.300.bin.gz), [text](https://s3-us-west-1.amazonaws.com/fasttext-vectors/word-vectors-v2/cc.de.300.vec.gz) | 88 | | Goan Konkani: [bin](https://s3-us-west-1.amazonaws.com/fasttext-vectors/word-vectors-v2/cc.gom.300.bin.gz), [text](https://s3-us-west-1.amazonaws.com/fasttext-vectors/word-vectors-v2/cc.gom.300.vec.gz) | Greek: [bin](https://s3-us-west-1.amazonaws.com/fasttext-vectors/word-vectors-v2/cc.el.300.bin.gz), [text](https://s3-us-west-1.amazonaws.com/fasttext-vectors/word-vectors-v2/cc.el.300.vec.gz) | Gujarati: [bin](https://s3-us-west-1.amazonaws.com/fasttext-vectors/word-vectors-v2/cc.gu.300.bin.gz), [text](https://s3-us-west-1.amazonaws.com/fasttext-vectors/word-vectors-v2/cc.gu.300.vec.gz) | 89 | | Haitian: [bin](https://s3-us-west-1.amazonaws.com/fasttext-vectors/word-vectors-v2/cc.ht.300.bin.gz), [text](https://s3-us-west-1.amazonaws.com/fasttext-vectors/word-vectors-v2/cc.ht.300.vec.gz) | Hebrew: [bin](https://s3-us-west-1.amazonaws.com/fasttext-vectors/word-vectors-v2/cc.he.300.bin.gz), [text](https://s3-us-west-1.amazonaws.com/fasttext-vectors/word-vectors-v2/cc.he.300.vec.gz) | Hill Mari: [bin](https://s3-us-west-1.amazonaws.com/fasttext-vectors/word-vectors-v2/cc.mrj.300.bin.gz), [text](https://s3-us-west-1.amazonaws.com/fasttext-vectors/word-vectors-v2/cc.mrj.300.vec.gz) | 90 | | Hindi: [bin](https://s3-us-west-1.amazonaws.com/fasttext-vectors/word-vectors-v2/cc.hi.300.bin.gz), [text](https://s3-us-west-1.amazonaws.com/fasttext-vectors/word-vectors-v2/cc.hi.300.vec.gz) | Hungarian: [bin](https://s3-us-west-1.amazonaws.com/fasttext-vectors/word-vectors-v2/cc.hu.300.bin.gz), [text](https://s3-us-west-1.amazonaws.com/fasttext-vectors/word-vectors-v2/cc.hu.300.vec.gz) | Icelandic: [bin](https://s3-us-west-1.amazonaws.com/fasttext-vectors/word-vectors-v2/cc.is.300.bin.gz), [text](https://s3-us-west-1.amazonaws.com/fasttext-vectors/word-vectors-v2/cc.is.300.vec.gz) | 91 | | Ido: [bin](https://s3-us-west-1.amazonaws.com/fasttext-vectors/word-vectors-v2/cc.io.300.bin.gz), [text](https://s3-us-west-1.amazonaws.com/fasttext-vectors/word-vectors-v2/cc.io.300.vec.gz) | Ilokano: [bin](https://s3-us-west-1.amazonaws.com/fasttext-vectors/word-vectors-v2/cc.ilo.300.bin.gz), [text](https://s3-us-west-1.amazonaws.com/fasttext-vectors/word-vectors-v2/cc.ilo.300.vec.gz) | Indonesian: [bin](https://s3-us-west-1.amazonaws.com/fasttext-vectors/word-vectors-v2/cc.id.300.bin.gz), [text](https://s3-us-west-1.amazonaws.com/fasttext-vectors/word-vectors-v2/cc.id.300.vec.gz) | 92 | | Interlingua: [bin](https://s3-us-west-1.amazonaws.com/fasttext-vectors/word-vectors-v2/cc.ia.300.bin.gz), [text](https://s3-us-west-1.amazonaws.com/fasttext-vectors/word-vectors-v2/cc.ia.300.vec.gz) | Irish: [bin](https://s3-us-west-1.amazonaws.com/fasttext-vectors/word-vectors-v2/cc.ga.300.bin.gz), [text](https://s3-us-west-1.amazonaws.com/fasttext-vectors/word-vectors-v2/cc.ga.300.vec.gz) | Italian: [bin](https://s3-us-west-1.amazonaws.com/fasttext-vectors/word-vectors-v2/cc.it.300.bin.gz), [text](https://s3-us-west-1.amazonaws.com/fasttext-vectors/word-vectors-v2/cc.it.300.vec.gz) | 93 | | Japanese: [bin](https://s3-us-west-1.amazonaws.com/fasttext-vectors/word-vectors-v2/cc.ja.300.bin.gz), [text](https://s3-us-west-1.amazonaws.com/fasttext-vectors/word-vectors-v2/cc.ja.300.vec.gz) | Javanese: [bin](https://s3-us-west-1.amazonaws.com/fasttext-vectors/word-vectors-v2/cc.jv.300.bin.gz), [text](https://s3-us-west-1.amazonaws.com/fasttext-vectors/word-vectors-v2/cc.jv.300.vec.gz) | Kannada: [bin](https://s3-us-west-1.amazonaws.com/fasttext-vectors/word-vectors-v2/cc.kn.300.bin.gz), [text](https://s3-us-west-1.amazonaws.com/fasttext-vectors/word-vectors-v2/cc.kn.300.vec.gz) | 94 | | Kapampangan: [bin](https://s3-us-west-1.amazonaws.com/fasttext-vectors/word-vectors-v2/cc.pam.300.bin.gz), [text](https://s3-us-west-1.amazonaws.com/fasttext-vectors/word-vectors-v2/cc.pam.300.vec.gz) | Kazakh: [bin](https://s3-us-west-1.amazonaws.com/fasttext-vectors/word-vectors-v2/cc.kk.300.bin.gz), [text](https://s3-us-west-1.amazonaws.com/fasttext-vectors/word-vectors-v2/cc.kk.300.vec.gz) | Khmer: [bin](https://s3-us-west-1.amazonaws.com/fasttext-vectors/word-vectors-v2/cc.km.300.bin.gz), [text](https://s3-us-west-1.amazonaws.com/fasttext-vectors/word-vectors-v2/cc.km.300.vec.gz) | 95 | | Kirghiz: [bin](https://s3-us-west-1.amazonaws.com/fasttext-vectors/word-vectors-v2/cc.ky.300.bin.gz), [text](https://s3-us-west-1.amazonaws.com/fasttext-vectors/word-vectors-v2/cc.ky.300.vec.gz) | Korean: [bin](https://s3-us-west-1.amazonaws.com/fasttext-vectors/word-vectors-v2/cc.ko.300.bin.gz), [text](https://s3-us-west-1.amazonaws.com/fasttext-vectors/word-vectors-v2/cc.ko.300.vec.gz) | Kurdish (Kurmanji): [bin](https://s3-us-west-1.amazonaws.com/fasttext-vectors/word-vectors-v2/cc.ku.300.bin.gz), [text](https://s3-us-west-1.amazonaws.com/fasttext-vectors/word-vectors-v2/cc.ku.300.vec.gz) | 96 | | Kurdish (Sorani): [bin](https://s3-us-west-1.amazonaws.com/fasttext-vectors/word-vectors-v2/cc.ckb.300.bin.gz), [text](https://s3-us-west-1.amazonaws.com/fasttext-vectors/word-vectors-v2/cc.ckb.300.vec.gz) | Latin: [bin](https://s3-us-west-1.amazonaws.com/fasttext-vectors/word-vectors-v2/cc.la.300.bin.gz), [text](https://s3-us-west-1.amazonaws.com/fasttext-vectors/word-vectors-v2/cc.la.300.vec.gz) | Latvian: [bin](https://s3-us-west-1.amazonaws.com/fasttext-vectors/word-vectors-v2/cc.lv.300.bin.gz), [text](https://s3-us-west-1.amazonaws.com/fasttext-vectors/word-vectors-v2/cc.lv.300.vec.gz) | 97 | | Limburgish: [bin](https://s3-us-west-1.amazonaws.com/fasttext-vectors/word-vectors-v2/cc.li.300.bin.gz), [text](https://s3-us-west-1.amazonaws.com/fasttext-vectors/word-vectors-v2/cc.li.300.vec.gz) | Lithuanian: [bin](https://s3-us-west-1.amazonaws.com/fasttext-vectors/word-vectors-v2/cc.lt.300.bin.gz), [text](https://s3-us-west-1.amazonaws.com/fasttext-vectors/word-vectors-v2/cc.lt.300.vec.gz) | Lombard: [bin](https://s3-us-west-1.amazonaws.com/fasttext-vectors/word-vectors-v2/cc.lmo.300.bin.gz), [text](https://s3-us-west-1.amazonaws.com/fasttext-vectors/word-vectors-v2/cc.lmo.300.vec.gz) | 98 | | Low Saxon: [bin](https://s3-us-west-1.amazonaws.com/fasttext-vectors/word-vectors-v2/cc.nds.300.bin.gz), [text](https://s3-us-west-1.amazonaws.com/fasttext-vectors/word-vectors-v2/cc.nds.300.vec.gz) | Luxembourgish: [bin](https://s3-us-west-1.amazonaws.com/fasttext-vectors/word-vectors-v2/cc.lb.300.bin.gz), [text](https://s3-us-west-1.amazonaws.com/fasttext-vectors/word-vectors-v2/cc.lb.300.vec.gz) | Macedonian: [bin](https://s3-us-west-1.amazonaws.com/fasttext-vectors/word-vectors-v2/cc.mk.300.bin.gz), [text](https://s3-us-west-1.amazonaws.com/fasttext-vectors/word-vectors-v2/cc.mk.300.vec.gz) | 99 | | Maithili: [bin](https://s3-us-west-1.amazonaws.com/fasttext-vectors/word-vectors-v2/cc.mai.300.bin.gz), [text](https://s3-us-west-1.amazonaws.com/fasttext-vectors/word-vectors-v2/cc.mai.300.vec.gz) | Malagasy: [bin](https://s3-us-west-1.amazonaws.com/fasttext-vectors/word-vectors-v2/cc.mg.300.bin.gz), [text](https://s3-us-west-1.amazonaws.com/fasttext-vectors/word-vectors-v2/cc.mg.300.vec.gz) | Malay: [bin](https://s3-us-west-1.amazonaws.com/fasttext-vectors/word-vectors-v2/cc.ms.300.bin.gz), [text](https://s3-us-west-1.amazonaws.com/fasttext-vectors/word-vectors-v2/cc.ms.300.vec.gz) | 100 | | Malayalam: [bin](https://s3-us-west-1.amazonaws.com/fasttext-vectors/word-vectors-v2/cc.ml.300.bin.gz), [text](https://s3-us-west-1.amazonaws.com/fasttext-vectors/word-vectors-v2/cc.ml.300.vec.gz) | Maltese: [bin](https://s3-us-west-1.amazonaws.com/fasttext-vectors/word-vectors-v2/cc.mt.300.bin.gz), [text](https://s3-us-west-1.amazonaws.com/fasttext-vectors/word-vectors-v2/cc.mt.300.vec.gz) | Manx: [bin](https://s3-us-west-1.amazonaws.com/fasttext-vectors/word-vectors-v2/cc.gv.300.bin.gz), [text](https://s3-us-west-1.amazonaws.com/fasttext-vectors/word-vectors-v2/cc.gv.300.vec.gz) | 101 | | Marathi: [bin](https://s3-us-west-1.amazonaws.com/fasttext-vectors/word-vectors-v2/cc.mr.300.bin.gz), [text](https://s3-us-west-1.amazonaws.com/fasttext-vectors/word-vectors-v2/cc.mr.300.vec.gz) | Mazandarani: [bin](https://s3-us-west-1.amazonaws.com/fasttext-vectors/word-vectors-v2/cc.mzn.300.bin.gz), [text](https://s3-us-west-1.amazonaws.com/fasttext-vectors/word-vectors-v2/cc.mzn.300.vec.gz) | Meadow Mari: [bin](https://s3-us-west-1.amazonaws.com/fasttext-vectors/word-vectors-v2/cc.mhr.300.bin.gz), [text](https://s3-us-west-1.amazonaws.com/fasttext-vectors/word-vectors-v2/cc.mhr.300.vec.gz) | 102 | | Minangkabau: [bin](https://s3-us-west-1.amazonaws.com/fasttext-vectors/word-vectors-v2/cc.min.300.bin.gz), [text](https://s3-us-west-1.amazonaws.com/fasttext-vectors/word-vectors-v2/cc.min.300.vec.gz) | Mingrelian: [bin](https://s3-us-west-1.amazonaws.com/fasttext-vectors/word-vectors-v2/cc.xmf.300.bin.gz), [text](https://s3-us-west-1.amazonaws.com/fasttext-vectors/word-vectors-v2/cc.xmf.300.vec.gz) | Mirandese: [bin](https://s3-us-west-1.amazonaws.com/fasttext-vectors/word-vectors-v2/cc.mwl.300.bin.gz), [text](https://s3-us-west-1.amazonaws.com/fasttext-vectors/word-vectors-v2/cc.mwl.300.vec.gz) | 103 | | Mongolian: [bin](https://s3-us-west-1.amazonaws.com/fasttext-vectors/word-vectors-v2/cc.mn.300.bin.gz), [text](https://s3-us-west-1.amazonaws.com/fasttext-vectors/word-vectors-v2/cc.mn.300.vec.gz) | Nahuatl: [bin](https://s3-us-west-1.amazonaws.com/fasttext-vectors/word-vectors-v2/cc.nah.300.bin.gz), [text](https://s3-us-west-1.amazonaws.com/fasttext-vectors/word-vectors-v2/cc.nah.300.vec.gz) | Neapolitan: [bin](https://s3-us-west-1.amazonaws.com/fasttext-vectors/word-vectors-v2/cc.nap.300.bin.gz), [text](https://s3-us-west-1.amazonaws.com/fasttext-vectors/word-vectors-v2/cc.nap.300.vec.gz) | 104 | | Nepali: [bin](https://s3-us-west-1.amazonaws.com/fasttext-vectors/word-vectors-v2/cc.ne.300.bin.gz), [text](https://s3-us-west-1.amazonaws.com/fasttext-vectors/word-vectors-v2/cc.ne.300.vec.gz) | Newar: [bin](https://s3-us-west-1.amazonaws.com/fasttext-vectors/word-vectors-v2/cc.new.300.bin.gz), [text](https://s3-us-west-1.amazonaws.com/fasttext-vectors/word-vectors-v2/cc.new.300.vec.gz) | North Frisian: [bin](https://s3-us-west-1.amazonaws.com/fasttext-vectors/word-vectors-v2/cc.frr.300.bin.gz), [text](https://s3-us-west-1.amazonaws.com/fasttext-vectors/word-vectors-v2/cc.frr.300.vec.gz) | 105 | | Northern Sotho: [bin](https://s3-us-west-1.amazonaws.com/fasttext-vectors/word-vectors-v2/cc.nso.300.bin.gz), [text](https://s3-us-west-1.amazonaws.com/fasttext-vectors/word-vectors-v2/cc.nso.300.vec.gz) | Norwegian (Bokmål): [bin](https://s3-us-west-1.amazonaws.com/fasttext-vectors/word-vectors-v2/cc.no.300.bin.gz), [text](https://s3-us-west-1.amazonaws.com/fasttext-vectors/word-vectors-v2/cc.no.300.vec.gz) | Norwegian (Nynorsk): [bin](https://s3-us-west-1.amazonaws.com/fasttext-vectors/word-vectors-v2/cc.nn.300.bin.gz), [text](https://s3-us-west-1.amazonaws.com/fasttext-vectors/word-vectors-v2/cc.nn.300.vec.gz) | 106 | | Occitan: [bin](https://s3-us-west-1.amazonaws.com/fasttext-vectors/word-vectors-v2/cc.oc.300.bin.gz), [text](https://s3-us-west-1.amazonaws.com/fasttext-vectors/word-vectors-v2/cc.oc.300.vec.gz) | Oriya: [bin](https://s3-us-west-1.amazonaws.com/fasttext-vectors/word-vectors-v2/cc.or.300.bin.gz), [text](https://s3-us-west-1.amazonaws.com/fasttext-vectors/word-vectors-v2/cc.or.300.vec.gz) | Ossetian: [bin](https://s3-us-west-1.amazonaws.com/fasttext-vectors/word-vectors-v2/cc.os.300.bin.gz), [text](https://s3-us-west-1.amazonaws.com/fasttext-vectors/word-vectors-v2/cc.os.300.vec.gz) | 107 | | Palatinate German: [bin](https://s3-us-west-1.amazonaws.com/fasttext-vectors/word-vectors-v2/cc.pfl.300.bin.gz), [text](https://s3-us-west-1.amazonaws.com/fasttext-vectors/word-vectors-v2/cc.pfl.300.vec.gz) | Pashto: [bin](https://s3-us-west-1.amazonaws.com/fasttext-vectors/word-vectors-v2/cc.ps.300.bin.gz), [text](https://s3-us-west-1.amazonaws.com/fasttext-vectors/word-vectors-v2/cc.ps.300.vec.gz) | Persian: [bin](https://s3-us-west-1.amazonaws.com/fasttext-vectors/word-vectors-v2/cc.fa.300.bin.gz), [text](https://s3-us-west-1.amazonaws.com/fasttext-vectors/word-vectors-v2/cc.fa.300.vec.gz) | 108 | | Piedmontese: [bin](https://s3-us-west-1.amazonaws.com/fasttext-vectors/word-vectors-v2/cc.pms.300.bin.gz), [text](https://s3-us-west-1.amazonaws.com/fasttext-vectors/word-vectors-v2/cc.pms.300.vec.gz) | Polish: [bin](https://s3-us-west-1.amazonaws.com/fasttext-vectors/word-vectors-v2/cc.pl.300.bin.gz), [text](https://s3-us-west-1.amazonaws.com/fasttext-vectors/word-vectors-v2/cc.pl.300.vec.gz) | Portuguese: [bin](https://s3-us-west-1.amazonaws.com/fasttext-vectors/word-vectors-v2/cc.pt.300.bin.gz), [text](https://s3-us-west-1.amazonaws.com/fasttext-vectors/word-vectors-v2/cc.pt.300.vec.gz) | 109 | | Quechua: [bin](https://s3-us-west-1.amazonaws.com/fasttext-vectors/word-vectors-v2/cc.qu.300.bin.gz), [text](https://s3-us-west-1.amazonaws.com/fasttext-vectors/word-vectors-v2/cc.qu.300.vec.gz) | Romanian: [bin](https://s3-us-west-1.amazonaws.com/fasttext-vectors/word-vectors-v2/cc.ro.300.bin.gz), [text](https://s3-us-west-1.amazonaws.com/fasttext-vectors/word-vectors-v2/cc.ro.300.vec.gz) | Romansh: [bin](https://s3-us-west-1.amazonaws.com/fasttext-vectors/word-vectors-v2/cc.rm.300.bin.gz), [text](https://s3-us-west-1.amazonaws.com/fasttext-vectors/word-vectors-v2/cc.rm.300.vec.gz) | 110 | | Russian: [bin](https://s3-us-west-1.amazonaws.com/fasttext-vectors/word-vectors-v2/cc.ru.300.bin.gz), [text](https://s3-us-west-1.amazonaws.com/fasttext-vectors/word-vectors-v2/cc.ru.300.vec.gz) | Sakha: [bin](https://s3-us-west-1.amazonaws.com/fasttext-vectors/word-vectors-v2/cc.sah.300.bin.gz), [text](https://s3-us-west-1.amazonaws.com/fasttext-vectors/word-vectors-v2/cc.sah.300.vec.gz) | Sanskrit: [bin](https://s3-us-west-1.amazonaws.com/fasttext-vectors/word-vectors-v2/cc.sa.300.bin.gz), [text](https://s3-us-west-1.amazonaws.com/fasttext-vectors/word-vectors-v2/cc.sa.300.vec.gz) | 111 | | Sardinian: [bin](https://s3-us-west-1.amazonaws.com/fasttext-vectors/word-vectors-v2/cc.sc.300.bin.gz), [text](https://s3-us-west-1.amazonaws.com/fasttext-vectors/word-vectors-v2/cc.sc.300.vec.gz) | Scots: [bin](https://s3-us-west-1.amazonaws.com/fasttext-vectors/word-vectors-v2/cc.sco.300.bin.gz), [text](https://s3-us-west-1.amazonaws.com/fasttext-vectors/word-vectors-v2/cc.sco.300.vec.gz) | Scottish Gaelic: [bin](https://s3-us-west-1.amazonaws.com/fasttext-vectors/word-vectors-v2/cc.gd.300.bin.gz), [text](https://s3-us-west-1.amazonaws.com/fasttext-vectors/word-vectors-v2/cc.gd.300.vec.gz) | 112 | | Serbian: [bin](https://s3-us-west-1.amazonaws.com/fasttext-vectors/word-vectors-v2/cc.sr.300.bin.gz), [text](https://s3-us-west-1.amazonaws.com/fasttext-vectors/word-vectors-v2/cc.sr.300.vec.gz) | Serbo-Croatian: [bin](https://s3-us-west-1.amazonaws.com/fasttext-vectors/word-vectors-v2/cc.sh.300.bin.gz), [text](https://s3-us-west-1.amazonaws.com/fasttext-vectors/word-vectors-v2/cc.sh.300.vec.gz) | Sicilian: [bin](https://s3-us-west-1.amazonaws.com/fasttext-vectors/word-vectors-v2/cc.scn.300.bin.gz), [text](https://s3-us-west-1.amazonaws.com/fasttext-vectors/word-vectors-v2/cc.scn.300.vec.gz) | 113 | | Sindhi: [bin](https://s3-us-west-1.amazonaws.com/fasttext-vectors/word-vectors-v2/cc.sd.300.bin.gz), [text](https://s3-us-west-1.amazonaws.com/fasttext-vectors/word-vectors-v2/cc.sd.300.vec.gz) | Sinhalese: [bin](https://s3-us-west-1.amazonaws.com/fasttext-vectors/word-vectors-v2/cc.si.300.bin.gz), [text](https://s3-us-west-1.amazonaws.com/fasttext-vectors/word-vectors-v2/cc.si.300.vec.gz) | Slovak: [bin](https://s3-us-west-1.amazonaws.com/fasttext-vectors/word-vectors-v2/cc.sk.300.bin.gz), [text](https://s3-us-west-1.amazonaws.com/fasttext-vectors/word-vectors-v2/cc.sk.300.vec.gz) | 114 | | Slovenian: [bin](https://s3-us-west-1.amazonaws.com/fasttext-vectors/word-vectors-v2/cc.sl.300.bin.gz), [text](https://s3-us-west-1.amazonaws.com/fasttext-vectors/word-vectors-v2/cc.sl.300.vec.gz) | Somali: [bin](https://s3-us-west-1.amazonaws.com/fasttext-vectors/word-vectors-v2/cc.so.300.bin.gz), [text](https://s3-us-west-1.amazonaws.com/fasttext-vectors/word-vectors-v2/cc.so.300.vec.gz) | Southern Azerbaijani: [bin](https://s3-us-west-1.amazonaws.com/fasttext-vectors/word-vectors-v2/cc.azb.300.bin.gz), [text](https://s3-us-west-1.amazonaws.com/fasttext-vectors/word-vectors-v2/cc.azb.300.vec.gz) | 115 | | Spanish: [bin](https://s3-us-west-1.amazonaws.com/fasttext-vectors/word-vectors-v2/cc.es.300.bin.gz), [text](https://s3-us-west-1.amazonaws.com/fasttext-vectors/word-vectors-v2/cc.es.300.vec.gz) | Sundanese: [bin](https://s3-us-west-1.amazonaws.com/fasttext-vectors/word-vectors-v2/cc.su.300.bin.gz), [text](https://s3-us-west-1.amazonaws.com/fasttext-vectors/word-vectors-v2/cc.su.300.vec.gz) | Swahili: [bin](https://s3-us-west-1.amazonaws.com/fasttext-vectors/word-vectors-v2/cc.sw.300.bin.gz), [text](https://s3-us-west-1.amazonaws.com/fasttext-vectors/word-vectors-v2/cc.sw.300.vec.gz) | 116 | | Swedish: [bin](https://s3-us-west-1.amazonaws.com/fasttext-vectors/word-vectors-v2/cc.sv.300.bin.gz), [text](https://s3-us-west-1.amazonaws.com/fasttext-vectors/word-vectors-v2/cc.sv.300.vec.gz) | Tagalog: [bin](https://s3-us-west-1.amazonaws.com/fasttext-vectors/word-vectors-v2/cc.tl.300.bin.gz), [text](https://s3-us-west-1.amazonaws.com/fasttext-vectors/word-vectors-v2/cc.tl.300.vec.gz) | Tajik: [bin](https://s3-us-west-1.amazonaws.com/fasttext-vectors/word-vectors-v2/cc.tg.300.bin.gz), [text](https://s3-us-west-1.amazonaws.com/fasttext-vectors/word-vectors-v2/cc.tg.300.vec.gz) | 117 | | Tamil: [bin](https://s3-us-west-1.amazonaws.com/fasttext-vectors/word-vectors-v2/cc.ta.300.bin.gz), [text](https://s3-us-west-1.amazonaws.com/fasttext-vectors/word-vectors-v2/cc.ta.300.vec.gz) | Tatar: [bin](https://s3-us-west-1.amazonaws.com/fasttext-vectors/word-vectors-v2/cc.tt.300.bin.gz), [text](https://s3-us-west-1.amazonaws.com/fasttext-vectors/word-vectors-v2/cc.tt.300.vec.gz) | Telugu: [bin](https://s3-us-west-1.amazonaws.com/fasttext-vectors/word-vectors-v2/cc.te.300.bin.gz), [text](https://s3-us-west-1.amazonaws.com/fasttext-vectors/word-vectors-v2/cc.te.300.vec.gz) | 118 | | Thai: [bin](https://s3-us-west-1.amazonaws.com/fasttext-vectors/word-vectors-v2/cc.th.300.bin.gz), [text](https://s3-us-west-1.amazonaws.com/fasttext-vectors/word-vectors-v2/cc.th.300.vec.gz) | Tibetan: [bin](https://s3-us-west-1.amazonaws.com/fasttext-vectors/word-vectors-v2/cc.bo.300.bin.gz), [text](https://s3-us-west-1.amazonaws.com/fasttext-vectors/word-vectors-v2/cc.bo.300.vec.gz) | Turkish: [bin](https://s3-us-west-1.amazonaws.com/fasttext-vectors/word-vectors-v2/cc.tr.300.bin.gz), [text](https://s3-us-west-1.amazonaws.com/fasttext-vectors/word-vectors-v2/cc.tr.300.vec.gz) | 119 | | Turkmen: [bin](https://s3-us-west-1.amazonaws.com/fasttext-vectors/word-vectors-v2/cc.tk.300.bin.gz), [text](https://s3-us-west-1.amazonaws.com/fasttext-vectors/word-vectors-v2/cc.tk.300.vec.gz) | Ukrainian: [bin](https://s3-us-west-1.amazonaws.com/fasttext-vectors/word-vectors-v2/cc.uk.300.bin.gz), [text](https://s3-us-west-1.amazonaws.com/fasttext-vectors/word-vectors-v2/cc.uk.300.vec.gz) | Upper Sorbian: [bin](https://s3-us-west-1.amazonaws.com/fasttext-vectors/word-vectors-v2/cc.hsb.300.bin.gz), [text](https://s3-us-west-1.amazonaws.com/fasttext-vectors/word-vectors-v2/cc.hsb.300.vec.gz) | 120 | | Urdu: [bin](https://s3-us-west-1.amazonaws.com/fasttext-vectors/word-vectors-v2/cc.ur.300.bin.gz), [text](https://s3-us-west-1.amazonaws.com/fasttext-vectors/word-vectors-v2/cc.ur.300.vec.gz) | Uyghur: [bin](https://s3-us-west-1.amazonaws.com/fasttext-vectors/word-vectors-v2/cc.ug.300.bin.gz), [text](https://s3-us-west-1.amazonaws.com/fasttext-vectors/word-vectors-v2/cc.ug.300.vec.gz) | Uzbek: [bin](https://s3-us-west-1.amazonaws.com/fasttext-vectors/word-vectors-v2/cc.uz.300.bin.gz), [text](https://s3-us-west-1.amazonaws.com/fasttext-vectors/word-vectors-v2/cc.uz.300.vec.gz) | 121 | | Venetian: [bin](https://s3-us-west-1.amazonaws.com/fasttext-vectors/word-vectors-v2/cc.vec.300.bin.gz), [text](https://s3-us-west-1.amazonaws.com/fasttext-vectors/word-vectors-v2/cc.vec.300.vec.gz) | Vietnamese: [bin](https://s3-us-west-1.amazonaws.com/fasttext-vectors/word-vectors-v2/cc.vi.300.bin.gz), [text](https://s3-us-west-1.amazonaws.com/fasttext-vectors/word-vectors-v2/cc.vi.300.vec.gz) | Volapük: [bin](https://s3-us-west-1.amazonaws.com/fasttext-vectors/word-vectors-v2/cc.vo.300.bin.gz), [text](https://s3-us-west-1.amazonaws.com/fasttext-vectors/word-vectors-v2/cc.vo.300.vec.gz) | 122 | | Walloon: [bin](https://s3-us-west-1.amazonaws.com/fasttext-vectors/word-vectors-v2/cc.wa.300.bin.gz), [text](https://s3-us-west-1.amazonaws.com/fasttext-vectors/word-vectors-v2/cc.wa.300.vec.gz) | Waray: [bin](https://s3-us-west-1.amazonaws.com/fasttext-vectors/word-vectors-v2/cc.war.300.bin.gz), [text](https://s3-us-west-1.amazonaws.com/fasttext-vectors/word-vectors-v2/cc.war.300.vec.gz) | Welsh: [bin](https://s3-us-west-1.amazonaws.com/fasttext-vectors/word-vectors-v2/cc.cy.300.bin.gz), [text](https://s3-us-west-1.amazonaws.com/fasttext-vectors/word-vectors-v2/cc.cy.300.vec.gz) | 123 | | West Flemish: [bin](https://s3-us-west-1.amazonaws.com/fasttext-vectors/word-vectors-v2/cc.vls.300.bin.gz), [text](https://s3-us-west-1.amazonaws.com/fasttext-vectors/word-vectors-v2/cc.vls.300.vec.gz) | West Frisian: [bin](https://s3-us-west-1.amazonaws.com/fasttext-vectors/word-vectors-v2/cc.fy.300.bin.gz), [text](https://s3-us-west-1.amazonaws.com/fasttext-vectors/word-vectors-v2/cc.fy.300.vec.gz) | Western Punjabi: [bin](https://s3-us-west-1.amazonaws.com/fasttext-vectors/word-vectors-v2/cc.pnb.300.bin.gz), [text](https://s3-us-west-1.amazonaws.com/fasttext-vectors/word-vectors-v2/cc.pnb.300.vec.gz) | 124 | | Yiddish: [bin](https://s3-us-west-1.amazonaws.com/fasttext-vectors/word-vectors-v2/cc.yi.300.bin.gz), [text](https://s3-us-west-1.amazonaws.com/fasttext-vectors/word-vectors-v2/cc.yi.300.vec.gz) | Yoruba: [bin](https://s3-us-west-1.amazonaws.com/fasttext-vectors/word-vectors-v2/cc.yo.300.bin.gz), [text](https://s3-us-west-1.amazonaws.com/fasttext-vectors/word-vectors-v2/cc.yo.300.vec.gz) | Zazaki: [bin](https://s3-us-west-1.amazonaws.com/fasttext-vectors/word-vectors-v2/cc.diq.300.bin.gz), [text](https://s3-us-west-1.amazonaws.com/fasttext-vectors/word-vectors-v2/cc.diq.300.vec.gz) | 125 | | Zeelandic: [bin](https://s3-us-west-1.amazonaws.com/fasttext-vectors/word-vectors-v2/cc.zea.300.bin.gz), [text](https://s3-us-west-1.amazonaws.com/fasttext-vectors/word-vectors-v2/cc.zea.300.vec.gz) | 126 | -------------------------------------------------------------------------------- /doc/en/dataset.md: -------------------------------------------------------------------------------- 1 | --- 2 | id: dataset 3 | title: Datasets 4 | --- 5 | 6 | [Download YFCC100M Dataset](https://fb-public.box.com/s/htfdbrvycvroebv9ecaezaztocbcnsdn) 7 | -------------------------------------------------------------------------------- /doc/en/english-vectors.md: -------------------------------------------------------------------------------- 1 | --- 2 | id: english-vectors 3 | title: English word vectors 4 | --- 5 | 6 | This page gathers several pre-trained word vectors trained using fastText. 7 | 8 | ### Download pre-trained word vectors 9 | 10 | Pre-trained word vectors learned on different sources can be downloaded below: 11 | 12 | 1. [wiki-news-300d-1M.vec.zip](https://s3-us-west-1.amazonaws.com/fasttext-vectors/wiki-news-300d-1M.vec.zip): 1 million word vectors trained on Wikipedia 2017, UMBC webbase corpus and statmt.org news dataset (16B tokens). 13 | 2. [wiki-news-300d-1M-subword.vec.zip](https://s3-us-west-1.amazonaws.com/fasttext-vectors/wiki-news-300d-1M-subword.vec.zip): 1 million word vectors trained with subword infomation on Wikipedia 2017, UMBC webbase corpus and statmt.org news dataset (16B tokens). 14 | 3. [crawl-300d-2M.vec.zip](https://s3-us-west-1.amazonaws.com/fasttext-vectors/crawl-300d-2M.vec.zip): 2 million word vectors trained on Common Crawl (600B tokens). 15 | 16 | ### Format 17 | 18 | The first line of the file contains the number of words in the vocabulary and the size of the vectors. 19 | Each line contains a word followed by its vectors, like in the default fastText text format. 20 | Each value is space separated. Words are ordered by descending frequency. 21 | 22 | ### License 23 | 24 | These word vectors are distributed under the [*Creative Commons Attribution-Share-Alike License 3.0*](https://creativecommons.org/licenses/by-sa/3.0/). 25 | 26 | ### References 27 | 28 | If you use these word vectors, please cite the following paper: 29 | 30 | T. Mikolov, E. Grave, P. Bojanowski, C. Puhrsch, A. Joulin. [*Advances in Pre-Training Distributed Word Representations*](https://arxiv.org/abs/1712.09405) 31 | 32 | ```markup 33 | @inproceedings{mikolov2018advances, 34 | title={Advances in Pre-Training Distributed Word Representations}, 35 | author={Mikolov, Tomas and Grave, Edouard and Bojanowski, Piotr and Puhrsch, Christian and Joulin, Armand}, 36 | booktitle={Proceedings of the International Conference on Language Resources and Evaluation (LREC 2018)}, 37 | year={2018} 38 | } 39 | ``` 40 | -------------------------------------------------------------------------------- /doc/en/faqs.md: -------------------------------------------------------------------------------- 1 | --- 2 | id: faqs 3 | title:FAQ 4 | --- 5 | 6 | ## What is fastText? Are there tutorials? 7 | 8 | FastText is a library for text classification and representation. It transforms text into continuous vectors that can later be used on any language related task. A few tutorials are available. 9 | 10 | ## Why are my fastText models that big? 11 | 12 | fastText uses a hashtable for either word or character ngrams. The size of the hashtable directly impacts the size of a model. To reduce the size of the model, it is possible to reduce the size of this table with the option '-hash'. For example a good value is 20000. Another option that greatly impacts the size of a model is the size of the vectors (-dim). This dimension can be reduced to save space but this can significantly impact performance. If that still produce a model that is too big, one can further reduce the size of a trained model with the quantization option. 13 | ```bash 14 | ./fasttext quantize -output model 15 | ``` 16 | 17 | ## What would be the best way to represent word phrases rather than words? 18 | 19 | Currently the best approach to represent word phrases or sentence is to take a bag of words of word vectors. Additionally, for phrases like “New York”, preprocessing the data so that it becomes a single token “New_York” can greatly help. 20 | 21 | ## Why does fastText produce vectors even for unknown words? 22 | 23 | One of the key features of fastText word representation is its ability to produce vectors for any words, even made-up ones. 24 | Indeed, fastText word vectors are built from vectors of substrings of characters contained in it. 25 | This allows to build vectors even for misspelled words or concatenation of words. 26 | 27 | ## Why is the hierarchical softmax slightly worse in performance than the full softmax? 28 | 29 | The hierachical softmax is an approximation of the full softmax loss that allows to train on large number of class efficiently. This is often at the cost of a few percent of accuracy. 30 | Note also that this loss is thought for classes that are unbalanced, that is some classes are more frequent than others. If your dataset has a balanced number of examples per class, it is worth trying the negative sampling loss (-loss ns -neg 100). 31 | However, negative sampling will still be very slow at test time, since the full softmax will be computed. 32 | 33 | ## Can we run fastText program on a GPU? 34 | 35 | FastText only works on CPU for accessibility. That being said, fastText has been implemented in the caffe2 library which can be run on GPU. 36 | 37 | ## Can I use fastText with python? Or other languages? 38 | 39 | There are few unofficial wrappers for python or lua available on github. 40 | 41 | ## Can I use fastText with continuous data? 42 | 43 | FastText works on discrete tokens and thus cannot be directly used on continuous tokens. However, one can discretize continuous tokens to use fastText on them, for example by rounding values to a specific digit ("12.3" becomes "12"). 44 | 45 | ## There are misspellings in the dictionary. Should we improve text normalization? 46 | 47 | If the words are infrequent, there is no need to worry. 48 | 49 | ## I'm encountering a NaN, why could this be? 50 | 51 | You'll likely see this behavior because your learning rate is too high. Try reducing it until you don't see this error anymore. 52 | 53 | ## My compiler / architecture can't build fastText. What should I do? 54 | Try a newer version of your compiler. We try to maintain compatibility with older versions of gcc and many platforms, however sometimes maintaining backwards compatibility becomes very hard. In general, compilers and tool chains that ship with LTS versions of major linux distributions should be fair game. In any case, create an issue with your compiler version and architecture and we'll try to implement compatibility. 55 | 56 | 57 | 58 | 59 | -------------------------------------------------------------------------------- /doc/en/language-identification.md: -------------------------------------------------------------------------------- 1 | --- 2 | id: language-identification 3 | title: Language identification 4 | --- 5 | 6 | ### Description 7 | 8 | We distribute two models for language identification, which can recognize 176 languages (see the list of ISO codes below). These models were trained on data from [Wikipedia](https://www.wikipedia.org/), [Tatoeba](https://tatoeba.org/eng/) and [SETimes](http://nlp.ffzg.hr/resources/corpora/setimes/), used under [CC-BY-SA](http://creativecommons.org/licenses/by-sa/3.0/). 9 | 10 | We distribute two versions of the models: 11 | 12 | * [lid.176.bin](https://s3-us-west-1.amazonaws.com/fasttext-vectors/supervised_models/lid.176.bin), which is faster and slightly more accurate, but has a file size of 126MB ; 13 | * [lid.176.ftz](https://s3-us-west-1.amazonaws.com/fasttext-vectors/supervised_models/lid.176.ftz), which is the compressed version of the model, with a file size of 917kB. 14 | 15 | These models were trained on UTF-8 data, and therefore expect UTF-8 as input. 16 | 17 | ### License 18 | 19 | The models are distributed under the [*Creative Commons Attribution-Share-Alike License 3.0*](https://creativecommons.org/licenses/by-sa/3.0/). 20 | 21 | ### List of supported languages 22 | ``` 23 | af als am an ar arz as ast av az azb ba bar bcl be bg bh bn bo bpy br bs bxr ca cbk ce ceb ckb co cs cv cy da de diq dsb dty dv el eml en eo es et eu fa fi fr frr fy ga gd gl gn gom gu gv he hi hif hr hsb ht hu hy ia id ie ilo io is it ja jbo jv ka kk km kn ko krc ku kv kw ky la lb lez li lmo lo lrc lt lv mai mg mhr min mk ml mn mr mrj ms mt mwl my myv mzn nah nap nds ne new nl nn no oc or os pa pam pfl pl pms pnb ps pt qu rm ro ru rue sa sah sc scn sco sd sh si sk sl so sq sr su sv sw ta te tg th tk tl tr tt tyv ug uk ur uz vec vep vi vls vo wa war wuu xal xmf yi yo yue zh 24 | ``` 25 | 26 | ### References 27 | 28 | If you use these models, please cite the following papers: 29 | 30 | [1] A. Joulin, E. Grave, P. Bojanowski, T. Mikolov, [*Bag of Tricks for Efficient Text Classification*](https://arxiv.org/abs/1607.01759) 31 | ``` 32 | @article{joulin2016bag, 33 | title={Bag of Tricks for Efficient Text Classification}, 34 | author={Joulin, Armand and Grave, Edouard and Bojanowski, Piotr and Mikolov, Tomas}, 35 | journal={arXiv preprint arXiv:1607.01759}, 36 | year={2016} 37 | } 38 | ``` 39 | [2] A. Joulin, E. Grave, P. Bojanowski, M. Douze, H. Jégou, T. Mikolov, [*FastText.zip: Compressing text classification models* ](https://arxiv.org/abs/1612.03651) 40 | ``` 41 | @article{joulin2016fasttext, 42 | title={FastText.zip: Compressing text classification models}, 43 | author={Joulin, Armand and Grave, Edouard and Bojanowski, Piotr and Douze, Matthijs and J{\'e}gou, H{\'e}rve and Mikolov, Tomas}, 44 | journal={arXiv preprint arXiv:1612.03651}, 45 | year={2016} 46 | } 47 | ``` 48 | -------------------------------------------------------------------------------- /doc/en/options.md: -------------------------------------------------------------------------------- 1 | --- 2 | id: options 3 | title: List of options 4 | --- 5 | 6 | Invoke a command without arguments to list available arguments and their default values: 7 | 8 | ```bash 9 | $ ./fasttext supervised 10 | Empty input or output path. 11 | 12 | The following arguments are mandatory: 13 | -input training file path 14 | -output output file path 15 | 16 | The following arguments are optional: 17 | -verbose verbosity level [2] 18 | 19 | The following arguments for the dictionary are optional: 20 | -minCount minimal number of word occurences [5] 21 | -minCountLabel minimal number of label occurences [0] 22 | -wordNgrams max length of word ngram [1] 23 | -bucket number of buckets [2000000] 24 | -minn min length of char ngram [3] 25 | -maxn max length of char ngram [6] 26 | -t sampling threshold [0.0001] 27 | -label labels prefix [__label__] 28 | 29 | The following arguments for training are optional: 30 | -lr learning rate [0.05] 31 | -lrUpdateRate change the rate of updates for the learning rate [100] 32 | -dim size of word vectors [100] 33 | -ws size of the context window [5] 34 | -epoch number of epochs [5] 35 | -neg number of negatives sampled [5] 36 | -loss loss function {ns, hs, softmax} [ns] 37 | -thread number of threads [12] 38 | -pretrainedVectors pretrained word vectors for supervised learning [] 39 | -saveOutput whether output params should be saved [0] 40 | 41 | The following arguments for quantization are optional: 42 | -cutoff number of words and ngrams to retain [0] 43 | -retrain finetune embeddings if a cutoff is applied [0] 44 | -qnorm quantizing the norm separately [0] 45 | -qout quantizing the classifier [0] 46 | -dsub size of each sub-vector [2] 47 | ``` 48 | 49 | Defaults may vary by mode. (Word-representation modes `skipgram` and `cbow` use a default `-minCount` of 5.) 50 | 51 | -------------------------------------------------------------------------------- /doc/en/references.md: -------------------------------------------------------------------------------- 1 | --- 2 | id: references 3 | title: References 4 | --- 5 | 6 | Please cite [1](#enriching-word-vectors-with-subword-information) if using this code for learning word representations or [2](#bag-of-tricks-for-efficient-text-classification) if using for text classification. 7 | 8 | [1] P. Bojanowski\*, E. Grave\*, A. Joulin, T. Mikolov, [*Enriching Word Vectors with Subword Information*](https://arxiv.org/abs/1607.04606) 9 | 10 | ```markup 11 | @article{bojanowski2016enriching, 12 | title={Enriching Word Vectors with Subword Information}, 13 | author={Bojanowski, Piotr and Grave, Edouard and Joulin, Armand and Mikolov, Tomas}, 14 | journal={arXiv preprint arXiv:1607.04606}, 15 | year={2016} 16 | } 17 | ``` 18 | 19 | [2] A. Joulin, E. Grave, P. Bojanowski, T. Mikolov, [*Bag of Tricks for Efficient Text Classification*](https://arxiv.org/abs/1607.01759) 20 | 21 | ```markup 22 | @article{joulin2016bag, 23 | title={Bag of Tricks for Efficient Text Classification}, 24 | author={Joulin, Armand and Grave, Edouard and Bojanowski, Piotr and Mikolov, Tomas}, 25 | journal={arXiv preprint arXiv:1607.01759}, 26 | year={2016} 27 | } 28 | ``` 29 | 30 | [3] A. Joulin, E. Grave, P. Bojanowski, M. Douze, H. Jégou, T. Mikolov, [*FastText.zip: Compressing text classification models*](https://arxiv.org/abs/1612.03651) 31 | 32 | ```markup 33 | @article{joulin2016fasttext, 34 | title={FastText.zip: Compressing text classification models}, 35 | author={Joulin, Armand and Grave, Edouard and Bojanowski, Piotr and Douze, Matthijs and J{\'e}gou, H{\'e}rve and Mikolov, Tomas}, 36 | journal={arXiv preprint arXiv:1612.03651}, 37 | year={2016} 38 | } 39 | ``` 40 | 41 | (\* These authors contributed equally.) 42 | -------------------------------------------------------------------------------- /doc/en/supervised-models.md: -------------------------------------------------------------------------------- 1 | --- 2 | id: supervised-models 3 | title: Supervised models 4 | --- 5 | 6 | This page gathers several pre-trained supervised models on several datasets. 7 | 8 | ### Description 9 | 10 | The regular models are trained using the procedure described in [1]. They can be reproduced using the classification-results.sh script within our github repository. The quantized models are build by using the respective supervised settings and adding the following flags to the quantize subcommand. 11 | 12 | ```bash 13 | -qnorm -retrain -cutoff 100000 14 | ``` 15 | 16 | ### Table of models 17 | 18 | Each entry describes the test accuracy and size of the model. You can click on a table cell to download the corresponding model. 19 | 20 | | dataset | ag news | amazon review full | amazon review polarity | dbpedia | 21 | |-----------|-----------------------|-----------------------|------------------------|------------------------| 22 | | regular | [0.924 / 387MB](https://s3-us-west-1.amazonaws.com/fasttext-vectors/supervised_models/ag_news.bin) | [0.603 / 462MB](https://s3-us-west-1.amazonaws.com/fasttext-vectors/supervised_models/amazon_review_full.bin) | [0.946 / 471MB](https://s3-us-west-1.amazonaws.com/fasttext-vectors/supervised_models/amazon_review_polarity.bin) | [0.986 / 427MB](https://s3-us-west-1.amazonaws.com/fasttext-vectors/supervised_models/dbpedia.bin) | 23 | | compressed | [0.92 / 1.6MB](https://s3-us-west-1.amazonaws.com/fasttext-vectors/supervised_models/ag_news.ftz) | [0.599 / 1.6MB]( https://s3-us-west-1.amazonaws.com/fasttext-vectors/supervised_models/amazon_review_full.ftz) | [0.93 / 1.6MB]( https://s3-us-west-1.amazonaws.com/fasttext-vectors/supervised_models/amazon_review_polarity.ftz) | [0.984 / 1.7MB]( https://s3-us-west-1.amazonaws.com/fasttext-vectors/supervised_models/dbpedia.ftz) | 24 | 25 | | dataset | sogou news | yahoo answers | yelp review polarity | yelp review full | 26 | |-----------|----------------------|------------------------|----------------------|------------------------| 27 | | regular | [0.969 / 402MB](https://s3-us-west-1.amazonaws.com/fasttext-vectors/supervised_models/sogou_news.bin) | [0.724 / 494MB](https://s3-us-west-1.amazonaws.com/fasttext-vectors/supervised_models/yahoo_answers.bin)| [0.957 / 409MB](https://s3-us-west-1.amazonaws.com/fasttext-vectors/supervised_models/yelp_review_polarity.bin)| [0.639 / 412MB](https://s3-us-west-1.amazonaws.com/fasttext-vectors/supervised_models/yelp_review_full.bin)| 28 | | compressed | [0.968 / 1.4MB](https://s3-us-west-1.amazonaws.com/fasttext-vectors/supervised_models/sogou_news.ftz) | [0.717 / 1.6MB](https://s3-us-west-1.amazonaws.com/fasttext-vectors/supervised_models/yahoo_answers.ftz) | [0.957 / 1.5MB](https://s3-us-west-1.amazonaws.com/fasttext-vectors/supervised_models/yelp_review_polarity.ftz) | [0.636 / 1.5MB](https://s3-us-west-1.amazonaws.com/fasttext-vectors/supervised_models/yelp_review_full.ftz) | 29 | 30 | ### References 31 | 32 | If you use these models, please cite the following paper: 33 | 34 | [1] A. Joulin, E. Grave, P. Bojanowski, T. Mikolov, [*Bag of Tricks for Efficient Text Classification*](https://arxiv.org/abs/1607.01759) 35 | 36 | ```markup 37 | @article{joulin2016bag, 38 | title={Bag of Tricks for Efficient Text Classification}, 39 | author={Joulin, Armand and Grave, Edouard and Bojanowski, Piotr and Mikolov, Tomas}, 40 | journal={arXiv preprint arXiv:1607.01759}, 41 | year={2016} 42 | } 43 | ``` 44 | 45 | [2] A. Joulin, E. Grave, P. Bojanowski, M. Douze, H. Jégou, T. Mikolov, [*FastText.zip: Compressing text classification models*](https://arxiv.org/abs/1612.03651) 46 | 47 | ```markup 48 | @article{joulin2016fasttext, 49 | title={FastText.zip: Compressing text classification models}, 50 | author={Joulin, Armand and Grave, Edouard and Bojanowski, Piotr and Douze, Matthijs and J{\'e}gou, H{\'e}rve and Mikolov, Tomas}, 51 | journal={arXiv preprint arXiv:1612.03651}, 52 | year={2016} 53 | } 54 | ``` 55 | -------------------------------------------------------------------------------- /doc/en/supervised-tutorial.md: -------------------------------------------------------------------------------- 1 | --- 2 | id: supervised-tutorial 3 | title: Text classification 4 | --- 5 | 6 | Text classification is a core problem to many applications, like spam detection, sentiment analysis or smart replies. In this tutorial, we describe how to build a text classifier with the fastText tool. 7 | 8 | ## What is text classification? 9 | 10 | The goal of text classification is to assign documents (such as emails, posts, text messages, product reviews, etc...) to one or multiple categories. Such categories can be review scores, spam v.s. non-spam, or the language in which the document was typed. Nowadays, the dominant approach to build such classifiers is machine learning, that is learning classification rules from examples. In order to build such classifiers, we need labeled data, which consists of documents and their corresponding categories (or tags, or labels). 11 | 12 | As an example, we build a classifier which automatically classifies stackexchange questions about cooking into one of several possible tags, such as `pot`, `bowl` or `baking`. 13 | 14 | ## Installing fastText 15 | 16 | The first step of this tutorial is to install and build fastText. It only requires a c++ compiler with good support of c++11. 17 | 18 | Let us start by downloading the [most recent release](https://github.com/facebookresearch/fastText/releases): 19 | 20 | ```bash 21 | $ wget https://github.com/facebookresearch/fastText/archive/v0.1.0.zip 22 | $ unzip v0.1.0.zip 23 | ``` 24 | 25 | Move to the fastText directory and build it: 26 | 27 | ```bash 28 | $ cd fastText-0.1.0 29 | $ make 30 | ``` 31 | 32 | Running the binary without any argument will print the high level documentation, showing the different usecases supported by fastText: 33 | 34 | ```bash 35 | >> ./fasttext 36 | usage: fasttext 37 | 38 | The commands supported by fasttext are: 39 | 40 | supervised train a supervised classifier 41 | quantize quantize a model to reduce the memory usage 42 | test evaluate a supervised classifier 43 | predict predict most likely labels 44 | predict-prob predict most likely labels with probabilities 45 | skipgram train a skipgram model 46 | cbow train a cbow model 47 | print-word-vectors print word vectors given a trained model 48 | print-sentence-vectors print sentence vectors given a trained model 49 | nn query for nearest neighbors 50 | analogies query for analogies 51 | 52 | ``` 53 | 54 | In this tutorial, we mainly use the `supervised`, `test` and `predict` subcommands, which corresponds to learning (and using) text classifier. For an introduction to the other functionalities of fastText, please see the [tutorial about learning word vectors](https://fasttext.cc/docs/en/unsupervised-tutorial.html). 55 | 56 | ## Getting and preparing the data 57 | 58 | As mentioned in the introduction, we need labeled data to train our supervised classifier. In this tutorial, we are interested in building a classifier to automatically recognize the topic of a stackexchange question about cooking. Let's download examples of questions from [the cooking section of Stackexchange](http://cooking.stackexchange.com/), and their associated tags: 59 | 60 | ```bash 61 | >> wget https://s3-us-west-1.amazonaws.com/fasttext-vectors/cooking.stackexchange.tar.gz && tar xvzf cooking.stackexchange.tar.gz 62 | >> head cooking.stackexchange.txt 63 | ``` 64 | 65 | Each line of the text file contains a list of labels, followed by the corresponding document. All the labels start by the `__label__` prefix, which is how fastText recognize what is a label or what is a word. The model is then trained to predict the labels given the word in the document. 66 | 67 | Before training our first classifier, we need to split the data into train and validation. We will use the validation set to evaluate how good the learned classifier is on new data. 68 | 69 | ```bash 70 | >> wc cooking.stackexchange.txt 71 | 15404 169582 1401900 cooking.stackexchange.txt 72 | ``` 73 | 74 | Our full dataset contains 15404 examples. Let's split it into a training set of 12404 examples and a validation set of 3000 examples: 75 | 76 | ```bash 77 | >> head -n 12404 cooking.stackexchange.txt > cooking.train 78 | >> tail -n 3000 cooking.stackexchange.txt > cooking.valid 79 | ``` 80 | 81 | ## Our first classifier 82 | 83 | We are now ready to train our first classifier: 84 | 85 | ```bash 86 | >> ./fasttext supervised -input cooking.train -output model_cooking 87 | Read 0M words 88 | Number of words: 14598 89 | Number of labels: 734 90 | Progress: 100.0% words/sec/thread: 75109 lr: 0.000000 loss: 5.708354 eta: 0h0m 91 | ``` 92 | 93 | The `-input` command line option indicates the file containing the training examples, while the `-output` option indicates where to save the model. At the end of training, a file `model_cooking.bin`, containing the trained classifier, is created in the current directory. 94 | 95 | It is possible to directly test our classifier interactively, by running the command: 96 | 97 | ```bash 98 | >> ./fasttext predict model_cooking.bin - 99 | ``` 100 | 101 | and then typing a sentence. Let's first try the sentence: 102 | 103 | *Which baking dish is best to bake a banana bread ?* 104 | 105 | The predicted tag is `baking` which fits well to this question. Let us now try a second example: 106 | 107 | *Why not put knives in the dishwasher?* 108 | 109 | The label predicted by the model is `food-safety`, which is not relevant. Somehow, the model seems to fail on simple examples. To get a better sense of its quality, let's test it on the validation data by running: 110 | 111 | ```bash 112 | >> ./fasttext test model_cooking.bin cooking.valid 113 | N 3000 114 | P@1 0.124 115 | R@1 0.0541 116 | Number of examples: 3000 117 | ``` 118 | 119 | The output of fastText are the precision at one (`P@1`) and the recall at one (`R@1`). We can also compute the precision at five and recall at five with: 120 | 121 | ```bash 122 | >> ./fasttext test model_cooking.bin cooking.valid 5 123 | N 3000 124 | P@5 0.0668 125 | R@5 0.146 126 | Number of examples: 3000 127 | ``` 128 | 129 | ## Advanced readers: precision and recall 130 | 131 | The precision is the number of correct labels among the labels predicted by fastText. The recall is the number of labels that successfully were predicted, among all the real labels. Let's take an example to make this more clear: 132 | 133 | *Why not put knives in the dishwasher?* 134 | 135 | On Stack Exchange, this sentence is labeled with three tags: `equipment`, `cleaning` and `knives`. The top five labels predicted by the model can be obtained with: 136 | 137 | ```bash 138 | >> ./fasttext predict model_cooking.bin - 5 139 | ``` 140 | 141 | are `food-safety`, `baking`, `equipment`, `substitutions` and `bread`. 142 | 143 | Thus, one out of five labels predicted by the model is correct, giving a precision of 0.20. Out of the three real labels, only one is predicted by the model, giving a recall of 0.33. 144 | 145 | For more details, see [the related Wikipedia page](https://en.wikipedia.org/wiki/Precision_and_recall). 146 | 147 | ## Making the model better 148 | 149 | The model obtained by running fastText with the default arguments is pretty bad at classifying new questions. Let's try to improve the performance, by changing the default parameters. 150 | 151 | ### preprocessing the data 152 | 153 | Looking at the data, we observe that some words contain uppercase letter or punctuation. One of the first step to improve the performance of our model is to apply some simple pre-processing. A crude normalization can be obtained using command line tools such as `sed` and `tr`: 154 | 155 | ```bash 156 | >> cat cooking.stackexchange.txt | sed -e "s/\([.\!?,'/()]\)/ \1 /g" | tr "[:upper:]" "[:lower:]" > cooking.preprocessed.txt 157 | >> head -n 12404 cooking.preprocessed.txt > cooking.train 158 | >> tail -n 3000 cooking.preprocessed.txt > cooking.valid 159 | ``` 160 | 161 | Let's train a new model on the pre-processed data: 162 | 163 | ```bash 164 | >> ./fasttext supervised -input cooking.train -output model_cooking 165 | Read 0M words 166 | Number of words: 9012 167 | Number of labels: 734 168 | Progress: 100.0% words/sec/thread: 82041 lr: 0.000000 loss: 5.671649 eta: 0h0m h-14m 169 | 170 | >> ./fasttext test model_cooking.bin cooking.valid 171 | N 3000 172 | P@1 0.164 173 | R@1 0.0717 174 | Number of examples: 3000 175 | ``` 176 | 177 | We observe that thanks to the pre-processing, the vocabulary is smaller (from 14k words to 9k). The precision is also starting to go up by 4%! 178 | 179 | ### more epochs and larger learning rate 180 | 181 | By default, fastText sees each training example only five times during training, which is pretty small, given that our training set only have 12k training examples. The number of times each examples is seen (also known as the number of epochs), can be increased using the `-epoch` option: 182 | 183 | ```bash 184 | >> ./fasttext supervised -input cooking.train -output model_cooking -epoch 25 185 | Read 0M words 186 | Number of words: 9012 187 | Number of labels: 734 188 | Progress: 100.0% words/sec/thread: 77633 lr: 0.000000 loss: 7.147976 eta: 0h0m 189 | ``` 190 | 191 | Let's test the new model: 192 | 193 | ```bash 194 | >> ./fasttext test model_cooking.bin cooking.valid 195 | N 3000 196 | P@1 0.501 197 | R@1 0.218 198 | Number of examples: 3000 199 | ``` 200 | 201 | This is much better! Another way to change the learning speed of our model is to increase (or decrease) the learning rate of the algorithm. This corresponds to how much the model changes after processing each example. A learning rate of 0 would means that the model does not change at all, and thus, does not learn anything. Good values of the learning rate are in the range `0.1 - 1.0`. 202 | 203 | ```bash 204 | >> ./fasttext supervised -input cooking.train -output model_cooking -lr 1.0 205 | Read 0M words 206 | Number of words: 9012 207 | Number of labels: 734 208 | Progress: 100.0% words/sec/thread: 81469 lr: 0.000000 loss: 6.405640 eta: 0h0m 209 | 210 | >> ./fasttext test model_cooking.bin cooking.valid 211 | N 3000 212 | P@1 0.563 213 | R@1 0.245 214 | Number of examples: 3000 215 | ``` 216 | 217 | Even better! Let's try both together: 218 | 219 | ```bash 220 | >> ./fasttext supervised -input cooking.train -output model_cooking -lr 1.0 -epoch 25 221 | Read 0M words 222 | Number of words: 9012 223 | Number of labels: 734 224 | Progress: 100.0% words/sec/thread: 76394 lr: 0.000000 loss: 4.350277 eta: 0h0m 225 | 226 | >> ./fasttext test model_cooking.bin cooking.valid 227 | N 3000 228 | P@1 0.585 229 | R@1 0.255 230 | Number of examples: 3000 231 | ``` 232 | 233 | Let us now add a few more features to improve even further our performance! 234 | 235 | ### word n-grams 236 | 237 | Finally, we can improve the performance of a model by using word bigrams, instead of just unigrams. This is especially important for classification problems where word order is important, such as sentiment analysis. 238 | 239 | ```bash 240 | >> ./fasttext supervised -input cooking.train -output model_cooking -lr 1.0 -epoch 25 -wordNgrams 2 241 | Read 0M words 242 | Number of words: 9012 243 | Number of labels: 734 244 | Progress: 100.0% words/sec/thread: 75366 lr: 0.000000 loss: 3.226064 eta: 0h0m 245 | 246 | >> ./fasttext test model_cooking.bin cooking.valid 247 | N 3000 248 | P@1 0.599 249 | R@1 0.261 250 | Number of examples: 3000 251 | ``` 252 | 253 | With a few steps, we were able to go from a precision at one of 12.4% to 59.9%. Important steps included: 254 | 255 | * preprocessing the data ; 256 | * changing the number of epochs (using the option `-epoch`, standard range `[5 - 50]`) ; 257 | * changing the learning rate (using the option `-lr`, standard range `[0.1 - 1.0]`) ; 258 | * using word n-grams (using the option `-wordNgrams`, standard range `[1 - 5]`). 259 | 260 | ## Advanced readers: What is a Bigram? 261 | 262 | A 'unigram' refers to a single undividing unit, or token, usually used as an input to a model. For example a unigram can a word or a letter depending on the model. In fastText, we work at the word level and thus unigrams are words. 263 | 264 | Similarly we denote by 'bigram' the concatenation of 2 consecutive tokens or words. Similarly we often talk about n-gram to refer to the concatenation any n consecutive tokens. 265 | 266 | For example, in the sentence, 'Last donut of the night', the unigrams are 'last', 'donut', 'of', 'the' and 'night'. The bigrams are: 'Last donut', 'donut of', 'of the' and 'the night'. 267 | 268 | Bigrams are particularly interesting because, for most sentences, you can reconstruct the order of the words just by looking at a bag of n-grams. 269 | 270 | Let us illustrate this by a simple exercise, given the following bigrams, try to reconstruct the original sentence: 'all out', 'I am', 'of bubblegum', 'out of' and 'am all'. 271 | It is common to refer to a word as a unigram. 272 | 273 | ## Scaling things up 274 | 275 | Since we are training our model on a few thousands of examples, the training only takes a few seconds. But training models on larger datasets, with more labels can start to be too slow. A potential solution to make the training faster is to use the hierarchical softmax, instead of the regular softmax [Add a quick explanation of the hierarchical softmax]. This can be done with the option `-loss hs`: 276 | 277 | ```bash 278 | >> ./fasttext supervised -input cooking.train -output model_cooking -lr 1.0 -epoch 25 -wordNgrams 2 -bucket 200000 -dim 50 -loss hs 279 | Read 0M words 280 | Number of words: 9012 281 | Number of labels: 734 282 | Progress: 100.0% words/sec/thread: 2199406 lr: 0.000000 loss: 1.718807 eta: 0h0m 283 | ``` 284 | 285 | Training should now take less than a second. 286 | 287 | ## Conclusion 288 | 289 | In this tutorial, we gave a brief overview of how to use fastText to train powerful text classifiers. We had a light overview of some of the most important options to tune. 290 | -------------------------------------------------------------------------------- /doc/en/support.md: -------------------------------------------------------------------------------- 1 | --- 2 | id: support 3 | title: Get started 4 | --- 5 | 6 | ## What is fastText? 7 | 8 | fastText is a library for efficient learning of word representations and sentence classification. 9 | 10 | ## Requirements 11 | 12 | fastText builds on modern Mac OS and Linux distributions. 13 | Since it uses C++11 features, it requires a compiler with good C++11 support. 14 | These include : 15 | 16 | * (gcc-4.6.3 or newer) or (clang-3.3 or newer) 17 | 18 | Compilation is carried out using a Makefile, so you will need to have a working **make**. 19 | For the word-similarity evaluation script you will need: 20 | 21 | * python 2.6 or newer 22 | * numpy & scipy 23 | 24 | ## Building fastText 25 | 26 | In order to build `fastText`, use the following: 27 | 28 | ```bash 29 | $ git clone https://github.com/facebookresearch/fastText.git 30 | $ cd fastText 31 | $ make 32 | ``` 33 | 34 | This will produce object files for all the classes as well as the main binary `fasttext`. 35 | If you do not plan on using the default system-wide compiler, update the two macros defined at the beginning of the Makefile (CC and INCLUDES). 36 | 37 | -------------------------------------------------------------------------------- /doc/en/unsupervised-tutorials.md: -------------------------------------------------------------------------------- 1 | --- 2 | id: unsupervised-tutorial 3 | title: Word representations 4 | --- 5 | A popular idea in modern machine learning is to represent words by vectors. These vectors capture hidden information about a language, like word analogies or semantic. It is also used to improve performance of text classifiers. 6 | 7 | In this tutorial, we show how to build these word vectors with the fastText tool. To download and install fastText, follow the first steps of [the tutorial on text classification](https://fasttext.cc/docs/en/supervised-tutorial.html). 8 | 9 | ## Getting the data 10 | 11 | In order to compute word vectors, you need a large text corpus. Depending on the corpus, the word vectors will capture different information. In this tutorial, we focus on Wikipedia's articles but other sources could be considered, like news or Webcrawl (more examples [here](http://statmt.org/)). To download a raw dump of Wikipedia, run the following command: 12 | 13 | ```bash 14 | wget https://dumps.wikimedia.org/enwiki/latest/enwiki-latest-pages-articles.xml.bz2 15 | ``` 16 | 17 | Downloading the Wikipedia corpus takes some time. Instead, lets restrict our study to the first 1 billion bytes of English Wikipedia. They can be found on Matt Mahoney's [website](http://mattmahoney.net/): 18 | 19 | ```bash 20 | $ mkdir data 21 | $ wget -c http://mattmahoney.net/dc/enwik9.zip -P data 22 | $ unzip data/enwik9.zip -d data 23 | ``` 24 | 25 | A raw Wikipedia dump contains a lot of HTML / XML data. We pre-process it with the wikifil.pl script bundled with fastText (this script was originally developed by Matt Mahoney, and can be found on his [website](http://mattmahoney.net/) ) 26 | 27 | ```bash 28 | $ perl wikifil.pl data/enwik9 > data/fil9 29 | ``` 30 | 31 | We can check the file by running the following command: 32 | 33 | ```bash 34 | $ head -c 80 data/fil9 35 | anarchism originated as a term of abuse first used against early working class 36 | ``` 37 | 38 | The text is nicely pre-processed and can be used to learn our word vectors. 39 | 40 | ## Training word vectors 41 | 42 | Learning word vectors on this data can now be achieved with a single command: 43 | 44 | ```bash 45 | $ mkdir result 46 | $ ./fasttext skipgram -input data/fil9 -output result/fil9 47 | ``` 48 | 49 | To decompose this command line: ./fastext calls the binary fastText executable (see how to install fastText here) with the 'skipgram' model (it can also be 'cbow'). We then specify the requires options '-input' for the location of the data and '-output' for the location where the word representations will be saved. 50 | 51 | While fastText is running, the progress and estimated time to completion is shown on your screen. Once the program finishes, there should be two files in the result directory: 52 | 53 | ```bash 54 | $ ls -l result 55 | -rw-r-r-- 1 bojanowski 1876110778 978480850 Dec 20 11:01 fil9.bin 56 | -rw-r-r-- 1 bojanowski 1876110778 190004182 Dec 20 11:01 fil9.vec 57 | ``` 58 | 59 | The `fil9.bin` file is a binary file that stores the whole fastText model and can be subsequently loaded. The `fil9.vec` file is a text file that contains the word vectors, one per line for each word in the vocabulary: 60 | 61 | ```bash 62 | $ head -n 4 result/fil9.vec 63 | 218316 100 64 | the -0.10363 -0.063669 0.032436 -0.040798 0.53749 0.00097867 0.10083 0.24829 ... 65 | of -0.0083724 0.0059414 -0.046618 -0.072735 0.83007 0.038895 -0.13634 0.60063 ... 66 | one 0.32731 0.044409 -0.46484 0.14716 0.7431 0.24684 -0.11301 0.51721 0.73262 ... 67 | ``` 68 | 69 | The first line is a header containing the number of words and the dimensionality of the vectors. The subsequent lines are the word vectors for all words in the vocabulary, sorted by decreasing frequency. 70 | 71 | ## Advanced readers: skipgram versus cbow 72 | 73 | fastText provides two models for computing word representations: skipgram and cbow ('**c**ontinuous-**b**ag-**o**f-**w**ords'). 74 | 75 | The skipgram model learns to predict a target word thanks to a nearby word. On the other hand, the cbow model predicts the target word according to its context. The context is represented as a bag of the words contained in a fixed size window around the target word. 76 | 77 | Let us illustrate this difference with an example: given the sentence *'Poets have been mysteriously silent on the subject of cheese'* and the target word '*silent*', a skipgram model tries to predict the target using a random close-by word, like '*subject' *or* '*mysteriously*'**. *The cbow model takes all the words in a surrounding window, like {*been, *mysteriously*, on, the*}, and uses the sum of their vectors to predict the target. The figure below summarizes this difference with another example. 78 | 79 | ![cbow vs skipgram](https://fasttext.cc/img/cbo_vs_skipgram.png) 80 | To train a cbow model with fastText, you run the following command: 81 | 82 | ```bash 83 | ./fasttext cbow -input data/fil9 -output result/fil9 84 | ``` 85 | 86 | 87 | In practice, we observe that skipgram models works better with subword information than cbow. 88 | 89 | ## Advanced readers: playing with the parameters 90 | 91 | So far, we run fastText with the default parameters, but depending on the data, these parameters may not be optimal. Let us give an introduction to some of the key parameters for word vectors. 92 | 93 | The most important parameters of the model are its dimension and the range of size for the subwords. The dimension (*dim*) controls the size of the vectors, the larger they are the more information they can capture but requires more data to be learned. But, if they are too large, they are harder and slower to train. By default, we use 100 dimensions, but any value in the 100-300 range is as popular. The subwords are all the substrings contained in a word between the minimum size (*minn*) and the maximal size (*maxn*). By default, we take all the subword between 3 and 6 characters, but other range could be more appropriate to different languages: 94 | 95 | ```bash 96 | $ ./fasttext skipgram -input data/fil9 -output result/fil9 -minn 2 -maxn 5 -dim 300 97 | ``` 98 | 99 | Depending on the quantity of data you have, you may want to change the parameters of the training. The *epoch* parameter controls how many time will loop over your data. By default, we loop over the dataset 5 times. If you dataset is extremely massive, you may want to loop over it less often. Another important parameter is the learning rate -*lr*). The higher the learning rate is, the faster the model converge to a solution but at the risk of overfitting to the dataset. The default value is 0.05 which is a good compromise. If you want to play with it we suggest to stay in the range of [0.01, 1]: 100 | 101 | ```bash 102 | $ ./fasttext skipgram -input data/fil9 -output result/fil9 -epoch 1 -lr 0.5 103 | ``` 104 | 105 | Finally , fastText is multi-threaded and uses 12 threads by default. If you have less CPU cores (say 4), you can easily set the number of threads using the *thread* flag: 106 | 107 | ```bash 108 | $ ./fasttext skipgram -input data/fil9 -output result/fil9 -thread 4 109 | ``` 110 | 111 | 112 | 113 | ## Printing word vectors 114 | 115 | Searching and printing word vectors directly from the `fil9.vec` file is cumbersome. Fortunately, there is a `print-word-vectors` functionality in fastText. 116 | 117 | For examples, we can print the word vectors of words *asparagus,* *pidgey* and *yellow* with the following command: 118 | 119 | ```bash 120 | $ echo "asparagus pidgey yellow" | ./fasttext print-word-vectors result/fil9.bin 121 | asparagus 0.46826 -0.20187 -0.29122 -0.17918 0.31289 -0.31679 0.17828 -0.04418 ... 122 | pidgey -0.16065 -0.45867 0.10565 0.036952 -0.11482 0.030053 0.12115 0.39725 ... 123 | yellow -0.39965 -0.41068 0.067086 -0.034611 0.15246 -0.12208 -0.040719 -0.30155 ... 124 | ``` 125 | 126 | A nice feature is that you can also query for words that did not appear in your data! Indeed words are represented by the sum of its substrings. As long as the unknown word is made of known substrings, there is a representation of it! 127 | 128 | As an example let's try with a misspelled word: 129 | 130 | ```bash 131 | $ echo "enviroment" | ./fasttext print-word-vectors result/fil9.bin 132 | ``` 133 | 134 | You still get a word vector for it! But how good it is? Let s find out in the next sections! 135 | 136 | 137 | ## Nearest neighbor queries 138 | 139 | A simple way to check the quality of a word vector is to look at its nearest neighbors. This give an intuition of the type of semantic information the vectors are able to capture. 140 | 141 | This can be achieve with the *nn *functionality. For example, we can query the 10 nearest neighbors of a word by running the following command: 142 | 143 | ```bash 144 | $ ./fasttext nn result/fil9.bin 145 | Pre-computing word vectors... done. 146 | ``` 147 | 148 | 149 | Then we are prompted to type our query word, let us try *asparagus* : 150 | 151 | ```bash 152 | Query word? asparagus 153 | beetroot 0.812384 154 | tomato 0.806688 155 | horseradish 0.805928 156 | spinach 0.801483 157 | licorice 0.791697 158 | lingonberries 0.781507 159 | asparagales 0.780756 160 | lingonberry 0.778534 161 | celery 0.774529 162 | beets 0.773984 163 | ``` 164 | 165 | Nice! It seems that vegetable vectors are similar. Note that the nearest neighbor is the word *asparagus* itself, this means that this word appeared in the dataset. What about pokemons? 166 | 167 | ```bash 168 | Query word? pidgey 169 | pidgeot 0.891801 170 | pidgeotto 0.885109 171 | pidge 0.884739 172 | pidgeon 0.787351 173 | pok 0.781068 174 | pikachu 0.758688 175 | charizard 0.749403 176 | squirtle 0.742582 177 | beedrill 0.741579 178 | charmeleon 0.733625 179 | ``` 180 | 181 | Different evolution of the same Pokemon have close-by vectors! But what about our misspelled word, is its vector close to anything reasonable? Let s find out: 182 | 183 | ```bash 184 | Query word? enviroment 185 | enviromental 0.907951 186 | environ 0.87146 187 | enviro 0.855381 188 | environs 0.803349 189 | environnement 0.772682 190 | enviromission 0.761168 191 | realclimate 0.716746 192 | environment 0.702706 193 | acclimatation 0.697196 194 | ecotourism 0.697081 195 | ``` 196 | 197 | Thanks to the information contained within the word, the vector of our misspelled word matches to reasonable words! It is not perfect but the main information has been captured. 198 | 199 | ## Advanced reader: measure of similarity 200 | 201 | In order to find nearest neighbors, we need to compute a similarity score between words. Our words are represented by continuous word vectors and we can thus apply simple similarities to them. In particular we use the cosine of the angles between two vectors. This similarity is computed for all words in the vocabulary, and the 10 most similar words are shown. Of course, if the word appears in the vocabulary, it will appear on top, with a similarity of 1. 202 | 203 | ## Word analogies 204 | 205 | In a similar spirit, one can play around with word analogies. For example, we can see if our model can guess what is to France, what Berlin is to Germany. 206 | 207 | This can be done with the *analogies *functionality. It takes a word triplet (like *Germany Berlin France*) and outputs the analogy: 208 | 209 | ```bash 210 | $ ./fasttext analogies result/fil9.bin 211 | Pre-computing word vectors... done. 212 | Query triplet (A - B + C)? berlin germany france 213 | paris 0.896462 214 | bourges 0.768954 215 | louveciennes 0.765569 216 | toulouse 0.761916 217 | valenciennes 0.760251 218 | montpellier 0.752747 219 | strasbourg 0.744487 220 | meudon 0.74143 221 | bordeaux 0.740635 222 | pigneaux 0.736122 223 | ``` 224 | 225 | The answer provides by our model is *Paris*, which is correct. Let's have a look at a less obvious example: 226 | 227 | ```bash 228 | Query triplet (A - B + C)? psx sony nintendo 229 | gamecube 0.803352 230 | nintendogs 0.792646 231 | playstation 0.77344 232 | sega 0.772165 233 | gameboy 0.767959 234 | arcade 0.754774 235 | playstationjapan 0.753473 236 | gba 0.752909 237 | dreamcast 0.74907 238 | famicom 0.745298 239 | ``` 240 | 241 | Our model considers that the *nintendo* analogy of a *psx* is the *gamecube*, which seems reasonable. Of course the quality of the analogies depend on the dataset used to train the model and one can only hope to cover fields only in the dataset. 242 | 243 | 244 | ## Importance of character n-grams 245 | 246 | Using subword-level information is particularly interesting to build vectors for unknown words. For example, the word *gearshift* does not exist on Wikipedia but we can still query its closest existing words: 247 | 248 | ```bash 249 | Query word? gearshift 250 | gearing 0.790762 251 | flywheels 0.779804 252 | flywheel 0.777859 253 | gears 0.776133 254 | driveshafts 0.756345 255 | driveshaft 0.755679 256 | daisywheel 0.749998 257 | wheelsets 0.748578 258 | epicycles 0.744268 259 | gearboxes 0.73986 260 | ``` 261 | 262 | Most of the retrieved words share substantial substrings but a few are actually quite different, like *cogwheel*. You can try other words like *sunbathe* or *grandnieces*. 263 | 264 | Now that we have seen the interest of subword information for unknown words, let s check how it compares to a model that do not use subword information. To train a model without no subwords, just run the following command: 265 | 266 | ```bash 267 | $ ./fasttext skipgram -input data/fil9 -output result/fil9-none -maxn 0 268 | ``` 269 | 270 | The results are saved in result/fil9-non.vec and result/fil9-non.bin. 271 | 272 | To illustrate the difference, let us take an uncommon word in Wikipedia, like *accomodation* which is a misspelling of *accommodation**.* Here is the nearest neighbors obtained without no subwords: 273 | 274 | ```bash 275 | $ ./fasttext nn result/fil9-none.bin 276 | Query word? accomodation 277 | sunnhordland 0.775057 278 | accomodations 0.769206 279 | administrational 0.753011 280 | laponian 0.752274 281 | ammenities 0.750805 282 | dachas 0.75026 283 | vuosaari 0.74172 284 | hostelling 0.739995 285 | greenbelts 0.733975 286 | asserbo 0.732465 287 | ``` 288 | 289 | The result does not make much sense, most of these words are unrelated. On the other hand, using subword information gives the following list of nearest neighbors: 290 | 291 | ```bash 292 | Query word? accomodation 293 | accomodations 0.96342 294 | accommodation 0.942124 295 | accommodations 0.915427 296 | accommodative 0.847751 297 | accommodating 0.794353 298 | accomodated 0.740381 299 | amenities 0.729746 300 | catering 0.725975 301 | accomodate 0.703177 302 | hospitality 0.701426 303 | ``` 304 | 305 | The nearest neighbors capture different variation around the word *accommodation*. We also get semantically related words such as *amenities* or *lodging*. 306 | 307 | ## Conclusion 308 | 309 | In this tutorial, we show how to obtain word vectors from Wikipedia. This can be done for any language and you we provide [pre-trained models](https://fasttext.cc/docs/en/pretrained-vectors.html) with the default setting for 294 of them. 310 | -------------------------------------------------------------------------------- /doc/zh/_templates/layout.html: -------------------------------------------------------------------------------- 1 | {% extends "!layout.html" %} 2 | 3 | {% block footer %} 4 | {{ super() }} 5 | 6 | 13 | 22 | 23 | 24 | 25 | {% endblock %} 26 | -------------------------------------------------------------------------------- /doc/zh/api.md: -------------------------------------------------------------------------------- 1 | # API 2 | 3 | 我们使用 doxygen 自动生成我们的 [API documentation](/docs/en/html/index.html). 4 | -------------------------------------------------------------------------------- /doc/zh/cheatsheet.md: -------------------------------------------------------------------------------- 1 | # Cheatsheet(备忘单) 2 | 3 | ## Word representation learning(词表示学习) 4 | 5 | 为了学习单词向量: 6 | 7 | ```bash 8 | $ ./fasttext skipgram -input data.txt -output model 9 | ``` 10 | 11 | ## Obtaining word vectors(获取单词向量) 12 | 13 | 为 `queries.txt` 包含单词的文本文件打印单词向量. 14 | 15 | ```bash 16 | $ ./fasttext print-word-vectors model.bin < queries.txt 17 | ``` 18 | 19 | ## Text classification(文本分类) 20 | 21 | 为了训练一个文本分类器, 请执行以下操作: 22 | 23 | ```bash 24 | $ ./fasttext supervised -input train.txt -output model 25 | ``` 26 | 27 | 一旦模型被训练完毕, 您可以使用以下公式计算测试集上的k (P@k and R@k) 的精准和召回来评估它: 28 | 29 | ```bash 30 | $ ./fasttext test model.bin test.txt 1 31 | ``` 32 | 33 | 为了获得一段文字的k个最可能的标签,使用: 34 | 35 | ```bash 36 | $ ./fasttext predict model.bin test.txt k 37 | ``` 38 | 39 | 为了获得一段文字的k个最可能的标签及其相关概率,请使用: 40 | 41 | ```bash 42 | $ ./fasttext predict-prob model.bin test.txt k 43 | ``` 44 | 45 | 如果你想计算句子或段落的向量表示, 请使用: 46 | 47 | ```bash 48 | $ ./fasttext print-sentence-vectors model.bin < text.txt 49 | ``` 50 | 51 | ## Quantization(量化) 52 | 53 | 为了创建一个 `.ftz` 内存占用量较小的文件, 请执行以下操作: 54 | 55 | ```bash 56 | $ ./fasttext quantize -output model 57 | ``` 58 | 59 | 所有其他命令(如测试)也适用于此模型 60 | 61 | ```bash 62 | $ ./fasttext test model.ftz test.txt 63 | ``` 64 | -------------------------------------------------------------------------------- /doc/zh/conf.py: -------------------------------------------------------------------------------- 1 | # -*- coding: utf-8 -*- 2 | # 3 | # Configuration file for the Sphinx documentation builder. 4 | # 5 | # This file does only contain a selection of the most common options. For a 6 | # full list see the documentation: 7 | # http://www.sphinx-doc.org/en/master/config 8 | 9 | # -- Path setup -------------------------------------------------------------- 10 | 11 | # If extensions (or modules to document with autodoc) are in another directory, 12 | # add these directories to sys.path here. If the directory is relative to the 13 | # documentation root, use os.path.abspath to make it absolute, like shown here. 14 | # 15 | # import os 16 | # import sys 17 | # sys.path.insert(0, os.path.abspath('.')) 18 | 19 | 20 | # -- Project information ----------------------------------------------------- 21 | 22 | project = u'fasttext 中文文档' 23 | copyright = u'2018, @ApacheCN' 24 | author = u'@ApacheCN' 25 | 26 | # The short X.Y version 27 | version = u'v0.1.0' 28 | # The full version, including alpha/beta/rc tags 29 | release = u'v0.1.0' 30 | 31 | 32 | # -- General configuration --------------------------------------------------- 33 | 34 | # If your documentation needs a minimal Sphinx version, state it here. 35 | # 36 | # needs_sphinx = '1.0' 37 | 38 | # Add any Sphinx extension module names here, as strings. They can be 39 | # extensions coming with Sphinx (named 'sphinx.ext.*') or your custom 40 | # ones. 41 | extensions = [ 42 | 'sphinx.ext.todo', 43 | 'sphinx.ext.coverage', 44 | 'sphinx.ext.mathjax', 45 | 'sphinx.ext.ifconfig', 46 | 'sphinx.ext.viewcode', 47 | ] 48 | 49 | # Add any paths that contain templates here, relative to this directory. 50 | templates_path = ['_templates'] 51 | 52 | # The suffix(es) of source filenames. 53 | # You can specify multiple suffix as a list of string: 54 | # 55 | # source_suffix = ['.rst', '.md'] 56 | source_parsers = { 57 | '.md': 'recommonmark.parser.CommonMarkParser', 58 | } 59 | 60 | source_suffix = ['.rst', '.md'] 61 | 62 | # The master toctree document. 63 | master_doc = 'index' 64 | 65 | # The language for content autogenerated by Sphinx. Refer to documentation 66 | # for a list of supported languages. 67 | # 68 | # This is also used if you do content translation via gettext catalogs. 69 | # Usually you set "language" from the command line for these cases. 70 | language = u'en' 71 | 72 | # List of patterns, relative to source directory, that match files and 73 | # directories to ignore when looking for source files. 74 | # This pattern also affects html_static_path and html_extra_path . 75 | exclude_patterns = [u'_build', 'Thumbs.db', '.DS_Store'] 76 | 77 | # The name of the Pygments (syntax highlighting) style to use. 78 | pygments_style = 'sphinx' 79 | 80 | 81 | # -- Options for HTML output ------------------------------------------------- 82 | 83 | # The theme to use for HTML and HTML Help pages. See the documentation for 84 | # a list of builtin themes. 85 | # 86 | html_theme = 'sphinx_rtd_theme' 87 | 88 | # Theme options are theme-specific and customize the look and feel of a theme 89 | # further. For a list of options available for each theme, see the 90 | # documentation. 91 | # 92 | # html_theme_options = {} 93 | html_theme_options = { 94 | 'canonical_url': '', 95 | 'analytics_id': '', 96 | 'logo_only': False, 97 | 'display_version': True, 98 | 'prev_next_buttons_location': 'bottom', 99 | 'style_external_links': False, 100 | 'vcs_pageview_mode': '', 101 | # Toc options 102 | 'collapse_navigation': False, 103 | 'sticky_navigation': False, 104 | 'navigation_depth': 4, 105 | 'includehidden': True, 106 | 'titles_only': False 107 | } 108 | 109 | # Add any paths that contain custom static files (such as style sheets) here, 110 | # relative to this directory. They are copied after the builtin static files, 111 | # so a file named "default.css" will overwrite the builtin "default.css". 112 | html_static_path = ['_static'] 113 | 114 | # Custom sidebar templates, must be a dictionary that maps document names 115 | # to template names. 116 | # 117 | # The default sidebars (for documents that don't match any pattern) are 118 | # defined by theme itself. Builtin themes are using these templates by 119 | # default: ``['localtoc.html', 'relations.html', 'sourcelink.html', 120 | # 'searchbox.html']``. 121 | # 122 | # html_sidebars = {} 123 | 124 | 125 | # -- Options for HTMLHelp output --------------------------------------------- 126 | 127 | # Output file base name for HTML help builder. 128 | htmlhelp_basename = 'fasttextdoc' 129 | 130 | 131 | # -- Options for LaTeX output ------------------------------------------------ 132 | 133 | latex_elements = { 134 | # The paper size ('letterpaper' or 'a4paper'). 135 | # 136 | # 'papersize': 'letterpaper', 137 | 138 | # The font size ('10pt', '11pt' or '12pt'). 139 | # 140 | # 'pointsize': '10pt', 141 | 142 | # Additional stuff for the LaTeX preamble. 143 | # 144 | # 'preamble': '', 145 | 146 | # Latex figure (float) alignment 147 | # 148 | # 'figure_align': 'htbp', 149 | } 150 | 151 | # Grouping the document tree into LaTeX files. List of tuples 152 | # (source start file, target name, title, 153 | # author, documentclass [howto, manual, or own class]). 154 | latex_documents = [ 155 | (master_doc, 'fasttext.tex', u'fasttext 中文文档 Documentation', 156 | u'@apachecn', 'manual'), 157 | ] 158 | 159 | 160 | # -- Options for manual page output ------------------------------------------ 161 | 162 | # One entry per manual page. List of tuples 163 | # (source start file, name, description, authors, manual section). 164 | man_pages = [ 165 | (master_doc, 'fasttext', u'fasttext 中文文档 Documentation', 166 | [author], 1) 167 | ] 168 | 169 | 170 | # -- Options for Texinfo output ---------------------------------------------- 171 | 172 | # Grouping the document tree into Texinfo files. List of tuples 173 | # (source start file, target name, title, author, 174 | # dir menu entry, description, category) 175 | texinfo_documents = [ 176 | (master_doc, 'fasttext', u'fasttext 中文文档 Documentation', 177 | author, 'fasttext', 'One line description of project.', 178 | 'Miscellaneous'), 179 | ] 180 | 181 | 182 | # -- Extension configuration ------------------------------------------------- 183 | 184 | # -- Options for todo extension ---------------------------------------------- 185 | 186 | # If true, `todo` and `todoList` produce output, else they produce nothing. 187 | todo_include_todos = True 188 | -------------------------------------------------------------------------------- /doc/zh/crawl-vectors.md: -------------------------------------------------------------------------------- 1 | # 157种语言的词向量 2 | 3 | 我们发布了之前训练的 157 种语言的词向量,这些词向量是用 fasttext 在 [*Common Crawl*](http://commoncrawl.org/) 和 [*Wikipedia*](https://www.wikipedia.org) 上训练得出的 4 | 5 | 这些词向量是由 CBOW 训练而成,而且所使用的 CBOW 模型考虑了位置权重,包含了 300个 维度,并且也考虑了长度为 5,包含十个负样本的大小为 5 的窗体的字符 N 元模型。 6 | 7 | 并且我们也发布了三种新的可供分析的数据集,分别是法语,印地语和波兰语。 8 | 9 | ### 格式 10 | 11 | 我们可以按照二进制和文本格式查看这些词向量 12 | 13 | 当使用二进制时,可以用如下命令查看在词汇表以外的单词向量 14 | ``` 15 | $ ./fasttext print-word-vectors wiki.it.300.bin < oov_words.txt 16 | ``` 17 | 18 | 其中 oov_words.txt 文件包含了词汇表之外的单词 19 | 20 | 21 | 在文本格式下,每一行包含一个单词,并且它的向量紧随其后 22 | 23 | 每个值都被空格分开,并且单词按照出现次数降序排列 24 | 25 | 只需要使用如下的代码,这些文本模型能在Python中轻松的下载: 26 | ```python 27 | import io 28 | 29 | def load_vectors(fname): 30 | fin = io.open(fname, 'r', encoding='utf-8', newline='\n', errors='ignore') 31 | n, d = map(int, fin.readline().split()) 32 | data = {} 33 | for line in fin: 34 | tokens = line.rstrip().split(' ') 35 | data[tokens[0]] = map(float, tokens[1:]) 36 | return data 37 | ``` 38 | 39 | ### 分词 40 | 41 | 42 | 我们使用 [*Stanford word segmenter*](https://nlp.stanford.edu/software/segmenter.html) 对汉语分词,使用 [*Mecab*](http://taku910.github.io/mecab/) 对日语分词,使用 [*UETsegmenter*](https://github.com/phongnt570/UETsegmenter) 对越南语分词 43 | 44 | 对于使用拉丁文,西里尔文,希伯来文或希腊文的语言,我们用来自于 [*Europarl*](http://www.statmt.org/europarl/) 的预处理工具进行分词 45 | 46 | 剩下的语言,我们用 ICU 进行分词 47 | 48 | 想要了解更多关于这些模型训练的信息,可以查看这篇文章 [*Learning Word Vectors for 157 Languages*](https://arxiv.org/abs/1802.06893). 49 | 50 | ### 许可证明 51 | 52 | 这些词向量发布在 [*Creative Commons Attribution-Share-Alike License 3.0*](https://creativecommons.org/licenses/by-sa/3.0/) 上面 53 | 54 | ### 参考资料 55 | 56 | 如果你使用这些词向量,请引用下面这些文章: 57 | 58 | E. Grave\*, P. Bojanowski\*, P. Gupta, A. Joulin, T. Mikolov, [*Learning Word Vectors for 157 Languages*](https://arxiv.org/abs/1802.06893) 59 | 60 | ```markup 61 | @inproceedings{grave2018learning, 62 | title={Learning Word Vectors for 157 Languages}, 63 | author={Grave, Edouard and Bojanowski, Piotr and Gupta, Prakhar and Joulin, Armand and Mikolov, Tomas}, 64 | booktitle={Proceedings of the International Conference on Language Resources and Evaluation (LREC 2018)}, 65 | year={2018} 66 | } 67 | ``` 68 | 69 | ### 评估数据集 70 | 71 | 在上面所描述的可供分析的评估数据集可以在下面的地址中得到: 72 | [法语](https://s3-us-west-1.amazonaws.com/fasttext-vectors/word-analogies/questions-words-fr.txt), [印地语](https://s3-us-west-1.amazonaws.com/fasttext-vectors/word-analogies/questions-words-hi.txt), [波兰语](https://s3-us-west-1.amazonaws.com/fasttext-vectors/word-analogies/questions-words-pl.txt). 73 | ### 模型 74 | 75 | 这些词向量可以从如下地址下载 76 | 77 | |||| 78 | |-|-|-| 79 | | 南非荷兰语: [bin](https://s3-us-west-1.amazonaws.com/fasttext-vectors/word-vectors-v2/cc.af.300.bin.gz), [text](https://s3-us-west-1.amazonaws.com/fasttext-vectors/word-vectors-v2/cc.af.300.vec.gz) | 阿尔巴尼亚语: [bin](https://s3-us-west-1.amazonaws.com/fasttext-vectors/word-vectors-v2/cc.sq.300.bin.gz), [text](https://s3-us-west-1.amazonaws.com/fasttext-vectors/word-vectors-v2/cc.sq.300.vec.gz) | 阿勒曼尼语: [bin](https://s3-us-west-1.amazonaws.com/fasttext-vectors/word-vectors-v2/cc.als.300.bin.gz), [text](https://s3-us-west-1.amazonaws.com/fasttext-vectors/word-vectors-v2/cc.als.300.vec.gz) | 80 | | 阿姆哈拉语: [bin](https://s3-us-west-1.amazonaws.com/fasttext-vectors/word-vectors-v2/cc.am.300.bin.gz), [text](https://s3-us-west-1.amazonaws.com/fasttext-vectors/word-vectors-v2/cc.am.300.vec.gz) | 阿拉伯语: [bin](https://s3-us-west-1.amazonaws.com/fasttext-vectors/word-vectors-v2/cc.ar.300.bin.gz), [text](https://s3-us-west-1.amazonaws.com/fasttext-vectors/word-vectors-v2/cc.ar.300.vec.gz) | 阿拉贡语: [bin](https://s3-us-west-1.amazonaws.com/fasttext-vectors/word-vectors-v2/cc.an.300.bin.gz), [text](https://s3-us-west-1.amazonaws.com/fasttext-vectors/word-vectors-v2/cc.an.300.vec.gz) | 81 | | 亚美尼亚语: [bin](https://s3-us-west-1.amazonaws.com/fasttext-vectors/word-vectors-v2/cc.hy.300.bin.gz), [text](https://s3-us-west-1.amazonaws.com/fasttext-vectors/word-vectors-v2/cc.hy.300.vec.gz) | 阿萨姆语: [bin](https://s3-us-west-1.amazonaws.com/fasttext-vectors/word-vectors-v2/cc.as.300.bin.gz), [text](https://s3-us-west-1.amazonaws.com/fasttext-vectors/word-vectors-v2/cc.as.300.vec.gz) | 阿斯图里亚斯语: [bin](https://s3-us-west-1.amazonaws.com/fasttext-vectors/word-vectors-v2/cc.ast.300.bin.gz), [text](https://s3-us-west-1.amazonaws.com/fasttext-vectors/word-vectors-v2/cc.ast.300.vec.gz) | 82 | | 阿塞拜疆语: [bin](https://s3-us-west-1.amazonaws.com/fasttext-vectors/word-vectors-v2/cc.az.300.bin.gz), [text](https://s3-us-west-1.amazonaws.com/fasttext-vectors/word-vectors-v2/cc.az.300.vec.gz) | 巴什基尔语: [bin](https://s3-us-west-1.amazonaws.com/fasttext-vectors/word-vectors-v2/cc.ba.300.bin.gz), [text](https://s3-us-west-1.amazonaws.com/fasttext-vectors/word-vectors-v2/cc.ba.300.vec.gz) | 巴斯克语: [bin](https://s3-us-west-1.amazonaws.com/fasttext-vectors/word-vectors-v2/cc.eu.300.bin.gz), [text](https://s3-us-west-1.amazonaws.com/fasttext-vectors/word-vectors-v2/cc.eu.300.vec.gz) | 83 | | 巴伐利亚语: [bin](https://s3-us-west-1.amazonaws.com/fasttext-vectors/word-vectors-v2/cc.bar.300.bin.gz), [text](https://s3-us-west-1.amazonaws.com/fasttext-vectors/word-vectors-v2/cc.bar.300.vec.gz) | 白俄罗斯语: [bin](https://s3-us-west-1.amazonaws.com/fasttext-vectors/word-vectors-v2/cc.be.300.bin.gz), [text](https://s3-us-west-1.amazonaws.com/fasttext-vectors/word-vectors-v2/cc.be.300.vec.gz) | 孟加拉语: [bin](https://s3-us-west-1.amazonaws.com/fasttext-vectors/word-vectors-v2/cc.bn.300.bin.gz), [text](https://s3-us-west-1.amazonaws.com/fasttext-vectors/word-vectors-v2/cc.bn.300.vec.gz) | 84 | | 比哈里语: [bin](https://s3-us-west-1.amazonaws.com/fasttext-vectors/word-vectors-v2/cc.bh.300.bin.gz), [text](https://s3-us-west-1.amazonaws.com/fasttext-vectors/word-vectors-v2/cc.bh.300.vec.gz) | Bishnupriya Manipuri: [bin](https://s3-us-west-1.amazonaws.com/fasttext-vectors/word-vectors-v2/cc.bpy.300.bin.gz), [text](https://s3-us-west-1.amazonaws.com/fasttext-vectors/word-vectors-v2/cc.bpy.300.vec.gz) | 波斯尼亚语: [bin](https://s3-us-west-1.amazonaws.com/fasttext-vectors/word-vectors-v2/cc.bs.300.bin.gz), [text](https://s3-us-west-1.amazonaws.com/fasttext-vectors/word-vectors-v2/cc.bs.300.vec.gz) | 85 | | 布列塔尼语: [bin](https://s3-us-west-1.amazonaws.com/fasttext-vectors/word-vectors-v2/cc.br.300.bin.gz), [text](https://s3-us-west-1.amazonaws.com/fasttext-vectors/word-vectors-v2/cc.br.300.vec.gz) | 保加利亚语: [bin](https://s3-us-west-1.amazonaws.com/fasttext-vectors/word-vectors-v2/cc.bg.300.bin.gz), [text](https://s3-us-west-1.amazonaws.com/fasttext-vectors/word-vectors-v2/cc.bg.300.vec.gz) | 缅甸语: [bin](https://s3-us-west-1.amazonaws.com/fasttext-vectors/word-vectors-v2/cc.my.300.bin.gz), [text](https://s3-us-west-1.amazonaws.com/fasttext-vectors/word-vectors-v2/cc.my.300.vec.gz) | 86 | | 加泰罗尼亚语: [bin](https://s3-us-west-1.amazonaws.com/fasttext-vectors/word-vectors-v2/cc.ca.300.bin.gz), [text](https://s3-us-west-1.amazonaws.com/fasttext-vectors/word-vectors-v2/cc.ca.300.vec.gz) | 宿务语: [bin](https://s3-us-west-1.amazonaws.com/fasttext-vectors/word-vectors-v2/cc.ceb.300.bin.gz), [text](https://s3-us-west-1.amazonaws.com/fasttext-vectors/word-vectors-v2/cc.ceb.300.vec.gz) | Central Bicolano: [bin](https://s3-us-west-1.amazonaws.com/fasttext-vectors/word-vectors-v2/cc.bcl.300.bin.gz), [text](https://s3-us-west-1.amazonaws.com/fasttext-vectors/word-vectors-v2/cc.bcl.300.vec.gz) | 87 | | 车臣语: [bin](https://s3-us-west-1.amazonaws.com/fasttext-vectors/word-vectors-v2/cc.ce.300.bin.gz), [text](https://s3-us-west-1.amazonaws.com/fasttext-vectors/word-vectors-v2/cc.ce.300.vec.gz) | 汉语: [bin](https://s3-us-west-1.amazonaws.com/fasttext-vectors/word-vectors-v2/cc.zh.300.bin.gz), [text](https://s3-us-west-1.amazonaws.com/fasttext-vectors/word-vectors-v2/cc.zh.300.vec.gz) | 楚瓦什语: [bin](https://s3-us-west-1.amazonaws.com/fasttext-vectors/word-vectors-v2/cc.cv.300.bin.gz), [text](https://s3-us-west-1.amazonaws.com/fasttext-vectors/word-vectors-v2/cc.cv.300.vec.gz) | 88 | | 科西嘉语: [bin](https://s3-us-west-1.amazonaws.com/fasttext-vectors/word-vectors-v2/cc.co.300.bin.gz), [text](https://s3-us-west-1.amazonaws.com/fasttext-vectors/word-vectors-v2/cc.co.300.vec.gz) | 克罗地亚语: [bin](https://s3-us-west-1.amazonaws.com/fasttext-vectors/word-vectors-v2/cc.hr.300.bin.gz), [text](https://s3-us-west-1.amazonaws.com/fasttext-vectors/word-vectors-v2/cc.hr.300.vec.gz) | 捷克语: [bin](https://s3-us-west-1.amazonaws.com/fasttext-vectors/word-vectors-v2/cc.cs.300.bin.gz), [text](https://s3-us-west-1.amazonaws.com/fasttext-vectors/word-vectors-v2/cc.cs.300.vec.gz) | 89 | | 丹麦语: [bin](https://s3-us-west-1.amazonaws.com/fasttext-vectors/word-vectors-v2/cc.da.300.bin.gz), [text](https://s3-us-west-1.amazonaws.com/fasttext-vectors/word-vectors-v2/cc.da.300.vec.gz) | 迪维希语: [bin](https://s3-us-west-1.amazonaws.com/fasttext-vectors/word-vectors-v2/cc.dv.300.bin.gz), [text](https://s3-us-west-1.amazonaws.com/fasttext-vectors/word-vectors-v2/cc.dv.300.vec.gz) | 荷兰语: [bin](https://s3-us-west-1.amazonaws.com/fasttext-vectors/word-vectors-v2/cc.nl.300.bin.gz), [text](https://s3-us-west-1.amazonaws.com/fasttext-vectors/word-vectors-v2/cc.nl.300.vec.gz) | 90 | | 东旁遮普邦语: [bin](https://s3-us-west-1.amazonaws.com/fasttext-vectors/word-vectors-v2/cc.pa.300.bin.gz), [text](https://s3-us-west-1.amazonaws.com/fasttext-vectors/word-vectors-v2/cc.pa.300.vec.gz) | 埃及阿拉伯语: [bin](https://s3-us-west-1.amazonaws.com/fasttext-vectors/word-vectors-v2/cc.arz.300.bin.gz), [text](https://s3-us-west-1.amazonaws.com/fasttext-vectors/word-vectors-v2/cc.arz.300.vec.gz) | 艾米尼亚-Romagnol: [bin](https://s3-us-west-1.amazonaws.com/fasttext-vectors/word-vectors-v2/cc.eml.300.bin.gz), [text](https://s3-us-west-1.amazonaws.com/fasttext-vectors/word-vectors-v2/cc.eml.300.vec.gz) | 91 | | 俄日亚文: [bin](https://s3-us-west-1.amazonaws.com/fasttext-vectors/word-vectors-v2/cc.myv.300.bin.gz), [text](https://s3-us-west-1.amazonaws.com/fasttext-vectors/word-vectors-v2/cc.myv.300.vec.gz) | 世界语: [bin](https://s3-us-west-1.amazonaws.com/fasttext-vectors/word-vectors-v2/cc.eo.300.bin.gz), [text](https://s3-us-west-1.amazonaws.com/fasttext-vectors/word-vectors-v2/cc.eo.300.vec.gz) | 爱沙尼亚语: [bin](https://s3-us-west-1.amazonaws.com/fasttext-vectors/word-vectors-v2/cc.et.300.bin.gz), [text](https://s3-us-west-1.amazonaws.com/fasttext-vectors/word-vectors-v2/cc.et.300.vec.gz) | 92 | | 斐济印地语: [bin](https://s3-us-west-1.amazonaws.com/fasttext-vectors/word-vectors-v2/cc.hif.300.bin.gz), [text](https://s3-us-west-1.amazonaws.com/fasttext-vectors/word-vectors-v2/cc.hif.300.vec.gz) | 芬兰语: [bin](https://s3-us-west-1.amazonaws.com/fasttext-vectors/word-vectors-v2/cc.fi.300.bin.gz), [text](https://s3-us-west-1.amazonaws.com/fasttext-vectors/word-vectors-v2/cc.fi.300.vec.gz) | 法语: [bin](https://s3-us-west-1.amazonaws.com/fasttext-vectors/word-vectors-v2/cc.fr.300.bin.gz), [text](https://s3-us-west-1.amazonaws.com/fasttext-vectors/word-vectors-v2/cc.fr.300.vec.gz) | 93 | | 加利西亚语: [bin](https://s3-us-west-1.amazonaws.com/fasttext-vectors/word-vectors-v2/cc.gl.300.bin.gz), [text](https://s3-us-west-1.amazonaws.com/fasttext-vectors/word-vectors-v2/cc.gl.300.vec.gz) | 格鲁吉亚语: [bin](https://s3-us-west-1.amazonaws.com/fasttext-vectors/word-vectors-v2/cc.ka.300.bin.gz), [text](https://s3-us-west-1.amazonaws.com/fasttext-vectors/word-vectors-v2/cc.ka.300.vec.gz) | 德语: [bin](https://s3-us-west-1.amazonaws.com/fasttext-vectors/word-vectors-v2/cc.de.300.bin.gz), [text](https://s3-us-west-1.amazonaws.com/fasttext-vectors/word-vectors-v2/cc.de.300.vec.gz) | 94 | | Goan Konkani: [bin](https://s3-us-west-1.amazonaws.com/fasttext-vectors/word-vectors-v2/cc.gom.300.bin.gz), [text](https://s3-us-west-1.amazonaws.com/fasttext-vectors/word-vectors-v2/cc.gom.300.vec.gz) | 希腊语: [bin](https://s3-us-west-1.amazonaws.com/fasttext-vectors/word-vectors-v2/cc.el.300.bin.gz), [text](https://s3-us-west-1.amazonaws.com/fasttext-vectors/word-vectors-v2/cc.el.300.vec.gz) | 古吉拉特语: [bin](https://s3-us-west-1.amazonaws.com/fasttext-vectors/word-vectors-v2/cc.gu.300.bin.gz), [text](https://s3-us-west-1.amazonaws.com/fasttext-vectors/word-vectors-v2/cc.gu.300.vec.gz) | 95 | | 海地语: [bin](https://s3-us-west-1.amazonaws.com/fasttext-vectors/word-vectors-v2/cc.ht.300.bin.gz), [text](https://s3-us-west-1.amazonaws.com/fasttext-vectors/word-vectors-v2/cc.ht.300.vec.gz) | 希伯来语: [bin](https://s3-us-west-1.amazonaws.com/fasttext-vectors/word-vectors-v2/cc.he.300.bin.gz), [text](https://s3-us-west-1.amazonaws.com/fasttext-vectors/word-vectors-v2/cc.he.300.vec.gz) | 希尔马里语: [bin](https://s3-us-west-1.amazonaws.com/fasttext-vectors/word-vectors-v2/cc.mrj.300.bin.gz), [text](https://s3-us-west-1.amazonaws.com/fasttext-vectors/word-vectors-v2/cc.mrj.300.vec.gz) | 96 | | 印地语: [bin](https://s3-us-west-1.amazonaws.com/fasttext-vectors/word-vectors-v2/cc.hi.300.bin.gz), [text](https://s3-us-west-1.amazonaws.com/fasttext-vectors/word-vectors-v2/cc.hi.300.vec.gz) | 匈牙利语: [bin](https://s3-us-west-1.amazonaws.com/fasttext-vectors/word-vectors-v2/cc.hu.300.bin.gz), [text](https://s3-us-west-1.amazonaws.com/fasttext-vectors/word-vectors-v2/cc.hu.300.vec.gz) | 冰岛语: [bin](https://s3-us-west-1.amazonaws.com/fasttext-vectors/word-vectors-v2/cc.is.300.bin.gz), [text](https://s3-us-west-1.amazonaws.com/fasttext-vectors/word-vectors-v2/cc.is.300.vec.gz) | 97 | | 伊多语: [bin](https://s3-us-west-1.amazonaws.com/fasttext-vectors/word-vectors-v2/cc.io.300.bin.gz), [text](https://s3-us-west-1.amazonaws.com/fasttext-vectors/word-vectors-v2/cc.io.300.vec.gz) | 伊洛卡诺语: [bin](https://s3-us-west-1.amazonaws.com/fasttext-vectors/word-vectors-v2/cc.ilo.300.bin.gz), [text](https://s3-us-west-1.amazonaws.com/fasttext-vectors/word-vectors-v2/cc.ilo.300.vec.gz) | 印度尼西亚语: [bin](https://s3-us-west-1.amazonaws.com/fasttext-vectors/word-vectors-v2/cc.id.300.bin.gz), [text](https://s3-us-west-1.amazonaws.com/fasttext-vectors/word-vectors-v2/cc.id.300.vec.gz) | 98 | | 国际语: [bin](https://s3-us-west-1.amazonaws.com/fasttext-vectors/word-vectors-v2/cc.ia.300.bin.gz), [text](https://s3-us-west-1.amazonaws.com/fasttext-vectors/word-vectors-v2/cc.ia.300.vec.gz) | 爱尔兰语: [bin](https://s3-us-west-1.amazonaws.com/fasttext-vectors/word-vectors-v2/cc.ga.300.bin.gz), [text](https://s3-us-west-1.amazonaws.com/fasttext-vectors/word-vectors-v2/cc.ga.300.vec.gz) | 意大利语: [bin](https://s3-us-west-1.amazonaws.com/fasttext-vectors/word-vectors-v2/cc.it.300.bin.gz), [text](https://s3-us-west-1.amazonaws.com/fasttext-vectors/word-vectors-v2/cc.it.300.vec.gz) | 99 | | 日语: [bin](https://s3-us-west-1.amazonaws.com/fasttext-vectors/word-vectors-v2/cc.ja.300.bin.gz), [text](https://s3-us-west-1.amazonaws.com/fasttext-vectors/word-vectors-v2/cc.ja.300.vec.gz) | 爪哇语: [bin](https://s3-us-west-1.amazonaws.com/fasttext-vectors/word-vectors-v2/cc.jv.300.bin.gz), [text](https://s3-us-west-1.amazonaws.com/fasttext-vectors/word-vectors-v2/cc.jv.300.vec.gz) | 卡纳达语: [bin](https://s3-us-west-1.amazonaws.com/fasttext-vectors/word-vectors-v2/cc.kn.300.bin.gz), [text](https://s3-us-west-1.amazonaws.com/fasttext-vectors/word-vectors-v2/cc.kn.300.vec.gz) | 100 | | 印尼爪哇语: [bin](https://s3-us-west-1.amazonaws.com/fasttext-vectors/word-vectors-v2/cc.pam.300.bin.gz), [text](https://s3-us-west-1.amazonaws.com/fasttext-vectors/word-vectors-v2/cc.pam.300.vec.gz) | 哈萨克语: [bin](https://s3-us-west-1.amazonaws.com/fasttext-vectors/word-vectors-v2/cc.kk.300.bin.gz), [text](https://s3-us-west-1.amazonaws.com/fasttext-vectors/word-vectors-v2/cc.kk.300.vec.gz) | 高棉语: [bin](https://s3-us-west-1.amazonaws.com/fasttext-vectors/word-vectors-v2/cc.km.300.bin.gz), [text](https://s3-us-west-1.amazonaws.com/fasttext-vectors/word-vectors-v2/cc.km.300.vec.gz) | 101 | | 吉尔吉斯语: [bin](https://s3-us-west-1.amazonaws.com/fasttext-vectors/word-vectors-v2/cc.ky.300.bin.gz), [text](https://s3-us-west-1.amazonaws.com/fasttext-vectors/word-vectors-v2/cc.ky.300.vec.gz) | 朝鲜语: [bin](https://s3-us-west-1.amazonaws.com/fasttext-vectors/word-vectors-v2/cc.ko.300.bin.gz), [text](https://s3-us-west-1.amazonaws.com/fasttext-vectors/word-vectors-v2/cc.ko.300.vec.gz) | 库尔德语(Kurmanji): [bin](https://s3-us-west-1.amazonaws.com/fasttext-vectors/word-vectors-v2/cc.ku.300.bin.gz), [text](https://s3-us-west-1.amazonaws.com/fasttext-vectors/word-vectors-v2/cc.ku.300.vec.gz) | 102 | | 库尔德语(Sorani): [bin](https://s3-us-west-1.amazonaws.com/fasttext-vectors/word-vectors-v2/cc.ckb.300.bin.gz), [text](https://s3-us-west-1.amazonaws.com/fasttext-vectors/word-vectors-v2/cc.ckb.300.vec.gz) | 拉丁语: [bin](https://s3-us-west-1.amazonaws.com/fasttext-vectors/word-vectors-v2/cc.la.300.bin.gz), [text](https://s3-us-west-1.amazonaws.com/fasttext-vectors/word-vectors-v2/cc.la.300.vec.gz) | 拉脱维亚语: [bin](https://s3-us-west-1.amazonaws.com/fasttext-vectors/word-vectors-v2/cc.lv.300.bin.gz), [text](https://s3-us-west-1.amazonaws.com/fasttext-vectors/word-vectors-v2/cc.lv.300.vec.gz) | 103 | | 林堡语: [bin](https://s3-us-west-1.amazonaws.com/fasttext-vectors/word-vectors-v2/cc.li.300.bin.gz), [text](https://s3-us-west-1.amazonaws.com/fasttext-vectors/word-vectors-v2/cc.li.300.vec.gz) | 立陶宛语: [bin](https://s3-us-west-1.amazonaws.com/fasttext-vectors/word-vectors-v2/cc.lt.300.bin.gz), [text](https://s3-us-west-1.amazonaws.com/fasttext-vectors/word-vectors-v2/cc.lt.300.vec.gz) | 隆巴德语: [bin](https://s3-us-west-1.amazonaws.com/fasttext-vectors/word-vectors-v2/cc.lmo.300.bin.gz), [text](https://s3-us-west-1.amazonaws.com/fasttext-vectors/word-vectors-v2/cc.lmo.300.vec.gz) | 104 | | 低撒克逊语: [bin](https://s3-us-west-1.amazonaws.com/fasttext-vectors/word-vectors-v2/cc.nds.300.bin.gz), [text](https://s3-us-west-1.amazonaws.com/fasttext-vectors/word-vectors-v2/cc.nds.300.vec.gz) | 卢森堡语: [bin](https://s3-us-west-1.amazonaws.com/fasttext-vectors/word-vectors-v2/cc.lb.300.bin.gz), [text](https://s3-us-west-1.amazonaws.com/fasttext-vectors/word-vectors-v2/cc.lb.300.vec.gz) | 马其顿语: [bin](https://s3-us-west-1.amazonaws.com/fasttext-vectors/word-vectors-v2/cc.mk.300.bin.gz), [text](https://s3-us-west-1.amazonaws.com/fasttext-vectors/word-vectors-v2/cc.mk.300.vec.gz) | 105 | | 迈蒂利语: [bin](https://s3-us-west-1.amazonaws.com/fasttext-vectors/word-vectors-v2/cc.mai.300.bin.gz), [text](https://s3-us-west-1.amazonaws.com/fasttext-vectors/word-vectors-v2/cc.mai.300.vec.gz) | 马尔加什语: [bin](https://s3-us-west-1.amazonaws.com/fasttext-vectors/word-vectors-v2/cc.mg.300.bin.gz), [text](https://s3-us-west-1.amazonaws.com/fasttext-vectors/word-vectors-v2/cc.mg.300.vec.gz) | 马来语: [bin](https://s3-us-west-1.amazonaws.com/fasttext-vectors/word-vectors-v2/cc.ms.300.bin.gz), [text](https://s3-us-west-1.amazonaws.com/fasttext-vectors/word-vectors-v2/cc.ms.300.vec.gz) | 106 | | 马拉亚姆语: [bin](https://s3-us-west-1.amazonaws.com/fasttext-vectors/word-vectors-v2/cc.ml.300.bin.gz), [text](https://s3-us-west-1.amazonaws.com/fasttext-vectors/word-vectors-v2/cc.ml.300.vec.gz) | 马其他语: [bin](https://s3-us-west-1.amazonaws.com/fasttext-vectors/word-vectors-v2/cc.mt.300.bin.gz), [text](https://s3-us-west-1.amazonaws.com/fasttext-vectors/word-vectors-v2/cc.mt.300.vec.gz) | 马恩岛语: [bin](https://s3-us-west-1.amazonaws.com/fasttext-vectors/word-vectors-v2/cc.gv.300.bin.gz), [text](https://s3-us-west-1.amazonaws.com/fasttext-vectors/word-vectors-v2/cc.gv.300.vec.gz) | 107 | | 马拉语: [bin](https://s3-us-west-1.amazonaws.com/fasttext-vectors/word-vectors-v2/cc.mr.300.bin.gz), [text](https://s3-us-west-1.amazonaws.com/fasttext-vectors/word-vectors-v2/cc.mr.300.vec.gz) | Mazandarani: [bin](https://s3-us-west-1.amazonaws.com/fasttext-vectors/word-vectors-v2/cc.mzn.300.bin.gz), [text](https://s3-us-west-1.amazonaws.com/fasttext-vectors/word-vectors-v2/cc.mzn.300.vec.gz) | 东马里语: [bin](https://s3-us-west-1.amazonaws.com/fasttext-vectors/word-vectors-v2/cc.mhr.300.bin.gz), [text](https://s3-us-west-1.amazonaws.com/fasttext-vectors/word-vectors-v2/cc.mhr.300.vec.gz) | 108 | | 米南加保语: [bin](https://s3-us-west-1.amazonaws.com/fasttext-vectors/word-vectors-v2/cc.min.300.bin.gz), [text](https://s3-us-west-1.amazonaws.com/fasttext-vectors/word-vectors-v2/cc.min.300.vec.gz) | 明格雷利亚语: [bin](https://s3-us-west-1.amazonaws.com/fasttext-vectors/word-vectors-v2/cc.xmf.300.bin.gz), [text](https://s3-us-west-1.amazonaws.com/fasttext-vectors/word-vectors-v2/cc.xmf.300.vec.gz) | 米兰德斯语: [bin](https://s3-us-west-1.amazonaws.com/fasttext-vectors/word-vectors-v2/cc.mwl.300.bin.gz), [text](https://s3-us-west-1.amazonaws.com/fasttext-vectors/word-vectors-v2/cc.mwl.300.vec.gz) | 109 | | 蒙古语: [bin](https://s3-us-west-1.amazonaws.com/fasttext-vectors/word-vectors-v2/cc.mn.300.bin.gz), [text](https://s3-us-west-1.amazonaws.com/fasttext-vectors/word-vectors-v2/cc.mn.300.vec.gz) | 纳瓦特尔语: [bin](https://s3-us-west-1.amazonaws.com/fasttext-vectors/word-vectors-v2/cc.nah.300.bin.gz), [text](https://s3-us-west-1.amazonaws.com/fasttext-vectors/word-vectors-v2/cc.nah.300.vec.gz) | 那不勒斯语: [bin](https://s3-us-west-1.amazonaws.com/fasttext-vectors/word-vectors-v2/cc.nap.300.bin.gz), [text](https://s3-us-west-1.amazonaws.com/fasttext-vectors/word-vectors-v2/cc.nap.300.vec.gz) | 110 | | 尼泊尔语: [bin](https://s3-us-west-1.amazonaws.com/fasttext-vectors/word-vectors-v2/cc.ne.300.bin.gz), [text](https://s3-us-west-1.amazonaws.com/fasttext-vectors/word-vectors-v2/cc.ne.300.vec.gz) | 尼瓦尔语: [bin](https://s3-us-west-1.amazonaws.com/fasttext-vectors/word-vectors-v2/cc.new.300.bin.gz), [text](https://s3-us-west-1.amazonaws.com/fasttext-vectors/word-vectors-v2/cc.new.300.vec.gz) | 北弗里斯兰语: [bin](https://s3-us-west-1.amazonaws.com/fasttext-vectors/word-vectors-v2/cc.frr.300.bin.gz), [text](https://s3-us-west-1.amazonaws.com/fasttext-vectors/word-vectors-v2/cc.frr.300.vec.gz) | 111 | | 北索托语: [bin](https://s3-us-west-1.amazonaws.com/fasttext-vectors/word-vectors-v2/cc.nso.300.bin.gz), [text](https://s3-us-west-1.amazonaws.com/fasttext-vectors/word-vectors-v2/cc.nso.300.vec.gz) | 挪威语 (Bokmål): [bin](https://s3-us-west-1.amazonaws.com/fasttext-vectors/word-vectors-v2/cc.no.300.bin.gz), [text](https://s3-us-west-1.amazonaws.com/fasttext-vectors/word-vectors-v2/cc.no.300.vec.gz) | 挪威语 (Nynorsk): [bin](https://s3-us-west-1.amazonaws.com/fasttext-vectors/word-vectors-v2/cc.nn.300.bin.gz), [text](https://s3-us-west-1.amazonaws.com/fasttext-vectors/word-vectors-v2/cc.nn.300.vec.gz) | 112 | | 奥克语: [bin](https://s3-us-west-1.amazonaws.com/fasttext-vectors/word-vectors-v2/cc.oc.300.bin.gz), [text](https://s3-us-west-1.amazonaws.com/fasttext-vectors/word-vectors-v2/cc.oc.300.vec.gz) | 奥里亚语: [bin](https://s3-us-west-1.amazonaws.com/fasttext-vectors/word-vectors-v2/cc.or.300.bin.gz), [text](https://s3-us-west-1.amazonaws.com/fasttext-vectors/word-vectors-v2/cc.or.300.vec.gz) | 南奥塞梯语: [bin](https://s3-us-west-1.amazonaws.com/fasttext-vectors/word-vectors-v2/cc.os.300.bin.gz), [text](https://s3-us-west-1.amazonaws.com/fasttext-vectors/word-vectors-v2/cc.os.300.vec.gz) | 113 | | 普法尔茨德语: [bin](https://s3-us-west-1.amazonaws.com/fasttext-vectors/word-vectors-v2/cc.pfl.300.bin.gz), [text](https://s3-us-west-1.amazonaws.com/fasttext-vectors/word-vectors-v2/cc.pfl.300.vec.gz) | 普什图语: [bin](https://s3-us-west-1.amazonaws.com/fasttext-vectors/word-vectors-v2/cc.ps.300.bin.gz), [text](https://s3-us-west-1.amazonaws.com/fasttext-vectors/word-vectors-v2/cc.ps.300.vec.gz) | 波斯语: [bin](https://s3-us-west-1.amazonaws.com/fasttext-vectors/word-vectors-v2/cc.fa.300.bin.gz), [text](https://s3-us-west-1.amazonaws.com/fasttext-vectors/word-vectors-v2/cc.fa.300.vec.gz) | 114 | | 皮埃蒙特语: [bin](https://s3-us-west-1.amazonaws.com/fasttext-vectors/word-vectors-v2/cc.pms.300.bin.gz), [text](https://s3-us-west-1.amazonaws.com/fasttext-vectors/word-vectors-v2/cc.pms.300.vec.gz) | 波兰语: [bin](https://s3-us-west-1.amazonaws.com/fasttext-vectors/word-vectors-v2/cc.pl.300.bin.gz), [text](https://s3-us-west-1.amazonaws.com/fasttext-vectors/word-vectors-v2/cc.pl.300.vec.gz) | 葡萄牙语: [bin](https://s3-us-west-1.amazonaws.com/fasttext-vectors/word-vectors-v2/cc.pt.300.bin.gz), [text](https://s3-us-west-1.amazonaws.com/fasttext-vectors/word-vectors-v2/cc.pt.300.vec.gz) | 115 | | 克丘亚语: [bin](https://s3-us-west-1.amazonaws.com/fasttext-vectors/word-vectors-v2/cc.qu.300.bin.gz), [text](https://s3-us-west-1.amazonaws.com/fasttext-vectors/word-vectors-v2/cc.qu.300.vec.gz) | 罗马尼亚语: [bin](https://s3-us-west-1.amazonaws.com/fasttext-vectors/word-vectors-v2/cc.ro.300.bin.gz), [text](https://s3-us-west-1.amazonaws.com/fasttext-vectors/word-vectors-v2/cc.ro.300.vec.gz) | 罗曼什语: [bin](https://s3-us-west-1.amazonaws.com/fasttext-vectors/word-vectors-v2/cc.rm.300.bin.gz), [text](https://s3-us-west-1.amazonaws.com/fasttext-vectors/word-vectors-v2/cc.rm.300.vec.gz) | 116 | | 俄语: [bin](https://s3-us-west-1.amazonaws.com/fasttext-vectors/word-vectors-v2/cc.ru.300.bin.gz), [text](https://s3-us-west-1.amazonaws.com/fasttext-vectors/word-vectors-v2/cc.ru.300.vec.gz) | 萨哈语: [bin](https://s3-us-west-1.amazonaws.com/fasttext-vectors/word-vectors-v2/cc.sah.300.bin.gz), [text](https://s3-us-west-1.amazonaws.com/fasttext-vectors/word-vectors-v2/cc.sah.300.vec.gz) | 梵文: [bin](https://s3-us-west-1.amazonaws.com/fasttext-vectors/word-vectors-v2/cc.sa.300.bin.gz), [text](https://s3-us-west-1.amazonaws.com/fasttext-vectors/word-vectors-v2/cc.sa.300.vec.gz) | 117 | | 撒丁岛语: [bin](https://s3-us-west-1.amazonaws.com/fasttext-vectors/word-vectors-v2/cc.sc.300.bin.gz), [text](https://s3-us-west-1.amazonaws.com/fasttext-vectors/word-vectors-v2/cc.sc.300.vec.gz) | 苏格兰语: [bin](https://s3-us-west-1.amazonaws.com/fasttext-vectors/word-vectors-v2/cc.sco.300.bin.gz), [text](https://s3-us-west-1.amazonaws.com/fasttext-vectors/word-vectors-v2/cc.sco.300.vec.gz) | 苏格兰盖尔语: [bin](https://s3-us-west-1.amazonaws.com/fasttext-vectors/word-vectors-v2/cc.gd.300.bin.gz), [text](https://s3-us-west-1.amazonaws.com/fasttext-vectors/word-vectors-v2/cc.gd.300.vec.gz) | 118 | | 塞尔维亚语: [bin](https://s3-us-west-1.amazonaws.com/fasttext-vectors/word-vectors-v2/cc.sr.300.bin.gz), [text](https://s3-us-west-1.amazonaws.com/fasttext-vectors/word-vectors-v2/cc.sr.300.vec.gz) | 塞尔维亚 - 克罗地亚语: [bin](https://s3-us-west-1.amazonaws.com/fasttext-vectors/word-vectors-v2/cc.sh.300.bin.gz), [text](https://s3-us-west-1.amazonaws.com/fasttext-vectors/word-vectors-v2/cc.sh.300.vec.gz) | 西西里语: [bin](https://s3-us-west-1.amazonaws.com/fasttext-vectors/word-vectors-v2/cc.scn.300.bin.gz), [text](https://s3-us-west-1.amazonaws.com/fasttext-vectors/word-vectors-v2/cc.scn.300.vec.gz) | 119 | | 信德语: [bin](https://s3-us-west-1.amazonaws.com/fasttext-vectors/word-vectors-v2/cc.sd.300.bin.gz), [text](https://s3-us-west-1.amazonaws.com/fasttext-vectors/word-vectors-v2/cc.sd.300.vec.gz) | 僧伽罗语: [bin](https://s3-us-west-1.amazonaws.com/fasttext-vectors/word-vectors-v2/cc.si.300.bin.gz), [text](https://s3-us-west-1.amazonaws.com/fasttext-vectors/word-vectors-v2/cc.si.300.vec.gz) | 斯洛伐克语: [bin](https://s3-us-west-1.amazonaws.com/fasttext-vectors/word-vectors-v2/cc.sk.300.bin.gz), [text](https://s3-us-west-1.amazonaws.com/fasttext-vectors/word-vectors-v2/cc.sk.300.vec.gz) | 120 | | 斯洛文尼亚语: [bin](https://s3-us-west-1.amazonaws.com/fasttext-vectors/word-vectors-v2/cc.sl.300.bin.gz), [text](https://s3-us-west-1.amazonaws.com/fasttext-vectors/word-vectors-v2/cc.sl.300.vec.gz) | 索马里语: [bin](https://s3-us-west-1.amazonaws.com/fasttext-vectors/word-vectors-v2/cc.so.300.bin.gz), [text](https://s3-us-west-1.amazonaws.com/fasttext-vectors/word-vectors-v2/cc.so.300.vec.gz) | 南阿塞拜疆: [bin](https://s3-us-west-1.amazonaws.com/fasttext-vectors/word-vectors-v2/cc.azb.300.bin.gz), [text](https://s3-us-west-1.amazonaws.com/fasttext-vectors/word-vectors-v2/cc.azb.300.vec.gz) | 121 | | 西班牙语: [bin](https://s3-us-west-1.amazonaws.com/fasttext-vectors/word-vectors-v2/cc.es.300.bin.gz), [text](https://s3-us-west-1.amazonaws.com/fasttext-vectors/word-vectors-v2/cc.es.300.vec.gz) | 巽他语: [bin](https://s3-us-west-1.amazonaws.com/fasttext-vectors/word-vectors-v2/cc.su.300.bin.gz), [text](https://s3-us-west-1.amazonaws.com/fasttext-vectors/word-vectors-v2/cc.su.300.vec.gz) | 斯瓦希里语: [bin](https://s3-us-west-1.amazonaws.com/fasttext-vectors/word-vectors-v2/cc.sw.300.bin.gz), [text](https://s3-us-west-1.amazonaws.com/fasttext-vectors/word-vectors-v2/cc.sw.300.vec.gz) | 122 | | 瑞典语: [bin](https://s3-us-west-1.amazonaws.com/fasttext-vectors/word-vectors-v2/cc.sv.300.bin.gz), [text](https://s3-us-west-1.amazonaws.com/fasttext-vectors/word-vectors-v2/cc.sv.300.vec.gz) | 他加禄语: [bin](https://s3-us-west-1.amazonaws.com/fasttext-vectors/word-vectors-v2/cc.tl.300.bin.gz), [text](https://s3-us-west-1.amazonaws.com/fasttext-vectors/word-vectors-v2/cc.tl.300.vec.gz) | 塔吉克语: [bin](https://s3-us-west-1.amazonaws.com/fasttext-vectors/word-vectors-v2/cc.tg.300.bin.gz), [text](https://s3-us-west-1.amazonaws.com/fasttext-vectors/word-vectors-v2/cc.tg.300.vec.gz) | 123 | | 泰米尔语: [bin](https://s3-us-west-1.amazonaws.com/fasttext-vectors/word-vectors-v2/cc.ta.300.bin.gz), [text](https://s3-us-west-1.amazonaws.com/fasttext-vectors/word-vectors-v2/cc.ta.300.vec.gz) | 鞑靼语: [bin](https://s3-us-west-1.amazonaws.com/fasttext-vectors/word-vectors-v2/cc.tt.300.bin.gz), [text](https://s3-us-west-1.amazonaws.com/fasttext-vectors/word-vectors-v2/cc.tt.300.vec.gz) | 泰卢固语: [bin](https://s3-us-west-1.amazonaws.com/fasttext-vectors/word-vectors-v2/cc.te.300.bin.gz), [text](https://s3-us-west-1.amazonaws.com/fasttext-vectors/word-vectors-v2/cc.te.300.vec.gz) | 124 | | 泰语: [bin](https://s3-us-west-1.amazonaws.com/fasttext-vectors/word-vectors-v2/cc.th.300.bin.gz), [text](https://s3-us-west-1.amazonaws.com/fasttext-vectors/word-vectors-v2/cc.th.300.vec.gz) | 藏语: [bin](https://s3-us-west-1.amazonaws.com/fasttext-vectors/word-vectors-v2/cc.bo.300.bin.gz), [text](https://s3-us-west-1.amazonaws.com/fasttext-vectors/word-vectors-v2/cc.bo.300.vec.gz) | 土耳其语: [bin](https://s3-us-west-1.amazonaws.com/fasttext-vectors/word-vectors-v2/cc.tr.300.bin.gz), [text](https://s3-us-west-1.amazonaws.com/fasttext-vectors/word-vectors-v2/cc.tr.300.vec.gz) | 125 | | 土库曼语: [bin](https://s3-us-west-1.amazonaws.com/fasttext-vectors/word-vectors-v2/cc.tk.300.bin.gz), [text](https://s3-us-west-1.amazonaws.com/fasttext-vectors/word-vectors-v2/cc.tk.300.vec.gz) | 乌克兰语: [bin](https://s3-us-west-1.amazonaws.com/fasttext-vectors/word-vectors-v2/cc.uk.300.bin.gz), [text](https://s3-us-west-1.amazonaws.com/fasttext-vectors/word-vectors-v2/cc.uk.300.vec.gz) | 上索布族语: [bin](https://s3-us-west-1.amazonaws.com/fasttext-vectors/word-vectors-v2/cc.hsb.300.bin.gz), [text](https://s3-us-west-1.amazonaws.com/fasttext-vectors/word-vectors-v2/cc.hsb.300.vec.gz) | 126 | | 乌尔都语: [bin](https://s3-us-west-1.amazonaws.com/fasttext-vectors/word-vectors-v2/cc.ur.300.bin.gz), [text](https://s3-us-west-1.amazonaws.com/fasttext-vectors/word-vectors-v2/cc.ur.300.vec.gz) | 维吾尔语: [bin](https://s3-us-west-1.amazonaws.com/fasttext-vectors/word-vectors-v2/cc.ug.300.bin.gz), [text](https://s3-us-west-1.amazonaws.com/fasttext-vectors/word-vectors-v2/cc.ug.300.vec.gz) | 乌兹别克语: [bin](https://s3-us-west-1.amazonaws.com/fasttext-vectors/word-vectors-v2/cc.uz.300.bin.gz), [text](https://s3-us-west-1.amazonaws.com/fasttext-vectors/word-vectors-v2/cc.uz.300.vec.gz) | 127 | | 威尼斯语: [bin](https://s3-us-west-1.amazonaws.com/fasttext-vectors/word-vectors-v2/cc.vec.300.bin.gz), [text](https://s3-us-west-1.amazonaws.com/fasttext-vectors/word-vectors-v2/cc.vec.300.vec.gz) | 越南语: [bin](https://s3-us-west-1.amazonaws.com/fasttext-vectors/word-vectors-v2/cc.vi.300.bin.gz), [text](https://s3-us-west-1.amazonaws.com/fasttext-vectors/word-vectors-v2/cc.vi.300.vec.gz) | 沃拉普克语: [bin](https://s3-us-west-1.amazonaws.com/fasttext-vectors/word-vectors-v2/cc.vo.300.bin.gz), [text](https://s3-us-west-1.amazonaws.com/fasttext-vectors/word-vectors-v2/cc.vo.300.vec.gz) | 128 | | 华隆语: [bin](https://s3-us-west-1.amazonaws.com/fasttext-vectors/word-vectors-v2/cc.wa.300.bin.gz), [text](https://s3-us-west-1.amazonaws.com/fasttext-vectors/word-vectors-v2/cc.wa.300.vec.gz) | 瓦莱语: [bin](https://s3-us-west-1.amazonaws.com/fasttext-vectors/word-vectors-v2/cc.war.300.bin.gz), [text](https://s3-us-west-1.amazonaws.com/fasttext-vectors/word-vectors-v2/cc.war.300.vec.gz) | 威尔士语: [bin](https://s3-us-west-1.amazonaws.com/fasttext-vectors/word-vectors-v2/cc.cy.300.bin.gz), [text](https://s3-us-west-1.amazonaws.com/fasttext-vectors/word-vectors-v2/cc.cy.300.vec.gz) | 129 | | 西佛兰芒语: [bin](https://s3-us-west-1.amazonaws.com/fasttext-vectors/word-vectors-v2/cc.vls.300.bin.gz), [text](https://s3-us-west-1.amazonaws.com/fasttext-vectors/word-vectors-v2/cc.vls.300.vec.gz) | West 弗里斯兰语: [bin](https://s3-us-west-1.amazonaws.com/fasttext-vectors/word-vectors-v2/cc.fy.300.bin.gz), [text](https://s3-us-west-1.amazonaws.com/fasttext-vectors/word-vectors-v2/cc.fy.300.vec.gz) | 西旁遮普语: [bin](https://s3-us-west-1.amazonaws.com/fasttext-vectors/word-vectors-v2/cc.pnb.300.bin.gz), [text](https://s3-us-west-1.amazonaws.com/fasttext-vectors/word-vectors-v2/cc.pnb.300.vec.gz) | 130 | | 意第绪语: [bin](https://s3-us-west-1.amazonaws.com/fasttext-vectors/word-vectors-v2/cc.yi.300.bin.gz), [text](https://s3-us-west-1.amazonaws.com/fasttext-vectors/word-vectors-v2/cc.yi.300.vec.gz) | 约鲁巴语: [bin](https://s3-us-west-1.amazonaws.com/fasttext-vectors/word-vectors-v2/cc.yo.300.bin.gz), [text](https://s3-us-west-1.amazonaws.com/fasttext-vectors/word-vectors-v2/cc.yo.300.vec.gz) | 扎扎其语: [bin](https://s3-us-west-1.amazonaws.com/fasttext-vectors/word-vectors-v2/cc.diq.300.bin.gz), [text](https://s3-us-west-1.amazonaws.com/fasttext-vectors/word-vectors-v2/cc.diq.300.vec.gz) | 131 | | 泽兰蒂克语: [bin](https://s3-us-west-1.amazonaws.com/fasttext-vectors/word-vectors-v2/cc.zea.300.bin.gz), [text](https://s3-us-west-1.amazonaws.com/fasttext-vectors/word-vectors-v2/cc.zea.300.vec.gz) | 132 | -------------------------------------------------------------------------------- /doc/zh/dataset.md: -------------------------------------------------------------------------------- 1 | # 数据集 2 | 3 | [下载 YFCC100M 数据集](https://fb-public.box.com/s/htfdbrvycvroebv9ecaezaztocbcnsdn) 4 | -------------------------------------------------------------------------------- /doc/zh/english-vectors.md: -------------------------------------------------------------------------------- 1 | # 英文单词向量 2 | 3 | 这一篇整合了一些之前用 fasttext 训练的词向量。 4 | 5 | ### 下载经过训练的词向量 6 | 7 | 你可以从下面下载单词向量,他们基于学习不同的数据来源,并且被预先训练过: 8 | 9 | 1. [wiki-news-300d-1M.vec.zip](https://s3-us-west-1.amazonaws.com/fasttext-vectors/wiki-news-300d-1M.vec.zip) :一百万的词向量,这些词向量是在 2017 维基百科,UMBC 基于网络的语料库和 statmt.org 新闻数据集训练得到的(16B) 10 | 2. [wiki-news-300d-1M-subword.vec.zip](https://s3-us-west-1.amazonaws.com/fasttext-vectors/wiki-news-300d-1M-subword.vec.zip) : 一百万的带有子词信息的词向量,这些词向量是在 2017 维基百科,UMBC 基于网络的语料库和 statmt.org 新闻数据集的训练得到的(16B) 11 | 12 | 3. [crawl-300d-2M.vec.zip](https://s3-us-west-1.amazonaws.com/fasttext-vectors/crawl-300d-2M.vec.zip) : 两百万的词向量,这些词向量是在 Common Crawl 上训练得到的。(600B) 13 | 14 | ### 格式 15 | 16 | 文件的第一行包含了词汇表中单词的数量以及向量的大小。 17 | 每一行包含了一个单词和它的向量,就像是 fasttext 文本格式默认的那种样子。 18 | 每个值都是由空格隔开。 19 | 单词是按照频数降序排列的。 20 | 21 | ### 许可证明 22 | 23 | 这些词向量在 [*Creative Commons Attribution-Share-Alike License 3.0*](https://creativecommons.org/licenses/by-sa/3.0/) 可以看到。 24 | 25 | ### 参考资料 26 | 27 | 如果你使用了这些词向量,请引用下面的文章: 28 | 29 | T. Mikolov, E. Grave, P. Bojanowski, C. Puhrsch, A. Joulin. [*Advances in Pre-Training Distributed Word Representations*](https://arxiv.org/abs/1712.09405) 30 | 31 | ```markup 32 | @inproceedings{mikolov2018advances, 33 | title={Advances in Pre-Training Distributed Word Representations}, 34 | author={Mikolov, Tomas and Grave, Edouard and Bojanowski, Piotr and Puhrsch, Christian and Joulin, Armand}, 35 | booktitle={Proceedings of the International Conference on Language Resources and Evaluation (LREC 2018)}, 36 | year={2018} 37 | } 38 | ``` 39 | -------------------------------------------------------------------------------- /doc/zh/faqs.md: -------------------------------------------------------------------------------- 1 | # 常问问题 2 | 3 | ## 什么是 fastText? 有教程吗? 4 | 5 | FastText 是一个文本分类与表示的库. 它将文本转化为能用于任何语言相关任务的连续向量. 有一些教程可供使用. 6 | 7 | ## 为什么我的 fastText 模型那么大? 8 | 9 | FastText 使用散列表来表示单词或字符 ngram(n元模型). 散列表的大小直接影响模型的大小. 要减小模型的大小, 可以使用 '-hash' 选项, 例如一个很好的值是20000. 另一个大大影响模型大小的选项是矢量的大小 (-dim) . 可以减少此维度以节省空间, 但这可能会显著影响性能. 如果仍然产生太大的模型,可以使用量化选项进一步减小训练模型的大小. 10 | ```bash 11 | ./fasttext quantize -output model 12 | ``` 13 | 14 | ## 表示单词短语而不是单词的最佳方法是什么? 15 | 16 | 目前, 表示单词短语或句子的最佳方法是将单词向量的单词做成词袋. 此外, 对于诸如“纽约”这样的短语, 预处理数据以使其成为单个标记 "New_York" 可以提供极大的帮助。 17 | 18 | ## 为什么 fastText 对未知词也产生向量? 19 | 20 | FastText 词表示的一个关键特征就是它能对任何词产生词向量, 即使是自制词. 21 | 事实上, fastText 词向量是由包含在其中的字符字串构成的. 22 | 这甚至允许为拼写错误的单词或拼接单词创建词向量. 23 | 24 | ## 为什么分层 softmax 的效果比完全 softmax 效果要略差一些? 25 | 26 | 分层 softmax 是完全 softmax 损失函数的一种近似, 它能够在大量类的数据上高效训练. 这通常会损失一些精确度. 27 | 还要注意, 这个损失函数是针对某些类比其他类出现得更为频繁的类别不均衡情况的. 如果你的数据集中各类的样本均衡, 那么值得尝试一下负采样损失 (-loss ns -neg 100). 28 | 然而, 负采样在测试时仍然会非常慢, 因为会计算完全 softmax. 29 | 30 | ## 我们可以在 GPU 上运行 fastText 程序吗? 31 | 32 | FastText 由于可访问性只工作于 CPU. 就是说, fastText 已经可以在能运行于 GPU 的 caffe2 库中实现. 33 | 34 | ## 我能用 python 语言使用 fastText 吗? 或者其他语言? 35 | 36 | Github 上几乎没有非官方的 python 或者 lua 包装器. 37 | 38 | ## 我能用 fastText 处理连续数据吗? 39 | 40 | FastText适用于离散标记, 因此不能直接用于连续标记. 但是, 可以将连续标记离散化以对其使用fastText, 例如将值四舍五入为特定数字 ("12.3" 变为 "12"). 41 | 42 | ## 词典中一些错误拼写的词. 我们应该提升文本规范化吗? 43 | 44 | 如果这些词出现频率不高, 无须理会. 45 | 46 | ## 我遇到了 NaN, 为什么会这样呢? 47 | 48 | 你出现这个情况可能是因为学习率太高. 尝试减小学习率直到看不到这个错误. 49 | 50 | ## 我的编译器 / 体系结构无法构建 fastText. 我该怎么办? 51 | 52 | 尝试新版本的编译器. 我们试图保持与老版本 gcc 和很多平台的兼容性, 然后有时候保持后端兼容变得非常难. 53 | 一般来说, 附带 LTS 版本的主要 linux 发行版的编译器和工具链应该都是没问题的. 遇到任何情况, 都可以创建一个你的编译器版本和体系结构的 issue(问题, 指在github上提出), 我们将尽力实现兼容性. 54 | 55 | 56 | 57 | 58 | -------------------------------------------------------------------------------- /doc/zh/language-identification.md: -------------------------------------------------------------------------------- 1 | # 语言识别 2 | 3 | ### 说明 4 | 5 | 我们发布了两种语言识别模型,可以识别 176 种语言(请参阅下面的 ISO 代码列表)。 这些模型是通过 [Wikipedia](https://www.wikipedia.org/) , [Tatoeba](https://tatoeba.org/eng/) 和 [SETimes](http://nlp.ffzg.hr/resources/corpora/setimes/) 的数据进行训练的,在 [CC-BY-SA](http://creativecommons.org/licenses/by-sa/3.0/) 下使用。 6 | 7 | 我们发布两个版本的模型: 8 | 9 | * [lid.176.bin](https://s3-us-west-1.amazonaws.com/fasttext-vectors/supervised_models/lid.176.bin) ,这个模型更快更准确,但有一个文件的大小有 126 MB; 10 | * [lid.176.ftz](https://s3-us-west-1.amazonaws.com/fasttext-vectors/supervised_models/lid.176.ftz) ,这是个只有 917 kB 的压缩版的模型。 11 | 12 | 这些模型都是使用 UTF-8 数据进行训练的,因此需要使用 UTF-8 作为输入。 13 | 14 | ### 许可 15 | 16 | 这些模型根据 [*Creative Commons Attribution-Share-Alike License 3.0*](https://creativecommons.org/licenses/by-sa/3.0/) 进行发布。 17 | 18 | ### 支持的语言列表 19 | ``` 20 | af als am ar ar as as ast av az azb ba bar bcl be bg bh bn bo bpy br bs bxr ca cbk ce ceb ckb co cs cv c cy cy de deqq dsb dty dv el eml en eo es et eu faf fr fr fyy ga gd gl gn gom gu gv he hi hif hr hsb ht hu hyia id ie ieo io is it ja jbo jv ka kk km kn ko krc ku kv kw ky la lb lez li lmo lo lrc lt lv mai mg mhr min mk ml mn mr mrj ms mt mwl my mym mzn nah nap nds ne new nl nn no oc or os pa pam pfl pl pms pnb ps pt qu rm ro ru rue sa sah scn sco sd sh si sl sl so sq sr su sv sw ta te tg th t t t tr t t tyv ug uk ur uz vec vep vi vls vo wa war wuu xal xmf yi yo yue zh 21 | ``` 22 | 23 | ### 参考 24 | 25 | 如果您使用这些模型,请引用以下论文: 26 | 27 | [1] A. Joulin, E. Grave, P. Bojanowski, T. Mikolov, [*Bag of Tricks for Efficient Text Classification*](https://arxiv.org/abs/1607.01759) 28 | ``` 29 | @article{joulin2016bag, 30 | title={Bag of Tricks for Efficient Text Classification}, 31 | author={Joulin, Armand and Grave, Edouard and Bojanowski, Piotr and Mikolov, Tomas}, 32 | journal={arXiv preprint arXiv:1607.01759}, 33 | year={2016} 34 | } 35 | ``` 36 | [2] A. Joulin, E. Grave, P. Bojanowski, M. Douze, H. Jégou, T. Mikolov, [*FastText.zip: Compressing text classification models* ](https://arxiv.org/abs/1612.03651) 37 | ``` 38 | @article{joulin2016fasttext, 39 | title={FastText.zip: Compressing text classification models}, 40 | author={Joulin, Armand and Grave, Edouard and Bojanowski, Piotr and Douze, Matthijs and J{\'e}gou, H{\'e}rve and Mikolov, Tomas}, 41 | journal={arXiv preprint arXiv:1612.03651}, 42 | year={2016} 43 | } 44 | ``` 45 | -------------------------------------------------------------------------------- /doc/zh/options.md: -------------------------------------------------------------------------------- 1 | # 选项列表 2 | 3 | 调用不带参数的命令来列出可用参数及其默认值: 4 | 5 | ```bash 6 | $ ./fasttext supervised 7 | 空的输入或输出路径. 8 | 9 | 以下参数是强制性的: 10 | -input 训练文件路径 11 | -output 输出文件路径 12 | 13 | 以下参数是可选的: 14 | -verbose 冗长等级 [2] 15 | 16 | 以下字典参数是可选的: 17 |  -minCount           词出现的最少次数 [5] 18 |  -minCountLabel     标签出现的最少次数 [0] 19 | -wordNgrams 单词 ngram 的最大长度 [1] 20 |  -bucket             桶的个数 [2000000] 21 | -minn char ngram 的最小长度 [3] 22 | -maxn char ngram 的最大长度 [6] 23 | -t 抽样阈值 [0.0001] 24 | -label 标签前缀 [__label__] 25 | 26 | 以下用于训练的参数是可选的: 27 |  -lr                 学习速率 [0.05] 28 |  -lrUpdateRate       更改学习速率的更新速率 100] 29 | -dim 字向量的大小 [100] 30 | -ws 上下文窗口的大小 [5] 31 |  -epoch             迭代次数 [5] 32 |  -neg               负样本个数 [5] 33 | -loss 损失函数 {ns, hs, softmax} [ns] 34 |  -thread             线程个数 [12] 35 |  -pretrainedVectors 监督学习的预训练词向量 [] 36 | -saveOutput 是否应该保存输出参数 [0] 37 | 38 | 以下量化参数是可选的: 39 |  -cutoff             要保留的词和 ngram 的数量 [0] 40 |  -retrain           微调 embeddings(假如应用 -cutoff 参数的话) [0] 41 | -qnorm 分别量化范数 [0] 42 | -qout 量化分类器 [0] 43 |  -dsub               每个子向量的大小 [2] 44 | ``` 45 | 46 | 默认值可能因模式而异。 (Word-representation 模型 `skipgram` 和 `cbow` 使用 5 作为 `-minCount` 的默认值) 47 | 48 | -------------------------------------------------------------------------------- /doc/zh/references.md: -------------------------------------------------------------------------------- 1 | # 参考 2 | 3 | 如果使用此代码学习词语表示, 请引用 [1](#enriching-word-vectors-with-subword-information); 如果使用文本分类, 请引用 [2](#bag-of-tricks-for-efficient-text-classification)。 4 | 5 | [1] P. Bojanowski\*, E. Grave\*, A. Joulin, T. Mikolov, [*Enriching Word Vectors with Subword Information*](https://arxiv.org/abs/1607.04606) 6 | 7 | ```markup 8 | @article{bojanowski2016enriching, 9 | title={Enriching Word Vectors with Subword Information}, 10 | author={Bojanowski, Piotr and Grave, Edouard and Joulin, Armand and Mikolov, Tomas}, 11 | journal={arXiv preprint arXiv:1607.04606}, 12 | year={2016} 13 | } 14 | ``` 15 | 16 | [2] A. Joulin, E. Grave, P. Bojanowski, T. Mikolov, [*Bag of Tricks for Efficient Text Classification*](https://arxiv.org/abs/1607.01759) 17 | 18 | ```markup 19 | @article{joulin2016bag, 20 | title={Bag of Tricks for Efficient Text Classification}, 21 | author={Joulin, Armand and Grave, Edouard and Bojanowski, Piotr and Mikolov, Tomas}, 22 | journal={arXiv preprint arXiv:1607.01759}, 23 | year={2016} 24 | } 25 | ``` 26 | 27 | [3] A. Joulin, E. Grave, P. Bojanowski, M. Douze, H. Jégou, T. Mikolov, [*FastText.zip: 压缩文本分类模型*](https://arxiv.org/abs/1612.03651) 28 | 29 | ```markup 30 | @article{joulin2016fasttext, 31 | title={FastText.zip: Compressing text classification models}, 32 | author={Joulin, Armand and Grave, Edouard and Bojanowski, Piotr and Douze, Matthijs and J{\'e}gou, H{\'e}rve and Mikolov, Tomas}, 33 | journal={arXiv preprint arXiv:1612.03651}, 34 | year={2016} 35 | } 36 | ``` 37 | 38 | (\* 这些作者贡献相同。) 39 | -------------------------------------------------------------------------------- /doc/zh/supervised-models.md: -------------------------------------------------------------------------------- 1 | # 监督模型 2 | 3 | 这个页面收集了几个预先训练好的监督模型,其训练数据来自于几个不同的数据集。 4 | 5 | ### Description(描述) 6 | 7 | 常规模型使用 [1] 中描述的步骤进行训练. 可以使用我们 github 存储库中的 classification-results.sh 脚本来复现它们。通过使用相应的监督设置并将以下参数添加到量化子命令中来构建量化模型. 8 | 9 | ```bash 10 | -qnorm -retrain -cutoff 100000 11 | ``` 12 | 13 | ### Table of models(模型表格) 14 | 15 | 每个条目描述了模型的测试精度和大小。 您可以单击表格单元格来下载相应的模型. 16 | 17 | | 数据集 | ag 新闻 | 亚马逊评论全部 | 亚马逊评论极性 | dbpedia | 18 | |-----------|-----------------------|-----------------------|------------------------|------------------------| 19 | | regular | [0.924 / 387MB](https://s3-us-west-1.amazonaws.com/fasttext-vectors/supervised_models/ag_news.bin) | [0.603 / 462MB](https://s3-us-west-1.amazonaws.com/fasttext-vectors/supervised_models/amazon_review_full.bin) | [0.946 / 471MB](https://s3-us-west-1.amazonaws.com/fasttext-vectors/supervised_models/amazon_review_polarity.bin) | [0.986 / 427MB](https://s3-us-west-1.amazonaws.com/fasttext-vectors/supervised_models/dbpedia.bin) | 20 | | compressed | [0.92 / 1.6MB](https://s3-us-west-1.amazonaws.com/fasttext-vectors/supervised_models/ag_news.ftz) | [0.599 / 1.6MB]( https://s3-us-west-1.amazonaws.com/fasttext-vectors/supervised_models/amazon_review_full.ftz) | [0.93 / 1.6MB]( https://s3-us-west-1.amazonaws.com/fasttext-vectors/supervised_models/amazon_review_polarity.ftz) | [0.984 / 1.7MB]( https://s3-us-west-1.amazonaws.com/fasttext-vectors/supervised_models/dbpedia.ftz) | 21 | 22 | | 数据集 | 搜狗新闻 | 雅虎回答 | yelp review 极性 | yelp 评论全部 | 23 | |-----------|----------------------|------------------------|----------------------|------------------------| 24 | | regular | [0.969 / 402MB](https://s3-us-west-1.amazonaws.com/fasttext-vectors/supervised_models/sogou_news.bin) | [0.724 / 494MB](https://s3-us-west-1.amazonaws.com/fasttext-vectors/supervised_models/yahoo_answers.bin)| [0.957 / 409MB](https://s3-us-west-1.amazonaws.com/fasttext-vectors/supervised_models/yelp_review_polarity.bin)| [0.639 / 412MB](https://s3-us-west-1.amazonaws.com/fasttext-vectors/supervised_models/yelp_review_full.bin)| 25 | | compressed | [0.968 / 1.4MB](https://s3-us-west-1.amazonaws.com/fasttext-vectors/supervised_models/sogou_news.ftz) | [0.717 / 1.6MB](https://s3-us-west-1.amazonaws.com/fasttext-vectors/supervised_models/yahoo_answers.ftz) | [0.957 / 1.5MB](https://s3-us-west-1.amazonaws.com/fasttext-vectors/supervised_models/yelp_review_polarity.ftz) | [0.636 / 1.5MB](https://s3-us-west-1.amazonaws.com/fasttext-vectors/supervised_models/yelp_review_full.ftz) | 26 | 27 | ### References(参考) 28 | 29 | 如果您使用这些模型, 请引用以下文章: 30 | 31 | [1] A. Joulin, E. Grave, P. Bojanowski, T. Mikolov, [*Bag of Tricks for Efficient Text Classification*](https://arxiv.org/abs/1607.01759) 32 | 33 | ```markup 34 | @article{joulin2016bag, 35 | title={Bag of Tricks for Efficient Text Classification}, 36 | author={Joulin, Armand and Grave, Edouard and Bojanowski, Piotr and Mikolov, Tomas}, 37 | journal={arXiv preprint arXiv:1607.01759}, 38 | year={2016} 39 | } 40 | ``` 41 | 42 | [2] A. Joulin, E. Grave, P. Bojanowski, M. Douze, H. Jégou, T. Mikolov, [*FastText.zip: 压缩文本分类模型*](https://arxiv.org/abs/1612.03651) 43 | 44 | ```markup 45 | @article{joulin2016fasttext, 46 | title={FastText.zip: Compressing text classification models}, 47 | author={Joulin, Armand and Grave, Edouard and Bojanowski, Piotr and Douze, Matthijs and J{\'e}gou, H{\'e}rve and Mikolov, Tomas}, 48 | journal={arXiv preprint arXiv:1612.03651}, 49 | year={2016} 50 | } 51 | ``` 52 | -------------------------------------------------------------------------------- /doc/zh/supervised-tutorial.md: -------------------------------------------------------------------------------- 1 | # 文本分类 2 | 3 | 文本分类是许多应用程序的核心问题,例如垃圾邮件检测,情感分析或智能回复。 在本教程中,我们将介绍如何使用fastText工具构建文本分类器。 4 | 5 | ## 什么是文本分类? 6 | 7 | 文本分类的目标是将文档(如电子邮件,博文,短信,产品评论等)分为一个或多个类别。 这些类别可以是根据评论分数,垃圾邮件与非垃圾邮件来划分,或者文档的编写语言。 如今,构建这种分类器的主要方法是机器学习,即从样本中学习分类规则。 为了构建这样的分类器,我们需要标注数据,它由文档及其相应的类别(也称为标签或标注)组成。 8 | 9 | 给一个例子,我们建立了一个分类器,该分类器将 [stack exchange](https://stackexchange.com/) 网站上有关烹饪的问题自动分类为几个可能的标签之一,例如`pot`,`bowl` 或 `baking`。 10 | 11 | ## 安装 fastText 12 | 13 | 本教程的第一步是安装并构建 fastText。 它只需要一个能够良好支持 c++11 的 c++ 编译器。 14 | 15 | 让我们从下载[最新版本](https://github.com/facebookresearch/fastText/releases)开始: 16 | 17 | ```bash 18 | $ wget https://github.com/facebookresearch/fastText/archive/v0.1.0.zip 19 | $ unzip v0.1.0.zip 20 | ``` 21 | 22 | 移动到 fastText 目录并构建它: 23 | 24 | ```bash 25 | $ cd fastText-0.1.0 26 | $ make 27 | ``` 28 | 29 | 运行二进制文件(不带任何参数)将输出高级文档,显示 fastText 支持的不同用例: 30 | 31 | ```bash 32 | >> ./fasttext 33 | usage: fasttext 34 | 35 | The commands supported by fasttext are: 36 | 37 |  supervised             训练一个监督分类器 38 | quantize 量化模型以减少内存使用量 39 |  test                   评估一个监督分类器 40 |  predict                 预测最有可能的标签 41 |  predict-prob           用概率预测最可能的标签 42 | skipgram 训练一个 skipgram 模型 43 |  cbow                   训练一个 cbow 模型 44 |  print-word-vectors     给定一个训练好的模型,打印出所有的单词向量 45 |  print-sentence-vectors 给定一个训练好的模型,打印出所有的句子向量 46 | nn 查询最近邻居 47 |  analogies               查找所有同类词 48 | 49 | ``` 50 | 51 | 在本教程中,我们主要使用`supervised`,`test` 和 `predict` 子命令,它们对应于学习(和使用)文本分类器。 有关 fastText 其他功能的介绍,请参见[单词向量的教程](https://fasttext.cc/docs/en/unsupervised-tutorial.html)。 52 | 53 | ## 获取和准备数据 54 | 55 | 正如上述介绍中所提到的,我们需要标记数据来训练我们的监督分类器。 在本教程中,我们将构建一个分类器来自动识别烹饪问题的类别。 让我们从[Stack exchange 网站的烹饪部分](http://cooking.stackexchange.com/)下载问题示例及其相应标签: 56 | 57 | ```bash 58 | >> wget https://s3-us-west-1.amazonaws.com/fasttext-vectors/cooking.stackexchange.tar.gz && tar xvzf cooking.stackexchange.tar.gz 59 | >> head cooking.stackexchange.txt 60 | ``` 61 | 62 | 文本文件的每一行都包含一个标签列表,其后是相应的文档。 所有标签都以 `__label__` 前缀开始,这就是 fastText 如何识别标签或单词是什么。 然后对模型进行训练,以预测给定文档的标签。 63 | 64 | 在训练我们的第一个分类器之前,我们需要将数据分为训练集和验证集。 我们将使用验证集来评估该学习分类器对新数据的适用程度。 65 | 66 | ```bash 67 | >> wc cooking.stackexchange.txt 68 | 15404 169582 1401900 cooking.stackexchange.txt 69 | ``` 70 | 71 | 我们的完整数据集包含 15404 样本。 我们将其分为拥有 12404 个样本的训练集和拥有 3000 个样本的验证集: 72 | 73 | ```bash 74 | >> head -n 12404 cooking.stackexchange.txt > cooking.train 75 | >> tail -n 3000 cooking.stackexchange.txt > cooking.valid 76 | ``` 77 | 78 | ## 我们的第一个分类器 79 | 80 | 我们现在准备好训练第一个分类器了: 81 | 82 | ```bash 83 | >> ./fasttext supervised -input cooking.train -output model_cooking 84 | Read 0M words 85 | Number of words: 14598 86 | Number of labels: 734 87 | Progress: 100.0% words/sec/thread: 75109 lr: 0.000000 loss: 5.708354 eta: 0h0m 88 | ``` 89 | 90 | 命令行选项 `-input` 指示包含训练样本的文件,而选项 `-output` 指示保存模型的位置。 训练结束时,在当前目录中创建一个 `model_cooking.bin` 文件,其中包含了已经训练好的模型。 91 | 92 | 可以通过命令行直接交互式地测试分类器: 93 | 94 | ```bash 95 | >> ./fasttext predict model_cooking.bin - 96 | ``` 97 | 98 | 然后输入一个句子。我们来试一下: 99 | 100 | *Which baking dish is best to bake a banana bread ?* 101 | 102 | 其预测的标签是 `baking`,非常贴切。 现在让我们尝试第二个例子: 103 | 104 | *Why not put knives in the dishwasher?* 105 | 106 | 模型预测出的标签是 `food-safety`,这是不相关的。 不知道为什么,这个模型似乎在简单的例子上反而失败了。 为了更好地了解其预测质量,我们通过运行下面这个命令测试这个验证数据: 107 | 108 | ```bash 109 | >> ./fasttext test model_cooking.bin cooking.valid 110 | N 3000 111 | P@1 0.124 112 | R@1 0.0541 113 | Number of examples: 3000 114 | ``` 115 | 116 | 只给定一个真实标签用于预测,fastText 的输出是精确度 (`P@1`) 和 召回率 (`R@1`)。我们也可以计算给定五个真实标签用于预测的精确度和召回率: 117 | 118 | ```bash 119 | >> ./fasttext test model_cooking.bin cooking.valid 5 120 | N 3000 121 | P@5 0.0668 122 | R@5 0.146 123 | Number of examples: 3000 124 | ``` 125 | 126 | ## 高级读者:精确度和召回率 127 | 128 | 精确度是由 fastText 所预测标签中正确标签的数量。 召回率是所有真实标签中被成功预测出的标签数量。 我们举一个例子来说明这一点: 129 | 130 | *Why not put knives in the dishwasher?* 131 | 132 | 在 Stack Exchange 上,这句话标有三个标签:`equipment`,`cleaning` 和 `knives`。 模型预测出的标签前五名可以通过以下方式获得: 133 | 134 | ```bash 135 | >> ./fasttext predict model_cooking.bin - 5 136 | ``` 137 | 138 | 前五名是 `food-safety`, `baking`, `equipment`, `substitutions` and `bread`. 139 | 140 | 因此,模型预测的五个标签中有一个是正确的,精确度为 0.20。 在三个真实标签中,只有 `equipment` 标签被该模型预测出,召回率为 0.33。 141 | 142 | 更多详细信息,请参阅[相关维基百科页面](https://en.wikipedia.org/wiki/Precision_and_recall)。 143 | 144 | ## 使模型更好 145 | 146 | 使用默认参数运行 fastText 所获得的模型在分类新问题时非常糟糕。 让我们尝试通过更改默认参数来提高性能。 147 | 148 | ### 预处理数据 149 | 150 | 观察这些数据,我们发现有些单词包含大写字母或标点符号。 提高我们模型性能的第一步之一是采取一些简单的预处理。 粗略的标准化可以通过命令行工具获得,例如`sed`和 `tr`: 151 | 152 | ```bash 153 | >> cat cooking.stackexchange.txt | sed -e "s/\([.\!?,'/()]\)/ \1 /g" | tr "[:upper:]" "[:lower:]" > cooking.preprocessed.txt 154 | >> head -n 12404 cooking.preprocessed.txt > cooking.train 155 | >> tail -n 3000 cooking.preprocessed.txt > cooking.valid 156 | ``` 157 | 158 | 让我们在被预处理过的数据上训练一个新的模型吧: 159 | 160 | ```bash 161 | >> ./fasttext supervised -input cooking.train -output model_cooking 162 | Read 0M words 163 | Number of words: 9012 164 | Number of labels: 734 165 | Progress: 100.0% words/sec/thread: 82041 lr: 0.000000 loss: 5.671649 eta: 0h0m h-14m 166 | 167 | >> ./fasttext test model_cooking.bin cooking.valid 168 | N 3000 169 | P@1 0.164 170 | R@1 0.0717 171 | Number of examples: 3000 172 | ``` 173 | 174 | 我们观察到,由于预处理,词汇量变得更小了(从 14k 到 9k)。精确度也开始提高了4%! 175 | 176 | ### 更多的迭代和更快的学习速率 177 | 178 | 默认情况下,由于我们的训练集只有 12k 个训练样本,因此 fastText 在训练过程中仅看到每个训练样例五次,这太少了。 可以使用 -epoch 选项增加每个示例出现的次数(也称为迭代数): 179 | 180 | ```bash 181 | >> ./fasttext supervised -input cooking.train -output model_cooking -epoch 25 182 | Read 0M words 183 | Number of words: 9012 184 | Number of labels: 734 185 | Progress: 100.0% words/sec/thread: 77633 lr: 0.000000 loss: 7.147976 eta: 0h0m 186 | ``` 187 | 188 | 让我们测试一下新模型: 189 | 190 | ```bash 191 | >> ./fasttext test model_cooking.bin cooking.valid 192 | N 3000 193 | P@1 0.501 194 | R@1 0.218 195 | Number of examples: 3000 196 | ``` 197 | 198 | 这好多了! 另一种改变我们模型学习速度的方法是增加(或减少)算法的学习速率。 这对应于处理每个样本后模型变化的幅度。 学习率为 0 意味着模型根本不会改变,因此不会学到任何东西。好的学习速率在 `0.1 - 1.0` 范围内。 199 | 200 | ```bash 201 | >> ./fasttext supervised -input cooking.train -output model_cooking -lr 1.0 202 | Read 0M words 203 | Number of words: 9012 204 | Number of labels: 734 205 | Progress: 100.0% words/sec/thread: 81469 lr: 0.000000 loss: 6.405640 eta: 0h0m 206 | 207 | >> ./fasttext test model_cooking.bin cooking.valid 208 | N 3000 209 | P@1 0.563 210 | R@1 0.245 211 | Number of examples: 3000 212 | ``` 213 | 214 | 更好了!我们试试学习速率和迭代次数两个参数一起变化: 215 | 216 | ```bash 217 | >> ./fasttext supervised -input cooking.train -output model_cooking -lr 1.0 -epoch 25 218 | Read 0M words 219 | Number of words: 9012 220 | Number of labels: 734 221 | Progress: 100.0% words/sec/thread: 76394 lr: 0.000000 loss: 4.350277 eta: 0h0m 222 | 223 | >> ./fasttext test model_cooking.bin cooking.valid 224 | N 3000 225 | P@1 0.585 226 | R@1 0.255 227 | Number of examples: 3000 228 | ``` 229 | 230 | 现在让我们多添加一些功能来进一步提高我们的性能! 231 | 232 | ### word n-grams 233 | 234 | 最后,我们可以通过使用 word bigrams 而不是 unigrams 来提高模型的性能。 这对于词序重要的分类问题尤其重要,例如情感分析。 235 | 236 | ```bash 237 | >> ./fasttext supervised -input cooking.train -output model_cooking -lr 1.0 -epoch 25 -wordNgrams 2 238 | Read 0M words 239 | Number of words: 9012 240 | Number of labels: 734 241 | Progress: 100.0% words/sec/thread: 75366 lr: 0.000000 loss: 3.226064 eta: 0h0m 242 | 243 | >> ./fasttext test model_cooking.bin cooking.valid 244 | N 3000 245 | P@1 0.599 246 | R@1 0.261 247 | Number of examples: 3000 248 | ``` 249 | 250 | 只需几个步骤,我们就可以从 12.4% 到达 59.9% 的精度。 重要步骤包括: 251 | 252 | * 预处理数据 ; 253 | * 改变迭代次数 (使用选项 `-epoch`, 标准范围 `[5 - 50]`) ; 254 | * 改变学习速率 (使用选项 `-lr`, 标准范围 `[0.1 - 1.0]`) ; 255 | * 使用 word n-grams (使用选项 `-wordNgrams`, 标准范围 `[1 - 5]`). 256 | 257 | ## 高级读者: 什么是 Bigram? 258 | 259 | 'unigram' 指的是单个不可分割的单位或标记,通常用作模型的输入。 例如,根据模型的不同,'unigram' 可以是单词或字母。 在 fastText 中,我们在单词级别工作,因此 unigrams 是单词。 260 | 261 | 类似地,我们用 'bigram' 表示2个连续标记或单词的连接。 类似地,我们经常谈论 n-gram 来引用任意 n 个连续标记或单词的级联。 262 | 263 | 例如,在 'Last donut of the night' 这个句子中,unigrams是 'last','donut','of','the' 和 'night'。 bigrams 是 'Last donut', 'donut of', 'of the' 和 'the night'。 264 | 265 | Bigrams 特别有趣,因为对于大多数句子,只需查看 n-gram 的集合即可重建句子中单词的顺序。 266 | 267 | 让我们通过一个简单的练习来说明这一点,给定以下 bigrams,试着重构原始的句子:'all out','I am','bubblegum','out of' 和 'all all'。 268 | 通常将一个单词称为一个 unigram。 269 | 270 | ## 扩大规模 271 | 272 | 由于我们正在通过几千个示例来训练我们的模型,所以训练只需几秒钟。但是在更大的数据集上训练模型,使用更多的标签可能会太慢。 使训练更快的潜在解决方案是使用hierarchical softmax,而不是 regular softmax [添加 hierarchical softmax 的快速解释]。 这可以通过选项 `-loss hs` 完成: 273 | 274 | ```bash 275 | >> ./fasttext supervised -input cooking.train -output model_cooking -lr 1.0 -epoch 25 -wordNgrams 2 -bucket 200000 -dim 50 -loss hs 276 | Read 0M words 277 | Number of words: 9012 278 | Number of labels: 734 279 | Progress: 100.0% words/sec/thread: 2199406 lr: 0.000000 loss: 1.718807 eta: 0h0m 280 | ``` 281 | 282 | 现在训练时间应该不到一秒。 283 | 284 | ## 结论 285 | 286 | 在本教程中,我们简要介绍了如何使用 fastText 来训练强大的文本分类器。 我们对一些最重要的可调节选项(如 `epoch` )进行了简要介绍。 287 | -------------------------------------------------------------------------------- /doc/zh/support.md: -------------------------------------------------------------------------------- 1 | # 快速入门 2 | 3 | ## 什么是快速文本? 4 | 5 | fastText 是一个用于高效学习单词表示和句子分类的库。 6 | 7 | ## 要求 8 | 9 | fastText 建立在现代 Mac OS 和 Linux 发行版上. 10 | 由于它使用 C++11功能, 因此需要具有良好 C++11 支持的编译器. 11 | 这些包括: 12 | 13 | * (gcc-4.6.3 or newer) or (clang-3.3 or newer) 14 | 15 | 使用 Makefile 进行编译, 因此您需要有一个可行的 **make**. 对于单词相似性评估脚本, 您需要: 16 | 17 | * python 2.6 or newer 18 | * numpy & scipy 19 | 20 | ## 建立快速文本 21 | 22 | 23 | 为了构建 `fastText`, 请使用以下内容: 24 | 25 | ```bash 26 | $ git clone https://github.com/facebookresearch/fastText.git 27 | $ cd fastText 28 | $ make 29 | ``` 30 | 31 | 这将产生所有类以及主二进制文件的目标文件 `fasttext`。 32 | 如果您不打算使用默认的系统范围编译器, 请更新 Makefile 开头定义的两个宏 (CC 和 INCLUDES)。 33 | 34 | -------------------------------------------------------------------------------- /doc/zh/unsupervised-tutorials.md: -------------------------------------------------------------------------------- 1 | # 词语表达 2 | 3 | 现代机器学习中的一个普遍观点是用向量表示单词。这些向量获取有关语言的隐藏信息,如词类或语义。它也被用来提高文本分类器的性能。 4 | 5 | 在本教程中,我们将演示如何使用 fastText 来构建这些词向量。需要下载并安装 fastText,请按照[文本分类教程](https://fasttext.cc/docs/en/supervised-tutorial.html)的第一步进行操作。 6 | 7 | ## 获取数据 8 | 9 | 为了计算词向量,你需要一个大型的文本语料库。根据语料库,词向量将获取不同的信息。在本教程中,我们使用了维基百科的文章,但可以考虑其他来源,如新闻或网页抓取(更多示例在[这里](http://statmt.org/))。要下载维基百科的原始转储,请运行下面的命令: 10 | 11 | ```bash 12 | wget https://dumps.wikimedia.org/enwiki/latest/enwiki-latest-pages-articles.xml.bz2 13 | ``` 14 | 15 | 下载维基百科语料库需要一些时间。有一种替代方案就是我们只研究英语维基百科的前 10 亿字节(大概 1G 不到)。可以在 Matt Mahoney 的[网站](http://mattmahoney.net/)上找到: 16 | 17 | ```bash 18 | $ mkdir data 19 | $ wget -c http://mattmahoney.net/dc/enwik9.zip -P data 20 | $ unzip data/enwik9.zip -d data 21 | ``` 22 | 23 | 原始维基百科转储包含大量的 HTML/XML 数据。我们使用与 fastText 一起打包的 `wikifil.pl` 脚本对其进行预处理(该脚本最初由 Matt Mahoney 开发,可以在他的[网站](http://mattmahoney.net/)上找到) 24 | 25 | ```bash 26 | $ perl wikifil.pl data/enwik9 > data/fil9 27 | ``` 28 | 29 | 我们可以通过运行下面的命令来检查文件: 30 | 31 | ```bash 32 | $ head -c 80 data/fil9 33 | anarchism originated as a term of abuse first used against early working class 34 | ``` 35 | 36 | 这个文本经过了很好地预处理,可以用来学习我们的词向量。 37 | 38 | ## 训练词向量 39 | 40 | 基于这个数据来学习词向量只需要一个命令就能实现: 41 | 42 | ```bash 43 | $ mkdir result 44 | $ ./fasttext skipgram -input data/fil9 -output result/fil9 45 | ``` 46 | 47 | 解释这行命令: `./fastext` 用 **skipgram** 模型(或者是 **cbow** 模型)调用 fastText 二进制的可执行文件(在这里参考如何安装 fastText )。然后,'-input' 选项要求我们指定输入数据的位置,'-output' 指定输出要保存的位置。 48 | 49 | 当 fastText 运行时,屏幕上会显示进度和预计完成时间。一旦程序结束,`result` 目录中应该有两个文件: 50 | 51 | ```bash 52 | $ ls -l result 53 | -rw-r-r-- 1 bojanowski 1876110778 978480850 Dec 20 11:01 fil9.bin 54 | -rw-r-r-- 1 bojanowski 1876110778 190004182 Dec 20 11:01 fil9.vec 55 | ``` 56 | 57 | `fil9.bin` 是一个二进制文件,用于存储整个 fastText 模型,并可以在之后重新加载。 `fil9.vec` 是一个包含词向量的文本文件,词汇表中的一个单词对应一行: 58 | 59 | ```bash 60 | $ head -n 4 result/fil9.vec 61 | 218316 100 62 | the -0.10363 -0.063669 0.032436 -0.040798 0.53749 0.00097867 0.10083 0.24829 ... 63 | of -0.0083724 0.0059414 -0.046618 -0.072735 0.83007 0.038895 -0.13634 0.60063 ... 64 | one 0.32731 0.044409 -0.46484 0.14716 0.7431 0.24684 -0.11301 0.51721 0.73262 ... 65 | ``` 66 | 67 | 第一行说明了单词数量和向量维数。随后的行是词汇表中所有单词的词向量,按降序排列。 68 | 69 | ## 高级读者:skipgram 与 cbow 两种模型 70 | 71 | fastText 提供了两种用于计算词表示的模型:skipgram 和 cbow ('**c**ontinuous-**b**ag-**o**f-**w**ords')。 72 | 73 | skipgram 模型是学习近邻的单词来预测目标单词。 另一方面, cbow 模型是根据目标词的上下文来预测目标词。上下文是指在目标词左边和右边的固定单词数和。 74 | 75 | 让我们用一个例子来说明这种差异:给出句子 *'Poets have been mysteriously silent on the subject of cheese'* 和目标单词 '*silent*',skipgram 模型随机取近邻词尝试预测目标词,如 '*subject*'或'*mysteriously*'。cbow 模型使用目标单词固定数量的左边和右边单词,,如 {*been*, *mysteriously*, *on*, *the*},并使用它们的向量和来预测目标单词。下图用另一个例子总结了这种差异。 76 | 77 | ![cbow vs skipgram](https://fasttext.cc/img/cbo_vs_skipgram.png) 78 | 要使用 fastText 训练 cbow 模型,请运行下面这个命令: 79 | 80 | ```bash 81 | ./fasttext cbow -input data/fil9 -output result/fil9 82 | ``` 83 | 84 | 通过这个练习,我们观察到 skipgram 模型会比 cbow 模型在 subword information 上效果更好 85 | 在实践中,我们观察到 skipgram 模型比 cbow 在子词信息方面效果更好。 86 | 87 | ## 高级读者:调整参数 88 | 89 | 到目前为止,我们使用默认参数运行 fastText,但根据数据,这些参数可能不是最优的。 让我们介绍一下词向量的一些关键参数。 90 | 91 | 模型的最重要的参数是维度和子词的大小范围。 维度(*dim*)控制向量的大小,维度越多,它们就需要学习更多的数据来获取越多的信息。 但是,如果它们太大,就会越来越难以训练。 默认情况下,我们使用 100个 维度,一般情况下使用 100 到 300 范围内中的值。 子词是包含在最小尺寸(*minn*)和最大尺寸(*maxn*)之间的字中的所有子字符串。 默认情况下,我们取 3 到 6 个字符的所有子词,但不同语言的适用范围可能不同: 92 | 93 | ```bash 94 | $ ./fasttext skipgram -input data/fil9 -output result/fil9 -minn 2 -maxn 5 -dim 300 95 | ``` 96 | 97 | 根据您已有的数据量,您可能需要更改训练参数。 *epoch* 参数控制将循环多少次的数据。 默认情况下,我们遍历数据集 5 次。 如果你的数据集非常庞大,你可能希望更少地循环它。另一个重要参数是学习率 -*lr*)。 学习率越高,模型收敛到最优解的速度越快,但过拟合数据集的风险也越高。 默认值是 0.05,这是一个很好的折中值。 如果你想调整它,我们建议留在 [0.01,1] 的范围内: 98 | 99 | ```bash 100 | $ ./fasttext skipgram -input data/fil9 -output result/fil9 -epoch 1 -lr 0.5 101 | ``` 102 | 103 | 最后,fastText 是多线程的,默认使用 12 个线程。 如果 CPU 核心数较少(只有 4 个),则可以使用 *thread* 参数轻松设置线程数: 104 | 105 | ```bash 106 | $ ./fasttext skipgram -input data/fil9 -output result/fil9 -thread 4 107 | ``` 108 | 109 | 110 | ## 打印词向量 111 | 112 | 直接从 `fil9.vec` 文件中搜索和打印词向量非常麻烦。 幸运的是,fastText 中有一个 `print-word-vectors` 功能。 113 | 114 | 例如,我们可以使用以下命令打印词 *asparagus*,*pidgey* 和 *yellow* 的词向量: 115 | 116 | ```bash 117 | $ echo "asparagus pidgey yellow" | ./fasttext print-word-vectors result/fil9.bin 118 | asparagus 0.46826 -0.20187 -0.29122 -0.17918 0.31289 -0.31679 0.17828 -0.04418 ... 119 | pidgey -0.16065 -0.45867 0.10565 0.036952 -0.11482 0.030053 0.12115 0.39725 ... 120 | yellow -0.39965 -0.41068 0.067086 -0.034611 0.15246 -0.12208 -0.040719 -0.30155 ... 121 | ``` 122 | 123 | 一个很好的功能是,您还可以查询没有出现在数据中的单词! 实际上,单词由其子串的总和来表示。 只要未知单词是由已知的子串组成的,就有它的表示形式! 124 | 125 | 这里有一个例子,让我们尝试拼写错误的单词: 126 | 127 | ```bash 128 | $ echo "enviroment" | ./fasttext print-word-vectors result/fil9.bin 129 | ``` 130 | 131 | 你仍然得到一个单词向量! 但它有多棒? 让我们在下一节中揭晓! 132 | 133 | ## 最近邻查询 134 | 135 | 查看最近邻是检查词向量效果的一种简单方法。 这给出了向量能够获取的语义信息类型的直觉。 136 | 137 | 这可以通过 *nn* 功能来实现。 例如,我们可以通过运行以下命令来查询单词的最近邻: 138 | 139 | ```bash 140 | $ ./fasttext nn result/fil9.bin 141 | Pre-computing word vectors... done. 142 | ``` 143 | 144 | 然后我们会提示输入我们的查询词,让我们试试 *asparagus*: 145 | 146 | ```bash 147 | Query word? asparagus 148 | beetroot 0.812384 149 | tomato 0.806688 150 | horseradish 0.805928 151 | spinach 0.801483 152 | licorice 0.791697 153 | lingonberries 0.781507 154 | asparagales 0.780756 155 | lingonberry 0.778534 156 | celery 0.774529 157 | beets 0.773984 158 | ``` 159 | 160 | 太好了! 看来 vegetable 的向量是相似的。 请注意,最近邻是 *asparagus* 本身,这意味着这个词出现在数据集中。 那么 pokemons? 161 | 162 | ```bash 163 | Query word? pidgey 164 | pidgeot 0.891801 165 | pidgeotto 0.885109 166 | pidge 0.884739 167 | pidgeon 0.787351 168 | pok 0.781068 169 | pikachu 0.758688 170 | charizard 0.749403 171 | squirtle 0.742582 172 | beedrill 0.741579 173 | charmeleon 0.733625 174 | ``` 175 | 176 | 相同 pokemons 的不同进化有紧邻的向量! 但是我们拼错的单词呢,它的向量接近于任何合理的东西吗? 让我们看看: 177 | 178 | ```bash 179 | Query word? enviroment 180 | enviromental 0.907951 181 | environ 0.87146 182 | enviro 0.855381 183 | environs 0.803349 184 | environnement 0.772682 185 | enviromission 0.761168 186 | realclimate 0.716746 187 | environment 0.702706 188 | acclimatation 0.697196 189 | ecotourism 0.697081 190 | ``` 191 | 192 | 由于这个词中包含的信息,我们拼写错误的单词的向量搭配到了匹配合理的单词! 虽然这并不完美,但主要信息已经获取到了。 193 | 194 | ## 高级读者:计算相似度 195 | 196 | 为了找到最近邻,我们需要计算单词之间的相似度分数。 我们的单词用连续的词向量来表示,因此我们可以对它们应用简单的相似性。 尤其我们使用两个向量之间角度的余弦。 计算词汇表中所有单词的相似度,并显示 10 个最相似的单词。 当然,如果单词出现在词汇表中,它将出现在顶部,其相似度为 1。 197 | 198 | ## 字的类比 199 | 200 | 用类比的思想,我们可以进行词的类比。 例如,我们可以看到我们的模型是否可以根据柏林是德国的首都来猜测法国的首都是什么, 201 | 202 | 这可以通过 *analogies* 功能来完成。 它需要三个词(如*德国柏林法国*)来输出类别结果: 203 | 204 | ```bash 205 | $ ./fasttext analogies result/fil9.bin 206 | Pre-computing word vectors... done. 207 | Query triplet (A - B + C)? berlin germany france 208 | paris 0.896462 209 | bourges 0.768954 210 | louveciennes 0.765569 211 | toulouse 0.761916 212 | valenciennes 0.760251 213 | montpellier 0.752747 214 | strasbourg 0.744487 215 | meudon 0.74143 216 | bordeaux 0.740635 217 | pigneaux 0.736122 218 | ``` 219 | 220 | 我们的模型提供了正确的答案 *Paris*。 让我们来看一个不太明显的例子: 221 | 222 | ```bash 223 | Query triplet (A - B + C)? psx sony nintendo 224 | gamecube 0.803352 225 | nintendogs 0.792646 226 | playstation 0.77344 227 | sega 0.772165 228 | gameboy 0.767959 229 | arcade 0.754774 230 | playstationjapan 0.753473 231 | gba 0.752909 232 | dreamcast 0.74907 233 | famicom 0.745298 234 | ``` 235 | 236 | 我们的模型认为 *psx* 的 *nintendo* 类比是 *gamecube*,这似乎是合理的。 当然,类比的质量取决于用于训练模型的数据集,并且只能出现存在数据集中的单词。 237 | 238 | ## 字符 n-gram 的重要性 239 | 240 | 使用子字级信息对于为未知单词构建向量特别有趣。 例如,维基百科上不存在 *gearshift* 这个词,但我们仍然可以查询其最接近的现有词语: 241 | 242 | ```bash 243 | Query word? gearshift 244 | gearing 0.790762 245 | flywheels 0.779804 246 | flywheel 0.777859 247 | gears 0.776133 248 | driveshafts 0.756345 249 | driveshaft 0.755679 250 | daisywheel 0.749998 251 | wheelsets 0.748578 252 | epicycles 0.744268 253 | gearboxes 0.73986 254 | ``` 255 | 256 | 大多数检索的单词共享大量的子字符串,但少数实际上完全不同,如 *cogwheel*。 你可以尝试其他单词,如 *sunbathe* 或 *grandnieces*。 257 | 258 | 现在我们已经看到了未知词的子词信息的兴趣,我们来检查它与不使用子词信息的模型的比较。 要训练没有子词的模型,只需运行以下命令: 259 | 260 | ```bash 261 | $ ./fasttext skipgram -input data/fil9 -output result/fil9-none -maxn 0 262 | ``` 263 | 264 | 结果保存在`result/fil9-non.vec`和`result/fil9-non.bin`中。 265 | 266 | 为了说明这种差异,让我们在维基百科中使用一个不常见的单词,例如 *accomodation*,它是拼写错误的 *accommodation*。这里是没有子词的最近邻: 267 | 268 | ```bash 269 | $ ./fasttext nn result/fil9-none.bin 270 | Query word? accomodation 271 | sunnhordland 0.775057 272 | accomodations 0.769206 273 | administrational 0.753011 274 | laponian 0.752274 275 | ammenities 0.750805 276 | dachas 0.75026 277 | vuosaari 0.74172 278 | hostelling 0.739995 279 | greenbelts 0.733975 280 | asserbo 0.732465 281 | ``` 282 | 283 | 结果没有多大意义,大部分这些词都是无关的。 另一方面,使用子字信息给出以下最近邻列表: 284 | 285 | ```bash 286 | Query word? accomodation 287 | accomodations 0.96342 288 | accommodation 0.942124 289 | accommodations 0.915427 290 | accommodative 0.847751 291 | accommodating 0.794353 292 | accomodated 0.740381 293 | amenities 0.729746 294 | catering 0.725975 295 | accomodate 0.703177 296 | hospitality 0.701426 297 | ``` 298 | 299 | 最近邻在词 *accommodation* 附近获取到了不同的变化。 我们还可以获得语义相关的词语,例如 *amenities* 或 *lodging*。 300 | 301 | ## 结论 302 | 303 | 在本教程中,我们展示了如何从维基百科获取词向量。 这可以通过任何语言完成,我们提供[预训练模型](https://fasttext.cc/docs/en/pretrained-vectors.html),默认设置为 294。 304 | -------------------------------------------------------------------------------- /index.html: -------------------------------------------------------------------------------- 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 | 13 | 14 | 15 | 16 | 17 | 18 | 19 | 20 |
now loading...
21 | 43 | 44 | 45 | 46 | 47 | 48 | 49 | 50 | 51 | 52 | 53 | -------------------------------------------------------------------------------- /update.sh: -------------------------------------------------------------------------------- 1 | git add -A 2 | git commit -am "$(date "+%Y-%m-%d %H:%M:%S")" 3 | git push --------------------------------------------------------------------------------