├── LICENSE ├── README.txt ├── compute-accuracy.c ├── demo-analogy.sh ├── demo-classes.sh ├── demo-phrase-accuracy.sh ├── demo-phrases.sh ├── demo-train-big-model-v1.sh ├── demo-word-accuracy.sh ├── demo-word.sh ├── distance.c ├── makefile ├── questions-phrases.txt ├── questions-words.txt ├── word-analogy.c ├── word2phrase.c └── word2vec.c /LICENSE: -------------------------------------------------------------------------------- 1 | 2 | Apache License 3 | Version 2.0, January 2004 4 | http://www.apache.org/licenses/ 5 | 6 | TERMS AND CONDITIONS FOR USE, REPRODUCTION, AND DISTRIBUTION 7 | 8 | 1. Definitions. 9 | 10 | "License" shall mean the terms and conditions for use, reproduction, 11 | and distribution as defined by Sections 1 through 9 of this document. 12 | 13 | "Licensor" shall mean the copyright owner or entity authorized by 14 | the copyright owner that is granting the License. 15 | 16 | "Legal Entity" shall mean the union of the acting entity and all 17 | other entities that control, are controlled by, or are under common 18 | control with that entity. For the purposes of this definition, 19 | "control" means (i) the power, direct or indirect, to cause the 20 | direction or management of such entity, whether by contract or 21 | otherwise, or (ii) ownership of fifty percent (50%) or more of the 22 | outstanding shares, or (iii) beneficial ownership of such entity. 23 | 24 | "You" (or "Your") shall mean an individual or Legal Entity 25 | exercising permissions granted by this License. 26 | 27 | "Source" form shall mean the preferred form for making modifications, 28 | including but not limited to software source code, documentation 29 | source, and configuration files. 30 | 31 | "Object" form shall mean any form resulting from mechanical 32 | transformation or translation of a Source form, including but 33 | not limited to compiled object code, generated documentation, 34 | and conversions to other media types. 35 | 36 | "Work" shall mean the work of authorship, whether in Source or 37 | Object form, made available under the License, as indicated by a 38 | copyright notice that is included in or attached to the work 39 | (an example is provided in the Appendix below). 40 | 41 | "Derivative Works" shall mean any work, whether in Source or Object 42 | form, that is based on (or derived from) the Work and for which the 43 | editorial revisions, annotations, elaborations, or other modifications 44 | represent, as a whole, an original work of authorship. For the purposes 45 | of this License, Derivative Works shall not include works that remain 46 | separable from, or merely link (or bind by name) to the interfaces of, 47 | the Work and Derivative Works thereof. 48 | 49 | "Contribution" shall mean any work of authorship, including 50 | the original version of the Work and any modifications or additions 51 | to that Work or Derivative Works thereof, that is intentionally 52 | submitted to Licensor for inclusion in the Work by the copyright owner 53 | or by an individual or Legal Entity authorized to submit on behalf of 54 | the copyright owner. For the purposes of this definition, "submitted" 55 | means any form of electronic, verbal, or written communication sent 56 | to the Licensor or its representatives, including but not limited to 57 | communication on electronic mailing lists, source code control systems, 58 | and issue tracking systems that are managed by, or on behalf of, the 59 | Licensor for the purpose of discussing and improving the Work, but 60 | excluding communication that is conspicuously marked or otherwise 61 | designated in writing by the copyright owner as "Not a Contribution." 62 | 63 | "Contributor" shall mean Licensor and any individual or Legal Entity 64 | on behalf of whom a Contribution has been received by Licensor and 65 | subsequently incorporated within the Work. 66 | 67 | 2. Grant of Copyright License. Subject to the terms and conditions of 68 | this License, each Contributor hereby grants to You a perpetual, 69 | worldwide, non-exclusive, no-charge, royalty-free, irrevocable 70 | copyright license to reproduce, prepare Derivative Works of, 71 | publicly display, publicly perform, sublicense, and distribute the 72 | Work and such Derivative Works in Source or Object form. 73 | 74 | 3. Grant of Patent License. Subject to the terms and conditions of 75 | this License, each Contributor hereby grants to You a perpetual, 76 | worldwide, non-exclusive, no-charge, royalty-free, irrevocable 77 | (except as stated in this section) patent license to make, have made, 78 | use, offer to sell, sell, import, and otherwise transfer the Work, 79 | where such license applies only to those patent claims licensable 80 | by such Contributor that are necessarily infringed by their 81 | Contribution(s) alone or by combination of their Contribution(s) 82 | with the Work to which such Contribution(s) was submitted. If You 83 | institute patent litigation against any entity (including a 84 | cross-claim or counterclaim in a lawsuit) alleging that the Work 85 | or a Contribution incorporated within the Work constitutes direct 86 | or contributory patent infringement, then any patent licenses 87 | granted to You under this License for that Work shall terminate 88 | as of the date such litigation is filed. 89 | 90 | 4. Redistribution. You may reproduce and distribute copies of the 91 | Work or Derivative Works thereof in any medium, with or without 92 | modifications, and in Source or Object form, provided that You 93 | meet the following conditions: 94 | 95 | (a) You must give any other recipients of the Work or 96 | Derivative Works a copy of this License; and 97 | 98 | (b) You must cause any modified files to carry prominent notices 99 | stating that You changed the files; and 100 | 101 | (c) You must retain, in the Source form of any Derivative Works 102 | that You distribute, all copyright, patent, trademark, and 103 | attribution notices from the Source form of the Work, 104 | excluding those notices that do not pertain to any part of 105 | the Derivative Works; and 106 | 107 | (d) If the Work includes a "NOTICE" text file as part of its 108 | distribution, then any Derivative Works that You distribute must 109 | include a readable copy of the attribution notices contained 110 | within such NOTICE file, excluding those notices that do not 111 | pertain to any part of the Derivative Works, in at least one 112 | of the following places: within a NOTICE text file distributed 113 | as part of the Derivative Works; within the Source form or 114 | documentation, if provided along with the Derivative Works; or, 115 | within a display generated by the Derivative Works, if and 116 | wherever such third-party notices normally appear. The contents 117 | of the NOTICE file are for informational purposes only and 118 | do not modify the License. You may add Your own attribution 119 | notices within Derivative Works that You distribute, alongside 120 | or as an addendum to the NOTICE text from the Work, provided 121 | that such additional attribution notices cannot be construed 122 | as modifying the License. 123 | 124 | You may add Your own copyright statement to Your modifications and 125 | may provide additional or different license terms and conditions 126 | for use, reproduction, or distribution of Your modifications, or 127 | for any such Derivative Works as a whole, provided Your use, 128 | reproduction, and distribution of the Work otherwise complies with 129 | the conditions stated in this License. 130 | 131 | 5. Submission of Contributions. Unless You explicitly state otherwise, 132 | any Contribution intentionally submitted for inclusion in the Work 133 | by You to the Licensor shall be under the terms and conditions of 134 | this License, without any additional terms or conditions. 135 | Notwithstanding the above, nothing herein shall supersede or modify 136 | the terms of any separate license agreement you may have executed 137 | with Licensor regarding such Contributions. 138 | 139 | 6. Trademarks. This License does not grant permission to use the trade 140 | names, trademarks, service marks, or product names of the Licensor, 141 | except as required for reasonable and customary use in describing the 142 | origin of the Work and reproducing the content of the NOTICE file. 143 | 144 | 7. Disclaimer of Warranty. Unless required by applicable law or 145 | agreed to in writing, Licensor provides the Work (and each 146 | Contributor provides its Contributions) on an "AS IS" BASIS, 147 | WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or 148 | implied, including, without limitation, any warranties or conditions 149 | of TITLE, NON-INFRINGEMENT, MERCHANTABILITY, or FITNESS FOR A 150 | PARTICULAR PURPOSE. You are solely responsible for determining the 151 | appropriateness of using or redistributing the Work and assume any 152 | risks associated with Your exercise of permissions under this License. 153 | 154 | 8. Limitation of Liability. In no event and under no legal theory, 155 | whether in tort (including negligence), contract, or otherwise, 156 | unless required by applicable law (such as deliberate and grossly 157 | negligent acts) or agreed to in writing, shall any Contributor be 158 | liable to You for damages, including any direct, indirect, special, 159 | incidental, or consequential damages of any character arising as a 160 | result of this License or out of the use or inability to use the 161 | Work (including but not limited to damages for loss of goodwill, 162 | work stoppage, computer failure or malfunction, or any and all 163 | other commercial damages or losses), even if such Contributor 164 | has been advised of the possibility of such damages. 165 | 166 | 9. Accepting Warranty or Additional Liability. While redistributing 167 | the Work or Derivative Works thereof, You may choose to offer, 168 | and charge a fee for, acceptance of support, warranty, indemnity, 169 | or other liability obligations and/or rights consistent with this 170 | License. However, in accepting such obligations, You may act only 171 | on Your own behalf and on Your sole responsibility, not on behalf 172 | of any other Contributor, and only if You agree to indemnify, 173 | defend, and hold each Contributor harmless for any liability 174 | incurred by, or claims asserted against, such Contributor by reason 175 | of your accepting any such warranty or additional liability. 176 | 177 | END OF TERMS AND CONDITIONS 178 | 179 | APPENDIX: How to apply the Apache License to your work. 180 | 181 | To apply the Apache License to your work, attach the following 182 | boilerplate notice, with the fields enclosed by brackets "[]" 183 | replaced with your own identifying information. (Don't include 184 | the brackets!) The text should be enclosed in the appropriate 185 | comment syntax for the file format. We also recommend that a 186 | file or class name and description of purpose be included on the 187 | same "printed page" as the copyright notice for easier 188 | identification within third-party archives. 189 | 190 | Copyright [yyyy] [name of copyright owner] 191 | 192 | Licensed under the Apache License, Version 2.0 (the "License"); 193 | you may not use this file except in compliance with the License. 194 | You may obtain a copy of the License at 195 | 196 | http://www.apache.org/licenses/LICENSE-2.0 197 | 198 | Unless required by applicable law or agreed to in writing, software 199 | distributed under the License is distributed on an "AS IS" BASIS, 200 | WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. 201 | See the License for the specific language governing permissions and 202 | limitations under the License. 203 | -------------------------------------------------------------------------------- /README.txt: -------------------------------------------------------------------------------- 1 | Tools for computing distributed representtion of words 2 | ------------------------------------------------------ 3 | 4 | We provide an implementation of the Continuous Bag-of-Words (CBOW) and the Skip-gram model (SG), as well as several demo scripts. 5 | 6 | Given a text corpus, the word2vec tool learns a vector for every word in the vocabulary using the Continuous 7 | Bag-of-Words or the Skip-Gram neural network architectures. The user should to specify the following: 8 | - desired vector dimensionality 9 | - the size of the context window for either the Skip-Gram or the Continuous Bag-of-Words model 10 | - training algorithm: hierarchical softmax and / or negative sampling 11 | - threshold for downsampling the frequent words 12 | - number of threads to use 13 | - the format of the output word vector file (text or binary) 14 | 15 | Usually, the other hyper-parameters such as the learning rate do not need to be tuned for different training sets. 16 | 17 | The script demo-word.sh downloads a small (100MB) text corpus from the web, and trains a small word vector model. After the training 18 | is finished, the user can interactively explore the similarity of the words. 19 | 20 | More information about the scripts is provided at https://code.google.com/p/word2vec/ 21 | 22 | -------------------------------------------------------------------------------- /compute-accuracy.c: -------------------------------------------------------------------------------- 1 | // Copyright 2013 Google Inc. All Rights Reserved. 2 | // 3 | // Licensed under the Apache License, Version 2.0 (the "License"); 4 | // you may not use this file except in compliance with the License. 5 | // You may obtain a copy of the License at 6 | // 7 | // http://www.apache.org/licenses/LICENSE-2.0 8 | // 9 | // Unless required by applicable law or agreed to in writing, software 10 | // distributed under the License is distributed on an "AS IS" BASIS, 11 | // WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. 12 | // See the License for the specific language governing permissions and 13 | // limitations under the License. 14 | 15 | #include 16 | #include 17 | #include 18 | #include 19 | #include 20 | #include 21 | 22 | const long long max_size = 2000; // max length of strings 23 | const long long N = 1; // number of closest words 24 | const long long max_w = 50; // max length of vocabulary entries 25 | 26 | int main(int argc, char **argv) 27 | { 28 | FILE *f; 29 | char st1[max_size], st2[max_size], st3[max_size], st4[max_size], bestw[N][max_size], file_name[max_size]; 30 | float dist, len, bestd[N], vec[max_size]; 31 | long long words, size, a, b, c, d, b1, b2, b3, threshold = 0; 32 | float *M; 33 | char *vocab; 34 | int TCN, CCN = 0, TACN = 0, CACN = 0, SECN = 0, SYCN = 0, SEAC = 0, SYAC = 0, QID = 0, TQ = 0, TQS = 0; 35 | if (argc < 2) { 36 | printf("Usage: ./compute-accuracy \nwhere FILE contains word projections, and threshold is used to reduce vocabulary of the model for fast approximate evaluation (0 = off, otherwise typical value is 30000)\n"); 37 | return 0; 38 | } 39 | strcpy(file_name, argv[1]); 40 | if (argc > 2) threshold = atoi(argv[2]); 41 | f = fopen(file_name, "rb"); 42 | if (f == NULL) { 43 | printf("Input file not found\n"); 44 | return -1; 45 | } 46 | fscanf(f, "%lld", &words); 47 | if (threshold) if (words > threshold) words = threshold; 48 | fscanf(f, "%lld", &size); 49 | vocab = (char *)malloc(words * max_w * sizeof(char)); 50 | M = (float *)malloc(words * size * sizeof(float)); 51 | if (M == NULL) { 52 | printf("Cannot allocate memory: %lld MB\n", words * size * sizeof(float) / 1048576); 53 | return -1; 54 | } 55 | for (b = 0; b < words; b++) { 56 | a = 0; 57 | while (1) { 58 | vocab[b * max_w + a] = fgetc(f); 59 | if (feof(f) || (vocab[b * max_w + a] == ' ')) break; 60 | if ((a < max_w) && (vocab[b * max_w + a] != '\n')) a++; 61 | } 62 | vocab[b * max_w + a] = 0; 63 | for (a = 0; a < max_w; a++) vocab[b * max_w + a] = toupper(vocab[b * max_w + a]); 64 | for (a = 0; a < size; a++) fread(&M[a + b * size], sizeof(float), 1, f); 65 | len = 0; 66 | for (a = 0; a < size; a++) len += M[a + b * size] * M[a + b * size]; 67 | len = sqrt(len); 68 | for (a = 0; a < size; a++) M[a + b * size] /= len; 69 | } 70 | fclose(f); 71 | TCN = 0; 72 | while (1) { 73 | for (a = 0; a < N; a++) bestd[a] = 0; 74 | for (a = 0; a < N; a++) bestw[a][0] = 0; 75 | scanf("%s", st1); 76 | for (a = 0; a < strlen(st1); a++) st1[a] = toupper(st1[a]); 77 | if ((!strcmp(st1, ":")) || (!strcmp(st1, "EXIT")) || feof(stdin)) { 78 | if (TCN == 0) TCN = 1; 79 | if (QID != 0) { 80 | printf("ACCURACY TOP1: %.2f %% (%d / %d)\n", CCN / (float)TCN * 100, CCN, TCN); 81 | printf("Total accuracy: %.2f %% Semantic accuracy: %.2f %% Syntactic accuracy: %.2f %% \n", CACN / (float)TACN * 100, SEAC / (float)SECN * 100, SYAC / (float)SYCN * 100); 82 | } 83 | QID++; 84 | scanf("%s", st1); 85 | if (feof(stdin)) break; 86 | printf("%s:\n", st1); 87 | TCN = 0; 88 | CCN = 0; 89 | continue; 90 | } 91 | if (!strcmp(st1, "EXIT")) break; 92 | scanf("%s", st2); 93 | for (a = 0; a < strlen(st2); a++) st2[a] = toupper(st2[a]); 94 | scanf("%s", st3); 95 | for (a = 0; a bestd[a]) { 122 | for (d = N - 1; d > a; d--) { 123 | bestd[d] = bestd[d - 1]; 124 | strcpy(bestw[d], bestw[d - 1]); 125 | } 126 | bestd[a] = dist; 127 | strcpy(bestw[a], &vocab[c * max_w]); 128 | break; 129 | } 130 | } 131 | } 132 | if (!strcmp(st4, bestw[0])) { 133 | CCN++; 134 | CACN++; 135 | if (QID <= 5) SEAC++; else SYAC++; 136 | } 137 | if (QID <= 5) SECN++; else SYCN++; 138 | TCN++; 139 | TACN++; 140 | } 141 | printf("Questions seen / total: %d %d %.2f %% \n", TQS, TQ, TQS/(float)TQ*100); 142 | return 0; 143 | } 144 | -------------------------------------------------------------------------------- /demo-analogy.sh: -------------------------------------------------------------------------------- 1 | make 2 | if [ ! -e text8 ]; then 3 | wget http://mattmahoney.net/dc/text8.zip -O text8.gz 4 | gzip -d text8.gz -f 5 | fi 6 | echo --------------------------------------------------------------------------------------------------- 7 | echo Note that for the word analogy to perform well, the model should be trained on much larger data set 8 | echo Example input: paris france berlin 9 | echo --------------------------------------------------------------------------------------------------- 10 | time ./word2vec -train text8 -output vectors.bin -cbow 1 -size 200 -window 8 -negative 25 -hs 0 -sample 1e-4 -threads 20 -binary 1 -iter 15 11 | ./word-analogy vectors.bin 12 | -------------------------------------------------------------------------------- /demo-classes.sh: -------------------------------------------------------------------------------- 1 | make 2 | if [ ! -e text8 ]; then 3 | wget http://mattmahoney.net/dc/text8.zip -O text8.gz 4 | gzip -d text8.gz -f 5 | fi 6 | time ./word2vec -train text8 -output classes.txt -cbow 1 -size 200 -window 8 -negative 25 -hs 0 -sample 1e-4 -threads 20 -iter 15 -classes 500 7 | sort classes.txt -k 2 -n > classes.sorted.txt 8 | echo The word classes were saved to file classes.sorted.txt 9 | -------------------------------------------------------------------------------- /demo-phrase-accuracy.sh: -------------------------------------------------------------------------------- 1 | make 2 | if [ ! -e news.2012.en.shuffled ]; then 3 | wget http://www.statmt.org/wmt14/training-monolingual-news-crawl/news.2012.en.shuffled.gz 4 | gzip -d news.2012.en.shuffled.gz -f 5 | fi 6 | sed -e "s/’/'/g" -e "s/′/'/g" -e "s/''/ /g" < news.2012.en.shuffled | tr -c "A-Za-z'_ \n" " " > news.2012.en.shuffled-norm0 7 | time ./word2phrase -train news.2012.en.shuffled-norm0 -output news.2012.en.shuffled-norm0-phrase0 -threshold 200 -debug 2 8 | time ./word2phrase -train news.2012.en.shuffled-norm0-phrase0 -output news.2012.en.shuffled-norm0-phrase1 -threshold 100 -debug 2 9 | tr A-Z a-z < news.2012.en.shuffled-norm0-phrase1 > news.2012.en.shuffled-norm1-phrase1 10 | time ./word2vec -train news.2012.en.shuffled-norm1-phrase1 -output vectors-phrase.bin -cbow 1 -size 200 -window 10 -negative 25 -hs 0 -sample 1e-5 -threads 20 -binary 1 -iter 15 11 | ./compute-accuracy vectors-phrase.bin < questions-phrases.txt 12 | -------------------------------------------------------------------------------- /demo-phrases.sh: -------------------------------------------------------------------------------- 1 | make 2 | if [ ! -e news.2012.en.shuffled ]; then 3 | wget http://www.statmt.org/wmt14/training-monolingual-news-crawl/news.2012.en.shuffled.gz 4 | gzip -d news.2012.en.shuffled.gz -f 5 | fi 6 | sed -e "s/’/'/g" -e "s/′/'/g" -e "s/''/ /g" < news.2012.en.shuffled | tr -c "A-Za-z'_ \n" " " > news.2012.en.shuffled-norm0 7 | time ./word2phrase -train news.2012.en.shuffled-norm0 -output news.2012.en.shuffled-norm0-phrase0 -threshold 200 -debug 2 8 | time ./word2phrase -train news.2012.en.shuffled-norm0-phrase0 -output news.2012.en.shuffled-norm0-phrase1 -threshold 100 -debug 2 9 | tr A-Z a-z < news.2012.en.shuffled-norm0-phrase1 > news.2012.en.shuffled-norm1-phrase1 10 | time ./word2vec -train news.2012.en.shuffled-norm1-phrase1 -output vectors-phrase.bin -cbow 1 -size 200 -window 10 -negative 25 -hs 0 -sample 1e-5 -threads 20 -binary 1 -iter 15 11 | ./distance vectors-phrase.bin 12 | -------------------------------------------------------------------------------- /demo-train-big-model-v1.sh: -------------------------------------------------------------------------------- 1 | ############################################################################################### 2 | # 3 | # Script for training good word and phrase vector model using public corpora, version 1.0. 4 | # The training time will be from several hours to about a day. 5 | # 6 | # Downloads about 8 billion words, makes phrases using two runs of word2phrase, trains 7 | # a 500-dimensional vector model and evaluates it on word and phrase analogy tasks. 8 | # 9 | ############################################################################################### 10 | 11 | # This function will convert text to lowercase and remove special characters 12 | normalize_text() { 13 | awk '{print tolower($0);}' | sed -e "s/’/'/g" -e "s/′/'/g" -e "s/''/ /g" -e "s/'/ ' /g" -e "s/“/\"/g" -e "s/”/\"/g" \ 14 | -e 's/"/ " /g' -e 's/\./ \. /g' -e 's/
/ /g' -e 's/, / , /g' -e 's/(/ ( /g' -e 's/)/ ) /g' -e 's/\!/ \! /g' \ 15 | -e 's/\?/ \? /g' -e 's/\;/ /g' -e 's/\:/ /g' -e 's/-/ - /g' -e 's/=/ /g' -e 's/=/ /g' -e 's/*/ /g' -e 's/|/ /g' \ 16 | -e 's/«/ /g' | tr 0-9 " " 17 | } 18 | 19 | mkdir word2vec 20 | cd word2vec 21 | 22 | wget http://www.statmt.org/wmt14/training-monolingual-news-crawl/news.2012.en.shuffled.gz 23 | wget http://www.statmt.org/wmt14/training-monolingual-news-crawl/news.2013.en.shuffled.gz 24 | gzip -d news.2012.en.shuffled.gz 25 | gzip -d news.2013.en.shuffled.gz 26 | normalize_text < news.2012.en.shuffled > data.txt 27 | normalize_text < news.2013.en.shuffled >> data.txt 28 | 29 | wget http://www.statmt.org/lm-benchmark/1-billion-word-language-modeling-benchmark-r13output.tar.gz 30 | tar -xvf 1-billion-word-language-modeling-benchmark-r13output.tar.gz 31 | for i in `ls 1-billion-word-language-modeling-benchmark-r13output/training-monolingual.tokenized.shuffled`; do 32 | normalize_text < 1-billion-word-language-modeling-benchmark-r13output/training-monolingual.tokenized.shuffled/$i >> data.txt 33 | done 34 | 35 | wget http://ebiquity.umbc.edu/redirect/to/resource/id/351/UMBC-webbase-corpus 36 | tar -zxvf umbc_webbase_corpus.tar.gz webbase_all/*.txt 37 | for i in `ls webbase_all`; do 38 | normalize_text < webbase_all/$i >> data.txt 39 | done 40 | 41 | wget http://dumps.wikimedia.org/enwiki/latest/enwiki-latest-pages-articles.xml.bz2 42 | bzip2 -c -d enwiki-latest-pages-articles.xml.bz2 | awk '{print tolower($0);}' | perl -e ' 43 | # Program to filter Wikipedia XML dumps to "clean" text consisting only of lowercase 44 | # letters (a-z, converted from A-Z), and spaces (never consecutive)... 45 | # All other characters are converted to spaces. Only text which normally appears. 46 | # in the web browser is displayed. Tables are removed. Image captions are. 47 | # preserved. Links are converted to normal text. Digits are spelled out. 48 | # *** Modified to not spell digits or throw away non-ASCII characters *** 49 | 50 | # Written by Matt Mahoney, June 10, 2006. This program is released to the public domain. 51 | 52 | $/=">"; # input record separator 53 | while (<>) { 54 | if (/ ... 55 | if (/#redirect/i) {$text=0;} # remove #REDIRECT 56 | if ($text) { 57 | 58 | # Remove any text not normally visible 59 | if (/<\/text>/) {$text=0;} 60 | s/<.*>//; # remove xml tags 61 | s/&/&/g; # decode URL encoded chars 62 | s/<//g; 64 | s///g; # remove references ... 65 | s/<[^>]*>//g; # remove xhtml tags 66 | s/\[http:[^] ]*/[/g; # remove normal url, preserve visible text 67 | s/\|thumb//ig; # remove images links, preserve caption 68 | s/\|left//ig; 69 | s/\|right//ig; 70 | s/\|\d+px//ig; 71 | s/\[\[image:[^\[\]]*\|//ig; 72 | s/\[\[category:([^|\]]*)[^]]*\]\]/[[$1]]/ig; # show categories without markup 73 | s/\[\[[a-z\-]*:[^\]]*\]\]//g; # remove links to other languages 74 | s/\[\[[^\|\]]*\|/[[/g; # remove wiki url, preserve visible text 75 | s/{{[^}]*}}//g; # remove {{icons}} and {tables} 76 | s/{[^}]*}//g; 77 | s/\[//g; # remove [ and ] 78 | s/\]//g; 79 | s/&[^;]*;/ /g; # remove URL encoded chars 80 | 81 | $_=" $_ "; 82 | chop; 83 | print $_; 84 | } 85 | } 86 | ' | normalize_text | awk '{if (NF>1) print;}' >> data.txt 87 | 88 | wget http://word2vec.googlecode.com/svn/trunk/word2vec.c 89 | wget http://word2vec.googlecode.com/svn/trunk/word2phrase.c 90 | wget http://word2vec.googlecode.com/svn/trunk/compute-accuracy.c 91 | wget http://word2vec.googlecode.com/svn/trunk/questions-words.txt 92 | wget http://word2vec.googlecode.com/svn/trunk/questions-phrases.txt 93 | gcc word2vec.c -o word2vec -lm -pthread -O3 -march=native -funroll-loops 94 | gcc word2phrase.c -o word2phrase -lm -pthread -O3 -march=native -funroll-loops 95 | gcc compute-accuracy.c -o compute-accuracy -lm -pthread -O3 -march=native -funroll-loops 96 | ./word2phrase -train data.txt -output data-phrase.txt -threshold 200 -debug 2 97 | ./word2phrase -train data-phrase.txt -output data-phrase2.txt -threshold 100 -debug 2 98 | ./word2vec -train data-phrase2.txt -output vectors.bin -cbow 1 -size 500 -window 10 -negative 10 -hs 0 -sample 1e-5 -threads 40 -binary 1 -iter 3 -min-count 10 99 | ./compute-accuracy vectors.bin 400000 < questions-words.txt # should get to almost 78% accuracy on 99.7% of questions 100 | ./compute-accuracy vectors.bin 1000000 < questions-phrases.txt # about 78% accuracy with 77% coverage 101 | -------------------------------------------------------------------------------- /demo-word-accuracy.sh: -------------------------------------------------------------------------------- 1 | make 2 | if [ ! -e text8 ]; then 3 | wget http://mattmahoney.net/dc/text8.zip -O text8.gz 4 | gzip -d text8.gz -f 5 | fi 6 | time ./word2vec -train text8 -output vectors.bin -cbow 1 -size 200 -window 8 -negative 25 -hs 0 -sample 1e-4 -threads 20 -binary 1 -iter 15 7 | ./compute-accuracy vectors.bin 30000 < questions-words.txt 8 | # to compute accuracy with the full vocabulary, use: ./compute-accuracy vectors.bin < questions-words.txt 9 | -------------------------------------------------------------------------------- /demo-word.sh: -------------------------------------------------------------------------------- 1 | make 2 | if [ ! -e text8 ]; then 3 | wget http://mattmahoney.net/dc/text8.zip -O text8.gz 4 | gzip -d text8.gz -f 5 | fi 6 | time ./word2vec -train text8 -output vectors.bin -cbow 1 -size 200 -window 8 -negative 25 -hs 0 -sample 1e-4 -threads 20 -binary 1 -iter 15 7 | ./distance vectors.bin 8 | -------------------------------------------------------------------------------- /distance.c: -------------------------------------------------------------------------------- 1 | // Copyright 2013 Google Inc. All Rights Reserved. 2 | // 3 | // Licensed under the Apache License, Version 2.0 (the "License"); 4 | // you may not use this file except in compliance with the License. 5 | // You may obtain a copy of the License at 6 | // 7 | // http://www.apache.org/licenses/LICENSE-2.0 8 | // 9 | // Unless required by applicable law or agreed to in writing, software 10 | // distributed under the License is distributed on an "AS IS" BASIS, 11 | // WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. 12 | // See the License for the specific language governing permissions and 13 | // limitations under the License. 14 | 15 | #include 16 | #include 17 | #include 18 | #include 19 | 20 | const long long max_size = 2000; // max length of strings 21 | const long long N = 40; // number of closest words that will be shown 22 | const long long max_w = 50; // max length of vocabulary entries 23 | 24 | int main(int argc, char **argv) { 25 | FILE *f; 26 | char st1[max_size]; 27 | char *bestw[N]; 28 | char file_name[max_size], st[100][max_size]; 29 | float dist, len, bestd[N], vec[max_size]; 30 | long long words, size, a, b, c, d, cn, bi[100]; 31 | float *M; 32 | char *vocab; 33 | if (argc < 2) { 34 | printf("Usage: ./distance \nwhere FILE contains word projections in the BINARY FORMAT\n"); 35 | return 0; 36 | } 37 | strcpy(file_name, argv[1]); 38 | f = fopen(file_name, "rb"); 39 | if (f == NULL) { 40 | printf("Input file not found\n"); 41 | return -1; 42 | } 43 | fscanf(f, "%lld", &words); 44 | fscanf(f, "%lld", &size); 45 | vocab = (char *)malloc((long long)words * max_w * sizeof(char)); 46 | for (a = 0; a < N; a++) bestw[a] = (char *)malloc(max_size * sizeof(char)); 47 | M = (float *)malloc((long long)words * (long long)size * sizeof(float)); 48 | if (M == NULL) { 49 | printf("Cannot allocate memory: %lld MB %lld %lld\n", (long long)words * size * sizeof(float) / 1048576, words, size); 50 | return -1; 51 | } 52 | for (b = 0; b < words; b++) { 53 | a = 0; 54 | while (1) { 55 | vocab[b * max_w + a] = fgetc(f); 56 | if (feof(f) || (vocab[b * max_w + a] == ' ')) break; 57 | if ((a < max_w) && (vocab[b * max_w + a] != '\n')) a++; 58 | } 59 | vocab[b * max_w + a] = 0; 60 | for (a = 0; a < size; a++) fread(&M[a + b * size], sizeof(float), 1, f); 61 | len = 0; 62 | for (a = 0; a < size; a++) len += M[a + b * size] * M[a + b * size]; 63 | len = sqrt(len); 64 | for (a = 0; a < size; a++) M[a + b * size] /= len; 65 | } 66 | fclose(f); 67 | while (1) { 68 | for (a = 0; a < N; a++) bestd[a] = 0; 69 | for (a = 0; a < N; a++) bestw[a][0] = 0; 70 | printf("Enter word or sentence (EXIT to break): "); 71 | a = 0; 72 | while (1) { 73 | st1[a] = fgetc(stdin); 74 | if ((st1[a] == '\n') || (a >= max_size - 1)) { 75 | st1[a] = 0; 76 | break; 77 | } 78 | a++; 79 | } 80 | if (!strcmp(st1, "EXIT")) break; 81 | cn = 0; 82 | b = 0; 83 | c = 0; 84 | while (1) { 85 | st[cn][b] = st1[c]; 86 | b++; 87 | c++; 88 | st[cn][b] = 0; 89 | if (st1[c] == 0) break; 90 | if (st1[c] == ' ') { 91 | cn++; 92 | b = 0; 93 | c++; 94 | } 95 | } 96 | cn++; 97 | for (a = 0; a < cn; a++) { 98 | for (b = 0; b < words; b++) if (!strcmp(&vocab[b * max_w], st[a])) break; 99 | if (b == words) b = -1; 100 | bi[a] = b; 101 | printf("\nWord: %s Position in vocabulary: %lld\n", st[a], bi[a]); 102 | if (b == -1) { 103 | printf("Out of dictionary word!\n"); 104 | break; 105 | } 106 | } 107 | if (b == -1) continue; 108 | printf("\n Word Cosine distance\n------------------------------------------------------------------------\n"); 109 | for (a = 0; a < size; a++) vec[a] = 0; 110 | for (b = 0; b < cn; b++) { 111 | if (bi[b] == -1) continue; 112 | for (a = 0; a < size; a++) vec[a] += M[a + bi[b] * size]; 113 | } 114 | len = 0; 115 | for (a = 0; a < size; a++) len += vec[a] * vec[a]; 116 | len = sqrt(len); 117 | for (a = 0; a < size; a++) vec[a] /= len; 118 | for (a = 0; a < N; a++) bestd[a] = -1; 119 | for (a = 0; a < N; a++) bestw[a][0] = 0; 120 | for (c = 0; c < words; c++) { 121 | a = 0; 122 | for (b = 0; b < cn; b++) if (bi[b] == c) a = 1; 123 | if (a == 1) continue; 124 | dist = 0; 125 | for (a = 0; a < size; a++) dist += vec[a] * M[a + c * size]; 126 | for (a = 0; a < N; a++) { 127 | if (dist > bestd[a]) { 128 | for (d = N - 1; d > a; d--) { 129 | bestd[d] = bestd[d - 1]; 130 | strcpy(bestw[d], bestw[d - 1]); 131 | } 132 | bestd[a] = dist; 133 | strcpy(bestw[a], &vocab[c * max_w]); 134 | break; 135 | } 136 | } 137 | } 138 | for (a = 0; a < N; a++) printf("%50s\t\t%f\n", bestw[a], bestd[a]); 139 | } 140 | return 0; 141 | } 142 | -------------------------------------------------------------------------------- /makefile: -------------------------------------------------------------------------------- 1 | CC = gcc 2 | #Using -Ofast instead of -O3 might result in faster code, but is supported only by newer GCC versions 3 | CFLAGS = -lm -pthread -O3 -march=native -Wall -funroll-loops -Wno-unused-result 4 | 5 | all: word2vec word2phrase distance word-analogy compute-accuracy 6 | 7 | word2vec : word2vec.c 8 | $(CC) word2vec.c -o word2vec $(CFLAGS) 9 | word2phrase : word2phrase.c 10 | $(CC) word2phrase.c -o word2phrase $(CFLAGS) 11 | distance : distance.c 12 | $(CC) distance.c -o distance $(CFLAGS) 13 | word-analogy : word-analogy.c 14 | $(CC) word-analogy.c -o word-analogy $(CFLAGS) 15 | compute-accuracy : compute-accuracy.c 16 | $(CC) compute-accuracy.c -o compute-accuracy $(CFLAGS) 17 | chmod +x *.sh 18 | 19 | clean: 20 | rm -rf word2vec word2phrase distance word-analogy compute-accuracy -------------------------------------------------------------------------------- /word-analogy.c: -------------------------------------------------------------------------------- 1 | // Copyright 2013 Google Inc. All Rights Reserved. 2 | // 3 | // Licensed under the Apache License, Version 2.0 (the "License"); 4 | // you may not use this file except in compliance with the License. 5 | // You may obtain a copy of the License at 6 | // 7 | // http://www.apache.org/licenses/LICENSE-2.0 8 | // 9 | // Unless required by applicable law or agreed to in writing, software 10 | // distributed under the License is distributed on an "AS IS" BASIS, 11 | // WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. 12 | // See the License for the specific language governing permissions and 13 | // limitations under the License. 14 | 15 | #include 16 | #include 17 | #include 18 | #include 19 | 20 | const long long max_size = 2000; // max length of strings 21 | const long long N = 40; // number of closest words that will be shown 22 | const long long max_w = 50; // max length of vocabulary entries 23 | 24 | int main(int argc, char **argv) { 25 | FILE *f; 26 | char st1[max_size]; 27 | char bestw[N][max_size]; 28 | char file_name[max_size], st[100][max_size]; 29 | float dist, len, bestd[N], vec[max_size]; 30 | long long words, size, a, b, c, d, cn, bi[100]; 31 | float *M; 32 | char *vocab; 33 | if (argc < 2) { 34 | printf("Usage: ./word-analogy \nwhere FILE contains word projections in the BINARY FORMAT\n"); 35 | return 0; 36 | } 37 | strcpy(file_name, argv[1]); 38 | f = fopen(file_name, "rb"); 39 | if (f == NULL) { 40 | printf("Input file not found\n"); 41 | return -1; 42 | } 43 | fscanf(f, "%lld", &words); 44 | fscanf(f, "%lld", &size); 45 | vocab = (char *)malloc((long long)words * max_w * sizeof(char)); 46 | M = (float *)malloc((long long)words * (long long)size * sizeof(float)); 47 | if (M == NULL) { 48 | printf("Cannot allocate memory: %lld MB %lld %lld\n", (long long)words * size * sizeof(float) / 1048576, words, size); 49 | return -1; 50 | } 51 | for (b = 0; b < words; b++) { 52 | a = 0; 53 | while (1) { 54 | vocab[b * max_w + a] = fgetc(f); 55 | if (feof(f) || (vocab[b * max_w + a] == ' ')) break; 56 | if ((a < max_w) && (vocab[b * max_w + a] != '\n')) a++; 57 | } 58 | vocab[b * max_w + a] = 0; 59 | for (a = 0; a < size; a++) fread(&M[a + b * size], sizeof(float), 1, f); 60 | len = 0; 61 | for (a = 0; a < size; a++) len += M[a + b * size] * M[a + b * size]; 62 | len = sqrt(len); 63 | for (a = 0; a < size; a++) M[a + b * size] /= len; 64 | } 65 | fclose(f); 66 | while (1) { 67 | for (a = 0; a < N; a++) bestd[a] = 0; 68 | for (a = 0; a < N; a++) bestw[a][0] = 0; 69 | printf("Enter three words (EXIT to break): "); 70 | a = 0; 71 | while (1) { 72 | st1[a] = fgetc(stdin); 73 | if ((st1[a] == '\n') || (a >= max_size - 1)) { 74 | st1[a] = 0; 75 | break; 76 | } 77 | a++; 78 | } 79 | if (!strcmp(st1, "EXIT")) break; 80 | cn = 0; 81 | b = 0; 82 | c = 0; 83 | while (1) { 84 | st[cn][b] = st1[c]; 85 | b++; 86 | c++; 87 | st[cn][b] = 0; 88 | if (st1[c] == 0) break; 89 | if (st1[c] == ' ') { 90 | cn++; 91 | b = 0; 92 | c++; 93 | } 94 | } 95 | cn++; 96 | if (cn < 3) { 97 | printf("Only %lld words were entered.. three words are needed at the input to perform the calculation\n", cn); 98 | continue; 99 | } 100 | for (a = 0; a < cn; a++) { 101 | for (b = 0; b < words; b++) if (!strcmp(&vocab[b * max_w], st[a])) break; 102 | if (b == words) b = 0; 103 | bi[a] = b; 104 | printf("\nWord: %s Position in vocabulary: %lld\n", st[a], bi[a]); 105 | if (b == 0) { 106 | printf("Out of dictionary word!\n"); 107 | break; 108 | } 109 | } 110 | if (b == 0) continue; 111 | printf("\n Word Distance\n------------------------------------------------------------------------\n"); 112 | for (a = 0; a < size; a++) vec[a] = M[a + bi[1] * size] - M[a + bi[0] * size] + M[a + bi[2] * size]; 113 | len = 0; 114 | for (a = 0; a < size; a++) len += vec[a] * vec[a]; 115 | len = sqrt(len); 116 | for (a = 0; a < size; a++) vec[a] /= len; 117 | for (a = 0; a < N; a++) bestd[a] = 0; 118 | for (a = 0; a < N; a++) bestw[a][0] = 0; 119 | for (c = 0; c < words; c++) { 120 | if (c == bi[0]) continue; 121 | if (c == bi[1]) continue; 122 | if (c == bi[2]) continue; 123 | a = 0; 124 | for (b = 0; b < cn; b++) if (bi[b] == c) a = 1; 125 | if (a == 1) continue; 126 | dist = 0; 127 | for (a = 0; a < size; a++) dist += vec[a] * M[a + c * size]; 128 | for (a = 0; a < N; a++) { 129 | if (dist > bestd[a]) { 130 | for (d = N - 1; d > a; d--) { 131 | bestd[d] = bestd[d - 1]; 132 | strcpy(bestw[d], bestw[d - 1]); 133 | } 134 | bestd[a] = dist; 135 | strcpy(bestw[a], &vocab[c * max_w]); 136 | break; 137 | } 138 | } 139 | } 140 | for (a = 0; a < N; a++) printf("%50s\t\t%f\n", bestw[a], bestd[a]); 141 | } 142 | return 0; 143 | } 144 | -------------------------------------------------------------------------------- /word2phrase.c: -------------------------------------------------------------------------------- 1 | // Copyright 2013 Google Inc. All Rights Reserved. 2 | // 3 | // Licensed under the Apache License, Version 2.0 (the "License"); 4 | // you may not use this file except in compliance with the License. 5 | // You may obtain a copy of the License at 6 | // 7 | // http://www.apache.org/licenses/LICENSE-2.0 8 | // 9 | // Unless required by applicable law or agreed to in writing, software 10 | // distributed under the License is distributed on an "AS IS" BASIS, 11 | // WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. 12 | // See the License for the specific language governing permissions and 13 | // limitations under the License. 14 | 15 | #include 16 | #include 17 | #include 18 | #include 19 | #include 20 | 21 | #define MAX_STRING 60 22 | 23 | const int vocab_hash_size = 500000000; // Maximum 500M entries in the vocabulary 24 | 25 | typedef float real; // Precision of float numbers 26 | 27 | struct vocab_word { 28 | long long cn; 29 | char *word; 30 | }; 31 | 32 | char train_file[MAX_STRING], output_file[MAX_STRING]; 33 | struct vocab_word *vocab; 34 | int debug_mode = 2, min_count = 5, *vocab_hash, min_reduce = 1; 35 | long long vocab_max_size = 10000, vocab_size = 0; 36 | long long train_words = 0; 37 | real threshold = 100; 38 | 39 | unsigned long long next_random = 1; 40 | 41 | // Reads a single word from a file, assuming space + tab + EOL to be word boundaries 42 | void ReadWord(char *word, FILE *fin, char *eof) { 43 | int a = 0, ch; 44 | while (1) { 45 | ch = fgetc_unlocked(fin); 46 | if (ch == EOF) { 47 | *eof = 1; 48 | break; 49 | } 50 | if (ch == 13) continue; 51 | if ((ch == ' ') || (ch == '\t') || (ch == '\n')) { 52 | if (a > 0) { 53 | if (ch == '\n') ungetc(ch, fin); 54 | break; 55 | } 56 | if (ch == '\n') { 57 | //strcpy(word, (char *)""); 58 | word[0] = '\n'; 59 | word[1] = 0; 60 | return; 61 | } else continue; 62 | } 63 | word[a] = ch; 64 | a++; 65 | if (a >= MAX_STRING - 1) a--; // Truncate too long words 66 | } 67 | word[a] = 0; 68 | } 69 | 70 | // Returns hash value of a word 71 | int GetWordHash(char *word) { 72 | unsigned long long a, hash = 1; 73 | for (a = 0; a < strlen(word); a++) hash = hash * 257 + word[a]; 74 | hash = hash % vocab_hash_size; 75 | return hash; 76 | } 77 | 78 | // Returns position of a word in the vocabulary; if the word is not found, returns -1 79 | int SearchVocab(char *word) { 80 | unsigned int hash = GetWordHash(word); 81 | while (1) { 82 | if (vocab_hash[hash] == -1) return -1; 83 | if (!strcmp(word, vocab[vocab_hash[hash]].word)) return vocab_hash[hash]; 84 | hash = (hash + 1) % vocab_hash_size; 85 | } 86 | return -1; 87 | } 88 | 89 | // Reads a word and returns its index in the vocabulary 90 | int ReadWordIndex(FILE *fin, char *eof) { 91 | char word[MAX_STRING], eof_l = 0; 92 | ReadWord(word, fin, &eof_l); 93 | if (eof_l) { 94 | *eof = 1; 95 | return -1; 96 | } 97 | return SearchVocab(word); 98 | } 99 | 100 | // Adds a word to the vocabulary 101 | int AddWordToVocab(char *word) { 102 | unsigned int hash, length = strlen(word) + 1; 103 | if (length > MAX_STRING) length = MAX_STRING; 104 | vocab[vocab_size].word = (char *)calloc(length, sizeof(char)); 105 | strcpy(vocab[vocab_size].word, word); 106 | vocab[vocab_size].cn = 0; 107 | vocab_size++; 108 | // Reallocate memory if needed 109 | if (vocab_size + 2 >= vocab_max_size) { 110 | vocab_max_size += 10000; 111 | vocab=(struct vocab_word *)realloc(vocab, vocab_max_size * sizeof(struct vocab_word)); 112 | } 113 | hash = GetWordHash(word); 114 | while (vocab_hash[hash] != -1) hash = (hash + 1) % vocab_hash_size; 115 | vocab_hash[hash]=vocab_size - 1; 116 | return vocab_size - 1; 117 | } 118 | 119 | // Used later for sorting by word counts 120 | int VocabCompare(const void *a, const void *b) { 121 | return ((struct vocab_word *)b)->cn - ((struct vocab_word *)a)->cn; 122 | } 123 | 124 | // Sorts the vocabulary by frequency using word counts 125 | void SortVocab() { 126 | int a; 127 | unsigned int hash; 128 | // Sort the vocabulary and keep at the first position 129 | qsort(&vocab[1], vocab_size - 1, sizeof(struct vocab_word), VocabCompare); 130 | for (a = 0; a < vocab_hash_size; a++) vocab_hash[a] = -1; 131 | for (a = 0; a < vocab_size; a++) { 132 | // Words occuring less than min_count times will be discarded from the vocab 133 | if (vocab[a].cn < min_count) { 134 | vocab_size--; 135 | free(vocab[vocab_size].word); 136 | } else { 137 | // Hash will be re-computed, as after the sorting it is not actual 138 | hash = GetWordHash(vocab[a].word); 139 | while (vocab_hash[hash] != -1) hash = (hash + 1) % vocab_hash_size; 140 | vocab_hash[hash] = a; 141 | } 142 | } 143 | vocab = (struct vocab_word *)realloc(vocab, vocab_size * sizeof(struct vocab_word)); 144 | } 145 | 146 | // Reduces the vocabulary by removing infrequent tokens 147 | void ReduceVocab() { 148 | int a, b = 0; 149 | unsigned int hash; 150 | for (a = 0; a < vocab_size; a++) if (vocab[a].cn > min_reduce) { 151 | vocab[b].cn = vocab[a].cn; 152 | vocab[b].word = vocab[a].word; 153 | b++; 154 | } else free(vocab[a].word); 155 | vocab_size = b; 156 | for (a = 0; a < vocab_hash_size; a++) vocab_hash[a] = -1; 157 | for (a = 0; a < vocab_size; a++) { 158 | // Hash will be re-computed, as it is not actual 159 | hash = GetWordHash(vocab[a].word); 160 | while (vocab_hash[hash] != -1) hash = (hash + 1) % vocab_hash_size; 161 | vocab_hash[hash] = a; 162 | } 163 | fflush(stdout); 164 | min_reduce++; 165 | } 166 | 167 | void LearnVocabFromTrainFile() { 168 | char word[MAX_STRING], last_word[MAX_STRING], bigram_word[MAX_STRING * 2], eof = 0; 169 | FILE *fin; 170 | long long a, b, i, start = 1; 171 | for (a = 0; a < vocab_hash_size; a++) vocab_hash[a] = -1; 172 | fin = fopen(train_file, "rb"); 173 | if (fin == NULL) { 174 | printf("ERROR: training data file not found!\n"); 175 | exit(1); 176 | } 177 | vocab_size = 0; 178 | AddWordToVocab((char *)""); 179 | while (1) { 180 | ReadWord(word, fin, &eof); 181 | if (eof) break; 182 | if (word[0] == '\n') { 183 | start = 1; 184 | continue; 185 | } else start = 0; 186 | train_words++; 187 | if ((debug_mode > 1) && (train_words % 1000000 == 0)) { 188 | printf("Words processed: %lldM Vocab size: %lldK %c", train_words / 1000000, vocab_size / 1000, 13); 189 | fflush(stdout); 190 | } 191 | i = SearchVocab(word); 192 | if (i == -1) { 193 | a = AddWordToVocab(word); 194 | vocab[a].cn = 1; 195 | } else vocab[i].cn++; 196 | if (start) continue; 197 | //sprintf(bigram_word, "%s_%s", last_word, word); 198 | a = 0; 199 | b = 0; 200 | while (last_word[a]) { 201 | bigram_word[b] = last_word[a]; 202 | a++; 203 | b++; 204 | } 205 | bigram_word[b] = '_'; 206 | b++; 207 | a = 0; 208 | while (word[a]) { 209 | bigram_word[b] = word[a]; 210 | a++; 211 | b++; 212 | } 213 | bigram_word[b] = 0; 214 | bigram_word[MAX_STRING - 1] = 0; 215 | // 216 | strcpy(last_word, word); 217 | i = SearchVocab(bigram_word); 218 | if (i == -1) { 219 | a = AddWordToVocab(bigram_word); 220 | vocab[a].cn = 1; 221 | } else vocab[i].cn++; 222 | if (vocab_size > vocab_hash_size * 0.7) ReduceVocab(); 223 | } 224 | SortVocab(); 225 | if (debug_mode > 0) { 226 | printf("\nVocab size (unigrams + bigrams): %lld\n", vocab_size); 227 | printf("Words in train file: %lld\n", train_words); 228 | } 229 | fclose(fin); 230 | } 231 | 232 | void TrainModel() { 233 | long long a, b, pa = 0, pb = 0, pab = 0, oov, i, li = -1, cn = 0; 234 | char word[MAX_STRING], last_word[MAX_STRING], bigram_word[MAX_STRING * 2], eof = 0; 235 | real score; 236 | unsigned long long next_random = 1; 237 | FILE *fo, *fin; 238 | printf("Starting training using file %s\n", train_file); 239 | LearnVocabFromTrainFile(); 240 | fin = fopen(train_file, "rb"); 241 | fo = fopen(output_file, "wb"); 242 | word[0] = 0; 243 | while (1) { 244 | strcpy(last_word, word); 245 | ReadWord(word, fin, &eof); 246 | if (eof) break; 247 | if (word[0] == '\n') { 248 | //fprintf(fo, "\n"); 249 | fputc_unlocked('\n', fo); 250 | continue; 251 | } 252 | cn++; 253 | if ((debug_mode > 1) && (cn % 1000000 == 0)) { 254 | printf("Words written: %lldM%c", cn / 1000000, 13); 255 | fflush(stdout); 256 | } 257 | oov = 0; 258 | i = SearchVocab(word); 259 | if (i == -1) oov = 1; else pb = vocab[i].cn; 260 | if (li == -1) oov = 1; 261 | li = i; 262 | //sprintf(bigram_word, "%s_%s", last_word, word); 263 | a = 0; 264 | b = 0; 265 | while (last_word[a]) { 266 | bigram_word[b] = last_word[a]; 267 | a++; 268 | b++; 269 | } 270 | bigram_word[b] = '_'; 271 | b++; 272 | a = 0; 273 | while (word[a]) { 274 | bigram_word[b] = word[a]; 275 | a++; 276 | b++; 277 | } 278 | bigram_word[b] = 0; 279 | bigram_word[MAX_STRING - 1] = 0; 280 | // 281 | i = SearchVocab(bigram_word); 282 | if (i == -1) oov = 1; else pab = vocab[i].cn; 283 | if (pa < min_count) oov = 1; 284 | if (pb < min_count) oov = 1; 285 | if (oov) score = 0; else score = (pab - min_count) / (real)pa / (real)pb * (real)train_words; 286 | next_random = next_random * (unsigned long long)25214903917 + 11; 287 | //if (next_random & 0x10000) score = 0; 288 | if (score > threshold) { 289 | fputc_unlocked('_', fo); 290 | pb = 0; 291 | } else fputc_unlocked(' ', fo); 292 | a = 0; 293 | while (word[a]) { 294 | fputc_unlocked(word[a], fo); 295 | a++; 296 | } 297 | pa = pb; 298 | } 299 | fclose(fo); 300 | fclose(fin); 301 | } 302 | 303 | int ArgPos(char *str, int argc, char **argv) { 304 | int a; 305 | for (a = 1; a < argc; a++) if (!strcmp(str, argv[a])) { 306 | if (a == argc - 1) { 307 | printf("Argument missing for %s\n", str); 308 | exit(1); 309 | } 310 | return a; 311 | } 312 | return -1; 313 | } 314 | 315 | int main(int argc, char **argv) { 316 | int i; 317 | if (argc == 1) { 318 | printf("WORD2PHRASE tool v0.1a\n\n"); 319 | printf("Options:\n"); 320 | printf("Parameters for training:\n"); 321 | printf("\t-train \n"); 322 | printf("\t\tUse text data from to train the model\n"); 323 | printf("\t-output \n"); 324 | printf("\t\tUse to save the resulting word vectors / word clusters / phrases\n"); 325 | printf("\t-min-count \n"); 326 | printf("\t\tThis will discard words that appear less than times; default is 5\n"); 327 | printf("\t-threshold \n"); 328 | printf("\t\t The value represents threshold for forming the phrases (higher means less phrases); default 100\n"); 329 | printf("\t-debug \n"); 330 | printf("\t\tSet the debug mode (default = 2 = more info during training)\n"); 331 | printf("\nExamples:\n"); 332 | printf("./word2phrase -train text.txt -output phrases.txt -threshold 100 -debug 2\n\n"); 333 | return 0; 334 | } 335 | if ((i = ArgPos((char *)"-train", argc, argv)) > 0) strcpy(train_file, argv[i + 1]); 336 | if ((i = ArgPos((char *)"-debug", argc, argv)) > 0) debug_mode = atoi(argv[i + 1]); 337 | if ((i = ArgPos((char *)"-output", argc, argv)) > 0) strcpy(output_file, argv[i + 1]); 338 | if ((i = ArgPos((char *)"-min-count", argc, argv)) > 0) min_count = atoi(argv[i + 1]); 339 | if ((i = ArgPos((char *)"-threshold", argc, argv)) > 0) threshold = atof(argv[i + 1]); 340 | vocab = (struct vocab_word *)calloc(vocab_max_size, sizeof(struct vocab_word)); 341 | vocab_hash = (int *)calloc(vocab_hash_size, sizeof(int)); 342 | TrainModel(); 343 | return 0; 344 | } 345 | -------------------------------------------------------------------------------- /word2vec.c: -------------------------------------------------------------------------------- 1 | // Copyright 2013 Google Inc. All Rights Reserved. 2 | // 3 | // Licensed under the Apache License, Version 2.0 (the "License"); 4 | // you may not use this file except in compliance with the License. 5 | // You may obtain a copy of the License at 6 | // 7 | // http://www.apache.org/licenses/LICENSE-2.0 8 | // 9 | // Unless required by applicable law or agreed to in writing, software 10 | // distributed under the License is distributed on an "AS IS" BASIS, 11 | // WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. 12 | // See the License for the specific language governing permissions and 13 | // limitations under the License. 14 | 15 | #include 16 | #include 17 | #include 18 | #include 19 | #include 20 | 21 | #define MAX_STRING 100 22 | #define EXP_TABLE_SIZE 1000 23 | #define MAX_EXP 6 24 | #define MAX_SENTENCE_LENGTH 1000 25 | #define MAX_CODE_LENGTH 40 26 | 27 | const int vocab_hash_size = 30000000; // Maximum 30 * 0.7 = 21M words in the vocabulary 28 | 29 | typedef float real; // Precision of float numbers 30 | 31 | struct vocab_word { 32 | long long cn; 33 | int *point; 34 | char *word, *code, codelen; 35 | }; 36 | 37 | char train_file[MAX_STRING], output_file[MAX_STRING]; 38 | char save_vocab_file[MAX_STRING], read_vocab_file[MAX_STRING]; 39 | struct vocab_word *vocab; 40 | int binary = 0, cbow = 1, debug_mode = 2, window = 5, min_count = 5, num_threads = 12, min_reduce = 1; 41 | int *vocab_hash; 42 | long long vocab_max_size = 1000, vocab_size = 0, layer1_size = 100; 43 | long long train_words = 0, word_count_actual = 0, iter = 5, file_size = 0, classes = 0; 44 | real alpha = 0.025, starting_alpha, sample = 1e-3; 45 | real *syn0, *syn1, *syn1neg, *expTable; 46 | clock_t start; 47 | 48 | int hs = 0, negative = 5; 49 | const int table_size = 1e8; 50 | int *table; 51 | 52 | void InitUnigramTable() { 53 | int a, i; 54 | double train_words_pow = 0; 55 | double d1, power = 0.75; 56 | table = (int *)malloc(table_size * sizeof(int)); 57 | for (a = 0; a < vocab_size; a++) train_words_pow += pow(vocab[a].cn, power); 58 | i = 0; 59 | d1 = pow(vocab[i].cn, power) / train_words_pow; 60 | for (a = 0; a < table_size; a++) { 61 | table[a] = i; 62 | if (a / (double)table_size > d1) { 63 | i++; 64 | d1 += pow(vocab[i].cn, power) / train_words_pow; 65 | } 66 | if (i >= vocab_size) i = vocab_size - 1; 67 | } 68 | } 69 | 70 | // Reads a single word from a file, assuming space + tab + EOL to be word boundaries 71 | void ReadWord(char *word, FILE *fin, char *eof) { 72 | int a = 0, ch; 73 | while (1) { 74 | ch = fgetc_unlocked(fin); 75 | if (ch == EOF) { 76 | *eof = 1; 77 | break; 78 | } 79 | if (ch == 13) continue; 80 | if ((ch == ' ') || (ch == '\t') || (ch == '\n')) { 81 | if (a > 0) { 82 | if (ch == '\n') ungetc(ch, fin); 83 | break; 84 | } 85 | if (ch == '\n') { 86 | strcpy(word, (char *)""); 87 | return; 88 | } else continue; 89 | } 90 | word[a] = ch; 91 | a++; 92 | if (a >= MAX_STRING - 1) a--; // Truncate too long words 93 | } 94 | word[a] = 0; 95 | } 96 | 97 | // Returns hash value of a word 98 | int GetWordHash(char *word) { 99 | unsigned long long a, hash = 0; 100 | for (a = 0; a < strlen(word); a++) hash = hash * 257 + word[a]; 101 | hash = hash % vocab_hash_size; 102 | return hash; 103 | } 104 | 105 | // Returns position of a word in the vocabulary; if the word is not found, returns -1 106 | int SearchVocab(char *word) { 107 | unsigned int hash = GetWordHash(word); 108 | while (1) { 109 | if (vocab_hash[hash] == -1) return -1; 110 | if (!strcmp(word, vocab[vocab_hash[hash]].word)) return vocab_hash[hash]; 111 | hash = (hash + 1) % vocab_hash_size; 112 | } 113 | return -1; 114 | } 115 | 116 | // Reads a word and returns its index in the vocabulary 117 | int ReadWordIndex(FILE *fin, char *eof) { 118 | char word[MAX_STRING], eof_l = 0; 119 | ReadWord(word, fin, &eof_l); 120 | if (eof_l) { 121 | *eof = 1; 122 | return -1; 123 | } 124 | return SearchVocab(word); 125 | } 126 | 127 | // Adds a word to the vocabulary 128 | int AddWordToVocab(char *word) { 129 | unsigned int hash, length = strlen(word) + 1; 130 | if (length > MAX_STRING) length = MAX_STRING; 131 | vocab[vocab_size].word = (char *)calloc(length, sizeof(char)); 132 | strcpy(vocab[vocab_size].word, word); 133 | vocab[vocab_size].cn = 0; 134 | vocab_size++; 135 | // Reallocate memory if needed 136 | if (vocab_size + 2 >= vocab_max_size) { 137 | vocab_max_size += 1000; 138 | vocab = (struct vocab_word *)realloc(vocab, vocab_max_size * sizeof(struct vocab_word)); 139 | } 140 | hash = GetWordHash(word); 141 | while (vocab_hash[hash] != -1) hash = (hash + 1) % vocab_hash_size; 142 | vocab_hash[hash] = vocab_size - 1; 143 | return vocab_size - 1; 144 | } 145 | 146 | // Used later for sorting by word counts 147 | int VocabCompare(const void *a, const void *b) { 148 | long long l = ((struct vocab_word *)b)->cn - ((struct vocab_word *)a)->cn; 149 | if (l > 0) return 1; 150 | if (l < 0) return -1; 151 | return 0; 152 | } 153 | 154 | // Sorts the vocabulary by frequency using word counts 155 | void SortVocab() { 156 | int a, size; 157 | unsigned int hash; 158 | // Sort the vocabulary and keep at the first position 159 | qsort(&vocab[1], vocab_size - 1, sizeof(struct vocab_word), VocabCompare); 160 | for (a = 0; a < vocab_hash_size; a++) vocab_hash[a] = -1; 161 | size = vocab_size; 162 | train_words = 0; 163 | for (a = 0; a < size; a++) { 164 | // Words occuring less than min_count times will be discarded from the vocab 165 | if ((vocab[a].cn < min_count) && (a != 0)) { 166 | vocab_size--; 167 | free(vocab[a].word); 168 | } else { 169 | // Hash will be re-computed, as after the sorting it is not actual 170 | hash=GetWordHash(vocab[a].word); 171 | while (vocab_hash[hash] != -1) hash = (hash + 1) % vocab_hash_size; 172 | vocab_hash[hash] = a; 173 | train_words += vocab[a].cn; 174 | } 175 | } 176 | vocab = (struct vocab_word *)realloc(vocab, (vocab_size + 1) * sizeof(struct vocab_word)); 177 | // Allocate memory for the binary tree construction 178 | for (a = 0; a < vocab_size; a++) { 179 | vocab[a].code = (char *)calloc(MAX_CODE_LENGTH, sizeof(char)); 180 | vocab[a].point = (int *)calloc(MAX_CODE_LENGTH, sizeof(int)); 181 | } 182 | } 183 | 184 | // Reduces the vocabulary by removing infrequent tokens 185 | void ReduceVocab() { 186 | int a, b = 0; 187 | unsigned int hash; 188 | for (a = 0; a < vocab_size; a++) if (vocab[a].cn > min_reduce) { 189 | vocab[b].cn = vocab[a].cn; 190 | vocab[b].word = vocab[a].word; 191 | b++; 192 | } else free(vocab[a].word); 193 | vocab_size = b; 194 | for (a = 0; a < vocab_hash_size; a++) vocab_hash[a] = -1; 195 | for (a = 0; a < vocab_size; a++) { 196 | // Hash will be re-computed, as it is not actual 197 | hash = GetWordHash(vocab[a].word); 198 | while (vocab_hash[hash] != -1) hash = (hash + 1) % vocab_hash_size; 199 | vocab_hash[hash] = a; 200 | } 201 | fflush(stdout); 202 | min_reduce++; 203 | } 204 | 205 | // Create binary Huffman tree using the word counts 206 | // Frequent words will have short uniqe binary codes 207 | void CreateBinaryTree() { 208 | long long a, b, i, min1i, min2i, pos1, pos2, point[MAX_CODE_LENGTH]; 209 | char code[MAX_CODE_LENGTH]; 210 | long long *count = (long long *)calloc(vocab_size * 2 + 1, sizeof(long long)); 211 | long long *binary = (long long *)calloc(vocab_size * 2 + 1, sizeof(long long)); 212 | long long *parent_node = (long long *)calloc(vocab_size * 2 + 1, sizeof(long long)); 213 | for (a = 0; a < vocab_size; a++) count[a] = vocab[a].cn; 214 | for (a = vocab_size; a < vocab_size * 2; a++) count[a] = 1e15; 215 | pos1 = vocab_size - 1; 216 | pos2 = vocab_size; 217 | // Following algorithm constructs the Huffman tree by adding one node at a time 218 | for (a = 0; a < vocab_size - 1; a++) { 219 | // First, find two smallest nodes 'min1, min2' 220 | if (pos1 >= 0) { 221 | if (count[pos1] < count[pos2]) { 222 | min1i = pos1; 223 | pos1--; 224 | } else { 225 | min1i = pos2; 226 | pos2++; 227 | } 228 | } else { 229 | min1i = pos2; 230 | pos2++; 231 | } 232 | if (pos1 >= 0) { 233 | if (count[pos1] < count[pos2]) { 234 | min2i = pos1; 235 | pos1--; 236 | } else { 237 | min2i = pos2; 238 | pos2++; 239 | } 240 | } else { 241 | min2i = pos2; 242 | pos2++; 243 | } 244 | count[vocab_size + a] = count[min1i] + count[min2i]; 245 | parent_node[min1i] = vocab_size + a; 246 | parent_node[min2i] = vocab_size + a; 247 | binary[min2i] = 1; 248 | } 249 | // Now assign binary code to each vocabulary word 250 | for (a = 0; a < vocab_size; a++) { 251 | b = a; 252 | i = 0; 253 | while (1) { 254 | code[i] = binary[b]; 255 | point[i] = b; 256 | i++; 257 | b = parent_node[b]; 258 | if (b == vocab_size * 2 - 2) break; 259 | } 260 | vocab[a].codelen = i; 261 | vocab[a].point[0] = vocab_size - 2; 262 | for (b = 0; b < i; b++) { 263 | vocab[a].code[i - b - 1] = code[b]; 264 | vocab[a].point[i - b] = point[b] - vocab_size; 265 | } 266 | } 267 | free(count); 268 | free(binary); 269 | free(parent_node); 270 | } 271 | 272 | void LearnVocabFromTrainFile() { 273 | char word[MAX_STRING], eof = 0; 274 | FILE *fin; 275 | long long a, i, wc = 0; 276 | for (a = 0; a < vocab_hash_size; a++) vocab_hash[a] = -1; 277 | fin = fopen(train_file, "rb"); 278 | if (fin == NULL) { 279 | printf("ERROR: training data file not found!\n"); 280 | exit(1); 281 | } 282 | vocab_size = 0; 283 | AddWordToVocab((char *)""); 284 | while (1) { 285 | ReadWord(word, fin, &eof); 286 | if (eof) break; 287 | train_words++; 288 | wc++; 289 | if ((debug_mode > 1) && (wc >= 1000000)) { 290 | printf("%lldM%c", train_words / 1000000, 13); 291 | fflush(stdout); 292 | wc = 0; 293 | } 294 | i = SearchVocab(word); 295 | if (i == -1) { 296 | a = AddWordToVocab(word); 297 | vocab[a].cn = 1; 298 | } else vocab[i].cn++; 299 | if (vocab_size > vocab_hash_size * 0.7) ReduceVocab(); 300 | } 301 | SortVocab(); 302 | if (debug_mode > 0) { 303 | printf("Vocab size: %lld\n", vocab_size); 304 | printf("Words in train file: %lld\n", train_words); 305 | } 306 | file_size = ftell(fin); 307 | fclose(fin); 308 | } 309 | 310 | void SaveVocab() { 311 | long long i; 312 | FILE *fo = fopen(save_vocab_file, "wb"); 313 | for (i = 0; i < vocab_size; i++) fprintf(fo, "%s %lld\n", vocab[i].word, vocab[i].cn); 314 | fclose(fo); 315 | } 316 | 317 | void ReadVocab() { 318 | long long a, i = 0; 319 | char c, eof = 0; 320 | char word[MAX_STRING]; 321 | FILE *fin = fopen(read_vocab_file, "rb"); 322 | if (fin == NULL) { 323 | printf("Vocabulary file not found\n"); 324 | exit(1); 325 | } 326 | for (a = 0; a < vocab_hash_size; a++) vocab_hash[a] = -1; 327 | vocab_size = 0; 328 | while (1) { 329 | ReadWord(word, fin, &eof); 330 | if (eof) break; 331 | a = AddWordToVocab(word); 332 | fscanf(fin, "%lld%c", &vocab[a].cn, &c); 333 | i++; 334 | } 335 | SortVocab(); 336 | if (debug_mode > 0) { 337 | printf("Vocab size: %lld\n", vocab_size); 338 | printf("Words in train file: %lld\n", train_words); 339 | } 340 | fin = fopen(train_file, "rb"); 341 | if (fin == NULL) { 342 | printf("ERROR: training data file not found!\n"); 343 | exit(1); 344 | } 345 | fseek(fin, 0, SEEK_END); 346 | file_size = ftell(fin); 347 | fclose(fin); 348 | } 349 | 350 | void InitNet() { 351 | long long a, b; 352 | unsigned long long next_random = 1; 353 | a = posix_memalign((void **)&syn0, 128, (long long)vocab_size * layer1_size * sizeof(real)); 354 | if (syn0 == NULL) {printf("Memory allocation failed\n"); exit(1);} 355 | if (hs) { 356 | a = posix_memalign((void **)&syn1, 128, (long long)vocab_size * layer1_size * sizeof(real)); 357 | if (syn1 == NULL) {printf("Memory allocation failed\n"); exit(1);} 358 | for (a = 0; a < vocab_size; a++) for (b = 0; b < layer1_size; b++) 359 | syn1[a * layer1_size + b] = 0; 360 | } 361 | if (negative>0) { 362 | a = posix_memalign((void **)&syn1neg, 128, (long long)vocab_size * layer1_size * sizeof(real)); 363 | if (syn1neg == NULL) {printf("Memory allocation failed\n"); exit(1);} 364 | for (a = 0; a < vocab_size; a++) for (b = 0; b < layer1_size; b++) 365 | syn1neg[a * layer1_size + b] = 0; 366 | } 367 | for (a = 0; a < vocab_size; a++) for (b = 0; b < layer1_size; b++) { 368 | next_random = next_random * (unsigned long long)25214903917 + 11; 369 | syn0[a * layer1_size + b] = (((next_random & 0xFFFF) / (real)65536) - 0.5) / layer1_size; 370 | } 371 | CreateBinaryTree(); 372 | } 373 | 374 | void *TrainModelThread(void *id) { 375 | long long a, b, d, cw, word, last_word, sentence_length = 0, sentence_position = 0; 376 | long long word_count = 0, last_word_count = 0, sen[MAX_SENTENCE_LENGTH + 1]; 377 | long long l1, l2, c, target, label, local_iter = iter; 378 | unsigned long long next_random = (long long)id; 379 | char eof = 0; 380 | real f, g; 381 | clock_t now; 382 | real *neu1 = (real *)calloc(layer1_size, sizeof(real)); 383 | real *neu1e = (real *)calloc(layer1_size, sizeof(real)); 384 | FILE *fi = fopen(train_file, "rb"); 385 | fseek(fi, file_size / (long long)num_threads * (long long)id, SEEK_SET); 386 | while (1) { 387 | if (word_count - last_word_count > 10000) { 388 | word_count_actual += word_count - last_word_count; 389 | last_word_count = word_count; 390 | if ((debug_mode > 1)) { 391 | now=clock(); 392 | printf("%cAlpha: %f Progress: %.2f%% Words/thread/sec: %.2fk ", 13, alpha, 393 | word_count_actual / (real)(iter * train_words + 1) * 100, 394 | word_count_actual / ((real)(now - start + 1) / (real)CLOCKS_PER_SEC * 1000)); 395 | fflush(stdout); 396 | } 397 | alpha = starting_alpha * (1 - word_count_actual / (real)(iter * train_words + 1)); 398 | if (alpha < starting_alpha * 0.0001) alpha = starting_alpha * 0.0001; 399 | } 400 | if (sentence_length == 0) { 401 | while (1) { 402 | word = ReadWordIndex(fi, &eof); 403 | if (eof) break; 404 | if (word == -1) continue; 405 | word_count++; 406 | if (word == 0) break; 407 | // The subsampling randomly discards frequent words while keeping the ranking same 408 | if (sample > 0) { 409 | real ran = (sqrt(vocab[word].cn / (sample * train_words)) + 1) * (sample * train_words) / vocab[word].cn; 410 | next_random = next_random * (unsigned long long)25214903917 + 11; 411 | if (ran < (next_random & 0xFFFF) / (real)65536) continue; 412 | } 413 | sen[sentence_length] = word; 414 | sentence_length++; 415 | if (sentence_length >= MAX_SENTENCE_LENGTH) break; 416 | } 417 | sentence_position = 0; 418 | } 419 | if (eof || (word_count > train_words / num_threads)) { 420 | word_count_actual += word_count - last_word_count; 421 | local_iter--; 422 | if (local_iter == 0) break; 423 | word_count = 0; 424 | last_word_count = 0; 425 | sentence_length = 0; 426 | fseek(fi, file_size / (long long)num_threads * (long long)id, SEEK_SET); 427 | continue; 428 | } 429 | word = sen[sentence_position]; 430 | if (word == -1) continue; 431 | for (c = 0; c < layer1_size; c++) neu1[c] = 0; 432 | for (c = 0; c < layer1_size; c++) neu1e[c] = 0; 433 | next_random = next_random * (unsigned long long)25214903917 + 11; 434 | b = next_random % window; 435 | if (cbow) { //train the cbow architecture 436 | // in -> hidden 437 | cw = 0; 438 | for (a = b; a < window * 2 + 1 - b; a++) if (a != window) { 439 | c = sentence_position - window + a; 440 | if (c < 0) continue; 441 | if (c >= sentence_length) continue; 442 | last_word = sen[c]; 443 | if (last_word == -1) continue; 444 | for (c = 0; c < layer1_size; c++) neu1[c] += syn0[c + last_word * layer1_size]; 445 | cw++; 446 | } 447 | if (cw) { 448 | for (c = 0; c < layer1_size; c++) neu1[c] /= cw; 449 | if (hs) for (d = 0; d < vocab[word].codelen; d++) { 450 | f = 0; 451 | l2 = vocab[word].point[d] * layer1_size; 452 | // Propagate hidden -> output 453 | for (c = 0; c < layer1_size; c++) f += neu1[c] * syn1[c + l2]; 454 | if (f <= -MAX_EXP) continue; 455 | else if (f >= MAX_EXP) continue; 456 | else f = expTable[(int)((f + MAX_EXP) * (EXP_TABLE_SIZE / MAX_EXP / 2))]; 457 | // 'g' is the gradient multiplied by the learning rate 458 | g = (1 - vocab[word].code[d] - f) * alpha; 459 | // Propagate errors output -> hidden 460 | for (c = 0; c < layer1_size; c++) neu1e[c] += g * syn1[c + l2]; 461 | // Learn weights hidden -> output 462 | for (c = 0; c < layer1_size; c++) syn1[c + l2] += g * neu1[c]; 463 | } 464 | // NEGATIVE SAMPLING 465 | if (negative > 0) for (d = 0; d < negative + 1; d++) { 466 | if (d == 0) { 467 | target = word; 468 | label = 1; 469 | } else { 470 | next_random = next_random * (unsigned long long)25214903917 + 11; 471 | target = table[(next_random >> 16) % table_size]; 472 | if (target == 0) target = next_random % (vocab_size - 1) + 1; 473 | if (target == word) continue; 474 | label = 0; 475 | } 476 | l2 = target * layer1_size; 477 | f = 0; 478 | for (c = 0; c < layer1_size; c++) f += neu1[c] * syn1neg[c + l2]; 479 | if (f > MAX_EXP) g = (label - 1) * alpha; 480 | else if (f < -MAX_EXP) g = (label - 0) * alpha; 481 | else g = (label - expTable[(int)((f + MAX_EXP) * (EXP_TABLE_SIZE / MAX_EXP / 2))]) * alpha; 482 | for (c = 0; c < layer1_size; c++) neu1e[c] += g * syn1neg[c + l2]; 483 | for (c = 0; c < layer1_size; c++) syn1neg[c + l2] += g * neu1[c]; 484 | } 485 | // hidden -> in 486 | for (a = b; a < window * 2 + 1 - b; a++) if (a != window) { 487 | c = sentence_position - window + a; 488 | if (c < 0) continue; 489 | if (c >= sentence_length) continue; 490 | last_word = sen[c]; 491 | if (last_word == -1) continue; 492 | for (c = 0; c < layer1_size; c++) syn0[c + last_word * layer1_size] += neu1e[c]; 493 | } 494 | } 495 | } else { //train skip-gram 496 | for (a = b; a < window * 2 + 1 - b; a++) if (a != window) { 497 | c = sentence_position - window + a; 498 | if (c < 0) continue; 499 | if (c >= sentence_length) continue; 500 | last_word = sen[c]; 501 | if (last_word == -1) continue; 502 | l1 = last_word * layer1_size; 503 | for (c = 0; c < layer1_size; c++) neu1e[c] = 0; 504 | // HIERARCHICAL SOFTMAX 505 | if (hs) for (d = 0; d < vocab[word].codelen; d++) { 506 | f = 0; 507 | l2 = vocab[word].point[d] * layer1_size; 508 | // Propagate hidden -> output 509 | for (c = 0; c < layer1_size; c++) f += syn0[c + l1] * syn1[c + l2]; 510 | if (f <= -MAX_EXP) continue; 511 | else if (f >= MAX_EXP) continue; 512 | else f = expTable[(int)((f + MAX_EXP) * (EXP_TABLE_SIZE / MAX_EXP / 2))]; 513 | // 'g' is the gradient multiplied by the learning rate 514 | g = (1 - vocab[word].code[d] - f) * alpha; 515 | // Propagate errors output -> hidden 516 | for (c = 0; c < layer1_size; c++) neu1e[c] += g * syn1[c + l2]; 517 | // Learn weights hidden -> output 518 | for (c = 0; c < layer1_size; c++) syn1[c + l2] += g * syn0[c + l1]; 519 | } 520 | // NEGATIVE SAMPLING 521 | if (negative > 0) for (d = 0; d < negative + 1; d++) { 522 | if (d == 0) { 523 | target = word; 524 | label = 1; 525 | } else { 526 | next_random = next_random * (unsigned long long)25214903917 + 11; 527 | target = table[(next_random >> 16) % table_size]; 528 | if (target == 0) target = next_random % (vocab_size - 1) + 1; 529 | if (target == word) continue; 530 | label = 0; 531 | } 532 | l2 = target * layer1_size; 533 | f = 0; 534 | for (c = 0; c < layer1_size; c++) f += syn0[c + l1] * syn1neg[c + l2]; 535 | if (f > MAX_EXP) g = (label - 1) * alpha; 536 | else if (f < -MAX_EXP) g = (label - 0) * alpha; 537 | else g = (label - expTable[(int)((f + MAX_EXP) * (EXP_TABLE_SIZE / MAX_EXP / 2))]) * alpha; 538 | for (c = 0; c < layer1_size; c++) neu1e[c] += g * syn1neg[c + l2]; 539 | for (c = 0; c < layer1_size; c++) syn1neg[c + l2] += g * syn0[c + l1]; 540 | } 541 | // Learn weights input -> hidden 542 | for (c = 0; c < layer1_size; c++) syn0[c + l1] += neu1e[c]; 543 | } 544 | } 545 | sentence_position++; 546 | if (sentence_position >= sentence_length) { 547 | sentence_length = 0; 548 | continue; 549 | } 550 | } 551 | fclose(fi); 552 | free(neu1); 553 | free(neu1e); 554 | pthread_exit(NULL); 555 | } 556 | 557 | void TrainModel() { 558 | long a, b, c, d; 559 | FILE *fo; 560 | pthread_t *pt = (pthread_t *)malloc(num_threads * sizeof(pthread_t)); 561 | printf("Starting training using file %s\n", train_file); 562 | starting_alpha = alpha; 563 | if (read_vocab_file[0] != 0) ReadVocab(); else LearnVocabFromTrainFile(); 564 | if (save_vocab_file[0] != 0) SaveVocab(); 565 | if (output_file[0] == 0) return; 566 | InitNet(); 567 | if (negative > 0) InitUnigramTable(); 568 | start = clock(); 569 | for (a = 0; a < num_threads; a++) pthread_create(&pt[a], NULL, TrainModelThread, (void *)a); 570 | for (a = 0; a < num_threads; a++) pthread_join(pt[a], NULL); 571 | fo = fopen(output_file, "wb"); 572 | if (classes == 0) { 573 | // Save the word vectors 574 | fprintf(fo, "%lld %lld\n", vocab_size, layer1_size); 575 | for (a = 0; a < vocab_size; a++) { 576 | fprintf(fo, "%s ", vocab[a].word); 577 | if (binary) for (b = 0; b < layer1_size; b++) fwrite(&syn0[a * layer1_size + b], sizeof(real), 1, fo); 578 | else for (b = 0; b < layer1_size; b++) fprintf(fo, "%lf ", syn0[a * layer1_size + b]); 579 | fprintf(fo, "\n"); 580 | } 581 | } else { 582 | // Run K-means on the word vectors 583 | int clcn = classes, iter = 10, closeid; 584 | int *centcn = (int *)malloc(classes * sizeof(int)); 585 | int *cl = (int *)calloc(vocab_size, sizeof(int)); 586 | real closev, x; 587 | real *cent = (real *)calloc(classes * layer1_size, sizeof(real)); 588 | for (a = 0; a < vocab_size; a++) cl[a] = a % clcn; 589 | for (a = 0; a < iter; a++) { 590 | for (b = 0; b < clcn * layer1_size; b++) cent[b] = 0; 591 | for (b = 0; b < clcn; b++) centcn[b] = 1; 592 | for (c = 0; c < vocab_size; c++) { 593 | for (d = 0; d < layer1_size; d++) cent[layer1_size * cl[c] + d] += syn0[c * layer1_size + d]; 594 | centcn[cl[c]]++; 595 | } 596 | for (b = 0; b < clcn; b++) { 597 | closev = 0; 598 | for (c = 0; c < layer1_size; c++) { 599 | cent[layer1_size * b + c] /= centcn[b]; 600 | closev += cent[layer1_size * b + c] * cent[layer1_size * b + c]; 601 | } 602 | closev = sqrt(closev); 603 | for (c = 0; c < layer1_size; c++) cent[layer1_size * b + c] /= closev; 604 | } 605 | for (c = 0; c < vocab_size; c++) { 606 | closev = -10; 607 | closeid = 0; 608 | for (d = 0; d < clcn; d++) { 609 | x = 0; 610 | for (b = 0; b < layer1_size; b++) x += cent[layer1_size * d + b] * syn0[c * layer1_size + b]; 611 | if (x > closev) { 612 | closev = x; 613 | closeid = d; 614 | } 615 | } 616 | cl[c] = closeid; 617 | } 618 | } 619 | // Save the K-means classes 620 | for (a = 0; a < vocab_size; a++) fprintf(fo, "%s %d\n", vocab[a].word, cl[a]); 621 | free(centcn); 622 | free(cent); 623 | free(cl); 624 | } 625 | fclose(fo); 626 | } 627 | 628 | int ArgPos(char *str, int argc, char **argv) { 629 | int a; 630 | for (a = 1; a < argc; a++) if (!strcmp(str, argv[a])) { 631 | if (a == argc - 1) { 632 | printf("Argument missing for %s\n", str); 633 | exit(1); 634 | } 635 | return a; 636 | } 637 | return -1; 638 | } 639 | 640 | int main(int argc, char **argv) { 641 | int i; 642 | if (argc == 1) { 643 | printf("WORD VECTOR estimation toolkit v 0.1c\n\n"); 644 | printf("Options:\n"); 645 | printf("Parameters for training:\n"); 646 | printf("\t-train \n"); 647 | printf("\t\tUse text data from to train the model\n"); 648 | printf("\t-output \n"); 649 | printf("\t\tUse to save the resulting word vectors / word clusters\n"); 650 | printf("\t-size \n"); 651 | printf("\t\tSet size of word vectors; default is 100\n"); 652 | printf("\t-window \n"); 653 | printf("\t\tSet max skip length between words; default is 5\n"); 654 | printf("\t-sample \n"); 655 | printf("\t\tSet threshold for occurrence of words. Those that appear with higher frequency in the training data\n"); 656 | printf("\t\twill be randomly down-sampled; default is 1e-3, useful range is (0, 1e-5)\n"); 657 | printf("\t-hs \n"); 658 | printf("\t\tUse Hierarchical Softmax; default is 0 (not used)\n"); 659 | printf("\t-negative \n"); 660 | printf("\t\tNumber of negative examples; default is 5, common values are 3 - 10 (0 = not used)\n"); 661 | printf("\t-threads \n"); 662 | printf("\t\tUse threads (default 12)\n"); 663 | printf("\t-iter \n"); 664 | printf("\t\tRun more training iterations (default 5)\n"); 665 | printf("\t-min-count \n"); 666 | printf("\t\tThis will discard words that appear less than times; default is 5\n"); 667 | printf("\t-alpha \n"); 668 | printf("\t\tSet the starting learning rate; default is 0.025 for skip-gram and 0.05 for CBOW\n"); 669 | printf("\t-classes \n"); 670 | printf("\t\tOutput word classes rather than word vectors; default number of classes is 0 (vectors are written)\n"); 671 | printf("\t-debug \n"); 672 | printf("\t\tSet the debug mode (default = 2 = more info during training)\n"); 673 | printf("\t-binary \n"); 674 | printf("\t\tSave the resulting vectors in binary moded; default is 0 (off)\n"); 675 | printf("\t-save-vocab \n"); 676 | printf("\t\tThe vocabulary will be saved to \n"); 677 | printf("\t-read-vocab \n"); 678 | printf("\t\tThe vocabulary will be read from , not constructed from the training data\n"); 679 | printf("\t-cbow \n"); 680 | printf("\t\tUse the continuous bag of words model; default is 1 (use 0 for skip-gram model)\n"); 681 | printf("\nExamples:\n"); 682 | printf("./word2vec -train data.txt -output vec.txt -size 200 -window 5 -sample 1e-4 -negative 5 -hs 0 -binary 0 -cbow 1 -iter 3\n\n"); 683 | return 0; 684 | } 685 | output_file[0] = 0; 686 | save_vocab_file[0] = 0; 687 | read_vocab_file[0] = 0; 688 | if ((i = ArgPos((char *)"-size", argc, argv)) > 0) layer1_size = atoi(argv[i + 1]); 689 | if ((i = ArgPos((char *)"-train", argc, argv)) > 0) strcpy(train_file, argv[i + 1]); 690 | if ((i = ArgPos((char *)"-save-vocab", argc, argv)) > 0) strcpy(save_vocab_file, argv[i + 1]); 691 | if ((i = ArgPos((char *)"-read-vocab", argc, argv)) > 0) strcpy(read_vocab_file, argv[i + 1]); 692 | if ((i = ArgPos((char *)"-debug", argc, argv)) > 0) debug_mode = atoi(argv[i + 1]); 693 | if ((i = ArgPos((char *)"-binary", argc, argv)) > 0) binary = atoi(argv[i + 1]); 694 | if ((i = ArgPos((char *)"-cbow", argc, argv)) > 0) cbow = atoi(argv[i + 1]); 695 | if (cbow) alpha = 0.05; 696 | if ((i = ArgPos((char *)"-alpha", argc, argv)) > 0) alpha = atof(argv[i + 1]); 697 | if ((i = ArgPos((char *)"-output", argc, argv)) > 0) strcpy(output_file, argv[i + 1]); 698 | if ((i = ArgPos((char *)"-window", argc, argv)) > 0) window = atoi(argv[i + 1]); 699 | if ((i = ArgPos((char *)"-sample", argc, argv)) > 0) sample = atof(argv[i + 1]); 700 | if ((i = ArgPos((char *)"-hs", argc, argv)) > 0) hs = atoi(argv[i + 1]); 701 | if ((i = ArgPos((char *)"-negative", argc, argv)) > 0) negative = atoi(argv[i + 1]); 702 | if ((i = ArgPos((char *)"-threads", argc, argv)) > 0) num_threads = atoi(argv[i + 1]); 703 | if ((i = ArgPos((char *)"-iter", argc, argv)) > 0) iter = atoi(argv[i + 1]); 704 | if ((i = ArgPos((char *)"-min-count", argc, argv)) > 0) min_count = atoi(argv[i + 1]); 705 | if ((i = ArgPos((char *)"-classes", argc, argv)) > 0) classes = atoi(argv[i + 1]); 706 | vocab = (struct vocab_word *)calloc(vocab_max_size, sizeof(struct vocab_word)); 707 | vocab_hash = (int *)calloc(vocab_hash_size, sizeof(int)); 708 | expTable = (real *)malloc((EXP_TABLE_SIZE + 1) * sizeof(real)); 709 | for (i = 0; i < EXP_TABLE_SIZE; i++) { 710 | expTable[i] = exp((i / (real)EXP_TABLE_SIZE * 2 - 1) * MAX_EXP); // Precompute the exp() table 711 | expTable[i] = expTable[i] / (expTable[i] + 1); // Precompute f(x) = x / (x + 1) 712 | } 713 | TrainModel(); 714 | return 0; 715 | } 716 | --------------------------------------------------------------------------------