├── README.md
├── glove
    ├── glove.c
    └── makefile
├── ngram2vec
    ├── analogy_eval.py
    ├── corpus2pairs.py
    ├── corpus2vocab.py
    ├── corpus2vocab_multiproc.py
    ├── counts2glove.py
    ├── counts2ppmi.py
    ├── distance.py
    ├── eval
    │   ├── __init__.py
    │   ├── recast.py
    │   ├── similarity.py
    │   └── testset.py
    ├── line2pairs.py
    ├── line2vocab.py
    ├── pairs2counts.py
    ├── pairs2sgns.py
    ├── pairs2vocab.py
    ├── ppmi2svd.py
    ├── shuffle.py
    ├── similarity_eval.py
    └── utils
    │   ├── __init__.py
    │   ├── constants.py
    │   ├── matrix.py
    │   ├── misc.py
    │   └── vocabulary.py
├── ngram_example.sh
├── scripts
    ├── clean_corpus.sh
    └── compile_c.py
├── testsets
    ├── analogy
    │   ├── google.txt
    │   ├── msr.txt
    │   ├── semantic.txt
    │   └── syntactic.txt
    ├── analogy_zh
    │   ├── ca-translated
    │   │   ├── capital-common-countries.txt
    │   │   ├── city-in-state.txt
    │   │   └── family.txt
    │   └── ca8
    │   │   ├── morphological
    │   │       ├── A.txt
    │   │       ├── AB.txt
    │   │       ├── prefix.txt
    │   │       └── suffix.txt
    │   │   └── semantic
    │   │       ├── geography.txt
    │   │       ├── history.txt
    │   │       ├── nature.txt
    │   │       └── people.txt
    └── similarity
    │   ├── bruni_men.txt
    │   ├── luong_rare.txt
    │   ├── radinsky_mturk.txt
    │   ├── sim999.txt
    │   ├── ws353.txt
    │   ├── ws353_relatedness.txt
    │   └── ws353_similarity.txt
├── word2vec
    ├── makefile
    └── word2vec.c
├── word_example.sh
└── workflow.jpg


/README.md:
--------------------------------------------------------------------------------
 1 | # Ngram2vec
 2 | Ngram2vec toolkit is originally used for reproducing results of the paper
 3 | <a href="http://www.aclweb.org/anthology/D17-1023"><em>Ngram2vec: Learning Improved Word Representations from Ngram Co-occurrence Statistics</em></a>
 4 | , aiming at learning high quality word embedding and ngram embedding.
 5 | 
 6 | Thansks to its well-designed architecture (we will talk about it later), ngram2vec toolkit provides a general and powerful framework, which is able to include researches of a large amount of papers and many popular toolkits such as word2vec. Ngram2vec toolkit allows researchers to learn representations upon co-occurrence statistics easily. Ngram2vec can generate embeddings of different granularities (beyond word embedding). For example, ngram2vec toolkit could be used for learning text embedding. Text embeddings trained by ngram2vec are very competitive. They outperform many deep and complex neural networks and achieve state-of-the-art results on a range of datasets. More details will be released later. 
 7 | 
 8 | Ngram2vec has been successfully applied on many projects. For example, <a href="https://github.com/Embedding/Chinese-Word-Vectors"><em>Chinese-Word-Vectors</em></a> provides over 100  Chinese word embeddings with different properties. All embeddings are trained by ngram2vec toolkit.
 9 | 
10 | The original version (v0.0.0) of ngram2vec can be downloaded on github release. Python2 is recommended. One can download ngram2vec v0.0.0 for reproducing results.
11 | 
12 | ## Features
13 | Ngram2vec is featured by decoupled architecture. The process from raw corpus to final embeddings is decoupled into multiple modules. This brings many advantages compared with other toolkits.
14 | * Well-organized: The ngram2vec toolkit is easy to read and understand.
15 | * Extensibility: One can add co-occurrence statistics and embedding models with little effort.
16 | * Intermediate results reuse: Intermediate results are written to disk and reused later, which largely boosts the efficiency in both speed and space.
17 | * Comprehensive: Ngram2vec includes a large amount of works related with word embedding
18 | * Embeddings of different linguistic units: Ngram2vec can learn embeddings of different linguistic units. For example, ngram2vec is able to produce high-quality text embeddings which achieve SOTA reults on a range of datasets. 
19 | 
20 | 
21 | ## Requirements
22 | * Python (both Python2 and 3 are supported)
23 | * numpy
24 | * scipy
25 | * sparsesvd
26 | 
27 | ## Example use cases
28 | 
29 | Firstly, run the following codes to make some files executable.<br>
30 | `chmod +x *.sh`<br>
31 | `chmod +x scripts/clean_corpus.sh`<br>
32 | `python scripts/compile_c.py`<br>
33 | 
34 | Also, a corpus should be prepared. We recommend to fetch it at<br> 
35 | http://nlp.stanford.edu/data/WestburyLab.wikicorp.201004.txt.bz2 , a wiki corpus without XML tags. `scripts/clean_corpus.sh` is used for cleaning English corpus.<br> For example `scripts/clean_corpus.sh WestburyLab.wikicorp.201004.txt > wiki2010.clean`<br>
36 | A pre-processed (including segmentation) chinese wiki corpus is available at https://pan.baidu.com/s/1kURV0rl , which can be directly used as input of this toolkit.
37 | 
38 | run `./word_example.sh` to see baselines<br>
39 | run `./ngram_example.sh` to introduce ngram into recent word representation methods inspired by traditional language modeling problem.br>
40 | 
41 | ## Workflow
42 | 
43 | <img src="https://github.com/zhezhaoa/ngram2vec/blob/master/workflow.jpg" width = "600" align=center />
44 | 
45 | ## Testsets
46 | 
47 | Besides English word analogy and similarity datasets, we provide several **Chinese** analogy datasets, which contain comprehensive analogy questions. Some of them are constructed by directly translating English analogy datasets. Some are unique to Chinese. I hope they could become useful resources for evaluating Chinese word embedding.
48 | 
49 | 
50 | ## References
51 | 
52 |     @inproceedings{DBLP:conf/emnlp/ZhaoLLLD17,
53 |          author = {Zhe Zhao and Tao Liu and Shen Li and Bofang Li and Xiaoyong Du},
54 |          title = {Ngram2vec: Learning Improved Word Representations from Ngram Co-occurrence Statistics},   
55 |          booktitle = {Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, {EMNLP} 2017, Copenhagen, Denmark, September 9-11, 2017},      
56 |          year = {2017}
57 |      }
58 | 
59 | 
60 | ## Acknowledgments
61 | 
62 | This toolkit is inspired by Omer Levy's work http://bitbucket.org/omerlevy/hyperwords<br>
63 | We reuse part of his code in this toolkit. We also thank him for his kind suggestions.<br>
64 | I also got the help from Bofang Li, [Prof. Ju Fan](http://info.ruc.edu.cn/academic_professor.php?teacher_id=162), and Jianwei Cui in Xiaomi.<br>
65 | My tutors are [Tao Liu](http://info.ruc.edu.cn/academic_professor.php?teacher_id=46) and [Xiaoyong Du](http://info.ruc.edu.cn/academic_professor.php?teacher_id=57)
66 | 
67 | ## Contact us
68 | 
69 | We are looking forward to receiving your questions and advice to this toolkit. We will reply you as soon as possible. We will further perfect this toolkit.<br>  
70 | [Zhe Zhao](https://zhezhaoa.github.io/), helloworld@ruc.edu.cn, from [DBIIR lab](http://iir.ruc.edu.cn/index.jsp)<br>
71 | Shen Li, shen@mail.bnu.edu.cn<br>
72 | Renfen Hu, irishere@mail.bnu.edu.cn
73 | 


--------------------------------------------------------------------------------
/glove/glove.c:
--------------------------------------------------------------------------------
  1 | //  GloVe: Global Vectors for Word Representation
  2 | //
  3 | //  Copyright (c) 2014 The Board of Trustees of
  4 | //  The Leland Stanford Junior University. All Rights Reserved.
  5 | //
  6 | //  Licensed under the Apache License, Version 2.0 (the "License");
  7 | //  you may not use this file except in compliance with the License.
  8 | //  You may obtain a copy of the License at
  9 | //
 10 | //      http://www.apache.org/licenses/LICENSE-2.0
 11 | //
 12 | //  Unless required by applicable law or agreed to in writing, software
 13 | //  distributed under the License is distributed on an "AS IS" BASIS,
 14 | //  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 15 | //  See the License for the specific language governing permissions and
 16 | //  limitations under the License.
 17 | //
 18 | //
 19 | //  For more information, bug reports, fixes, contact:
 20 | //    Jeffrey Pennington (jpennin@stanford.edu)
 21 | //    GlobalVectors@googlegroups.com
 22 | //    http://nlp.stanford.edu/projects/glove/
 23 | 
 24 | 
 25 | #include <stdint.h>
 26 | #include <stdio.h>
 27 | #include <stdlib.h>
 28 | #include <string.h>
 29 | #include <math.h>
 30 | #include <pthread.h>
 31 | #include <time.h>
 32 | 
 33 | #define MAX_STRING_LENGTH 1000
 34 | 
 35 | typedef double real;
 36 | 
 37 | typedef struct cooccur_rec {
 38 |     int word1;
 39 |     int word2;
 40 |     real val;
 41 | } CREC;
 42 | 
 43 | int write_header=1; //0=no, 1=yes; writes vocab_size/vector_size as first line for use with some libraries, such as gensim.
 44 | int verbose = 2; // 0, 1, or 2
 45 | int use_unk_vec = 0; // 0 or 1
 46 | int num_threads = 8; // pthreads
 47 | int num_iter = 25; // Number of full passes through cooccurrence matrix
 48 | //!@#$%^&*
 49 | int vector_size = 300; // Word vector size
 50 | int use_binary = 0; // 0: save as text files; 1: save as binary; 2: both. For binary, save both word and context word vectors.
 51 | int model = 1; // For text file output only. 0: concatenate word and context vectors (and biases) i.e. save everything; 1: Just save word vectors (no bias); 2: Save (word + context word) vectors (no biases)
 52 | int checkpoint_every = 0; // checkpoint the model for every checkpoint_every iterations. Do nothing if checkpoint_every <= 0
 53 | real eta = 0.05; // Initial learning rate
 54 | real alpha = 0.75, x_max = 100.0; // Weighting function parameters, not extremely sensitive to corpus, though may need adjustment for very small or very large corpora
 55 | real *W, *gradsq, *cost;
 56 | long long input_vocab_size, output_vocab_size, counts_num; //!@#$%^&*
 57 | char *counts_file, *input_vector_file, *output_vector_file, *input_vocab_file, *output_vocab_file; //!@#$%^&*
 58 | long long file_size = 0; //!@#$%^&*
 59 | /* Efficient string comparison */
 60 | int scmp( char *s1, char *s2 ) {
 61 |     while (*s1 != '\0' && *s1 == *s2) {s1++; s2++;}
 62 |     return(*s1 - *s2);
 63 | }
 64 | 
 65 | void initialize_parameters() {
 66 |     long long a, b;
 67 |     vector_size++; // Temporarily increment to allocate space for bias
 68 | 
 69 |     /* Allocate space for word vectors and context word vectors, and correspodning gradsq */
 70 |     a = posix_memalign((void **)&W, 128, (input_vocab_size + output_vocab_size) * vector_size * sizeof(real)); // Might perform better than malloc
 71 |     if (W == NULL) {
 72 |         fprintf(stderr, "Error allocating memory for W\n");
 73 |         exit(1);
 74 |     }
 75 |     a = posix_memalign((void **)&gradsq, 128, (input_vocab_size + output_vocab_size) * vector_size * sizeof(real)); // Might perform better than malloc
 76 | 	if (gradsq == NULL) {
 77 |         fprintf(stderr, "Error allocating memory for gradsq\n");
 78 |         exit(1);
 79 |     }
 80 |     for (b = 0; b < vector_size; b++) for (a = 0; a < input_vocab_size + output_vocab_size; a++) W[a * vector_size + b] = (rand() / (real)RAND_MAX - 0.5) / vector_size;
 81 |     for (b = 0; b < vector_size; b++) for (a = 0; a < input_vocab_size + output_vocab_size; a++) gradsq[a * vector_size + b] = 1.0; // So initial value of eta is equal to initial learning rate
 82 |     vector_size--;
 83 | }
 84 | 
 85 | inline real check_nan(real update) {
 86 |     if (isnan(update) || isinf(update)) {
 87 |         fprintf(stderr,"\ncaught NaN in update");
 88 |         return 0.;
 89 |     } else {
 90 |         return update;
 91 |     }
 92 | }
 93 | 
 94 | /* Train the GloVe model */
 95 | void *glove_thread(void *vid) {
 96 |     long long a, l1, l2;
 97 |     long long id = *(long long*)vid;
 98 |     CREC cr;
 99 |     real diff, fdiff, temp1, temp2;
100 |     FILE *fin;
101 |     fin = fopen(counts_file, "rb");
102 |     cost[id] = 0;
103 |     char c1, c2;
104 |     
105 |     long long start_offset = file_size / (long long)num_threads * (long long)id;
106 |     long long end_offset = file_size / (long long)num_threads * (long long)(id+1);
107 | 
108 |     real* W_updates1 = (real*)malloc(vector_size * sizeof(real));
109 |     real* W_updates2 = (real*)malloc(vector_size * sizeof(real));
110 |     fseek(fin, start_offset, SEEK_SET);
111 |     while (fgetc(fin) != '\n') { };
112 | 
113 |     while (1) {
114 |         if (feof(fin) || ftell(fin) > end_offset) break;
115 | 
116 |         fscanf(fin, "%d%c%d%c%lf", &cr.word1, &c1, &cr.word2, &c2, &cr.val);
117 | 
118 |         if (feof(fin)) break;
119 |         
120 |         /* Get location of words in W & gradsq */
121 |         l1 = cr.word1 * (vector_size + 1); // cr word indices start at 1
122 |         l2 = (cr.word2 + input_vocab_size) * (vector_size + 1); // shift by vocab_size to get separate vectors for context words
123 |         
124 |         /* Calculate cost, save diff for gradients */
125 |         diff = 0;
126 |         for (a = 0; a < vector_size; a++) diff += W[a + l1] * W[a + l2]; // dot product of word and context word vector
127 |         diff += W[vector_size + l1] + W[vector_size + l2] - log(cr.val); // add separate bias for each word
128 |         fdiff = (cr.val > x_max) ? diff : pow(cr.val / x_max, alpha) * diff; // multiply weighting function (f) with diff
129 | 
130 |         // Check for NaN and inf() in the diffs.
131 |         if (isnan(diff) || isnan(fdiff) || isinf(diff) || isinf(fdiff)) {
132 |             fprintf(stderr,"Caught NaN in diff for kdiff for thread. Skipping update");
133 |             continue;
134 |         }
135 | 
136 |         cost[id] += 0.5 * fdiff * diff; // weighted squared error
137 |         
138 |         /* Adaptive gradient updates */
139 |         fdiff *= eta; // for ease in calculating gradient
140 |         real W_updates1_sum = 0;
141 |         real W_updates2_sum = 0;
142 |         for (a = 0; a < vector_size; a++) {
143 |             // learning rate times gradient for word vectors
144 |             temp1 = fdiff * W[a + l2];
145 |             temp2 = fdiff * W[a + l1];
146 |             // adaptive updates
147 |             W_updates1[a] = temp1 / sqrt(gradsq[a + l1]);
148 |             W_updates2[a] = temp2 / sqrt(gradsq[a + l2]);
149 |             W_updates1_sum += W_updates1[a];
150 |             W_updates2_sum += W_updates2[a];
151 |             gradsq[a + l1] += temp1 * temp1;
152 |             gradsq[a + l2] += temp2 * temp2;
153 |         }
154 |         if (!isnan(W_updates1_sum) && !isinf(W_updates1_sum) && !isnan(W_updates2_sum) && !isinf(W_updates2_sum)) {
155 |             for (a = 0; a < vector_size; a++) {
156 |                 W[a + l1] -= W_updates1[a];
157 |                 W[a + l2] -= W_updates2[a];
158 |             }
159 |         }
160 |         // updates for bias terms
161 |         W[vector_size + l1] -= check_nan(fdiff / sqrt(gradsq[vector_size + l1]));
162 |         W[vector_size + l2] -= check_nan(fdiff / sqrt(gradsq[vector_size + l2]));
163 |         fdiff *= fdiff;
164 |         gradsq[vector_size + l1] += fdiff;
165 |         gradsq[vector_size + l2] += fdiff;
166 |     }
167 |     free(W_updates1);
168 |     free(W_updates2);
169 |     
170 |     fclose(fin);
171 |     pthread_exit(NULL);
172 | }
173 | 
174 | long long GetFileSize(char *fname) {
175 |   long long fsize;
176 |   FILE *fin = fopen(fname, "rb");
177 |   if (fin == NULL) {
178 |     printf("ERROR: file not found! %s\n", fname);
179 |     exit(1);
180 |   }
181 |   fseek(fin, 0, SEEK_END);
182 |   fsize = ftell(fin);
183 |   fclose(fin);
184 |   return fsize;
185 | }
186 | 
187 | /* Save params to file */
188 | int save_params() {
189 |     long long a, b;
190 |     char format[20];
191 |     char *word = malloc(sizeof(char) * MAX_STRING_LENGTH + 1);
192 |     FILE *fo_input, *fo_output, *fi_input_vocab, *fi_output_vocab;
193 | 
194 |     fo_input = fopen(input_vector_file, "wb");
195 |     fo_output = fopen(output_vector_file, "wb");   
196 |     fi_input_vocab = fopen(input_vocab_file, "r");
197 |     fi_output_vocab = fopen(output_vocab_file, "r");
198 | 
199 |     sprintf(format,"%%%ds",MAX_STRING_LENGTH);
200 |     fprintf(fo_input, "%lld %d\n", input_vocab_size, vector_size);
201 | 
202 |     for (a = 0; a < input_vocab_size; a++) { //!@#$%^&*
203 |         if (fscanf(fi_input_vocab,format,word) == 0) return 1;
204 |         fprintf(fo_input, "%s",word);
205 |         for (b = 0; b < vector_size; b++) fprintf(fo_input," %lf", W[a * (vector_size + 1) + b]);
206 |         fprintf(fo_input,"\n");
207 |         if (fscanf(fi_input_vocab,format,word) == 0) return 1; // Eat irrelevant frequency entry
208 |     }
209 | 
210 |     fprintf(fo_output, "%lld %d\n", output_vocab_size, vector_size);
211 |     for (a = 0; a < output_vocab_size; a++) { //!@#$%^&*
212 |         if (fscanf(fi_output_vocab,format,word) == 0) return 1;
213 |         fprintf(fo_output, "%s",word);
214 |         for (b = 0; b < vector_size; b++) fprintf(fo_output," %lf", W[(input_vocab_size + a) * (vector_size + 1) + b]);
215 |         fprintf(fo_output,"\n");
216 |         if (fscanf(fi_output_vocab,format,word) == 0) return 1; // Eat irrelevant frequency entry
217 |     }
218 | 
219 |     fclose(fo_input);
220 |     fclose(fo_output);
221 |     fclose(fi_input_vocab);
222 |     fclose(fi_output_vocab);
223 |     return 0;
224 | }
225 | 
226 | /* Train model */
227 | int train_glove() {
228 |     long long a;
229 |     int b;
230 |     real total_cost = 0;
231 |     fprintf(stderr, "TRAINING MODEL\n");
232 |     file_size = GetFileSize(counts_file);
233 |     fprintf(stderr,"Read %lld counts.\n", counts_num);
234 |     if (verbose > 1) fprintf(stderr,"Initializing parameters...");
235 |     initialize_parameters();
236 |     if (verbose > 1) fprintf(stderr,"done.\n");
237 |     if (verbose > 0) fprintf(stderr,"vector size: %d\n", vector_size);
238 |     if (verbose > 0) fprintf(stderr,"words size: %lld\n", input_vocab_size);
239 |     if (verbose > 0) fprintf(stderr,"contexts size: %lld\n", output_vocab_size);
240 |     if (verbose > 0) fprintf(stderr,"x_max: %lf\n", x_max);
241 |     if (verbose > 0) fprintf(stderr,"alpha: %lf\n", alpha);
242 |     pthread_t *pt = (pthread_t *)malloc(num_threads * sizeof(pthread_t));
243 |     
244 |     time_t rawtime;
245 |     struct tm *info;
246 |     char time_buffer[80];
247 |     // Lock-free asynchronous SGD
248 |     for (b = 0; b < num_iter; b++) {
249 |         total_cost = 0;
250 |         long long *thread_ids = (long long*)malloc(sizeof(long long) * num_threads);
251 |         for (a = 0; a < num_threads; a++) thread_ids[a] = a;
252 |         for (a = 0; a < num_threads; a++) pthread_create(&pt[a], NULL, glove_thread, (void *)&thread_ids[a]);
253 |         for (a = 0; a < num_threads; a++) pthread_join(pt[a], NULL);
254 |         for (a = 0; a < num_threads; a++) total_cost += cost[a];
255 |         free(thread_ids);
256 | 
257 |         time(&rawtime);
258 |         info = localtime(&rawtime);
259 |         strftime(time_buffer,80,"%x - %I:%M.%S%p", info);
260 |         fprintf(stderr, "%s, iter: %03d, cost: %lf\n", time_buffer,  b+1, total_cost/counts_num);
261 | 
262 |     }
263 |     free(pt);
264 |     return save_params();
265 | 
266 | }
267 | 
268 | int find_arg(char *str, int argc, char **argv) {
269 |     int i;
270 |     for (i = 1; i < argc; i++) {
271 |         if (!scmp(str, argv[i])) {
272 |             if (i == argc - 1) {
273 |                 printf("No argument given for %s\n", str);
274 |                 exit(1);
275 |             }
276 |             return i;
277 |         }
278 |     }
279 |     return -1;
280 | }
281 | 
282 | int main(int argc, char **argv) {
283 |     int i;
284 |     FILE *fid;
285 |     //!@#$%^&*
286 |     counts_file = malloc(sizeof(char) * MAX_STRING_LENGTH);
287 |     input_vocab_file = malloc(sizeof(char) * MAX_STRING_LENGTH);
288 |     output_vocab_file = malloc(sizeof(char) * MAX_STRING_LENGTH);
289 |     input_vector_file = malloc(sizeof(char) * MAX_STRING_LENGTH);
290 |     output_vector_file = malloc(sizeof(char) * MAX_STRING_LENGTH);
291 |     int result = 0;
292 |     
293 |     if (argc == 1) {
294 |         printf("GloVe: Global Vectors for Word Representation, v0.2\n");
295 |         printf("Author: Jeffrey Pennington (jpennin@stanford.edu)\n\n");
296 |         printf("Usage options:\n");
297 |         printf("\t-verbose <int>\n");
298 |         printf("\t\tSet verbosity: 0, 1, or 2 (default)\n");
299 |         printf("\t-vector-size <int>\n");
300 |         printf("\t\tDimension of word vector representations (excluding bias term); default 300\n");//!@#$%^&*
301 |         printf("\t-threads <int>\n");
302 |         printf("\t\tNumber of threads; default 8\n");
303 |         printf("\t-iter <int>\n");
304 |         printf("\t\tNumber of training iterations; default 25\n");
305 |         printf("\t-eta <float>\n");
306 |         printf("\t\tInitial learning rate; default 0.05\n");
307 |         printf("\t-alpha <float>\n");
308 |         printf("\t\tParameter in exponent of weighting function; default 0.75\n");
309 |         printf("\t-x-max <float>\n");
310 |         printf("\t\tParameter specifying cutoff in weighting function; default 100.0\n");
311 |         printf("\t-input-file <file>\n");
312 |         printf("\t\tBinary input file of shuffled cooccurrence data (produced by 'cooccur' and 'shuffle'); default cooccurrence.shuf.bin\n");
313 |         printf("\t-vocab-file <file>\n");
314 |         printf("\t\tFile containing vocabulary (truncated unigram counts, produced by 'vocab_count'); default vocab.txt\n");
315 |         printf("\t-save-file <file>\n");
316 |         printf("\t\tFilename, excluding extension, for word vector output; default vectors\n");
317 |         printf("\nExample usage:\n");
318 |         printf("./glove -input-file cooccurrence.shuf.bin -vocab-file vocab.txt -save-file vectors -gradsq-file gradsq -verbose 2 -vector-size 100 -threads 16 -alpha 0.75 -x-max 100.0 -eta 0.05 -binary 2 -model 2\n\n");
319 |         result = 0;
320 |     } else {
321 | 	if ((i = find_arg((char *)"-write-header", argc, argv)) > 0) write_header = atoi(argv[i + 1]);
322 |         if ((i = find_arg((char *)"--verbose", argc, argv)) > 0) verbose = atoi(argv[i + 1]);
323 |         if ((i = find_arg((char *)"--size", argc, argv)) > 0) vector_size = atoi(argv[i + 1]);
324 |         if ((i = find_arg((char *)"--iter", argc, argv)) > 0) num_iter = atoi(argv[i + 1]);
325 |         if ((i = find_arg((char *)"--threads_num", argc, argv)) > 0) num_threads = atoi(argv[i + 1]);
326 |         cost = malloc(sizeof(real) * num_threads);
327 |         if ((i = find_arg((char *)"--alpha", argc, argv)) > 0) alpha = atof(argv[i + 1]);
328 |         if ((i = find_arg((char *)"--x-max", argc, argv)) > 0) x_max = atof(argv[i + 1]);
329 |         if ((i = find_arg((char *)"--eta", argc, argv)) > 0) eta = atof(argv[i + 1]);
330 | 
331 |         if ((i = find_arg((char *)"--counts_file", argc, argv)) > 0) strcpy(counts_file, argv[i + 1]);
332 |         if ((i = find_arg((char *)"--input_vocab_file", argc, argv)) > 0) strcpy(input_vocab_file, argv[i + 1]);
333 |         if ((i = find_arg((char *)"--output_vocab_file", argc, argv)) > 0) strcpy(output_vocab_file, argv[i + 1]);
334 |         if ((i = find_arg((char *)"--input_vector_file", argc, argv)) > 0) strcpy(input_vector_file, argv[i + 1]);
335 |         if ((i = find_arg((char *)"--output_vector_file", argc, argv)) > 0) strcpy(output_vector_file, argv[i + 1]);
336 | 
337 |         input_vocab_size = 0;
338 |         fid = fopen(input_vocab_file, "r");
339 |         if (fid == NULL) {fprintf(stderr, "Unable to open words file %s.\n",input_vocab_file); return 1;}
340 |         while ((i = getc(fid)) != EOF) if (i == '\n') input_vocab_size++; // Count number of entries in vocab_file
341 |         fclose(fid);
342 | 
343 |         output_vocab_size = 0;
344 |         fid = fopen(output_vocab_file, "r");
345 |         if (fid == NULL) {fprintf(stderr, "Unable to open output_vocab file %s.\n",output_vocab_file); return 1;}
346 |         while ((i = getc(fid)) != EOF) if (i == '\n') output_vocab_size++; // Count number of entries in vocab_file
347 |         fclose(fid);
348 | 
349 |         counts_num = 0;
350 |         fid = fopen(counts_file, "r");
351 |         if (fid == NULL) {fprintf(stderr, "Unable to open contexts file %s.\n",counts_file); return 1;}
352 |         while ((i = getc(fid)) != EOF) if (i == '\n') counts_num++; // Count number of entries in vocab_file
353 |         fclose(fid);
354 |         
355 |         result = train_glove();
356 |         free(cost);
357 |     }
358 |     free(counts_file);
359 |     free(input_vocab_file);
360 |     free(output_vocab_file);
361 |     free(input_vector_file);
362 |     free(output_vector_file);
363 |     return result;
364 | }
365 | 


--------------------------------------------------------------------------------
/glove/makefile:
--------------------------------------------------------------------------------
 1 | CC = gcc
 2 | CFLAGS = -lm -pthread -O3 -march=native -Wall -funroll-loops -Wno-unused-result
 3 | 
 4 | all: glove
 5 | 
 6 | glove : glove.c
 7 | 	$(CC) glove.c -o glove $(CFLAGS)
 8 | 
 9 | clean:
10 | 	rm -rf glove
11 | 
12 | 


--------------------------------------------------------------------------------
/ngram2vec/analogy_eval.py:
--------------------------------------------------------------------------------
  1 | # -*- coding: utf-8 -*-
  2 | 
  3 | from __future__ import print_function, unicode_literals
  4 | import argparse
  5 | import codecs
  6 | import numpy as np
  7 | from scipy.stats.stats import spearmanr
  8 | import sys
  9 | from utils.misc import normalize
 10 | from utils.matrix import load_dense, load_sparse
 11 | from eval.testset import load_analogy, get_ana_vocab
 12 | from eval.similarity import prepare_similarities
 13 | from eval.recast import retain_words, align_matrix
 14 | 
 15 | 
 16 | def main():
 17 |     parser = argparse.ArgumentParser(formatter_class=argparse.ArgumentDefaultsHelpFormatter)
 18 |     parser.add_argument("--input_vector_file", type=str, required=True,
 19 |                         help="")
 20 |     parser.add_argument("--output_vector_file", type=str,
 21 |                         help="")
 22 |     parser.add_argument("--test_file", type=str, required=True,
 23 |                         help="")
 24 |     parser.add_argument('--sparse', action='store_true',
 25 |                         help="Load sparse representation.")
 26 |     parser.add_argument('--normalize', action='store_true',
 27 |                         help="If set, vector is normalized.")
 28 |     parser.add_argument("--ensemble", type=str, default="input",
 29 |                         choices=["input", "output", "add", "concat"],
 30 |                         help="""Strategies for using input/output vectors.
 31 |                         One can use only input, only output, the addition of input and output,
 32 |                         or their concatenation. Options are
 33 |                         [input|output|add|concat].""")
 34 | 
 35 |     args = parser.parse_args()
 36 |     
 37 |     testset = load_analogy(args.test_file)
 38 |     ana_vocab, vocab = {}, {}
 39 |     ana_vocab["i2w"], ana_vocab["w2i"] = get_ana_vocab(testset)
 40 |     if args.sparse:
 41 |         matrix, vocab, _ = load_sparse(args.input_vector_file)
 42 |     else:
 43 |         matrix, vocab, _ = load_dense(args.input_vector_file)
 44 | 
 45 |     if not args.sparse:
 46 |         if args.ensemble == "add":
 47 |             output_matrix, output_vocab, _ = load_dense(args.output_vector_file)
 48 |             output_matrix = align_matrix(matrix, output_matrix, vocab, output_vocab)
 49 |             matrix = matrix + output_matrix
 50 |         elif args.ensemble == "concat":
 51 |             output_matrix, output_vocab, _ = load_dense(args.output_vector_file)
 52 |             output_matrix = align_matrix(matrix, output_matrix, vocab, output_vocab)
 53 |             matrix = np.concatenate([matrix, output_matrix], axis=1)
 54 |         elif args.ensemble == "output":
 55 |             matrix, vocab, _ = load_dense(args.output_vector_file)
 56 |         else: # args.ensemble == "input"
 57 |             pass
 58 | 
 59 |     if args.normalize:
 60 |         matrix = normalize(matrix, args.sparse)
 61 | 
 62 |     matrix, vocab["i2w"], vocab["w2i"] = retain_words(matrix, vocab["i2w"], vocab["w2i"])
 63 |     sim_matrix = prepare_similarities(matrix, ana_vocab, vocab, sparse=args.sparse)
 64 | 
 65 |     seen, correct_add, correct_mul = 0, 0, 0
 66 |     for a, a_, b, b_ in testset:
 67 |         if a not in vocab["i2w"] or a_ not in vocab["i2w"] or b not in vocab["i2w"]:
 68 |             continue
 69 |         seen += 1
 70 |         guess_add, guess_mul = guess(sim_matrix, ana_vocab, vocab, a, a_, b)
 71 |         if guess_add == b_:
 72 |             correct_add += 1
 73 |         if guess_mul == b_:
 74 |             correct_mul += 1
 75 |     accuracy_add = float(correct_add) / seen
 76 |     accuracy_mul = float(correct_mul) / seen
 77 |     print("seen/total: {}/{}".format(seen, len(testset)))
 78 |     print("{}: {:.3f} {:.3f}".format(args.test_file, accuracy_add, accuracy_mul))
 79 | 
 80 | 
 81 | def guess(sim_matrix, ana_vocab, vocab, a, a_, b):
 82 |     sa = sim_matrix[ana_vocab["w2i"][a]]
 83 |     sa_ = sim_matrix[ana_vocab["w2i"][a_]]
 84 |     sb = sim_matrix[ana_vocab["w2i"][b]]
 85 |     
 86 |     sim_add = sa_ + sb - sa
 87 |     sim_add[vocab["w2i"][a]] = 0
 88 |     sim_add[vocab["w2i"][a_]] = 0
 89 |     sim_add[vocab["w2i"][b]] = 0
 90 |     guess_add = vocab["i2w"][np.nanargmax(sim_add)]
 91 |     
 92 |     sim_mul = sa_ * sb * np.reciprocal(sa+0.01)
 93 |     sim_mul[vocab["w2i"][a]] = 0
 94 |     sim_mul[vocab["w2i"][a_]] = 0
 95 |     sim_mul[vocab["w2i"][b]] = 0
 96 |     guess_mul = vocab["i2w"][np.nanargmax(sim_mul)]
 97 |     
 98 |     return guess_add, guess_mul
 99 | 
100 | 
101 | if __name__ == '__main__':
102 |     main()
103 | 


--------------------------------------------------------------------------------
/ngram2vec/corpus2pairs.py:
--------------------------------------------------------------------------------
  1 | # -*- coding: utf-8 -*-
  2 | 
  3 | from __future__ import print_function, unicode_literals, division
  4 | import argparse
  5 | import multiprocessing
  6 | import codecs
  7 | import random
  8 | from math import sqrt
  9 | import six
 10 | import sys
 11 | from utils.vocabulary import load_count_vocabulary
 12 | import line2pairs
 13 | from utils.misc import is_word
 14 | 
 15 | 
 16 | def main():
 17 |     parser = argparse.ArgumentParser(formatter_class=argparse.ArgumentDefaultsHelpFormatter)
 18 |     parser.add_argument("--corpus_file", type=str, required=True,
 19 |                         help="Path to the corpus file.")
 20 |     parser.add_argument("--pairs_file", type=str, required=True,
 21 |                         help="Path to the pairs file.")
 22 |     parser.add_argument("--vocab_file", type=str, required=True,
 23 |                         help="Path to the vocab file.")
 24 | 
 25 |     parser.add_argument("--cooccur", type=str, default="word_word",
 26 |                         choices=["word_word", "ngram_ngram"],
 27 |                         help="""Type of co-occurrence to use.
 28 |                         More types will be added. Options are
 29 |                         [word_word|ngram_ngram].""")
 30 |     parser.add_argument("--win", type=int, default=2,
 31 |                         help="Local window size.")
 32 |     parser.add_argument("--sub", type=float, default=1e-5,
 33 |                         help="Subsampling for filtering high-frequency features.")
 34 |     parser.add_argument("--processes_num", type=int, default=4,
 35 |                         help="Number of processes.")
 36 |     parser.add_argument('--dynamic_win', action='store_true',
 37 |                         help="If set, local window size is sampled from [1, win].")
 38 |     parser.add_argument('--dirty', action='store_true',
 39 |                         help="If set, removed features will be excluded before setting local window.")
 40 |     parser.add_argument('--seed', type=int, default=7,
 41 |                         help="Seed.")
 42 | 
 43 |     parser.add_argument("--input_order", type=int, default=1,
 44 |                         help="Order of input word-level ngram if --cooccur is set to ngram_ngram.")
 45 |     parser.add_argument("--output_order", type=int, default=2,
 46 |                         help="Order of output word-level ngram if --cooccur is set to ngram_ngram.")
 47 |     parser.add_argument('--overlap', action='store_true',
 48 |                         help="If set, overlap of ngram is allowed.")
 49 | 
 50 |     args = parser.parse_args()
 51 | 
 52 |     print("Corpus2pairs")
 53 |     processes_list = []
 54 |     # Start multiple processes.
 55 |     for i in range(0, args.processes_num):
 56 |         p = multiprocessing.Process(target=corpus2pairs_process, args=(args, i))
 57 |         p.start()
 58 |         processes_list.append(p)
 59 |     for p in processes_list:
 60 |         p.join()
 61 |     print ()
 62 |     print ("Corpus2pairs finished")
 63 | 
 64 | 
 65 | def corpus2pairs_process(args, pid):
 66 |     pairs_file = args.pairs_file + "_" + str(pid)
 67 |     pairs = codecs.open(pairs_file, "w", "utf-8")
 68 |     processes_num = args.processes_num
 69 |     sub = args.sub
 70 | 
 71 |     vocab = load_count_vocabulary(args.vocab_file)
 72 |     tokens_num = 0
 73 |     # Subsampling strategy.
 74 |     for w, c in six.iteritems(vocab): 
 75 |         if is_word(w):
 76 |             tokens_num += c
 77 |     sub *= tokens_num
 78 |     if sub != 0:
 79 |         subsampler = dict([(w, 1 - sqrt(sub / c)) for w, c in six.iteritems(vocab) if c > sub])
 80 |     else:
 81 |         subsampler = None
 82 |     if pid == 0:
 83 |         print("Vocabulary size: {}".format(len(vocab)))
 84 |     random.seed(args.seed)
 85 |     with codecs.open(args.corpus_file, "r", "utf-8") as f:
 86 |         lines_num = 0
 87 |         for line in f:
 88 |             lines_num += 1
 89 |             if lines_num % 1000 == 0 and pid == 0:
 90 |                 print("\r{}K lines processed.".format(int(lines_num/1000)), end="")
 91 |             if lines_num % processes_num != pid:
 92 |                 continue
 93 |             pairs_list = getattr(line2pairs, args.cooccur)(line, vocab, subsampler, random, args)
 94 |             for input, output in pairs_list:
 95 |                 pairs.write("{} {}\n".format(input, output)) 
 96 | 
 97 |     pairs.close()
 98 | 
 99 | 
100 | if __name__ == '__main__':
101 |     main()
102 | 


--------------------------------------------------------------------------------
/ngram2vec/corpus2vocab.py:
--------------------------------------------------------------------------------
 1 | # -*- coding: utf-8 -*-
 2 | 
 3 | from __future__ import print_function, unicode_literals, division
 4 | import argparse
 5 | import codecs
 6 | from sys import getsizeof
 7 | import six
 8 | from utils.vocabulary import save_count_vocabulary
 9 | import line2vocab
10 | 
11 | 
12 | def main():
13 |     parser = argparse.ArgumentParser(formatter_class=argparse.ArgumentDefaultsHelpFormatter)
14 |     parser.add_argument("--corpus_file", type=str, required=True,
15 |                         help="Path to the corpus file.")
16 |     parser.add_argument("--vocab_file", type=str, required=True,
17 |                         help="Path to the vocab file.")
18 | 
19 |     parser.add_argument("--feature", type=str, default="word",
20 |                         choices=["word", "ngram"],
21 |                         help="""Type of linguistic units (features) to use.
22 |                         More types will be added. Options are
23 |                         [word|ngram].""")
24 |     parser.add_argument("--min_count", type=int, default=100,
25 |                         help="Threshold for removing low-frequency features (units).")
26 |     parser.add_argument("--max_length", type=int, default=50,
27 |                         help="Threshold for removing features (units) with too much characters.")
28 |     parser.add_argument("--memory_size", type=float, default=4.0,
29 |                         help="""Memory size. Sometimes the vocabulary is large. 
30 |                         We remove low-frequency features to ensure that vocabulary could be stored in memory.""")
31 | 
32 |     parser.add_argument("--order", type=int, default=2,
33 |                         help="Order of word-level ngram if --feature is set to ngram")
34 | 
35 |     args = parser.parse_args()
36 | 
37 |     print("Corpus2vocab")
38 |     vocab = {}
39 |     memory_size = args.memory_size * 1000**3
40 |     memory_size_used = 0.0
41 |     reduce_thr = 1
42 |     tokens_num = 0
43 |     with codecs.open(args.corpus_file, "r", "utf-8") as f:
44 |         for line in f:
45 |             print("\r{}M tokens processed.".format(int(tokens_num/1000**2)), end="")
46 |             # Extract linguistic features (units) from a line.
47 |             line_vocab = getattr(line2vocab, args.feature)(line, args)
48 |             tokens_num += len(line.strip().split())
49 |             for word, count in line_vocab.items():
50 |                 if len(word) > args.max_length:
51 |                     continue
52 |                 if word not in vocab:
53 |                     vocab[word] = count
54 |                     memory_size_used += getsizeof(word)
55 |                     # Remove low-frequency features when 30% memory is used up.
56 |                     # One can use any value between 0 and 1. 0.3 is a safe value.
57 |                     if memory_size_used + getsizeof(vocab) > memory_size * 0.3:
58 |                         reduce_thr += 1
59 |                         vocab_size = len(vocab)
60 |                         vocab = {w: c for w, c in six.iteritems(vocab) if c >= reduce_thr}
61 |                         # Reestimate memory size used.
62 |                         memory_size_used *= len(vocab) / vocab_size
63 |                 else:
64 |                     vocab[word] += count
65 | 
66 |     # Reduce vocabulary.
67 |     vocab = {w: c for w, c in six.iteritems(vocab) if c >= args.min_count}
68 |     # Sort vocabulary.
69 |     vocab = sorted(six.iteritems(vocab), key=lambda item: item[1], reverse=True)
70 |     save_count_vocabulary(args.vocab_file, vocab)
71 | 
72 |     print ()
73 |     print ("Number of tokens: {}".format(tokens_num))
74 |     print ("Size of vocabulary: {}".format(len(vocab)))
75 |     print ("Low-frequency threshold: {}".format(args.min_count if args.min_count > reduce_thr else reduce_thr))
76 |     print ("Corpus2vocab finished")
77 | 
78 | 
79 | if __name__ == '__main__':
80 |     main()
81 | 


--------------------------------------------------------------------------------
/ngram2vec/corpus2vocab_multiproc.py:
--------------------------------------------------------------------------------
  1 | # -*- coding: utf-8 -*-
  2 | 
  3 | from __future__ import print_function, unicode_literals, division
  4 | import argparse
  5 | import codecs
  6 | from sys import getsizeof
  7 | import six
  8 | from multiprocessing import Pool
  9 | from utils.vocabulary import save_count_vocabulary
 10 | from utils.misc import merge_vocabularies
 11 | import line2vocab
 12 | 
 13 | 
 14 | def corpus2vocab_process(corpus_file, proc_id, proc_num, args):
 15 |     """ 
 16 |     .
 17 |     """
 18 |     vocab = {}
 19 | 
 20 |     memory_size = args.memory_size * 1000**3 / proc_num
 21 |     memory_size_used = 0.0
 22 |     reduce_thr = 1
 23 |     tokens_num = 0
 24 |     
 25 |     with open(corpus_file, mode="r", encoding="utf-8") as f:
 26 |         for line_id, line in enumerate(f):
 27 |             if (line_id % proc_num) == proc_id:
 28 |                 if proc_id == 0:
 29 |                     print("\r{}M tokens processed (approximately).".format(int(tokens_num*proc_num/1000**2)), end="")
 30 |                 line_vocab = getattr(line2vocab, args.feature)(line, args)
 31 |                 tokens_num += len(line.strip().split())
 32 |                 for word, count in line_vocab.items():
 33 |                     if len(word) > args.max_length:
 34 |                         continue
 35 |                     if word not in vocab:
 36 |                         vocab[word] = count
 37 |                         memory_size_used += getsizeof(word)
 38 |                         if memory_size_used + getsizeof(vocab) > memory_size * 0.7:
 39 |                             reduce_thr += 1
 40 |                             vocab_size = len(vocab)
 41 |                             vocab = {w: c for w, c in six.iteritems(vocab) if c >= reduce_thr}
 42 |                             memory_size_used *= len(vocab) / vocab_size
 43 |                     else:
 44 |                         vocab[word] += count
 45 | 
 46 |     return (vocab,)
 47 | 
 48 | 
 49 | def main():
 50 |     parser = argparse.ArgumentParser(formatter_class=argparse.ArgumentDefaultsHelpFormatter)
 51 |     parser.add_argument("--corpus_file", type=str, required=True,
 52 |                         help="Path to the corpus file.")
 53 |     parser.add_argument("--vocab_file", type=str, required=True,
 54 |                         help="Path to the vocab file.")
 55 | 
 56 |     parser.add_argument("--feature", type=str, default="word",
 57 |                         choices=["word", "ngram"],
 58 |                         help="""Type of linguistic units (features) to use.
 59 |                         More types will be added. Options are
 60 |                         [word|ngram].""")
 61 |     parser.add_argument("--min_count", type=int, default=100,
 62 |                         help="Threshold for removing low-frequency features (units).")
 63 |     parser.add_argument("--max_length", type=int, default=50,
 64 |                         help="Threshold for removing features (units) with too much characters.")
 65 | 
 66 |     parser.add_argument("--processes_num", type=int, default=4,
 67 |                         help="Number of processes.")
 68 | 
 69 |     parser.add_argument("--order", type=int, default=2,
 70 |                         help="Order of word-level ngram if --feature is set to ngram")
 71 |     
 72 |     parser.add_argument("--memory_size", type=float, default=4.0,
 73 |                         help="""Memory size. Sometimes the vocabulary is large. 
 74 |                         We remove low-frequency features to ensure that vocabulary could be stored in memory.""")
 75 | 
 76 |     args = parser.parse_args()
 77 | 
 78 |     print("Corpus2vocab")
 79 | 
 80 |     pool = Pool(args.processes_num)
 81 |     vocab_list = []
 82 |     for i in range(args.processes_num):
 83 |         vocab_list.append((pool.apply_async(func=corpus2vocab_process, args=[args.corpus_file, i, args.processes_num, args])))
 84 |     pool.close()
 85 |     pool.join()
 86 | 
 87 |     vocab_list = [v.get()[0] for v in vocab_list]
 88 |     
 89 |     vocab = merge_vocabularies(vocab_list)
 90 | 
 91 |     # Reduce vocabulary.
 92 |     vocab = {w: c for w, c in six.iteritems(vocab) if c >= args.min_count}
 93 |     # Sort vocabulary.
 94 |     vocab = sorted(six.iteritems(vocab), key=lambda item: item[1], reverse=True)
 95 |     save_count_vocabulary(args.vocab_file, vocab)
 96 | 
 97 |     print ()
 98 |     print ("Size of vocabulary: {}".format(len(vocab)))
 99 |     print ("Corpus2vocab finished")
100 | 
101 | 
102 | if __name__ == '__main__':
103 |     main()
104 | 
105 | 


--------------------------------------------------------------------------------
/ngram2vec/counts2glove.py:
--------------------------------------------------------------------------------
 1 | # -*- coding: utf-8 -*-
 2 | 
 3 | from __future__ import print_function, unicode_literals
 4 | import argparse
 5 | import subprocess
 6 | import os
 7 | 
 8 | 
 9 | def main():
10 |     parser = argparse.ArgumentParser(formatter_class=argparse.ArgumentDefaultsHelpFormatter)
11 |     parser.add_argument("--counts_file", type=str, required=True,
12 |                         help="Path to the counts file.")
13 |     parser.add_argument("--input_vocab_file", type=str, required=True,
14 |                         help="Path to the input vocabulary file.")
15 |     parser.add_argument("--output_vocab_file", type=str, required=True,
16 |                         help="Path to the output vocabulary file.")
17 |     parser.add_argument("--input_vector_file", type=str, required=True,
18 |                         help="Path to the input vector file.")
19 |     parser.add_argument("--output_vector_file", type=str, required=True,
20 |                         help="Path to the output vector file.")
21 | 
22 |     parser.add_argument("--size", type=int, default=100,
23 |                         help="")
24 |     parser.add_argument("--threads_num", type=int, default=4,
25 |                         help="")
26 |     parser.add_argument("--iter", type=int, default=10,
27 |                         help="")
28 | 
29 |     args = parser.parse_args()
30 | 
31 |     print("Counts2glove")
32 |     command = ["./glove/glove"]
33 |     command.extend(["--counts_file", args.counts_file])
34 |     command.extend(["--input_vocab_file", args.input_vocab_file])
35 |     command.extend(["--output_vocab_file", args.output_vocab_file])
36 |     command.extend(["--input_vector_file", args.input_vector_file])
37 |     command.extend(["--output_vector_file", args.output_vector_file])
38 | 
39 |     command.extend(["--size", str(args.size)])
40 |     command.extend(["--threads_num", str(args.threads_num)])
41 |     command.extend(["--iter", str(args.iter)])
42 |     
43 |     return_code = subprocess.call(command)
44 |     print()
45 |     print("Counts2glove finished")
46 | 
47 | 
48 | if __name__ == '__main__':
49 |     main()
50 | 


--------------------------------------------------------------------------------
/ngram2vec/counts2ppmi.py:
--------------------------------------------------------------------------------
 1 | # -*- coding: utf-8 -*-
 2 | 
 3 | from __future__ import print_function, unicode_literals, division
 4 | import argparse
 5 | import codecs
 6 | from scipy.sparse import csr_matrix
 7 | import numpy as np
 8 | from utils.vocabulary import load_vocabulary
 9 | from utils.matrix import save_sparse
10 | from utils.misc import normalize
11 | 
12 | 
13 | def main():
14 |     parser = argparse.ArgumentParser(formatter_class=argparse.ArgumentDefaultsHelpFormatter)
15 |     parser.add_argument("--counts_file", type=str, required=True,
16 |                         help="Path to the counts (matrix) file.")
17 |     parser.add_argument("--input_vocab_file", type=str, required=True,
18 |                         help="Path to the input vocabulary file.")
19 |     parser.add_argument("--output_vocab_file", type=str, required=True,
20 |                         help="Path to the output vocabulary file.")
21 |     parser.add_argument("--ppmi_file", type=str, required=True,
22 |                         help="Path to the PPMI file.")
23 | 
24 |     parser.add_argument("--cds", type=float, default=1.0,
25 |                         help="Context distribution smoothing.")
26 |     parser.add_argument("--neg", type=float, default=1.0,
27 |                         help="Negative sampling, shifted value on PPMI.")
28 |     
29 |     args = parser.parse_args()
30 | 
31 |     print("Counts2ppmi")
32 |     input_vocab = {}
33 |     output_vocab = {}
34 |     input_vocab["i2w"], input_vocab["w2i"] = load_vocabulary(args.input_vocab_file)
35 |     output_vocab["i2w"], output_vocab["w2i"] = load_vocabulary(args.output_vocab_file)
36 |     matrix = load_sparse_from_counts(args.counts_file, input_vocab["w2i"], output_vocab["w2i"], is_id=True)
37 |     pmi = calc_pmi(matrix, args.cds)
38 | 
39 |     # From PMI to Shifted PPMI (SPPMI).
40 |     pmi.data = np.log(pmi.data)
41 |     pmi.data -= np.log(args.neg)
42 |     pmi.data[pmi.data < 0] = 0
43 |     pmi.eliminate_zeros()
44 |     ppmi = pmi
45 |     # Save PPMI in txt format.
46 |     save_sparse(args.ppmi_file, ppmi, input_vocab["i2w"])
47 |     print()
48 |     print("Counts2ppmi finished")
49 | 
50 | 
51 | def load_sparse_from_counts(counts_file, input_vocab, output_vocab, is_id=False):
52 |     counts_num = 0
53 |     row, col, data = [], [], []
54 |     with codecs.open(counts_file, "r", "utf-8") as f:
55 |         for line in f:
56 |             counts_num += 1
57 |             if counts_num % 1000**2 == 0:
58 |                 print("\r{}M counts processed.".format(int(counts_num/1000**2)), end="")
59 |             input, output, count = line.strip().split()
60 |             if is_id:
61 |                 row.append(int(input))
62 |                 col.append(int(output))
63 |             else:
64 |                 row.append(input_vocab[input])
65 |                 col.append(output_vocab[output])
66 |             data.append(float(count))
67 |     matrix = csr_matrix((data, (row, col)), shape=(len(input_vocab), len(output_vocab)), dtype=np.float32)
68 |     return matrix
69 | 
70 | 
71 | def calc_pmi(matrix, cds):
72 |     row_sum = np.array(matrix.sum(axis=1))[:, 0]
73 |     col_sum = np.array(matrix.sum(axis=0))[0, :]
74 |     if cds != 1:
75 |         col_sum = col_sum ** cds
76 |     sum_total = col_sum.sum()
77 | 
78 |     pmi = multiply_by_rows(matrix, np.reciprocal(row_sum))
79 |     pmi = multiply_by_columns(pmi, np.reciprocal(col_sum))
80 |     pmi = pmi * sum_total
81 |     return pmi
82 | 
83 | 
84 | def multiply_by_rows(matrix, row_sum):
85 |     normalizer = csr_matrix((row_sum, (range(len(row_sum)), range(len(row_sum)))), dtype=np.float32)
86 |     return normalizer.dot(matrix)
87 | 
88 | 
89 | def multiply_by_columns(matrix, col_sum):
90 |     normalizer = csr_matrix((col_sum, (range(len(col_sum)), range(len(col_sum)))), dtype=np.float32)
91 |     return matrix.dot(normalizer)
92 | 
93 | 
94 | if __name__ == '__main__':
95 |     main()
96 | 


--------------------------------------------------------------------------------
/ngram2vec/distance.py:
--------------------------------------------------------------------------------
 1 | # -*- coding: utf-8 -*-
 2 | 
 3 | from __future__ import print_function, unicode_literals
 4 | import argparse
 5 | import codecs
 6 | import numpy as np
 7 | from scipy.stats.stats import spearmanr
 8 | import sys
 9 | from six.moves import input
10 | from utils.misc import normalize
11 | from utils.matrix import load_dense, load_sparse
12 | from eval.testset import load_analogy, get_ana_vocab
13 | from eval.similarity import prepare_similarities
14 | from eval.recast import retain_words, align_matrix
15 | 
16 | 
17 | def main():
18 |     parser = argparse.ArgumentParser(formatter_class=argparse.ArgumentDefaultsHelpFormatter)
19 |     parser.add_argument("--vector_file", type=str, required=True,
20 |                         help="Path to the vector file.")
21 |     parser.add_argument('--sparse', action='store_true',
22 |                         help="Load sparse representation.")
23 |     parser.add_argument('--normalize', action='store_true',
24 |                         help="If set, vector is normalized.")
25 |     parser.add_argument("--top_num", type=int, default=10,
26 |                         help="The number of neighbours returned.")
27 | 
28 |     args = parser.parse_args()
29 |     
30 |     if args.sparse:
31 |         matrix, vocab, _ = load_sparse(args.vector_file)
32 |     else:
33 |         matrix, vocab, _ = load_dense(args.vector_file)
34 |     
35 |     if args.normalize:
36 |         matrix = normalize(matrix, args.sparse)
37 |     top_num = args.top_num
38 | 
39 |     while(True):
40 |         target = input("Enter a word (EXIT to break): ")
41 |         if target == "EXIT":
42 |             break
43 |         if target not in vocab["i2w"]:
44 |             print("Out of vocabulary")
45 |             continue
46 |         target_vocab = {}
47 |         target_vocab["i2w"], target_vocab["w2i"] = [target],  {target: 0}
48 |         sim_matrix = prepare_similarities(matrix, target_vocab, vocab, args.sparse)
49 |         neighbours = []
50 |         for i, w in enumerate(vocab["i2w"]):
51 |             sim = sim_matrix[0, i]
52 |             if target == w:
53 |                 continue
54 |             if len(neighbours) == 0:
55 |                 neighbours.append((w, sim))
56 |                 continue
57 |             if sim <= neighbours[-1][1] and len(neighbours) >= top_num:
58 |                 continue
59 |             for j in range(len(neighbours)):
60 |                 if sim > neighbours[j][1]:
61 |                     neighbours.insert(j, (w, sim))
62 |                     break
63 |             if len(neighbours) > top_num:
64 |                 neighbours.pop(-1)
65 | 
66 |         print("{0: <20} {1: <20}".format("word", "similarity"))
67 |         for w, sim in neighbours:
68 |             print("{0: <20} {1: <20}".format(w, sim))
69 | 
70 | 
71 | if __name__ == '__main__':
72 |     main()
73 | 


--------------------------------------------------------------------------------
/ngram2vec/eval/__init__.py:
--------------------------------------------------------------------------------
1 | # -*- coding: utf-8 -*-


--------------------------------------------------------------------------------
/ngram2vec/eval/recast.py:
--------------------------------------------------------------------------------
 1 | # -*- coding: utf-8 -*-
 2 | 
 3 | from __future__ import unicode_literals
 4 | import argparse
 5 | import codecs
 6 | import numpy as np
 7 | from scipy.stats.stats import spearmanr
 8 | import sys
 9 | from utils.misc import is_word
10 | 
11 | 
12 | def retain_words(matrix, i2w, w2i):
13 |     i2w_word = []
14 |     retain_index = []
15 |     for i, w in enumerate(i2w):
16 |         if is_word(w):
17 |             i2w_word.append(w)
18 |             retain_index.append(i)
19 |     matrix = matrix[retain_index]
20 |     w2i_word = dict([(w, i) for i, w in enumerate(i2w_word)])
21 |     return matrix, i2w_word, w2i_word
22 | 
23 | 
24 | def align_matrix(input_matrix, output_matrix, input_vocab, output_vocab):
25 |     output_matrix_align = np.zeros(input_matrix.shape)
26 |     for i, w in enumerate(input_vocab["i2w"]):
27 |         if w not in output_vocab["w2i"]:
28 |             continue
29 |         output_matrix_align[i] = output_matrix[output_vocab["w2i"][w]]
30 |     return output_matrix_align
31 | 


--------------------------------------------------------------------------------
/ngram2vec/eval/similarity.py:
--------------------------------------------------------------------------------
 1 | # -*- coding: utf-8 -*-
 2 | 
 3 | from __future__ import unicode_literals
 4 | import argparse
 5 | import codecs
 6 | import numpy as np
 7 | from scipy.stats.stats import spearmanr
 8 | import sys
 9 | from utils.misc import is_word
10 | 
11 | 
12 | def similarity(matrix, w2i, w1, w2, sparse=False):
13 |     if w1 not in w2i or w2 not in w2i:
14 |         return None
15 |     if sparse:
16 |         v1 = matrix[w2i[w1],:].toarray()[0]
17 |         v2 = matrix[w2i[w2],:].toarray()[0]
18 |     else:
19 |         v1 = matrix[w2i[w1],:]
20 |         v2 = matrix[w2i[w2],:]
21 |     sim = v1.dot(v2)
22 |     return sim
23 | 
24 | 
25 | def prepare_similarities(matrix, ana_vocab, vocab, sparse=False):
26 |     ana_matrix = matrix[[vocab["w2i"][w] if w in vocab["w2i"] else 0 for w in ana_vocab["i2w"]]]
27 |     if sparse:
28 |         sim_matrix = ana_matrix.dot(matrix.T).toarray()
29 |     else:
30 |         sim_matrix = ana_matrix.dot(matrix.T)
31 |         sim_matrix = (sim_matrix+1)/2
32 |     return sim_matrix
33 | 


--------------------------------------------------------------------------------
/ngram2vec/eval/testset.py:
--------------------------------------------------------------------------------
 1 | # -*- coding: utf-8 -*-
 2 | 
 3 | from __future__ import unicode_literals
 4 | import codecs
 5 | 
 6 | 
 7 | def load_analogy(test_file):
 8 |     testset = []
 9 |     with codecs.open(test_file, "r", "utf-8") as f:
10 |         for line in f:
11 |             analogy = line.strip().lower().split()
12 |             testset.append(analogy)
13 |     return testset
14 | 
15 | 
16 | def load_similarity(test_file):
17 |     testset = []
18 |     with codecs.open(test_file, "r", "utf-8") as f:
19 |         for line in f:
20 |             w1, w2, sim = line.strip().lower().split()
21 |             testset.append(((w1, w2), float(sim)))
22 |     return testset
23 | 
24 | 
25 | def get_ana_vocab(testset):
26 |     vocab = set()
27 |     for analogy in testset:
28 |         vocab.update(analogy)
29 |     i2w = list(vocab)
30 |     return i2w, dict([(w, i) for i, w in enumerate(i2w)])
31 | 


--------------------------------------------------------------------------------
/ngram2vec/line2pairs.py:
--------------------------------------------------------------------------------
 1 | # -*- coding: utf-8 -*-
 2 | 
 3 | from __future__ import print_function, unicode_literals, division
 4 | from utils.misc import get_ngram, check_feature
 5 | 
 6 | 
 7 | def word_word(line, vocab, subsampler, random, args):
 8 |     pairs_list = []
 9 |     if args.dynamic_win:
10 |         win = random.randint(1, args.win)
11 |     else:
12 |         win = args.win
13 |     line = [t if t in vocab else None for t in line.strip().split()]
14 |     if subsampler:
15 |         line = [t if t not in subsampler or random.random() > subsampler[t] else None for t in line]
16 |     if args.dirty:
17 |         line = [t for t in line if t is not None]
18 |     for i in range(len(line)):
19 |         input = get_ngram(line, i, 1)
20 |         if input is None:
21 |             continue
22 |         start, end = i - win, i + win
23 |         for j in range(start, end + 1):
24 |             if i == j:
25 |                 continue
26 |             output = get_ngram(line, j, 1)
27 |             if output is None:
28 |                 continue
29 |             pairs_list.append((input, output))
30 |     return pairs_list
31 | 
32 | 
33 | def ngram_ngram(line, vocab, subsampler, random, args):
34 |     pairs_list = []
35 |     if args.dynamic_win:
36 |         win = random.randint(1, args.win)
37 |     else:
38 |         win = args.win
39 |     input_order, output_order = args.input_order, args.output_order
40 |     overlap = args.overlap
41 |     tokens = line.strip().split()
42 |     for i in range(len(tokens)):
43 |         for j in range(1, input_order+1):
44 |             input = get_ngram(tokens, i, j)
45 |             input = check_feature(input, vocab, subsampler, random)
46 |             if input is None:
47 |                 continue
48 |             for k in range(1, output_order+1):
49 |                 start = i - win + j - 1
50 |                 end = i + win - k + 1
51 |                 for l in range(start, end + 1):
52 |                     if overlap:
53 |                         if i == l and j == k:
54 |                             continue
55 |                     else:
56 |                         if len(set(range(i, i + j)) & set(range(l, l + k))) > 0:
57 |                             continue
58 |                     output = get_ngram(tokens, l, k)
59 |                     output = check_feature(output, vocab, subsampler, random)
60 |                     if output is None:
61 |                         continue
62 |                     pairs_list.append((input, output))
63 |     return pairs_list
64 | 


--------------------------------------------------------------------------------
/ngram2vec/line2vocab.py:
--------------------------------------------------------------------------------
 1 | # -*- coding: utf-8 -*-
 2 | 
 3 | from __future__ import print_function, unicode_literals
 4 | from collections import Counter
 5 | from utils.constants import JOIN_OF_NGRAM
 6 | from utils.misc import get_ngram
 7 | 
 8 | 
 9 | def word(line, args):
10 |     return Counter(line.strip().split())
11 | 
12 | 
13 | def ngram(line, args):
14 |     order = args.order
15 |     line = line.strip().split()
16 |     vocab = Counter()
17 |     for i in range(len(line)):
18 |         for j in range(1, order+1):
19 |             ngram = get_ngram(line, i, j)
20 |             if ngram is None:
21 |                 continue
22 |             vocab.update([ngram])
23 |     return vocab
24 | 


--------------------------------------------------------------------------------
/ngram2vec/pairs2counts.py:
--------------------------------------------------------------------------------
  1 | # -*- coding: utf-8 -*-
  2 | 
  3 | from __future__ import print_function, unicode_literals, division
  4 | import codecs
  5 | import argparse
  6 | from collections import Counter
  7 | from math import sqrt
  8 | from sys import getsizeof
  9 | import numpy as np
 10 | import sys
 11 | import cPickle as pickle
 12 | import os
 13 | from utils.vocabulary import load_vocabulary
 14 | 
 15 | 
 16 | def aggregate(tmpcounts, input, output, memory_size_used, args):
 17 |     if args.aggregate == "stripes":
 18 |         if input in tmpcounts:
 19 |             tmp_size = getsizeof(tmpcounts[input])
 20 |             tmpcounts[input].update({output: 1})
 21 |             memory_size_used += getsizeof(tmpcounts[input]) - tmp_size
 22 |         else:
 23 |             tmpcounts[input] = Counter({output: 1})
 24 |             memory_size_used += getsizeof(tmpcounts[input])
 25 |     else: # args.aggregate == "pairs"
 26 |         if (input, output) in tmpcounts:
 27 |             tmpcounts[(input, output)] += 1
 28 |         else:
 29 |             tmpcounts[(input, output)] = 1
 30 |             memory_size_used += getsizeof(int(0))
 31 |     return memory_size_used
 32 | 
 33 | 
 34 | def write_tmpfiles(tmpcounts, tmpfile_id, args):
 35 |     with codecs.open("{}_{}".format(args.counts_file, tmpfile_id), 'wb') as f:
 36 |         if args.aggregate == "stripes":
 37 |             sorted_input = sorted(tmpcounts.keys())
 38 |             for i in sorted_input:
 39 |                 pickle.dump((i, tmpcounts[i]), f)
 40 |         else: # args.aggregate == "pairs"
 41 |             sorted_pairs = sorted(tmpcounts.keys())
 42 |             for input, output in sorted_pairs:
 43 |                 pickle.dump(((input, output), tmpcounts[(input, output)]), f)
 44 | 
 45 | 
 46 | def merge(out, buffer_min, args):
 47 |     if args.aggregate == "stripes":
 48 |         out[1].update(buffer_min[1])
 49 |     else: # args.aggregate == "pairs"
 50 |         out[1] += buffer_min[1]
 51 |     return out
 52 | 
 53 | 
 54 | def write_buffer(counts, out, counts_num, input_i2w, output_i2w, args):
 55 |     if args.aggregate == "stripes":
 56 |         sorted_output = sorted(out[1].keys())
 57 |         for w in sorted_output:
 58 |             counts_num += 1
 59 |             if counts_num % 1000**2 == 0:
 60 |                 print("\r{}M counts processed.".format(int(counts_num/1000**2)), end="")
 61 |             if args.output_id:
 62 |                 counts.write("{} {} {}\n".format(out[0], w, out[1][w]))
 63 |             else:
 64 |                 counts.write("{} {} {}\n".format(input_i2w[out[0]], output_i2w[w], out[1][w]))
 65 |     else: # args.aggregate == "pairs"
 66 |         counts_num += 1
 67 |         if counts_num % 1000**2 == 0:
 68 |             print("\r{}M counts processed.".format(int(counts_num/1000**2)), end="")
 69 |         if args.output_id:
 70 |             counts.write("{} {} {}\n".format(out[0][0], out[0][1], out[1]))
 71 |         else:
 72 |             counts.write("{} {} {}\n".format(input_i2w[out[0][0]], output_i2w[out[0][1]], out[1]))
 73 |     return counts_num
 74 | 
 75 | 
 76 | def main():
 77 |     parser = argparse.ArgumentParser(formatter_class=argparse.ArgumentDefaultsHelpFormatter)
 78 |     parser.add_argument("--pairs_file", type=str, required=True,
 79 |                         help="Path to the pairs file.")
 80 |     parser.add_argument("--input_vocab_file", type=str, required=True,
 81 |                         help="Path to the input vocabulary file.")
 82 |     parser.add_argument("--output_vocab_file", type=str, required=True,
 83 |                         help="Path to the output vocabulary file.")
 84 |     parser.add_argument("--counts_file", type=str, required=True,
 85 |                         help="Path to the counts (matrix) file.")
 86 |     parser.add_argument('--output_id', action='store_true',
 87 |                         help="If set, output (id id count) instead of (word word count).")
 88 |     parser.add_argument("--memory_size", type=float, default=4.0,
 89 |                         help="Memory size.")
 90 |     parser.add_argument("--aggregate", type=str, default="stripes",
 91 |                         choices=["pairs", "stripes"],
 92 |                         help="""Different strategies of building counts (matrix).
 93 |                         Options are
 94 |                         [pairs|stripes].""")
 95 |     
 96 |     args = parser.parse_args()
 97 | 
 98 |     print("Pairs2counts")
 99 | 
100 |     input_vocab, output_vocab = {}, {}
101 |     input_vocab["i2w"], input_vocab["w2i"] =load_vocabulary(args.input_vocab_file)
102 |     output_vocab["i2w"], output_vocab["w2i"] = load_vocabulary(args.output_vocab_file)
103 |     memory_size = args.memory_size * 1000**3
104 |     tmpcounts = {} #store co-occurrence matrix in dictionary
105 |     tmpfiles_num = 0
106 |     memory_size_used = 0
107 |     pairs_num = 0
108 |     with codecs.open(args.pairs_file, "r", "utf-8") as f:
109 |         for line in f:
110 |             pairs_num += 1
111 |             if pairs_num % 1000**2 == 0:
112 |                 print("\r{}M pairs processed.".format(int(pairs_num/1000**2)), end="")
113 |             if getsizeof(tmpcounts) + memory_size_used > memory_size * 0.8:
114 |                 write_tmpfiles(tmpcounts, tmpfiles_num, args)
115 |                 tmpfiles_num += 1
116 |                 tmpcounts.clear()
117 |                 memory_size_used = 0
118 |             pair = line.strip().split()
119 |             input = input_vocab["w2i"][pair[0]]
120 |             output = output_vocab["w2i"][pair[1]]
121 |             memory_size_used = aggregate(tmpcounts, input, output, memory_size_used, args)
122 |     write_tmpfiles(tmpcounts, tmpfiles_num, args)
123 |     tmpcounts.clear()
124 |     tmpfiles_num += 1
125 | 
126 |     print()
127 |     tmpfiles = []
128 |     top_buffer = []
129 |     counts_num = 0
130 |     counts = codecs.open(args.counts_file, "w", "utf-8")
131 |     for i in range(tmpfiles_num):
132 |         tmpfiles.append(codecs.open("{}_{}".format(args.counts_file, i), "rb"))
133 |         top_buffer.append(pickle.load(tmpfiles[i]))
134 |     top_buffer_keys = [c[0] for c in top_buffer]
135 |     min_index = top_buffer_keys.index(min(top_buffer_keys))
136 |     out = list(top_buffer[min_index])
137 |     top_buffer[min_index] = pickle.load(tmpfiles[min_index])
138 | 
139 |     while True:
140 |         top_buffer_keys = [c[0] for c in top_buffer]
141 |         min_index = top_buffer_keys.index(min(top_buffer_keys))
142 |         if top_buffer[min_index][0] == out[0]:
143 |             out = merge(out, top_buffer[min_index], args)
144 |         else:
145 |             counts_num = write_buffer(counts, out, counts_num, input_vocab["i2w"], output_vocab["i2w"], args)
146 |             out = list(top_buffer[min_index])
147 |         try:
148 |             top_buffer[min_index] = pickle.load(tmpfiles[min_index])
149 |         except Exception:
150 |             if args.aggregate == "stripes":
151 |                 top_buffer[min_index] = (sys.maxint, Counter())
152 |             else: # args.aggregate == "pairs"
153 |                 top_buffer[min_index] = ((sys.maxint, sys.maxint), 0)
154 |             tmpfiles_num -= 1
155 |         if tmpfiles_num == 0:
156 |             counts_num = write_buffer(counts, out, counts_num, input_vocab["i2w"], output_vocab["i2w"], args)
157 |             break
158 |     counts.close()
159 |     print("Number of counts: {}".format(counts_num))
160 |     for i in range(len(top_buffer)):
161 |         os.remove("{}_{}".format(args.counts_file, i))
162 | 
163 |     print("Pairs2counts finished")
164 | 
165 | 
166 | if __name__ == '__main__':
167 |     main()
168 | 


--------------------------------------------------------------------------------
/ngram2vec/pairs2sgns.py:
--------------------------------------------------------------------------------
 1 | # -*- coding: utf-8 -*-
 2 | 
 3 | from __future__ import print_function, unicode_literals
 4 | import argparse
 5 | import subprocess
 6 | import os
 7 | 
 8 | 
 9 | def main():
10 |     parser = argparse.ArgumentParser(formatter_class=argparse.ArgumentDefaultsHelpFormatter)
11 |     parser.add_argument("--pairs_file", type=str, required=True,
12 |                         help="Path to the pairs file.")
13 |     parser.add_argument("--input_vocab_file", type=str, required=True,
14 |                         help="Path to the input vocabulary file.")
15 |     parser.add_argument("--output_vocab_file", type=str, required=True,
16 |                         help="Path to the output vocabulary file.")
17 |     parser.add_argument("--input_vector_file", type=str, required=True,
18 |                         help="Path to the input vector file.")
19 |     parser.add_argument("--output_vector_file", type=str, required=True,
20 |                         help="Path to the output vector file.")
21 | 
22 |     parser.add_argument("--negative", type=int, default=5,
23 |                         help="")
24 |     parser.add_argument("--size", type=int, default=100,
25 |                         help="")
26 |     parser.add_argument("--threads_num", type=int, default=4,
27 |                         help="")
28 |     parser.add_argument("--iter", type=int, default=3,
29 |                         help="")
30 | 
31 |     args = parser.parse_args()
32 | 
33 |     print("Pairs2sgns")
34 |     command = ["./word2vec/word2vec"]
35 |     command.extend(["--pairs_file", args.pairs_file])
36 |     command.extend(["--input_vocab_file", args.input_vocab_file])
37 |     command.extend(["--output_vocab_file", args.output_vocab_file])
38 |     command.extend(["--input_vector_file", args.input_vector_file])
39 |     command.extend(["--output_vector_file", args.output_vector_file])
40 | 
41 |     command.extend(["--negative", str(args.negative)])
42 |     command.extend(["--size", str(args.size)])
43 |     command.extend(["--threads_num", str(args.threads_num)])
44 |     command.extend(["--iter", str(args.iter)])
45 | 
46 |     return_code = subprocess.call(command)
47 |     print()
48 |     print("Pairs2sgns finished")
49 | 
50 | 
51 | if __name__ == '__main__':
52 |     main()
53 | 


--------------------------------------------------------------------------------
/ngram2vec/pairs2vocab.py:
--------------------------------------------------------------------------------
 1 | # -*- coding: utf-8 -*-
 2 | 
 3 | from __future__ import print_function, unicode_literals
 4 | import argparse
 5 | import codecs
 6 | from multiprocessing import Pool
 7 | from utils.vocabulary import save_count_vocabulary
 8 | from utils.misc import merge_vocabularies
 9 | import six
10 | 
11 | 
12 | def pairs2vocab_process(pairs_file, proc_id, proc_num, args):
13 |     """ 
14 |     .
15 |     """
16 |     input_vocab = {}
17 |     output_vocab = {}
18 |     pairs_num = 0
19 |     with open(pairs_file, mode="r", encoding="utf-8") as f:
20 |         for line_id, line in enumerate(f):
21 |             if (line_id % proc_num) == proc_id:
22 |                 if proc_id == 0:
23 |                     if pairs_num % 1000 == 0:
24 |                         print("\r{}M pairs processed.".format(int(pairs_num*proc_num/1000**2)), end="")
25 |                 pair = line.strip().split()
26 |                 if pair[0] not in input_vocab:
27 |                     input_vocab[pair[0]] = 1
28 |                 else:
29 |                     input_vocab[pair[0]] += 1      
30 |                 if pair[1] not in output_vocab:
31 |                     output_vocab[pair[1]] = 1
32 |                 else:
33 |                     output_vocab[pair[1]] += 1
34 |                 pairs_num += 1
35 |     return (input_vocab, output_vocab)
36 | 
37 | 
38 | def main():
39 |     parser = argparse.ArgumentParser(formatter_class=argparse.ArgumentDefaultsHelpFormatter)
40 |     parser.add_argument("--pairs_file", type=str, required=True,
41 |                         help="Path to the pairs file.")
42 |     parser.add_argument("--input_vocab_file", type=str, required=True,
43 |                         help="Path to the input vocabulary file.")
44 |     parser.add_argument("--output_vocab_file", type=str, required=True,
45 |                         help="Path to the output vocabulary file.")
46 | 
47 |     parser.add_argument("--processes_num", type=int, default=4,
48 |                         help="Number of processes.")
49 |     
50 |     args = parser.parse_args()
51 |     
52 |     print("Pairs2vocab")
53 | 
54 |     pool = Pool(args.processes_num)
55 |     vocab_list = []
56 |     for i in range(args.processes_num):
57 |         vocab_list.append((pool.apply_async(func=pairs2vocab_process, args=[args.pairs_file, i, args.processes_num, args])))
58 |     
59 |     pool.close()
60 |     pool.join()
61 | 
62 |     input_vocab_list = [v.get()[0] for v in vocab_list]
63 |     input_vocab = merge_vocabularies(input_vocab_list)
64 | 
65 |     output_vocab_list = [v.get()[1] for v in vocab_list]
66 |     output_vocab = merge_vocabularies(output_vocab_list)
67 | 
68 |     input_vocab = sorted(six.iteritems(input_vocab), key=lambda item: item[1], reverse=True)
69 |     output_vocab = sorted(six.iteritems(output_vocab), key=lambda item: item[1], reverse=True)
70 |     save_count_vocabulary(args.input_vocab_file, input_vocab)
71 |     save_count_vocabulary(args.output_vocab_file, output_vocab)   
72 |     print ("Input vocab size: {}".format(len(input_vocab)))
73 |     print ("Output vocab size: {}".format(len(output_vocab)))
74 |     print ("Pairs2vocab finished")
75 | 
76 | 
77 | if __name__ == '__main__':
78 |     main()
79 | 
80 | 


--------------------------------------------------------------------------------
/ngram2vec/ppmi2svd.py:
--------------------------------------------------------------------------------
 1 | # -*- coding: utf-8 -*-
 2 | 
 3 | from __future__ import print_function, unicode_literals, division
 4 | import argparse
 5 | import codecs
 6 | from sparsesvd import sparsesvd
 7 | import numpy as np
 8 | from utils.matrix import load_sparse, save_dense
 9 | from utils.vocabulary import load_vocabulary
10 | from utils.misc import normalize
11 | 
12 | 
13 | def main():
14 |     parser = argparse.ArgumentParser(formatter_class=argparse.ArgumentDefaultsHelpFormatter)
15 |     parser.add_argument("--ppmi_file", type=str, required=True,
16 |                         help="Path to the counts (matrix) file.")
17 |     parser.add_argument("--svd_file", type=str, required=True,
18 |                         help="Path to the SVD file.")
19 |     parser.add_argument("--input_vocab_file", type=str, required=True,
20 |                         help="Path to the input vocabulary file.")
21 |     parser.add_argument("--output_vocab_file", type=str, required=True,
22 |                         help="Path to the output vocabulary file.")
23 | 
24 |     parser.add_argument("--size", type=int, default=100,
25 |                         help="Vector size.")
26 |     parser.add_argument("--normalize", action="store_true",
27 |                         help="If set, we factorize normalized PPMI matrix")
28 | 
29 |     args = parser.parse_args()
30 | 
31 |     print("Ppmi2svd")
32 |     input_vocab, _ = load_vocabulary(args.input_vocab_file)
33 |     output_vocab, _ = load_vocabulary(args.output_vocab_file)
34 |     ppmi, _, _ = load_sparse(args.ppmi_file)
35 |     if args.normalize:
36 |         ppmi = normalize(ppmi, sparse=True)
37 |     ut, s, vt = sparsesvd(ppmi.tocsc(), args.size)    
38 | 
39 |     np.save(args.svd_file + ".ut.npy", ut)
40 |     np.save(args.svd_file + ".s.npy", s)
41 |     np.save(args.svd_file + ".vt.npy", vt)
42 | 
43 |     save_dense(args.svd_file + ".input", ut.T, input_vocab)
44 |     save_dense(args.svd_file + ".output", vt.T, output_vocab)
45 |     print("Ppmi2svd finished")
46 | 
47 | 
48 | if __name__ == '__main__':
49 |     main()
50 | 


--------------------------------------------------------------------------------
/ngram2vec/shuffle.py:
--------------------------------------------------------------------------------
  1 | # -*- coding: utf-8 -*-
  2 | 
  3 | from __future__ import print_function, unicode_literals, division
  4 | import argparse
  5 | import codecs
  6 | import os
  7 | import random
  8 | from sys import getsizeof
  9 | 
 10 | 
 11 | def main():
 12 |     parser = argparse.ArgumentParser(formatter_class=argparse.ArgumentDefaultsHelpFormatter)
 13 |     parser.add_argument("--input_file", type=str, required=True,
 14 |                         help="Path to the original file.")
 15 |     parser.add_argument("--output_file", type=str, required=True,
 16 |                         help="Path to the shuffled file.")
 17 |     parser.add_argument("--memory_size", type=float, default=4.0,
 18 |                         help="Memory size. Shuffle and write out data when memory is used up.")
 19 |     
 20 |     args = parser.parse_args()
 21 | 
 22 |     memory_size = args.memory_size * 1000**3
 23 |     memory_size_used = 0
 24 |     lines = []
 25 |     lines_num_per_file = []
 26 |     tmpfiles_num = 0
 27 |     lines_num = 0
 28 |     # Shuffle step 1.
 29 |     with codecs.open(args.input_file, "r", "utf-8") as f:
 30 |         for line in f:
 31 |             lines_num += 1
 32 |             if lines_num % 1000 == 0:
 33 |                 print("\r{}M lines processed.".format(int(lines_num/1000**2)), end="")
 34 |             lines.append(line)
 35 |             memory_size_used += getsizeof(line)
 36 |             if getsizeof(lines) + memory_size_used > memory_size * 0.8:
 37 |                 random.shuffle(lines)
 38 |                 with codecs.open(args.output_file + str(tmpfiles_num), "w", "utf-8") as f:
 39 |                     for l in lines:
 40 |                         f.write(l)
 41 |                 if len(lines_num_per_file) == 0:
 42 |                     lines_num_per_file.append(lines_num)
 43 |                 else:
 44 |                     lines_num_per_file.append(lines_num-lines_num_per_file[-1])
 45 |                 lines = []
 46 |                 tmpfiles_num += 1
 47 |                 memory_size_used = 0
 48 | 
 49 |     random.shuffle(lines)
 50 |     with codecs.open(args.output_file + str(tmpfiles_num), "w", "utf-8") as f:
 51 |         for l in lines:
 52 |             f.write(l)
 53 |         if len(lines_num_per_file) == 0:
 54 |             lines_num_per_file.append(lines_num)
 55 |         else:
 56 |             lines_num_per_file.append(lines_num-lines_num_per_file[-1])
 57 |         lines = []
 58 |         tmpfiles_num += 1
 59 |     print()
 60 |     print("Number of tmpfiles: {}".format(tmpfiles_num))
 61 | 
 62 | 
 63 |     # Shuffle step 2.
 64 |     lines_num = 0
 65 |     output = codecs.open(args.output_file, "w", "utf-8")
 66 |     tmpfiles = []
 67 |     for i in range(tmpfiles_num):
 68 |         tmpfiles.append(codecs.open(args.output_file + str(i), "r", "utf-8"))
 69 |     
 70 |     limit = int(lines_num_per_file[0] / tmpfiles_num)
 71 |     for i in range(tmpfiles_num-1):
 72 |         lines = []
 73 |         for f in tmpfiles:
 74 |             for j in range(limit):
 75 |                 line = f.readline()
 76 |                 if len(line) > 0:
 77 |                     lines_num += 1
 78 |                     lines.append(line)
 79 |                     if lines_num % 1000 == 0:
 80 |                         print("\r{}M lines processed.".format(int(lines_num/1000**2)), end="")
 81 |         random.shuffle(lines)
 82 |         for line in lines:
 83 |             output.write(line)
 84 |     lines = []
 85 |     for f in tmpfiles:
 86 |         for line in f:
 87 |             lines_num += 1
 88 |             if lines_num % 1000 == 0:
 89 |                 print("\r{}M lines processed.".format(int(lines_num/1000**2)), end="")        
 90 |             lines.append(line)
 91 |     random.shuffle(lines)
 92 |     for line in lines:
 93 |         output.write(line)
 94 | 
 95 |     for i in range(tmpfiles_num):
 96 |         tmpfiles[i].close()
 97 |     for i in range(tmpfiles_num):
 98 |         os.remove(args.output_file + str(i))
 99 |     output.close()
100 |     print()
101 |     print("Number of lines: {}".format(lines_num))
102 | 
103 | 
104 | if __name__ == '__main__':
105 |     main()
106 | 


--------------------------------------------------------------------------------
/ngram2vec/similarity_eval.py:
--------------------------------------------------------------------------------
 1 | # -*- coding: utf-8 -*-
 2 | 
 3 | from __future__ import print_function, unicode_literals
 4 | import argparse
 5 | import codecs
 6 | import numpy as np
 7 | from scipy.stats.stats import spearmanr
 8 | import sys
 9 | from utils.misc import normalize
10 | from utils.matrix import load_dense, load_sparse
11 | from eval.testset import load_similarity
12 | from eval.similarity import similarity
13 | from eval.recast import align_matrix
14 | 
15 | 
16 | def main():
17 |     parser = argparse.ArgumentParser(formatter_class=argparse.ArgumentDefaultsHelpFormatter)
18 |     parser.add_argument("--input_vector_file", type=str, required=True,
19 |                         help="Path to the input vector file.")
20 |     parser.add_argument("--output_vector_file", type=str,
21 |                         help="Path to the output vector file.")
22 |     parser.add_argument("--test_file", type=str, required=True,
23 |                         help="Path to the similarity task.")
24 |     parser.add_argument('--sparse', action='store_true',
25 |                         help="Load sparse representation.")
26 |     parser.add_argument('--normalize', action='store_true',
27 |                         help="If set, vector is normalized.")
28 |     parser.add_argument("--ensemble", type=str, default="input",
29 |                         choices=["input", "output", "add", "concat"],
30 |                         help="""Strategies for using input/output vectors.
31 |                         One can use only input, only output, the addition of input and output,
32 |                         or their concatenation. Options are
33 |                         [input|output|add|concat].""")
34 | 
35 |     args = parser.parse_args()
36 |     
37 |     testset = load_similarity(args.test_file)
38 |     if args.sparse:
39 |         matrix, vocab, _ = load_sparse(args.input_vector_file)
40 |     else:
41 |         matrix, vocab, _ = load_dense(args.input_vector_file)
42 | 
43 |     if not args.sparse:
44 |         if args.ensemble == "add":
45 |             output_matrix, output_vocab, _ = load_dense(args.output_vector_file)
46 |             output_matrix = align_matrix(matrix, output_matrix, vocab, output_vocab)
47 |             matrix = matrix + output_matrix
48 |         elif args.ensemble == "concat":
49 |             output_matrix, output_vocab, _ = load_dense(args.output_vector_file)
50 |             output_matrix = align_matrix(matrix, output_matrix, vocab, output_vocab)
51 |             matrix = np.concatenate([matrix, output_matrix], axis=1)
52 |         elif args.ensemble == "output":
53 |             matrix, vocab, _ = load_dense(args.output_vector_file)
54 |         else: # args.ensemble == "input":
55 |             pass
56 | 
57 |     if args.normalize:
58 |         matrix = normalize(matrix, args.sparse)
59 | 
60 |     results = []
61 |     for (w1, w2), sim_expected in testset:
62 |         sim_actual = similarity(matrix, vocab["w2i"], w1, w2, args.sparse)
63 |         if sim_actual is not None:
64 |             results.append((sim_actual, sim_expected))
65 |     actual, expected = zip(*results)
66 |     print("seen/total: {}/{}".format(len(results), len(testset)))
67 |     print("{}: {:.3f}".format(args.test_file, spearmanr(actual, expected)[0]))
68 | 
69 | 
70 | if __name__ == '__main__':
71 |     main()
72 | 


--------------------------------------------------------------------------------
/ngram2vec/utils/__init__.py:
--------------------------------------------------------------------------------
1 | # -*- coding: utf-8 -*-


--------------------------------------------------------------------------------
/ngram2vec/utils/constants.py:
--------------------------------------------------------------------------------
1 | # -*- coding: utf-8 -*-
2 | 
3 | from __future__ import unicode_literals
4 | 
5 | 
6 | JOIN_OF_NGRAM = "@$"
7 | 


--------------------------------------------------------------------------------
/ngram2vec/utils/matrix.py:
--------------------------------------------------------------------------------
 1 | # -*- coding: utf-8 -*-
 2 | 
 3 | from __future__ import unicode_literals
 4 | import numpy as np
 5 | from scipy.sparse import csr_matrix
 6 | import codecs
 7 | 
 8 | 
 9 | def save_dense(path, matrix, vocab):
10 |     with codecs.open(path, "w", "utf-8") as f:
11 |         line = str(matrix.shape[0]) + " " + str(matrix.shape[1]) + "\n"
12 |         f.write(line)
13 |         for i in range(len(vocab)):
14 |             line = " ".join([str(v) for v in matrix[i, :]])
15 |             line = vocab[i] + " " + line + "\n"
16 |             f.write(line)
17 | 
18 | 
19 | def load_dense(path):
20 |     vocab_size, size = 0, 0
21 |     vocab = {}
22 |     vocab["i2w"], vocab["w2i"] = [], {}
23 |     with codecs.open(path, "r", "utf-8") as f:
24 |         first_line = True
25 |         for line in f:
26 |             if first_line:
27 |                 first_line = False
28 |                 vocab_size = int(line.strip().split()[0])
29 |                 size = int(line.strip().split()[1])
30 |                 matrix = np.zeros(shape=(vocab_size, size), dtype=np.float32)
31 |                 continue
32 |             vec = line.strip().split()
33 |             vocab["i2w"].append(vec[0])
34 |             matrix[len(vocab["i2w"])-1, :] = np.array([float(x) for x in vec[1:]])
35 |     for i, w in enumerate(vocab["i2w"]):
36 |         vocab["w2i"][w] = i
37 |     return matrix, vocab, size
38 | 
39 | 
40 | def save_sparse(path, matrix, vocab):
41 |      with codecs.open(path, "w", "utf-8") as f:
42 |         line = str(matrix.get_shape()[0]) + " " + str(matrix.get_shape()[1]) + "\n"
43 |         f.write(line)
44 |         for i in range(len(vocab)):
45 |             ind = matrix.indices[matrix.indptr[i]: matrix.indptr[i+1]]
46 |             dat = matrix.data[matrix.indptr[i]: matrix.indptr[i+1]]
47 |             line = [str(a1)+":"+str(a2) for a1, a2 in zip(ind, dat)]
48 |             line = " ".join(line)
49 |             line = vocab[i] + " " + line + "\n"
50 |             f.write(line)
51 | 
52 | 
53 | def load_sparse(path):
54 |     vocab_size, size = 0, 0
55 |     vocab = {}
56 |     vocab["i2w"], vocab["w2i"] = [], {}
57 |     row, col, data = [], [], []
58 |     with codecs.open(path, "r", "utf-8") as f:
59 |         first_line = True
60 |         lines_num = 0
61 |         for line in f:
62 |             if first_line:
63 |                 first_line = False
64 |                 vocab_size = int(line.strip().split()[0])
65 |                 size = int(line.strip().split()[1])
66 |                 continue
67 |             line = line.strip().split()
68 |             vocab["i2w"].append(line[0])
69 |             vector = line[1:]
70 |             for v in vector:
71 |                 row.append(lines_num)
72 |                 col.append(int(v.split(":")[0]))
73 |                 data.append(float(v.split(":")[1]))
74 |             lines_num += 1
75 |         for i, w in enumerate(vocab["i2w"]):
76 |             vocab["w2i"][w] = i
77 |         row, col, data = np.array(row), np.array(col), np.array(data)
78 |         matrix = csr_matrix((data, (row, col)), shape=(vocab_size, size))
79 |     return matrix, vocab, size
80 | 


--------------------------------------------------------------------------------
/ngram2vec/utils/misc.py:
--------------------------------------------------------------------------------
 1 | # -*- coding: utf-8 -*-
 2 | 
 3 | from __future__ import unicode_literals
 4 | import numpy as np
 5 | import codecs
 6 | import six
 7 | from scipy.sparse import csr_matrix
 8 | from utils.constants import JOIN_OF_NGRAM
 9 | 
10 | 
11 | def get_ngram(line, pos, order):
12 |     if pos < 0:
13 |         return None
14 |     if pos + order > len(line):
15 |         return None
16 |     ngram = line[pos]
17 |     for i in range(1, order):
18 |         ngram = ngram + JOIN_OF_NGRAM + line[pos + i]
19 |     return ngram
20 | 
21 | 
22 | def is_word(feature):
23 |     return feature.isalpha()
24 | 
25 | 
26 | def check_feature(feat, vocab, subsampler, random):
27 |     if feat is None:
28 |         return None
29 |     if subsampler != None:
30 |         feat = feat if feat not in subsampler or random.random() > subsampler[feat] else None
31 |         if feat is None:
32 |             return None
33 |     feat = feat if feat in vocab else None
34 |     return feat
35 | 
36 | 
37 | def normalize(matrix, sparse = False):
38 |     if sparse:
39 |         norm = matrix.copy()
40 |         norm.data **= 2
41 |         norm = np.reciprocal(np.sqrt(np.array(norm.sum(axis=1))[:,0]))
42 |         normalizer = csr_matrix((norm, (range(len(norm)), range(len(norm)))), dtype=np.float32)
43 |         matrix = normalizer.dot(matrix)
44 |     else:
45 |         norm = np.sqrt(np.sum(matrix * matrix, axis=1))
46 |         matrix = matrix / norm[:, np.newaxis]
47 |     return matrix
48 | 
49 | 
50 | def merge_vocabularies(vocab_list):
51 |     vocab = {}
52 |     for vocab_p in vocab_list:
53 |         for w in vocab_p:
54 |             if w not in vocab:
55 |                 vocab[w] = vocab_p[w]
56 |             else:
57 |                 vocab[w] += vocab_p[w]
58 |     return vocab
59 | 
60 | 


--------------------------------------------------------------------------------
/ngram2vec/utils/vocabulary.py:
--------------------------------------------------------------------------------
 1 | # -*- coding: utf-8 -*-
 2 | 
 3 | from __future__ import print_function, unicode_literals
 4 | import codecs
 5 | import six
 6 | 
 7 | 
 8 | def save_vocabulary(path, vocab):
 9 |     with codecs.open(path, "w", "utf-8") as f:
10 |         for w in vocab:
11 |             f.write("{}\n".format(w))
12 | 
13 | 
14 | def load_vocabulary(path):
15 |     with codecs.open(path, "r", "utf-8") as f:
16 |         i2w = [line.strip().split()[0] for line in f if len(line) > 0]
17 |     return i2w, dict([(w, i) for i, w in enumerate(i2w)])
18 | 
19 | 
20 | def save_count_vocabulary(path, vocab):
21 |     with codecs.open(path, "w", "utf-8") as f:
22 |         if isinstance(vocab, dict):
23 |             for w, c in six.iteritems(vocab):
24 |                 f.write("{} {}\n".format(w, c))
25 |         else: # isinstance(vocab, list) == True
26 |             for w, c in vocab:
27 |                 f.write("{} {}\n".format(w, c))
28 | 
29 | 
30 | def load_count_vocabulary(path, thr=1):
31 |     with codecs.open(path, "r", "utf-8") as f:
32 |         vocab = [line.strip().split() for line in f if len(line) > 0]
33 |         vocab = dict([(w, int(c)) for w, c in vocab if int(c) >= thr])
34 |     return vocab
35 | 


--------------------------------------------------------------------------------
/ngram_example.sh:
--------------------------------------------------------------------------------
 1 | #!/bin/sh
 2 | 
 3 | memory_size=4
 4 | cpus_num=4
 5 | corpus=wikipedia
 6 | output_path=outputs/${corpus}/ngram_ngram
 7 | 
 8 | mkdir -p ${output_path}/sgns
 9 | mkdir -p ${output_path}/ppmi
10 | mkdir -p ${output_path}/svd
11 | mkdir -p ${output_path}/glove
12 | 
13 | python ngram2vec/corpus2vocab.py --corpus_file ${corpus} --vocab_file ${output_path}/vocab --memory_size ${memory_size} --feature ngram --order 2
14 | python ngram2vec/corpus2pairs.py --corpus_file ${corpus} --pairs_file ${output_path}/pairs --vocab_file ${output_path}/vocab --processes_num ${cpus_num} --cooccur ngram_ngram --input_order 1 --output_order 2
15 | 
16 | # Concatenate pair files. 
17 | if [ -f "${output_path}/pairs" ]; then
18 | 	rm ${output_path}/pairs
19 | fi
20 | for i in $(seq 0 $((${cpus_num}-1)))
21 | do
22 | 	cat ${output_path}/pairs_${i} >> ${output_path}/pairs
23 | 	rm ${output_path}/pairs_${i}
24 | done
25 | 
26 | # Generate input vocabulary and output vocabulary, which are used as vocabulary files for all models
27 | python ngram2vec/pairs2vocab.py --pairs_file ${output_path}/pairs --input_vocab_file ${output_path}/vocab.input --output_vocab_file ${output_path}/vocab.output
28 | 
29 | # SGNS, learn representation upon pairs.
30 | # We add a python interface upon C code.
31 | python ngram2vec/pairs2sgns.py --pairs_file ${output_path}/pairs --input_vocab_file ${output_path}/vocab.input --output_vocab_file ${output_path}/vocab.output --input_vector_file ${output_path}/sgns/sgns.input --output_vector_file ${output_path}/sgns/sgns.output --threads_num ${cpus_num} --size 300
32 | 
33 | # SGNS evaluation.
34 | python ngram2vec/similarity_eval.py --input_vector_file ${output_path}/sgns/sgns.input  --test_file testsets/similarity/ws353_similarity.txt --normalize
35 | python ngram2vec/analogy_eval.py --input_vector_file ${output_path}/sgns/sgns.input --test_file testsets/analogy/semantic.txt --normalize
36 | 
37 | # Generate co-occurrence matrix from pairs.
38 | python ngram2vec/pairs2counts.py --pairs_file ${output_path}/pairs --input_vocab_file ${output_path}/vocab.input --output_vocab_file ${output_path}/vocab.output --counts_file ${output_path}/counts --output_id --memory_size ${memory_size}
39 | 
40 | # PPMI, learn representation upon counts (co-occurrence matrix).
41 | python ngram2vec/counts2ppmi.py --counts_file ${output_path}/counts --input_vocab_file ${output_path}/vocab.input --output_vocab_file ${output_path}/vocab.output --ppmi_file ${output_path}/ppmi/ppmi
42 | 
43 | # PPMI evaluation.
44 | python ngram2vec/similarity_eval.py --input_vector_file ${output_path}/ppmi/ppmi --test_file testsets/similarity/ws353_similarity.txt --normalize --sparse
45 | python ngram2vec/analogy_eval.py --input_vector_file ${output_path}/ppmi/ppmi --test_file testsets/analogy/semantic.txt --normalize --sparse
46 | 
47 | # SVD, factorize PPMI matrix.
48 | python ngram2vec/ppmi2svd.py --ppmi_file ${output_path}/ppmi/ppmi --svd_file ${output_path}/svd/svd --input_vocab_file ${output_path}/vocab.input --output_vocab_file ${output_path}/vocab.output 
49 | 
50 | # SVD evaluation.
51 | python ngram2vec/similarity_eval.py --input_vector_file ${output_path}/svd/svd.input  --test_file testsets/similarity/ws353_similarity.txt --normalize
52 | python ngram2vec/analogy_eval.py --input_vector_file ${output_path}/svd/svd.input --test_file testsets/analogy/semantic.txt --normalize
53 | 
54 | # Shuffle counts.
55 | python ngram2vec/shuffle.py --input_file ${output_path}/counts --output_file ${output_path}/counts.shuf --memory_size ${memory_size}
56 | 
57 | # GloVe, learn representation upon counts (co-occurrence matrix).
58 | python ngram2vec/counts2glove.py --counts_file ${output_path}/counts.shuf --input_vocab_file ${output_path}/vocab.input --output_vocab_file ${output_path}/vocab.output --input_vector_file ${output_path}/glove/glove.input --output_vector_file ${output_path}/glove/glove.output --threads_num ${cpus_num}
59 | 
60 | # GloVe evaluation.
61 | python ngram2vec/similarity_eval.py --input_vector_file ${output_path}/glove/glove.input  --test_file testsets/similarity/ws353_similarity.txt --normalize
62 | python ngram2vec/analogy_eval.py --input_vector_file ${output_path}/glove/glove.input --test_file testsets/analogy/semantic.txt --normalize
63 | 


--------------------------------------------------------------------------------
/scripts/clean_corpus.sh:
--------------------------------------------------------------------------------
1 | #!/bin/sh
2 | 
3 | # Cleaning English corpus. Strategy used in hyperwords toolkit.
4 | # Please use gnu-sed on Mac by running `brew install gnu-sed --with-default-names`. 
5 | # Note that utf-8 and ascii are identical for English corpus.
6 | iconv -c -f utf-8 -t ascii $1 | tr '[A-Z]' '[a-z]' | sed "s/[^a-z0-9]*[ \t\n\r][^a-z0-9]*/ /g" | sed "s/[^a-z0-9]*$/ /g" | sed "s/^[^a-z0-9]*//g" | sed "s/  */ /g"
7 | 


--------------------------------------------------------------------------------
/scripts/compile_c.py:
--------------------------------------------------------------------------------
 1 | # -*- coding: utf-8 -*-
 2 | 
 3 | from __future__ import print_function, unicode_literals
 4 | import subprocess
 5 | import os
 6 | 
 7 | 
 8 | def compile(source, target):
 9 |     CC = "gcc"
10 |     CFLAGS = "-lm -pthread -O3 -march=native -Wall -funroll-loops -Wno-unused-result"
11 |     command = [CC, source, "-o", target]
12 |     command.extend(CFLAGS.split())
13 |     print("Compilation command: " + " ".join(command))
14 |     return_code = subprocess.call(command)
15 |     if return_code > 0:
16 |         exit(return_code)
17 | 
18 | def main():
19 |     # Word2vec toolkit.
20 |     source = os.path.join("word2vec", "word2vec.c")
21 |     target = os.path.join("word2vec", "word2vec")
22 |     compile(source, target)
23 |     # GloVe toolkit.
24 |     source = os.path.join("glove", "glove.c")
25 |     target = os.path.join("glove", "glove")
26 |     compile(source, target)
27 | 
28 | 
29 | if __name__ == '__main__':
30 |     main()
31 | 


--------------------------------------------------------------------------------
/testsets/analogy_zh/ca-translated/capital-common-countries.txt:
--------------------------------------------------------------------------------
  1 | 雅典 希腊 巴格达 伊拉克
  2 | 雅典 希腊 曼谷 泰国
  3 | 雅典 希腊 北京 中国
  4 | 雅典 希腊 柏林 德国
  5 | 雅典 希腊 伯尔尼 瑞士
  6 | 雅典 希腊 开罗 埃及
  7 | 雅典 希腊 堪培拉 澳大利亚
  8 | 雅典 希腊 河内 越南
  9 | 雅典 希腊 哈瓦那 古巴
 10 | 雅典 希腊 赫尔辛基 芬兰
 11 | 雅典 希腊 伊斯兰堡 巴基斯坦
 12 | 雅典 希腊 喀布尔 阿富汗
 13 | 雅典 希腊 伦敦 英国
 14 | 雅典 希腊 马德里 西班牙
 15 | 雅典 希腊 莫斯科 俄罗斯
 16 | 雅典 希腊 奥斯陆 挪威
 17 | 雅典 希腊 渥太华 加拿大
 18 | 雅典 希腊 巴黎 法国
 19 | 雅典 希腊 罗马 意大利
 20 | 雅典 希腊 斯德哥尔摩 瑞典
 21 | 雅典 希腊 德黑兰 伊朗
 22 | 雅典 希腊 东京 日本
 23 | 巴格达 伊拉克 曼谷 泰国
 24 | 巴格达 伊拉克 北京 中国
 25 | 巴格达 伊拉克 柏林 德国
 26 | 巴格达 伊拉克 伯尔尼 瑞士
 27 | 巴格达 伊拉克 开罗 埃及
 28 | 巴格达 伊拉克 堪培拉 澳大利亚
 29 | 巴格达 伊拉克 河内 越南
 30 | 巴格达 伊拉克 哈瓦那 古巴
 31 | 巴格达 伊拉克 赫尔辛基 芬兰
 32 | 巴格达 伊拉克 伊斯兰堡 巴基斯坦
 33 | 巴格达 伊拉克 喀布尔 阿富汗
 34 | 巴格达 伊拉克 伦敦 英国
 35 | 巴格达 伊拉克 马德里 西班牙
 36 | 巴格达 伊拉克 莫斯科 俄罗斯
 37 | 巴格达 伊拉克 奥斯陆 挪威
 38 | 巴格达 伊拉克 渥太华 加拿大
 39 | 巴格达 伊拉克 巴黎 法国
 40 | 巴格达 伊拉克 罗马 意大利
 41 | 巴格达 伊拉克 斯德哥尔摩 瑞典
 42 | 巴格达 伊拉克 德黑兰 伊朗
 43 | 巴格达 伊拉克 东京 日本
 44 | 巴格达 伊拉克 雅典 希腊
 45 | 曼谷 泰国 北京 中国
 46 | 曼谷 泰国 柏林 德国
 47 | 曼谷 泰国 伯尔尼 瑞士
 48 | 曼谷 泰国 开罗 埃及
 49 | 曼谷 泰国 堪培拉 澳大利亚
 50 | 曼谷 泰国 河内 越南
 51 | 曼谷 泰国 哈瓦那 古巴
 52 | 曼谷 泰国 赫尔辛基 芬兰
 53 | 曼谷 泰国 伊斯兰堡 巴基斯坦
 54 | 曼谷 泰国 喀布尔 阿富汗
 55 | 曼谷 泰国 伦敦 英国
 56 | 曼谷 泰国 马德里 西班牙
 57 | 曼谷 泰国 莫斯科 俄罗斯
 58 | 曼谷 泰国 奥斯陆 挪威
 59 | 曼谷 泰国 渥太华 加拿大
 60 | 曼谷 泰国 巴黎 法国
 61 | 曼谷 泰国 罗马 意大利
 62 | 曼谷 泰国 斯德哥尔摩 瑞典
 63 | 曼谷 泰国 德黑兰 伊朗
 64 | 曼谷 泰国 东京 日本
 65 | 曼谷 泰国 雅典 希腊
 66 | 曼谷 泰国 巴格达 伊拉克
 67 | 北京 中国 柏林 德国
 68 | 北京 中国 伯尔尼 瑞士
 69 | 北京 中国 开罗 埃及
 70 | 北京 中国 堪培拉 澳大利亚
 71 | 北京 中国 河内 越南
 72 | 北京 中国 哈瓦那 古巴
 73 | 北京 中国 赫尔辛基 芬兰
 74 | 北京 中国 伊斯兰堡 巴基斯坦
 75 | 北京 中国 喀布尔 阿富汗
 76 | 北京 中国 伦敦 英国
 77 | 北京 中国 马德里 西班牙
 78 | 北京 中国 莫斯科 俄罗斯
 79 | 北京 中国 奥斯陆 挪威
 80 | 北京 中国 渥太华 加拿大
 81 | 北京 中国 巴黎 法国
 82 | 北京 中国 罗马 意大利
 83 | 北京 中国 斯德哥尔摩 瑞典
 84 | 北京 中国 德黑兰 伊朗
 85 | 北京 中国 东京 日本
 86 | 北京 中国 雅典 希腊
 87 | 北京 中国 巴格达 伊拉克
 88 | 北京 中国 曼谷 泰国
 89 | 柏林 德国 伯尔尼 瑞士
 90 | 柏林 德国 开罗 埃及
 91 | 柏林 德国 堪培拉 澳大利亚
 92 | 柏林 德国 河内 越南
 93 | 柏林 德国 哈瓦那 古巴
 94 | 柏林 德国 赫尔辛基 芬兰
 95 | 柏林 德国 伊斯兰堡 巴基斯坦
 96 | 柏林 德国 喀布尔 阿富汗
 97 | 柏林 德国 伦敦 英国
 98 | 柏林 德国 马德里 西班牙
 99 | 柏林 德国 莫斯科 俄罗斯
100 | 柏林 德国 奥斯陆 挪威
101 | 柏林 德国 渥太华 加拿大
102 | 柏林 德国 巴黎 法国
103 | 柏林 德国 罗马 意大利
104 | 柏林 德国 斯德哥尔摩 瑞典
105 | 柏林 德国 德黑兰 伊朗
106 | 柏林 德国 东京 日本
107 | 柏林 德国 雅典 希腊
108 | 柏林 德国 巴格达 伊拉克
109 | 柏林 德国 曼谷 泰国
110 | 柏林 德国 北京 中国
111 | 伯尔尼 瑞士 开罗 埃及
112 | 伯尔尼 瑞士 堪培拉 澳大利亚
113 | 伯尔尼 瑞士 河内 越南
114 | 伯尔尼 瑞士 哈瓦那 古巴
115 | 伯尔尼 瑞士 赫尔辛基 芬兰
116 | 伯尔尼 瑞士 伊斯兰堡 巴基斯坦
117 | 伯尔尼 瑞士 喀布尔 阿富汗
118 | 伯尔尼 瑞士 伦敦 英国
119 | 伯尔尼 瑞士 马德里 西班牙
120 | 伯尔尼 瑞士 莫斯科 俄罗斯
121 | 伯尔尼 瑞士 奥斯陆 挪威
122 | 伯尔尼 瑞士 渥太华 加拿大
123 | 伯尔尼 瑞士 巴黎 法国
124 | 伯尔尼 瑞士 罗马 意大利
125 | 伯尔尼 瑞士 斯德哥尔摩 瑞典
126 | 伯尔尼 瑞士 德黑兰 伊朗
127 | 伯尔尼 瑞士 东京 日本
128 | 伯尔尼 瑞士 雅典 希腊
129 | 伯尔尼 瑞士 巴格达 伊拉克
130 | 伯尔尼 瑞士 曼谷 泰国
131 | 伯尔尼 瑞士 北京 中国
132 | 伯尔尼 瑞士 柏林 德国
133 | 开罗 埃及 堪培拉 澳大利亚
134 | 开罗 埃及 河内 越南
135 | 开罗 埃及 哈瓦那 古巴
136 | 开罗 埃及 赫尔辛基 芬兰
137 | 开罗 埃及 伊斯兰堡 巴基斯坦
138 | 开罗 埃及 喀布尔 阿富汗
139 | 开罗 埃及 伦敦 英国
140 | 开罗 埃及 马德里 西班牙
141 | 开罗 埃及 莫斯科 俄罗斯
142 | 开罗 埃及 奥斯陆 挪威
143 | 开罗 埃及 渥太华 加拿大
144 | 开罗 埃及 巴黎 法国
145 | 开罗 埃及 罗马 意大利
146 | 开罗 埃及 斯德哥尔摩 瑞典
147 | 开罗 埃及 德黑兰 伊朗
148 | 开罗 埃及 东京 日本
149 | 开罗 埃及 雅典 希腊
150 | 开罗 埃及 巴格达 伊拉克
151 | 开罗 埃及 曼谷 泰国
152 | 开罗 埃及 北京 中国
153 | 开罗 埃及 柏林 德国
154 | 开罗 埃及 伯尔尼 瑞士
155 | 堪培拉 澳大利亚 河内 越南
156 | 堪培拉 澳大利亚 哈瓦那 古巴
157 | 堪培拉 澳大利亚 赫尔辛基 芬兰
158 | 堪培拉 澳大利亚 伊斯兰堡 巴基斯坦
159 | 堪培拉 澳大利亚 喀布尔 阿富汗
160 | 堪培拉 澳大利亚 伦敦 英国
161 | 堪培拉 澳大利亚 马德里 西班牙
162 | 堪培拉 澳大利亚 莫斯科 俄罗斯
163 | 堪培拉 澳大利亚 奥斯陆 挪威
164 | 堪培拉 澳大利亚 渥太华 加拿大
165 | 堪培拉 澳大利亚 巴黎 法国
166 | 堪培拉 澳大利亚 罗马 意大利
167 | 堪培拉 澳大利亚 斯德哥尔摩 瑞典
168 | 堪培拉 澳大利亚 德黑兰 伊朗
169 | 堪培拉 澳大利亚 东京 日本
170 | 堪培拉 澳大利亚 雅典 希腊
171 | 堪培拉 澳大利亚 巴格达 伊拉克
172 | 堪培拉 澳大利亚 曼谷 泰国
173 | 堪培拉 澳大利亚 北京 中国
174 | 堪培拉 澳大利亚 柏林 德国
175 | 堪培拉 澳大利亚 伯尔尼 瑞士
176 | 堪培拉 澳大利亚 开罗 埃及
177 | 河内 越南 哈瓦那 古巴
178 | 河内 越南 赫尔辛基 芬兰
179 | 河内 越南 伊斯兰堡 巴基斯坦
180 | 河内 越南 喀布尔 阿富汗
181 | 河内 越南 伦敦 英国
182 | 河内 越南 马德里 西班牙
183 | 河内 越南 莫斯科 俄罗斯
184 | 河内 越南 奥斯陆 挪威
185 | 河内 越南 渥太华 加拿大
186 | 河内 越南 巴黎 法国
187 | 河内 越南 罗马 意大利
188 | 河内 越南 斯德哥尔摩 瑞典
189 | 河内 越南 德黑兰 伊朗
190 | 河内 越南 东京 日本
191 | 河内 越南 雅典 希腊
192 | 河内 越南 巴格达 伊拉克
193 | 河内 越南 曼谷 泰国
194 | 河内 越南 北京 中国
195 | 河内 越南 柏林 德国
196 | 河内 越南 伯尔尼 瑞士
197 | 河内 越南 开罗 埃及
198 | 河内 越南 堪培拉 澳大利亚
199 | 哈瓦那 古巴 赫尔辛基 芬兰
200 | 哈瓦那 古巴 伊斯兰堡 巴基斯坦
201 | 哈瓦那 古巴 喀布尔 阿富汗
202 | 哈瓦那 古巴 伦敦 英国
203 | 哈瓦那 古巴 马德里 西班牙
204 | 哈瓦那 古巴 莫斯科 俄罗斯
205 | 哈瓦那 古巴 奥斯陆 挪威
206 | 哈瓦那 古巴 渥太华 加拿大
207 | 哈瓦那 古巴 巴黎 法国
208 | 哈瓦那 古巴 罗马 意大利
209 | 哈瓦那 古巴 斯德哥尔摩 瑞典
210 | 哈瓦那 古巴 德黑兰 伊朗
211 | 哈瓦那 古巴 东京 日本
212 | 哈瓦那 古巴 雅典 希腊
213 | 哈瓦那 古巴 巴格达 伊拉克
214 | 哈瓦那 古巴 曼谷 泰国
215 | 哈瓦那 古巴 北京 中国
216 | 哈瓦那 古巴 柏林 德国
217 | 哈瓦那 古巴 伯尔尼 瑞士
218 | 哈瓦那 古巴 开罗 埃及
219 | 哈瓦那 古巴 堪培拉 澳大利亚
220 | 哈瓦那 古巴 河内 越南
221 | 赫尔辛基 芬兰 伊斯兰堡 巴基斯坦
222 | 赫尔辛基 芬兰 喀布尔 阿富汗
223 | 赫尔辛基 芬兰 伦敦 英国
224 | 赫尔辛基 芬兰 马德里 西班牙
225 | 赫尔辛基 芬兰 莫斯科 俄罗斯
226 | 赫尔辛基 芬兰 奥斯陆 挪威
227 | 赫尔辛基 芬兰 渥太华 加拿大
228 | 赫尔辛基 芬兰 巴黎 法国
229 | 赫尔辛基 芬兰 罗马 意大利
230 | 赫尔辛基 芬兰 斯德哥尔摩 瑞典
231 | 赫尔辛基 芬兰 德黑兰 伊朗
232 | 赫尔辛基 芬兰 东京 日本
233 | 赫尔辛基 芬兰 雅典 希腊
234 | 赫尔辛基 芬兰 巴格达 伊拉克
235 | 赫尔辛基 芬兰 曼谷 泰国
236 | 赫尔辛基 芬兰 北京 中国
237 | 赫尔辛基 芬兰 柏林 德国
238 | 赫尔辛基 芬兰 伯尔尼 瑞士
239 | 赫尔辛基 芬兰 开罗 埃及
240 | 赫尔辛基 芬兰 堪培拉 澳大利亚
241 | 赫尔辛基 芬兰 河内 越南
242 | 赫尔辛基 芬兰 哈瓦那 古巴
243 | 伊斯兰堡 巴基斯坦 喀布尔 阿富汗
244 | 伊斯兰堡 巴基斯坦 伦敦 英国
245 | 伊斯兰堡 巴基斯坦 马德里 西班牙
246 | 伊斯兰堡 巴基斯坦 莫斯科 俄罗斯
247 | 伊斯兰堡 巴基斯坦 奥斯陆 挪威
248 | 伊斯兰堡 巴基斯坦 渥太华 加拿大
249 | 伊斯兰堡 巴基斯坦 巴黎 法国
250 | 伊斯兰堡 巴基斯坦 罗马 意大利
251 | 伊斯兰堡 巴基斯坦 斯德哥尔摩 瑞典
252 | 伊斯兰堡 巴基斯坦 德黑兰 伊朗
253 | 伊斯兰堡 巴基斯坦 东京 日本
254 | 伊斯兰堡 巴基斯坦 雅典 希腊
255 | 伊斯兰堡 巴基斯坦 巴格达 伊拉克
256 | 伊斯兰堡 巴基斯坦 曼谷 泰国
257 | 伊斯兰堡 巴基斯坦 北京 中国
258 | 伊斯兰堡 巴基斯坦 柏林 德国
259 | 伊斯兰堡 巴基斯坦 伯尔尼 瑞士
260 | 伊斯兰堡 巴基斯坦 开罗 埃及
261 | 伊斯兰堡 巴基斯坦 堪培拉 澳大利亚
262 | 伊斯兰堡 巴基斯坦 河内 越南
263 | 伊斯兰堡 巴基斯坦 哈瓦那 古巴
264 | 伊斯兰堡 巴基斯坦 赫尔辛基 芬兰
265 | 喀布尔 阿富汗 伦敦 英国
266 | 喀布尔 阿富汗 马德里 西班牙
267 | 喀布尔 阿富汗 莫斯科 俄罗斯
268 | 喀布尔 阿富汗 奥斯陆 挪威
269 | 喀布尔 阿富汗 渥太华 加拿大
270 | 喀布尔 阿富汗 巴黎 法国
271 | 喀布尔 阿富汗 罗马 意大利
272 | 喀布尔 阿富汗 斯德哥尔摩 瑞典
273 | 喀布尔 阿富汗 德黑兰 伊朗
274 | 喀布尔 阿富汗 东京 日本
275 | 喀布尔 阿富汗 雅典 希腊
276 | 喀布尔 阿富汗 巴格达 伊拉克
277 | 喀布尔 阿富汗 曼谷 泰国
278 | 喀布尔 阿富汗 北京 中国
279 | 喀布尔 阿富汗 柏林 德国
280 | 喀布尔 阿富汗 伯尔尼 瑞士
281 | 喀布尔 阿富汗 开罗 埃及
282 | 喀布尔 阿富汗 堪培拉 澳大利亚
283 | 喀布尔 阿富汗 河内 越南
284 | 喀布尔 阿富汗 哈瓦那 古巴
285 | 喀布尔 阿富汗 赫尔辛基 芬兰
286 | 喀布尔 阿富汗 伊斯兰堡 巴基斯坦
287 | 伦敦 英国 马德里 西班牙
288 | 伦敦 英国 莫斯科 俄罗斯
289 | 伦敦 英国 奥斯陆 挪威
290 | 伦敦 英国 渥太华 加拿大
291 | 伦敦 英国 巴黎 法国
292 | 伦敦 英国 罗马 意大利
293 | 伦敦 英国 斯德哥尔摩 瑞典
294 | 伦敦 英国 德黑兰 伊朗
295 | 伦敦 英国 东京 日本
296 | 伦敦 英国 雅典 希腊
297 | 伦敦 英国 巴格达 伊拉克
298 | 伦敦 英国 曼谷 泰国
299 | 伦敦 英国 北京 中国
300 | 伦敦 英国 柏林 德国
301 | 伦敦 英国 伯尔尼 瑞士
302 | 伦敦 英国 开罗 埃及
303 | 伦敦 英国 堪培拉 澳大利亚
304 | 伦敦 英国 河内 越南
305 | 伦敦 英国 哈瓦那 古巴
306 | 伦敦 英国 赫尔辛基 芬兰
307 | 伦敦 英国 伊斯兰堡 巴基斯坦
308 | 伦敦 英国 喀布尔 阿富汗
309 | 马德里 西班牙 莫斯科 俄罗斯
310 | 马德里 西班牙 奥斯陆 挪威
311 | 马德里 西班牙 渥太华 加拿大
312 | 马德里 西班牙 巴黎 法国
313 | 马德里 西班牙 罗马 意大利
314 | 马德里 西班牙 斯德哥尔摩 瑞典
315 | 马德里 西班牙 德黑兰 伊朗
316 | 马德里 西班牙 东京 日本
317 | 马德里 西班牙 雅典 希腊
318 | 马德里 西班牙 巴格达 伊拉克
319 | 马德里 西班牙 曼谷 泰国
320 | 马德里 西班牙 北京 中国
321 | 马德里 西班牙 柏林 德国
322 | 马德里 西班牙 伯尔尼 瑞士
323 | 马德里 西班牙 开罗 埃及
324 | 马德里 西班牙 堪培拉 澳大利亚
325 | 马德里 西班牙 河内 越南
326 | 马德里 西班牙 哈瓦那 古巴
327 | 马德里 西班牙 赫尔辛基 芬兰
328 | 马德里 西班牙 伊斯兰堡 巴基斯坦
329 | 马德里 西班牙 喀布尔 阿富汗
330 | 马德里 西班牙 伦敦 英国
331 | 莫斯科 俄罗斯 奥斯陆 挪威
332 | 莫斯科 俄罗斯 渥太华 加拿大
333 | 莫斯科 俄罗斯 巴黎 法国
334 | 莫斯科 俄罗斯 罗马 意大利
335 | 莫斯科 俄罗斯 斯德哥尔摩 瑞典
336 | 莫斯科 俄罗斯 德黑兰 伊朗
337 | 莫斯科 俄罗斯 东京 日本
338 | 莫斯科 俄罗斯 雅典 希腊
339 | 莫斯科 俄罗斯 巴格达 伊拉克
340 | 莫斯科 俄罗斯 曼谷 泰国
341 | 莫斯科 俄罗斯 北京 中国
342 | 莫斯科 俄罗斯 柏林 德国
343 | 莫斯科 俄罗斯 伯尔尼 瑞士
344 | 莫斯科 俄罗斯 开罗 埃及
345 | 莫斯科 俄罗斯 堪培拉 澳大利亚
346 | 莫斯科 俄罗斯 河内 越南
347 | 莫斯科 俄罗斯 哈瓦那 古巴
348 | 莫斯科 俄罗斯 赫尔辛基 芬兰
349 | 莫斯科 俄罗斯 伊斯兰堡 巴基斯坦
350 | 莫斯科 俄罗斯 喀布尔 阿富汗
351 | 莫斯科 俄罗斯 伦敦 英国
352 | 莫斯科 俄罗斯 马德里 西班牙
353 | 奥斯陆 挪威 渥太华 加拿大
354 | 奥斯陆 挪威 巴黎 法国
355 | 奥斯陆 挪威 罗马 意大利
356 | 奥斯陆 挪威 斯德哥尔摩 瑞典
357 | 奥斯陆 挪威 德黑兰 伊朗
358 | 奥斯陆 挪威 东京 日本
359 | 奥斯陆 挪威 雅典 希腊
360 | 奥斯陆 挪威 巴格达 伊拉克
361 | 奥斯陆 挪威 曼谷 泰国
362 | 奥斯陆 挪威 北京 中国
363 | 奥斯陆 挪威 柏林 德国
364 | 奥斯陆 挪威 伯尔尼 瑞士
365 | 奥斯陆 挪威 开罗 埃及
366 | 奥斯陆 挪威 堪培拉 澳大利亚
367 | 奥斯陆 挪威 河内 越南
368 | 奥斯陆 挪威 哈瓦那 古巴
369 | 奥斯陆 挪威 赫尔辛基 芬兰
370 | 奥斯陆 挪威 伊斯兰堡 巴基斯坦
371 | 奥斯陆 挪威 喀布尔 阿富汗
372 | 奥斯陆 挪威 伦敦 英国
373 | 奥斯陆 挪威 马德里 西班牙
374 | 奥斯陆 挪威 莫斯科 俄罗斯
375 | 渥太华 加拿大 巴黎 法国
376 | 渥太华 加拿大 罗马 意大利
377 | 渥太华 加拿大 斯德哥尔摩 瑞典
378 | 渥太华 加拿大 德黑兰 伊朗
379 | 渥太华 加拿大 东京 日本
380 | 渥太华 加拿大 雅典 希腊
381 | 渥太华 加拿大 巴格达 伊拉克
382 | 渥太华 加拿大 曼谷 泰国
383 | 渥太华 加拿大 北京 中国
384 | 渥太华 加拿大 柏林 德国
385 | 渥太华 加拿大 伯尔尼 瑞士
386 | 渥太华 加拿大 开罗 埃及
387 | 渥太华 加拿大 堪培拉 澳大利亚
388 | 渥太华 加拿大 河内 越南
389 | 渥太华 加拿大 哈瓦那 古巴
390 | 渥太华 加拿大 赫尔辛基 芬兰
391 | 渥太华 加拿大 伊斯兰堡 巴基斯坦
392 | 渥太华 加拿大 喀布尔 阿富汗
393 | 渥太华 加拿大 伦敦 英国
394 | 渥太华 加拿大 马德里 西班牙
395 | 渥太华 加拿大 莫斯科 俄罗斯
396 | 渥太华 加拿大 奥斯陆 挪威
397 | 巴黎 法国 罗马 意大利
398 | 巴黎 法国 斯德哥尔摩 瑞典
399 | 巴黎 法国 德黑兰 伊朗
400 | 巴黎 法国 东京 日本
401 | 巴黎 法国 雅典 希腊
402 | 巴黎 法国 巴格达 伊拉克
403 | 巴黎 法国 曼谷 泰国
404 | 巴黎 法国 北京 中国
405 | 巴黎 法国 柏林 德国
406 | 巴黎 法国 伯尔尼 瑞士
407 | 巴黎 法国 开罗 埃及
408 | 巴黎 法国 堪培拉 澳大利亚
409 | 巴黎 法国 河内 越南
410 | 巴黎 法国 哈瓦那 古巴
411 | 巴黎 法国 赫尔辛基 芬兰
412 | 巴黎 法国 伊斯兰堡 巴基斯坦
413 | 巴黎 法国 喀布尔 阿富汗
414 | 巴黎 法国 伦敦 英国
415 | 巴黎 法国 马德里 西班牙
416 | 巴黎 法国 莫斯科 俄罗斯
417 | 巴黎 法国 奥斯陆 挪威
418 | 巴黎 法国 渥太华 加拿大
419 | 罗马 意大利 斯德哥尔摩 瑞典
420 | 罗马 意大利 德黑兰 伊朗
421 | 罗马 意大利 东京 日本
422 | 罗马 意大利 雅典 希腊
423 | 罗马 意大利 巴格达 伊拉克
424 | 罗马 意大利 曼谷 泰国
425 | 罗马 意大利 北京 中国
426 | 罗马 意大利 柏林 德国
427 | 罗马 意大利 伯尔尼 瑞士
428 | 罗马 意大利 开罗 埃及
429 | 罗马 意大利 堪培拉 澳大利亚
430 | 罗马 意大利 河内 越南
431 | 罗马 意大利 哈瓦那 古巴
432 | 罗马 意大利 赫尔辛基 芬兰
433 | 罗马 意大利 伊斯兰堡 巴基斯坦
434 | 罗马 意大利 喀布尔 阿富汗
435 | 罗马 意大利 伦敦 英国
436 | 罗马 意大利 马德里 西班牙
437 | 罗马 意大利 莫斯科 俄罗斯
438 | 罗马 意大利 奥斯陆 挪威
439 | 罗马 意大利 渥太华 加拿大
440 | 罗马 意大利 巴黎 法国
441 | 斯德哥尔摩 瑞典 德黑兰 伊朗
442 | 斯德哥尔摩 瑞典 东京 日本
443 | 斯德哥尔摩 瑞典 雅典 希腊
444 | 斯德哥尔摩 瑞典 巴格达 伊拉克
445 | 斯德哥尔摩 瑞典 曼谷 泰国
446 | 斯德哥尔摩 瑞典 北京 中国
447 | 斯德哥尔摩 瑞典 柏林 德国
448 | 斯德哥尔摩 瑞典 伯尔尼 瑞士
449 | 斯德哥尔摩 瑞典 开罗 埃及
450 | 斯德哥尔摩 瑞典 堪培拉 澳大利亚
451 | 斯德哥尔摩 瑞典 河内 越南
452 | 斯德哥尔摩 瑞典 哈瓦那 古巴
453 | 斯德哥尔摩 瑞典 赫尔辛基 芬兰
454 | 斯德哥尔摩 瑞典 伊斯兰堡 巴基斯坦
455 | 斯德哥尔摩 瑞典 喀布尔 阿富汗
456 | 斯德哥尔摩 瑞典 伦敦 英国
457 | 斯德哥尔摩 瑞典 马德里 西班牙
458 | 斯德哥尔摩 瑞典 莫斯科 俄罗斯
459 | 斯德哥尔摩 瑞典 奥斯陆 挪威
460 | 斯德哥尔摩 瑞典 渥太华 加拿大
461 | 斯德哥尔摩 瑞典 巴黎 法国
462 | 斯德哥尔摩 瑞典 罗马 意大利
463 | 德黑兰 伊朗 东京 日本
464 | 德黑兰 伊朗 雅典 希腊
465 | 德黑兰 伊朗 巴格达 伊拉克
466 | 德黑兰 伊朗 曼谷 泰国
467 | 德黑兰 伊朗 北京 中国
468 | 德黑兰 伊朗 柏林 德国
469 | 德黑兰 伊朗 伯尔尼 瑞士
470 | 德黑兰 伊朗 开罗 埃及
471 | 德黑兰 伊朗 堪培拉 澳大利亚
472 | 德黑兰 伊朗 河内 越南
473 | 德黑兰 伊朗 哈瓦那 古巴
474 | 德黑兰 伊朗 赫尔辛基 芬兰
475 | 德黑兰 伊朗 伊斯兰堡 巴基斯坦
476 | 德黑兰 伊朗 喀布尔 阿富汗
477 | 德黑兰 伊朗 伦敦 英国
478 | 德黑兰 伊朗 马德里 西班牙
479 | 德黑兰 伊朗 莫斯科 俄罗斯
480 | 德黑兰 伊朗 奥斯陆 挪威
481 | 德黑兰 伊朗 渥太华 加拿大
482 | 德黑兰 伊朗 巴黎 法国
483 | 德黑兰 伊朗 罗马 意大利
484 | 德黑兰 伊朗 斯德哥尔摩 瑞典
485 | 东京 日本 雅典 希腊
486 | 东京 日本 巴格达 伊拉克
487 | 东京 日本 曼谷 泰国
488 | 东京 日本 北京 中国
489 | 东京 日本 柏林 德国
490 | 东京 日本 伯尔尼 瑞士
491 | 东京 日本 开罗 埃及
492 | 东京 日本 堪培拉 澳大利亚
493 | 东京 日本 河内 越南
494 | 东京 日本 哈瓦那 古巴
495 | 东京 日本 赫尔辛基 芬兰
496 | 东京 日本 伊斯兰堡 巴基斯坦
497 | 东京 日本 喀布尔 阿富汗
498 | 东京 日本 伦敦 英国
499 | 东京 日本 马德里 西班牙
500 | 东京 日本 莫斯科 俄罗斯
501 | 东京 日本 奥斯陆 挪威
502 | 东京 日本 渥太华 加拿大
503 | 东京 日本 巴黎 法国
504 | 东京 日本 罗马 意大利
505 | 东京 日本 斯德哥尔摩 瑞典
506 | 东京 日本 德黑兰 伊朗


--------------------------------------------------------------------------------
/testsets/analogy_zh/ca-translated/city-in-state.txt:
--------------------------------------------------------------------------------
  1 | 石家庄 河北 南昌 江西
  2 | 石家庄 河北 海口 海南
  3 | 石家庄 河北 兰州 甘肃
  4 | 石家庄 河北 西宁 青海
  5 | 太原 山西 南昌 江西
  6 | 太原 山西 广州 广东
  7 | 太原 山西 西宁 青海
  8 | 沈阳 辽宁 哈尔滨 黑龙江
  9 | 沈阳 辽宁 杭州 浙江
 10 | 沈阳 辽宁 南昌 江西
 11 | 沈阳 辽宁 贵阳 贵州
 12 | 沈阳 辽宁 兰州 甘肃
 13 | 沈阳 辽宁 南宁 广西
 14 | 沈阳 辽宁 银川 宁夏
 15 | 长春 吉林 石家庄 河北
 16 | 长春 吉林 哈尔滨 黑龙江
 17 | 长春 吉林 南京 江苏
 18 | 长春 吉林 杭州 浙江
 19 | 长春 吉林 合肥 安徽
 20 | 长春 吉林 南昌 江西
 21 | 长春 吉林 广州 广东
 22 | 长春 吉林 贵阳 贵州
 23 | 长春 吉林 西安 陕西
 24 | 长春 吉林 呼和浩特 内蒙古
 25 | 哈尔滨 黑龙江 南京 江苏
 26 | 哈尔滨 黑龙江 南昌 江西
 27 | 哈尔滨 黑龙江 贵阳 贵州
 28 | 哈尔滨 黑龙江 昆明 云南
 29 | 哈尔滨 黑龙江 南宁 广西
 30 | 南京 江苏 杭州 浙江
 31 | 南京 江苏 合肥 安徽
 32 | 南京 江苏 福州 福建
 33 | 南京 江苏 郑州 河南
 34 | 南京 江苏 广州 广东
 35 | 南京 江苏 成都 四川
 36 | 南京 江苏 贵阳 贵州
 37 | 南京 江苏 西安 陕西
 38 | 南京 江苏 呼和浩特 内蒙古
 39 | 杭州 浙江 广州 广东
 40 | 杭州 浙江 海口 海南
 41 | 杭州 浙江 西宁 青海
 42 | 杭州 浙江 南宁 广西
 43 | 合肥 安徽 太原 山西
 44 | 合肥 安徽 沈阳 辽宁
 45 | 合肥 安徽 长春 吉林
 46 | 合肥 安徽 杭州 浙江
 47 | 合肥 安徽 成都 四川
 48 | 合肥 安徽 兰州 甘肃
 49 | 福州 福建 石家庄 河北
 50 | 福州 福建 南昌 江西
 51 | 福州 福建 郑州 河南
 52 | 福州 福建 贵阳 贵州
 53 | 福州 福建 昆明 云南
 54 | 福州 福建 乌鲁木齐 新疆
 55 | 南昌 江西 长春 吉林
 56 | 南昌 江西 福州 福建
 57 | 南昌 江西 海口 海南
 58 | 南昌 江西 银川 宁夏
 59 | 济南 山东 太原 山西
 60 | 济南 山东 杭州 浙江
 61 | 济南 山东 合肥 安徽
 62 | 济南 山东 长沙 湖南
 63 | 济南 山东 海口 海南
 64 | 济南 山东 贵阳 贵州
 65 | 济南 山东 西安 陕西
 66 | 郑州 河南 长春 吉林
 67 | 郑州 河南 福州 福建
 68 | 郑州 河南 武汉 湖北
 69 | 郑州 河南 长沙 湖南
 70 | 郑州 河南 成都 四川
 71 | 郑州 河南 昆明 云南
 72 | 郑州 河南 兰州 甘肃
 73 | 郑州 河南 银川 宁夏
 74 | 武汉 湖北 沈阳 辽宁
 75 | 武汉 湖北 杭州 浙江
 76 | 武汉 湖北 西安 陕西
 77 | 武汉 湖北 兰州 甘肃
 78 | 武汉 湖北 西宁 青海
 79 | 武汉 湖北 拉萨 西藏
 80 | 武汉 湖北 银川 宁夏
 81 | 长沙 湖南 合肥 安徽
 82 | 长沙 湖南 济南 山东
 83 | 长沙 湖南 广州 广东
 84 | 长沙 湖南 拉萨 西藏
 85 | 广州 广东 石家庄 河北
 86 | 广州 广东 沈阳 辽宁
 87 | 广州 广东 南京 江苏
 88 | 广州 广东 杭州 浙江
 89 | 广州 广东 福州 福建
 90 | 广州 广东 南昌 江西
 91 | 广州 广东 济南 山东
 92 | 广州 广东 拉萨 西藏
 93 | 广州 广东 呼和浩特 内蒙古
 94 | 海口 海南 南京 江苏
 95 | 海口 海南 济南 山东
 96 | 海口 海南 武汉 湖北
 97 | 海口 海南 长沙 湖南
 98 | 海口 海南 西安 陕西
 99 | 成都 四川 太原 山西
100 | 成都 四川 哈尔滨 黑龙江
101 | 成都 四川 南京 江苏
102 | 成都 四川 杭州 浙江
103 | 成都 四川 长沙 湖南
104 | 成都 四川 兰州 甘肃
105 | 成都 四川 南宁 广西
106 | 成都 四川 呼和浩特 内蒙古
107 | 成都 四川 银川 宁夏
108 | 贵阳 贵州 石家庄 河北
109 | 贵阳 贵州 太原 山西
110 | 贵阳 贵州 哈尔滨 黑龙江
111 | 贵阳 贵州 南昌 江西
112 | 贵阳 贵州 济南 山东
113 | 贵阳 贵州 广州 广东
114 | 贵阳 贵州 西安 陕西
115 | 贵阳 贵州 拉萨 西藏
116 | 昆明 云南 长春 吉林
117 | 昆明 云南 杭州 浙江
118 | 昆明 云南 合肥 安徽
119 | 昆明 云南 济南 山东
120 | 昆明 云南 武汉 湖北
121 | 昆明 云南 广州 广东
122 | 昆明 云南 兰州 甘肃
123 | 昆明 云南 西宁 青海
124 | 昆明 云南 呼和浩特 内蒙古
125 | 昆明 云南 乌鲁木齐 新疆
126 | 西安 陕西 石家庄 河北
127 | 西安 陕西 哈尔滨 黑龙江
128 | 西安 陕西 南京 江苏
129 | 西安 陕西 武汉 湖北
130 | 西安 陕西 海口 海南
131 | 西安 陕西 贵阳 贵州
132 | 西安 陕西 呼和浩特 内蒙古
133 | 兰州 甘肃 武汉 湖北
134 | 兰州 甘肃 海口 海南
135 | 兰州 甘肃 西宁 青海
136 | 兰州 甘肃 拉萨 西藏
137 | 兰州 甘肃 南宁 广西
138 | 兰州 甘肃 呼和浩特 内蒙古
139 | 兰州 甘肃 银川 宁夏
140 | 西宁 青海 哈尔滨 黑龙江
141 | 西宁 青海 南京 江苏
142 | 西宁 青海 杭州 浙江
143 | 西宁 青海 济南 山东
144 | 西宁 青海 成都 四川
145 | 西宁 青海 贵阳 贵州
146 | 西宁 青海 南宁 广西
147 | 西宁 青海 银川 宁夏
148 | 拉萨 西藏 石家庄 河北
149 | 拉萨 西藏 哈尔滨 黑龙江
150 | 拉萨 西藏 福州 福建
151 | 拉萨 西藏 郑州 河南
152 | 拉萨 西藏 长沙 湖南
153 | 拉萨 西藏 贵阳 贵州
154 | 拉萨 西藏 西宁 青海
155 | 南宁 广西 杭州 浙江
156 | 南宁 广西 福州 福建
157 | 南宁 广西 南昌 江西
158 | 南宁 广西 成都 四川
159 | 南宁 广西 昆明 云南
160 | 呼和浩特 内蒙古 太原 山西
161 | 呼和浩特 内蒙古 昆明 云南
162 | 呼和浩特 内蒙古 西安 陕西
163 | 呼和浩特 内蒙古 兰州 甘肃
164 | 呼和浩特 内蒙古 拉萨 西藏
165 | 银川 宁夏 福州 福建
166 | 银川 宁夏 拉萨 西藏
167 | 乌鲁木齐 新疆 石家庄 河北
168 | 乌鲁木齐 新疆 沈阳 辽宁
169 | 乌鲁木齐 新疆 哈尔滨 黑龙江
170 | 乌鲁木齐 新疆 合肥 安徽
171 | 乌鲁木齐 新疆 广州 广东
172 | 乌鲁木齐 新疆 成都 四川
173 | 乌鲁木齐 新疆 西安 陕西
174 | 乌鲁木齐 新疆 兰州 甘肃
175 | 乌鲁木齐 新疆 南宁 广西


--------------------------------------------------------------------------------
/testsets/analogy_zh/ca-translated/family.txt:
--------------------------------------------------------------------------------
  1 | 男孩 女孩 兄弟 姐妹
  2 | 男孩 女孩 爸爸 妈妈
  3 | 男孩 女孩 父亲 母亲
  4 | 男孩 女孩 祖父 祖母
  5 | 男孩 女孩 爷爷 奶奶
  6 | 男孩 女孩 孙子 孙女
  7 | 男孩 女孩 新郎 新娘
  8 | 男孩 女孩 丈夫 妻子
  9 | 男孩 女孩 国王 王后
 10 | 男孩 女孩 男人 女人
 11 | 男孩 女孩 侄子 侄女
 12 | 男孩 女孩 王子 公主
 13 | 男孩 女孩 儿子 女儿
 14 | 男孩 女孩 继父 继母
 15 | 男孩 女孩 继子 继女
 16 | 男孩 女孩 叔叔 阿姨
 17 | 兄弟 姐妹 爸爸 妈妈
 18 | 兄弟 姐妹 父亲 母亲
 19 | 兄弟 姐妹 祖父 祖母
 20 | 兄弟 姐妹 爷爷 奶奶
 21 | 兄弟 姐妹 孙子 孙女
 22 | 兄弟 姐妹 新郎 新娘
 23 | 兄弟 姐妹 丈夫 妻子
 24 | 兄弟 姐妹 国王 王后
 25 | 兄弟 姐妹 男人 女人
 26 | 兄弟 姐妹 侄子 侄女
 27 | 兄弟 姐妹 王子 公主
 28 | 兄弟 姐妹 儿子 女儿
 29 | 兄弟 姐妹 继父 继母
 30 | 兄弟 姐妹 继子 继女
 31 | 兄弟 姐妹 叔叔 阿姨
 32 | 兄弟 姐妹 男孩 女孩
 33 | 爸爸 妈妈 父亲 母亲
 34 | 爸爸 妈妈 祖父 祖母
 35 | 爸爸 妈妈 爷爷 奶奶
 36 | 爸爸 妈妈 孙子 孙女
 37 | 爸爸 妈妈 新郎 新娘
 38 | 爸爸 妈妈 丈夫 妻子
 39 | 爸爸 妈妈 国王 王后
 40 | 爸爸 妈妈 男人 女人
 41 | 爸爸 妈妈 侄子 侄女
 42 | 爸爸 妈妈 王子 公主
 43 | 爸爸 妈妈 儿子 女儿
 44 | 爸爸 妈妈 继父 继母
 45 | 爸爸 妈妈 继子 继女
 46 | 爸爸 妈妈 叔叔 阿姨
 47 | 爸爸 妈妈 男孩 女孩
 48 | 爸爸 妈妈 兄弟 姐妹
 49 | 父亲 母亲 祖父 祖母
 50 | 父亲 母亲 爷爷 奶奶
 51 | 父亲 母亲 孙子 孙女
 52 | 父亲 母亲 新郎 新娘
 53 | 父亲 母亲 丈夫 妻子
 54 | 父亲 母亲 国王 王后
 55 | 父亲 母亲 男人 女人
 56 | 父亲 母亲 侄子 侄女
 57 | 父亲 母亲 王子 公主
 58 | 父亲 母亲 儿子 女儿
 59 | 父亲 母亲 继父 继母
 60 | 父亲 母亲 继子 继女
 61 | 父亲 母亲 叔叔 阿姨
 62 | 父亲 母亲 男孩 女孩
 63 | 父亲 母亲 兄弟 姐妹
 64 | 父亲 母亲 爸爸 妈妈
 65 | 祖父 祖母 爷爷 奶奶
 66 | 祖父 祖母 孙子 孙女
 67 | 祖父 祖母 新郎 新娘
 68 | 祖父 祖母 丈夫 妻子
 69 | 祖父 祖母 国王 王后
 70 | 祖父 祖母 男人 女人
 71 | 祖父 祖母 侄子 侄女
 72 | 祖父 祖母 王子 公主
 73 | 祖父 祖母 儿子 女儿
 74 | 祖父 祖母 继父 继母
 75 | 祖父 祖母 继子 继女
 76 | 祖父 祖母 叔叔 阿姨
 77 | 祖父 祖母 男孩 女孩
 78 | 祖父 祖母 兄弟 姐妹
 79 | 祖父 祖母 爸爸 妈妈
 80 | 祖父 祖母 父亲 母亲
 81 | 爷爷 奶奶 孙子 孙女
 82 | 爷爷 奶奶 新郎 新娘
 83 | 爷爷 奶奶 丈夫 妻子
 84 | 爷爷 奶奶 国王 王后
 85 | 爷爷 奶奶 男人 女人
 86 | 爷爷 奶奶 侄子 侄女
 87 | 爷爷 奶奶 王子 公主
 88 | 爷爷 奶奶 儿子 女儿
 89 | 爷爷 奶奶 继父 继母
 90 | 爷爷 奶奶 继子 继女
 91 | 爷爷 奶奶 叔叔 阿姨
 92 | 爷爷 奶奶 男孩 女孩
 93 | 爷爷 奶奶 兄弟 姐妹
 94 | 爷爷 奶奶 爸爸 妈妈
 95 | 爷爷 奶奶 父亲 母亲
 96 | 爷爷 奶奶 祖父 祖母
 97 | 孙子 孙女 新郎 新娘
 98 | 孙子 孙女 丈夫 妻子
 99 | 孙子 孙女 国王 王后
100 | 孙子 孙女 男人 女人
101 | 孙子 孙女 侄子 侄女
102 | 孙子 孙女 王子 公主
103 | 孙子 孙女 儿子 女儿
104 | 孙子 孙女 继父 继母
105 | 孙子 孙女 继子 继女
106 | 孙子 孙女 叔叔 阿姨
107 | 孙子 孙女 男孩 女孩
108 | 孙子 孙女 兄弟 姐妹
109 | 孙子 孙女 爸爸 妈妈
110 | 孙子 孙女 父亲 母亲
111 | 孙子 孙女 祖父 祖母
112 | 孙子 孙女 爷爷 奶奶
113 | 新郎 新娘 丈夫 妻子
114 | 新郎 新娘 国王 王后
115 | 新郎 新娘 男人 女人
116 | 新郎 新娘 侄子 侄女
117 | 新郎 新娘 王子 公主
118 | 新郎 新娘 儿子 女儿
119 | 新郎 新娘 继父 继母
120 | 新郎 新娘 继子 继女
121 | 新郎 新娘 叔叔 阿姨
122 | 新郎 新娘 男孩 女孩
123 | 新郎 新娘 兄弟 姐妹
124 | 新郎 新娘 爸爸 妈妈
125 | 新郎 新娘 父亲 母亲
126 | 新郎 新娘 祖父 祖母
127 | 新郎 新娘 爷爷 奶奶
128 | 新郎 新娘 孙子 孙女
129 | 丈夫 妻子 国王 王后
130 | 丈夫 妻子 男人 女人
131 | 丈夫 妻子 侄子 侄女
132 | 丈夫 妻子 王子 公主
133 | 丈夫 妻子 儿子 女儿
134 | 丈夫 妻子 继父 继母
135 | 丈夫 妻子 继子 继女
136 | 丈夫 妻子 叔叔 阿姨
137 | 丈夫 妻子 男孩 女孩
138 | 丈夫 妻子 兄弟 姐妹
139 | 丈夫 妻子 爸爸 妈妈
140 | 丈夫 妻子 父亲 母亲
141 | 丈夫 妻子 祖父 祖母
142 | 丈夫 妻子 爷爷 奶奶
143 | 丈夫 妻子 孙子 孙女
144 | 丈夫 妻子 新郎 新娘
145 | 国王 王后 男人 女人
146 | 国王 王后 侄子 侄女
147 | 国王 王后 王子 公主
148 | 国王 王后 儿子 女儿
149 | 国王 王后 继父 继母
150 | 国王 王后 继子 继女
151 | 国王 王后 叔叔 阿姨
152 | 国王 王后 男孩 女孩
153 | 国王 王后 兄弟 姐妹
154 | 国王 王后 爸爸 妈妈
155 | 国王 王后 父亲 母亲
156 | 国王 王后 祖父 祖母
157 | 国王 王后 爷爷 奶奶
158 | 国王 王后 孙子 孙女
159 | 国王 王后 新郎 新娘
160 | 国王 王后 丈夫 妻子
161 | 男人 女人 侄子 侄女
162 | 男人 女人 王子 公主
163 | 男人 女人 儿子 女儿
164 | 男人 女人 继父 继母
165 | 男人 女人 继子 继女
166 | 男人 女人 叔叔 阿姨
167 | 男人 女人 男孩 女孩
168 | 男人 女人 兄弟 姐妹
169 | 男人 女人 爸爸 妈妈
170 | 男人 女人 父亲 母亲
171 | 男人 女人 祖父 祖母
172 | 男人 女人 爷爷 奶奶
173 | 男人 女人 孙子 孙女
174 | 男人 女人 新郎 新娘
175 | 男人 女人 丈夫 妻子
176 | 男人 女人 国王 王后
177 | 侄子 侄女 王子 公主
178 | 侄子 侄女 儿子 女儿
179 | 侄子 侄女 继父 继母
180 | 侄子 侄女 继子 继女
181 | 侄子 侄女 叔叔 阿姨
182 | 侄子 侄女 男孩 女孩
183 | 侄子 侄女 兄弟 姐妹
184 | 侄子 侄女 爸爸 妈妈
185 | 侄子 侄女 父亲 母亲
186 | 侄子 侄女 祖父 祖母
187 | 侄子 侄女 爷爷 奶奶
188 | 侄子 侄女 孙子 孙女
189 | 侄子 侄女 新郎 新娘
190 | 侄子 侄女 丈夫 妻子
191 | 侄子 侄女 国王 王后
192 | 侄子 侄女 男人 女人
193 | 王子 公主 儿子 女儿
194 | 王子 公主 继父 继母
195 | 王子 公主 继子 继女
196 | 王子 公主 叔叔 阿姨
197 | 王子 公主 男孩 女孩
198 | 王子 公主 兄弟 姐妹
199 | 王子 公主 爸爸 妈妈
200 | 王子 公主 父亲 母亲
201 | 王子 公主 祖父 祖母
202 | 王子 公主 爷爷 奶奶
203 | 王子 公主 孙子 孙女
204 | 王子 公主 新郎 新娘
205 | 王子 公主 丈夫 妻子
206 | 王子 公主 国王 王后
207 | 王子 公主 男人 女人
208 | 王子 公主 侄子 侄女
209 | 儿子 女儿 继父 继母
210 | 儿子 女儿 继子 继女
211 | 儿子 女儿 叔叔 阿姨
212 | 儿子 女儿 男孩 女孩
213 | 儿子 女儿 兄弟 姐妹
214 | 儿子 女儿 爸爸 妈妈
215 | 儿子 女儿 父亲 母亲
216 | 儿子 女儿 祖父 祖母
217 | 儿子 女儿 爷爷 奶奶
218 | 儿子 女儿 孙子 孙女
219 | 儿子 女儿 新郎 新娘
220 | 儿子 女儿 丈夫 妻子
221 | 儿子 女儿 国王 王后
222 | 儿子 女儿 男人 女人
223 | 儿子 女儿 侄子 侄女
224 | 儿子 女儿 王子 公主
225 | 继父 继母 继子 继女
226 | 继父 继母 叔叔 阿姨
227 | 继父 继母 男孩 女孩
228 | 继父 继母 兄弟 姐妹
229 | 继父 继母 爸爸 妈妈
230 | 继父 继母 父亲 母亲
231 | 继父 继母 祖父 祖母
232 | 继父 继母 爷爷 奶奶
233 | 继父 继母 孙子 孙女
234 | 继父 继母 新郎 新娘
235 | 继父 继母 丈夫 妻子
236 | 继父 继母 国王 王后
237 | 继父 继母 男人 女人
238 | 继父 继母 侄子 侄女
239 | 继父 继母 王子 公主
240 | 继父 继母 儿子 女儿
241 | 继子 继女 叔叔 阿姨
242 | 继子 继女 男孩 女孩
243 | 继子 继女 兄弟 姐妹
244 | 继子 继女 爸爸 妈妈
245 | 继子 继女 父亲 母亲
246 | 继子 继女 祖父 祖母
247 | 继子 继女 爷爷 奶奶
248 | 继子 继女 孙子 孙女
249 | 继子 继女 新郎 新娘
250 | 继子 继女 丈夫 妻子
251 | 继子 继女 国王 王后
252 | 继子 继女 男人 女人
253 | 继子 继女 侄子 侄女
254 | 继子 继女 王子 公主
255 | 继子 继女 儿子 女儿
256 | 继子 继女 继父 继母
257 | 叔叔 阿姨 男孩 女孩
258 | 叔叔 阿姨 兄弟 姐妹
259 | 叔叔 阿姨 爸爸 妈妈
260 | 叔叔 阿姨 父亲 母亲
261 | 叔叔 阿姨 祖父 祖母
262 | 叔叔 阿姨 爷爷 奶奶
263 | 叔叔 阿姨 孙子 孙女
264 | 叔叔 阿姨 新郎 新娘
265 | 叔叔 阿姨 丈夫 妻子
266 | 叔叔 阿姨 国王 王后
267 | 叔叔 阿姨 男人 女人
268 | 叔叔 阿姨 侄子 侄女
269 | 叔叔 阿姨 王子 公主
270 | 叔叔 阿姨 儿子 女儿
271 | 叔叔 阿姨 继父 继母
272 | 叔叔 阿姨 继子 继女


--------------------------------------------------------------------------------
/testsets/analogy_zh/ca8/semantic/nature.txt:
--------------------------------------------------------------------------------
   1 | 第一 冠军 第二 亚军
   2 | 第一 冠军 第三 季军
   3 | 第二 亚军 第三 季军
   4 | 第一 状元 第二 榜眼
   5 | 第一 状元 第三 探花
   6 | 第二 榜眼 第三 探花
   7 | 第一 长子 第二 次子
   8 | 一 冠军 二 亚军
   9 | 一 冠军 三 季军
  10 | 二 亚军 三 季军
  11 | 一 状元 二 榜眼
  12 | 一 状元 三 探花
  13 | 二 榜眼 三 探花
  14 | 冷 热 西北风 东南风
  15 | 冷 热 冬季 夏季
  16 | 冷 热 低温 高温
  17 | 冷 热 严寒 酷暑
  18 | 冷 热 冬天 夏天
  19 | 冷 热 寒冷 炎热
  20 | 冷 热 干燥 潮湿
  21 | 西北风 东南风 冬季 夏季
  22 | 西北风 东南风 低温 高温
  23 | 西北风 东南风 严寒 酷暑
  24 | 西北风 东南风 冬天 夏天
  25 | 西北风 东南风 寒冷 炎热
  26 | 西北风 东南风 干燥 潮湿
  27 | 冬季 夏季 低温 高温
  28 | 冬季 夏季 严寒 酷暑
  29 | 冬季 夏季 冬天 夏天
  30 | 冬季 夏季 寒冷 炎热
  31 | 冬季 夏季 干燥 潮湿
  32 | 低温 高温 严寒 酷暑
  33 | 低温 高温 冬天 夏天
  34 | 低温 高温 寒冷 炎热
  35 | 低温 高温 干燥 潮湿
  36 | 严寒 酷暑 冬天 夏天
  37 | 严寒 酷暑 寒冷 炎热
  38 | 严寒 酷暑 干燥 潮湿
  39 | 冬天 夏天 寒冷 炎热
  40 | 冬天 夏天 干燥 潮湿
  41 | 寒冷 炎热 干燥 潮湿
  42 | 立春 春天 立夏 夏天
  43 | 立春 春天 立秋 秋天
  44 | 立春 春天 立冬 冬天
  45 | 立夏 夏天 立秋 秋天
  46 | 立夏 夏天 立冬 冬天
  47 | 立秋 秋天 立冬 冬天
  48 | 雨水 春天 小满 夏天
  49 | 雨水 春天 处暑 秋天
  50 | 雨水 春天 小雪 冬天
  51 | 小满 夏天 处暑 秋天
  52 | 小满 夏天 小雪 冬天
  53 | 处暑 秋天 小雪 冬天
  54 | 惊蛰 春天 芒种 夏天
  55 | 惊蛰 春天 白露 秋天
  56 | 惊蛰 春天 大雪 冬天
  57 | 芒种 夏天 白露 秋天
  58 | 芒种 夏天 大雪 冬天
  59 | 白露 秋天 大雪 冬天
  60 | 春分 春天 夏至 夏天
  61 | 春分 春天 秋分 秋天
  62 | 春分 春天 冬至 冬天
  63 | 夏至 夏天 秋分 秋天
  64 | 夏至 夏天 冬至 冬天
  65 | 秋分 秋天 冬至 冬天
  66 | 清明 春天 小暑 夏天
  67 | 清明 春天 寒露 秋天
  68 | 清明 春天 小寒 冬天
  69 | 小暑 夏天 寒露 秋天
  70 | 小暑 夏天 小寒 冬天
  71 | 寒露 秋天 小寒 冬天
  72 | 谷雨 春天 大暑 夏天
  73 | 谷雨 春天 霜降 秋天
  74 | 谷雨 春天 大寒 冬天
  75 | 大暑 夏天 霜降 秋天
  76 | 大暑 夏天 大寒 冬天
  77 | 霜降 秋天 大寒 冬天
  78 | 春节 正月 端午 五月
  79 | 春节 正月 七夕 七月
  80 | 春节 正月 中秋 八月
  81 | 春节 正月 重阳 九月
  82 | 端午 五月 七夕 七月
  83 | 端午 五月 中秋 八月
  84 | 端午 五月 重阳 九月
  85 | 七夕 七月 中秋 八月
  86 | 七夕 七月 重阳 九月
  87 | 中秋 八月 重阳 九月
  88 | 手指 戒指 耳朵 耳环
  89 | 手指 戒指 脖子 项链
  90 | 手指 戒指 手腕 手镯
  91 | 手指 戒指 脚踝 脚链
  92 | 耳朵 耳环 脖子 项链
  93 | 耳朵 耳环 手腕 手镯
  94 | 耳朵 耳环 脚踝 脚链
  95 | 脖子 项链 手腕 手镯
  96 | 脖子 项链 脚踝 脚链
  97 | 手腕 手镯 脚踝 脚链
  98 | 公鸡 母鸡 公鸭 母鸭
  99 | 公鸡 母鸡 公牛 母牛
 100 | 公鸡 母鸡 公羊 母羊
 101 | 公鸡 母鸡 公马 母马
 102 | 公鸡 母鸡 公驴 母驴
 103 | 公鸡 母鸡 公狗 母狗
 104 | 公鸡 母鸡 公猫 母猫
 105 | 公鸡 母鸡 公猪 母猪
 106 | 公鸡 母鸡 公猴 母猴
 107 | 公鸡 母鸡 公狮 母狮
 108 | 公鸡 母鸡 公老虎 母老虎
 109 | 公鸡 母鸡 公狼 母狼
 110 | 公鸡 母鸡 公熊 母熊
 111 | 公鸡 母鸡 雄鸟 雌鸟
 112 | 公鸡 母鸡 雄兔 雌兔
 113 | 公鸭 母鸭 公牛 母牛
 114 | 公鸭 母鸭 公羊 母羊
 115 | 公鸭 母鸭 公马 母马
 116 | 公鸭 母鸭 公驴 母驴
 117 | 公鸭 母鸭 公狗 母狗
 118 | 公鸭 母鸭 公猫 母猫
 119 | 公鸭 母鸭 公猪 母猪
 120 | 公鸭 母鸭 公猴 母猴
 121 | 公鸭 母鸭 公狮 母狮
 122 | 公鸭 母鸭 公老虎 母老虎
 123 | 公鸭 母鸭 公狼 母狼
 124 | 公鸭 母鸭 公熊 母熊
 125 | 公鸭 母鸭 雄鸟 雌鸟
 126 | 公鸭 母鸭 雄兔 雌兔
 127 | 公牛 母牛 公羊 母羊
 128 | 公牛 母牛 公马 母马
 129 | 公牛 母牛 公驴 母驴
 130 | 公牛 母牛 公狗 母狗
 131 | 公牛 母牛 公猫 母猫
 132 | 公牛 母牛 公猪 母猪
 133 | 公牛 母牛 公猴 母猴
 134 | 公牛 母牛 公狮 母狮
 135 | 公牛 母牛 公老虎 母老虎
 136 | 公牛 母牛 公狼 母狼
 137 | 公牛 母牛 公熊 母熊
 138 | 公牛 母牛 雄鸟 雌鸟
 139 | 公牛 母牛 雄兔 雌兔
 140 | 公羊 母羊 公马 母马
 141 | 公羊 母羊 公驴 母驴
 142 | 公羊 母羊 公狗 母狗
 143 | 公羊 母羊 公猫 母猫
 144 | 公羊 母羊 公猪 母猪
 145 | 公羊 母羊 公猴 母猴
 146 | 公羊 母羊 公狮 母狮
 147 | 公羊 母羊 公老虎 母老虎
 148 | 公羊 母羊 公狼 母狼
 149 | 公羊 母羊 公熊 母熊
 150 | 公羊 母羊 雄鸟 雌鸟
 151 | 公羊 母羊 雄兔 雌兔
 152 | 公马 母马 公驴 母驴
 153 | 公马 母马 公狗 母狗
 154 | 公马 母马 公猫 母猫
 155 | 公马 母马 公猪 母猪
 156 | 公马 母马 公猴 母猴
 157 | 公马 母马 公狮 母狮
 158 | 公马 母马 公老虎 母老虎
 159 | 公马 母马 公狼 母狼
 160 | 公马 母马 公熊 母熊
 161 | 公马 母马 雄鸟 雌鸟
 162 | 公马 母马 雄兔 雌兔
 163 | 公驴 母驴 公狗 母狗
 164 | 公驴 母驴 公猫 母猫
 165 | 公驴 母驴 公猪 母猪
 166 | 公驴 母驴 公猴 母猴
 167 | 公驴 母驴 公狮 母狮
 168 | 公驴 母驴 公老虎 母老虎
 169 | 公驴 母驴 公狼 母狼
 170 | 公驴 母驴 公熊 母熊
 171 | 公驴 母驴 雄鸟 雌鸟
 172 | 公驴 母驴 雄兔 雌兔
 173 | 公狗 母狗 公猫 母猫
 174 | 公狗 母狗 公猪 母猪
 175 | 公狗 母狗 公猴 母猴
 176 | 公狗 母狗 公狮 母狮
 177 | 公狗 母狗 公老虎 母老虎
 178 | 公狗 母狗 公狼 母狼
 179 | 公狗 母狗 公熊 母熊
 180 | 公狗 母狗 雄鸟 雌鸟
 181 | 公狗 母狗 雄兔 雌兔
 182 | 公猫 母猫 公猪 母猪
 183 | 公猫 母猫 公猴 母猴
 184 | 公猫 母猫 公狮 母狮
 185 | 公猫 母猫 公老虎 母老虎
 186 | 公猫 母猫 公狼 母狼
 187 | 公猫 母猫 公熊 母熊
 188 | 公猫 母猫 雄鸟 雌鸟
 189 | 公猫 母猫 雄兔 雌兔
 190 | 公猪 母猪 公猴 母猴
 191 | 公猪 母猪 公狮 母狮
 192 | 公猪 母猪 公老虎 母老虎
 193 | 公猪 母猪 公狼 母狼
 194 | 公猪 母猪 公熊 母熊
 195 | 公猪 母猪 雄鸟 雌鸟
 196 | 公猪 母猪 雄兔 雌兔
 197 | 公猴 母猴 公狮 母狮
 198 | 公猴 母猴 公老虎 母老虎
 199 | 公猴 母猴 公狼 母狼
 200 | 公猴 母猴 公熊 母熊
 201 | 公猴 母猴 雄鸟 雌鸟
 202 | 公猴 母猴 雄兔 雌兔
 203 | 公狮 母狮 公老虎 母老虎
 204 | 公狮 母狮 公狼 母狼
 205 | 公狮 母狮 公熊 母熊
 206 | 公狮 母狮 雄鸟 雌鸟
 207 | 公狮 母狮 雄兔 雌兔
 208 | 公老虎 母老虎 公狼 母狼
 209 | 公老虎 母老虎 公熊 母熊
 210 | 公老虎 母老虎 雄鸟 雌鸟
 211 | 公老虎 母老虎 雄兔 雌兔
 212 | 公狼 母狼 公熊 母熊
 213 | 公狼 母狼 雄鸟 雌鸟
 214 | 公狼 母狼 雄兔 雌兔
 215 | 公熊 母熊 雄鸟 雌鸟
 216 | 公熊 母熊 雄兔 雌兔
 217 | 雄鸟 雌鸟 雄兔 雌兔
 218 | 鸟 鸟蛋 鸡 鸡蛋
 219 | 鸟 鸟蛋 鸭 鸭蛋
 220 | 鸟 鸟蛋 鹅 鹅蛋
 221 | 鸟 鸟蛋 恐龙 恐龙蛋
 222 | 鸟 鸟蛋 麻雀 麻雀蛋
 223 | 鸟 鸟蛋 鹌鹑 鹌鹑蛋
 224 | 鸡 鸡蛋 鸭 鸭蛋
 225 | 鸡 鸡蛋 鹅 鹅蛋
 226 | 鸡 鸡蛋 恐龙 恐龙蛋
 227 | 鸡 鸡蛋 麻雀 麻雀蛋
 228 | 鸡 鸡蛋 鹌鹑 鹌鹑蛋
 229 | 鸭 鸭蛋 鹅 鹅蛋
 230 | 鸭 鸭蛋 恐龙 恐龙蛋
 231 | 鸭 鸭蛋 麻雀 麻雀蛋
 232 | 鸭 鸭蛋 鹌鹑 鹌鹑蛋
 233 | 鹅 鹅蛋 恐龙 恐龙蛋
 234 | 鹅 鹅蛋 麻雀 麻雀蛋
 235 | 鹅 鹅蛋 鹌鹑 鹌鹑蛋
 236 | 恐龙 恐龙蛋 麻雀 麻雀蛋
 237 | 恐龙 恐龙蛋 鹌鹑 鹌鹑蛋
 238 | 麻雀 麻雀蛋 鹌鹑 鹌鹑蛋
 239 | 鸡 鸡肉 鸭 鸭肉
 240 | 鸡 鸡肉 羊 羊肉
 241 | 鸡 鸡肉 鹅 鹅肉
 242 | 鸡 鸡肉 牛 牛肉
 243 | 鸡 鸡肉 狗 狗肉
 244 | 鸡 鸡肉 猪 猪肉
 245 | 鸭 鸭肉 羊 羊肉
 246 | 鸭 鸭肉 鹅 鹅肉
 247 | 鸭 鸭肉 牛 牛肉
 248 | 鸭 鸭肉 狗 狗肉
 249 | 鸭 鸭肉 猪 猪肉
 250 | 羊 羊肉 鹅 鹅肉
 251 | 羊 羊肉 牛 牛肉
 252 | 羊 羊肉 狗 狗肉
 253 | 羊 羊肉 猪 猪肉
 254 | 鹅 鹅肉 牛 牛肉
 255 | 鹅 鹅肉 狗 狗肉
 256 | 鹅 鹅肉 猪 猪肉
 257 | 牛 牛肉 狗 狗肉
 258 | 牛 牛肉 猪 猪肉
 259 | 狗 狗肉 猪 猪肉
 260 | 鸡 鸡腿 鸭 鸭腿
 261 | 鸡 鸡腿 羊 羊腿
 262 | 鸡 鸡腿 猪 火腿
 263 | 鸭 鸭腿 羊 羊腿
 264 | 鸭 鸭腿 猪 火腿
 265 | 羊 羊腿 猪 火腿
 266 | 牛 哞哞 羊 咩咩
 267 | 牛 哞哞 猫 喵喵
 268 | 牛 哞哞 狗 汪汪
 269 | 牛 哞哞 蚊子 嗡嗡
 270 | 牛 哞哞 青蛙 呱呱
 271 | 牛 哞哞 公鸡 喔喔
 272 | 羊 咩咩 猫 喵喵
 273 | 羊 咩咩 狗 汪汪
 274 | 羊 咩咩 蚊子 嗡嗡
 275 | 羊 咩咩 青蛙 呱呱
 276 | 羊 咩咩 公鸡 喔喔
 277 | 猫 喵喵 狗 汪汪
 278 | 猫 喵喵 蚊子 嗡嗡
 279 | 猫 喵喵 青蛙 呱呱
 280 | 猫 喵喵 公鸡 喔喔
 281 | 狗 汪汪 蚊子 嗡嗡
 282 | 狗 汪汪 青蛙 呱呱
 283 | 狗 汪汪 公鸡 喔喔
 284 | 蚊子 嗡嗡 青蛙 呱呱
 285 | 蚊子 嗡嗡 公鸡 喔喔
 286 | 青蛙 呱呱 公鸡 喔喔
 287 | 杏树 杏 梨树 梨
 288 | 杏树 杏 桃树 桃
 289 | 杏树 杏 苹果树 苹果
 290 | 杏树 杏 香蕉树 香蕉
 291 | 杏树 杏 芭蕉树 芭蕉
 292 | 杏树 杏 葡萄藤 葡萄
 293 | 杏树 杏 椰子树 椰子
 294 | 杏树 杏 橘树 橘子
 295 | 杏树 杏 核桃树 核桃
 296 | 杏树 杏 樱桃树 樱桃
 297 | 杏树 杏 板栗树 板栗
 298 | 杏树 杏 石榴树 石榴
 299 | 杏树 杏 荔枝树 荔枝
 300 | 杏树 杏 桂圆树 桂圆
 301 | 杏树 杏 芒果树 芒果
 302 | 梨树 梨 桃树 桃
 303 | 梨树 梨 苹果树 苹果
 304 | 梨树 梨 香蕉树 香蕉
 305 | 梨树 梨 芭蕉树 芭蕉
 306 | 梨树 梨 葡萄藤 葡萄
 307 | 梨树 梨 椰子树 椰子
 308 | 梨树 梨 橘树 橘子
 309 | 梨树 梨 核桃树 核桃
 310 | 梨树 梨 樱桃树 樱桃
 311 | 梨树 梨 板栗树 板栗
 312 | 梨树 梨 石榴树 石榴
 313 | 梨树 梨 荔枝树 荔枝
 314 | 梨树 梨 桂圆树 桂圆
 315 | 梨树 梨 芒果树 芒果
 316 | 桃树 桃 苹果树 苹果
 317 | 桃树 桃 香蕉树 香蕉
 318 | 桃树 桃 芭蕉树 芭蕉
 319 | 桃树 桃 葡萄藤 葡萄
 320 | 桃树 桃 椰子树 椰子
 321 | 桃树 桃 橘树 橘子
 322 | 桃树 桃 核桃树 核桃
 323 | 桃树 桃 樱桃树 樱桃
 324 | 桃树 桃 板栗树 板栗
 325 | 桃树 桃 石榴树 石榴
 326 | 桃树 桃 荔枝树 荔枝
 327 | 桃树 桃 桂圆树 桂圆
 328 | 桃树 桃 芒果树 芒果
 329 | 苹果树 苹果 香蕉树 香蕉
 330 | 苹果树 苹果 芭蕉树 芭蕉
 331 | 苹果树 苹果 葡萄藤 葡萄
 332 | 苹果树 苹果 椰子树 椰子
 333 | 苹果树 苹果 橘树 橘子
 334 | 苹果树 苹果 核桃树 核桃
 335 | 苹果树 苹果 樱桃树 樱桃
 336 | 苹果树 苹果 板栗树 板栗
 337 | 苹果树 苹果 石榴树 石榴
 338 | 苹果树 苹果 荔枝树 荔枝
 339 | 苹果树 苹果 桂圆树 桂圆
 340 | 苹果树 苹果 芒果树 芒果
 341 | 香蕉树 香蕉 芭蕉树 芭蕉
 342 | 香蕉树 香蕉 葡萄藤 葡萄
 343 | 香蕉树 香蕉 椰子树 椰子
 344 | 香蕉树 香蕉 橘树 橘子
 345 | 香蕉树 香蕉 核桃树 核桃
 346 | 香蕉树 香蕉 樱桃树 樱桃
 347 | 香蕉树 香蕉 板栗树 板栗
 348 | 香蕉树 香蕉 石榴树 石榴
 349 | 香蕉树 香蕉 荔枝树 荔枝
 350 | 香蕉树 香蕉 桂圆树 桂圆
 351 | 香蕉树 香蕉 芒果树 芒果
 352 | 芭蕉树 芭蕉 葡萄藤 葡萄
 353 | 芭蕉树 芭蕉 椰子树 椰子
 354 | 芭蕉树 芭蕉 橘树 橘子
 355 | 芭蕉树 芭蕉 核桃树 核桃
 356 | 芭蕉树 芭蕉 樱桃树 樱桃
 357 | 芭蕉树 芭蕉 板栗树 板栗
 358 | 芭蕉树 芭蕉 石榴树 石榴
 359 | 芭蕉树 芭蕉 荔枝树 荔枝
 360 | 芭蕉树 芭蕉 桂圆树 桂圆
 361 | 芭蕉树 芭蕉 芒果树 芒果
 362 | 葡萄藤 葡萄 椰子树 椰子
 363 | 葡萄藤 葡萄 橘树 橘子
 364 | 葡萄藤 葡萄 核桃树 核桃
 365 | 葡萄藤 葡萄 樱桃树 樱桃
 366 | 葡萄藤 葡萄 板栗树 板栗
 367 | 葡萄藤 葡萄 石榴树 石榴
 368 | 葡萄藤 葡萄 荔枝树 荔枝
 369 | 葡萄藤 葡萄 桂圆树 桂圆
 370 | 葡萄藤 葡萄 芒果树 芒果
 371 | 椰子树 椰子 橘树 橘子
 372 | 椰子树 椰子 核桃树 核桃
 373 | 椰子树 椰子 樱桃树 樱桃
 374 | 椰子树 椰子 板栗树 板栗
 375 | 椰子树 椰子 石榴树 石榴
 376 | 椰子树 椰子 荔枝树 荔枝
 377 | 椰子树 椰子 桂圆树 桂圆
 378 | 椰子树 椰子 芒果树 芒果
 379 | 橘树 橘子 核桃树 核桃
 380 | 橘树 橘子 樱桃树 樱桃
 381 | 橘树 橘子 板栗树 板栗
 382 | 橘树 橘子 石榴树 石榴
 383 | 橘树 橘子 荔枝树 荔枝
 384 | 橘树 橘子 桂圆树 桂圆
 385 | 橘树 橘子 芒果树 芒果
 386 | 核桃树 核桃 樱桃树 樱桃
 387 | 核桃树 核桃 板栗树 板栗
 388 | 核桃树 核桃 石榴树 石榴
 389 | 核桃树 核桃 荔枝树 荔枝
 390 | 核桃树 核桃 桂圆树 桂圆
 391 | 核桃树 核桃 芒果树 芒果
 392 | 樱桃树 樱桃 板栗树 板栗
 393 | 樱桃树 樱桃 石榴树 石榴
 394 | 樱桃树 樱桃 荔枝树 荔枝
 395 | 樱桃树 樱桃 桂圆树 桂圆
 396 | 樱桃树 樱桃 芒果树 芒果
 397 | 板栗树 板栗 石榴树 石榴
 398 | 板栗树 板栗 荔枝树 荔枝
 399 | 板栗树 板栗 桂圆树 桂圆
 400 | 板栗树 板栗 芒果树 芒果
 401 | 石榴树 石榴 荔枝树 荔枝
 402 | 石榴树 石榴 桂圆树 桂圆
 403 | 石榴树 石榴 芒果树 芒果
 404 | 荔枝树 荔枝 桂圆树 桂圆
 405 | 荔枝树 荔枝 芒果树 芒果
 406 | 桂圆树 桂圆 芒果树 芒果
 407 | 麦 麦穗 稻 稻穗
 408 | 麦 麦穗 谷 谷穗
 409 | 麦 麦穗 玉米 玉米穗
 410 | 稻 稻穗 谷 谷穗
 411 | 稻 稻穗 玉米 玉米穗
 412 | 谷 谷穗 玉米 玉米穗
 413 | 水稻 大米 小麦 白面
 414 | 水稻 大米 谷子 小米
 415 | 水稻 大米 裸燕麦 莜面
 416 | 小麦 白面 谷子 小米
 417 | 小麦 白面 裸燕麦 莜面
 418 | 谷子 小米 裸燕麦 莜面
 419 | 盐 氯化钠 醋 乙酸
 420 | 盐 氯化钠 味精 谷氨酸钠
 421 | 盐 氯化钠 苏打 碳酸钠
 422 | 盐 氯化钠 小苏打 碳酸氢钠
 423 | 盐 氯化钠 火碱 氢氧化钠
 424 | 盐 氯化钠 钡餐 硫酸钡
 425 | 盐 氯化钠 石英 二氧化硅
 426 | 盐 氯化钠 天然气 甲烷
 427 | 盐 氯化钠 水 氧化氢
 428 | 盐 氯化钠 双氧水 过氧化氢
 429 | 盐 氯化钠 干冰 二氧化碳
 430 | 醋 乙酸 味精 谷氨酸钠
 431 | 醋 乙酸 苏打 碳酸钠
 432 | 醋 乙酸 小苏打 碳酸氢钠
 433 | 醋 乙酸 火碱 氢氧化钠
 434 | 醋 乙酸 钡餐 硫酸钡
 435 | 醋 乙酸 石英 二氧化硅
 436 | 醋 乙酸 天然气 甲烷
 437 | 醋 乙酸 水 氧化氢
 438 | 醋 乙酸 双氧水 过氧化氢
 439 | 醋 乙酸 干冰 二氧化碳
 440 | 味精 谷氨酸钠 苏打 碳酸钠
 441 | 味精 谷氨酸钠 小苏打 碳酸氢钠
 442 | 味精 谷氨酸钠 火碱 氢氧化钠
 443 | 味精 谷氨酸钠 钡餐 硫酸钡
 444 | 味精 谷氨酸钠 石英 二氧化硅
 445 | 味精 谷氨酸钠 天然气 甲烷
 446 | 味精 谷氨酸钠 水 氧化氢
 447 | 味精 谷氨酸钠 双氧水 过氧化氢
 448 | 味精 谷氨酸钠 干冰 二氧化碳
 449 | 苏打 碳酸钠 小苏打 碳酸氢钠
 450 | 苏打 碳酸钠 火碱 氢氧化钠
 451 | 苏打 碳酸钠 钡餐 硫酸钡
 452 | 苏打 碳酸钠 石英 二氧化硅
 453 | 苏打 碳酸钠 天然气 甲烷
 454 | 苏打 碳酸钠 水 氧化氢
 455 | 苏打 碳酸钠 双氧水 过氧化氢
 456 | 苏打 碳酸钠 干冰 二氧化碳
 457 | 小苏打 碳酸氢钠 火碱 氢氧化钠
 458 | 小苏打 碳酸氢钠 钡餐 硫酸钡
 459 | 小苏打 碳酸氢钠 石英 二氧化硅
 460 | 小苏打 碳酸氢钠 天然气 甲烷
 461 | 小苏打 碳酸氢钠 水 氧化氢
 462 | 小苏打 碳酸氢钠 双氧水 过氧化氢
 463 | 小苏打 碳酸氢钠 干冰 二氧化碳
 464 | 火碱 氢氧化钠 钡餐 硫酸钡
 465 | 火碱 氢氧化钠 石英 二氧化硅
 466 | 火碱 氢氧化钠 天然气 甲烷
 467 | 火碱 氢氧化钠 水 氧化氢
 468 | 火碱 氢氧化钠 双氧水 过氧化氢
 469 | 火碱 氢氧化钠 干冰 二氧化碳
 470 | 钡餐 硫酸钡 石英 二氧化硅
 471 | 钡餐 硫酸钡 天然气 甲烷
 472 | 钡餐 硫酸钡 水 氧化氢
 473 | 钡餐 硫酸钡 双氧水 过氧化氢
 474 | 钡餐 硫酸钡 干冰 二氧化碳
 475 | 石英 二氧化硅 天然气 甲烷
 476 | 石英 二氧化硅 水 氧化氢
 477 | 石英 二氧化硅 双氧水 过氧化氢
 478 | 石英 二氧化硅 干冰 二氧化碳
 479 | 天然气 甲烷 水 氧化氢
 480 | 天然气 甲烷 双氧水 过氧化氢
 481 | 天然气 甲烷 干冰 二氧化碳
 482 | 水 氧化氢 双氧水 过氧化氢
 483 | 水 氧化氢 干冰 二氧化碳
 484 | 双氧水 过氧化氢 干冰 二氧化碳
 485 | 冰 水蒸气 干冰 二氧化碳
 486 | 冰 水蒸气 固体 气体
 487 | 冰 水蒸气 固态 气态
 488 | 干冰 二氧化碳 固体 气体
 489 | 干冰 二氧化碳 固态 气态
 490 | 固体 气体 固态 气态
 491 | 冰 水 雪 雨
 492 | 冰 水 固体 液体
 493 | 冰 水 固态 液态
 494 | 雪 雨 固体 液体
 495 | 雪 雨 固态 液态
 496 | 固体 液体 固态 液态
 497 | 海 蓝色 胭脂 红色
 498 | 海 蓝色 草 绿色
 499 | 海 蓝色 象牙 白色
 500 | 海 蓝色 碳 黑色
 501 | 海 蓝色 稻草 黄色
 502 | 胭脂 红色 草 绿色
 503 | 胭脂 红色 象牙 白色
 504 | 胭脂 红色 碳 黑色
 505 | 胭脂 红色 稻草 黄色
 506 | 草 绿色 象牙 白色
 507 | 草 绿色 碳 黑色
 508 | 草 绿色 稻草 黄色
 509 | 象牙 白色 碳 黑色
 510 | 象牙 白色 稻草 黄色
 511 | 碳 黑色 稻草 黄色
 512 | 肯定 否定 赞扬 批评
 513 | 肯定 否定 支持 反对
 514 | 赞扬 批评 支持 反对
 515 | 松 紧 好 坏
 516 | 松 紧 大 小
 517 | 松 紧 宽 窄
 518 | 松 紧 强 弱
 519 | 松 紧 美 丑
 520 | 松 紧 真 假
 521 | 松 紧 动 静
 522 | 松 紧 胖 瘦
 523 | 松 紧 明 暗
 524 | 松 紧 冷 热
 525 | 松 紧 轻 重
 526 | 松 紧 贫 富
 527 | 松 紧 深 浅
 528 | 松 紧 粗 细
 529 | 松 紧 远 近
 530 | 松 紧 上 下
 531 | 松 紧 高 低
 532 | 松 紧 东 西
 533 | 松 紧 南 北
 534 | 松 紧 前 后
 535 | 松 紧 左 右
 536 | 松 紧 首 尾
 537 | 好 坏 大 小
 538 | 好 坏 宽 窄
 539 | 好 坏 强 弱
 540 | 好 坏 美 丑
 541 | 好 坏 真 假
 542 | 好 坏 动 静
 543 | 好 坏 胖 瘦
 544 | 好 坏 明 暗
 545 | 好 坏 冷 热
 546 | 好 坏 轻 重
 547 | 好 坏 贫 富
 548 | 好 坏 深 浅
 549 | 好 坏 粗 细
 550 | 好 坏 远 近
 551 | 好 坏 上 下
 552 | 好 坏 高 低
 553 | 好 坏 东 西
 554 | 好 坏 南 北
 555 | 好 坏 前 后
 556 | 好 坏 左 右
 557 | 好 坏 首 尾
 558 | 大 小 宽 窄
 559 | 大 小 强 弱
 560 | 大 小 美 丑
 561 | 大 小 真 假
 562 | 大 小 动 静
 563 | 大 小 胖 瘦
 564 | 大 小 明 暗
 565 | 大 小 冷 热
 566 | 大 小 轻 重
 567 | 大 小 贫 富
 568 | 大 小 深 浅
 569 | 大 小 粗 细
 570 | 大 小 远 近
 571 | 大 小 上 下
 572 | 大 小 高 低
 573 | 大 小 东 西
 574 | 大 小 南 北
 575 | 大 小 前 后
 576 | 大 小 左 右
 577 | 大 小 首 尾
 578 | 宽 窄 强 弱
 579 | 宽 窄 美 丑
 580 | 宽 窄 真 假
 581 | 宽 窄 动 静
 582 | 宽 窄 胖 瘦
 583 | 宽 窄 明 暗
 584 | 宽 窄 冷 热
 585 | 宽 窄 轻 重
 586 | 宽 窄 贫 富
 587 | 宽 窄 深 浅
 588 | 宽 窄 粗 细
 589 | 宽 窄 远 近
 590 | 宽 窄 上 下
 591 | 宽 窄 高 低
 592 | 宽 窄 东 西
 593 | 宽 窄 南 北
 594 | 宽 窄 前 后
 595 | 宽 窄 左 右
 596 | 宽 窄 首 尾
 597 | 强 弱 美 丑
 598 | 强 弱 真 假
 599 | 强 弱 动 静
 600 | 强 弱 胖 瘦
 601 | 强 弱 明 暗
 602 | 强 弱 冷 热
 603 | 强 弱 轻 重
 604 | 强 弱 贫 富
 605 | 强 弱 深 浅
 606 | 强 弱 粗 细
 607 | 强 弱 远 近
 608 | 强 弱 上 下
 609 | 强 弱 高 低
 610 | 强 弱 东 西
 611 | 强 弱 南 北
 612 | 强 弱 前 后
 613 | 强 弱 左 右
 614 | 强 弱 首 尾
 615 | 美 丑 真 假
 616 | 美 丑 动 静
 617 | 美 丑 胖 瘦
 618 | 美 丑 明 暗
 619 | 美 丑 冷 热
 620 | 美 丑 轻 重
 621 | 美 丑 贫 富
 622 | 美 丑 深 浅
 623 | 美 丑 粗 细
 624 | 美 丑 远 近
 625 | 美 丑 上 下
 626 | 美 丑 高 低
 627 | 美 丑 东 西
 628 | 美 丑 南 北
 629 | 美 丑 前 后
 630 | 美 丑 左 右
 631 | 美 丑 首 尾
 632 | 真 假 动 静
 633 | 真 假 胖 瘦
 634 | 真 假 明 暗
 635 | 真 假 冷 热
 636 | 真 假 轻 重
 637 | 真 假 贫 富
 638 | 真 假 深 浅
 639 | 真 假 粗 细
 640 | 真 假 远 近
 641 | 真 假 上 下
 642 | 真 假 高 低
 643 | 真 假 东 西
 644 | 真 假 南 北
 645 | 真 假 前 后
 646 | 真 假 左 右
 647 | 真 假 首 尾
 648 | 动 静 胖 瘦
 649 | 动 静 明 暗
 650 | 动 静 冷 热
 651 | 动 静 轻 重
 652 | 动 静 贫 富
 653 | 动 静 深 浅
 654 | 动 静 粗 细
 655 | 动 静 远 近
 656 | 动 静 上 下
 657 | 动 静 高 低
 658 | 动 静 东 西
 659 | 动 静 南 北
 660 | 动 静 前 后
 661 | 动 静 左 右
 662 | 动 静 首 尾
 663 | 胖 瘦 明 暗
 664 | 胖 瘦 冷 热
 665 | 胖 瘦 轻 重
 666 | 胖 瘦 贫 富
 667 | 胖 瘦 深 浅
 668 | 胖 瘦 粗 细
 669 | 胖 瘦 远 近
 670 | 胖 瘦 上 下
 671 | 胖 瘦 高 低
 672 | 胖 瘦 东 西
 673 | 胖 瘦 南 北
 674 | 胖 瘦 前 后
 675 | 胖 瘦 左 右
 676 | 胖 瘦 首 尾
 677 | 明 暗 冷 热
 678 | 明 暗 轻 重
 679 | 明 暗 贫 富
 680 | 明 暗 深 浅
 681 | 明 暗 粗 细
 682 | 明 暗 远 近
 683 | 明 暗 上 下
 684 | 明 暗 高 低
 685 | 明 暗 东 西
 686 | 明 暗 南 北
 687 | 明 暗 前 后
 688 | 明 暗 左 右
 689 | 明 暗 首 尾
 690 | 冷 热 轻 重
 691 | 冷 热 贫 富
 692 | 冷 热 深 浅
 693 | 冷 热 粗 细
 694 | 冷 热 远 近
 695 | 冷 热 上 下
 696 | 冷 热 高 低
 697 | 冷 热 东 西
 698 | 冷 热 南 北
 699 | 冷 热 前 后
 700 | 冷 热 左 右
 701 | 冷 热 首 尾
 702 | 轻 重 贫 富
 703 | 轻 重 深 浅
 704 | 轻 重 粗 细
 705 | 轻 重 远 近
 706 | 轻 重 上 下
 707 | 轻 重 高 低
 708 | 轻 重 东 西
 709 | 轻 重 南 北
 710 | 轻 重 前 后
 711 | 轻 重 左 右
 712 | 轻 重 首 尾
 713 | 贫 富 深 浅
 714 | 贫 富 粗 细
 715 | 贫 富 远 近
 716 | 贫 富 上 下
 717 | 贫 富 高 低
 718 | 贫 富 东 西
 719 | 贫 富 南 北
 720 | 贫 富 前 后
 721 | 贫 富 左 右
 722 | 贫 富 首 尾
 723 | 深 浅 粗 细
 724 | 深 浅 远 近
 725 | 深 浅 上 下
 726 | 深 浅 高 低
 727 | 深 浅 东 西
 728 | 深 浅 南 北
 729 | 深 浅 前 后
 730 | 深 浅 左 右
 731 | 深 浅 首 尾
 732 | 粗 细 远 近
 733 | 粗 细 上 下
 734 | 粗 细 高 低
 735 | 粗 细 东 西
 736 | 粗 细 南 北
 737 | 粗 细 前 后
 738 | 粗 细 左 右
 739 | 粗 细 首 尾
 740 | 远 近 上 下
 741 | 远 近 高 低
 742 | 远 近 东 西
 743 | 远 近 南 北
 744 | 远 近 前 后
 745 | 远 近 左 右
 746 | 远 近 首 尾
 747 | 上 下 高 低
 748 | 上 下 东 西
 749 | 上 下 南 北
 750 | 上 下 前 后
 751 | 上 下 左 右
 752 | 上 下 首 尾
 753 | 高 低 东 西
 754 | 高 低 南 北
 755 | 高 低 前 后
 756 | 高 低 左 右
 757 | 高 低 首 尾
 758 | 东 西 南 北
 759 | 东 西 前 后
 760 | 东 西 左 右
 761 | 东 西 首 尾
 762 | 南 北 前 后
 763 | 南 北 左 右
 764 | 南 北 首 尾
 765 | 前 后 左 右
 766 | 前 后 首 尾
 767 | 左 右 首 尾
 768 | 增 减 进 退
 769 | 增 减 攻 守
 770 | 增 减 来 去
 771 | 增 减 买 卖
 772 | 增 减 升 降
 773 | 增 减 进 出
 774 | 进 退 攻 守
 775 | 进 退 来 去
 776 | 进 退 买 卖
 777 | 进 退 升 降
 778 | 进 退 进 出
 779 | 攻 守 来 去
 780 | 攻 守 买 卖
 781 | 攻 守 升 降
 782 | 攻 守 进 出
 783 | 来 去 买 卖
 784 | 来 去 升 降
 785 | 来 去 进 出
 786 | 买 卖 升 降
 787 | 买 卖 进 出
 788 | 升 降 进 出
 789 | 成功 失败 安全 危险
 790 | 成功 失败 聪明 愚蠢
 791 | 成功 失败 美丽 丑陋
 792 | 成功 失败 锋利 迟钝
 793 | 成功 失败 细心 粗心
 794 | 成功 失败 宽敞 狭窄
 795 | 成功 失败 宽广 狭隘
 796 | 成功 失败 清楚 模糊
 797 | 成功 失败 好事 坏事
 798 | 成功 失败 熟悉 陌生
 799 | 成功 失败 光明 黑暗
 800 | 成功 失败 温暖 寒冷
 801 | 成功 失败 坚强 脆弱
 802 | 成功 失败 简单 复杂
 803 | 成功 失败 增大 减小
 804 | 成功 失败 节约 浪费
 805 | 成功 失败 打开 关闭
 806 | 成功 失败 伟大 渺小
 807 | 成功 失败 天堂 地狱
 808 | 成功 失败 出现 消失
 809 | 成功 失败 团结 分裂
 810 | 成功 失败 希望 失望
 811 | 成功 失败 笔直 弯曲
 812 | 成功 失败 正面 反面
 813 | 成功 失败 积极 消极
 814 | 成功 失败 勤劳 懒惰
 815 | 成功 失败 正常 异常
 816 | 成功 失败 支持 反对
 817 | 安全 危险 聪明 愚蠢
 818 | 安全 危险 美丽 丑陋
 819 | 安全 危险 锋利 迟钝
 820 | 安全 危险 细心 粗心
 821 | 安全 危险 宽敞 狭窄
 822 | 安全 危险 宽广 狭隘
 823 | 安全 危险 清楚 模糊
 824 | 安全 危险 好事 坏事
 825 | 安全 危险 熟悉 陌生
 826 | 安全 危险 光明 黑暗
 827 | 安全 危险 温暖 寒冷
 828 | 安全 危险 坚强 脆弱
 829 | 安全 危险 简单 复杂
 830 | 安全 危险 增大 减小
 831 | 安全 危险 节约 浪费
 832 | 安全 危险 打开 关闭
 833 | 安全 危险 伟大 渺小
 834 | 安全 危险 天堂 地狱
 835 | 安全 危险 出现 消失
 836 | 安全 危险 团结 分裂
 837 | 安全 危险 希望 失望
 838 | 安全 危险 笔直 弯曲
 839 | 安全 危险 正面 反面
 840 | 安全 危险 积极 消极
 841 | 安全 危险 勤劳 懒惰
 842 | 安全 危险 正常 异常
 843 | 安全 危险 支持 反对
 844 | 聪明 愚蠢 美丽 丑陋
 845 | 聪明 愚蠢 锋利 迟钝
 846 | 聪明 愚蠢 细心 粗心
 847 | 聪明 愚蠢 宽敞 狭窄
 848 | 聪明 愚蠢 宽广 狭隘
 849 | 聪明 愚蠢 清楚 模糊
 850 | 聪明 愚蠢 好事 坏事
 851 | 聪明 愚蠢 熟悉 陌生
 852 | 聪明 愚蠢 光明 黑暗
 853 | 聪明 愚蠢 温暖 寒冷
 854 | 聪明 愚蠢 坚强 脆弱
 855 | 聪明 愚蠢 简单 复杂
 856 | 聪明 愚蠢 增大 减小
 857 | 聪明 愚蠢 节约 浪费
 858 | 聪明 愚蠢 打开 关闭
 859 | 聪明 愚蠢 伟大 渺小
 860 | 聪明 愚蠢 天堂 地狱
 861 | 聪明 愚蠢 出现 消失
 862 | 聪明 愚蠢 团结 分裂
 863 | 聪明 愚蠢 希望 失望
 864 | 聪明 愚蠢 笔直 弯曲
 865 | 聪明 愚蠢 正面 反面
 866 | 聪明 愚蠢 积极 消极
 867 | 聪明 愚蠢 勤劳 懒惰
 868 | 聪明 愚蠢 正常 异常
 869 | 聪明 愚蠢 支持 反对
 870 | 美丽 丑陋 锋利 迟钝
 871 | 美丽 丑陋 细心 粗心
 872 | 美丽 丑陋 宽敞 狭窄
 873 | 美丽 丑陋 宽广 狭隘
 874 | 美丽 丑陋 清楚 模糊
 875 | 美丽 丑陋 好事 坏事
 876 | 美丽 丑陋 熟悉 陌生
 877 | 美丽 丑陋 光明 黑暗
 878 | 美丽 丑陋 温暖 寒冷
 879 | 美丽 丑陋 坚强 脆弱
 880 | 美丽 丑陋 简单 复杂
 881 | 美丽 丑陋 增大 减小
 882 | 美丽 丑陋 节约 浪费
 883 | 美丽 丑陋 打开 关闭
 884 | 美丽 丑陋 伟大 渺小
 885 | 美丽 丑陋 天堂 地狱
 886 | 美丽 丑陋 出现 消失
 887 | 美丽 丑陋 团结 分裂
 888 | 美丽 丑陋 希望 失望
 889 | 美丽 丑陋 笔直 弯曲
 890 | 美丽 丑陋 正面 反面
 891 | 美丽 丑陋 积极 消极
 892 | 美丽 丑陋 勤劳 懒惰
 893 | 美丽 丑陋 正常 异常
 894 | 美丽 丑陋 支持 反对
 895 | 锋利 迟钝 细心 粗心
 896 | 锋利 迟钝 宽敞 狭窄
 897 | 锋利 迟钝 宽广 狭隘
 898 | 锋利 迟钝 清楚 模糊
 899 | 锋利 迟钝 好事 坏事
 900 | 锋利 迟钝 熟悉 陌生
 901 | 锋利 迟钝 光明 黑暗
 902 | 锋利 迟钝 温暖 寒冷
 903 | 锋利 迟钝 坚强 脆弱
 904 | 锋利 迟钝 简单 复杂
 905 | 锋利 迟钝 增大 减小
 906 | 锋利 迟钝 节约 浪费
 907 | 锋利 迟钝 打开 关闭
 908 | 锋利 迟钝 伟大 渺小
 909 | 锋利 迟钝 天堂 地狱
 910 | 锋利 迟钝 出现 消失
 911 | 锋利 迟钝 团结 分裂
 912 | 锋利 迟钝 希望 失望
 913 | 锋利 迟钝 笔直 弯曲
 914 | 锋利 迟钝 正面 反面
 915 | 锋利 迟钝 积极 消极
 916 | 锋利 迟钝 勤劳 懒惰
 917 | 锋利 迟钝 正常 异常
 918 | 锋利 迟钝 支持 反对
 919 | 细心 粗心 宽敞 狭窄
 920 | 细心 粗心 宽广 狭隘
 921 | 细心 粗心 清楚 模糊
 922 | 细心 粗心 好事 坏事
 923 | 细心 粗心 熟悉 陌生
 924 | 细心 粗心 光明 黑暗
 925 | 细心 粗心 温暖 寒冷
 926 | 细心 粗心 坚强 脆弱
 927 | 细心 粗心 简单 复杂
 928 | 细心 粗心 增大 减小
 929 | 细心 粗心 节约 浪费
 930 | 细心 粗心 打开 关闭
 931 | 细心 粗心 伟大 渺小
 932 | 细心 粗心 天堂 地狱
 933 | 细心 粗心 出现 消失
 934 | 细心 粗心 团结 分裂
 935 | 细心 粗心 希望 失望
 936 | 细心 粗心 笔直 弯曲
 937 | 细心 粗心 正面 反面
 938 | 细心 粗心 积极 消极
 939 | 细心 粗心 勤劳 懒惰
 940 | 细心 粗心 正常 异常
 941 | 细心 粗心 支持 反对
 942 | 宽敞 狭窄 宽广 狭隘
 943 | 宽敞 狭窄 清楚 模糊
 944 | 宽敞 狭窄 好事 坏事
 945 | 宽敞 狭窄 熟悉 陌生
 946 | 宽敞 狭窄 光明 黑暗
 947 | 宽敞 狭窄 温暖 寒冷
 948 | 宽敞 狭窄 坚强 脆弱
 949 | 宽敞 狭窄 简单 复杂
 950 | 宽敞 狭窄 增大 减小
 951 | 宽敞 狭窄 节约 浪费
 952 | 宽敞 狭窄 打开 关闭
 953 | 宽敞 狭窄 伟大 渺小
 954 | 宽敞 狭窄 天堂 地狱
 955 | 宽敞 狭窄 出现 消失
 956 | 宽敞 狭窄 团结 分裂
 957 | 宽敞 狭窄 希望 失望
 958 | 宽敞 狭窄 笔直 弯曲
 959 | 宽敞 狭窄 正面 反面
 960 | 宽敞 狭窄 积极 消极
 961 | 宽敞 狭窄 勤劳 懒惰
 962 | 宽敞 狭窄 正常 异常
 963 | 宽敞 狭窄 支持 反对
 964 | 宽广 狭隘 清楚 模糊
 965 | 宽广 狭隘 好事 坏事
 966 | 宽广 狭隘 熟悉 陌生
 967 | 宽广 狭隘 光明 黑暗
 968 | 宽广 狭隘 温暖 寒冷
 969 | 宽广 狭隘 坚强 脆弱
 970 | 宽广 狭隘 简单 复杂
 971 | 宽广 狭隘 增大 减小
 972 | 宽广 狭隘 节约 浪费
 973 | 宽广 狭隘 打开 关闭
 974 | 宽广 狭隘 伟大 渺小
 975 | 宽广 狭隘 天堂 地狱
 976 | 宽广 狭隘 出现 消失
 977 | 宽广 狭隘 团结 分裂
 978 | 宽广 狭隘 希望 失望
 979 | 宽广 狭隘 笔直 弯曲
 980 | 宽广 狭隘 正面 反面
 981 | 宽广 狭隘 积极 消极
 982 | 宽广 狭隘 勤劳 懒惰
 983 | 宽广 狭隘 正常 异常
 984 | 宽广 狭隘 支持 反对
 985 | 清楚 模糊 好事 坏事
 986 | 清楚 模糊 熟悉 陌生
 987 | 清楚 模糊 光明 黑暗
 988 | 清楚 模糊 温暖 寒冷
 989 | 清楚 模糊 坚强 脆弱
 990 | 清楚 模糊 简单 复杂
 991 | 清楚 模糊 增大 减小
 992 | 清楚 模糊 节约 浪费
 993 | 清楚 模糊 打开 关闭
 994 | 清楚 模糊 伟大 渺小
 995 | 清楚 模糊 天堂 地狱
 996 | 清楚 模糊 出现 消失
 997 | 清楚 模糊 团结 分裂
 998 | 清楚 模糊 希望 失望
 999 | 清楚 模糊 笔直 弯曲
1000 | 清楚 模糊 正面 反面
1001 | 清楚 模糊 积极 消极
1002 | 清楚 模糊 勤劳 懒惰
1003 | 清楚 模糊 正常 异常
1004 | 清楚 模糊 支持 反对
1005 | 好事 坏事 熟悉 陌生
1006 | 好事 坏事 光明 黑暗
1007 | 好事 坏事 温暖 寒冷
1008 | 好事 坏事 坚强 脆弱
1009 | 好事 坏事 简单 复杂
1010 | 好事 坏事 增大 减小
1011 | 好事 坏事 节约 浪费
1012 | 好事 坏事 打开 关闭
1013 | 好事 坏事 伟大 渺小
1014 | 好事 坏事 天堂 地狱
1015 | 好事 坏事 出现 消失
1016 | 好事 坏事 团结 分裂
1017 | 好事 坏事 希望 失望
1018 | 好事 坏事 笔直 弯曲
1019 | 好事 坏事 正面 反面
1020 | 好事 坏事 积极 消极
1021 | 好事 坏事 勤劳 懒惰
1022 | 好事 坏事 正常 异常
1023 | 好事 坏事 支持 反对
1024 | 熟悉 陌生 光明 黑暗
1025 | 熟悉 陌生 温暖 寒冷
1026 | 熟悉 陌生 坚强 脆弱
1027 | 熟悉 陌生 简单 复杂
1028 | 熟悉 陌生 增大 减小
1029 | 熟悉 陌生 节约 浪费
1030 | 熟悉 陌生 打开 关闭
1031 | 熟悉 陌生 伟大 渺小
1032 | 熟悉 陌生 天堂 地狱
1033 | 熟悉 陌生 出现 消失
1034 | 熟悉 陌生 团结 分裂
1035 | 熟悉 陌生 希望 失望
1036 | 熟悉 陌生 笔直 弯曲
1037 | 熟悉 陌生 正面 反面
1038 | 熟悉 陌生 积极 消极
1039 | 熟悉 陌生 勤劳 懒惰
1040 | 熟悉 陌生 正常 异常
1041 | 熟悉 陌生 支持 反对
1042 | 光明 黑暗 温暖 寒冷
1043 | 光明 黑暗 坚强 脆弱
1044 | 光明 黑暗 简单 复杂
1045 | 光明 黑暗 增大 减小
1046 | 光明 黑暗 节约 浪费
1047 | 光明 黑暗 打开 关闭
1048 | 光明 黑暗 伟大 渺小
1049 | 光明 黑暗 天堂 地狱
1050 | 光明 黑暗 出现 消失
1051 | 光明 黑暗 团结 分裂
1052 | 光明 黑暗 希望 失望
1053 | 光明 黑暗 笔直 弯曲
1054 | 光明 黑暗 正面 反面
1055 | 光明 黑暗 积极 消极
1056 | 光明 黑暗 勤劳 懒惰
1057 | 光明 黑暗 正常 异常
1058 | 光明 黑暗 支持 反对
1059 | 温暖 寒冷 坚强 脆弱
1060 | 温暖 寒冷 简单 复杂
1061 | 温暖 寒冷 增大 减小
1062 | 温暖 寒冷 节约 浪费
1063 | 温暖 寒冷 打开 关闭
1064 | 温暖 寒冷 伟大 渺小
1065 | 温暖 寒冷 天堂 地狱
1066 | 温暖 寒冷 出现 消失
1067 | 温暖 寒冷 团结 分裂
1068 | 温暖 寒冷 希望 失望
1069 | 温暖 寒冷 笔直 弯曲
1070 | 温暖 寒冷 正面 反面
1071 | 温暖 寒冷 积极 消极
1072 | 温暖 寒冷 勤劳 懒惰
1073 | 温暖 寒冷 正常 异常
1074 | 温暖 寒冷 支持 反对
1075 | 坚强 脆弱 简单 复杂
1076 | 坚强 脆弱 增大 减小
1077 | 坚强 脆弱 节约 浪费
1078 | 坚强 脆弱 打开 关闭
1079 | 坚强 脆弱 伟大 渺小
1080 | 坚强 脆弱 天堂 地狱
1081 | 坚强 脆弱 出现 消失
1082 | 坚强 脆弱 团结 分裂
1083 | 坚强 脆弱 希望 失望
1084 | 坚强 脆弱 笔直 弯曲
1085 | 坚强 脆弱 正面 反面
1086 | 坚强 脆弱 积极 消极
1087 | 坚强 脆弱 勤劳 懒惰
1088 | 坚强 脆弱 正常 异常
1089 | 坚强 脆弱 支持 反对
1090 | 简单 复杂 增大 减小
1091 | 简单 复杂 节约 浪费
1092 | 简单 复杂 打开 关闭
1093 | 简单 复杂 伟大 渺小
1094 | 简单 复杂 天堂 地狱
1095 | 简单 复杂 出现 消失
1096 | 简单 复杂 团结 分裂
1097 | 简单 复杂 希望 失望
1098 | 简单 复杂 笔直 弯曲
1099 | 简单 复杂 正面 反面
1100 | 简单 复杂 积极 消极
1101 | 简单 复杂 勤劳 懒惰
1102 | 简单 复杂 正常 异常
1103 | 简单 复杂 支持 反对
1104 | 增大 减小 节约 浪费
1105 | 增大 减小 打开 关闭
1106 | 增大 减小 伟大 渺小
1107 | 增大 减小 天堂 地狱
1108 | 增大 减小 出现 消失
1109 | 增大 减小 团结 分裂
1110 | 增大 减小 希望 失望
1111 | 增大 减小 笔直 弯曲
1112 | 增大 减小 正面 反面
1113 | 增大 减小 积极 消极
1114 | 增大 减小 勤劳 懒惰
1115 | 增大 减小 正常 异常
1116 | 增大 减小 支持 反对
1117 | 节约 浪费 打开 关闭
1118 | 节约 浪费 伟大 渺小
1119 | 节约 浪费 天堂 地狱
1120 | 节约 浪费 出现 消失
1121 | 节约 浪费 团结 分裂
1122 | 节约 浪费 希望 失望
1123 | 节约 浪费 笔直 弯曲
1124 | 节约 浪费 正面 反面
1125 | 节约 浪费 积极 消极
1126 | 节约 浪费 勤劳 懒惰
1127 | 节约 浪费 正常 异常
1128 | 节约 浪费 支持 反对
1129 | 打开 关闭 伟大 渺小
1130 | 打开 关闭 天堂 地狱
1131 | 打开 关闭 出现 消失
1132 | 打开 关闭 团结 分裂
1133 | 打开 关闭 希望 失望
1134 | 打开 关闭 笔直 弯曲
1135 | 打开 关闭 正面 反面
1136 | 打开 关闭 积极 消极
1137 | 打开 关闭 勤劳 懒惰
1138 | 打开 关闭 正常 异常
1139 | 打开 关闭 支持 反对
1140 | 伟大 渺小 天堂 地狱
1141 | 伟大 渺小 出现 消失
1142 | 伟大 渺小 团结 分裂
1143 | 伟大 渺小 希望 失望
1144 | 伟大 渺小 笔直 弯曲
1145 | 伟大 渺小 正面 反面
1146 | 伟大 渺小 积极 消极
1147 | 伟大 渺小 勤劳 懒惰
1148 | 伟大 渺小 正常 异常
1149 | 伟大 渺小 支持 反对
1150 | 天堂 地狱 出现 消失
1151 | 天堂 地狱 团结 分裂
1152 | 天堂 地狱 希望 失望
1153 | 天堂 地狱 笔直 弯曲
1154 | 天堂 地狱 正面 反面
1155 | 天堂 地狱 积极 消极
1156 | 天堂 地狱 勤劳 懒惰
1157 | 天堂 地狱 正常 异常
1158 | 天堂 地狱 支持 反对
1159 | 出现 消失 团结 分裂
1160 | 出现 消失 希望 失望
1161 | 出现 消失 笔直 弯曲
1162 | 出现 消失 正面 反面
1163 | 出现 消失 积极 消极
1164 | 出现 消失 勤劳 懒惰
1165 | 出现 消失 正常 异常
1166 | 出现 消失 支持 反对
1167 | 团结 分裂 希望 失望
1168 | 团结 分裂 笔直 弯曲
1169 | 团结 分裂 正面 反面
1170 | 团结 分裂 积极 消极
1171 | 团结 分裂 勤劳 懒惰
1172 | 团结 分裂 正常 异常
1173 | 团结 分裂 支持 反对
1174 | 希望 失望 笔直 弯曲
1175 | 希望 失望 正面 反面
1176 | 希望 失望 积极 消极
1177 | 希望 失望 勤劳 懒惰
1178 | 希望 失望 正常 异常
1179 | 希望 失望 支持 反对
1180 | 笔直 弯曲 正面 反面
1181 | 笔直 弯曲 积极 消极
1182 | 笔直 弯曲 勤劳 懒惰
1183 | 笔直 弯曲 正常 异常
1184 | 笔直 弯曲 支持 反对
1185 | 正面 反面 积极 消极
1186 | 正面 反面 勤劳 懒惰
1187 | 正面 反面 正常 异常
1188 | 正面 反面 支持 反对
1189 | 积极 消极 勤劳 懒惰
1190 | 积极 消极 正常 异常
1191 | 积极 消极 支持 反对
1192 | 勤劳 懒惰 正常 异常
1193 | 勤劳 懒惰 支持 反对
1194 | 正常 异常 支持 反对
1195 | 深 浅 深刻 肤浅
1196 | 前 后 前进 后退
1197 | 前 后 进 退
1198 | 前 后 进攻 退守
1199 | 前进 后退 进 退
1200 | 前进 后退 进攻 退守
1201 | 进 退 进攻 退守
1202 | 进攻 退守 攻 守
1203 | 高 低 升高 降低
1204 | 高 低 高音 低音
1205 | 高 低 高亢 低沉
1206 | 高 低 抬高 贬低
1207 | 高 低 提高 降低
1208 | 升高 降低 高音 低音
1209 | 升高 降低 高亢 低沉
1210 | 升高 降低 抬高 贬低
1211 | 升高 降低 提高 降低
1212 | 高音 低音 高亢 低沉
1213 | 高音 低音 抬高 贬低
1214 | 高音 低音 提高 降低
1215 | 高亢 低沉 抬高 贬低
1216 | 高亢 低沉 提高 降低
1217 | 抬高 贬低 提高 降低
1218 | 内 外 入 出
1219 | 内 外 进来 出去
1220 | 入 出 进来 出去
1221 | 出 入 支出 收入
1222 | 高 矮 高大 矮小
1223 | 上 下 天 地
1224 | 上 下 天堂 地狱
1225 | 上 下 上天 下地
1226 | 上 下 天空 大地
1227 | 上 下 上级 下级
1228 | 上 下 仰视 俯视
1229 | 上 下 起飞 降落
1230 | 上 下 上楼 下楼
1231 | 天 地 天堂 地狱
1232 | 天 地 上天 下地
1233 | 天 地 天空 大地
1234 | 天 地 上级 下级
1235 | 天 地 仰视 俯视
1236 | 天 地 起飞 降落
1237 | 天 地 上楼 下楼
1238 | 天堂 地狱 上天 下地
1239 | 天堂 地狱 天空 大地
1240 | 天堂 地狱 上级 下级
1241 | 天堂 地狱 仰视 俯视
1242 | 天堂 地狱 起飞 降落
1243 | 天堂 地狱 上楼 下楼
1244 | 上天 下地 天空 大地
1245 | 上天 下地 上级 下级
1246 | 上天 下地 仰视 俯视
1247 | 上天 下地 起飞 降落
1248 | 上天 下地 上楼 下楼
1249 | 天空 大地 上级 下级
1250 | 天空 大地 仰视 俯视
1251 | 天空 大地 起飞 降落
1252 | 天空 大地 上楼 下楼
1253 | 上级 下级 仰视 俯视
1254 | 上级 下级 起飞 降落
1255 | 上级 下级 上楼 下楼
1256 | 仰视 俯视 起飞 降落
1257 | 仰视 俯视 上楼 下楼
1258 | 起飞 降落 上楼 下楼
1259 | 大 小 伟大 渺小
1260 | 大 小 增大 减小
1261 | 大 小 提高 降低
1262 | 伟大 渺小 增大 减小
1263 | 伟大 渺小 提高 降低
1264 | 增大 减小 提高 降低
1265 | 彼 此 这 那
1266 | 彼 此 这个 那个
1267 | 彼 此 这些 那些
1268 | 彼 此 这次 那次
1269 | 彼 此 这回 那回
1270 | 彼 此 这里 那里
1271 | 彼 此 这边 那边
1272 | 彼 此 这种 那种
1273 | 彼 此 这家 那家
1274 | 彼 此 这件 那件
1275 | 彼 此 这么 那么
1276 | 这 那 这个 那个
1277 | 这 那 这些 那些
1278 | 这 那 这次 那次
1279 | 这 那 这回 那回
1280 | 这 那 这里 那里
1281 | 这 那 这边 那边
1282 | 这 那 这种 那种
1283 | 这 那 这家 那家
1284 | 这 那 这件 那件
1285 | 这 那 这么 那么
1286 | 这个 那个 这些 那些
1287 | 这个 那个 这次 那次
1288 | 这个 那个 这回 那回
1289 | 这个 那个 这里 那里
1290 | 这个 那个 这边 那边
1291 | 这个 那个 这种 那种
1292 | 这个 那个 这家 那家
1293 | 这个 那个 这件 那件
1294 | 这个 那个 这么 那么
1295 | 这些 那些 这次 那次
1296 | 这些 那些 这回 那回
1297 | 这些 那些 这里 那里
1298 | 这些 那些 这边 那边
1299 | 这些 那些 这种 那种
1300 | 这些 那些 这家 那家
1301 | 这些 那些 这件 那件
1302 | 这些 那些 这么 那么
1303 | 这次 那次 这回 那回
1304 | 这次 那次 这里 那里
1305 | 这次 那次 这边 那边
1306 | 这次 那次 这种 那种
1307 | 这次 那次 这家 那家
1308 | 这次 那次 这件 那件
1309 | 这次 那次 这么 那么
1310 | 这回 那回 这里 那里
1311 | 这回 那回 这边 那边
1312 | 这回 那回 这种 那种
1313 | 这回 那回 这家 那家
1314 | 这回 那回 这件 那件
1315 | 这回 那回 这么 那么
1316 | 这里 那里 这边 那边
1317 | 这里 那里 这种 那种
1318 | 这里 那里 这家 那家
1319 | 这里 那里 这件 那件
1320 | 这里 那里 这么 那么
1321 | 这边 那边 这种 那种
1322 | 这边 那边 这家 那家
1323 | 这边 那边 这件 那件
1324 | 这边 那边 这么 那么
1325 | 这种 那种 这家 那家
1326 | 这种 那种 这件 那件
1327 | 这种 那种 这么 那么
1328 | 这家 那家 这件 那件
1329 | 这家 那家 这么 那么
1330 | 这件 那件 这么 那么
1331 | 公 私 公用 私用
1332 | 公 私 公有 私有
1333 | 公 私 公有制 私有制
1334 | 公用 私用 公有 私有
1335 | 公用 私用 公有制 私有制
1336 | 公有 私有 公有制 私有制
1337 | 早 晚 朝 暮
1338 | 早 晚 潮 汐
1339 | 早 晚 早晨 夜晚
1340 | 早 晚 早上 晚上
1341 | 早 晚 日出 日落
1342 | 早 晚 上班 下班
1343 | 早 晚 上学 放学
1344 | 朝 暮 潮 汐
1345 | 朝 暮 早晨 夜晚
1346 | 朝 暮 早上 晚上
1347 | 朝 暮 日出 日落
1348 | 朝 暮 上班 下班
1349 | 朝 暮 上学 放学
1350 | 潮 汐 早晨 夜晚
1351 | 潮 汐 早上 晚上
1352 | 潮 汐 日出 日落
1353 | 潮 汐 上班 下班
1354 | 潮 汐 上学 放学
1355 | 早晨 夜晚 早上 晚上
1356 | 早晨 夜晚 日出 日落
1357 | 早晨 夜晚 上班 下班
1358 | 早晨 夜晚 上学 放学
1359 | 早上 晚上 日出 日落
1360 | 早上 晚上 上班 下班
1361 | 早上 晚上 上学 放学
1362 | 日出 日落 上班 下班
1363 | 日出 日落 上学 放学
1364 | 上班 下班 上学 放学
1365 | 首 尾 开始 结束
1366 | 首 尾 开幕 闭幕
1367 | 首 尾 上课 下课
1368 | 开始 结束 开幕 闭幕
1369 | 开始 结束 上课 下课
1370 | 开幕 闭幕 上课 下课
1371 | 


--------------------------------------------------------------------------------
/testsets/similarity/radinsky_mturk.txt:
--------------------------------------------------------------------------------
  1 | episcopal	russia	2.75
  2 | water	shortage	2.714285714
  3 | horse	wedding	2.266666667
  4 | plays	losses	3.2
  5 | classics	advertiser	2.25
  6 | latin	credit	2.0625
  7 | ship	ballots	2.3125
  8 | mistake	error	4.352941176
  9 | disease	plague	4.117647059
 10 | sake	shade	2.529411765
 11 | saints	observatory	1.9375
 12 | treaty	wheat	1.8125
 13 | texas	death	1.533333333
 14 | republicans	challenge	2.3125
 15 | body	peaceful	2.058823529
 16 | admiralty	intensity	2.647058824
 17 | body	improving	2.117647059
 18 | heroin	marijuana	3.375
 19 | scottish	commuters	2.6875
 20 | apollo	myth	2.6
 21 | film	cautious	2.125
 22 | exhibition	art	4.117647059
 23 | chocolate	candy	3.764705882
 24 | republic	candidate	2.8125
 25 | gospel	church	4.0625
 26 | momentum	desirable	2.4
 27 | singapore	sanctions	2.117647059
 28 | english	french	3.823529412
 29 | exile	church	2.941176471
 30 | navy	coordinator	2.235294118
 31 | adventure	flood	2.4375
 32 | radar	plane	3.235294118
 33 | pacific	ocean	4.266666667
 34 | scotch	liquor	4.571428571
 35 | kennedy	gun	3
 36 | garfield	cat	2.866666667
 37 | scale	budget	3.5
 38 | rhythm	blues	3.071428571
 39 | rich	privileges	3.2
 40 | navy	withdrawn	1.571428571
 41 | marble	marching	2.615384615
 42 | polo	charged	2.125
 43 | mark	missing	2.333333333
 44 | battleship	army	4.235294118
 45 | medium	organization	2.5625
 46 | pennsylvania	writer	1.466666667
 47 | hamlet	poet	3.882352941
 48 | battle	prisoners	3.705882353
 49 | guild	smith	2.75
 50 | mud	soil	4.235294118
 51 | crime	assaulted	3.941176471
 52 | mussolini	stability	2.133333333
 53 | lincoln	division	2.4375
 54 | slaves	insured	2.2
 55 | summer	winter	4.375
 56 | integration	dignity	3.058823529
 57 | money	quota	2.5
 58 | honolulu	vacation	3.6875
 59 | libya	forged	2.461538462
 60 | cheers	musician	2.823529412
 61 | session	surprises	1.8125
 62 | billion	campaigning	2.571428571
 63 | perjury	soybean	2.0625
 64 | forswearing	perjury	3.3125
 65 | costume	halloween	3.4375
 66 | bulgarian	nurses	1.941176471
 67 | costume	ultimate	2.5
 68 | faith	judging	2.235294118
 69 | france	bridges	2.235294118
 70 | citizenship	casey	2.2
 71 | recreation	dish	1.4
 72 | intelligence	troubles	1.625
 73 | germany	worst	1.4375
 74 | chaos	death	2.75
 75 | sydney	hancock	2.857142857
 76 | sabbath	stevenson	2.214285714
 77 | espionage	passport	2.3125
 78 | political	today	1.6875
 79 | pipe	convertible	2
 80 | scouting	demonstrate	2.5625
 81 | salute	patterns	2.235294118
 82 | reichstag	germany	2.285714286
 83 | radiation	costumes	1.5625
 84 | horace	grief	1.764705882
 85 | sale	rental	3.470588235
 86 | open	close	4.058823529
 87 | photography	proving	2.375
 88 | propaganda	germany	1.705882353
 89 | assassination	forbes	2.071428571
 90 | mirror	duel	1.928571429
 91 | probability	hanging	2.058823529
 92 | africa	theater	1.5
 93 | hell	heaven	4.117647059
 94 | mussolini	italy	3
 95 | composer	beethoven	3.647058824
 96 | minister	forthcoming	1.764705882
 97 | brussels	sweden	3.176470588
 98 | neutral	parish	1.6
 99 | emotion	taxation	1.733333333
100 | louisiana	simple	2
101 | quarantine	disease	3
102 | cannon	imprisoned	2.625
103 | bronze	suspicion	2
104 | pearl	interim	2.352941176
105 | artist	paint	4.117647059
106 | relay	family	2.0625
107 | art	mortality	2.294117647
108 | food	investment	2.25
109 | alt	tenor	2.692307692
110 | catholics	protestant	3.5625
111 | militia	landlord	3.0625
112 | battle	warships	4.176470588
113 | alcohol	fleeing	2.5625
114 | coil	ashes	3.117647059
115 | poland	russia	4
116 | explosive	builders	2.4375
117 | aeronautics	plane	4.277777778
118 | charge	sentence	3.133333333
119 | pet	retiring	2
120 | drink	alcohol	4.352941176
121 | stability	species	2.375
122 | colonies	depression	2
123 | easter	preference	2.0625
124 | genius	intellect	4.090909091
125 | diamond	killed	1.555555556
126 | slavery	african	2.8
127 | jurisdiction	law	4.454545455
128 | saints	repeal	1.555555556
129 | conspiracy	campaign	2.166666667
130 | operator	extracts	2.214285714
131 | physician	action	2.153846154
132 | electronics	guess	1.916666667
133 | slavery	diamond	2.285714286
134 | quarterback	sport	3.142857143
135 | assassination	killed	4.285714286
136 | slavery	klan	2.230769231
137 | heroin	shoot	2.692307692
138 | birds	disturbances	1.692307692
139 | palestinians	turks	2.5
140 | citizenship	court	2.5
141 | immunity	violation	2.076923077
142 | alternative	contend	2.461538462
143 | chile	plates	2.692307692
144 | abraham	stranger	1.846153846
145 | kansas	city	3.769230769
146 | month	year	3.857142857
147 | month	day	3.857142857
148 | amateur	actor	2.333333333
149 | afghanistan	war	3.384615385
150 | transmission	maxwell	2.25
151 | manchester	ambitious	1.923076923
152 | program	battered	1.928571429
153 | drawing	music	2.583333333
154 | exile	pledges	2.307692308
155 | adventure	sixteen	1.538461538
156 | exile	threats	2.166666667
157 | concrete	wings	1.428571429
158 | seizure	bishops	2
159 | submarine	sea	3.857142857
160 | villa	mayor	2.25
161 | trade	farley	2.375
162 | nature	forest	3.636363636
163 | chronicle	young	1.9
164 | radical	bishops	1.818181818
165 | pakistan	radical	2.875
166 | fire	water	4.266666667
167 | gossip	nuisance	3.0625
168 | con	examiner	2.266666667
169 | satellite	space	3.75
170 | essay	boston	2
171 | miniature	statue	3.6
172 | spill	pollution	3.5
173 | minister	council	3.5625
174 | landscape	mountain	3.5625
175 | religion	remedy	2.5625
176 | ship	storm	3.5
177 | college	scientist	2.8125
178 | crystal	oldest	2.5625
179 | afghanistan	wise	2.066666667
180 | trinity	religion	3.133333333
181 | homer	odyssey	2.857142857
182 | parish	clue	2.4375
183 | actress	actor	4.0625
184 | patent	professionals	2.375
185 | chaos	horrible	3.066666667
186 | acre	earthquake	2.125
187 | goverment	immunity	2
188 | football	justice	1.8
189 | gambling	money	3.75
190 | corruption	nervous	1.875
191 | cardinals	villages	2.375
192 | life	death	4.103448276
193 | artillery	sanctions	2.428571429
194 | jerusalem	murdered	2.357142857
195 | cell	brick	3.285714286
196 | knowledge	promoter	2.642857143
197 | adventure	rails	2.571428571
198 | houston	crash	2.357142857
199 | oxford	subcommittee	2.642857143
200 | militia	weapon	3.785714286
201 | manufacturer	meat	1.857142857
202 | damages	reaction	3.071428571
203 | sea	fishing	4.357142857
204 | atomic	clash	2.785714286
205 | broadcasting	athletics	3
206 | mystery	expedition	2.538461538
207 | kremlin	soviets	3.166666667
208 | pig	blaze	1.75
209 | riverside	vietnamese	2.25
210 | bitter	protective	1.923076923
211 | disaster	announced	2.384615385
212 | pork	blaze	2.230769231
213 | feet	international	1.916666667
214 | radical	uniform	2.5
215 | gossip	condemned	2.692307692
216 | mozart	wagner	3.166666667
217 | soccer	boxing	3.4
218 | radical	roles	2.75
219 | rescued	slaying	3
220 | researchers	tested	3.538461538
221 | sales	season	2.307692308
222 | homeless	refugees	3.615384615
223 | pakistan	repair	1.75
224 | athens	painting	2.294117647
225 | tiger	woods	3.375
226 | aircraft	plane	4.473684211
227 | solar	carbon	2.842105263
228 | enterprise	bankruptcy	2.5
229 | homer	springfield	2.833333333
230 | coin	awards	2.166666667
231 | rhodes	native	2.25
232 | soccer	curator	2.125
233 | gasoline	stock	2.888888889
234 | guilt	extended	2.105263158
235 | rapid	singapore	1.764705882
236 | coin	banker	3.631578947
237 | london	correspondence	1.944444444
238 | pop	sex	2.6
239 | medicine	bread	2.176470588
240 | asia	animal	1.555555556
241 | pop	clubhouse	3.210526316
242 | nazi	defensive	2.055555556
243 | earth	poles	3.421052632
244 | thailand	crowded	2.166666667
245 | day	independence	3.473684211
246 | controversy	pitch	2.375
247 | stock	gasoline	3.166666667
248 | composers	mozart	3.833333333
249 | tone	piano	3.722222222
250 | paris	chef	2.111111111
251 | profession	responsible	2.722222222
252 | bankruptcy	chronicle	2
253 | lebanon	war	2.722222222
254 | israel	terror	3.055555556
255 | angola	military	2.941176471
256 | chemistry	patients	2.357142857
257 | munich	constitution	3.071428571
258 | piano	theater	3.266666667
259 | poetry	artist	3.8
260 | acre	burned	1.769230769
261 | religion	abortion	2.076923077
262 | jazz	music	4.533333333
263 | government	transportation	3
264 | color	wine	2.533333333
265 | jackson	quota	1.692307692
266 | shariff	deputy	3.642857143
267 | boat	negroes	2
268 | shooting	sentenced	2.933333333
269 | republicans	friedman	2.416666667
270 | politics	brokerage	2.5
271 | russian	stalin	3.357142857
272 | love	philip	2.5
273 | nuclear	plant	3.733333333
274 | jamaica	queens	3.076923077
275 | dollar	asylum	1.846153846
276 | bridge	rowing	2.785714286
277 | berlin	germany	4
278 | funeral	death	4.714285714
279 | albert	einstein	4.266666667
280 | gulf	shore	3.857142857
281 | ecuador	argentina	3.266666667
282 | britain	france	3.714285714
283 | sports	score	3.866666667
284 | socialism	capitalism	3.785714286
285 | treaty	peace	4.166666667
286 | exchange	market	4.266666667
287 | marriage	anniversary	4.333333333
288 | 


--------------------------------------------------------------------------------
/testsets/similarity/sim999.txt:
--------------------------------------------------------------------------------
   1 | old new 1.58
   2 | smart intelligent 9.2
   3 | hard difficult 8.77
   4 | happy cheerful 9.55
   5 | hard easy 0.95
   6 | fast rapid 8.75
   7 | happy glad 9.17
   8 | short long 1.23
   9 | stupid dumb 9.58
  10 | weird strange 8.93
  11 | wide narrow 1.03
  12 | bad awful 8.42
  13 | easy difficult 0.58
  14 | bad terrible 7.78
  15 | hard simple 1.38
  16 | smart dumb 0.55
  17 | insane crazy 9.57
  18 | happy mad 0.95
  19 | large huge 9.47
  20 | hard tough 8.05
  21 | new fresh 6.83
  22 | sharp dull 0.6
  23 | quick rapid 9.7
  24 | dumb foolish 6.67
  25 | wonderful terrific 8.63
  26 | strange odd 9.02
  27 | happy angry 1.28
  28 | narrow broad 1.18
  29 | simple easy 9.4
  30 | old fresh 0.87
  31 | apparent obvious 8.47
  32 | inexpensive cheap 8.72
  33 | nice generous 5
  34 | weird normal 0.72
  35 | weird odd 9.2
  36 | bad immoral 7.62
  37 | sad funny 0.95
  38 | wonderful great 8.05
  39 | guilty ashamed 6.38
  40 | beautiful wonderful 6.5
  41 | confident sure 8.27
  42 | dumb dense 7.27
  43 | large big 9.55
  44 | nice cruel 0.67
  45 | impatient anxious 6.03
  46 | big broad 6.73
  47 | strong proud 3.17
  48 | unnecessary necessary 0.63
  49 | restless young 1.6
  50 | dumb intelligent 0.75
  51 | bad great 0.35
  52 | difficult simple 0.87
  53 | necessary important 7.37
  54 | bad terrific 0.65
  55 | mad glad 1.45
  56 | honest guilty 1.18
  57 | easy tough 0.52
  58 | easy flexible 4.1
  59 | certain sure 8.42
  60 | essential necessary 8.97
  61 | different normal 1.08
  62 | sly clever 7.25
  63 | crucial important 8.82
  64 | harsh cruel 8.18
  65 | childish foolish 5.5
  66 | scarce rare 9.17
  67 | friendly generous 5.9
  68 | fragile frigid 2.38
  69 | long narrow 3.57
  70 | big heavy 6.18
  71 | rough frigid 2.47
  72 | bizarre strange 9.37
  73 | illegal immoral 4.28
  74 | bad guilty 4.2
  75 | modern ancient 0.73
  76 | new ancient 0.23
  77 | dull funny 0.55
  78 | happy young 2
  79 | easy big 1.12
  80 | great awful 1.17
  81 | tiny huge 0.6
  82 | polite proper 7.63
  83 | modest ashamed 2.65
  84 | exotic rare 8.05
  85 | dumb clever 1.17
  86 | delightful wonderful 8.65
  87 | noticeable obvious 8.48
  88 | afraid anxious 5.07
  89 | formal proper 8.02
  90 | dreary dull 8.25
  91 | delightful cheerful 6.58
  92 | unhappy mad 5.95
  93 | sad terrible 5.4
  94 | sick crazy 3.57
  95 | violent angry 6.98
  96 | laden heavy 5.9
  97 | dirty cheap 1.6
  98 | elastic flexible 7.78
  99 | hard dense 5.9
 100 | recent new 7.05
 101 | bold proud 3.97
 102 | sly strange 1.97
 103 | strange sly 2.07
 104 | dumb rare 0.48
 105 | sly tough 0.58
 106 | terrific mad 0.4
 107 | modest flexible 0.98
 108 | fresh wide 0.4
 109 | huge dumb 0.48
 110 | large flexible 0.48
 111 | dirty narrow 0.3
 112 | wife husband 2.3
 113 | book text 6.35
 114 | groom bride 3.17
 115 | night day 1.88
 116 | south north 2.2
 117 | plane airport 3.65
 118 | uncle aunt 5.5
 119 | horse mare 8.33
 120 | bottom top 0.7
 121 | friend buddy 8.78
 122 | student pupil 9.35
 123 | world globe 6.67
 124 | leg arm 2.88
 125 | plane jet 8.1
 126 | woman man 3.33
 127 | horse colt 7.07
 128 | actress actor 7.12
 129 | teacher instructor 9.25
 130 | movie film 8.87
 131 | bird hawk 7.85
 132 | word dictionary 3.68
 133 | money salary 7.88
 134 | dog cat 1.75
 135 | area region 9.47
 136 | navy army 6.43
 137 | book literature 7.53
 138 | clothes closet 3.27
 139 | sunset sunrise 2.47
 140 | child adult 2.98
 141 | cow cattle 9.52
 142 | book story 5.63
 143 | winter summer 2.38
 144 | taxi cab 9.2
 145 | tree maple 5.53
 146 | bed bedroom 3.4
 147 | roof ceiling 7.58
 148 | disease infection 7.15
 149 | arm shoulder 4.85
 150 | sheep lamb 8.42
 151 | lady gentleman 3.42
 152 | boat anchor 2.25
 153 | priest monk 6.28
 154 | toe finger 4.68
 155 | river stream 7.3
 156 | anger fury 8.73
 157 | date calendar 4.42
 158 | sea ocean 8.27
 159 | second minute 4.62
 160 | hand thumb 3.88
 161 | wood log 7.3
 162 | mud dirt 7.32
 163 | hallway corridor 9.28
 164 | way manner 7.62
 165 | mouse cat 1.12
 166 | cop sheriff 9.05
 167 | death burial 4.93
 168 | music melody 6.98
 169 | beer alcohol 7.5
 170 | mouth lip 7.1
 171 | storm hurricane 6.38
 172 | tax income 2.38
 173 | flower violet 6.95
 174 | paper cardboard 5.38
 175 | floor ceiling 1.73
 176 | beach seashore 8.33
 177 | rod curtain 3.03
 178 | hound fox 2.38
 179 | street alley 5.48
 180 | boat deck 4.28
 181 | car horn 2.57
 182 | friend guest 4.25
 183 | employer employee 3.65
 184 | hand wrist 3.97
 185 | ball cannon 2.58
 186 | alcohol brandy 6.98
 187 | victory triumph 8.98
 188 | telephone booth 3.63
 189 | door doorway 5.4
 190 | motel inn 8.17
 191 | clothes cloth 5.47
 192 | steak meat 7.47
 193 | nail thumb 3.55
 194 | band orchestra 7.08
 195 | book bible 5
 196 | business industry 7.02
 197 | winter season 6.27
 198 | decade century 3.48
 199 | alcohol gin 8.65
 200 | hat coat 2.67
 201 | window door 3.33
 202 | arm wrist 3.57
 203 | house apartment 5.8
 204 | glass crystal 6.27
 205 | wine brandy 5.15
 206 | creator maker 9.62
 207 | dinner breakfast 3.33
 208 | arm muscle 3.72
 209 | bubble suds 8.57
 210 | bread flour 3.33
 211 | death tragedy 5.8
 212 | absence presence 0.4
 213 | gun cannon 5.68
 214 | grass blade 4.57
 215 | ball basket 1.67
 216 | hose garden 1.67
 217 | boy kid 7.5
 218 | church choir 2.95
 219 | clothes drawer 3.02
 220 | tower bell 1.9
 221 | father parent 7.07
 222 | school grade 4.42
 223 | parent adult 5.37
 224 | bar jail 1.9
 225 | car highway 3.4
 226 | dictionary definition 6.25
 227 | door cellar 1.97
 228 | army legion 5.95
 229 | metal aluminum 7.25
 230 | chair bench 6.67
 231 | cloud fog 6
 232 | boy son 6.75
 233 | water ice 6.47
 234 | bed blanket 3.02
 235 | attorney lawyer 9.35
 236 | area zone 8.33
 237 | business company 9.02
 238 | clothes fabric 5.87
 239 | sweater jacket 7.15
 240 | money capital 6.67
 241 | hand foot 4.17
 242 | alcohol cocktail 6.73
 243 | yard inch 3.78
 244 | molecule atom 6.45
 245 | lens camera 4.28
 246 | meal dinner 7.15
 247 | eye tear 3.55
 248 | god devil 1.8
 249 | loop belt 3.1
 250 | rat mouse 7.78
 251 | motor engine 8.65
 252 | car cab 7.42
 253 | cat lion 6.75
 254 | size magnitude 6.33
 255 | reality fantasy 1.03
 256 | door gate 5.25
 257 | cat pet 5.95
 258 | tin aluminum 6.42
 259 | bone jaw 4.17
 260 | cereal wheat 3.75
 261 | house key 1.9
 262 | blood flesh 4.28
 263 | door corridor 3.73
 264 | god spirit 7.3
 265 | capability competence 7.62
 266 | abundance plenty 8.97
 267 | sofa chair 6.67
 268 | wall brick 4.68
 269 | horn drum 2.68
 270 | organ liver 6.15
 271 | strength might 7.07
 272 | phrase word 5.48
 273 | band parade 3.92
 274 | stomach waist 5.9
 275 | cloud storm 5.6
 276 | joy pride 5
 277 | noise rattle 6.17
 278 | rain mist 5.97
 279 | beer beverage 5.42
 280 | man uncle 3.92
 281 | apple juice 2.88
 282 | intelligence logic 6.5
 283 | communication language 7.47
 284 | mink fur 6.83
 285 | mob crowd 7.85
 286 | shore coast 8.83
 287 | wire cord 7.62
 288 | bird turkey 6.58
 289 | bed crib 7.3
 290 | competence ability 7.5
 291 | cloud haze 7.32
 292 | supper meal 7.53
 293 | bar cage 2.8
 294 | water salt 1.3
 295 | sense intuition 7.68
 296 | situation condition 6.58
 297 | crime theft 7.53
 298 | style fashion 8.5
 299 | boundary border 9.08
 300 | arm body 4.05
 301 | boat car 2.37
 302 | sandwich lunch 6.3
 303 | bride princess 2.8
 304 | heroine hero 8.78
 305 | car gauge 1.13
 306 | insect bee 6.07
 307 | crib cradle 8.55
 308 | animal person 3.05
 309 | marijuana herb 6.5
 310 | bed hospital 0.92
 311 | cheek tongue 4.52
 312 | disc computer 3.2
 313 | curve angle 3.33
 314 | grass moss 5
 315 | school law 1.13
 316 | foot head 2.3
 317 | mother guardian 6.5
 318 | orthodontist dentist 8.27
 319 | alcohol whiskey 7.27
 320 | mouth tooth 6.3
 321 | breakfast bacon 4.37
 322 | bathroom bedroom 3.4
 323 | plate bowl 5.23
 324 | meat bacon 5.8
 325 | air helium 3.63
 326 | worker employer 5.37
 327 | body chest 4.45
 328 | son father 3.82
 329 | heart surgery 1.08
 330 | woman secretary 1.98
 331 | man father 4.83
 332 | beach island 5.6
 333 | story topic 5
 334 | game fun 3.42
 335 | weekend week 4
 336 | couple pair 8.33
 337 | woman wife 5.72
 338 | sheep cattle 4.77
 339 | purse bag 8.33
 340 | ceiling cathedral 2.42
 341 | bean coffee 5.15
 342 | wood paper 2.88
 343 | top side 1.9
 344 | crime fraud 5.65
 345 | pain harm 5.38
 346 | lover companion 5.97
 347 | evening dusk 7.78
 348 | father daughter 2.62
 349 | wine liquor 7.85
 350 | cow goat 2.93
 351 | belief opinion 7.7
 352 | reality illusion 1.42
 353 | pact agreement 9.02
 354 | wealth poverty 1.27
 355 | accident emergency 4.93
 356 | battle conquest 7.22
 357 | friend teacher 2.62
 358 | illness infection 6.9
 359 | game trick 2.32
 360 | brother son 3.48
 361 | aunt nephew 3.1
 362 | worker mechanic 4.92
 363 | doctor orthodontist 5.58
 364 | oak maple 6.03
 365 | bee queen 3.27
 366 | car bicycle 3.47
 367 | goal quest 5.83
 368 | august month 5.53
 369 | army squad 5.08
 370 | cloud weather 4.87
 371 | physician doctor 8.88
 372 | canyon valley 6.75
 373 | river valley 1.67
 374 | sun sky 2.27
 375 | target arrow 3.25
 376 | chocolate pie 2.27
 377 | circumstance situation 7.85
 378 | opinion choice 5.43
 379 | rhythm melody 6.12
 380 | gut nerve 4.93
 381 | day dawn 5.47
 382 | cattle beef 7.03
 383 | doctor professor 4.65
 384 | arm vein 3.65
 385 | room bath 3.33
 386 | corporation business 9.02
 387 | fun football 1.97
 388 | hill cliff 4.28
 389 | bone ankle 3.82
 390 | apple candy 2.08
 391 | helper maid 5.58
 392 | leader manager 7.27
 393 | lemon tea 1.6
 394 | bee ant 2.78
 395 | basketball baseball 4.92
 396 | rice bean 2.72
 397 | bed furniture 6.08
 398 | emotion passion 7.72
 399 | anarchy chaos 7.93
 400 | crime violation 7.12
 401 | machine engine 5.58
 402 | beach sea 4.68
 403 | alley bowl 1.53
 404 | jar bottle 7.83
 405 | strength capability 5.28
 406 | seed mustard 3.48
 407 | guitar drum 3.78
 408 | opinion idea 5.7
 409 | north west 3.63
 410 | diet salad 2.98
 411 | mother wife 3.02
 412 | dad mother 3.55
 413 | captain sailor 5
 414 | meter yard 5.6
 415 | beer champagne 4.45
 416 | motor boat 2.57
 417 | card bridge 1.97
 418 | science psychology 4.92
 419 | sinner saint 1.6
 420 | destruction construction 0.98
 421 | crowd bunch 7.42
 422 | beach reef 3.77
 423 | man child 4.13
 424 | bread cheese 1.95
 425 | champion winner 8.73
 426 | celebration ceremony 7.72
 427 | menu order 3.62
 428 | king princess 3.27
 429 | wealth prestige 6.07
 430 | endurance strength 6.58
 431 | danger threat 8.78
 432 | god priest 4.5
 433 | men fraternity 3.13
 434 | buddy companion 8.65
 435 | teacher helper 4.28
 436 | body stomach 3.93
 437 | tongue throat 3.1
 438 | house carpet 1.38
 439 | intelligence skill 5.35
 440 | journey conquest 4.72
 441 | god prey 1.23
 442 | brother soul 0.97
 443 | adversary opponent 9.05
 444 | death catastrophe 4.13
 445 | monster demon 6.95
 446 | day morning 4.87
 447 | man victor 1.9
 448 | friend guy 3.88
 449 | song story 3.97
 450 | ray sunshine 6.83
 451 | guy stud 5.83
 452 | chicken rice 1.43
 453 | box elevator 1.32
 454 | butter potato 1.22
 455 | apartment furniture 1.28
 456 | lake swamp 4.92
 457 | salad vinegar 1.13
 458 | flower bulb 4.48
 459 | cloud mist 6.67
 460 | driver pilot 6.28
 461 | sugar honey 5.13
 462 | body shoulder 2.88
 463 | idea image 3.55
 464 | father brother 4.2
 465 | moon planet 5.87
 466 | ball costume 2.32
 467 | rail fence 5.22
 468 | room bed 2.35
 469 | flower bush 4.25
 470 | bone knee 4.17
 471 | arm knee 2.75
 472 | bottom side 2.63
 473 | vessel vein 5.15
 474 | cat rabbit 2.37
 475 | meat sandwich 2.35
 476 | belief concept 5.08
 477 | intelligence insight 5.9
 478 | attention interest 7.22
 479 | attitude confidence 4.35
 480 | right justice 7.05
 481 | argument agreement 1.45
 482 | depth magnitude 6.12
 483 | medium news 3.65
 484 | winner candidate 2.78
 485 | birthday date 5.08
 486 | fee payment 7.15
 487 | bible hymn 5.15
 488 | exit doorway 5.5
 489 | man sentry 3.25
 490 | aisle hall 6.35
 491 | whiskey gin 6.28
 492 | blood marrow 3.4
 493 | oil mink 1.23
 494 | floor deck 5.55
 495 | roof floor 2.62
 496 | door floor 1.67
 497 | shoulder head 3.42
 498 | wagon carriage 7.7
 499 | car carriage 5.13
 500 | elbow ankle 3.13
 501 | wealth fame 4.02
 502 | sorrow shame 4.77
 503 | administration management 7.25
 504 | communication conversation 8.02
 505 | pollution atmosphere 4.25
 506 | anatomy biology 5.33
 507 | college profession 3.12
 508 | book topic 2.07
 509 | formula equation 7.95
 510 | book information 5
 511 | boy partner 1.9
 512 | sky universe 4.68
 513 | population people 7.68
 514 | college class 4.13
 515 | chief mayor 4.85
 516 | rabbi minister 7.62
 517 | meter inch 5.08
 518 | polyester cotton 5.63
 519 | lawyer banker 1.88
 520 | violin instrument 6.58
 521 | camp cabin 4.2
 522 | pot appliance 2.53
 523 | linen fabric 7.47
 524 | whiskey champagne 5.33
 525 | girl child 5.38
 526 | cottage cabin 7.72
 527 | bird hen 7.03
 528 | racket noise 8.1
 529 | sunset evening 5.98
 530 | drizzle rain 9.17
 531 | adult baby 2.22
 532 | charcoal coal 7.63
 533 | body spine 4.78
 534 | head nail 2.47
 535 | log timber 8.05
 536 | spoon cup 2.02
 537 | body nerve 3.13
 538 | man husband 5.32
 539 | bone neck 2.53
 540 | frustration anger 6.5
 541 | river sea 5.72
 542 | task job 8.87
 543 | club society 5.23
 544 | reflection image 7.27
 545 | prince king 5.92
 546 | snow weather 5.48
 547 | people party 2.2
 548 | boy brother 6.67
 549 | root grass 3.55
 550 | brow eye 3.82
 551 | money pearl 2.1
 552 | money diamond 3.42
 553 | vehicle bus 6.47
 554 | cab bus 5.6
 555 | house barn 4.33
 556 | finger palm 3.33
 557 | car bridge 0.95
 558 | effort difficulty 4.45
 559 | fact insight 4.77
 560 | job management 3.97
 561 | cancer sickness 7.93
 562 | word newspaper 2.47
 563 | composer writer 6.58
 564 | actor singer 4.52
 565 | shelter hut 6.47
 566 | bathroom kitchen 3.1
 567 | cabin hut 6.53
 568 | door kitchen 1.67
 569 | value belief 7.07
 570 | wisdom intelligence 7.47
 571 | ignorance intelligence 1.5
 572 | happiness luck 2.38
 573 | idea scheme 6.75
 574 | mood emotion 8.12
 575 | happiness peace 6.03
 576 | despair misery 7.22
 577 | logic arithmetic 3.97
 578 | denial confession 1.03
 579 | argument criticism 5.08
 580 | aggression hostility 8.48
 581 | hysteria confusion 6.33
 582 | chemistry theory 3.17
 583 | trial verdict 3.33
 584 | comfort safety 5.8
 585 | confidence self 3.12
 586 | vision perception 6.88
 587 | era decade 5.4
 588 | biography fiction 1.38
 589 | discussion argument 5.48
 590 | code symbol 6.03
 591 | danger disease 3
 592 | accident catastrophe 5.9
 593 | journey trip 8.88
 594 | activity movement 7.15
 595 | gossip news 5.22
 596 | father god 3.57
 597 | action course 5.45
 598 | fever illness 7.65
 599 | aviation flight 8.18
 600 | game action 4.85
 601 | molecule air 3.05
 602 | home state 2.58
 603 | word literature 4.77
 604 | adult guardian 6.9
 605 | newspaper information 5.65
 606 | communication television 5.6
 607 | cousin uncle 4.63
 608 | author reader 1.6
 609 | guy partner 3.57
 610 | area corner 2.07
 611 | ballad song 7.53
 612 | wall decoration 2.62
 613 | word page 2.92
 614 | nurse scientist 2.08
 615 | politician president 7.38
 616 | president mayor 5.68
 617 | book essay 4.72
 618 | man warrior 4.72
 619 | article journal 6.18
 620 | breakfast supper 4.4
 621 | crowd parade 3.93
 622 | aisle hallway 6.75
 623 | teacher rabbi 4.37
 624 | hip lip 1.43
 625 | book article 5.43
 626 | room cell 4.58
 627 | box booth 3.8
 628 | daughter kid 4.17
 629 | limb leg 6.9
 630 | liver lung 2.7
 631 | classroom hallway 2
 632 | mountain ledge 3.73
 633 | car elevator 1.03
 634 | bed couch 3.42
 635 | clothes button 2.3
 636 | clothes coat 5.35
 637 | kidney organ 6.17
 638 | apple sauce 1.43
 639 | chicken steak 3.73
 640 | car hose 0.87
 641 | tobacco cigarette 7.5
 642 | student professor 1.95
 643 | baby daughter 5
 644 | pipe cigar 6.03
 645 | milk juice 4.05
 646 | box cigar 1.25
 647 | apartment hotel 3.33
 648 | cup cone 3.17
 649 | horse ox 3.02
 650 | throat nose 2.8
 651 | bone teeth 4.17
 652 | bone elbow 3.78
 653 | bacon bean 1.22
 654 | cup jar 5.13
 655 | proof fact 7.3
 656 | appointment engagement 6.75
 657 | birthday year 1.67
 658 | word clue 2.53
 659 | author creator 8.02
 660 | atom carbon 3.1
 661 | archbishop bishop 7.05
 662 | letter paragraph 4
 663 | page paragraph 3.03
 664 | steeple chapel 7.08
 665 | muscle bone 3.65
 666 | muscle tongue 5
 667 | boy soldier 2.15
 668 | belly abdomen 8.13
 669 | guy girl 3.33
 670 | bed chair 3.5
 671 | clothes jacket 5.15
 672 | gun knife 3.65
 673 | tin metal 5.63
 674 | bottle container 7.93
 675 | hen turkey 6.13
 676 | meat bread 1.67
 677 | arm bone 3.83
 678 | neck spine 5.32
 679 | apple lemon 4.05
 680 | agony grief 7.63
 681 | assignment task 8.7
 682 | night dawn 2.95
 683 | dinner soup 3.72
 684 | calf bull 4.93
 685 | snow storm 4.8
 686 | nail hand 3.42
 687 | dog horse 2.38
 688 | arm neck 1.58
 689 | ball glove 1.75
 690 | flu fever 6.08
 691 | fee salary 3.72
 692 | nerve brain 3.88
 693 | beast animal 7.83
 694 | dinner chicken 2.85
 695 | girl maid 2.93
 696 | child boy 5.75
 697 | alcohol wine 7.42
 698 | nose mouth 3.73
 699 | street car 2.38
 700 | bell door 2.2
 701 | box hat 1.3
 702 | belief impression 5.95
 703 | bias opinion 5.6
 704 | attention awareness 8.73
 705 | anger mood 4.1
 706 | elegance style 5.72
 707 | beauty age 1.58
 708 | book theme 2.58
 709 | friend mother 2.53
 710 | vitamin iron 5.55
 711 | car factory 2.75
 712 | pact condition 2.45
 713 | chapter choice 0.48
 714 | arithmetic rhythm 2.35
 715 | winner presence 1.08
 716 | belief flower 0.4
 717 | winner goal 3.23
 718 | trick size 0.48
 719 | choice vein 0.98
 720 | hymn conquest 0.68
 721 | endurance band 0.4
 722 | jail choice 1.08
 723 | condition boy 0.48
 724 | flower endurance 0.4
 725 | hole agreement 0.3
 726 | doctor temper 0.48
 727 | fraternity door 0.68
 728 | task woman 0.68
 729 | fraternity baseball 0.88
 730 | cent size 0.4
 731 | presence door 0.48
 732 | mouse management 0.48
 733 | task highway 0.48
 734 | liquor century 0.4
 735 | task straw 0.68
 736 | island task 0.3
 737 | night chapter 0.48
 738 | pollution president 0.68
 739 | gun trick 0.48
 740 | bath trick 0.58
 741 | diet apple 1.18
 742 | cent wife 0.58
 743 | chapter tail 0.3
 744 | course stomach 0.58
 745 | hymn straw 0.4
 746 | dentist colonel 0.4
 747 | wife straw 0.4
 748 | hole wife 0.68
 749 | pupil president 0.78
 750 | bath wife 0.48
 751 | people cent 0.48
 752 | formula log 1.77
 753 | woman fur 0.58
 754 | apple sunshine 0.58
 755 | gun dawn 1.18
 756 | meal waist 0.98
 757 | camera president 0.48
 758 | liquor band 0.68
 759 | stomach vein 2.35
 760 | gun fur 0.3
 761 | couch baseball 0.88
 762 | worker camera 0.68
 763 | deck mouse 0.48
 764 | rice boy 0.4
 765 | people gun 0.68
 766 | cliff tail 0.3
 767 | ankle window 0.3
 768 | princess island 0.3
 769 | container mouse 0.3
 770 | wagon container 2.65
 771 | people balloon 0.48
 772 | dollar people 0.4
 773 | bath balloon 0.4
 774 | stomach bedroom 0.4
 775 | bicycle bedroom 0.4
 776 | log bath 0.4
 777 | bowl tail 0.48
 778 | go come 2.42
 779 | take steal 6.18
 780 | listen hear 8.17
 781 | think rationalize 8.25
 782 | occur happen 9.32
 783 | vanish disappear 9.8
 784 | multiply divide 1.75
 785 | plead beg 9.08
 786 | begin originate 8.2
 787 | protect defend 9.13
 788 | kill destroy 5.9
 789 | create make 8.72
 790 | accept reject 0.83
 791 | ignore avoid 6.87
 792 | carry bring 5.8
 793 | leave enter 0.95
 794 | choose elect 7.62
 795 | lose fail 7.33
 796 | encourage discourage 1.58
 797 | achieve accomplish 8.57
 798 | make construct 8.33
 799 | listen obey 4.93
 800 | inform notify 9.25
 801 | receive give 1.47
 802 | borrow beg 2.62
 803 | take obtain 7.1
 804 | advise recommend 8.1
 805 | imitate portray 6.75
 806 | win succeed 7.9
 807 | think decide 5.13
 808 | greet meet 6.17
 809 | agree argue 0.77
 810 | enjoy entertain 5.92
 811 | destroy make 1.6
 812 | save protect 6.58
 813 | give lend 7.22
 814 | understand know 7.47
 815 | take receive 5.08
 816 | accept acknowledge 6.88
 817 | decide choose 8.87
 818 | accept believe 6.75
 819 | keep possess 8.27
 820 | roam wander 8.83
 821 | succeed fail 0.83
 822 | spend save 0.55
 823 | leave go 7.63
 824 | come attend 8.1
 825 | know believe 5.5
 826 | gather meet 7.3
 827 | make earn 7.62
 828 | forget ignore 3.07
 829 | multiply add 2.7
 830 | shrink grow 0.23
 831 | arrive leave 1.33
 832 | succeed try 3.98
 833 | accept deny 1.75
 834 | arrive come 7.05
 835 | agree differ 1.05
 836 | send receive 1.08
 837 | win dominate 5.68
 838 | add divide 2.3
 839 | kill choke 4.92
 840 | acquire get 8.82
 841 | participate join 7.7
 842 | leave remain 2.53
 843 | go enter 4
 844 | take carry 5.23
 845 | forget learn 1.18
 846 | appoint elect 8.17
 847 | engage marry 5.43
 848 | ask pray 3.72
 849 | go send 3.75
 850 | take deliver 4.37
 851 | speak hear 3.02
 852 | analyze evaluate 8.03
 853 | argue rationalize 4.2
 854 | lose keep 1.05
 855 | compare analyze 8.1
 856 | disorganize organize 1.45
 857 | go allow 3.62
 858 | take possess 7.2
 859 | learn listen 3.88
 860 | destroy construct 0.92
 861 | create build 8.48
 862 | steal buy 1.13
 863 | kill hang 4.45
 864 | forget know 0.92
 865 | create imagine 5.13
 866 | do happen 4.23
 867 | win accomplish 7.85
 868 | give deny 1.43
 869 | deserve earn 5.8
 870 | get put 1.98
 871 | locate find 8.73
 872 | appear attend 6.28
 873 | know comprehend 7.63
 874 | pretend imagine 8.47
 875 | satisfy please 7.67
 876 | cherish keep 4.85
 877 | argue differ 5.15
 878 | overcome dominate 6.25
 879 | behave obey 7.3
 880 | cooperate participate 6.43
 881 | achieve try 4.42
 882 | fail discourage 3.33
 883 | begin quit 1.28
 884 | say participate 3.82
 885 | come bring 2.42
 886 | declare announce 9.08
 887 | read comprehend 4.7
 888 | take leave 2.47
 889 | proclaim announce 8.18
 890 | acquire obtain 8.57
 891 | conclude decide 7.75
 892 | please plead 2.98
 893 | argue prove 4.83
 894 | ask plead 6.47
 895 | find disappear 0.77
 896 | inspect examine 8.75
 897 | verify justify 4.08
 898 | assume predict 4.85
 899 | learn evaluate 4.17
 900 | argue justify 5
 901 | make become 4.77
 902 | discover originate 4.83
 903 | achieve succeed 7.5
 904 | give put 3.65
 905 | understand listen 4.68
 906 | expand grow 8.27
 907 | borrow sell 1.73
 908 | keep protect 5.4
 909 | explain prove 4.1
 910 | assume pretend 3.72
 911 | agree please 4.13
 912 | forgive forget 3.92
 913 | clarify explain 8.33
 914 | understand forgive 4.87
 915 | remind forget 0.87
 916 | get remain 1.6
 917 | realize discover 7.47
 918 | require inquire 1.82
 919 | ignore ask 1.07
 920 | think inquire 4.77
 921 | reject avoid 4.78
 922 | argue persuade 6.23
 923 | pursue persuade 3.17
 924 | accept forgive 3.73
 925 | do quit 1.17
 926 | investigate examine 8.1
 927 | discuss explain 6.67
 928 | owe lend 2.32
 929 | explore discover 8.48
 930 | complain argue 4.8
 931 | withdraw reject 6.38
 932 | keep borrow 2.25
 933 | beg ask 6
 934 | arrange organize 8.27
 935 | reduce shrink 8.02
 936 | speak acknowledge 4.67
 937 | give borrow 2.22
 938 | kill defend 2.63
 939 | disappear shrink 5.8
 940 | deliver carry 3.88
 941 | breathe choke 1.37
 942 | acknowledge notify 5.3
 943 | become seem 2.63
 944 | pretend seem 4.68
 945 | accomplish become 4
 946 | contemplate think 8.82
 947 | determine predict 5.8
 948 | please entertain 5
 949 | remain retain 5.75
 950 | pretend portray 7.03
 951 | forget retain 0.63
 952 | want choose 4.78
 953 | lose get 0.77
 954 | try think 2.62
 955 | become appear 4.77
 956 | leave ignore 4.42
 957 | accept recommend 2.75
 958 | leave wander 3.57
 959 | keep give 1.05
 960 | give allow 5.15
 961 | bring send 2.97
 962 | absorb learn 5.48
 963 | acquire find 6.38
 964 | leave appear 0.97
 965 | create destroy 0.63
 966 | begin go 7.42
 967 | get buy 5.08
 968 | collect save 6.67
 969 | replace restore 5.73
 970 | join add 8.1
 971 | join marry 5.35
 972 | accept deliver 1.58
 973 | attach join 7.75
 974 | put hang 3
 975 | go sell 0.97
 976 | communicate pray 3.55
 977 | give steal 0.5
 978 | add build 4.92
 979 | bring restore 2.62
 980 | comprehend satisfy 2.55
 981 | portray decide 1.18
 982 | organize become 1.77
 983 | give know 0.88
 984 | say verify 4.9
 985 | cooperate join 5.18
 986 | arrange require 0.98
 987 | borrow want 1.77
 988 | investigate pursue 7.15
 989 | ignore explore 0.4
 990 | bring complain 0.98
 991 | enter owe 0.68
 992 | portray notify 0.78
 993 | remind sell 0.4
 994 | absorb possess 5
 995 | join acquire 2.85
 996 | send attend 1.67
 997 | gather attend 4.8
 998 | absorb withdraw 2.97
 999 | attend arrive 6.08
1000 | 


--------------------------------------------------------------------------------
/testsets/similarity/ws353.txt:
--------------------------------------------------------------------------------
  1 | love	sex	6.77
  2 | tiger	cat	7.35
  3 | tiger	tiger	10.00
  4 | book	paper	7.46
  5 | computer	keyboard	7.62
  6 | computer	internet	7.58
  7 | plane	car	5.77
  8 | train	car	6.31
  9 | telephone	communication	7.50
 10 | television	radio	6.77
 11 | media	radio	7.42
 12 | drug	abuse	6.85
 13 | bread	butter	6.19
 14 | cucumber	potato	5.92
 15 | doctor	nurse	7.00
 16 | professor	doctor	6.62
 17 | student	professor	6.81
 18 | smart	student	4.62
 19 | smart	stupid	5.81
 20 | company	stock	7.08
 21 | stock	market	8.08
 22 | stock	phone	1.62
 23 | stock	CD	1.31
 24 | stock	jaguar	0.92
 25 | stock	egg	1.81
 26 | fertility	egg	6.69
 27 | stock	live	3.73
 28 | stock	life	0.92
 29 | book	library	7.46
 30 | bank	money	8.12
 31 | wood	forest	7.73
 32 | money	cash	9.15
 33 | professor	cucumber	0.31
 34 | king	cabbage	0.23
 35 | king	queen	8.58
 36 | king	rook	5.92
 37 | bishop	rabbi	6.69
 38 | Jerusalem	Israel	8.46
 39 | Jerusalem	Palestinian	7.65
 40 | holy	sex	1.62
 41 | fuck	sex	9.44
 42 | Maradona	football	8.62
 43 | football	soccer	9.03
 44 | football	basketball	6.81
 45 | football	tennis	6.63
 46 | tennis	racket	7.56
 47 | Arafat	peace	6.73
 48 | Arafat	terror	7.65
 49 | Arafat	Jackson	2.50
 50 | law	lawyer	8.38
 51 | movie	star	7.38
 52 | movie	popcorn	6.19
 53 | movie	critic	6.73
 54 | movie	theater	7.92
 55 | physics	proton	8.12
 56 | physics	chemistry	7.35
 57 | space	chemistry	4.88
 58 | alcohol	chemistry	5.54
 59 | vodka	gin	8.46
 60 | vodka	brandy	8.13
 61 | drink	car	3.04
 62 | drink	ear	1.31
 63 | drink	mouth	5.96
 64 | drink	eat	6.87
 65 | baby	mother	7.85
 66 | drink	mother	2.65
 67 | car	automobile	8.94
 68 | gem	jewel	8.96
 69 | journey	voyage	9.29
 70 | boy	lad	8.83
 71 | coast	shore	9.10
 72 | asylum	madhouse	8.87
 73 | magician	wizard	9.02
 74 | midday	noon	9.29
 75 | furnace	stove	8.79
 76 | food	fruit	7.52
 77 | bird	cock	7.10
 78 | bird	crane	7.38
 79 | tool	implement	6.46
 80 | brother	monk	6.27
 81 | crane	implement	2.69
 82 | lad	brother	4.46
 83 | journey	car	5.85
 84 | monk	oracle	5.00
 85 | cemetery	woodland	2.08
 86 | food	rooster	4.42
 87 | coast	hill	4.38
 88 | forest	graveyard	1.85
 89 | shore	woodland	3.08
 90 | monk	slave	0.92
 91 | coast	forest	3.15
 92 | lad	wizard	0.92
 93 | chord	smile	0.54
 94 | glass	magician	2.08
 95 | noon	string	0.54
 96 | rooster	voyage	0.62
 97 | money	dollar	8.42
 98 | money	cash	9.08
 99 | money	currency	9.04
100 | money	wealth	8.27
101 | money	property	7.57
102 | money	possession	7.29
103 | money	bank	8.50
104 | money	deposit	7.73
105 | money	withdrawal	6.88
106 | money	laundering	5.65
107 | money	operation	3.31
108 | tiger	jaguar	8.00
109 | tiger	feline	8.00
110 | tiger	carnivore	7.08
111 | tiger	mammal	6.85
112 | tiger	animal	7.00
113 | tiger	organism	4.77
114 | tiger	fauna	5.62
115 | tiger	zoo	5.87
116 | psychology	psychiatry	8.08
117 | psychology	anxiety	7.00
118 | psychology	fear	6.85
119 | psychology	depression	7.42
120 | psychology	clinic	6.58
121 | psychology	doctor	6.42
122 | psychology	Freud	8.21
123 | psychology	mind	7.69
124 | psychology	health	7.23
125 | psychology	science	6.71
126 | psychology	discipline	5.58
127 | psychology	cognition	7.48
128 | planet	star	8.45
129 | planet	constellation	8.06
130 | planet	moon	8.08
131 | planet	sun	8.02
132 | planet	galaxy	8.11
133 | planet	space	7.92
134 | planet	astronomer	7.94
135 | precedent	example	5.85
136 | precedent	information	3.85
137 | precedent	cognition	2.81
138 | precedent	law	6.65
139 | precedent	collection	2.50
140 | precedent	group	1.77
141 | precedent	antecedent	6.04
142 | cup	coffee	6.58
143 | cup	tableware	6.85
144 | cup	article	2.40
145 | cup	artifact	2.92
146 | cup	object	3.69
147 | cup	entity	2.15
148 | cup	drink	7.25
149 | cup	food	5.00
150 | cup	substance	1.92
151 | cup	liquid	5.90
152 | jaguar	cat	7.42
153 | jaguar	car	7.27
154 | energy	secretary	1.81
155 | secretary	senate	5.06
156 | energy	laboratory	5.09
157 | computer	laboratory	6.78
158 | weapon	secret	6.06
159 | FBI	fingerprint	6.94
160 | FBI	investigation	8.31
161 | investigation	effort	4.59
162 | Mars	water	2.94
163 | Mars	scientist	5.63
164 | news	report	8.16
165 | canyon	landscape	7.53
166 | image	surface	4.56
167 | discovery	space	6.34
168 | water	seepage	6.56
169 | sign	recess	2.38
170 | Wednesday	news	2.22
171 | mile	kilometer	8.66
172 | computer	news	4.47
173 | territory	surface	5.34
174 | atmosphere	landscape	3.69
175 | president	medal	3.00
176 | war	troops	8.13
177 | record	number	6.31
178 | skin	eye	6.22
179 | Japanese	American	6.50
180 | theater	history	3.91
181 | volunteer	motto	2.56
182 | prejudice	recognition	3.00
183 | decoration	valor	5.63
184 | century	year	7.59
185 | century	nation	3.16
186 | delay	racism	1.19
187 | delay	news	3.31
188 | minister	party	6.63
189 | peace	plan	4.75
190 | minority	peace	3.69
191 | attempt	peace	4.25
192 | government	crisis	6.56
193 | deployment	departure	4.25
194 | deployment	withdrawal	5.88
195 | energy	crisis	5.94
196 | announcement	news	7.56
197 | announcement	effort	2.75
198 | stroke	hospital	7.03
199 | disability	death	5.47
200 | victim	emergency	6.47
201 | treatment	recovery	7.91
202 | journal	association	4.97
203 | doctor	personnel	5.00
204 | doctor	liability	5.19
205 | liability	insurance	7.03
206 | school	center	3.44
207 | reason	hypertension	2.31
208 | reason	criterion	5.91
209 | hundred	percent	7.38
210 | Harvard	Yale	8.13
211 | hospital	infrastructure	4.63
212 | death	row	5.25
213 | death	inmate	5.03
214 | lawyer	evidence	6.69
215 | life	death	7.88
216 | life	term	4.50
217 | word	similarity	4.75
218 | board	recommendation	4.47
219 | governor	interview	3.25
220 | OPEC	country	5.63
221 | peace	atmosphere	3.69
222 | peace	insurance	2.94
223 | territory	kilometer	5.28
224 | travel	activity	5.00
225 | competition	price	6.44
226 | consumer	confidence	4.13
227 | consumer	energy	4.75
228 | problem	airport	2.38
229 | car	flight	4.94
230 | credit	card	8.06
231 | credit	information	5.31
232 | hotel	reservation	8.03
233 | grocery	money	5.94
234 | registration	arrangement	6.00
235 | arrangement	accommodation	5.41
236 | month	hotel	1.81
237 | type	kind	8.97
238 | arrival	hotel	6.00
239 | bed	closet	6.72
240 | closet	clothes	8.00
241 | situation	conclusion	4.81
242 | situation	isolation	3.88
243 | impartiality	interest	5.16
244 | direction	combination	2.25
245 | street	place	6.44
246 | street	avenue	8.88
247 | street	block	6.88
248 | street	children	4.94
249 | listing	proximity	2.56
250 | listing	category	6.38
251 | cell	phone	7.81
252 | production	hike	1.75
253 | benchmark	index	4.25
254 | media	trading	3.88
255 | media	gain	2.88
256 | dividend	payment	7.63
257 | dividend	calculation	6.48
258 | calculation	computation	8.44
259 | currency	market	7.50
260 | OPEC	oil	8.59
261 | oil	stock	6.34
262 | announcement	production	3.38
263 | announcement	warning	6.00
264 | profit	warning	3.88
265 | profit	loss	7.63
266 | dollar	yen	7.78
267 | dollar	buck	9.22
268 | dollar	profit	7.38
269 | dollar	loss	6.09
270 | computer	software	8.50
271 | network	hardware	8.31
272 | phone	equipment	7.13
273 | equipment	maker	5.91
274 | luxury	car	6.47
275 | five	month	3.38
276 | report	gain	3.63
277 | investor	earning	7.13
278 | liquid	water	7.89
279 | baseball	season	5.97
280 | game	victory	7.03
281 | game	team	7.69
282 | marathon	sprint	7.47
283 | game	series	6.19
284 | game	defeat	6.97
285 | seven	series	3.56
286 | seafood	sea	7.47
287 | seafood	food	8.34
288 | seafood	lobster	8.70
289 | lobster	food	7.81
290 | lobster	wine	5.70
291 | food	preparation	6.22
292 | video	archive	6.34
293 | start	year	4.06
294 | start	match	4.47
295 | game	round	5.97
296 | boxing	round	7.61
297 | championship	tournament	8.36
298 | fighting	defeating	7.41
299 | line	insurance	2.69
300 | day	summer	3.94
301 | summer	drought	7.16
302 | summer	nature	5.63
303 | day	dawn	7.53
304 | nature	environment	8.31
305 | environment	ecology	8.81
306 | nature	man	6.25
307 | man	woman	8.30
308 | man	governor	5.25
309 | murder	manslaughter	8.53
310 | soap	opera	7.94
311 | opera	performance	6.88
312 | life	lesson	5.94
313 | focus	life	4.06
314 | production	crew	6.25
315 | television	film	7.72
316 | lover	quarrel	6.19
317 | viewer	serial	2.97
318 | possibility	girl	1.94
319 | population	development	3.75
320 | morality	importance	3.31
321 | morality	marriage	3.69
322 | Mexico	Brazil	7.44
323 | gender	equality	6.41
324 | change	attitude	5.44
325 | family	planning	6.25
326 | opera	industry	2.63
327 | sugar	approach	0.88
328 | practice	institution	3.19
329 | ministry	culture	4.69
330 | problem	challenge	6.75
331 | size	prominence	5.31
332 | country	citizen	7.31
333 | planet	people	5.75
334 | development	issue	3.97
335 | experience	music	3.47
336 | music	project	3.63
337 | glass	metal	5.56
338 | aluminum	metal	7.83
339 | chance	credibility	3.88
340 | exhibit	memorabilia	5.31
341 | concert	virtuoso	6.81
342 | rock	jazz	7.59
343 | museum	theater	7.19
344 | observation	architecture	4.38
345 | space	world	6.53
346 | preservation	world	6.19
347 | admission	ticket	7.69
348 | shower	thunderstorm	6.31
349 | shower	flood	6.03
350 | weather	forecast	8.34
351 | disaster	area	6.25
352 | governor	office	6.34
353 | architecture	century	3.78
354 | 


--------------------------------------------------------------------------------
/testsets/similarity/ws353_relatedness.txt:
--------------------------------------------------------------------------------
  1 | computer	keyboard	7.62
  2 | Jerusalem	Israel	8.46
  3 | planet	galaxy	8.11
  4 | canyon	landscape	7.53
  5 | OPEC	country	5.63
  6 | day	summer	3.94
  7 | day	dawn	7.53
  8 | country	citizen	7.31
  9 | planet	people	5.75
 10 | environment	ecology	8.81
 11 | Maradona	football	8.62
 12 | OPEC	oil	8.59
 13 | money	bank	8.50
 14 | computer	software	8.50
 15 | law	lawyer	8.38
 16 | weather	forecast	8.34
 17 | network	hardware	8.31
 18 | nature	environment	8.31
 19 | FBI	investigation	8.31
 20 | money	wealth	8.27
 21 | psychology	Freud	8.21
 22 | news	report	8.16
 23 | war	troops	8.13
 24 | physics	proton	8.12
 25 | bank	money	8.12
 26 | stock	market	8.08
 27 | planet	constellation	8.06
 28 | credit	card	8.06
 29 | hotel	reservation	8.03
 30 | closet	clothes	8.00
 31 | soap	opera	7.94
 32 | planet	astronomer	7.94
 33 | planet	space	7.92
 34 | movie	theater	7.92
 35 | treatment	recovery	7.91
 36 | baby	mother	7.85
 37 | money	deposit	7.73
 38 | television	film	7.72
 39 | psychology	mind	7.69
 40 | game	team	7.69
 41 | admission	ticket	7.69
 42 | Jerusalem	Palestinian	7.65
 43 | Arafat	terror	7.65
 44 | boxing	round	7.61
 45 | computer	internet	7.58
 46 | money	property	7.57
 47 | tennis	racket	7.56
 48 | telephone	communication	7.50
 49 | currency	market	7.50
 50 | psychology	cognition	7.48
 51 | seafood	sea	7.47
 52 | book	paper	7.46
 53 | book	library	7.46
 54 | psychology	depression	7.42
 55 | fighting	defeating	7.41
 56 | movie	star	7.38
 57 | hundred	percent	7.38
 58 | dollar	profit	7.38
 59 | money	possession	7.29
 60 | cup	drink	7.25
 61 | psychology	health	7.23
 62 | summer	drought	7.16
 63 | investor	earning	7.13
 64 | company	stock	7.08
 65 | stroke	hospital	7.03
 66 | liability	insurance	7.03
 67 | game	victory	7.03
 68 | psychology	anxiety	7.00
 69 | game	defeat	6.97
 70 | FBI	fingerprint	6.94
 71 | money	withdrawal	6.88
 72 | psychology	fear	6.85
 73 | drug	abuse	6.85
 74 | concert	virtuoso	6.81
 75 | computer	laboratory	6.78
 76 | love	sex	6.77
 77 | problem	challenge	6.75
 78 | movie	critic	6.73
 79 | Arafat	peace	6.73
 80 | bed	closet	6.72
 81 | lawyer	evidence	6.69
 82 | fertility	egg	6.69
 83 | precedent	law	6.65
 84 | minister	party	6.63
 85 | psychology	clinic	6.58
 86 | cup	coffee	6.58
 87 | water	seepage	6.56
 88 | government	crisis	6.56
 89 | space	world	6.53
 90 | dividend	calculation	6.48
 91 | victim	emergency	6.47
 92 | luxury	car	6.47
 93 | tool	implement	6.46
 94 | competition	price	6.44
 95 | psychology	doctor	6.42
 96 | gender	equality	6.41
 97 | listing	category	6.38
 98 | video	archive	6.34
 99 | oil	stock	6.34
100 | governor	office	6.34
101 | discovery	space	6.34
102 | record	number	6.31
103 | brother	monk	6.27
104 | production	crew	6.25
105 | nature	man	6.25
106 | family	planning	6.25
107 | disaster	area	6.25
108 | food	preparation	6.22
109 | preservation	world	6.19
110 | movie	popcorn	6.19
111 | lover	quarrel	6.19
112 | game	series	6.19
113 | dollar	loss	6.09
114 | weapon	secret	6.06
115 | shower	flood	6.03
116 | registration	arrangement	6.00
117 | arrival	hotel	6.00
118 | announcement	warning	6.00
119 | game	round	5.97
120 | baseball	season	5.97
121 | drink	mouth	5.96
122 | life	lesson	5.94
123 | grocery	money	5.94
124 | energy	crisis	5.94
125 | reason	criterion	5.91
126 | equipment	maker	5.91
127 | cup	liquid	5.90
128 | deployment	withdrawal	5.88
129 | tiger	zoo	5.87
130 | journey	car	5.85
131 | money	laundering	5.65
132 | summer	nature	5.63
133 | decoration	valor	5.63
134 | Mars	scientist	5.63
135 | alcohol	chemistry	5.54
136 | disability	death	5.47
137 | change	attitude	5.44
138 | arrangement	accommodation	5.41
139 | territory	surface	5.34
140 | size	prominence	5.31
141 | exhibit	memorabilia	5.31
142 | credit	information	5.31
143 | territory	kilometer	5.28
144 | death	row	5.25
145 | doctor	liability	5.19
146 | impartiality	interest	5.16
147 | energy	laboratory	5.09
148 | secretary	senate	5.06
149 | death	inmate	5.03
150 | monk	oracle	5.00
151 | cup	food	5.00
152 | journal	association	4.97
153 | street	children	4.94
154 | car	flight	4.94
155 | space	chemistry	4.88
156 | situation	conclusion	4.81
157 | word	similarity	4.75
158 | peace	plan	4.75
159 | consumer	energy	4.75
160 | ministry	culture	4.69
161 | smart	student	4.62
162 | investigation	effort	4.59
163 | image	surface	4.56
164 | life	term	4.50
165 | start	match	4.47
166 | computer	news	4.47
167 | board	recommendation	4.47
168 | lad	brother	4.46
169 | observation	architecture	4.38
170 | coast	hill	4.38
171 | deployment	departure	4.25
172 | benchmark	index	4.25
173 | attempt	peace	4.25
174 | consumer	confidence	4.13
175 | start	year	4.06
176 | focus	life	4.06
177 | development	issue	3.97
178 | theater	history	3.91
179 | situation	isolation	3.88
180 | profit	warning	3.88
181 | media	trading	3.88
182 | chance	credibility	3.88
183 | precedent	information	3.85
184 | architecture	century	3.78
185 | population	development	3.75
186 | stock	live	3.73
187 | peace	atmosphere	3.69
188 | morality	marriage	3.69
189 | minority	peace	3.69
190 | atmosphere	landscape	3.69
191 | report	gain	3.63
192 | music	project	3.63
193 | seven	series	3.56
194 | experience	music	3.47
195 | school	center	3.44
196 | five	month	3.38
197 | announcement	production	3.38
198 | morality	importance	3.31
199 | money	operation	3.31
200 | delay	news	3.31
201 | governor	interview	3.25
202 | practice	institution	3.19
203 | century	nation	3.16
204 | coast	forest	3.15
205 | shore	woodland	3.08
206 | drink	car	3.04
207 | president	medal	3.00
208 | prejudice	recognition	3.00
209 | viewer	serial	2.97
210 | peace	insurance	2.94
211 | Mars	water	2.94
212 | media	gain	2.88
213 | precedent	cognition	2.81
214 | announcement	effort	2.75
215 | line	insurance	2.69
216 | crane	implement	2.69
217 | drink	mother	2.65
218 | opera	industry	2.63
219 | volunteer	motto	2.56
220 | listing	proximity	2.56
221 | precedent	collection	2.50
222 | cup	article	2.40
223 | sign	recess	2.38
224 | problem	airport	2.38
225 | reason	hypertension	2.31
226 | direction	combination	2.25
227 | Wednesday	news	2.22
228 | glass	magician	2.08
229 | cemetery	woodland	2.08
230 | possibility	girl	1.94
231 | cup	substance	1.92
232 | forest	graveyard	1.85
233 | stock	egg	1.81
234 | month	hotel	1.81
235 | energy	secretary	1.81
236 | precedent	group	1.77
237 | production	hike	1.75
238 | stock	phone	1.62
239 | holy	sex	1.62
240 | stock	CD	1.31
241 | drink	ear	1.31
242 | delay	racism	1.19
243 | stock	life	0.92
244 | stock	jaguar	0.92
245 | monk	slave	0.92
246 | lad	wizard	0.92
247 | sugar	approach	0.88
248 | rooster	voyage	0.62
249 | noon	string	0.54
250 | chord	smile	0.54
251 | professor	cucumber	0.31
252 | king	cabbage	0.23
253 | 


--------------------------------------------------------------------------------
/testsets/similarity/ws353_similarity.txt:
--------------------------------------------------------------------------------
  1 | tiger	cat	7.35
  2 | tiger	tiger	10.00
  3 | plane	car	5.77
  4 | train	car	6.31
  5 | television	radio	6.77
  6 | media	radio	7.42
  7 | bread	butter	6.19
  8 | cucumber	potato	5.92
  9 | doctor	nurse	7.00
 10 | professor	doctor	6.62
 11 | student	professor	6.81
 12 | smart	stupid	5.81
 13 | wood	forest	7.73
 14 | money	cash	9.15
 15 | king	queen	8.58
 16 | king	rook	5.92
 17 | bishop	rabbi	6.69
 18 | fuck	sex	9.44
 19 | football	soccer	9.03
 20 | football	basketball	6.81
 21 | football	tennis	6.63
 22 | Arafat	Jackson	2.50
 23 | physics	chemistry	7.35
 24 | vodka	gin	8.46
 25 | vodka	brandy	8.13
 26 | drink	eat	6.87
 27 | car	automobile	8.94
 28 | gem	jewel	8.96
 29 | journey	voyage	9.29
 30 | boy	lad	8.83
 31 | coast	shore	9.10
 32 | asylum	madhouse	8.87
 33 | magician	wizard	9.02
 34 | midday	noon	9.29
 35 | furnace	stove	8.79
 36 | food	fruit	7.52
 37 | bird	cock	7.10
 38 | bird	crane	7.38
 39 | food	rooster	4.42
 40 | money	dollar	8.42
 41 | money	currency	9.04
 42 | tiger	jaguar	8.00
 43 | tiger	feline	8.00
 44 | tiger	carnivore	7.08
 45 | tiger	mammal	6.85
 46 | tiger	animal	7.00
 47 | tiger	organism	4.77
 48 | tiger	fauna	5.62
 49 | psychology	psychiatry	8.08
 50 | psychology	science	6.71
 51 | psychology	discipline	5.58
 52 | planet	star	8.45
 53 | planet	moon	8.08
 54 | planet	sun	8.02
 55 | precedent	example	5.85
 56 | precedent	antecedent	6.04
 57 | cup	tableware	6.85
 58 | cup	artifact	2.92
 59 | cup	object	3.69
 60 | cup	entity	2.15
 61 | jaguar	cat	7.42
 62 | jaguar	car	7.27
 63 | mile	kilometer	8.66
 64 | skin	eye	6.22
 65 | Japanese	American	6.50
 66 | century	year	7.59
 67 | announcement	news	7.56
 68 | doctor	personnel	5.00
 69 | Harvard	Yale	8.13
 70 | hospital	infrastructure	4.63
 71 | life	death	7.88
 72 | travel	activity	5.00
 73 | type	kind	8.97
 74 | street	place	6.44
 75 | street	avenue	8.88
 76 | street	block	6.88
 77 | cell	phone	7.81
 78 | dividend	payment	7.63
 79 | calculation	computation	8.44
 80 | profit	loss	7.63
 81 | dollar	yen	7.78
 82 | dollar	buck	9.22
 83 | phone	equipment	7.13
 84 | liquid	water	7.89
 85 | marathon	sprint	7.47
 86 | seafood	food	8.34
 87 | seafood	lobster	8.70
 88 | lobster	food	7.81
 89 | lobster	wine	5.70
 90 | championship	tournament	8.36
 91 | man	woman	8.30
 92 | man	governor	5.25
 93 | murder	manslaughter	8.53
 94 | opera	performance	6.88
 95 | Mexico	Brazil	7.44
 96 | glass	metal	5.56
 97 | aluminum	metal	7.83
 98 | rock	jazz	7.59
 99 | museum	theater	7.19
100 | shower	thunderstorm	6.31
101 | monk	oracle	5.00
102 | cup	food	5.00
103 | journal	association	4.97
104 | street	children	4.94
105 | car	flight	4.94
106 | space	chemistry	4.88
107 | situation	conclusion	4.81
108 | word	similarity	4.75
109 | peace	plan	4.75
110 | consumer	energy	4.75
111 | ministry	culture	4.69
112 | smart	student	4.62
113 | investigation	effort	4.59
114 | image	surface	4.56
115 | life	term	4.50
116 | start	match	4.47
117 | computer	news	4.47
118 | board	recommendation	4.47
119 | lad	brother	4.46
120 | observation	architecture	4.38
121 | coast	hill	4.38
122 | deployment	departure	4.25
123 | benchmark	index	4.25
124 | attempt	peace	4.25
125 | consumer	confidence	4.13
126 | start	year	4.06
127 | focus	life	4.06
128 | development	issue	3.97
129 | theater	history	3.91
130 | situation	isolation	3.88
131 | profit	warning	3.88
132 | media	trading	3.88
133 | chance	credibility	3.88
134 | precedent	information	3.85
135 | architecture	century	3.78
136 | population	development	3.75
137 | stock	live	3.73
138 | peace	atmosphere	3.69
139 | morality	marriage	3.69
140 | minority	peace	3.69
141 | atmosphere	landscape	3.69
142 | report	gain	3.63
143 | music	project	3.63
144 | seven	series	3.56
145 | experience	music	3.47
146 | school	center	3.44
147 | five	month	3.38
148 | announcement	production	3.38
149 | morality	importance	3.31
150 | money	operation	3.31
151 | delay	news	3.31
152 | governor	interview	3.25
153 | practice	institution	3.19
154 | century	nation	3.16
155 | coast	forest	3.15
156 | shore	woodland	3.08
157 | drink	car	3.04
158 | president	medal	3.00
159 | prejudice	recognition	3.00
160 | viewer	serial	2.97
161 | peace	insurance	2.94
162 | Mars	water	2.94
163 | media	gain	2.88
164 | precedent	cognition	2.81
165 | announcement	effort	2.75
166 | line	insurance	2.69
167 | crane	implement	2.69
168 | drink	mother	2.65
169 | opera	industry	2.63
170 | volunteer	motto	2.56
171 | listing	proximity	2.56
172 | precedent	collection	2.50
173 | cup	article	2.40
174 | sign	recess	2.38
175 | problem	airport	2.38
176 | reason	hypertension	2.31
177 | direction	combination	2.25
178 | Wednesday	news	2.22
179 | glass	magician	2.08
180 | cemetery	woodland	2.08
181 | possibility	girl	1.94
182 | cup	substance	1.92
183 | forest	graveyard	1.85
184 | stock	egg	1.81
185 | month	hotel	1.81
186 | energy	secretary	1.81
187 | precedent	group	1.77
188 | production	hike	1.75
189 | stock	phone	1.62
190 | holy	sex	1.62
191 | stock	CD	1.31
192 | drink	ear	1.31
193 | delay	racism	1.19
194 | stock	life	0.92
195 | stock	jaguar	0.92
196 | monk	slave	0.92
197 | lad	wizard	0.92
198 | sugar	approach	0.88
199 | rooster	voyage	0.62
200 | noon	string	0.54
201 | chord	smile	0.54
202 | professor	cucumber	0.31
203 | king	cabbage	0.23
204 | 


--------------------------------------------------------------------------------
/word2vec/makefile:
--------------------------------------------------------------------------------
 1 | CC = gcc
 2 | CFLAGS = -lm -pthread -O3 -march=native -Wall -funroll-loops -Wno-unused-result
 3 | 
 4 | all: word2vec
 5 | 
 6 | word2vec : word2vec.c
 7 | 	$(CC) word2vec.c -o word2vec $(CFLAGS)
 8 | 
 9 | clean:
10 | 	rm -rf word2vec
11 | 


--------------------------------------------------------------------------------
/word2vec/word2vec.c:
--------------------------------------------------------------------------------
  1 | // Modifed by ZheZhao, Renmin University of China 
  2 | // fix some bugs such as training info printing 
  3 | // delete irrelevant code
  4 | //
  5 | /////////////////////////////////////////////////////////////////
  6 | // TODO: add total word count to vocabulary, instead of "train_words"
  7 | //
  8 | // Modifed by Yoav Goldberg, Jan-Feb 2014
  9 | // Removed:
 10 | //    hierarchical-softmax training
 11 | //    cbow
 12 | // Added:
 13 | //   - support for different vocabularies for words and contexts
 14 | //   - different input syntax
 15 | //
 16 | /////////////////////////////////////////////////////////////////
 17 | //
 18 | //  Copyright 2013 Google Inc. All Rights Reserved.
 19 | //
 20 | //  Licensed under the Apache License, Version 2.0 (the "License");
 21 | //  you may not use this file except in compliance with the License.
 22 | //  You may obtain a copy of the License at
 23 | //
 24 | //      http://www.apache.org/licenses/LICENSE-2.0
 25 | //
 26 | //  Unless required by applicable law or agreed to in writing, software
 27 | //  distributed under the License is distributed on an "AS IS" BASIS,
 28 | //  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 29 | //  See the License for the specific language governing permissions and
 30 | //  limitations under the License.
 31 | 
 32 | #include <stdio.h>
 33 | #include <stdlib.h>
 34 | #include <string.h>
 35 | #include <math.h>
 36 | #include <pthread.h>
 37 | 
 38 | 
 39 | #define MAX_STRING 100
 40 | #define EXP_TABLE_SIZE 1000
 41 | #define MAX_EXP 6
 42 | 
 43 | const long long vocab_hash_size = 70000000;
 44 | typedef float real;
 45 | 
 46 | struct vocab_word {
 47 |   long long cn;
 48 |   char *word;
 49 | };
 50 | 
 51 | struct vocabulary {
 52 |    struct vocab_word *vocab;
 53 |    long long *vocab_hash;
 54 |    long long vocab_max_size; //1000
 55 |    long long vocab_size;
 56 |    long long pairs_num;
 57 | };
 58 | 
 59 | char pairs_file[MAX_STRING];
 60 | char input_vocab_file[MAX_STRING], output_vocab_file[MAX_STRING];
 61 | char input_vector_file[MAX_STRING], output_vector_file[MAX_STRING];
 62 | int debug_mode = 2, num_threads = 1;
 63 | long long vec_size = 100;
 64 | long long pairs_num = 0, pairs_count_actual = 0, file_size = 0, classes = 0;
 65 | real alpha = 0.025, starting_alpha, sample = 0;
 66 | real *syn0, *syn1neg, *expTable;
 67 | clock_t start;
 68 | int numiters = 1;
 69 | 
 70 | struct vocabulary *input_vocab;
 71 | struct vocabulary *output_vocab;
 72 | 
 73 | int negative = 15;
 74 | const long long table_size = 1e8;
 75 | long long *samplingtable;
 76 | 
 77 | void InitSamplingTable(struct vocabulary *v) {
 78 |   long long a, i;
 79 |   long long normalizer = 0;
 80 |   real d1, power = 0.75;
 81 |   samplingtable = (long long *)malloc(table_size * sizeof(long long));
 82 |   for (a = 0; a < v->vocab_size; a++) normalizer += pow(v->vocab[a].cn, power);
 83 |   i = 0;
 84 |   d1 = pow(v->vocab[i].cn, power) / (real)normalizer;
 85 |   for (a = 0; a < table_size; a++) {
 86 |     samplingtable[a] = i;
 87 |     if (a / (real)table_size > d1) {
 88 |       i++;
 89 |       d1 += pow(v->vocab[i].cn, power) / (real)normalizer;
 90 |     }
 91 |     if (i >= v->vocab_size) i = v->vocab_size - 1;
 92 |   }
 93 | }
 94 | 
 95 | void ReadWord(char *word, FILE *fin) {
 96 |   int a = 0, ch;
 97 |   while (!feof(fin)) {
 98 |     ch = fgetc(fin);
 99 |     if (ch == 13) continue;
100 |     if ((ch == ' ') || (ch == '\t') || (ch == '\n')) {
101 |       if (a > 0) break;
102 |       else continue; 
103 |     }
104 |     word[a] = ch;
105 |     a++;
106 |     if (a >= MAX_STRING - 1) a--;
107 |   }
108 |   word[a] = 0;
109 | }
110 | 
111 | // Returns hash value of a word
112 | unsigned long long GetWordHash(char *word) {
113 |   unsigned long long a, hash = 0;
114 |   for (a = 0; a < strlen(word); a++) hash = hash * 257 + word[a];
115 |   hash = hash % vocab_hash_size;
116 |   return hash;
117 | }
118 | 
119 | 
120 | long long SearchVocab(struct vocabulary *v, char *word) {
121 |   unsigned long long hash = GetWordHash(word);
122 |   while (1) {
123 |     if ((v->vocab_hash)[hash] == -1) return -1;
124 |     if (!strcmp(word, v->vocab[v->vocab_hash[hash]].word)) return v->vocab_hash[hash];
125 |     hash = (hash + 1) % vocab_hash_size;
126 |   }
127 |   return -1;
128 | }
129 | 
130 | // Adds a word to the vocabulary
131 | long long AddWordToVocab(struct vocabulary *v, char *word) {
132 |   unsigned long long hash;
133 |   int length = strlen(word) + 1;
134 |   if (length > MAX_STRING) length = MAX_STRING;
135 |   v->vocab[v->vocab_size].word = (char *)calloc(length, sizeof(char));
136 |   strcpy(v->vocab[v->vocab_size].word, word);
137 |   v->vocab[v->vocab_size].cn = 0;
138 |   v->vocab_size++;
139 |   if (v->vocab_size + 2 >= v->vocab_max_size) {
140 |     v->vocab_max_size += 1000;
141 |     v->vocab = (struct vocab_word *)realloc(v->vocab, v->vocab_max_size * sizeof(struct vocab_word));
142 |   }
143 |   hash = GetWordHash(word);
144 |   while (v->vocab_hash[hash] != -1) hash = (hash + 1) % vocab_hash_size;
145 |   v->vocab_hash[hash] = v->vocab_size - 1;
146 |   return v->vocab_size - 1;
147 | }
148 | 
149 | struct vocabulary *CreateVocabulary() {
150 |    struct vocabulary *v = malloc(sizeof(struct vocabulary));
151 |    long long a;
152 |    v->vocab_max_size = 1000;
153 |    v->vocab_size = 0;
154 |    v->vocab = (struct vocab_word *)calloc(v->vocab_max_size, sizeof(struct vocab_word));
155 |    v->vocab_hash = (long long *)calloc(vocab_hash_size, sizeof(long long));
156 |    for (a = 0; a < vocab_hash_size; a++) v->vocab_hash[a] = -1;
157 |    return v;
158 | }
159 | 
160 | void SaveVocab(struct vocabulary *v, char *save_vocab_file) {
161 |   long long i;
162 |   FILE *fo = fopen(save_vocab_file, "wb");
163 |   for (i = 0; i < v->vocab_size; i++) fprintf(fo, "%s %lld\n", v->vocab[i].word, v->vocab[i].cn);
164 |   fclose(fo);
165 | }
166 | 
167 | // Reads a word and returns its index in the vocabulary
168 | long long ReadWordIndex(struct vocabulary *v, FILE *fin) {
169 |   char word[MAX_STRING];
170 |   ReadWord(word, fin);
171 |   if (feof(fin)) return -1;
172 |   return SearchVocab(v, word);
173 | }
174 | 
175 | struct vocabulary *ReadVocab(char *vocabfile) {
176 |   long long a, i = 0;
177 |   char c;
178 |   char word[MAX_STRING];
179 |   FILE *fin = fopen(vocabfile, "rb");
180 |   if (fin == NULL) {
181 |     printf("Vocabulary file not found\n");
182 |     exit(1);
183 |   }
184 |   struct vocabulary *v = CreateVocabulary();
185 |   v->pairs_num = 0;
186 |   while (1) {
187 |     ReadWord(word, fin);
188 |     if (feof(fin)) break;
189 |     a = AddWordToVocab(v, word);
190 |     fscanf(fin, "%lld%c", &v->vocab[a].cn, &c);
191 |     v->pairs_num += v->vocab[a].cn;
192 |     i++;
193 |   }
194 |   printf("Vocab size: %lld\n", v->vocab_size);
195 |   printf("Number of pairs: %lld\n", v->pairs_num);
196 |   return v;
197 | }
198 | 
199 | long long GetFileSize(char *fname) {
200 |   long long fsize;
201 |   FILE *fin = fopen(fname, "rb");
202 |   if (fin == NULL) {
203 |     printf("ERROR: file not found! %s\n", fname);
204 |     exit(1);
205 |   }
206 |   fseek(fin, 0, SEEK_END);
207 |   fsize = ftell(fin);
208 |   fclose(fin);
209 |   return fsize;
210 | }
211 | 
212 | void InitNet(struct vocabulary *input_vocab, struct vocabulary *output_vocab) {
213 |    long long a, b;
214 |    a = posix_memalign((void **)&syn0, 128, (long long)input_vocab->vocab_size * vec_size * sizeof(real));
215 |    if (syn0 == NULL) {printf("Memory allocation failed\n"); exit(1);}
216 |    for (b = 0; b < vec_size; b++) 
217 |       for (a = 0; a < input_vocab->vocab_size; a++)
218 |          syn0[a * vec_size + b] = (rand() / (real)RAND_MAX - 0.5) / vec_size;
219 | 
220 |    a = posix_memalign((void **)&syn1neg, 128, (long long)output_vocab->vocab_size * vec_size * sizeof(real));
221 |    if (syn1neg == NULL) {printf("Memory allocation failed\n"); exit(1);}
222 |    for (b = 0; b < vec_size; b++)
223 |       for (a = 0; a < output_vocab->vocab_size; a++)
224 |         syn1neg[a * vec_size + b] = 0;
225 | }
226 | 
227 | void *TrainModelThread(void *id) {
228 |   long long input = -1, output = -1;
229 |   long long d;
230 |   long long pairs_count = 0, last_pairs_count = 0;
231 |   long long l1, l2, c, target, label;
232 |   unsigned long long next_random = (unsigned long long)id;
233 |   real f, g;
234 |   clock_t now;
235 |   real *neu1 = (real *)calloc(vec_size, sizeof(real));
236 |   real *neu1e = (real *)calloc(vec_size, sizeof(real));
237 |   FILE *fi = fopen(pairs_file, "rb");
238 |   long long start_offset = file_size / (long long)num_threads * (long long)id;
239 |   long long end_offset = file_size / (long long)num_threads * (long long)(id+1);
240 |   int iter;
241 |   for (iter=0; iter < numiters; ++iter) {
242 |      fseek(fi, start_offset, SEEK_SET);
243 |      while (fgetc(fi) != '\n') { };
244 |      long long pairs_num = input_vocab->pairs_num;
245 |      while (1) {
246 |         if (pairs_count - last_pairs_count > 10000) {
247 |            pairs_count_actual += pairs_count - last_pairs_count;
248 |            last_pairs_count = pairs_count;
249 |            if ((debug_mode > 1)) {
250 |               now=clock();
251 |               printf("%cAlpha: %f  Progress: %.2f%%  Pairs/thread/sec: %.2fk  ", 13, alpha,
252 |                     pairs_count_actual / (real)(numiters*pairs_num + 1) * 100,
253 |                     pairs_count_actual / ((real)(now - start + 1) / (real)CLOCKS_PER_SEC * 1000));
254 |               fflush(stdout);
255 |            }
256 |            alpha = starting_alpha * (1 - pairs_count_actual / (real)(numiters*pairs_num + 1));
257 |            if (alpha < starting_alpha * 0.0001) alpha = starting_alpha * 0.0001;
258 |         }
259 |         if (feof(fi) || ftell(fi) > end_offset) break;
260 |         for (c = 0; c < vec_size; c++) neu1[c] = 0;
261 |         for (c = 0; c < vec_size; c++) neu1e[c] = 0;
262 |         input = ReadWordIndex(input_vocab, fi);
263 |         output = ReadWordIndex(output_vocab, fi);
264 |         pairs_count++;
265 |         if (input < 0 || output < 0) continue;
266 |         // Negative sampling.
267 |         l1 = input * vec_size;
268 |         for (d = 0; d < negative + 1; d++) {
269 |            if (d == 0) {
270 |               target = output;
271 |               label = 1;
272 |            } else {
273 |               next_random = next_random * (unsigned long long)25214903917 + 11;
274 |               target = samplingtable[(next_random >> 16) % table_size];
275 |               if (target == 0) target = next_random % (output_vocab->vocab_size - 1) + 1;
276 |               if (target == output) continue;
277 |               label = 0;
278 |            }
279 |            l2 = target * vec_size;
280 |            f = 0;
281 |            for (c = 0; c < vec_size; c++) f += syn0[c + l1] * syn1neg[c + l2];
282 |            if (f > MAX_EXP) g = (label - 1) * alpha;
283 |            else if (f < -MAX_EXP) g = (label - 0) * alpha;
284 |            else g = (label - expTable[(int)((f + MAX_EXP) * (EXP_TABLE_SIZE / MAX_EXP / 2))]) * alpha;
285 |            for (c = 0; c < vec_size; c++) neu1e[c] += g * syn1neg[c + l2];
286 |            for (c = 0; c < vec_size; c++) syn1neg[c + l2] += g * syn0[c + l1];
287 |         }
288 |         for (c = 0; c < vec_size; c++) syn0[c + l1] += neu1e[c];
289 |      }
290 |   }
291 |   fclose(fi);
292 |   free(neu1);
293 |   free(neu1e);
294 |   pthread_exit(NULL);
295 | }
296 | 
297 | void TrainModel() {
298 |   long a, b;
299 |   FILE *fo1;
300 |   FILE *fo2;
301 |   file_size = GetFileSize(pairs_file);
302 |   pthread_t *pt = (pthread_t *)malloc(num_threads * sizeof(pthread_t));
303 |   printf("Starting training using file %s\n", pairs_file);
304 |   starting_alpha = alpha;
305 |   input_vocab = ReadVocab(input_vocab_file);
306 |   output_vocab = ReadVocab(output_vocab_file);
307 |   InitNet(input_vocab, output_vocab);
308 |   InitSamplingTable(output_vocab);
309 |   start = clock();
310 |   for (a = 0; a < num_threads; a++) pthread_create(&pt[a], NULL, TrainModelThread, (void *)a);
311 |   for (a = 0; a < num_threads; a++) pthread_join(pt[a], NULL);
312 |   fo1 = fopen(input_vector_file, "wb");
313 |   fprintf(fo1, "%lld %lld\n", input_vocab->vocab_size, vec_size);
314 |   for (a = 0; a < input_vocab->vocab_size; a++) {
315 |     fprintf(fo1, "%s ", input_vocab->vocab[a].word);
316 |     for (b = 0; b < vec_size; b++) fprintf(fo1, "%lf ", syn0[a * vec_size + b]);
317 |     fprintf(fo1, "\n");
318 |   }
319 |   fclose(fo1);
320 | 
321 |   fo2 = fopen(output_vector_file, "wb");
322 |   fprintf(fo2, "%lld %lld\n", output_vocab->vocab_size, vec_size);
323 |   for (a = 0; a < output_vocab->vocab_size; a++) {
324 |       fprintf(fo2, "%s ", output_vocab->vocab[a].word);
325 |       for (b = 0; b < vec_size; b++) fprintf(fo2, "%lf ", syn1neg[a * vec_size + b]);
326 |       fprintf(fo2, "\n");
327 |   }
328 |   fclose(fo2);
329 |   
330 | }
331 | 
332 | int ArgPos(char *str, int argc, char **argv) {
333 |   int a;
334 |   for (a = 1; a < argc; a++) if (!strcmp(str, argv[a])) {
335 |     if (a == argc - 1) {
336 |       printf("Argument missing for %s\n", str);
337 |       exit(1);
338 |     }
339 |     return a;
340 |   }
341 |   return -1;
342 | }
343 | 
344 | int main(int argc, char **argv) {
345 |   int i;
346 |   if (argc == 1) {
347 |     printf("WORD VECTOR estimation toolkit\n\n");
348 |     printf("Options:\n");
349 |     printf("Hyper-parameters for training:\n");
350 |     printf("\t-train <file>\n");
351 |     printf("\t\tUse pairs data from <file> to train the model\n");
352 |     printf("\t-input_vocab <file>\n");
353 |     printf("\t\tinput vocabulary file\n");
354 |     printf("\t-output_vocab <file>\n");
355 |     printf("\t\toutput vocabulary file\n");
356 |     printf("\t-input_vector <file>\n");
357 |     printf("\t\tUse <file> to save the resulting input vectors\n");
358 |     printf("\t-output_vector <file>\n");
359 |     printf("\t\tUse <file> to save the resulting output vectors\n");
360 | 
361 |     printf("\t-size <int>\n");
362 |     printf("\t\tSet size of word vectors; default is 100\n");
363 |     printf("\t-negative <int>\n");
364 |     printf("\t\tNumber of negative examples; default is 15, common values are 5 - 10 (0 = not used)\n");
365 |     printf("\t-threads <int>\n");
366 |     printf("\t\tUse <int> threads (default 1)\n");
367 |     printf("\t-alpha <float>\n");
368 |     printf("\t\tSet the starting learning rate; default is 0.025\n");
369 |     printf("\t-iters <int>\n");
370 |     printf("\t\tPerform i iterations over the data; default is 1\n");
371 |     return 0;
372 |   }
373 |   pairs_file[0] = 0;
374 |   input_vocab_file[0] = 0;
375 |   output_vocab_file[0] = 0;
376 |   input_vector_file[0] = 0;
377 |   output_vector_file[0] = 0;
378 |   if ((i = ArgPos((char *)"--pairs_file", argc, argv)) > 0) strcpy(pairs_file, argv[i + 1]);
379 |   if ((i = ArgPos((char *)"--input_vocab_file", argc, argv)) > 0) strcpy(input_vocab_file, argv[i + 1]);
380 |   if ((i = ArgPos((char *)"--output_vocab_file", argc, argv)) > 0) strcpy(output_vocab_file, argv[i + 1]);
381 |   if ((i = ArgPos((char *)"--input_vector_file", argc, argv)) > 0) strcpy(input_vector_file, argv[i + 1]);
382 |   if ((i = ArgPos((char *)"--output_vector_file", argc, argv)) > 0) strcpy(output_vector_file, argv[i + 1]);
383 |   if ((i = ArgPos((char *)"--size", argc, argv)) > 0) vec_size = atoi(argv[i + 1]);
384 |   if ((i = ArgPos((char *)"--debug", argc, argv)) > 0) debug_mode = atoi(argv[i + 1]);
385 |   if ((i = ArgPos((char *)"--alpha", argc, argv)) > 0) alpha = atof(argv[i + 1]);
386 |   if ((i = ArgPos((char *)"--negative", argc, argv)) > 0) negative = atoi(argv[i + 1]);
387 |   if ((i = ArgPos((char *)"--threads_num", argc, argv)) > 0) num_threads = atoi(argv[i + 1]);
388 |   if ((i = ArgPos((char *)"--iter", argc, argv)) > 0) numiters = atoi(argv[i+1]);
389 | 
390 |   if (pairs_file[0] == 0) { printf("must supply -train.\n\n"); return 0; }
391 |   if (input_vocab_file[0] == 0) { printf("must supply -input_vocab.\n\n"); return 0; }
392 |   if (output_vocab_file[0] == 0) { printf("must supply -output_vocab.\n\n"); return 0; }
393 |   if (input_vector_file[0] == 0) { printf("must supply -input_vector.\n\n"); return 0; }
394 |   if (output_vector_file[0] == 0) { printf("must supply -output_vector.\n\n"); return 0; }
395 |   expTable = (real *)malloc((EXP_TABLE_SIZE + 1) * sizeof(real));
396 |   for (i = 0; i < EXP_TABLE_SIZE; i++) {
397 |     expTable[i] = exp((i / (real)EXP_TABLE_SIZE * 2 - 1) * MAX_EXP); // Precompute the exp() table
398 |     expTable[i] = expTable[i] / (expTable[i] + 1);                   // Precompute f(x) = x / (x + 1)
399 |   }
400 |   TrainModel();
401 |   return 0;
402 | }
403 | 


--------------------------------------------------------------------------------
/word_example.sh:
--------------------------------------------------------------------------------
 1 | #!/bin/sh
 2 | 
 3 | memory_size=4
 4 | cpus_num=4
 5 | corpus=wikipedia
 6 | output_path=outputs/${corpus}/word_word
 7 | 
 8 | mkdir -p ${output_path}/sgns
 9 | mkdir -p ${output_path}/ppmi
10 | mkdir -p ${output_path}/svd
11 | mkdir -p ${output_path}/glove
12 | 
13 | python ngram2vec/corpus2vocab.py --corpus_file ${corpus} --vocab_file ${output_path}/vocab --memory_size ${memory_size}
14 | python ngram2vec/corpus2pairs.py --corpus_file ${corpus} --pairs_file ${output_path}/pairs --vocab_file ${output_path}/vocab --processes_num ${cpus_num}
15 | 
16 | # Concatenate pair files. 
17 | if [ -f "${output_path}/pairs" ]; then
18 | 	rm ${output_path}/pairs
19 | fi
20 | for i in $(seq 0 $((${cpus_num}-1)))
21 | do
22 | 	cat ${output_path}/pairs_${i} >> ${output_path}/pairs
23 | 	rm ${output_path}/pairs_${i}
24 | done
25 | 
26 | # Generate input vocabulary and output vocabulary, which are used as vocabulary files for all models
27 | python ngram2vec/pairs2vocab.py --pairs_file ${output_path}/pairs --input_vocab_file ${output_path}/vocab.input --output_vocab_file ${output_path}/vocab.output
28 | 
29 | # SGNS, learn representation upon pairs.
30 | # We add a python interface upon C code.
31 | python ngram2vec/pairs2sgns.py --pairs_file ${output_path}/pairs --input_vocab_file ${output_path}/vocab.input --output_vocab_file ${output_path}/vocab.output --input_vector_file ${output_path}/sgns/sgns.input --output_vector_file ${output_path}/sgns/sgns.output --threads_num ${cpus_num}
32 | 
33 | # SGNS evaluation.
34 | python ngram2vec/similarity_eval.py --input_vector_file ${output_path}/sgns/sgns.input  --test_file testsets/similarity/ws353_similarity.txt --normalize
35 | python ngram2vec/analogy_eval.py --input_vector_file ${output_path}/sgns/sgns.input --test_file testsets/analogy/semantic.txt --normalize
36 | 
37 | # Generate co-occurrence matrix from pairs.
38 | python ngram2vec/pairs2counts.py --pairs_file ${output_path}/pairs --input_vocab_file ${output_path}/vocab.input --output_vocab_file ${output_path}/vocab.output --counts_file ${output_path}/counts --output_id --memory_size ${memory_size}
39 | 
40 | # PPMI, learn representation upon counts (co-occurrence matrix).
41 | python ngram2vec/counts2ppmi.py --counts_file ${output_path}/counts --input_vocab_file ${output_path}/vocab.input --output_vocab_file ${output_path}/vocab.output --ppmi_file ${output_path}/ppmi/ppmi
42 | 
43 | # PPMI evaluation.
44 | python ngram2vec/similarity_eval.py --input_vector_file ${output_path}/ppmi/ppmi --test_file testsets/similarity/ws353_similarity.txt --normalize --sparse
45 | python ngram2vec/analogy_eval.py --input_vector_file ${output_path}/ppmi/ppmi --test_file testsets/analogy/semantic.txt --normalize --sparse
46 | 
47 | # SVD, factorize PPMI matrix.
48 | python ngram2vec/ppmi2svd.py --ppmi_file ${output_path}/ppmi/ppmi --svd_file ${output_path}/svd/svd --input_vocab_file ${output_path}/vocab.input --output_vocab_file ${output_path}/vocab.output 
49 | 
50 | # SVD evaluation.
51 | python ngram2vec/similarity_eval.py --input_vector_file ${output_path}/svd/svd.input  --test_file testsets/similarity/ws353_similarity.txt --normalize
52 | python ngram2vec/analogy_eval.py --input_vector_file ${output_path}/svd/svd.input --test_file testsets/analogy/semantic.txt --normalize
53 | 
54 | # Shuffle counts.
55 | python ngram2vec/shuffle.py --input_file ${output_path}/counts --output_file ${output_path}/counts.shuf --memory_size ${memory_size}
56 | 
57 | # GloVe, learn representation upon counts (co-occurrence matrix).
58 | python ngram2vec/counts2glove.py --counts_file ${output_path}/counts.shuf --input_vocab_file ${output_path}/vocab.input --output_vocab_file ${output_path}/vocab.output --input_vector_file ${output_path}/glove/glove.input --output_vector_file ${output_path}/glove/glove.output --threads_num ${cpus_num}
59 | 
60 | # GloVe evaluation.
61 | python ngram2vec/similarity_eval.py --input_vector_file ${output_path}/glove/glove.input  --test_file testsets/similarity/ws353_similarity.txt --normalize
62 | python ngram2vec/analogy_eval.py --input_vector_file ${output_path}/glove/glove.input --test_file testsets/analogy/semantic.txt --normalize
63 | 


--------------------------------------------------------------------------------
/workflow.jpg:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/zhezhaoa/ngram2vec/e57c623a5dfe5df4ea0be3f04ad52045aaa6360f/workflow.jpg


--------------------------------------------------------------------------------