├── .gitignore
├── README.md
├── convolution
├── __init__.py
└── cnn.py
├── embeddings
├── __init__.py
└── text_embeddings.py
├── eval.py
├── evaluation
├── __init__.py
└── confusion_matrix.py
├── graphs
├── __init__.py
└── graph.py
├── helpers
├── __init__.py
├── data_helper.py
├── data_shaper.py
└── io_helper.py
├── ml
├── __init__.py
├── batcher.py
├── loss_functions.py
└── trainer.py
├── nlp.py
├── scaler.py
├── sts
├── __init__.py
└── simple_sts.py
├── supervised-scaler.py
├── wfcode
├── __init__.py
├── corpus.py
└── scaler.py
└── wordfish.py
/.gitignore:
--------------------------------------------------------------------------------
1 | # Byte-compiled / optimized / DLL files
2 | __pycache__/
3 | *.py[cod]
4 | *$py.class
5 |
6 | # C extensions
7 | *.so
8 |
9 | # Distribution / packaging
10 | .Python
11 | env/
12 | build/
13 | develop-eggs/
14 | dist/
15 | downloads/
16 | eggs/
17 | .eggs/
18 | lib/
19 | lib64/
20 | parts/
21 | sdist/
22 | var/
23 | wheels/
24 | *.egg-info/
25 | .installed.cfg
26 | *.egg
27 |
28 | # PyInstaller
29 | # Usually these files are written by a python script from a template
30 | # before PyInstaller builds the exe, so as to inject date/other infos into it.
31 | *.manifest
32 | *.spec
33 |
34 | # Installer logs
35 | pip-log.txt
36 | pip-delete-this-directory.txt
37 |
38 | # Unit test / coverage reports
39 | htmlcov/
40 | .tox/
41 | .coverage
42 | .coverage.*
43 | .cache
44 | nosetests.xml
45 | coverage.xml
46 | *.cover
47 | .hypothesis/
48 |
49 | # Translations
50 | *.mo
51 | *.pot
52 |
53 | # Django stuff:
54 | *.log
55 | local_settings.py
56 |
57 | # Flask stuff:
58 | instance/
59 | .webassets-cache
60 |
61 | # Scrapy stuff:
62 | .scrapy
63 |
64 | # Sphinx documentation
65 | docs/_build/
66 |
67 | # PyBuilder
68 | target/
69 |
70 | # Jupyter Notebook
71 | .ipynb_checkpoints
72 |
73 | # pyenv
74 | .python-version
75 |
76 | # celery beat schedule file
77 | celerybeat-schedule
78 |
79 | # SageMath parsed files
80 | *.sage.py
81 |
82 | # dotenv
83 | .env
84 |
85 | # virtualenv
86 | .venv
87 | venv/
88 | ENV/
89 |
90 | # Spyder project settings
91 | .spyderproject
92 | .spyproject
93 |
94 | # Rope project settings
95 | .ropeproject
96 |
97 | # mkdocs documentation
98 | /site
99 |
100 | # mypy
101 | .mypy_cache/
102 |
--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
1 | # SemScale
2 | An easy-to-use tool for semantic scaling of political text, based on word embeddings. Check out the working draft of our [political science article](https://arxiv.org/pdf/1904.06217.pdf) (plus its [online appendix](https://umanlp.github.io/semantic-scaling/)) and the [original NLP paper](https://ub-madoc.bib.uni-mannheim.de/42002/1/E17-2109.pdf).
3 |
4 | ## How to use it
5 |
6 | Clone or download the project, then go into the SemScale directory. The script scaler.py needs just the following inputs:
7 |
8 | __datadir__ -> A path to the directory containing the input text
9 | files for scaling (one score will be assigned per
10 | file).
11 |
12 | __embs__ -> A path to the file containing pre-trained word
13 | embeddings
14 |
15 | __output__ -> A file path to which to store the scaling results.
16 |
17 |
18 | optional arguments:
19 |
20 | -h, --help -> show this help message and exit
21 |
22 | --stopwords STOPWORDS -> A file to the path containing stopwords
23 |
24 | --emb_cutoff EMB_CUTOFF -> A cutoff on the vocabulary size of the embeddings.
25 |
26 | ### Data directory
27 |
28 | The expected input is in the one-text-per-file format. Each text file in the referenced directory should contain a language (e.g., "en") in the first line, i.e., the format should be "*language*\n*text of the file*".
29 |
30 | ### (Multilingual) Word Embeddings
31 |
32 | For an easy set-up, we provide pre-trained FastText embeddings in a single file for the following five language: English, French, German, Italian and Spanish, that can be obtained from [here](https://drive.google.com/file/d/1Oy61TV0DpruUXOK9qO3IFsvL5DMvwGwD/view?usp=sharing).
33 |
34 | Nonetheless, you can easily use the tool for texts in other languages or with different word embeddings, as long as you:
35 |
36 | 1) provide a (language-prefixed) word embedding file, the following way: for each word, abbreviation for the language plus double underscore plus word and then the word embedding. For instance, each word in a Bulgarian word embeddings file might be prefixed with "bg__")
37 |
38 | 2) in case you employ embeddings in a different language to the 5 listed above, update the list of supported languages in the beginning of the code file *nlp.py* and at the beginning of the task script you're using (e.g., *scaler.py*)
39 |
40 | ### Output File
41 |
42 | A simple .txt, which will be filled with filename, positional-score for each input file.
43 |
44 | ### (Optional) Stopwords
45 |
46 | Stopwords can be automatically excluded, via this input file (one stop-word per line).
47 |
48 | ### Prerequisites
49 |
50 | The script requires basic libraries from the Python scientific stack: *numpy* (tested with version 1.12.1), *scipy* (tested with version 0.19.0), and *nltk* (tested with version 3.2.3);
51 |
52 | ## Run it!
53 |
54 | In the SemScale folder, just run the following command:
55 |
56 | ``
57 | python scaler.py path-to-embeddings-file path-to-input-folder output.txt
58 | ``
59 |
60 | ## Other functionalities
61 |
62 | To use the supervised scaling version of our approach (dubbed __SemScores__), just run:
63 |
64 | ``
65 | python supervised-scaler.py
66 | ``
67 |
68 | and add as final arguments the two pivot texts to be used.
69 |
70 | We also offer a Python implementation of the famous Wordfish algorithm for text scaling. To know how to use it, just run:
71 |
72 | ``
73 | python wordfish.py -h
74 | ``
75 |
76 | Additional functionalities (classification, topical-scaling) are available in the [main branch](https://github.com/codogogo/topfish) of this project.
77 |
78 | ## License
79 |
80 | 
This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.
81 |
82 | ## Referencing
83 |
84 | If you're using this tool, please cite the following paper:
85 |
86 | ```
87 | @InProceedings{glavavs-nanni-ponzetto:2017:EACLshort,
88 | author = {Glava\v{s}, Goran and Nanni, Federico and Ponzetto, Simone Paolo},
89 | title = {Unsupervised Cross-Lingual Scaling of Political Texts},
90 | booktitle = {Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 2, Short Papers},
91 | month = {April},
92 | year = {2017},
93 | address = {Valencia, Spain},
94 | publisher = {Association for Computational Linguistics},
95 | pages = {688--693},
96 | url = {http://www.aclweb.org/anthology/E17-2109}
97 | }
98 | ```
99 |
--------------------------------------------------------------------------------
/convolution/__init__.py:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/codogogo/topfish/6b3f5723029616cb430d6226bc59c013fe79eb78/convolution/__init__.py
--------------------------------------------------------------------------------
/convolution/cnn.py:
--------------------------------------------------------------------------------
1 | import tensorflow as tf
2 | import numpy as np
3 | from helpers import io_helper
4 |
5 | def load_labels_and_max_length(path):
6 | parameters, model = io_helper.deserialize(path)
7 | return parameters["dist_labels"], parameters["max_text_length"]
8 |
9 | def load_model(path, embeddings, loss_function, just_predict = True):
10 | parameters, model = io_helper.deserialize(path)
11 |
12 | print("Defining and initializing model...")
13 | classifier = CNN(embeddings = (parameters["embedding_size"], embeddings), num_conv_layers = parameters["num_convolutions"], filters = parameters["filters"], k_max_pools = parameters["k_max_pools"], manual_features_size = parameters["manual_features_size"])
14 | classifier.define_model(parameters["max_text_length"], parameters["num_classes"], loss_function, -1, l2_reg_factor = parameters["reg_factor"], update_embeddings = parameters["upd_embs"])
15 | if not just_predict:
16 | classifier.define_optimization(learning_rate = parameters["learning_rate"])
17 |
18 | print("Initializing session...", flush = True)
19 | session = tf.InteractiveSession()
20 | session.run(tf.global_variables_initializer())
21 |
22 | classifier.set_variable_values(session, model)
23 | classifier.set_distinct_labels(parameters["dist_labels"])
24 |
25 | return classifier, session
26 |
27 | class CNN(object):
28 | """
29 | A general convolutional neural network for text classification.
30 | The CNN is highly customizable, the user may determine the number of convolutional and pooling layers and all other parameters of the network (e.g., the number of filters and filter sizes)
31 | Uses an embedding layer, followed by a convolutional, max-pooling and softmax layer.
32 | """
33 |
34 | def __init__(self, embeddings = (100, None), num_conv_layers = 1, filters = [[(3, 64), (4, 128), (5, 64)]], k_max_pools = [1], manual_features_size = 0):
35 | self.emb_size = embeddings[0]
36 | self.embs = embeddings[1]
37 | self.num_convolutions = num_conv_layers
38 | self.filters = filters
39 | self.k_max_pools = k_max_pools
40 |
41 | self.variable_memory = {}
42 | self.manual_features_size = manual_features_size
43 |
44 | def define_model(self, max_text_length, num_classes, loss_function, vocab_size, l2_reg_factor = 0.0, update_embeddings = False):
45 | self.update_embeddings = update_embeddings
46 | self.reg_factor = l2_reg_factor
47 | self.max_text_length = max_text_length
48 | self.num_classes = num_classes
49 | self.loss_function = loss_function
50 |
51 | self.input_x = tf.placeholder(tf.int32, [None, max_text_length], name="input_x")
52 | if self.manual_features_size > 0:
53 | self.manual_features = tf.placeholder(tf.float32, [None, self.manual_features_size], name="man_feats")
54 | self.dropout = tf.placeholder(tf.float32, name="dropout")
55 |
56 | if self.embs is None:
57 | self.W_embeddings = tf.Variable(tf.random_uniform([vocab_size, self.emb_size], -1.0, 1.0), name="W_embeddings")
58 | elif update_embeddings:
59 | self.W_embeddings = tf.Variable(self.embs, dtype = tf.float32, name="W_embeddings")
60 | else:
61 | self.W_embeddings = tf.constant(self.embs, dtype = tf.float32, name="W_embeddings")
62 |
63 | self.mb_embeddings = tf.expand_dims(tf.nn.embedding_lookup(self.W_embeddings, self.input_x), -1)
64 |
65 | for i in range(self.num_convolutions):
66 | current_filters = self.filters[i]
67 | current_max_pool_size = self.k_max_pools[i]
68 |
69 | if i > 0:
70 | pooled = tf.reshape(pooled, [-1, self.k_max_pools[i - 1], sum_filt, 1])
71 |
72 | input = self.mb_embeddings if i == 0 else pooled
73 | input_dim = self.emb_size if i == 0 else sum_filt
74 | num_units = max_text_length if i == 0 else self.k_max_pools[i - 1]
75 |
76 | sum_filt = 0
77 | for filter_size, num_filters in current_filters:
78 | filter_shape = [filter_size, input_dim, 1, num_filters]
79 |
80 | W_conv = tf.Variable(tf.truncated_normal(filter_shape, stddev=0.1, dtype = tf.float32), name="W_conv_" + str(i) + "_" + str(filter_size))
81 | self.variable_memory["W_conv_" + str(i) + "_" + str(filter_size)] = W_conv
82 |
83 | b_conv = tf.Variable(tf.constant(0.1, shape=[num_filters], dtype = tf.float32), name="b_" + str(i) + "_" + str(filter_size))
84 | self.variable_memory["b_" + str(i) + "_" + str(filter_size)] = b_conv
85 |
86 | conv = tf.nn.conv2d(input, W_conv, strides=[1, 1, 1, 1], padding="VALID", name="conv_" + str(i) + "_" + str(filter_size))
87 | h = tf.nn.relu(tf.nn.bias_add(conv, b_conv), name="relu" + str(i) + "_" + str(filter_size))
88 |
89 | if sum_filt == 0:
90 | pooled = tf.nn.max_pool(h, ksize=[1, (num_units - filter_size + 1) - current_max_pool_size + 1, 1, 1], strides=[1, 1, 1, 1], padding='VALID', name="pool_" + str(i) + "_" + str(filter_size))
91 |
92 | else:
93 | new_pool = tf.nn.max_pool(h, ksize=[1, (num_units - filter_size + 1) - current_max_pool_size + 1, 1, 1], strides=[1, 1, 1, 1], padding='VALID', name="pool_" + str(i) + "_" + str(filter_size))
94 | pooled = tf.concat(axis=3, values=[pooled, new_pool])
95 |
96 | sum_filt += num_filters
97 |
98 | self.pooled_flat = tf.reshape(pooled, [-1, self.k_max_pools[-1] * sum_filt])
99 | self.pooled_dropout = tf.nn.dropout(self.pooled_flat, self.dropout)
100 |
101 | W_softmax = tf.get_variable("W_softmax", shape=[self.k_max_pools[-1] * sum_filt + self.manual_features_size, num_classes], initializer=tf.contrib.layers.xavier_initializer(), dtype = tf.float32)
102 | self.variable_memory["W_softmax"] = W_softmax
103 |
104 | b_softmax = tf.Variable(tf.constant(0.1, shape=[num_classes], dtype = tf.float32), name="b_softmax")
105 | self.variable_memory["b_softmax"] = b_softmax
106 |
107 | self.final_features = tf.concat(axis=1, values=[self.pooled_dropout, self.manual_features]) if self.manual_features_size > 0 else self.pooled_dropout
108 | self.preds = tf.nn.xw_plus_b(self.final_features, W_softmax, b_softmax, name="scores")
109 | #self.preds_sftmx = tf.nn.softmax(self.preds)
110 |
111 | self.l2_loss = tf.constant(0.0)
112 | self.l2_loss += tf.nn.l2_loss(W_softmax)
113 | self.l2_loss += tf.nn.l2_loss(b_softmax)
114 |
115 |
116 | def define_optimization(self, learning_rate = 1e-3):
117 | self.input_y = tf.placeholder(tf.float32, [None, self.num_classes], name="input_y")
118 | self.pure_loss = self.loss_function(self.preds, self.input_y)
119 | self.loss = self.pure_loss + self.reg_factor * self.l2_loss
120 |
121 | self.learning_rate = learning_rate
122 | self.train_step = tf.train.RMSPropOptimizer(learning_rate).minimize(self.loss)
123 |
124 | def set_distinct_labels(self, dist_labels):
125 | self.dist_labels = dist_labels
126 |
127 | def get_feed_dict(self, input_data, labels, dropout, manual_feats = None):
128 | fd_mine = { self.input_x : input_data, self.dropout : dropout }
129 | if labels is not None:
130 | fd_mine.update({self.input_y : labels})
131 | if manual_feats is not None:
132 | fd_mine.update({self.manual_features : manual_feats})
133 | return fd_mine
134 |
135 | def get_variable_values(self, session):
136 | variables = {}
137 | for v in self.variable_memory:
138 | value = self.variable_memory[v].eval(session = session)
139 | variables[v] = value
140 | return variables
141 |
142 | def set_variable_values(self, session, var_values):
143 | for v in var_values:
144 | session.run(self.variable_memory[v].assign(var_values[v]))
145 |
146 | def get_hyperparameters(self):
147 | params = { "embedding_size" : self.emb_size,
148 | "num_convolutions" : self.num_convolutions,
149 | "filters" : self.filters,
150 | "k_max_pools" : self.k_max_pools,
151 | "upd_embs" : self.update_embeddings,
152 | "reg_factor" : self.reg_factor,
153 | "learning_rate" : self.learning_rate,
154 | "manual_features_size" : self.manual_features_size,
155 | "max_text_length" : self.max_text_length,
156 | "num_classes" : self.num_classes,
157 | "dist_labels" : self.dist_labels }
158 | return params
159 |
160 | def get_model(self, session):
161 | return [self.get_hyperparameters(), self.get_variable_values(session)]
162 |
163 | def serialize(self, session, path):
164 | variables = self.get_variable_values(session)
165 | to_serialize = [self.get_hyperparameters(), self.get_variable_values(session)]
166 | io_helper.serialize(to_serialize, path)
167 |
168 |
169 |
170 |
171 |
--------------------------------------------------------------------------------
/embeddings/__init__.py:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/codogogo/topfish/6b3f5723029616cb430d6226bc59c013fe79eb78/embeddings/__init__.py
--------------------------------------------------------------------------------
/embeddings/text_embeddings.py:
--------------------------------------------------------------------------------
1 | import numpy as np
2 | from helpers import io_helper as ioh
3 | import codecs
4 | from helpers import io_helper
5 |
6 | def aggregate_phrase_embedding(words, stopwords, embs, emb_size, l2_norm_vec = True, lang = 'en'):
7 | vec_res = np.zeros(emb_size)
8 | fit_words = [w.lower() for w in words if w.lower() not in stopwords and w.lower() in embs.lang_vocabularies[lang]]
9 | if len(fit_words) == 0:
10 | return None
11 |
12 | for w in fit_words:
13 | vec_res += embs.get_vector(lang, w)
14 | res = np.multiply(1.0 / (float(len(fit_words))), vec_res)
15 | if l2_norm_vec:
16 | res = np.multiply(1.0 / np.linalg.norm(res), res)
17 | return res
18 |
19 |
20 | class Embeddings(object):
21 | """Captures functionality to load and store textual embeddings"""
22 |
23 | def __init__(self, cache_similarities = False):
24 | self.lang_embeddings = {}
25 | self.lang_emb_norms = {}
26 | self.lang_vocabularies = {}
27 | self.emb_sizes = {}
28 | self.cache = {}
29 | self.do_cache = cache_similarities
30 |
31 | def inverse_vocabularies(self):
32 | self.inverse_vocabularies = {}
33 | for l in self.lang_vocabularies:
34 | self.inverse_vocabularies[l] = {v: k for k, v in self.lang_vocabularies[l].items()}
35 |
36 | def get_word_from_index(self, index, lang = 'en'):
37 | if index in self.inverse_vocabularies[lang]:
38 | return self.inverse_vocabularies[lang][index]
39 | else:
40 | return None
41 |
42 | def get_vector(self, lang, word):
43 | if word in self.lang_vocabularies[lang]:
44 | return self.lang_embeddings[lang][self.lang_vocabularies[lang][word]]
45 | else:
46 | return None
47 |
48 | def set_vector(self, lang, word, vector):
49 | if word in self.lang_vocabularies[lang]:
50 | self.lang_embeddings[lang][self.lang_vocabularies[lang][word]] = vector
51 |
52 | def get_norm(self, lang, word):
53 | if word in self.lang_vocabularies[lang]:
54 | return self.lang_emb_norms[lang][self.lang_vocabularies[lang][word]]
55 | else:
56 | return None
57 |
58 | def set_norm(self, lang, word, norm):
59 | if word in self.lang_vocabularies[lang]:
60 | self.lang_emb_norms[lang][self.lang_vocabularies[lang][word]] = norm
61 |
62 | def add_word(self, lang, word, vector = None):
63 | if word not in self.lang_vocabularies[lang]:
64 | self.lang_vocabularies[lang][word] = len(self.lang_vocabularies[lang])
65 | rvec = np.random.uniform(-1.0, 1.0, size = [self.emb_sizes[lang]]) if vector is None else vector
66 | rnrm = np.linalg.norm(rvec, 2)
67 | self.lang_embeddings[lang] = np.vstack((self.lang_embeddings[lang], rvec))
68 | self.lang_emb_norms[lang] = np.concatenate((self.lang_emb_norms[lang], [rnrm]))
69 |
70 | def remove_word(self, lang, word):
71 | self.lang_vocabularies[lang].pop(word, None)
72 |
73 | def load_embeddings(self, filepath, limit, language = 'en', print_loading = "False", skip_first_line = False, min_one_letter = False, special_tokens = None):
74 | vocabulary, embs, norms = ioh.load_embeddings_dict_with_norms(filepath, limit = limit, special_tokens = special_tokens, print_load_progress = print_loading, skip_first_line = skip_first_line, min_one_letter = min_one_letter)
75 | self.lang_embeddings[language] = embs
76 | self.lang_emb_norms[language] = norms
77 | self.emb_sizes[language] = embs.shape[1]
78 | self.lang_vocabularies[language] = vocabulary
79 |
80 |
81 | def word_similarity(self, first_word, second_word, first_language = 'en', second_language = 'en'):
82 | if self.do_cache:
83 | cache_str = min(first_word, second_word) + "-" + max(first_word, second_word)
84 | if (first_language + "-" + second_language) in self.cache and cache_str in self.cache[first_language + "-" + second_language]:
85 | return self.cache[first_language + "-" + second_language][cache_str]
86 | elif (first_word not in self.lang_vocabularies[first_language] and first_word.lower() not in self.lang_vocabularies[first_language]) or (second_word not in self.lang_vocabularies[second_language] and second_word.lower() not in self.lang_vocabularies[second_language]):
87 | if ((first_word in second_word or second_word in first_word) and first_language == second_language) or first_word == second_word:
88 | return 1
89 | else:
90 | return 0
91 |
92 | index_first = self.lang_vocabularies[first_language][first_word] if first_word in self.lang_vocabularies[first_language] else (self.lang_vocabularies[first_language][first_word.lower()] if first_word.lower() in self.lang_vocabularies[first_language] else -1)
93 | index_second = self.lang_vocabularies[second_language][second_word] if second_word in self.lang_vocabularies[second_language] else (self.lang_vocabularies[second_language][second_word.lower()] if second_word.lower() in self.lang_vocabularies[second_language] else -1)
94 |
95 | if index_first >= 0 and index_second >= 0:
96 | first_emb = self.lang_embeddings[first_language][index_first]
97 | second_emb = self.lang_embeddings[second_language][index_second]
98 |
99 | first_norm = self.lang_emb_norms[first_language][index_first]
100 | second_norm = self.lang_emb_norms[second_language][index_second]
101 |
102 | score = np.dot(first_emb, second_emb) / (first_norm * second_norm)
103 | else:
104 | score = 0
105 |
106 | if self.do_cache:
107 | if (first_language + "-" + second_language) not in self.cache:
108 | self.cache[first_language + "-" + second_language] = {}
109 | if cache_str not in self.cache[first_language + "-" + second_language]:
110 | self.cache[first_language + "-" + second_language][cache_str] = score
111 | return score
112 |
113 | def most_similar(self, embedding, target_lang, num, similarity = True):
114 | ms = []
115 | for w in self.lang_vocabularies[target_lang]:
116 | targ_w_emb = self.get_vector(target_lang, w)
117 | if len(embedding) != len(targ_w_emb):
118 | print("Unaligned embedding length: " + w)
119 | else:
120 | if similarity:
121 | nrm = np.linalg.norm(embedding, 2)
122 | trg_nrm = self.get_norm(target_lang, w)
123 | sim = np.dot(embedding, targ_w_emb) / (nrm * trg_nrm)
124 | if (len(ms) < num) or (sim > ms[-1][1]):
125 | ms.append((w, sim))
126 | ms.sort(key = lambda x: x[1], reverse = True)
127 | else:
128 | dist = np.linalg.norm(embedding - targ_w_emb)
129 | if (len(ms) < num) or (dist < ms[-1][1]):
130 | ms.append((w, dist))
131 | ms.sort(key = lambda x: x[1])
132 | if len(ms) > num:
133 | ms.pop()
134 | return [ws for ws in ms]
135 |
136 | def merge_embedding_spaces(self, languages, emb_size, merge_name = 'merge', lang_prefix_delimiter = '__', special_tokens = None):
137 | print("Merging embedding spaces...")
138 | merge_vocabulary = {}
139 | merge_embs = []
140 | merge_norms = []
141 |
142 | for lang in languages:
143 | print("For language: " + lang)
144 | norms =[]
145 | embs = []
146 | for word in self.lang_vocabularies[lang]:
147 | if special_tokens is None or word not in special_tokens:
148 | merge_vocabulary[lang + lang_prefix_delimiter + word] = len(merge_vocabulary)
149 | else:
150 | merge_vocabulary[word] = len(merge_vocabulary)
151 | embs.append(self.get_vector(lang, word))
152 | norms.append(self.get_norm(lang, word))
153 | merge_embs = np.copy(embs) if len(merge_embs) == 0 else np.vstack((merge_embs, embs))
154 | merge_norms = np.copy(norms) if len(merge_norms) == 0 else np.concatenate((merge_norms, norms))
155 |
156 | self.lang_vocabularies[merge_name] = merge_vocabulary
157 | self.lang_embeddings[merge_name] = merge_embs
158 | self.lang_emb_norms[merge_name] = merge_norms
159 | self.emb_sizes[merge_name] = emb_size
160 |
161 | def store_embeddings(self, path, language):
162 | io_helper.store_embeddings(path, self, language)
--------------------------------------------------------------------------------
/eval.py:
--------------------------------------------------------------------------------
1 | import codecs
2 | from scipy import stats
3 | import os
4 | import numpy as np
5 | import sys
6 |
7 | def pairwise_accuracy(golds, preds):
8 | count_good = 0.0
9 | count_all = 0.0
10 | for i in range(len(golds) - 1):
11 | for j in range(i+1, len(golds)):
12 | count_all += 1.0
13 | diff_gold = golds[i] - golds[j]
14 | diff_pred = preds[i] - preds[j]
15 | if (diff_gold * diff_pred >= 0):
16 | count_good += 1.0
17 | return count_good / count_all
18 |
19 |
20 | def evaluate(gold_path, predicted_path):
21 | golds = [(x.split()[0].strip(), float(x.split()[1].strip())) for x in list(codecs.open(gold_path, "r", "utf-8").readlines())]
22 | predicts = [(x.split()[0].strip(), float(x.split()[1].strip())) for x in list(codecs.open(predicted_path, "r", "utf-8").readlines())]
23 |
24 | gold_scores = [x[1] for x in golds]
25 | gold_min = min(gold_scores)
26 | gold_max = max(gold_scores)
27 |
28 | predict_scores = [x[1] for x in predicts]
29 | preds_min = min(predict_scores)
30 | preds_max = max(predict_scores)
31 |
32 | golds_norm = {x[0] : (x[1] - gold_min) / (gold_max - gold_min) for x in golds }
33 | preds_norm = {x[0] : (x[1] - preds_min) / (preds_max - preds_min) for x in predicts }
34 | preds_inv_norm = {key : 1.0 - preds_norm[key] for key in preds_norm}
35 |
36 | g_last = []
37 | p_last = []
38 | pinv_last = []
39 | for k in golds_norm:
40 | g_last.append(golds_norm[k])
41 | p_last.append(preds_norm[k])
42 | pinv_last.append(preds_inv_norm[k])
43 |
44 | pearson = stats.pearsonr(g_last, p_last)[0]
45 | spearman = stats.spearmanr(g_last, p_last)[0]
46 | pa = pairwise_accuracy(g_last, p_last)
47 |
48 | pearson_inv = stats.pearsonr(g_last, pinv_last)[0]
49 | spearman_inv = stats.spearmanr(g_last, pinv_last)[0]
50 | pa_inv = pairwise_accuracy(g_last, pinv_last)
51 |
52 | return max(pearson, pearson_inv), max(spearman, spearman_inv), max(pa, pa_inv)
53 |
54 | gold_path = sys.argv[1]
55 | pred_path = sys.argv[2]
56 | pears, spear, pa = evaluate(gold_path, pred_path)
57 | print("Pearson coefficient: " + str(pears))
58 | print("Spearman coefficient: " + str(spear))
59 | print("Pairwise accuracy: " + str(pa))
--------------------------------------------------------------------------------
/evaluation/__init__.py:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/codogogo/topfish/6b3f5723029616cb430d6226bc59c013fe79eb78/evaluation/__init__.py
--------------------------------------------------------------------------------
/evaluation/confusion_matrix.py:
--------------------------------------------------------------------------------
1 | import numpy as np
2 |
3 | def merge_confusion_matrices(conf_mats):
4 | res_mat = ConfusionMatrix(conf_mats[0].labels)
5 | for cm in conf_mats:
6 | res_mat.matrix = np.add(res_mat.matrix, cm.matrix)
7 | res_mat.compute_all_scores()
8 | return res_mat
9 |
10 | class ConfusionMatrix(object):
11 | """
12 | Confusion matrix for evaluating classification tasks.
13 | """
14 |
15 | def __init__(self, labels = [], predictions = [], gold = [], one_hot_encoding = False, class_indices = False):
16 | # rows are true labels, columns predictions
17 | self.matrix = np.zeros(shape = (len(labels), len(labels)))
18 | self.labels = labels
19 |
20 | if len(predictions) != len(gold):
21 | raise ValueError("Predictions and gold labels do not have the same count.")
22 | for i in range(len(predictions)):
23 | index_pred = np.argmax(predictions[i]) if one_hot_encoding else (predictions[i] if class_indices else labels.index(predictions[i]))
24 | index_gold = np.argmax(gold[i]) if one_hot_encoding else (gold[i] if class_indices else labels.index(gold[i]))
25 | self.matrix[index_gold][index_pred] += 1
26 |
27 | if len(predictions) > 0:
28 | self.compute_all_scores()
29 |
30 | def compute_all_scores(self):
31 | self.class_performances = {}
32 | self.counts = {}
33 | for i in range(len(self.labels)):
34 | tp = np.float32(self.matrix[i][i])
35 | fp_plus_tp = np.float32(np.sum(self.matrix, axis = 0)[i])
36 | fn_plus_tp = np.float32(np.sum(self.matrix, axis = 1)[i])
37 | p = tp / fp_plus_tp
38 | r = tp / fn_plus_tp
39 | self.class_performances[self.labels[i]] = (p, r, 2*p*r/(p+r))
40 | self.counts[self.labels[i]] = (tp, fp_plus_tp - tp, fn_plus_tp - tp)
41 |
42 | self.microf1 = np.float32(np.trace(self.matrix)) / np.sum(self.matrix)
43 | self.macrof1 = float(sum([x[2] for x in self.class_performances.values()])) / float(len(self.labels))
44 | self.macroP = float(sum([x[0] for x in self.class_performances.values()])) / float(len(self.labels))
45 | self.macroR = float(sum([x[1] for x in self.class_performances.values()])) / float(len(self.labels))
46 | self.accuracy = float(sum([self.matrix[i, i] for i in range(len(self.labels))])) / float(np.sum(self.matrix))
47 |
48 |
49 | def print_results(self):
50 | for l in self.labels:
51 | print(l + ": " + str(self.get_class_performance(l)))
52 | print("Micro avg: " + str(self.accuracy))
53 | print("Macro avg: " + str(self.macrof1))
54 |
55 | def get_class_performance(self, label):
56 | if label in self.labels:
57 | return self.class_performances[label]
58 | else:
59 | raise ValueException("Unknown label")
60 |
61 | def aggregate_class_performance(self, classes):
62 | true_sum = 0.0
63 | fp_sum = 0.0
64 | fn_sum = 0.0
65 | for l in classes:
66 | tp, fp, fn = self.counts[l]
67 | true_sum += tp
68 | fp_sum += fp
69 | fn_sum += fn
70 | p = true_sum / (fp_sum + true_sum)
71 | r = true_sum / (fn_sum + true_sum)
72 | f = (2 * r * p) / (r + p)
73 | return p, r, f
74 |
75 |
76 |
77 |
78 |
--------------------------------------------------------------------------------
/graphs/__init__.py:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/codogogo/topfish/6b3f5723029616cb430d6226bc59c013fe79eb78/graphs/__init__.py
--------------------------------------------------------------------------------
/graphs/graph.py:
--------------------------------------------------------------------------------
1 | import numpy as np
2 |
3 | class Graph(object):
4 | """description of class"""
5 | def __init__(self, nodes = [], edges = [], symmetric = True):
6 | self.nodes = nodes
7 | self.edges = []
8 | for edge in edges:
9 | self.add_edge(edge)
10 | self.build_adjacency_matrix(symmetric)
11 |
12 | def add_node(self, node):
13 | self.nodes.append(node)
14 |
15 | def add_edge(self, edge):
16 | if len(edge) != 3:
17 | raise ValueError('An edge needs to have three values: starting node, ending node, and the weight (1 for unweighted graph)')
18 | if edge[0] not in self.nodes:
19 | raise ValueError('Starting node of the edge is unknown, i.e., not in the node list of the graph')
20 | if edge[1] not in self.nodes:
21 | raise ValueError('Ending node of the edge is unknown, i.e., not in the node list of the graph')
22 | self.edges.append((self.nodes.index(edge[0]), self.nodes.index(edge[1]), edge[2]))
23 |
24 | def build_adjacency_matrix(self, symmetric = True):
25 | self.adj_mat = np.zeros((len(self.nodes), len(self.nodes)))
26 | for edge in self.edges:
27 | self.adj_mat[edge[0]][edge[1]] = edge[2]
28 | if symmetric:
29 | self.adj_mat[edge[1]][edge[0]] = edge[2]
30 |
31 | def harmonic_function_label_propagation(self, fixed_indices_vals, rescale_extremes = True, normalize = True):
32 | self.wedeg_mat = np.zeros((len(self.nodes), len(self.nodes)))
33 | for i in range(len(self.nodes)):
34 | self.wedeg_mat[i][i] = sum(self.adj_mat[i])
35 |
36 | lap_mat = np.subtract(self.wedeg_mat, self.adj_mat)
37 | lap_mat_uu = lap_mat[np.ix_([x for x in range(len(self.nodes)) if x not in [y[0] for y in fixed_indices_vals]], [x for x in range(len(self.nodes)) if x not in [y[0] for y in fixed_indices_vals]])]
38 | lap_mat_ul = lap_mat[np.ix_([x for x in range(len(self.nodes)) if x not in [y[0] for y in fixed_indices_vals]], [y[0] for y in fixed_indices_vals])]
39 | scores_l = np.expand_dims(np.array([y[1] for y in fixed_indices_vals]), axis = 0)
40 |
41 | scores_u = np.dot(np.dot(np.multiply(-1.0, np.linalg.inv(lap_mat_uu)), lap_mat_ul), scores_l.T)
42 | unlab_docs = [x for x in self.nodes if self.nodes.index(x) not in [y[0] for y in fixed_indices_vals]]
43 | all_scores = dict(zip(unlab_docs, scores_u.T[0]))
44 |
45 | for e in fixed_indices_vals:
46 | if not rescale_extremes:
47 | all_scores[self.nodes[e[0]]] = e[1]
48 | else:
49 | adj_row = self.adj_mat[e[0]]
50 | adj_row = np.multiply(1.0 / np.sum(adj_row), adj_row)
51 | all_scores[self.nodes[e[0]]] = sum([adj_row[i] * all_scores[self.nodes[i]] for i in range(len(self.nodes)) if i not in [y[0] for y in fixed_indices_vals]])
52 |
53 | if normalize:
54 | min_score = min(all_scores.values())
55 | max_score = max(all_scores.values())
56 | for k in all_scores:
57 | all_scores[k] = (all_scores[k] - min_score) / (max_score - min_score)
58 | return all_scores
59 |
60 |
61 | def pagerank(self, alpha = 0.15, init_pr_vector = None, fixed_indices = None, rescale_extremes = True):
62 | #print("Running PageRank...")
63 | if init_pr_vector is None:
64 | init_pr_vector = np.expand_dims(np.full((len(self.nodes)), 1.0/((float)(len(self.nodes)))), axis = 0)
65 |
66 | # normalization and stochasticity adjustment of the adjacence matrix
67 | pr_mat = np.zeros((len(self.nodes), len(self.nodes)))
68 | for i in range(len(self.nodes)):
69 | if np.count_nonzero(self.adj_mat[i]) == 0:
70 | pr_mat[i][:] = np.full((len(self.nodes)), 1.0/((float)(len(self.nodes))))
71 | else:
72 | pr_mat[i][:] = np.multiply(1.0 / np.sum(self.adj_mat[i]), self.adj_mat[i])
73 |
74 | # primitivity adjustment
75 | pr_mat = np.multiply(1 - alpha, pr_mat) + np.multiply(alpha, np.full((len(self.nodes), len(self.nodes)), 1.0/((float)(len(self.nodes)))))
76 |
77 | # pagerank iterations
78 | diff = 1
79 | it = 1
80 | while diff > 0.001:
81 | old_vec = init_pr_vector
82 | init_pr_vector = np.dot(init_pr_vector, pr_mat)
83 | #init_pr_vector = np.multiply(1.0 / np.sum(init_pr_vector), init_pr_vector)
84 |
85 | if fixed_indices is not None:
86 | for ind in fixed_indices:
87 | init_pr_vector[0][ind] = fixed_indices[ind]
88 |
89 | diff = np.sum(np.abs(init_pr_vector - old_vec))
90 | #print("PR iteration " + str(it) + ": " + str(init_pr_vector))
91 | it += 1
92 |
93 |
94 | if fixed_indices is not None and rescale_extremes:
95 | for ind in fixed_indices:
96 | adj_row = self.adj_mat[ind]
97 | adj_row = np.multiply(1.0 / np.sum(adj_row), adj_row)
98 | init_pr_vector[0][ind] = sum([adj_row[i] * init_pr_vector[0][i] for i in range(len(self.nodes)) if i != ind])
99 |
100 | return dict(zip(self.nodes, init_pr_vector[0]))
101 |
--------------------------------------------------------------------------------
/helpers/__init__.py:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/codogogo/topfish/6b3f5723029616cb430d6226bc59c013fe79eb78/helpers/__init__.py
--------------------------------------------------------------------------------
/helpers/data_helper.py:
--------------------------------------------------------------------------------
1 | import numpy as np
2 | import re
3 | import itertools
4 | from collections import Counter
5 | import sys
6 | import codecs
7 | import random
8 | from embeddings import text_embeddings
9 | from sys import stdin
10 |
11 | def clean_str(string):
12 | """
13 | Tokenization/string cleaning.
14 | Original taken from https://github.com/yoonkim/CNN_sentence/blob/master/process_data.py
15 | """
16 | string = re.sub(r"[^A-Za-z0-9(),!?\'\`]", " ", string)
17 | string = re.sub(r"\'s", " \'s", string)
18 | string = re.sub(r"\'ve", " \'ve", string)
19 | string = re.sub(r"n\'t", " n\'t", string)
20 | string = re.sub(r"\'re", " \'re", string)
21 | string = re.sub(r"\'d", " \'d", string)
22 | string = re.sub(r"\'ll", " \'ll", string)
23 | string = re.sub(r",", " , ", string)
24 | string = re.sub(r"!", " ! ", string)
25 | string = re.sub(r"\(", " \( ", string)
26 | string = re.sub(r"\)", " \) ", string)
27 | string = re.sub(r"\?", " \? ", string)
28 | string = re.sub(r"\s{2,}", " ", string)
29 | return string.strip().lower()
30 |
31 | def load_text_and_labels(path, lowercase = True, multilingual = False, distinct_labels_index = None):
32 | """Loads text instances from files (one text one line), splits the data into words and generates labels (as one-hot vectors).
33 | Returns split sentences and labels.
34 | """
35 | # Load data from files
36 | lines = [(s.lower() if lowercase else s).strip().split() for s in list(codecs.open(path,'r',encoding='utf8', errors='replace').readlines())]
37 | x_instances = [l[1:-1] for l in lines] if multilingual else [l[:-1] for l in lines]
38 |
39 | if multilingual:
40 | langs = [l[0] for l in lines]
41 | labels = [l[-1] for l in lines]
42 |
43 | dist_labels = list(set(labels)) if distinct_labels_index is None else distinct_labels_index
44 | y_instances = [np.zeros(len(dist_labels)) for l in labels]
45 | for i in range(len(y_instances)):
46 | y_instances[i][dist_labels.index(labels[i])] = 1
47 |
48 | return [x_instances, y_instances, langs, dist_labels] if multilingual else [x_instances, y_instances, dist_labels]
49 |
50 | def build_text_and_labels(texts, class_labels, lowercase = True, multilingual = False, langs = None, distinct_labels_index = None):
51 | # Load data from files
52 | lines = [(text.lower() if lowercase else text).strip().split() for text in texts]
53 | x_instances = [l[1:-1] for l in lines] if multilingual else [l[:-1] for l in lines]
54 |
55 | dist_labels = list(set(class_labels)) if distinct_labels_index is None else distinct_labels_index
56 | y_instances = [np.zeros(len(dist_labels)) for l in class_labels]
57 | for i in range(len(y_instances)):
58 | y_instances[i][dist_labels.index(class_labels[i])] = 1
59 |
60 | return [x_instances, y_instances, langs, dist_labels] if multilingual else [x_instances, y_instances, dist_labels]
61 |
62 | def pad_texts(texts, padding_word="", max_length = None):
63 | """
64 | Pads all sentences to the same length. The length is defined by the longest sentence.
65 | Returns padded sentences.
66 | """
67 | sequence_length = max(len(x) for x in texts) if max_length is None else max_length
68 | padded_texts = []
69 | for i in range(len(texts)):
70 | text = texts[i]
71 | num_padding = sequence_length - len(text)
72 | padded_text = text + [padding_word] * num_padding if num_padding >= 0 else text[ : sequence_length]
73 | padded_texts.append(padded_text)
74 | return padded_texts
75 |
76 | def build_vocab(texts):
77 | """
78 | Builds a vocabulary mapping from word to index based on the sentences.
79 | Returns vocabulary mapping and inverse vocabulary mapping.
80 | """
81 | # Build vocabulary
82 | word_counts = Counter(itertools.chain(*texts))
83 | # Mapping from index to word
84 | vocabulary_invariable = [x[0] for x in word_counts.most_common()]
85 | vocabulary_invariable = list(sorted(vocabulary_invariable))
86 | # Mapping from word to index
87 | vocabulary = {x: i for i, x in enumerate(vocabulary_invariable)}
88 | inverse_vocabulary = {v: k for k, v in vocabulary.items()}
89 | return [vocabulary, inverse_vocabulary]
90 |
91 | def build_input_data(texts, labels, vocabulary, padding_token = "", langs = None, ignore_empty = True):
92 | x = []
93 | y = []
94 | if langs is not None:
95 | filt_langs = []
96 | for i in range(len(texts)):
97 | num_not_pad = len([x for x in texts[i] if x != padding_token])
98 | if (num_not_pad > 0 or (not ignore_empty)):
99 | x.append([vocabulary[t] for t in texts[i]])
100 | y.append(labels[i])
101 | if langs is not None:
102 | filt_langs.append(langs[i])
103 | if langs is not None:
104 | return [np.array(x), np.array(y), langs]
105 | else:
106 | return [np.array(x), np.array(y)]
107 |
108 | def remove_stopwords(texts, langs, stopwords, lowercase = True, multilingual = False, lang_prefix_delimiter = '__'):
109 | for i in range(len(texts)):
110 | texts[i] = [x for x in texts[i] if (x.split('__')[1].strip() if multilingual else x).lower() not in (stopwords[langs[i]] if multilingual else stopwords)]
111 |
112 | def filter_against_vocabulary(texts, vocabulary, lowercase = False):
113 | return [[(t.lower() if lowercase else t) for t in s if (t.lower() if lowercase else t) in vocabulary] for s in texts]
114 |
115 | def load_data_build_vocabulary(path, stopwords = None, lowercase = True, multilingual = False, lang_prefix_delimiter = '__'):
116 | """
117 | Loads and preprocesses data.
118 | Returns input vectors, labels, vocabulary, and inverse vocabulary.
119 | """
120 | # Load and preprocess data
121 | if multilingual:
122 | texts, labels, langs, dist_labels = load_text_and_labels(path, lowercase = lowercase, multilingual = True)
123 | for i in range(len(texts)):
124 | texts[i] = langs[i].lower() + lang_prefix_delimiter + texts[i]
125 | else:
126 | texts, labels, dist_labels = load_text_and_labels(path, lowercase = lowercase, multilingual = False)
127 |
128 | if stopwords is not None:
129 | texts = remove_stopwords(texts, langs, stopwords, lowercase = lowercase)
130 | texts_padded = pad_texts(texts)
131 |
132 | vocabulary, vocabulary_inverse = build_vocab(texts_padded)
133 | x, y = build_input_data(texts_padded, labels, vocabulary)
134 |
135 | return [x, y, dist_labels, vocabulary, vocabulary_inverse]
136 |
137 |
138 | def load_data_given_vocabulary(path, vocabulary, stopwords = None, lowercase = False, multilingual = False, lang_prefix_delimiter = '__', max_length = None, split = None, ignore_empty = True, distinct_labels_index = None):
139 | """
140 | Loads and preprocesses data given the vocabulary.
141 | Returns input vectors, labels, vocabulary, and inverse vocabulary.
142 | """
143 | # Load and preprocess data
144 | if multilingual:
145 | texts, labels, langs, dist_labels = load_text_and_labels(path, lowercase = lowercase, multilingual = True, distinct_labels_index = distinct_labels_index)
146 | for i in range(len(texts)):
147 | for j in range(len(texts[i])):
148 | texts[i][j] = langs[i].lower() + lang_prefix_delimiter + texts[i][j]
149 | else:
150 | texts, labels, dist_labels = load_text_and_labels(path, lowercase = lowercase, multilingual = False, distinct_labels_index = distinct_labels_index)
151 |
152 | if stopwords is not None:
153 | remove_stopwords(texts, langs if multilingual else None, stopwords, lowercase = lowercase, multilingual = multilingual)
154 |
155 | texts = filter_against_vocabulary(texts, vocabulary)
156 | texts_padded = pad_texts(texts, max_length = max_length)
157 |
158 | if multilingual:
159 | x, y, flangs = build_input_data(texts_padded, labels, vocabulary, langs = langs, ignore_empty = ignore_empty)
160 | dist_langs = set(flangs)
161 | for dl in dist_langs:
162 | num = len([l for l in flangs if l == dl])
163 | print("Language: " + dl + ", num: " + str(num))
164 | else:
165 | x, y = build_input_data(texts_padded, labels, vocabulary, ignore_empty = ignore_empty)
166 | if split is None:
167 | return [x, y, dist_labels]
168 | else:
169 | x_train = x[:split]
170 | y_train = y[:split]
171 | x_test = x[split:]
172 | y_test = y[split:]
173 | return [x_train, y_train, x_test, y_test, dist_labels]
174 |
175 | def build_data_given_vocabulary(data, class_labels, vocabulary, stopwords = None, lowercase = False, multilingual = False, lang_prefix_delimiter = '__', max_length = None, split = None, ignore_empty = True, distinct_labels_index = None):
176 | """
177 | Loads and preprocesses data given the vocabulary.
178 | Returns input vectors, labels, vocabulary, and inverse vocabulary.
179 | """
180 | # Load and preprocess data
181 | if multilingual:
182 | texts, labels, langs, dist_labels = load_text_and_labels(path, lowercase = lowercase, multilingual = True, distinct_labels_index = distinct_labels_index)
183 | for i in range(len(texts)):
184 | for j in range(len(texts[i])):
185 | texts[i][j] = langs[i].lower() + lang_prefix_delimiter + texts[i][j]
186 | else:
187 | texts, labels, dist_labels = load_text_and_labels(path, lowercase = lowercase, multilingual = False, distinct_labels_index = distinct_labels_index)
188 |
189 | if stopwords is not None:
190 | remove_stopwords(texts, langs if multilingual else None, stopwords, lowercase = lowercase, multilingual = multilingual)
191 |
192 | texts = filter_against_vocabulary(texts, vocabulary)
193 | texts_padded = pad_texts(texts, max_length = max_length)
194 |
195 | if multilingual:
196 | x, y, flangs = build_input_data(texts_padded, labels, vocabulary, langs = langs, ignore_empty = ignore_empty)
197 | dist_langs = set(flangs)
198 | for dl in dist_langs:
199 | num = len([l for l in flangs if l == dl])
200 | print("Language: " + dl + ", num: " + str(num))
201 | else:
202 | x, y = build_input_data(texts_padded, labels, vocabulary, ignore_empty = ignore_empty)
203 | if split is None:
204 | return [x, y, dist_labels]
205 | else:
206 | x_train = x[:split]
207 | y_train = y[:split]
208 | x_test = x[split:]
209 | y_test = y[split:]
210 | return [x_train, y_train, x_test, y_test, dist_labels]
211 |
212 |
213 |
214 | def load_vocabulary_embeddings(vocabulary_inv, embeddings, emb_size, padding = ""):
215 | voc_embs = []
216 | for i in range(len(vocabulary_inv)):
217 | if i not in vocabulary_inv:
218 | raise Exception("Index not in index vocabulary!" + " Index: " + str(i))
219 | word = vocabulary_inv[i]
220 | if word == padding:
221 | voc_embs.append(np.random.uniform(-1.0, 1.0, size = [emb_size]))
222 | elif word not in embeddings:
223 | raise Exception("Word not found in embeddings! " + word)
224 | else:
225 | voc_embs.append(embeddings[word])
226 | return np.array(voc_embs, dtype = np.float32)
227 |
228 | def prepare_data_for_kb_embedding(data, prebuilt_dicts = None, valid_triples_dict = None, generate_corrupt = True, num_corrupt = 2):
229 | if valid_triples_dict is None:
230 | valid_triples_dict = {}
231 |
232 | if prebuilt_dicts is None:
233 | cnt_ent = 0
234 | cnt_rel = 0
235 | entity_dict = {}
236 | relations_dict = {}
237 | else:
238 | entity_dict = prebuilt_dicts[0]
239 | relations_dict = prebuilt_dicts[1]
240 |
241 | for d in data:
242 | if prebuilt_dicts is None:
243 | if d[0] not in entity_dict:
244 | entity_dict[d[0]] = cnt_ent
245 | cnt_ent += 1
246 | if d[2] not in entity_dict:
247 | entity_dict[d[2]] = cnt_ent
248 | cnt_ent += 1
249 | if d[1] not in relations_dict:
250 | relations_dict[d[1]] = cnt_rel
251 | cnt_rel += 1
252 |
253 | str_rep = str(entity_dict[d[0]]) + "_" + str(relations_dict[d[1]]) + "_" + str(entity_dict[d[2]])
254 | valid_triples_dict[str_rep] = str_rep
255 |
256 | e1_indices = []
257 | e2_indices = []
258 | r_indices = []
259 | y_vals = []
260 |
261 | count_corrupt_valid = 0
262 | for d in data:
263 | e1_ind = entity_dict[d[0]]
264 | e2_ind = entity_dict[d[2]]
265 | r_ind = relations_dict[d[1]]
266 |
267 | e1_indices.append(e1_ind)
268 | e2_indices.append(e2_ind)
269 | r_indices.append(r_ind)
270 | y_vals.append(1)
271 |
272 | if generate_corrupt:
273 | for i in range(num_corrupt):
274 | corr_type = random.randint(1,3)
275 | fake_ind = random.randint(0, (len(entity_dict) if (corr_type == 1 or corr_type == 3) else len(relations_dict)) - 1)
276 | corr_triple_str_rep = (str(fake_ind) + "_" + str(r_ind) + "_" + str(e2_ind) if corr_type == 1 else (str(e1_ind) + "_" + str(r_ind) + "_" + str(fake_ind) if corr_type == 3 else str(e1_ind) + "_" + str(fake_ind) + "_" + str(e2_ind)))
277 |
278 | while corr_triple_str_rep in valid_triples_dict:
279 | fake_ind = random.randint(0, (len(entity_dict) if (corr_type == 1 or corr_type == 3) else len(relations_dict)) - 1)
280 | corr_triple_str_rep = (str(fake_ind) + "_" + str(r_ind) + "_" + str(e2_ind) if corr_type == 1 else (str(e1_ind) + "_" + str(r_ind) + "_" + str(fake_ind) if corr_type == 3 else str(e1_ind) + "_" + str(fake_ind) + "_" + str(e2_ind)))
281 | count_corrupt_valid += 1
282 |
283 | if corr_type == 1:
284 | e1_indices.append(fake_ind)
285 | e2_indices.append(e2_ind)
286 | r_indices.append(r_ind)
287 | elif corr_type == 2:
288 | e1_indices.append(e1_ind)
289 | e2_indices.append(e2_ind)
290 | r_indices.append(fake_ind)
291 | elif corr_type == 3:
292 | e1_indices.append(e1_ind)
293 | e2_indices.append(fake_ind)
294 | r_indices.append(r_ind)
295 | y_vals.append(-1)
296 |
297 | return [(entity_dict, relations_dict), valid_triples_dict, np.array(e1_indices, dtype = np.int32), np.array(e2_indices, dtype = np.int32), np.array(r_indices, dtype = np.int32), np.array(y_vals, dtype = np.float32) ]
298 |
299 | def prepare_wn_data(data, concept_dict, rel_string, rel_string_inv, prev_dict = None):
300 | data_out = []
301 | if prev_dict is None:
302 | prev_dict = {}
303 |
304 | data = [x for x in data if x[1] == rel_string or x[1] == rel_string_inv]
305 |
306 | for i in range(len(data)):
307 | d = data[i]
308 | if d[1] == rel_string:
309 | rel_str = concept_dict[d[0]] + "_" + concept_dict[d[2]]
310 | if rel_str not in prev_dict:
311 | data_out.append((concept_dict[d[0]], concept_dict[d[2]], "1"))
312 | prev_dict[rel_str] = 1
313 | elif d[1] == rel_string_inv:
314 | rel_str = concept_dict[d[2]] + "_" + concept_dict[d[0]]
315 | if rel_str not in prev_dict:
316 | data_out.append((concept_dict[d[2]], concept_dict[d[0]], "1"))
317 | prev_dict[rel_str] = 1
318 | return data_out
319 |
320 | def create_corrupts(correct_train, correct_test, concept_dict, prev_dict, num_corrupt = 2, shuffle = True):
321 | concepts = list(concept_dict.values())
322 | train_corrupt = []
323 | test_corrupt = []
324 | current_dict = {}
325 |
326 | merged = []
327 | merged.extend(correct_train)
328 | merged.extend(correct_test)
329 |
330 | for i in range(len(merged)):
331 | rel_str = merged[i][1] + "_" + merged[i][0]
332 | if rel_str not in prev_dict and rel_str not in current_dict:
333 | (train_corrupt if i < len(correct_train) else test_corrupt).append((merged[i][1], merged[i][0], "0"))
334 | current_dict[rel_str] = 1
335 |
336 | for j in range(num_corrupt - 1):
337 | c1 = concepts[random.randint(0, len(concepts) - 1)]
338 | c2 = concepts[random.randint(0, len(concepts) - 1)]
339 | rel_str = c1 + "_" + c2
340 | while(rel_str in prev_dict or rel_str in current_dict):
341 | c1 = concepts[random.randint(0, len(concepts) - 1)]
342 | c2 = concepts[random.randint(0, len(concepts) - 1)]
343 | rel_str = c1 + "_" + c2
344 | (train_corrupt if i < len(correct_train) else test_corrupt).append((c1, c2, "0"))
345 | current_dict[rel_str] = 1
346 |
347 | fdata_train = []
348 | fdata_train.extend(correct_train)
349 | fdata_train.extend(train_corrupt)
350 |
351 | fdata_test = []
352 | fdata_test.extend(correct_test)
353 | fdata_test.extend(test_corrupt)
354 |
355 | if shuffle:
356 | random.shuffle(fdata_train)
357 | random.shuffle(fdata_test)
358 |
359 | return (fdata_train, fdata_test)
360 |
361 | def lexically_independent_train_set(data_train, data_test):
362 | ents_test = [x[0] for x in data_test]
363 | ents_test.extend([x[1] for x in data_test])
364 | ents_test = set(ents_test)
365 |
366 | filtered_train = [x for x in data_train if x[0] not in ents_test and x[1] not in ents_test]
367 | return filtered_train
368 |
369 | def prepare_eval_semrel_emb(word_embeddings, stopwords, emb_size, data, y_direct = False, keep_words = False):
370 | left_mat = []
371 | right_mat = []
372 | gold_labels = []
373 | words = []
374 |
375 | for i in range(len(data)):
376 | first_word = data[i][0]
377 | emb1 = text_embeddings.aggregate_phrase_embedding(first_word.strip().split(), stopwords, word_embeddings, emb_size, l2_norm_vec = False)
378 | second_word = data[i][1]
379 | emb2 = text_embeddings.aggregate_phrase_embedding(second_word.strip().split(), stopwords, word_embeddings, emb_size, l2_norm_vec = False)
380 |
381 | if emb1 is not None and emb2 is not None:
382 | left_mat.append(emb1)
383 | right_mat.append(emb2)
384 | if keep_words:
385 | words.append(first_word + '\t' + second_word)
386 | if not y_direct:
387 | gold_labels.append(-1.0 if data[i][2] == "0" else 1.0)
388 | else:
389 | gold_labels.append(data[i][2])
390 |
391 | if keep_words:
392 | return [np.array(left_mat), np.array(right_mat), gold_labels, words]
393 | else:
394 | return [np.array(left_mat), np.array(right_mat), gold_labels]
395 |
396 | def prepare_dataset_semrel_emb(entity_dict, selected_embeddings, stopwords, word_embeddings, emb_size, data, dict_examples):
397 | cnt_ent = len(entity_dict)
398 | e1_inds = []
399 | e2_inds = []
400 | y_vals = []
401 |
402 | cnt_emb_fail = 0
403 | cnt_existing = 0
404 |
405 | for i in range(len(data)):
406 | first_word = data[i][0]
407 | if first_word not in entity_dict:
408 | emb = text_embeddings.aggregate_phrase_embedding(first_word.strip().split(), stopwords, word_embeddings, emb_size, l2_norm_vec = False)
409 | if emb is not None:
410 | selected_embeddings.append(emb)
411 | entity_dict[first_word] = cnt_ent
412 | cnt_ent += 1
413 | else:
414 | cnt_emb_fail += 1
415 | continue
416 | second_word = data[i][1]
417 | if second_word not in entity_dict:
418 | emb = text_embeddings.aggregate_phrase_embedding(second_word.strip().split(), stopwords, word_embeddings, emb_size, l2_norm_vec = False)
419 | if emb is not None:
420 | selected_embeddings.append(emb)
421 | entity_dict[second_word] = cnt_ent
422 | cnt_ent += 1
423 | else:
424 | cnt_emb_fail += 1
425 | continue
426 |
427 | e1i = entity_dict[first_word]
428 | e2i = entity_dict[second_word]
429 | stres = str(e1i) + "_" + str(e2i)
430 | if stres not in dict_examples:
431 | e1_inds.append(e1i)
432 | e2_inds.append(e2i)
433 | y_vals.append(-1.0 if data[i][2] == "0" else 1.0)
434 | dict_examples[stres] = stres
435 | else:
436 | #print("Example (pair of entities) already seen: "+ "\"" + first_word + "\" ; \"" + second_word + "\"")
437 | cnt_existing += 1
438 |
439 | return [list(zip(e1_inds, e2_inds, y_vals)), selected_embeddings]
440 |
441 |
442 |
443 |
444 |
--------------------------------------------------------------------------------
/helpers/data_shaper.py:
--------------------------------------------------------------------------------
1 | from helpers import io_helper
2 | import numpy as np
3 | import re
4 |
5 | def punctuation():
6 | return ['—', '-', '.', ',', ';', ':', '\'', '"', '{', '}', '(', ')', '[', ']']
7 |
8 | def is_number(token):
9 | return re.match('^[\d]+[,]*.?\d*$', token) is not None
10 |
11 | def decode_predictions(labels, predictions, flatten = False):
12 | if len(predictions.shape) == 2:
13 | labs = [labels[np.nonzero(instance)[0][0]] if len(np.nonzero(instance)[0]) > 0 else '' for instance in predictions]
14 | elif len(predictions.shape) == 3:
15 | labs = [[labels[np.nonzero(instance)[0][0]] if len(np.nonzero(instance)[0]) > 0 else '' for instance in sequence] for sequence in predictions]
16 | if flatten:
17 | labs = [item for sublist in labs for item in sublist]
18 | else:
19 | raise ValueError("Not supported. Only list of single instances or list of sequences supported for decoding labels.")
20 | return labs
21 |
22 | def prep_labels_one_hot_encoding(labels, dist_labels = None, multilabel = False):
23 | if dist_labels is None:
24 | if multilabel:
25 | dist_labels = list(set([y for s in labels for y in s]))
26 | else:
27 | dist_labels = list(set(labels))
28 | y = []
29 | for i in range(len(labels)):
30 | lab_vec = [0] * len(dist_labels)
31 | if multilabel:
32 | for j in range(len(labels[i])):
33 | lab_vec[dist_labels.index(labels[i][j])] = 1.0
34 | else:
35 | lab_vec[dist_labels.index(labels[i])] = 1.0
36 | y.append(lab_vec)
37 | return np.array(y, dtype = np.float64), dist_labels
38 |
39 | def prep_word_tuples(word_lists, embeddings, embeddings_language, langs = None, labels = None,):
40 | examples = []
41 | if labels:
42 | labs = []
43 | for i in range(len(word_lists)):
44 | example = []
45 | add_example = True
46 | for j in range(len(word_lists[i])):
47 | w = ("" if langs is None else langs[i] + "__") + word_lists[i][j]
48 | if w in embeddings.lang_vocabularies[embeddings_language]:
49 | example.append(embeddings.lang_vocabularies[embeddings_language][w])
50 | elif w.lower() in embeddings.lang_vocabularies[embeddings_language]:
51 | example.append(embeddings.lang_vocabularies[embeddings_language][w.lower()])
52 | else:
53 | add_example = False
54 | break
55 | if add_example:
56 | examples.append(example)
57 | if labels:
58 | labs.append(labels[i])
59 | if labels:
60 | return examples, labs
61 | else:
62 | return examples
63 |
64 | def prep_sequence_labelling(texts, labels, embeddings, stopwords = None, embeddings_language = 'en', multilingual_langs = None, lowercase = False, pad = True, pad_token = '', numbers_token = None, punct_token = None, dist_labels = None, max_seq_len = None, add_missing_tokens = False):
65 | x = []
66 | if labels:
67 | y = []
68 |
69 | for i in range(len(texts)):
70 | if i % 100 == 0:
71 | print("Line: " + str(i) + " of " + str(len(texts)))
72 | tok_list = []
73 | if labels:
74 | lab_list = []
75 | language = embeddings_language if multilingual_langs is None else multilingual_langs[i]
76 |
77 | for j in range(len(texts[i])):
78 | token_clean = texts[i][j].lower() if lowercase else texts[i][j]
79 | token = token_clean if multilingual_langs is None else multilingual_langs[i] + "__" + token_clean
80 |
81 | if token_clean.strip() in punctuation() and punct_token is not None:
82 | token = punct_token
83 | if is_number(token_clean) and numbers_token is not None:
84 | token = numbers_token
85 |
86 | if stopwords is not None and (token_clean in stopwords[language] or token_clean.lower() in stopwords[language]):
87 | continue
88 | if token not in embeddings.lang_vocabularies[embeddings_language] and token.lower() not in embeddings.lang_vocabularies[embeddings_language]:
89 | if add_missing_tokens:
90 | embeddings.add_word(embeddings_language, token)
91 | else:
92 | continue
93 |
94 | tok_list.append(embeddings.lang_vocabularies[embeddings_language][token] if token in embeddings.lang_vocabularies[embeddings_language] else embeddings.lang_vocabularies[embeddings_language][token.lower()])
95 | if labels:
96 | lab_list.append(labels[i][j])
97 | x.append(tok_list)
98 | if labels:
99 | y.append(lab_list)
100 |
101 | if labels:
102 | y_clean = []
103 | if dist_labels is None:
104 | dist_labels = list(set([l for txt_labs in y for l in txt_labs]))
105 | for i in range(len(y)):
106 | lab_list = []
107 | for j in range(len(y[i])):
108 | lab_vec = [0] * len(dist_labels)
109 | lab_vec[dist_labels.index(y[i][j])] = 1.0
110 | lab_list.append(lab_vec)
111 | y_clean.append(lab_list)
112 |
113 | if pad:
114 | ind_pad = embeddings.lang_vocabularies[embeddings_language][pad_token]
115 | max_len = max([len(t) for t in x]) if max_seq_len is None else max_seq_len
116 | x = [t + [ind_pad] * (max_len - len(t)) for t in x]
117 | if labels:
118 | for r in y_clean:
119 | extension = [[0] * len(dist_labels)] * (max_len - len(r))
120 | r.extend(extension)
121 | sent_lengths = [len([ind for ind in txt if ind != ind_pad]) for txt in x]
122 | else:
123 | sent_lengths = [len(txt) for txt in x]
124 |
125 | if labels:
126 | return np.array(x, dtype = np.int32), np.array(y_clean, dtype = np.float64), dist_labels, sent_lengths
127 | else:
128 | return np.array(x, dtype = np.int32), sent_lengths
129 |
130 | def prep_classification(texts, labels, embeddings, stopwords = None, embeddings_language = 'en', multilingual_langs = None, lowercase = False, pad = True, pad_token = '', numbers_token = None, punct_token = None, dist_labels = None, max_seq_len = None, add_out_of_vocabulary_terms = False):
131 | x = []
132 | y = []
133 |
134 | for i in range(len(texts)):
135 | tok_list = []
136 | lab_list = []
137 | language = embeddings_language if multilingual_langs is None else multilingual_langs[i]
138 |
139 | for j in range(len(texts[i])):
140 | token_clean = texts[i][j].lower() if lowercase else texts[i][j]
141 | token = token_clean if multilingual_langs is None else multilingual_langs[i] + "__" + token_clean
142 |
143 | if token_clean.strip() in punctuation() and punct_token is not None:
144 | token = punct_token
145 | if is_number(token_clean) and numbers_token is not None:
146 | token = numbers_token
147 |
148 | if stopwords is not None and (token_clean in stopwords[language] or token_clean.lower() in stopwords[language]):
149 | continue
150 | if token not in embeddings.lang_vocabularies[embeddings_language] and token.lower() not in embeddings.lang_vocabularies[embeddings_language]:
151 | if add_out_of_vocabulary_terms:
152 | embeddings.add_word(embeddings_language, token)
153 | else:
154 | continue
155 | if max_seq_len is None or len(tok_list) < max_seq_len:
156 | tok_list.append(embeddings.lang_vocabularies[embeddings_language][token] if token in embeddings.lang_vocabularies[embeddings_language] else embeddings.lang_vocabularies[embeddings_language][token.lower()])
157 | else:
158 | break
159 | x.append(tok_list)
160 |
161 | if labels is not None:
162 | if dist_labels is None:
163 | dist_labels = list(set([l for txt_labs in labels for l in txt_labs]))
164 | for i in range(len(labels)):
165 | lab_vec = [0] * len(dist_labels)
166 | for j in range(len(labels[i])):
167 | lab_vec[dist_labels.index(labels[i][j])] = 1.0
168 | y.append(lab_vec)
169 |
170 | if pad:
171 | ind_pad = embeddings.lang_vocabularies[embeddings_language][pad_token]
172 | max_len = max([len(t) for t in x]) if max_seq_len is None else max_seq_len
173 | x = [t + [ind_pad] * (max_len - len(t)) for t in x]
174 |
175 | if labels is not None:
176 | x_ret = np.array(x, dtype = np.int32)
177 | y_ret = np.array(y, dtype = np.float64)
178 | return x_ret, y_ret, dist_labels
179 | else:
180 | return np.array(x, dtype = np.int32)
181 |
182 | def prepare_contrastive_learning_examples(positives, negatives, num_negatives_per_positive):
183 | if len(negatives) != len(positives) * num_negatives_per_positive:
184 | raise ValueError("The number of negative examples (per positive examples) is incorrect!")
185 | examples = []
186 | for i in len(positives):
187 | examples.append(positives[i])
188 | examples.extend(negatives[i*num_negatives_per_positive : (i+1)*num_negatives_per_positive])
189 | return examples
190 |
--------------------------------------------------------------------------------
/helpers/io_helper.py:
--------------------------------------------------------------------------------
1 | from __future__ import division
2 | import codecs
3 | from os import listdir
4 | from os.path import isfile, join
5 | import pickle
6 | import numpy as np
7 | from helpers import data_helper
8 | import re
9 |
10 | ################################################################################################################################
11 |
12 | def serialize(item, path):
13 | pickle.dump(item, open(path, "wb" ))
14 |
15 | def deserialize(path):
16 | return pickle.load(open(path, "rb" ))
17 |
18 | def load_file(filepath):
19 | return (codecs.open(filepath, 'r', encoding = 'utf8', errors = 'replace')).read()
20 |
21 | def load_lines(filepath):
22 | return [l.strip() for l in list(codecs.open(filepath, "r", encoding = 'utf8', errors = 'replace').readlines())]
23 |
24 | def load_blocked_lines(filepath):
25 | lines = [l.strip() for l in list(codecs.open(filepath, "r", encoding = 'utf8', errors = 'replace').readlines())]
26 | blocks = []
27 | block = []
28 | for l in lines:
29 | if l == "":
30 | blocks.append(block)
31 | block = []
32 | else:
33 | block.append(l)
34 | if len(block) > 0:
35 | blocks.append(block)
36 | return blocks
37 |
38 | def load_all_files(dirpath):
39 | files = []
40 | for filename in listdir(dirpath):
41 | files.append((filename, load_file(dirpath + "/" + filename)))
42 | return files
43 |
44 | ################################################################################################################################
45 |
46 | def store_embeddings(path, embeddings, language, print_progress = True):
47 | f = codecs.open(path,'w',encoding='utf8')
48 | vocab = embeddings.lang_vocabularies[language]
49 | embs = embeddings.lang_embeddings[language]
50 |
51 | cnt = 0
52 | for word in vocab:
53 | cnt += 1
54 | if print_progress and cnt % 1000 == 0:
55 | print("Storing embeddings " + str(cnt))
56 | f.write(word + " ")
57 | for i in range(len(embs[vocab[word]])):
58 | f.write(str(embs[vocab[word]][i]) + " ")
59 | f.write("\n")
60 | f.close()
61 |
62 | def load_embeddings_dict_with_norms(filepath, limit = None, special_tokens = None, print_load_progress = False, min_one_letter = False, skip_first_line = False):
63 | norms = []
64 | vocabulary = {}
65 | embeddings = []
66 | cnt = 0
67 | cnt_dict = 0
68 | emb_size = -1
69 |
70 | with codecs.open(filepath,'r',encoding='utf8', errors='replace') as f:
71 | for line in f:
72 | try:
73 | cnt += 1
74 | if limit and cnt > limit:
75 | break
76 | if print_load_progress and (cnt % 1000 == 0):
77 | print("Loading embeddings: " + str(cnt))
78 | if cnt > 1 or not skip_first_line:
79 | splt = line.split()
80 | word = splt[0]
81 | if min_one_letter and not any(c.isalpha() for c in word):
82 | continue
83 |
84 | vec = [np.float32(x) for x in splt[1:]]
85 | if emb_size < 0 and len(vec) > 10:
86 | emb_size = len(vec)
87 |
88 | if emb_size > 0 and len(vec) == emb_size:
89 | vocabulary[word] = cnt_dict
90 | cnt_dict += 1
91 | norms.append(np.linalg.norm(vec, 2))
92 | embeddings.append(vec)
93 | except(ValueError,IndexError,UnicodeEncodeError):
94 | print("Incorrect format line!")
95 |
96 | if special_tokens is not None:
97 | for st in special_tokens:
98 | vocabulary[st] = cnt_dict
99 | cnt_dict += 1
100 | vec = np.array([0.1 * (special_tokens.index(st) + 1)] * emb_size) #np.random.uniform(-1.0, 1.0, size = [emb_size])
101 | norms.append(np.linalg.norm(vec, 2))
102 | embeddings.append(vec)
103 |
104 | return vocabulary, np.array(embeddings, dtype = np.float32), norms
105 |
106 | ############################################################################################################################
107 |
108 | def load_whitespace_separated_data(filepath):
109 | lines = list(codecs.open(filepath,'r',encoding='utf8', errors='replace').readlines())
110 | return [[x.strip() for x in l.strip().split()] for l in lines]
111 |
112 | def load_tab_separated_data(filepath):
113 | lines = list(codecs.open(filepath,'r',encoding='utf8', errors='replace').readlines())
114 | return [[x.strip() for x in l.strip().split('\t')] for l in lines]
115 |
116 | def load_wn_concepts_dict(path):
117 | lines = list(codecs.open(path,'r',encoding='utf8', errors='replace').readlines())
118 | lcols = {x[0] : ' '.join((x[1].split('_'))[2:-2]) for x in [l.strip().split() for l in lines]}
119 | return lcols
120 |
121 | def load_bless_dataset(path):
122 | lines = list(codecs.open(path,'r',encoding='utf8', errors='replace').readlines())
123 | lcols = [(x[0].split('-')[0], x[3].split('-')[0], "1" if x[2] == "hyper" else "0") for x in [l.strip().split() for l in lines]]
124 | return lcols
125 |
126 | def write_list(path, list):
127 | f = codecs.open(path,'w',encoding='utf8')
128 | for l in list:
129 | f.write(l + "\n")
130 | f.close()
131 |
132 | def write_dictionary(path, dictionary, append = False):
133 | f = codecs.open(path,'a' if append else 'w',encoding='utf8')
134 | for k in dictionary:
135 | f.write(str(k) + "\t" + str(dictionary[k]) + "\n")
136 | f.close()
137 |
138 | def load_translation_pairs(filepath):
139 | lines = list(codecs.open(filepath,'r',encoding='utf8', errors='replace').readlines())
140 | dataset = [];
141 | for line in lines:
142 | spl = line.split(',')
143 | srcword = spl[0].strip()
144 | trgword = spl[1].strip();
145 | if (" " not in srcword.strip()) and (" " not in trgword.strip()):
146 | dataset.append((srcword, trgword));
147 | return dataset
148 |
149 | def write_list_tuples_separated(path, list, delimiter = '\t'):
150 | f = codecs.open(path,'w',encoding='utf8')
151 | for i in range(len(list)):
152 | for j in range(len(list[i])):
153 | if j == len(list[i]) - 1:
154 | f.write(str(list[i][j]) + '\n')
155 | else:
156 | f.write(str(list[i][j]) + delimiter)
157 | f.close()
158 |
159 | def store_wordnet_rels(dirpath, relname, pos, lang, instances):
160 | f = codecs.open(dirpath + "/" + lang + "_" + relname + "_" + pos + ".txt",'w',encoding='utf8')
161 | for i in instances:
162 | splt = i.split('::')
163 | f.write(splt[0].replace("_", " ") + "\t" + splt[1].replace("_", " ") + "\t" + str(instances[i]) + "\n")
164 | f.close()
165 |
166 | def load_csv_lines(path, delimiter = ',', indices = None):
167 | f = codecs.open(path,'r',encoding='utf8')
168 | lines = [l.strip().split(delimiter) for l in f.readlines()]
169 | if indices is None:
170 | return lines
171 | else:
172 | return [sublist(l, indices) for l in lines]
173 |
174 | def load_csv_lines_line_by_line(path, delimiter = ',', indices = None, limit = None):
175 | lines = []
176 | f = codecs.open(path,'r',encoding='utf8')
177 | line = f.readline().strip()
178 | cnt = 1
179 | while line is not '':
180 | lines.extend(sublist(line, indices) if indices is not None else line.split(delimiter))
181 | line = f.readline().strip()
182 | cnt += 1
183 | if limit is not None and cnt > limit:
184 | break
185 | return lines
186 |
187 | def sublist(list, indices):
188 | sublist = []
189 | for i in indices:
190 | sublist.append(list[i])
191 | return sublist
192 |
193 |
194 | ############################################################################################################################
195 |
196 | def load_sequence_labelling_data(path, delimiter = '\t', indices = None, line_start_skip = None):
197 | f = codecs.open(path,'r',encoding='utf8')
198 | lines = [[t.strip() for t in l.split(delimiter)] for l in f.readlines()]
199 | instances = []
200 | instance = []
201 | for i in range(len(lines)):
202 | if line_start_skip is not None and lines[i][0].startswith(line_start_skip):
203 | continue
204 | if len(lines[i]) == 1 and lines[i][0] == "":
205 | instances.append(instance)
206 | instance = []
207 | else:
208 | if indices is None:
209 | instance.append(lines[i])
210 | else:
211 | instance.append(sublist(lines[i], indices))
212 | if len(instance) > 0:
213 | instances.append(instance)
214 | return instances
215 |
216 | def load_classification_data(path, delimiter_text_labels = '\t', delimiter_labels = '\t', line_start_skip = None):
217 | f = codecs.open(path,'r',encoding='utf8')
218 | lines = [[t.strip() for t in l.split(delimiter_text_labels)] for l in f.readlines()]
219 | instances = []
220 | for i in range(len(lines)):
221 | if line_start_skip is not None and lines[i][0].startswith(line_start_skip):
222 | continue
223 | text = data_helper.clean_str(lines[i][0].strip()).split()
224 | if delimiter_text_labels == delimiter_labels:
225 | labels = lines[i][1:]
226 | else:
227 | labels = lines[i][1].strip().split(delimiter_labels)
228 | instances.append((text, labels))
229 | return instances
230 |
231 | ############################################################################################################################
232 | # Applications specific loading
233 | ############################################################################################################################
234 |
235 | def load_snli_data(path):
236 | l = load_csv_lines(path, delimiter = '\t', indices = [0, 5, 6])
237 | l.pop(0)
238 |
239 | labels = [x[0] for x in l]
240 | premises = [x[1] for x in l]
241 | implications = [x[2] for x in l]
242 |
243 | return premises, implications, labels
244 |
245 |
--------------------------------------------------------------------------------
/ml/__init__.py:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/codogogo/topfish/6b3f5723029616cb430d6226bc59c013fe79eb78/ml/__init__.py
--------------------------------------------------------------------------------
/ml/batcher.py:
--------------------------------------------------------------------------------
1 | import numpy as np
2 | import random
3 |
4 | def batch_iter(data, batch_size, num_epochs, shuffle = True):
5 | """
6 | Generates a batch iterator for a dataset.
7 | """
8 | #data = np.array(data, dtype = np.int32)
9 | data_size = len(data)
10 |
11 | num_batches_per_epoch = int(data_size/batch_size) + 1
12 | for epoch in range(num_epochs):
13 | # Shuffle the data at each epoch
14 | if shuffle:
15 | #shuffle_indices = np.random.permutation(np.arange(data_size))
16 | #shuffled_data = data[shuffle_indices]
17 | random.shuffle(data)
18 | #else:
19 | # shuffled_data = data
20 |
21 | for batch_num in range(num_batches_per_epoch):
22 | start_index = batch_num * batch_size
23 | end_index = min((batch_num + 1) * batch_size, data_size)
24 | yield data[start_index:end_index]
--------------------------------------------------------------------------------
/ml/loss_functions.py:
--------------------------------------------------------------------------------
1 | import tensorflow as tf
2 |
3 | def softmax_cross_entropy(predictions, golds):
4 | losses = tf.nn.softmax_cross_entropy_with_logits(logits=predictions, labels=golds)
5 | loss = tf.reduce_mean(losses)
6 | return loss
7 |
8 | def softmax_cross_entropy_micro_batches(predictions, golds, params):
9 | print("Defining micro-batched cross-entropy loss...")
10 | micro_batch_size, batch_size = params
11 | print("Micro-batch size: " + str(micro_batch_size))
12 |
13 | preds_unstacked = tf.unstack(predictions, num = batch_size)
14 | golds_unstacked = tf.unstack(golds, num = batch_size)
15 |
16 | if (len(preds_unstacked) % micro_batch_size != 0 or len(preds_unstacked) != len(golds_unstacked)):
17 | raise ValueError("Unexpected batch size, must be a multiplier of number of contrastive examples or num golds and predictions doesn't match!")
18 |
19 | loss = 0
20 | k = 0
21 | while k*micro_batch_size < len(preds_unstacked):
22 | print("Micro-batch iteration: " + str(k+1))
23 | preds_micro_batch = tf.nn.softmax(tf.stack(preds_unstacked[k*micro_batch_size : (k+1)*micro_batch_size]))
24 | golds_micro_batch = tf.nn.softmax(tf.stack(golds_unstacked[k*micro_batch_size : (k+1)*micro_batch_size]))
25 | loss += softmax_cross_entropy(preds_micro_batch, golds_micro_batch)
26 | k += 1
27 | return loss
28 |
29 | def margin_based_loss(predictions, golds):
30 | return tf.reduce_sum(tf.maximum(tf.subtract(tf.constant(1.0, dtype = tf.float64), tf.multiply(predictions, golds)), 0.0))
31 |
32 | def mse_loss(predictions, golds):
33 | return tf.reduce_sum(tf.square(tf.subtract(predictions, golds)))
34 |
35 | def contrastive_loss(predictions, golds, params):
36 | print("Defining contrastive loss...")
37 | num_pos_pairs, num_neg_pairs, gamma = params
38 | preds_unstacked = tf.unstack(predictions)
39 | size = num_pos_pairs + num_neg_pairs
40 | if (len(preds_unstacked) % size != 0):
41 | raise ValueError("Unexpected batch size, must be a multiplier of number of contrastive examples!")
42 |
43 | loss = 0
44 | k = 0
45 | while k*size < len(preds_unstacked):
46 | pos_pairs = preds_unstacked[k*size : k*size + num_pos_pairs]
47 | print("Len of pos pair preds: " + str(len(pos_pairs)))
48 | neg_pairs = preds_unstacked[k*size + num_pos_pairs : (k+1) * size]
49 | print("Len of neg pair preds: " + str(len(neg_pairs)))
50 | for p in pos_pairs:
51 | for n in neg_pairs:
52 | loss += tf.maximum(tf.constant(0.0, dtype = tf.float64), gamma - (p - n))
53 | k += 1
54 | return loss
55 |
56 | def contrastive_loss_nonbinary(predictions, golds, params):
57 | print("Defining contrastive loss...")
58 | num_pos_pairs, num_neg_pairs, mean_square_error, batch_size = params
59 | preds_unstacked = tf.unstack(predictions, num = batch_size)
60 | golds_unstacked = tf.unstack(golds, num = batch_size)
61 |
62 | size = num_pos_pairs + num_neg_pairs
63 | if (len(preds_unstacked) % size != 0 or len(preds_unstacked) != len(golds_unstacked)):
64 | raise ValueError("Unexpected batch size, must be a multiplier of number of contrastive examples or num golds and predictions doesn't match!")
65 |
66 | loss = 0
67 | k = 0
68 | while k*size < len(preds_unstacked):
69 | print("Micro-batch iteration: " + str(k+1))
70 | pos_pairs = preds_unstacked[k*size : k*size + num_pos_pairs]
71 | pos_golds = golds_unstacked[k*size : k*size + num_pos_pairs]
72 | print("Len of pos pair preds: " + str(len(pos_pairs)))
73 |
74 | neg_pairs = preds_unstacked[k*size + num_pos_pairs : (k+1) * size]
75 | neg_golds = golds_unstacked[k*size + num_pos_pairs : (k+1) * size]
76 | print("Len of neg pair preds: " + str(len(neg_pairs)))
77 |
78 | for i in range(len(pos_pairs)):
79 | for j in range(len(neg_pairs)):
80 | if mean_square_error:
81 | if k == 0 and i == 0 and j == 0:
82 | print("MSE NCE loss for pair...")
83 | loss += tf.square((pos_golds[i] - neg_golds[j]) - (pos_pairs[i] - neg_pairs[j]))
84 | else:
85 | if k == 0 and i == 0 and j == 0:
86 | print("Hinge, margin loss for pair...")
87 | loss += tf.maximum(tf.constant(0.0, dtype = tf.float64), (pos_golds[i] - neg_golds[j]) - (pos_pairs[i] - neg_pairs[j]))
88 | k += 1
89 | return loss
90 |
91 |
92 |
93 |
94 |
95 |
96 |
--------------------------------------------------------------------------------
/ml/trainer.py:
--------------------------------------------------------------------------------
1 | import numpy as np
2 | from evaluation import confusion_matrix
3 | from ml import batcher
4 | import random
5 | import copy
6 | import tensorflow as tf
7 | from sys import stdin
8 |
9 | class SimpleTrainer(object):
10 | def __init__(self, model, session, feed_dict_function, eval_func, configuration_func = None, labels = None, additional_results_func = None):
11 | self.model = model
12 | self.session = session
13 | self.feed_dict_function = feed_dict_function
14 | self.eval_func = eval_func
15 | self.config_func = configuration_func
16 | self.additional_results_function = additional_results_func
17 | self.labels = labels
18 |
19 | def train_model_single_iteration(self, feed_dict):
20 | self.model.train_step.run(session = self.session, feed_dict = feed_dict)
21 |
22 | def predict(self, feed_dict):
23 | return self.model.preds.eval(session = self.session, feed_dict = feed_dict)
24 |
25 | def evaluate(self, feed_dict, gold):
26 | preds = predict(self.model, self.session, feed_dict)
27 | return preds, self.eval_func(gold, preds)
28 |
29 | def test(self, test_data, batch_size, eval_params = None, print_batches = False, batch_size_irrelevant = True, compute_loss = False):
30 | if compute_loss:
31 | epoch_loss = 0
32 | batches_eval = batcher.batch_iter(test_data, batch_size, 1, shuffle = False)
33 | eval_batch_counter = 1
34 |
35 | for batch_eval in batches_eval:
36 | if (batch_size_irrelevant or len(batch_eval) == batch_size):
37 | feed_dict_eval, golds_batch_eval = self.feed_dict_function(self.model, batch_eval, None, predict = True)
38 | preds_batch_eval = self.predict(feed_dict_eval)
39 | if compute_loss:
40 | batch_eval_loss = self.model.loss.eval(session = self.session, feed_dict = feed_dict_eval)
41 | epoch_loss += batch_eval_loss
42 |
43 | if eval_batch_counter == 1:
44 | golds = golds_batch_eval
45 | preds = preds_batch_eval
46 | else:
47 | golds = np.concatenate((golds, golds_batch_eval), axis = 0)
48 | preds = np.concatenate((preds, preds_batch_eval), axis = 0)
49 | if print_batches:
50 | print("Eval batch counter: " + str(eval_batch_counter), flush=True)
51 | eval_batch_counter += 1
52 |
53 | if self.eval_func is not None:
54 | score = self.eval_func(golds, preds, eval_params)
55 | if compute_loss:
56 | return preds, score, epoch_loss
57 | else:
58 | return preds, score
59 | else:
60 | if compute_loss:
61 | return preds, epoch_loss
62 | else:
63 | return preds
64 |
65 | def train(self, train_data, batch_size, max_num_epochs, num_epochs_not_better_end = 5, epoch_diff_smaller_end = 1e-5, print_batch_losses = True, configuration = None, eval_params = None, shuffle_data = True, batch_size_irrelevant = True):
66 | batch_counter = 0
67 | epoch_counter = 0
68 | epoch_losses = []
69 | epoch_loss = 0
70 | batches_in_epoch = int(len(train_data)/batch_size) + 1
71 |
72 | batches = batcher.batch_iter(train_data, batch_size, max_num_epochs, shuffle = shuffle_data)
73 | for batch in batches:
74 | batch_counter += 1
75 |
76 | if (batch_size_irrelevant or len(batch) == batch_size):
77 | feed_dict, gold_labels = self.feed_dict_function(self.model, batch, config = configuration, predict = False)
78 | self.train_model_single_iteration(feed_dict)
79 | batch_loss = self.model.loss.eval(session = self.session, feed_dict = feed_dict)
80 | if print_batch_losses:
81 | print("Batch " + str(batch_counter) + ": " + str(batch_loss), flush=True)
82 |
83 | if batch_counter % batches_in_epoch == 0:
84 | epoch_counter += 1
85 | print("Evaluating the epoch loss for epoch " + str(epoch_counter), flush=True)
86 |
87 | if self.eval_func:
88 | preds, score, epoch_loss = self.test(train_data, batch_size, eval_params, False, batch_size_irrelevant = batch_size_irrelevant, compute_loss = True)
89 | else:
90 | preds, epoch_loss = self.test(train_data, batch_size, None, False, batch_size_irrelevant = batch_size_irrelevant, compute_loss = True)
91 |
92 | print("Epoch " + str(epoch_counter) + ": " + str(epoch_loss), flush=True)
93 | if self.eval_func:
94 | print("Epoch (train) performance: " + str(score), flush=True)
95 | print("Previous epochs: " + str(epoch_losses), flush=True)
96 |
97 | if len(epoch_losses) == num_epochs_not_better_end and (epoch_losses[0] - epoch_loss < epoch_diff_smaller_end):
98 | break
99 | else:
100 | epoch_losses.append(epoch_loss)
101 | epoch_loss = 0
102 | if len(epoch_losses) > num_epochs_not_better_end:
103 | epoch_losses.pop(0)
104 |
105 | def train_dev(self, train_data, dev_data, batch_size, max_num_epochs, num_devs_not_better_end = 5, batch_dev_perf = 100, print_batch_losses = True, dev_score_maximize = True, configuration = None, print_training = False, shuffle_data = True):
106 | batch_counter = 0
107 | epoch_counter = 0
108 | epoch_losses = []
109 | dev_performances = []
110 | dev_losses = []
111 | epoch_loss = 0
112 |
113 | best_model = None
114 | best_performance = -1
115 | best_preds_dev = None
116 | batches_in_epoch = int(len(train_data)/batch_size) + 1
117 |
118 | batches = batcher.batch_iter(train_data, batch_size, max_num_epochs, shuffle = shuffle_data)
119 | for batch in batches:
120 | batch_counter += 1
121 |
122 | if (len(batch) == batch_size):
123 | feed_dict, gold_labels = self.feed_dict_function(self.model, batch, configuration)
124 | self.train_model_single_iteration(feed_dict)
125 |
126 | batch_loss = self.model.pure_loss.eval(session = self.session, feed_dict = feed_dict)
127 | #batch_dist_loss = self.model.distance_loss.eval(session = self.session, feed_dict = feed_dict)
128 | epoch_loss += batch_loss
129 |
130 | if print_training and print_batch_losses:
131 | print("Batch loss" + str(batch_counter) + ": " + str(batch_loss), flush=True)
132 | #print("Batch distance loss" + str(batch_counter) + ": " + str(batch_dist_loss))
133 |
134 | if batch_counter % batches_in_epoch == 0:
135 | epoch_counter += 1
136 | if print_training:
137 | print("\nEpoch " + str(epoch_counter) + ": " + str(epoch_loss), flush=True)
138 | print("Previous epochs: " + str(epoch_losses) + "\n", flush=True)
139 | epoch_losses.append(epoch_loss)
140 | epoch_loss = 0
141 | if len(epoch_losses) > num_devs_not_better_end:
142 | epoch_losses.pop(0)
143 |
144 | if batch_counter % batch_dev_perf == 0:
145 | if print_training:
146 | print("\n### Evaluation of development set, after batch " + str(batch_counter) + " ###", flush=True)
147 | batches_dev = batcher.batch_iter(dev_data, batch_size, 1, shuffle = False)
148 | dev_batch_counter = 1
149 | dev_loss = 0
150 | for batch_dev in batches_dev:
151 | if (len(batch_dev) == batch_size):
152 | feed_dict_dev, golds_batch_dev = self.feed_dict_function(self.model, batch_dev, configuration, predict = True)
153 | dev_batch_loss = self.model.pure_loss.eval(session = self.session, feed_dict = feed_dict_dev)
154 | dev_loss += dev_batch_loss
155 | if print_training and print_batch_losses:
156 | print("Dev batch: " + str(dev_batch_counter) + ": " + str(dev_batch_loss), flush=True)
157 | preds_batch_dev = self.predict(feed_dict_dev)
158 | if dev_batch_counter == 1:
159 | golds = golds_batch_dev
160 | preds = preds_batch_dev
161 | else:
162 | golds = np.concatenate((golds, golds_batch_dev), axis = 0)
163 | preds = np.concatenate((preds, preds_batch_dev), axis = 0)
164 | dev_batch_counter += 1
165 | print("Development pure loss: " + str(dev_loss), flush=True)
166 | score = self.eval_func(golds, preds, self.labels)
167 | if self.additional_results_function:
168 | self.additional_results_function(self.model, self.session)
169 | if print_training:
170 | print("Peformance: " + str(score) + "\n", flush=True)
171 | print("Previous performances: " + str(dev_performances), flush=True)
172 | print("\nLoss: " + str(dev_loss) + "\n", flush=True)
173 | print("Previous losses: " + str(dev_losses), flush=True)
174 | if score > best_performance:
175 | best_model = self.model.get_model(self.session)
176 | best_preds_dev = preds
177 | best_performance = score
178 |
179 | #if len(dev_performances) == num_devs_not_better_end and ((dev_score_maximize and dev_performances[0] >= score) or (not dev_score_maximize and dev_performances[0] <= score)):
180 | if len(dev_losses) == num_devs_not_better_end and dev_losses[0] < dev_loss:
181 | break
182 | else:
183 | dev_performances.append(score)
184 | dev_losses.append(dev_loss)
185 | if len(dev_performances) > num_devs_not_better_end:
186 | dev_performances.pop(0)
187 | dev_losses.pop(0)
188 | return (best_model, best_performance, best_preds_dev, golds)
189 |
190 | def cross_validate(self, data, batch_size, max_num_epochs, num_folds = 5, num_devs_not_better_end = 5, batch_dev_perf = 100, print_batch_losses = True, dev_score_maximize = True, configuration = None, print_training = False, micro_performance = True, shuffle_data = True):
191 | folds = np.array_split(data, num_folds)
192 | results = {}
193 |
194 | for i in range(num_folds):
195 | train_data = []
196 | for j in range(num_folds):
197 | if j != i:
198 | train_data.extend(folds[j])
199 | dev_data = folds[i]
200 |
201 | print("Sizes: train " + str(len(train_data)) + "; dev " + str(len(dev_data)), flush=True)
202 | print("Fold " + str(i+1) + ", creating model...", flush=True)
203 | model, conf_str, session = self.config_func(configuration)
204 | self.model = model
205 | self.session = session
206 | print("Fold " + str(i+1) + ", training the model...", flush=True)
207 | results[conf_str + "__fold-" + str(i+1)] = self.train_dev(train_data, dev_data, batch_size, max_num_epochs, num_devs_not_better_end, batch_dev_perf, print_batch_losses, dev_score_maximize, configuration, print_training, shuffle_data = shuffle_data)
208 |
209 | print("Closing session, reseting the default graph (freeing memory)", flush=True)
210 | self.session.close()
211 | tf.reset_default_graph()
212 | print("Performance: " + str(results[conf_str + "__fold-" + str(i+1)][1]), flush=True)
213 |
214 | if micro_performance:
215 | print("Concatenating fold predictions for micro-performance computation", flush=True)
216 | cntr = 0
217 | for k in results:
218 | cntr += 1
219 | if cntr == 1:
220 | all_preds = results[k][2]
221 | all_golds = results[k][3]
222 | else:
223 | all_preds = np.concatenate((all_preds, results[k][2]), axis = 0)
224 | all_golds = np.concatenate((all_golds, results[k][3]), axis = 0)
225 | micro_perf = self.eval_func(all_golds, all_preds, self.labels)
226 | return results, micro_perf
227 | else:
228 | return results
229 |
230 | def grid_search(self, configurations, train_data, dev_data, batch_size, max_num_epochs, num_devs_not_better_end = 5, batch_dev_perf = 100, print_batch_losses = True, dev_score_maximize = True, cross_validate = False, cv_folds = None, print_training = False, micro_performance = False, shuffle_data = True):
231 | if self.config_func is None:
232 | raise ValueError("Function that creates a concrete model for a given hyperparameter configuration must be defined!")
233 | results = {}
234 | config_cnt = 0
235 | for config in configurations:
236 | config_cnt += 1
237 | print("Config: #" + str(config_cnt), flush=True)
238 | if cross_validate:
239 | results[str(config)] = self.cross_validate(train_data, batch_size, max_num_epochs, cv_folds, num_devs_not_better_end, batch_dev_perf, print_batch_losses, dev_score_maximize, config, print_training, micro_performance = micro_performance, shuffle_data = shuffle_data)
240 | if micro_performance:
241 | print("### Configuration performance: " + str(results[str(config)][1]), flush=True)
242 | else:
243 | model, conf_str, session = self.config_func(config)
244 | self.model = model
245 | self.session = session
246 | results[conf_str] = self.train_dev(train_data, dev_data, batch_size, max_num_epochs, num_devs_not_better_end, batch_dev_perf, print_batch_losses, dev_score_maximize, config, print_training, shuffle_data = shuffle_data)
247 |
248 | print("Closing session, reseting the default graph (freeing memory)", flush=True)
249 | self.session.close()
250 | tf.reset_default_graph()
251 | return results
252 |
253 |
254 | class Trainer(object):
255 | """
256 | A wrapper around the classifiers, implementing functionality like cross-validation, batching, grid search, etc.
257 | """
258 | def __init__(self, classifier, one_hot_encoding_preds = False, class_indexes = True):
259 | self.classifier = classifier
260 | self.one_hot_encoding_preds = one_hot_encoding_preds
261 | self.class_indices = class_indexes
262 |
263 | def cross_validate(self, tf_session, class_labels, data_input, data_labels, num_folds, batch_size, num_epochs, model_reset_function = None, shuffle = False, fold_avg = 'micro', cl_perf = None, overall_perf = True, num_epochs_not_better_end = 2):
264 | conf_matrices = []
265 | best_epochs = []
266 | if shuffle:
267 | paired = list(zip(data_input, data_labels))
268 | random.shuffle(paired)
269 | data_input, data_labels = zip(*paired)
270 |
271 | folds = self.cross_validation_fold(data_input, data_labels, num_folds)
272 | fold_counter = 1
273 | for fold in folds:
274 | print("Fold: " + str(fold_counter), flush=True)
275 | train_input = fold[0]; train_labels = fold[1]; dev_input = fold[2]; dev_labels = fold[3]
276 | model_reset_function(tf_session)
277 | conf_mat, epoch = self.train_and_test(tf_session, class_labels, train_input, train_labels, dev_input, dev_labels, batch_size, num_epochs, cl_perf, overall_perf, num_epochs_not_better_end = num_epochs_not_better_end)
278 | conf_matrices.append(conf_mat)
279 | best_epochs.append(epoch)
280 | fold_counter += 1
281 | if fold_avg == 'macro':
282 | return conf_matrices, best_epochs
283 | elif fold_avg == 'micro':
284 | return confusion_matrix.merge_confusion_matrices(conf_matrices), best_epochs
285 | else:
286 | raise ValueError("Unknown value for fold_avg")
287 |
288 |
289 | def cross_validation_fold(self, data_input, data_labels, num_folds):
290 | folds_x_train = np.array_split(data_input, num_folds)
291 | folds_y_train = np.array_split(data_labels, num_folds)
292 | for i in range(num_folds):
293 | train_set_x = []
294 | train_set_y = []
295 | for j in range(num_folds):
296 | if j != i:
297 | train_set_x.extend(folds_x_train[j])
298 | train_set_y.extend(folds_y_train[j])
299 | dev_set_x = folds_x_train[i]
300 | dev_set_y = folds_y_train[i]
301 | yield [np.array(train_set_x), np.array(train_set_y), dev_set_x, dev_set_y]
302 |
303 | def train_and_test(self, session, class_labels, x_train, y_train, x_test, y_test, batch_size, num_epochs, cl_perf = None, overall_perf = True, num_epochs_not_better_end = 10, manual_features = False):
304 | batch_counter = 0
305 | epoch_loss = 0
306 | epoch_counter = 0
307 | last_epoch_results = []
308 | best_f = 0
309 | best_epoch = 0
310 | best_conf_mat = None
311 | best_predictions = []
312 |
313 | num_batches_per_epoch = int((len(x_train) if not manual_features else len(x_train[0])) / batch_size) + 1
314 |
315 | batches = batcher.batch_iter(list(zip(x_train, y_train)), batch_size, num_epochs) if not manual_features else batcher.batch_iter(list(zip(x_train[0], x_train[1], y_train)), batch_size, num_epochs)
316 | for batch in batches:
317 | if manual_features:
318 | x_b, x_b_man, y_b = zip(*batch)
319 | batch_loss = self.classifier.train(session, x_b, y_b, man_feats = x_b_man)
320 | else:
321 | x_b, y_b = zip(*batch)
322 | x_b = np.array(x_b)
323 | y_b = np.array(y_b)
324 | batch_loss = self.classifier.train(session, x_b, y_b)
325 | epoch_loss += batch_loss
326 |
327 | batch_counter += 1
328 |
329 | #if batch_counter % 50 == 0:
330 | #print("Batch " + str(batch_counter) + " loss: " + str(batch_loss))
331 | # evaluating current model's performance on test
332 | #preds, gold = self.classifier.predict(session, x_test, y_test)
333 | #self.evaluate_performance(class_labels, preds, gold, cl_perf, overall_perf, " (test set) ")
334 |
335 | if batch_counter % num_batches_per_epoch == 0:
336 | epoch_counter += 1
337 | print("Epoch " + str(epoch_counter) + " loss: " + str(epoch_loss), flush=True)
338 | last_epoch_results.append(epoch_loss)
339 | epoch_loss = 0
340 |
341 | if manual_features:
342 | x_test_text = x_test[0]
343 | x_test_manual = x_test[1]
344 | preds, gold = self.classifier.predict(session, x_test_text, y_test, man_feats = x_test_manual)
345 |
346 | else:
347 | preds, gold = self.classifier.predict(session, x_test, y_test)
348 |
349 | cm = self.evaluate_performance(class_labels, preds, gold, cl_perf, overall_perf, " (test set) ")
350 |
351 | fepoch = cm.accuracy # cm.get_class_performance("1")[2]
352 | if fepoch > best_f:
353 | best_f = fepoch
354 | best_epoch = epoch_counter
355 | best_conf_mat = cm
356 | best_predictions = preds
357 |
358 | if len(last_epoch_results) > num_epochs_not_better_end:
359 | last_epoch_results.pop(0)
360 | print("Last epochs: " + str(last_epoch_results), flush=True)
361 |
362 | if len(last_epoch_results) == num_epochs_not_better_end and last_epoch_results[0] < last_epoch_results[-1]:
363 | print("End condition satisfied, training finished. ", flush=True)
364 | break
365 |
366 | #preds, gold = self.classifier.predict(session, x_train, y_train)
367 | #self.evaluate_performance(class_labels, preds, gold, cl_perf, overall_perf, " (train set) ")
368 |
369 | #preds, gold = self.classifier.predict(session, x_test, y_test)
370 | #conf_mat = self.evaluate_performance(class_labels, preds, gold, cl_perf, overall_perf, " (test set) ")
371 | #return conf_mat
372 | return best_conf_mat, best_epoch, best_predictions
373 |
374 | def evaluate_performance(self, class_labels, preds, gold, cl_perf = None, overall_perf = True, desc = " () ", print_perf = True):
375 | conf_matrix = confusion_matrix.ConfusionMatrix(class_labels, preds, gold, self.one_hot_encoding_preds, self.class_indices)
376 | if print_perf:
377 | if cl_perf is not None:
378 | for cl in cl_perf:
379 | p, r, f = conf_matrix.get_class_performance(cl)
380 | print(desc + " Class: " + cl + "\nP: " + str(p) + "\nR: " + str(r) + "\nF: " + str(f) + "\n", flush=True)
381 | if overall_perf:
382 | print(desc + " Micro F1: " + str(conf_matrix.microf1) + "\nMacro F1: " + str(conf_matrix.macrof1) + "\n", flush=True)
383 | return conf_matrix
--------------------------------------------------------------------------------
/nlp.py:
--------------------------------------------------------------------------------
1 | import tensorflow as tf
2 | import numpy as np
3 | from embeddings import text_embeddings
4 | from helpers import io_helper
5 | from helpers import data_shaper
6 | from convolution import cnn
7 | from ml import loss_functions
8 | from evaluation import confusion_matrix
9 | from ml import trainer
10 | from helpers import data_helper
11 | import math
12 | import nltk
13 | from sts import simple_sts
14 | from graphs import graph
15 | import math
16 | import os
17 | from sys import stdin
18 | from datetime import datetime
19 | from scipy import spatial
20 | import codecs
21 | import pickle
22 |
23 | def map_lang(lang):
24 | if lang.lower() == 'english':
25 | return 'en'
26 | elif lang.lower() == 'french':
27 | return 'fr'
28 | elif lang.lower() == 'german':
29 | return 'de'
30 | elif lang.lower() == 'italian':
31 | return 'it'
32 | elif lang.lower() == 'spanish':
33 | return 'es'
34 | elif lang.lower() in ["en", "es", "de", "fr", "it"]:
35 | return lang.lower()
36 | else:
37 | return None
38 |
39 | def inverse_map_lang(lang):
40 | if lang.lower() == 'en':
41 | return 'english'
42 | elif lang.lower() == 'fr':
43 | return 'french'
44 | elif lang.lower() == 'de':
45 | return 'german'
46 | elif lang.lower() == 'it':
47 | return 'italian'
48 | elif lang.lower() == 'es':
49 | return 'spanish'
50 | elif lang.lower() in ["english", "spanish", "german", "french", "italian"]:
51 | return lang.lower()
52 | else:
53 | return None
54 |
55 | def load_embeddings(path):
56 | embeddings = text_embeddings.Embeddings()
57 | embeddings.load_embeddings(path, limit = None, language = 'default', print_loading = False)
58 | return embeddings
59 |
60 | def build_feed_dict_func(model, data, config = None, predict = False):
61 | x, y = zip(*data)
62 | fd = model.get_feed_dict(x, None if None in y else y, 1.0 if predict else 0.5)
63 | return fd, y
64 |
65 | def eval_func(golds, preds, params = None):
66 | gold_labs = np.argmax(golds, axis = 1)
67 | pred_labs = np.argmax(preds, axis = 1)
68 |
69 | conf_matrix = confusion_matrix.ConfusionMatrix(params["dist_labels"], pred_labs, gold_labs, False, class_indices = True)
70 | res = conf_matrix.accuracy
71 | return 0 if math.isnan(res) else res
72 |
73 | def get_prediction_labels(preds, dist_labels):
74 | pred_labs = [dist_labels[x] for x in np.argmax(preds, axis = 1)]
75 | return pred_labs
76 |
77 | def train_cnn(texts, languages, labels, embeddings, parameters, model_serialization_path, emb_lang = 'default'):
78 | # preparing texts
79 | print(datetime.now().strftime('%Y-%m-%d %H:%M:%S')+' Preparing texts...', flush=True)
80 | texts_clean = [data_helper.clean_str(t.strip()).split() for t in texts]
81 | # encoding languages (full name to abbreviation)
82 | print(datetime.now().strftime('%Y-%m-%d %H:%M:%S')+' Encoding languages (full name to abbreviation)...', flush=True)
83 | langs = [map_lang(x) for x in languages]
84 | # preparing training examples
85 | print(datetime.now().strftime('%Y-%m-%d %H:%M:%S')+' Preparing training examples...', flush=True)
86 | x_train, y_train, dist_labels = data_shaper.prep_classification(texts_clean, labels, embeddings, embeddings_language = emb_lang, multilingual_langs = langs, numbers_token = '', punct_token = '', add_out_of_vocabulary_terms = False)
87 |
88 | # defining the CNN model
89 | print(datetime.now().strftime('%Y-%m-%d %H:%M:%S')+' Defining the CNN model...', flush=True)
90 | cnn_classifier = cnn.CNN(embeddings = (embeddings.emb_sizes[emb_lang], embeddings.lang_embeddings[emb_lang]), num_conv_layers = parameters["num_convolutions"], filters = parameters["filters"], k_max_pools = parameters["k_max_pools"], manual_features_size = 0)
91 | cnn_classifier.define_model(len(x_train[0]), len(dist_labels), loss_functions.softmax_cross_entropy, len(embeddings.lang_vocabularies[emb_lang]), l2_reg_factor = parameters["reg_factor"], update_embeddings = parameters["update_embeddings"])
92 | cnn_classifier.define_optimization(learning_rate = parameters["learning_rate"])
93 | cnn_classifier.set_distinct_labels(dist_labels)
94 |
95 | # initializing a Tensorflow session
96 | print(datetime.now().strftime('%Y-%m-%d %H:%M:%S')+' Initializing a Tensorflow session...', flush=True)
97 | session = tf.InteractiveSession()
98 | session.run(tf.global_variables_initializer())
99 |
100 | # training the model
101 | print(datetime.now().strftime('%Y-%m-%d %H:%M:%S')+' Training the model...', flush=True)
102 | simp_trainer = trainer.SimpleTrainer(cnn_classifier, session, build_feed_dict_func, eval_func, configuration_func = None)
103 | simp_trainer.train(list(zip(x_train, y_train)), parameters["batch_size"], parameters["num_epochs"], num_epochs_not_better_end = 5, epoch_diff_smaller_end = parameters["epoch_diff_smaller_end"], print_batch_losses = True, eval_params = { "dist_labels" : dist_labels })
104 |
105 | # storing the model
106 | print(datetime.now().strftime('%Y-%m-%d %H:%M:%S')+' Storing the model...', flush=True)
107 | cnn_classifier.serialize(session, model_serialization_path)
108 | session.close()
109 | print(datetime.now().strftime('%Y-%m-%d %H:%M:%S')+' Training model is done!', flush=True)
110 |
111 | def test_cnn(texts, languages, labels, embeddings, model_serialization_path, predictions_file_path, parameters, emb_lang = 'default'):
112 | # loading the serialized
113 | print(datetime.now().strftime('%Y-%m-%d %H:%M:%S')+' Loading the serialized model...', flush=True)
114 | cnn_classifier, session = cnn.load_model(model_serialization_path, embeddings.lang_embeddings[emb_lang], loss_functions.softmax_cross_entropy, just_predict = (labels is None))
115 |
116 | # preparing/cleaning the texts
117 | print(datetime.now().strftime('%Y-%m-%d %H:%M:%S')+' Preparing/cleaning the texts...', flush=True)
118 | texts_clean = [data_helper.clean_str(t.strip()).split() for t in texts]
119 | # encoding languages (full name to abbreviation)
120 | print(datetime.now().strftime('%Y-%m-%d %H:%M:%S')+' Encoding languages (full name to abbreviation)...', flush=True)
121 | langs = [map_lang(x) for x in languages]
122 | # preparing testing examples
123 | print(datetime.now().strftime('%Y-%m-%d %H:%M:%S')+' Preparing training examples...', flush=True)
124 | if labels:
125 | x_test, y_test, dist_labels = data_shaper.prep_classification(texts_clean, labels, embeddings, embeddings_language = emb_lang, multilingual_langs = langs, numbers_token = '', punct_token = '', add_out_of_vocabulary_terms = False, dist_labels = cnn_classifier.dist_labels, max_seq_len = cnn_classifier.max_text_length)
126 | else:
127 | x_test = data_shaper.prep_classification(texts_clean, labels, embeddings, embeddings_language = emb_lang, multilingual_langs = langs, numbers_token = '', punct_token = '', add_out_of_vocabulary_terms = False, dist_labels = cnn_classifier.dist_labels, max_seq_len = cnn_classifier.max_text_length)
128 |
129 | simp_trainer = trainer.SimpleTrainer(cnn_classifier, session, build_feed_dict_func, None if not labels else eval_func, configuration_func = None)
130 | print(datetime.now().strftime('%Y-%m-%d %H:%M:%S')+' Starting test...', flush=True)
131 | results = simp_trainer.test(list(zip(x_test, y_test if labels else [None] * len(x_test))), parameters["batch_size"], eval_params = { "dist_labels" : cnn_classifier.dist_labels }, batch_size_irrelevant = True)
132 | print(datetime.now().strftime('%Y-%m-%d %H:%M:%S')+' Getting prediction labels...', flush=True)
133 | pred_labs = get_prediction_labels(results[0] if labels else results, cnn_classifier.dist_labels)
134 |
135 | if labels is None:
136 | io_helper.write_list(predictions_file_path, pred_labs)
137 | else:
138 | list_pairs = list(zip(pred_labs, labels))
139 | list_pairs.insert(0, ("Prediction", "Real label"))
140 | list_pairs.append(("Performance: ", str(results[1])))
141 | io_helper.write_list_tuples_separated(predictions_file_path, list_pairs)
142 | print(datetime.now().strftime('%Y-%m-%d %H:%M:%S') + ' Prediction is done!', flush=True)
143 |
144 | def scale_supervised(filenames, texts, languages, embeddings, predictions_file_path, pivot1, pivot2, stopwords = [], emb_lang = 'default'):
145 | print(datetime.now().strftime('%Y-%m-%d %H:%M:%S') + " Tokenizing documents...", flush = True)
146 | texts_tokenized = []
147 | for i in range(len(texts)):
148 | print("Document " + str(i + 1) + " of " + str(len(texts)), flush = True)
149 | texts_tokenized.append(simple_sts.simple_tokenize(texts[i], stopwords, lang_prefix = map_lang(languages[i])))
150 |
151 | print(datetime.now().strftime('%Y-%m-%d %H:%M:%S') + " Building tf-idf indices for weighted aggregation...", flush = True)
152 | tf_index, idf_index = simple_sts.build_tf_idf_indices(texts_tokenized)
153 | agg_vecs = []
154 | for i in range(len(texts_tokenized)):
155 | print("Aggregating vector of the document: " + str(i+1) + " of " + str(len(texts_tokenized)), flush = True)
156 | #agg_vec = simple_sts.aggregate_weighted_text_embedding(embeddings, tf_index[i], idf_index, emb_lang, weigh_idf = (len(set(languages)) == 1))
157 | agg_vec = simple_sts.aggregate_weighted_text_embedding(embeddings, tf_index[i], idf_index, emb_lang, weigh_idf = False)
158 | agg_vecs.append(agg_vec)
159 | pairs = []
160 | cntr = 0
161 | print(datetime.now().strftime('%Y-%m-%d %H:%M:%S') + " Computing pairwise similarities...", flush = True)
162 | for i in range(len(agg_vecs) - 1):
163 | for j in range(i+1, len(agg_vecs)):
164 | cntr += 1
165 | #print("Pair: " + filenames[i] + " - " + filenames[j] + " (" + str(cntr) + " of " + str((len(filenames) * (len(filenames) - 1)) / 2))
166 | sim = 1.0 - spatial.distance.cosine(agg_vecs[i], agg_vecs[j])
167 | print (sim)
168 | #print("Similarity: " + str(sim))
169 | pairs.append((filenames[i], filenames[j], sim))
170 |
171 | # rescale distances and produce similarities
172 | print(datetime.now().strftime('%Y-%m-%d %H:%M:%S') + " Normalizing pairwise similarities...", flush = True)
173 | max_sim = max([x[2] for x in pairs])
174 | min_sim = min([x[2] for x in pairs])
175 | pairs = [(x[0], x[1], (x[2] - min_sim) / (max_sim - min_sim)) for x in pairs]
176 |
177 | print(datetime.now().strftime('%Y-%m-%d %H:%M:%S') + " Fixing the pivot documents for scaling...", flush = True)
178 | min_sim_pair = [pivot1,pivot2,0.0]
179 |
180 | fixed = [(filenames.index(min_sim_pair[0]), -1.0), (filenames.index(min_sim_pair[1]), 1.0)]
181 | # fixed = [(pivot1, -1.0), (pivot2, 1.0)]
182 |
183 | # propagating position scores, i.e., scaling
184 | print(datetime.now().strftime('%Y-%m-%d %H:%M:%S') + " Running graph-based label propagation with pivot rescaling and score normalization...", flush = True)
185 | g = graph.Graph(nodes = filenames, edges = pairs)
186 | scores = g.harmonic_function_label_propagation(fixed, rescale_extremes = False, normalize = True)
187 |
188 | embs_to_store = {filenames[x]: [agg_vecs[x],scores[filenames[x]]] for x in range(len(agg_vecs))}
189 | print ("embs_to_store", len(embs_to_store))
190 |
191 | with open('docs-embs.pickle', 'wb') as handle:
192 | pickle.dump(embs_to_store, handle, protocol=pickle.HIGHEST_PROTOCOL)
193 |
194 | if predictions_file_path:
195 | io_helper.write_dictionary(predictions_file_path, scores)
196 |
197 | return scores
198 |
199 | def scale_efficient(filenames, texts, languages, embeddings, predictions_file_path, parameters, emb_lang = 'default', stopwords = []):
200 | print(datetime.now().strftime('%Y-%m-%d %H:%M:%S') + " Tokenizing documents...", flush = True)
201 | texts_tokenized = []
202 | for i in range(len(texts)):
203 | print("Document " + str(i + 1) + " of " + str(len(texts)), flush = True)
204 | texts_tokenized.append(simple_sts.simple_tokenize(texts[i], stopwords, lang_prefix = map_lang(languages[i])))
205 |
206 |
207 | embs_to_store = {filenames[x]: [texts_tokenized[x]] for x in range(len(texts_tokenized))}
208 | print ("embs_to_store", len(embs_to_store))
209 |
210 | with open('tok-text.pickle', 'wb') as handle:
211 | pickle.dump(embs_to_store, handle, protocol=pickle.HIGHEST_PROTOCOL)
212 |
213 |
214 | print(datetime.now().strftime('%Y-%m-%d %H:%M:%S') + " Building tf-idf indices for weighted aggregation...", flush = True)
215 | tf_index, idf_index = simple_sts.build_tf_idf_indices(texts_tokenized)
216 | agg_vecs = []
217 | for i in range(len(texts_tokenized)):
218 | print("Aggregating vector of the document: " + str(i+1) + " of " + str(len(texts_tokenized)), flush = True)
219 | #agg_vec = simple_sts.aggregate_weighted_text_embedding(embeddings, tf_index[i], idf_index, emb_lang, weigh_idf = (len(set(languages)) == 1))
220 | agg_vec = simple_sts.aggregate_weighted_text_embedding(embeddings, tf_index[i], idf_index, emb_lang, weigh_idf = False)
221 | agg_vecs.append(agg_vec)
222 |
223 |
224 |
225 | pairs = []
226 | cntr = 0
227 | print(datetime.now().strftime('%Y-%m-%d %H:%M:%S') + " Computing pairwise similarities...", flush = True)
228 | for i in range(len(agg_vecs) - 1):
229 | for j in range(i+1, len(agg_vecs)):
230 | cntr += 1
231 | #print("Pair: " + filenames[i] + " - " + filenames[j] + " (" + str(cntr) + " of " + str((len(filenames) * (len(filenames) - 1)) / 2))
232 | sim = 1.0 - spatial.distance.cosine(agg_vecs[i], agg_vecs[j])
233 | print (sim)
234 | #print("Similarity: " + str(sim))
235 | pairs.append((filenames[i], filenames[j], sim))
236 |
237 | # rescale distances and produce similarities
238 | print(datetime.now().strftime('%Y-%m-%d %H:%M:%S') + " Normalizing pairwise similarities...", flush = True)
239 | max_sim = max([x[2] for x in pairs])
240 | min_sim = min([x[2] for x in pairs])
241 | pairs = [(x[0], x[1], (x[2] - min_sim) / (max_sim - min_sim)) for x in pairs]
242 |
243 | print(datetime.now().strftime('%Y-%m-%d %H:%M:%S') + " Fixing the pivot documents for scaling...", flush = True)
244 | min_sim_pair = [x for x in pairs if x[2] == 0][0]
245 | fixed = [(filenames.index(min_sim_pair[0]), -1.0), (filenames.index(min_sim_pair[1]), 1.0)]
246 |
247 | # propagating position scores, i.e., scaling
248 | print(datetime.now().strftime('%Y-%m-%d %H:%M:%S') + " Running graph-based label propagation with pivot rescaling and score normalization...", flush = True)
249 | g = graph.Graph(nodes = filenames, edges = pairs)
250 | scores = g.harmonic_function_label_propagation(fixed, rescale_extremes = True, normalize = True)
251 |
252 | embs_to_store = {filenames[x]: [agg_vecs[x],scores[filenames[x]]] for x in range(len(agg_vecs))}
253 | print ("embs_to_store", len(embs_to_store))
254 |
255 | with open('docs-embs.pickle', 'wb') as handle:
256 | pickle.dump(embs_to_store, handle, protocol=pickle.HIGHEST_PROTOCOL)
257 |
258 | if predictions_file_path:
259 | io_helper.write_dictionary(predictions_file_path, scores)
260 | return scores
261 |
262 | def scale(filenames, texts, languages, embeddings, predictions_file_path, parameters, emb_lang = 'default'):
263 | print(datetime.now().strftime('%Y-%m-%d %H:%M:%S')+" Tokenizing documents...", flush=True)
264 | texts_tokenized = []
265 | for i in range(len(texts)):
266 | print(datetime.now().strftime('%Y-%m-%d %H:%M:%S')+" Document " + str(i + 1) + " of " + str(len(texts)), flush=True)
267 | texts_tokenized.append(simple_sts.simple_tokenize(texts[i], [], lang_prefix = map_lang(languages[i])))
268 |
269 | doc_dicts = []
270 | cntr = 0
271 | print(datetime.now().strftime('%Y-%m-%d %H:%M:%S')+" Building vocabularies for documents...", flush=True)
272 | for x in texts_tokenized:
273 | cntr += 1
274 | print("Document " + str(cntr) + " of " + str(len(texts)))
275 | doc_dicts.append(simple_sts.build_vocab(x, count_treshold = 1))
276 |
277 | print(datetime.now().strftime('%Y-%m-%d %H:%M:%S')+" Computing similarities between document pairs...", flush=True)
278 | items = list(zip(filenames, languages, doc_dicts))
279 | pairs = []
280 | cntr = 0
281 | for i in range(len(items) - 1):
282 | for j in range(i+1, len(items)):
283 | cntr += 1
284 | print("Pair: " + items[i][0] + " - " + items[j][0] + " (" + str(cntr) + " of " + str((len(items) * (len(items) - 1)) / 2), flush=True)
285 | sim = simple_sts.greedy_alignment_similarity(embeddings, items[i][2], items[j][2], lowest_sim = 0.01, length_factor = 0.01)
286 | print("Similarity: " + str(sim), flush=True)
287 | print("\n", flush=True)
288 | pairs.append((items[i][0], items[j][0], sim))
289 |
290 | # rescale distances and produce similarities
291 | print(datetime.now().strftime('%Y-%m-%d %H:%M:%S')+" Normalizing pairwise similarities...", flush=True)
292 | max_sim = max([x[2] for x in pairs])
293 | min_sim = min([x[2] for x in pairs])
294 | pairs = [(x[0], x[1], (x[2] - min_sim) / (max_sim - min_sim)) for x in pairs]
295 |
296 | print(datetime.now().strftime('%Y-%m-%d %H:%M:%S')+" Fixing the pivot documents for scaling...", flush=True)
297 | min_sim_pair = [x for x in pairs if x[2] == 0][0]
298 | fixed = [(filenames.index(min_sim_pair[0]), -1.0), (filenames.index(min_sim_pair[1]), 1.0)]
299 |
300 | # propagating position scores, i.e., scaling
301 | print(datetime.now().strftime('%Y-%m-%d %H:%M:%S')+" Running graph-based label propagation with pivot rescaling and score normalization...", flush=True)
302 | g = graph.Graph(nodes = filenames, edges = pairs)
303 | scores = g.harmonic_function_label_propagation(fixed, rescale_extremes = True, normalize = True)
304 | if predictions_file_path:
305 | io_helper.write_dictionary(predictions_file_path, scores)
306 | return scores
307 |
308 | def topically_scale(filenames, texts, languages, embeddings, model_serialization_path, predictions_file_path, parameters, emb_lang = 'default', stopwords = []):
309 | print(datetime.now().strftime('%Y-%m-%d %H:%M:%S') + " Loading classifier...", flush=True)
310 | cnn_classifier, session = cnn.load_model(model_serialization_path, embeddings.lang_embeddings[emb_lang], loss_functions.softmax_cross_entropy, just_predict = True)
311 | simp_trainer = trainer.SimpleTrainer(cnn_classifier, session, build_feed_dict_func, None, configuration_func = None)
312 |
313 | classified_texts = {}
314 | items = list(zip(filenames, texts, [map_lang(x) for x in languages]))
315 | print(datetime.now().strftime('%Y-%m-%d %H:%M:%S') + " Topically classifying texts...", flush = True)
316 | for item in items:
317 | fn, text, lang = item
318 | print(fn, flush=True)
319 | # split text in sentences
320 | sentences = nltk.sent_tokenize(text)
321 | sents_clean = [data_helper.clean_str(s.strip()).split() for s in sentences]
322 | langs = [lang] * len(sentences)
323 |
324 | # preparing training examples
325 | x_test = data_shaper.prep_classification(sents_clean, None, embeddings, embeddings_language = emb_lang, multilingual_langs = langs, numbers_token = '', punct_token = '', add_out_of_vocabulary_terms = False, dist_labels = cnn_classifier.dist_labels, max_seq_len = cnn_classifier.max_text_length)
326 |
327 | results = simp_trainer.test(list(zip(x_test, [None]*len(x_test))), parameters["batch_size"], batch_size_irrelevant = True, print_batches = True)
328 |
329 | pred_labs = get_prediction_labels(results, cnn_classifier.dist_labels)
330 | print("Predictions: ", flush=True)
331 | print(pred_labs, flush=True)
332 |
333 | classified_texts[fn] = list(zip(sentences, pred_labs, langs))
334 |
335 | print("Languages: " + str(langs), flush=True)
336 | print("Done with classifying: " + fn, flush=True)
337 |
338 | lines_to_write = []
339 | print(datetime.now().strftime('%Y-%m-%d %H:%M:%S')+ " Topical scaling...", flush=True)
340 | for l in cnn_classifier.dist_labels:
341 | label_filtered = [(fn, classified_texts[fn][0][2], ' '.join([sent_label[0] for sent_label in classified_texts[fn] if sent_label[1] == l])) for fn in classified_texts]
342 | label_filtered = [x for x in label_filtered if len(x[2].strip()) > 50]
343 | if len(label_filtered) > 3:
344 | print("Topic: " + l, flush=True)
345 | fns = [x[0] for x in label_filtered]
346 | langs = [x[1] for x in label_filtered]
347 | filt_texts = [x[2] for x in label_filtered]
348 |
349 | for i in range(len(fns)):
350 | io_helper.write_list(os.path.dirname(predictions_file_path) + "/" + fns[i].split(".")[0] + "_" + l.replace(" ", "-") + ".txt", [filt_texts[i]])
351 |
352 | label_scale = scale_efficient(fns, filt_texts, [inverse_map_lang(x) for x in langs], embeddings, None, parameters, emb_lang = emb_lang, stopwords = stopwords)
353 | lines_to_write.append("Scaling for class: " + l)
354 | lines_to_write.extend([k + " " + str(label_scale[k]) for k in label_scale])
355 | lines_to_write.append("\n")
356 | else:
357 | lines_to_write.append("Topic: " + l + ": Insufficient number of files contains text of this topic (i.e., class) in order to allow for scaling for the topic.")
358 | print("Topic: " + l + ": Insufficient number of files contains text of this topic (i.e., class) in order to allow for scaling for the topic.", flush=True)
359 |
360 | io_helper.write_list(predictions_file_path, lines_to_write)
361 | print(datetime.now().strftime('%Y-%m-%d %H:%M:%S')+' Topical Scaling is done!', flush=True)
362 |
363 |
364 |
--------------------------------------------------------------------------------
/scaler.py:
--------------------------------------------------------------------------------
1 | from embeddings import text_embeddings
2 | import nlp
3 | from helpers import io_helper
4 | from sts import simple_sts
5 | from sys import stdin
6 | import argparse
7 | import os
8 | from datetime import datetime
9 |
10 | supported_lang_strings = {"en" : "english", "fr" : "french", "de" : "german", "es" : "spanish", "it" : "italian"}
11 |
12 | parser = argparse.ArgumentParser(description='Performs text scaling (assigns a score to each text on a linear scale).')
13 | parser.add_argument('datadir', help='A path to the directory containing the input text files for scaling (one score will be assigned per file).')
14 | parser.add_argument('embs', help='A path to the file containing pre-trained word embeddings')
15 | parser.add_argument('output', help='A file path to which to store the scaling results.')
16 | parser.add_argument('--emb_cutoff', help='Length of the embedding-dictionary to use.')
17 | parser.add_argument('--stopwords', help='A file to the path containing stopwords')
18 |
19 | args = parser.parse_args()
20 |
21 | if not os.path.isdir(os.path.dirname(args.datadir)):
22 | print("Error: Directory containing the input files not found.")
23 | exit(code = 1)
24 |
25 | if not os.path.isfile(args.embs):
26 | print("Error: File containing pre-trained word embeddings not found.")
27 | exit(code = 1)
28 |
29 | if not os.path.isdir(os.path.dirname(args.output)) and not os.path.dirname(args.output) == "":
30 | print("Error: Directory of the output file does not exist.")
31 | exit(code = 1)
32 |
33 | if not args.emb_cutoff:
34 | args.emb_cutoff = None
35 | print("Note: Number of embeddings-cutoff is not provided, so we consider the entire vocabulary size.")
36 | else:
37 | args.emb_cutoff = int(args.emb_cutoff)
38 |
39 | if args.stopwords and not os.path.isfile(args.stopwords):
40 | print("Error: File containing stopwords not found.")
41 | exit(code = 1)
42 |
43 | files = io_helper.load_all_files(args.datadir)
44 | if len(files) < 4:
45 | print("Error: There need to be at least 4 texts for a meaningful scaling.")
46 | exit(code = 1)
47 |
48 | filenames = [x[0] for x in files]
49 | texts = [x[1] for x in files]
50 |
51 | wrong_lang = False
52 | languages = [x.split("\n", 1)[0].strip().lower() for x in texts]
53 | texts = [x.split("\n", 1)[1].strip().lower() for x in texts]
54 | for i in range(len(languages)):
55 | if languages[i] not in supported_lang_strings.keys() and languages[i] not in supported_lang_strings.values():
56 | print("The format of the file is incorrect, unspecified or unsupported language: " + str(filenames[i]))
57 | wrong_lang = True
58 | if wrong_lang:
59 | exit(code = 2)
60 |
61 | langs = [(l if l in supported_lang_strings.values() else supported_lang_strings[l]) for l in languages]
62 |
63 | if args.stopwords:
64 | stopwords = io_helper.load_lines(args.stopwords)
65 | else:
66 | stopwords = []
67 |
68 | predictions_serialization_path = args.output
69 |
70 | embeddings = text_embeddings.Embeddings()
71 | embeddings.load_embeddings(args.embs, limit = args.emb_cutoff, language = 'default', print_loading = True, skip_first_line = True)
72 | nlp.scale_efficient(filenames, texts, langs, embeddings, predictions_serialization_path, stopwords)
73 | print(datetime.now().strftime('%Y-%m-%d %H:%M:%S') + " Scaling completed.", flush = True)
--------------------------------------------------------------------------------
/sts/__init__.py:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/codogogo/topfish/6b3f5723029616cb430d6226bc59c013fe79eb78/sts/__init__.py
--------------------------------------------------------------------------------
/sts/simple_sts.py:
--------------------------------------------------------------------------------
1 | # -*- coding: utf-8 -*-
2 | import numpy as np
3 | import nltk
4 | import math
5 |
6 | def build_tf_idf_indices(texts_tokenized):
7 | idf_index = {}
8 | tf_index = {}
9 | for i in range(len(texts_tokenized)):
10 | tf_index[i] = {}
11 | for j in range(len(texts_tokenized[i])):
12 | w = texts_tokenized[i][j]
13 | if w not in tf_index[i]:
14 | tf_index[i][w] = 1
15 | else:
16 | tf_index[i][w] += 1
17 | if w not in idf_index:
18 | idf_index[w] = {}
19 | if i not in idf_index[w]:
20 | idf_index[w][i] = 1
21 | max_word_freq = max([tf_index[i][x] for x in tf_index[i]])
22 | print("Max word freq: " + str(max_word_freq))
23 | for w in tf_index[i]:
24 | tf_index[i][w] = tf_index[i][w] / max_word_freq
25 | for w in idf_index:
26 | idf_index[w] = math.log(len(texts_tokenized) / len(idf_index[w]))
27 | return tf_index, idf_index
28 |
29 | def fix_tokenization(tokens):
30 | punctuation = [".", ",", "!", ":", "?", ";", "-", ")", "(", "[", "]", "{", "}", "...", "/", "\\", "''", "\"", "'"]
31 | for i in range(len(tokens)):
32 | pcs = [p for p in punctuation if tokens[i].endswith(p)]
33 | if (len(pcs) > 0):
34 | tokens[i] = tokens[i][:-1]
35 | pcs = [p for p in punctuation if tokens[i].startswith(p)]
36 | if (len(pcs) > 0):
37 | tokens[i] = tokens[i][1:]
38 |
39 | def build_vocab(tokens, count_treshold = 1):
40 | print("Building full vocabulary...")
41 | full_vocab = {}
42 | for t in tokens:
43 | if t in full_vocab:
44 | full_vocab[t] = full_vocab[t] + 1
45 | else:
46 | full_vocab[t] = 1
47 |
48 | print("Tresholding vocabulary...")
49 | vocab = [x for x in full_vocab if full_vocab[x] >= count_treshold]
50 | print("Vocabulary length: " + str(len(vocab)))
51 | print("Building index dicts...")
52 | dict = { x : vocab.index(x) for x in vocab }
53 | inv_dict = { vocab.index(x) : x for x in vocab }
54 | print("Building count dict...")
55 | counts = { vocab.index(x) : full_vocab[x] for x in vocab }
56 |
57 | return (dict, inv_dict, counts)
58 |
59 | def simple_tokenize(text, stopwords, lower = True, lang_prefix = None):
60 | print("Tokenizing text...")
61 | punctuation = [".", ",", "!", ":", "?", ";", "-", ")", "(", "[", "]", "{", "}", "...", "/", "\\", "''", "\"", "'"]
62 | toks = [(x.strip().lower() if lower else x.strip()) for x in nltk.word_tokenize(text) if x.strip().lower() not in stopwords and x.strip().lower() not in punctuation]
63 | #toks = [(x.strip().lower() if lower else x.strip()) for x in nltk.word_tokenize(text) if x.strip().lower() not in stopwords and x.strip().lower()]
64 |
65 | fix_tokenization(toks)
66 |
67 | if lang_prefix:
68 | toks = [lang_prefix + "__" + x for x in toks]
69 | return toks
70 |
71 | def aggregate_weighted_text_embedding(embeddings, tf_index, idf_index, lang = "default", weigh_idf = True):
72 | agg_vec = np.zeros(embeddings.emb_sizes[lang])
73 | for t in tf_index:
74 | emb = embeddings.get_vector(lang, t)
75 | if emb is not None:
76 | if weigh_idf:
77 | weight = tf_index[t] * idf_index[t]
78 | else:
79 | weight = tf_index[t]
80 | agg_vec = np.add(agg_vec, np.multiply(weight, emb))
81 | return agg_vec
82 |
83 | def word_movers_distance(embeddings, first_tokens, second_tokens):
84 | return embeddings.wmdistance(first_tokens, second_tokens)
85 |
86 | def greedy_alignment_similarity(embeddings, first_doc, second_doc, lowest_sim = 0.3, length_factor = 0.1):
87 | print("Greedy aligning...")
88 | first_vocab, first_vocab_inv, first_counts_cpy = first_doc
89 | second_vocab, second_vocab_inv, second_counts_cpy = second_doc
90 |
91 | if len(first_vocab) == 0 or len(second_vocab) == 0:
92 | return 0
93 |
94 | first_counts = {x : first_counts_cpy[x] for x in first_counts_cpy }
95 | second_counts = {x : second_counts_cpy[x] for x in second_counts_cpy}
96 |
97 | #print("Computing actual document lengths...")
98 | len_first = sum(first_counts_cpy.values())
99 | len_second = sum(second_counts_cpy.values())
100 |
101 | # similarity matrix computation
102 | matrix = np.zeros((len(first_vocab), len(second_vocab)))
103 | print("Computing the similarity matrix...")
104 | #print("Vocab. size first: " + str(len(first_vocab)) + "Vocab. size second: " + str(len(second_vocab)))
105 | cntr = 0
106 | for ft in first_vocab:
107 | cntr += 1
108 | #if cntr % 10 == 0:
109 | # print("First vocab, item: " + str(cntr))
110 | first_index = first_vocab[ft]
111 | for st in second_vocab:
112 | second_index = second_vocab[st]
113 | sim = embeddings.word_similarity(ft, st, "default", "default")
114 | #print("Embedding similarity, " + ft + "::" + st + ": " + str(sim))
115 | matrix[first_index, second_index] = sim
116 |
117 | # greedy alignment
118 | print("Computing the alignment...")
119 | greedy_align_sum = 0.0
120 | counter_left_first = len_first
121 | counter_left_second = len_second
122 | tok_to_align = min(counter_left_first, counter_left_second)
123 | while counter_left_first > 0 and counter_left_second > 0:
124 | new_tok_to_align = min(counter_left_first, counter_left_second)
125 | if new_tok_to_align == tok_to_align or (tok_to_align - new_tok_to_align > 10000):
126 | tok_to_align = new_tok_to_align
127 | print("Left tokens to align: " + str(tok_to_align))
128 | ind = np.argmax(matrix.flatten())
129 | ind_src = ind // matrix.shape[1]
130 | ind_trg = ind % matrix.shape[1]
131 |
132 | simil = matrix[ind_src, ind_trg]
133 | #print("Similarity: " + str(simil))
134 |
135 | #print("Index src: " + str(ind_src) + ", word src: " + str(first_vocab_inv[ind_src].encode(encoding='UTF-8', errors='ignore')))
136 | #print("Index trg: " + str(ind_trg) + ", word trg: " + str(second_vocab_inv[ind_trg].encode(encoding='UTF-8', errors='ignore')))
137 |
138 | if simil < lowest_sim:
139 | break;
140 |
141 | min_freq = min(first_counts[ind_src], second_counts[ind_trg])
142 | greedy_align_sum += simil * min_freq
143 | matrix[ind_src, ind_trg] = -2
144 |
145 | first_counts[ind_src] = first_counts[ind_src] - min_freq
146 | second_counts[ind_trg] = second_counts[ind_trg] - min_freq
147 |
148 | if first_counts[ind_src] == 0:
149 | matrix[ind_src, :] = -2
150 | if second_counts[ind_trg] == 0:
151 | matrix[:, ind_trg] = -2
152 |
153 | counter_left_first = counter_left_first - min_freq
154 | counter_left_second = counter_left_second - min_freq
155 |
156 | prec = greedy_align_sum / min(len_first, len_second)
157 | rec = greedy_align_sum / max(len_first, len_second)
158 | return (((1 - length_factor) * prec) + (length_factor * rec))
--------------------------------------------------------------------------------
/supervised-scaler.py:
--------------------------------------------------------------------------------
1 | from embeddings import text_embeddings
2 | import nlp
3 | from helpers import io_helper
4 | from sts import simple_sts
5 | from sys import stdin
6 | import argparse
7 | import os
8 | from datetime import datetime
9 |
10 | supported_lang_strings = {"en" : "english", "fr" : "french", "de" : "german", "es" : "spanish", "it" : "italian"}
11 |
12 | parser = argparse.ArgumentParser(description='Performs text scaling (assigns a score to each text on a linear scale).')
13 | parser.add_argument('datadir', help='A path to the directory containing the input text files for scaling (one score will be assigned per file).')
14 | parser.add_argument('embs', help='A path to the file containing pre-trained word embeddings')
15 | parser.add_argument('output', help='A file path to which to store the scaling results.')
16 | parser.add_argument('pivot1', help='First pivot')
17 | parser.add_argument('pivot2', help='Second pivot')
18 | parser.add_argument('--stopwords', help='A file to the path containing stopwords')
19 |
20 | args = parser.parse_args()
21 |
22 | if not os.path.isdir(os.path.dirname(args.datadir)):
23 | print("Error: Directory containing the input files not found.")
24 | exit(code = 1)
25 |
26 | if not os.path.isfile(args.embs):
27 | print("Error: File containing pre-trained word embeddings not found.")
28 | exit(code = 1)
29 |
30 | if not os.path.isdir(os.path.dirname(args.output)) and not os.path.dirname(args.output) == "":
31 | print("Error: Directory of the output file does not exist.")
32 | exit(code = 1)
33 |
34 | if not os.path.isdir(os.path.dirname(args.pivot1)) and not os.path.dirname(args.pivot1) == "":
35 | print("Error: pivot1 does not exist.")
36 | exit(code = 1)
37 |
38 | if not os.path.isdir(os.path.dirname(args.pivot2)) and not os.path.dirname(args.pivot2) == "":
39 | print("Error: pivot2 does not exist.")
40 | exit(code = 1)
41 |
42 | if args.stopwords and not os.path.isfile(args.stopwords):
43 | print("Error: File containing stopwords not found.")
44 | exit(code = 1)
45 |
46 | files = io_helper.load_all_files(args.datadir)
47 | if len(files) < 4:
48 | print("Error: There need to be at least 4 texts for a meaningful scaling.")
49 | exit(code = 1)
50 |
51 | filenames = [x[0] for x in files]
52 | texts = [x[1] for x in files]
53 |
54 | wrong_lang = False
55 | languages = [x.split("\n", 1)[0].strip().lower() for x in texts]
56 | texts = [x.split("\n", 1)[1].strip().lower() for x in texts]
57 | for i in range(len(languages)):
58 | if languages[i] not in supported_lang_strings.keys() and languages[i] not in supported_lang_strings.values():
59 | print("The format of the file is incorrect, unspecified or unsupported language: " + str(filenames[i]))
60 | wrong_lang = True
61 | if wrong_lang:
62 | exit(code = 2)
63 |
64 | langs = [(l if l in supported_lang_strings.values() else supported_lang_strings[l]) for l in languages]
65 |
66 | if args.stopwords:
67 | stopwords = io_helper.load_lines(args.stopwords)
68 | else:
69 | stopwords = []
70 |
71 | predictions_serialization_path = args.output
72 |
73 | pivot1 = args.pivot1
74 | pivot2 = args.pivot2
75 |
76 | embeddings = text_embeddings.Embeddings()
77 | embeddings.load_embeddings(args.embs, limit = 1000000, language = 'default', print_loading = True, skip_first_line = True)
78 | nlp.scale_supervised(filenames, texts, langs, embeddings, predictions_serialization_path,pivot1,pivot2, stopwords)
79 | print(datetime.now().strftime('%Y-%m-%d %H:%M:%S') + " Scaling completed.", flush = True)
80 |
--------------------------------------------------------------------------------
/wfcode/__init__.py:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/codogogo/topfish/6b3f5723029616cb430d6226bc59c013fe79eb78/wfcode/__init__.py
--------------------------------------------------------------------------------
/wfcode/corpus.py:
--------------------------------------------------------------------------------
1 | import nltk
2 | import numpy as np
3 | import math
4 | from scipy import spatial
5 | import time
6 | from sys import stdin
7 | from datetime import datetime
8 |
9 | class Corpus(object):
10 | """description of class"""
11 | def __init__(self, documents, docpairs = None):
12 | print("Loading corpus, received: " + str(len(documents)) + " docs.")
13 | self.docs_raw = [d[1] for d in documents]
14 | self.docs_names = [d[0] for d in documents]
15 | self.punctuation = [".", ",", "!", ":", "?", ";", "-", ")", "(", "[", "]", "{", "}", "...", "/", "\\", u"``", "''", "\"", "'", "-", "$" ]
16 | self.doc_pairs = docpairs
17 | self.results = {}
18 |
19 | def tokenize(self, stopwords = None, freq_treshold = 5):
20 | self.stopwords = stopwords
21 | print(datetime.now().strftime('%Y-%m-%d %H:%M:%S') + " Preprocessing corpus...", flush = True)
22 | self.docs_tokens = [[tok.strip() for tok in nltk.word_tokenize(doc) if tok.strip() not in self.punctuation and len(tok.strip()) > 2] for doc in self.docs_raw]
23 | #self.docs_tokens = [[tok.strip() for tok in nltk.word_tokenize(doc)] for doc in self.docs_raw]
24 |
25 |
26 | self.freq_dicts = []
27 | if self.stopwords is not None:
28 | for i in range(len(self.docs_tokens)):
29 | self.docs_tokens[i] = [tok.strip() for tok in self.docs_tokens[i] if tok.strip().lower() not in self.stopwords]
30 |
31 | def build_occurrences(self):
32 | print(datetime.now().strftime('%Y-%m-%d %H:%M:%S') + " Building vocabulary...", flush = True)
33 | self.vocabulary = {}
34 | for dt in self.docs_tokens:
35 | for t in dt:
36 | if t not in self.vocabulary:
37 | self.vocabulary[t] = len(self.vocabulary)
38 |
39 | print(datetime.now().strftime('%Y-%m-%d %H:%M:%S') + " Building coocurrence matrix...", flush = True)
40 | self.occurrences = np.ones((len(self.docs_tokens), len(self.vocabulary)), dtype = np.float32)
41 | cnt = 0
42 | for i in range(len(self.docs_tokens)):
43 | cnt += 1
44 | print(str(cnt) + "/" + str(len(self.docs_tokens)))
45 | for j in range(len(self.docs_tokens[i])):
46 | word = self.docs_tokens[i][j]
47 | self.occurrences[i][self.vocabulary[word]] += 1
48 | if np.isnan(self.occurrences).any():
49 | raise ValueError("NaN in self.occurrences")
50 |
51 | def set_doc_positions(self, positions):
52 | for i in range(len(self.docs_names)):
53 | self.results[self.docs_names[i]] = positions[i]
54 |
55 | #def compute_semantic_similarities_aggregation(self, aggreg_sim_func, embeddings):
56 | # sims = []
57 | # if self.doc_pairs is None:
58 | # for i in range(len(self.docs_names) - 1):
59 | # for j in range(i + 1, len(self.docs_names)):
60 | # score = aggreg_sim_func(self.freq_dicts[i], self.freq_dicts[j], embeddings, self.docs_langs[i], self.docs_langs[j])
61 | # sims.append((self.docs_names[i], self.docs_names[j], score))
62 | # else:
63 | # for dp in self.doc_pairs:
64 | # i = self.docs_names.index(dp[0])
65 | # j = self.docs_names.index(dp[1])
66 | # score = aggreg_sim_func(self.freq_dicts[i], self.freq_dicts[j], embeddings, self.docs_langs[i], self.docs_langs[j])
67 | # sims.append((self.docs_names[i], self.docs_names[j], score))
68 |
69 | # self.raw_sims = sims
70 | # print("\n Sorted semantic similarities, s1: ")
71 | # self.raw_sims.sort(key=lambda tup: tup[2])
72 | # for s in self.raw_sims:
73 | # print(s[0], s[1], str(s[2]))
74 |
75 | # def compute_semantic_similarities(self, doc_similarity_function, sent_similarity_function, embedding_similarity_function):
76 | # sims = []
77 | # #tasks = []
78 | # start_time = time.time()
79 | # if self.doc_pairs is None:
80 | # for i in range(len(self.docs_names) - 1):
81 | # for j in range(i + 1, len(self.docs_names)):
82 | # #tasks.append((i, j))
83 | # #print(self.docs_names[i], self.docs_names[j])
84 | # score = doc_similarity_function(self.freq_dicts[i], self.freq_dicts[j], sent_similarity_function, embedding_similarity_function, self.docs_langs[i], self.docs_langs[j])
85 | # #sst = SemSimThread(doc_similarity_function, sent_similarity_function, embedding_similarity_function, self.docs_tokens[i], self.docs_tokens[j], self.docs_langs[i], self.docs_langs[j], "Thread-" + self.docs_names[i] + "-" + self.docs_names[j], 1)
86 | # #sst.start()
87 | # #sst.join()
88 | # sims.append((self.docs_names[i], self.docs_names[j], score[0], score[1]))
89 | # #print("Similarity: " + str(sst.result))
90 | # else:
91 | # for dp in self.doc_pairs:
92 | # i = self.docs_names.index(dp[0])
93 | # j = self.docs_names.index(dp[1])
94 | # print("Measuring similarity: " + dp[0] + " :: " + dp[1])
95 | # score = doc_similarity_function(self.freq_dicts[i], self.freq_dicts[j], sent_similarity_function, embedding_similarity_function, self.docs_langs[i], self.docs_langs[j])
96 | # print("Score: " + str(score[0]) + "; " + str(score[1]) + "; " + str(score[2]))
97 | # sims.append((self.docs_names[i], self.docs_names[j], score))
98 |
99 | # end_time = time.time()
100 | # print("Time elapsed: " + str(end_time-start_time))
101 |
102 | # #num_parallel = 10
103 | # #num_batches = math.ceil((1.0*len(tasks)) / (1.0*num_parallel))
104 | # #for i in range(num_batches):
105 | # # start_time = time.time()
106 | # # print("Batch: " + str(i+1) + "/" + str(num_batches))
107 | # # start_range = i * num_parallel
108 | # # end_range = (i+1)*num_parallel if (i+1)*num_parallel < len(tasks) else len(tasks)
109 | # # threads = [SemSimThread(doc_similarity_function, sent_similarity_function, embedding_similarity_function, self.docs_tokens[task[0]], self.docs_tokens[task[1]], self.docs_langs[task[0]], self.docs_langs[task[1]], self.docs_names[task[0]], self.docs_names[task[1]]) for task in tasks[start_range:end_range]]
110 | # # for thr in threads:
111 | # # thr.start()
112 | # # for thr in threads:
113 | # # thr.join()
114 | # # print("Thread results: ")
115 | # # for thr in threads:
116 | # # print(thr.threadID + " " + str(thr.result))
117 | # # sims.append((thr.first_name, thr.second_name, thr.result))
118 | # # end_time = time.time()
119 | # # print("Time elapsed: " + str(end_time-start_time))
120 |
121 |
122 | # #sim = parallel(delayed(doc_similarity_function)) doc_similarity_function(self.docs_tokens[i], self.docs_tokens[j], sent_similarity_function, embedding_similarity_function, self.docs_langs[i], self.docs_langs[j])
123 | # #sims = [(self.docs_names[tasks[i][0]], self.docs_names[tasks[i][1]], threads[i].result) for i in range(len(tasks))]
124 |
125 | # #min_sim = min([x[2] for x in sims])
126 | # #max_sim = max([x[2] for x in sims])
127 | # self.raw_sims = sims
128 | # #self.similarities = [(x[0], x[1], (x[2] - min_sim)/(max_sim - min_sim)) for x in sims]
129 |
130 | # print("\n Sorted semantic similarities, s1: ")
131 | # #self.raw_sims.sort(key=lambda tup: tup[2][0])
132 | # for s in self.raw_sims:
133 | # print(s[0], s[1], str(s[2][0]), str(s[2][1]), str(s[2][2]))
134 |
135 | # def compute_term_similarities(self):
136 | # sims = []
137 | # for i in range(len(self.docs_names) - 1):
138 | # for j in range(i + 1, len(self.docs_names)):
139 | # print(self.docs_names[i], self.docs_names[j])
140 | # sim = 1 - spatial.distance.cosine(self.tf_idf_vectors[i], self.tf_idf_vectors[j])
141 | # print("Term-based similarity: " + str(sim))
142 | # sims.append((self.docs_names[i], self.docs_names[j], sim))
143 | # min_sim = min([x[2] for x in sims])
144 | # max_sim = max([x[2] for x in sims])
145 | # self.raw_sims = sims
146 | # self.similarities = [(x[0], x[1], (x[2] - min_sim)/(max_sim - min_sim)) for x in sims]
147 |
148 | # print("\n Sorted tf-idf similarities: ")
149 | # self.raw_sims.sort(key=lambda tup: tup[2])
150 | # for s in self.raw_sims:
151 | # print(s[0], s[1], str(s[2]))
152 |
153 | #def most_dissimilar_vector(nodes, edges):
154 | # vec = []
155 | # min_score = min([x[2] for x in edges])
156 | # min_pair = ([x for x in edges if x[2] == min_score])[0]
157 | # first_added = False
158 | # for i in range(len(nodes)):
159 | # if nodes[i] == min_pair[0] or nodes[i] == min_pair[1]:
160 | # vec.append(-1 if first_added else 1)
161 | # if not first_added:
162 | # first_added = True
163 | # else:
164 | # vec.append(0)
165 | # return vec
166 |
167 |
168 |
169 |
170 |
171 |
172 |
173 |
--------------------------------------------------------------------------------
/wfcode/scaler.py:
--------------------------------------------------------------------------------
1 | import numpy as np
2 | import math
3 |
4 | class LinearScaler(object):
5 | def __init__(self, items, sims):
6 | self.items = items
7 | self.sims = sims
8 |
9 | def scale(self):
10 | minsim = min([x[2] for x in self.sims])
11 | minsim_edge = ([x for x in self.sims if x[2] == minsim])[0]
12 | ind_first = self.items.index(minsim_edge[0])
13 | ind_second = self.items.index(minsim_edge[1])
14 |
15 | # linear interpolation scaling
16 | scales = {}
17 | for i in range(len(self.items)):
18 | if i != ind_first and i != ind_second:
19 | sim_minus = ([x[2] for x in self.sims if (x[0] == minsim_edge[1] and x[1] == self.items[i]) or (x[1] == minsim_edge[1] and x[0] == self.items[i])])[0]
20 | sim_plus = ([x[2] for x in self.sims if (x[0] == minsim_edge[0] and x[1] == self.items[i]) or (x[1] == minsim_edge[0] and x[0] == self.items[i])])[0]
21 | scales[self.items[i]] = (-1 * sim_minus) / (sim_minus + sim_plus) + sim_plus / (sim_minus + sim_plus)
22 | elif i == ind_first:
23 | scales[self.items[i]] = 1
24 | elif i == ind_second:
25 | scales[self.items[i]] = -1
26 | return scales
27 |
28 | class WordfishScaler(object):
29 | """implementation of a WordFish-like scaling"""
30 | def __init__(self, corpus):
31 | self.corpus = corpus
32 | self.num_docs = len(self.corpus.docs_raw)
33 | self.num_words = len(self.corpus.vocabulary)
34 |
35 | self.alpha_docs = np.zeros(self.num_docs)
36 | self.theta_docs = np.zeros(self.num_docs)
37 | self.beta_words = np.zeros(self.num_words)
38 | self.psi_words = np.zeros(self.num_words)
39 | self.log_expectations = np.zeros((self.num_docs, self.num_words))
40 |
41 | def initialize(self):
42 | print("Initializing...")
43 | # Setting initial values for word fixed effects (psi)
44 | self.psi_words = np.log(np.average(self.corpus.occurrences, axis = 0))
45 |
46 | # Setting initial values for document fixed effects (Alphas)
47 | counts = np.sum(self.corpus.occurrences, axis = 1)
48 | self.alpha_docs = np.log(np.multiply(counts, 1.0 / counts[0]))
49 | print("Alpha docs: ")
50 | print(self.alpha_docs)
51 |
52 | # Setting initial values for betas and omegas
53 | matrix = np.log(np.transpose(self.corpus.occurrences)) - np.transpose(np.repeat(np.expand_dims(self.psi_words, 0), self.num_docs, axis = 0)) - np.repeat(np.expand_dims(self.alpha_docs, 0), self.num_words, axis = 0)
54 | u, s, v = np.linalg.svd(matrix, full_matrices = False, compute_uv = True)
55 | self.beta_words = u[:,0]
56 | self.theta_docs = v[0,:]
57 |
58 | def normalize_positions(self):
59 | self.alpha_docs[0] = 0
60 | self.theta_docs = np.divide((self.theta_docs - np.full((1, self.num_docs), np.mean(self.theta_docs))), np.full((1, self.num_docs), np.std(self.theta_docs)))
61 | self.theta_docs = self.theta_docs[0]
62 |
63 | def train(self, learning_rate, num_iters):
64 | print("Training...")
65 | # Computing the objective and also refreshing lambdas (log-likelihoods) for all pairs of word-document
66 | self.normalize_positions()
67 | obj_score = self.objective()
68 | print("Initial objective score: " + str(obj_score))
69 |
70 | for i in range(num_iters):
71 | # Updating document parameters
72 | alpha_grads, theta_grads = self.gradients_docs()
73 | self.alpha_docs = self.alpha_docs - np.multiply(alpha_grads, learning_rate / self.num_words)
74 | self.theta_docs = self.theta_docs - np.multiply(theta_grads, learning_rate / self.num_words)
75 |
76 | self.normalize_positions()
77 |
78 | #obj_score = self.objective()
79 | #if i % 100 == 0: print("Iteration (primary) " + str(i+1) + ": " + str(obj_score))
80 |
81 | # Updating word parameters
82 | beta_grads, psi_grads = self.gradients_words()
83 | self.beta_words = self.beta_words - np.multiply(beta_grads, learning_rate / self.num_docs)
84 | self.psi_words = self.psi_words - np.multiply(psi_grads, learning_rate / self.num_docs)
85 |
86 | obj_score = self.objective()
87 | if i % 100 == 0:
88 | print("Iteration (secondary) " + str(i+1) + ": " + str(obj_score))
89 |
90 | self.normalize_positions()
91 | self.corpus.set_doc_positions(self.theta_docs)
92 |
93 | def objective(self):
94 | self.log_expectations = self.log_expectation()
95 | return -1 * np.sum(np.multiply(self.corpus.occurrences, self.log_expectations) - np.exp(self.log_expectations))
96 |
97 | def log_expectation(self):
98 | return np.transpose(np.repeat(np.expand_dims(self.alpha_docs, 0), self.num_words, axis = 0)) + np.repeat(np.expand_dims(self.psi_words, 0), self.num_docs, axis = 0) + np.outer(self.theta_docs, self.beta_words)
99 |
100 | def gradients_words(self):
101 | psi_grads = np.sum(np.exp(self.log_expectations) - self.corpus.occurrences, axis = 0)
102 | beta_grads = np.sum(np.multiply(np.exp(self.log_expectations) - self.corpus.occurrences, np.transpose(np.repeat(np.expand_dims(self.theta_docs, 0), self.num_words, axis = 0))), axis = 0)
103 | return [beta_grads, psi_grads]
104 |
105 | def gradients_docs(self):
106 | alpha_grads = np.sum(np.exp(self.log_expectations) - self.corpus.occurrences, axis = 1)
107 | theta_grads = np.sum(np.multiply(np.exp(self.log_expectations) - self.corpus.occurrences, np.repeat(np.expand_dims(self.beta_words, 0), self.num_docs, axis = 0)), axis = 1)
108 | return [alpha_grads, theta_grads]
--------------------------------------------------------------------------------
/wordfish.py:
--------------------------------------------------------------------------------
1 | from helpers import io_helper
2 | from wfcode import corpus
3 | from wfcode import scaler
4 | import argparse
5 | import os
6 | from datetime import datetime
7 |
8 | parser = argparse.ArgumentParser(description='Trains a model for classifying lexico-semantic relations.')
9 | parser.add_argument('datadir', help='A path to the directory containing the input text files for scaling (one score will be assigned per file).')
10 | parser.add_argument('output', help='A file path to which to store the scaling results.')
11 | parser.add_argument('--stopwords', help='A file to the path containing stopwords')
12 | parser.add_argument('-f', '--freqthold', type=int, help='A frequency threshold -- all words appearing less than -ft times will be ignored (default 2)')
13 | parser.add_argument('-l', '--learnrate', type=float, help='Learning rate value (default = 0.00001)')
14 | parser.add_argument('-t', '--trainiters', type=int, help='Number of optimization iterations (default = 5000)')
15 |
16 | args = parser.parse_args()
17 |
18 | if args.trainiters:
19 | niter = args.trainiters
20 | else:
21 | niter = 5000
22 |
23 | if args.learnrate:
24 | lr = args.learnrate
25 | else:
26 | lr = 0.00001
27 |
28 | if args.freqthold:
29 | ft = args.freqthold
30 | else:
31 | ft = 2
32 |
33 | if not os.path.isdir(os.path.dirname(args.datadir)):
34 | print("Error: Directory containing the input files not found.")
35 | exit(code = 1)
36 |
37 | if not os.path.isdir(os.path.dirname(args.output)) and not os.path.dirname(args.output) == "":
38 | print("Error: Directory of the output file does not exist.")
39 | exit(code = 1)
40 |
41 | if args.stopwords and not os.path.isfile(args.stopwords):
42 | print("Error: File containing stopwords not found.")
43 | exit(code = 1)
44 |
45 | if args.stopwords:
46 | stopwords = io_helper.load_file_lines(args.stopwords)
47 | else:
48 | stopwords = None
49 |
50 | files = io_helper.load_all_files(args.datadir)
51 | corp = corpus.Corpus(files)
52 | corp.tokenize(stopwords = stopwords, freq_treshold = ft)
53 | corp.build_occurrences()
54 |
55 | wf_scaler = scaler.WordfishScaler(corp)
56 | wf_scaler.initialize()
57 | wf_scaler.train(learning_rate = lr, num_iters = niter)
58 |
59 | print(datetime.now().strftime('%Y-%m-%d %H:%M:%S') + " WordFish scaling completed.", flush = True)
60 |
61 | scale = []
62 | for x in corp.results:
63 | scale.append(str(x) + "\t" + str(corp.results[x]))
64 | io_helper.write_list(args.output, scale)
65 |
--------------------------------------------------------------------------------