├── README.md
├── conf
    └── fcepublic.conf
├── conlleval.py
├── evaluator.py
├── experiment.py
├── labeler.py
└── print_output.py


/README.md:
--------------------------------------------------------------------------------
  1 | Sequence labeler
  2 | =========================
  3 | 
  4 | This is a neural network sequence labeling system. Given a sequence of tokens, it will learn to assign labels to each token. Can be used for named entity recognition, POS-tagging, error detection, chunking, CCG supertagging, etc.
  5 | 
  6 | The main model implements a bidirectional LSTM for sequence tagging. In addition, you can incorporate character-level information -- either by concatenating a character-based representation, or by using an attention/gating mechanism for combining it with a word embedding.
  7 | 
  8 | Run with:
  9 | 
 10 |     python experiment.py config.conf
 11 | 
 12 | Preferably with Tensorflow set up to use CUDA, so the process can run on a GPU. The script will train the model on the training data, test it on the test data, and print various evaluation metrics.
 13 | 
 14 | Note: The original sequence labeler was implemented in Theano, but since Theano is soon ending support, I have reimplemented it in TensorFlow. I also used the chance to refactor the code a bit, and it should be better in every way. However, if you need the specific code used in previously published papers, you'll need to refer to older commits.
 15 | 
 16 | Requirements
 17 | -------------------------
 18 | 
 19 | * python (tested with 2.7.12 and 3.5.2)
 20 | * numpy (tested with 1.13.3 and 1.14.0)
 21 | * tensorflow (tested with 1.3.0 and 1.4.1)
 22 | 
 23 | 
 24 | Data format
 25 | -------------------------
 26 | 
 27 | The training and test data is expected in standard CoNLL-type tab-separated format. One word per line, separate column for token and label, empty line between sentences.
 28 | 
 29 | For error detection, this would be something like:
 30 | 
 31 |     I       c
 32 |     saws    i
 33 |     the     c
 34 |     show    c
 35 |     
 36 | 
 37 | The first column is assumed to be the token and the last column is the label. There can be other columns in the middle, which are currently not used. For example:
 38 | 
 39 |     EU      NNP     I-NP    S-ORG
 40 |     rejects VBZ     I-VP    O
 41 |     German  JJ      I-NP    S-MISC
 42 |     call    NN      I-NP    O
 43 |     to      TO      I-VP    O
 44 |     boycott VB      I-VP    O
 45 |     British JJ      I-NP    S-MISC
 46 |     lamb    NN      I-NP    O
 47 |     .       .       O       O
 48 |     
 49 | 
 50 | Configuration
 51 | -------------------------
 52 | 
 53 | Edit the values in config.conf as needed:
 54 | 
 55 | * **path_train** - Path to the training data, in CoNLL tab-separated format. One word per line, first column is the word, last column is the label. Empty lines between sentences.
 56 | * **path_dev** - Path to the development data, used for choosing the best epoch.
 57 | * **path_test** - Path to the test file. Can contain multiple files, colon separated.
 58 | * **conll_eval** - Whether the standard CoNLL NER evaluation should be run.
 59 | * **main_label** - The output label for which precision/recall/F-measure are calculated. Does not affect accuracy or measures from the CoNLL eval.
 60 | * **model_selector** - What is measured on the dev set for model selection: "dev_conll_f:high" for NER and chunking, "dev_acc:high" for POS-tagging, "dev_f05:high" for error detection.
 61 | * **preload_vectors** - Path to the pretrained word embeddings, in word2vec plain text format. If your embeddings are in binary, you can use [convertvec](https://github.com/marekrei/convertvec) to convert them to plain text.
 62 | * **word_embedding_size** - Size of the word embeddings used in the model.
 63 | * **crf_on_top** - If True, use a CRF as the output layer. If False, use softmax instead.
 64 | * **emb_initial_zero** - Whether word embeddings should have zero initialisation by default.
 65 | * **train_embeddings** - Whether word embeddings should be updated during training.
 66 | * **char_embedding_size** - Size of the character embeddings.
 67 | * **word_recurrent_size** - Size of the word-level LSTM hidden layers.
 68 | * **char_recurrent_size** - Size of the char-level LSTM hidden layers.
 69 | * **hidden_layer_size** - Size of the extra hidden layer on top of the bi-LSTM.
 70 | * **char_hidden_layer_size** - Size of the extra hidden layer on top of the character-based component.
 71 | * **lowercase** - Whether words should be lowercased when mapping to word embeddings.
 72 | * **replace_digits** - Whether all digits should be replaced by 0.
 73 | * **min_word_freq** - Minimal frequency of words to be included in the vocabulary. Others will be considered OOV.
 74 | * **singletons_prob** - The probability of mapping words that appear only once to OOV instead during training.
 75 | * **allowed_word_length** - Maximum allowed word length, clipping the rest. Can be necessary if the text contains unreasonably long tokens, eg URLs.
 76 | * **max_train_sent_length** - Discard sentences longer than this limit when training.
 77 | * **vocab_include_devtest** - Load words from dev and test sets also into the vocabulary. If they don't appear in the training set, they will have the default representations from the preloaded embeddings.
 78 | * **vocab_only_embedded** - Whether the vocabulary should contain only words in the pretrained embedding set.
 79 | * **initializer** - The method used to initialize weight matrices in the network.
 80 | * **opt_strategy** - The method used for weight updates.
 81 | * **learningrate** - Learning rate.
 82 | * **clip** - Clip the gradient to a range.
 83 | * **batch_equal_size** - Create batches of sentences with equal length.
 84 | * **epochs** - Maximum number of epochs to run.
 85 | * **stop_if_no_improvement_for_epochs** - Training will be stopped if there has been no improvement for n epochs.
 86 | * **learningrate_decay** - If performance hasn't improved for 3 epochs, multiply the learning rate with this value.
 87 | * **dropout_input** - The probability for applying dropout to the word representations. 0.0 means no dropout.
 88 | * **dropout_word_lstm** - The probability for applying dropout to the LSTM outputs.
 89 | * **tf_per_process_gpu_memory_fraction** - The fraction of GPU memory that the process can use.
 90 | * **tf_allow_growth** - Whether the GPU memory usage can grow dynamically.
 91 | * **main_cost** - Control the weight of the main labeling objective.
 92 | * **lmcost_max_vocab_size** = Maximum vocabulary size for the language modeling loss. The remaining words are mapped to a single entry.
 93 | * **lmcost_hidden_layer_size** = Hidden layer size for the language modeling loss.
 94 | * **lmcost_gamma** - Weight for the language modeling loss. 
 95 | * **char_integration_method** - How character information is integrated. Options are: "none" (not integrated), "concat" (concatenated), "attention" (the method proposed in Rei et al. (2016)).
 96 | * **save** - Path to save the model.
 97 | * **load** - Path to load the model.
 98 | * **garbage_collection** - Whether garbage collection is explicitly called. Makes things slower but can operate with bigger models.
 99 | * **lstm_use_peepholes** - Whether to use the LSTM implementation with peepholes.
100 | * **random_seed** - Random seed for initialisation and data shuffling. This can affect results, so for robust conclusions I recommend running multiple experiments with different seeds and averaging the metrics.
101 | 
102 | 
103 | 
104 | 
105 | 
106 | 
107 | Printing output
108 | -------------------------
109 | 
110 | There is now a separate script for loading a saved model and using it to print output for a given input file. Use the **save** option in the config file for saving the model. The input file needs to be in the same format as the training data (one word per line, labels in a separate column). The labels are expected for printing output as well. If you don't know the correct labels, just print any valid label in that field.
111 | 
112 | To print the output, run:
113 | 
114 |     python print_output.py labels model_file input_file
115 | 
116 | This will print the input file to standard output, with an extra column at the end that shows the prediction. 
117 | 
118 | You can also use:
119 | 
120 |     python print_output.py probs model_file input_file
121 | 
122 | This will print the individual probabilities for each of the possible labels.
123 | If the model is using CRFs, the *probs* option will output unnormalised state scores without taking the transitions into account.
124 | 
125 | 
126 | References
127 | -------------------------
128 | 
129 | The main sequence labeling model is described here:
130 | 
131 | [**Compositional Sequence Labeling Models for Error Detection in Learner Writing**](http://aclweb.org/anthology/P/P16/P16-1112.pdf)  
132 | Marek Rei and Helen Yannakoudakis  
133 | *In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (ACL-2016)*
134 |   
135 | 
136 | The character-level component is described here:
137 | 
138 | [**Attending to characters in neural sequence labeling models**](https://aclweb.org/anthology/C/C16/C16-1030.pdf)  
139 | Marek Rei, Gamal K.O. Crichton and Sampo Pyysalo  
140 | *In Proceedings of the 26th International Conference on Computational Linguistics (COLING-2016)*
141 | 
142 | The language modeling objective is described here:
143 | 
144 | [**Semi-supervised Multitask Learning for Sequence Labeling**](https://arxiv.org/abs/1704.07156)  
145 | Marek Rei  
146 | *In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (ACL-2017)*
147 | 
148 | The CRF implementation is based on:
149 | 
150 | [**Neural Architectures for Named Entity Recognition**](https://arxiv.org/abs/1603.01360)  
151 | Guillaume Lample, Miguel Ballesteros, Sandeep Subramanian, Kazuya Kawakami and Chris Dyer  
152 | *In Proceedings of NAACL-HLT 2016*
153 |   
154 | 
155 | The conlleval.py script is from: https://github.com/spyysalo/conlleval.py
156 | 
157 | 
158 | 
159 | 
160 | License
161 | ---------------------------
162 | 
163 | The code is distributed under the Affero General Public License 3 (AGPL-3.0) by default. 
164 | If you wish to use it under a different license, feel free to get in touch.
165 | 
166 | Copyright (c) 2018 Marek Rei
167 | 
168 | This program is free software: you can redistribute it and/or modify
169 | it under the terms of the GNU Affero General Public License as
170 | published by the Free Software Foundation, either version 3 of the
171 | License, or (at your option) any later version.
172 | 
173 | This program is distributed in the hope that it will be useful,
174 | but WITHOUT ANY WARRANTY; without even the implied warranty of
175 | MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
176 | GNU Affero General Public License for more details.
177 | 


--------------------------------------------------------------------------------
/conf/fcepublic.conf:
--------------------------------------------------------------------------------
 1 | [config]
 2 | dataset = fcepublic
 3 | path_train = fce-public.train.original.tsv
 4 | path_dev = fce-public.dev.original.tsv
 5 | path_test = fce-public.dev.original.tsv:fce-public.test.original.tsv:nucle.test0.original.tsv:nucle.test1.original.tsv
 6 | conll_eval = False
 7 | main_label = i
 8 | model_selector = dev_f05:high
 9 | preload_vectors = embeddings/glove/glove.6B.300d.txt
10 | word_embedding_size = 300
11 | crf_on_top = False
12 | emb_initial_zero = False
13 | train_embeddings = True
14 | char_embedding_size = 100
15 | word_recurrent_size = 300
16 | char_recurrent_size = 100
17 | hidden_layer_size = 50
18 | char_hidden_layer_size = 50
19 | lowercase = True
20 | replace_digits = True
21 | min_word_freq = -1
22 | singletons_prob = 0.1
23 | allowed_word_length = -1
24 | max_train_sent_length = -1
25 | vocab_include_devtest = True
26 | vocab_only_embedded = False
27 | initializer = glorot
28 | opt_strategy = adadelta
29 | learningrate = 1.0
30 | clip = 0.0
31 | batch_equal_size = False
32 | max_batch_size = 32
33 | epochs = 200
34 | stop_if_no_improvement_for_epochs = 7
35 | learningrate_decay = 0.9
36 | dropout_input = 0.5
37 | dropout_word_lstm = 0.5
38 | tf_per_process_gpu_memory_fraction = 1.0
39 | tf_allow_growth = True
40 | main_cost = 1.0
41 | lmcost_max_vocab_size = 7500
42 | lmcost_hidden_layer_size = 50
43 | lmcost_lstm_gamma = 0.1
44 | lmcost_joint_lstm_gamma = 0.0
45 | lmcost_char_gamma = 0.0
46 | lmcost_joint_char_gamma = 0.0
47 | char_attention_cosine_cost = 1.0
48 | char_integration_method = concat
49 | save = 
50 | load = 
51 | garbage_collection = False
52 | lstm_use_peepholes = False
53 | random_seed = 100
54 | 


--------------------------------------------------------------------------------
/conlleval.py:
--------------------------------------------------------------------------------
  1 | #!/usr/bin/env python
  2 | 
  3 | # Python version of the evaluation script from CoNLL'00-
  4 | # Originates from: https://github.com/spyysalo/conlleval.py
  5 | 
  6 | 
  7 | # Intentional differences:
  8 | # - accept any space as delimiter by default
  9 | # - optional file argument (default STDIN)
 10 | # - option to set boundary (-b argument)
 11 | # - LaTeX output (-l argument) not supported
 12 | # - raw tags (-r argument) not supported
 13 | 
 14 | import sys
 15 | import re
 16 | 
 17 | from collections import defaultdict, namedtuple
 18 | 
 19 | ANY_SPACE = '<SPACE>'
 20 | 
 21 | class FormatError(Exception):
 22 |     pass
 23 | 
 24 | Metrics = namedtuple('Metrics', 'tp fp fn prec rec fscore')
 25 | 
 26 | class EvalCounts(object):
 27 |     def __init__(self):
 28 |         self.correct_chunk = 0    # number of correctly identified chunks
 29 |         self.correct_tags = 0     # number of correct chunk tags
 30 |         self.found_correct = 0    # number of chunks in corpus
 31 |         self.found_guessed = 0    # number of identified chunks
 32 |         self.token_counter = 0    # token counter (ignores sentence breaks)
 33 | 
 34 |         # counts by type
 35 |         self.t_correct_chunk = defaultdict(int)
 36 |         self.t_found_correct = defaultdict(int)
 37 |         self.t_found_guessed = defaultdict(int)
 38 | 
 39 | def parse_args(argv):
 40 |     import argparse
 41 |     parser = argparse.ArgumentParser(
 42 |         description='evaluate tagging results using CoNLL criteria',
 43 |         formatter_class=argparse.ArgumentDefaultsHelpFormatter
 44 |     )
 45 |     arg = parser.add_argument
 46 |     arg('-b', '--boundary', metavar='STR', default='-X-',
 47 |         help='sentence boundary')
 48 |     arg('-d', '--delimiter', metavar='CHAR', default=ANY_SPACE,
 49 |         help='character delimiting items in input')
 50 |     arg('-o', '--otag', metavar='CHAR', default='O',
 51 |         help='alternative outside tag')
 52 |     arg('file', nargs='?', default=None)
 53 |     return parser.parse_args(argv)
 54 | 
 55 | def parse_tag(t):
 56 |     m = re.match(r'^([^-]*)-(.*)$', t)
 57 |     return m.groups() if m else (t, '')
 58 | 
 59 | def evaluate(iterable, options=None):
 60 |     if options is None:
 61 |         options = parse_args([])    # use defaults
 62 | 
 63 |     counts = EvalCounts()
 64 |     num_features = None       # number of features per line
 65 |     in_correct = False        # currently processed chunks is correct until now
 66 |     last_correct = 'O'        # previous chunk tag in corpus
 67 |     last_correct_type = ''    # type of previously identified chunk tag
 68 |     last_guessed = 'O'        # previously identified chunk tag
 69 |     last_guessed_type = ''    # type of previous chunk tag in corpus
 70 | 
 71 |     for line in iterable:
 72 |         line = line.rstrip('\r\n')
 73 | 
 74 |         if options.delimiter == ANY_SPACE:
 75 |             features = line.split()
 76 |         else:
 77 |             features = line.split(options.delimiter)
 78 | 
 79 |         if num_features is None:
 80 |             num_features = len(features)
 81 |         elif num_features != len(features) and len(features) != 0:
 82 |             raise FormatError('unexpected number of features: %d (%d)' %
 83 |                               (len(features), num_features))
 84 | 
 85 |         if len(features) == 0 or features[0] == options.boundary:
 86 |             features = [options.boundary, 'O', 'O']
 87 |         if len(features) < 3:
 88 |             raise FormatError('unexpected number of features in line %s' % line)
 89 | 
 90 |         guessed, guessed_type = parse_tag(features.pop())
 91 |         correct, correct_type = parse_tag(features.pop())
 92 |         first_item = features.pop(0)
 93 | 
 94 |         if first_item == options.boundary:
 95 |             guessed = 'O'
 96 | 
 97 |         end_correct = end_of_chunk(last_correct, correct,
 98 |                                    last_correct_type, correct_type)
 99 |         end_guessed = end_of_chunk(last_guessed, guessed,
100 |                                    last_guessed_type, guessed_type)
101 |         start_correct = start_of_chunk(last_correct, correct,
102 |                                        last_correct_type, correct_type)
103 |         start_guessed = start_of_chunk(last_guessed, guessed,
104 |                                        last_guessed_type, guessed_type)
105 | 
106 |         if in_correct:
107 |             if (end_correct and end_guessed and
108 |                 last_guessed_type == last_correct_type):
109 |                 in_correct = False
110 |                 counts.correct_chunk += 1
111 |                 counts.t_correct_chunk[last_correct_type] += 1
112 |             elif (end_correct != end_guessed or guessed_type != correct_type):
113 |                 in_correct = False
114 | 
115 |         if start_correct and start_guessed and guessed_type == correct_type:
116 |             in_correct = True
117 | 
118 |         if start_correct:
119 |             counts.found_correct += 1
120 |             counts.t_found_correct[correct_type] += 1
121 |         if start_guessed:
122 |             counts.found_guessed += 1
123 |             counts.t_found_guessed[guessed_type] += 1
124 |         if first_item != options.boundary:
125 |             if correct == guessed and guessed_type == correct_type:
126 |                 counts.correct_tags += 1
127 |             counts.token_counter += 1
128 | 
129 |         last_guessed = guessed
130 |         last_correct = correct
131 |         last_guessed_type = guessed_type
132 |         last_correct_type = correct_type
133 | 
134 |     if in_correct:
135 |         counts.correct_chunk += 1
136 |         counts.t_correct_chunk[last_correct_type] += 1
137 | 
138 |     return counts
139 | 
140 | def uniq(iterable):
141 |   seen = set()
142 |   return [i for i in iterable if not (i in seen or seen.add(i))]
143 | 
144 | def calculate_metrics(correct, guessed, total):
145 |     tp, fp, fn = correct, guessed-correct, total-correct
146 |     p = 0 if tp + fp == 0 else 1.*tp / (tp + fp)
147 |     r = 0 if tp + fn == 0 else 1.*tp / (tp + fn)
148 |     f = 0 if p + r == 0 else 2 * p * r / (p + r)
149 |     return Metrics(tp, fp, fn, p, r, f)
150 | 
151 | def metrics(counts):
152 |     c = counts
153 |     overall = calculate_metrics(
154 |         c.correct_chunk, c.found_guessed, c.found_correct
155 |     )
156 |     by_type = {}
157 |     for t in uniq(list(c.t_found_correct.keys()) + list(c.t_found_guessed.keys())):
158 |         by_type[t] = calculate_metrics(
159 |             c.t_correct_chunk[t], c.t_found_guessed[t], c.t_found_correct[t]
160 |         )
161 |     return overall, by_type
162 | 
163 | def report(counts, out=None):
164 |     if out is None:
165 |         out = sys.stdout
166 | 
167 |     overall, by_type = metrics(counts)
168 | 
169 |     c = counts
170 |     out.write('processed %d tokens with %d phrases; ' %
171 |               (c.token_counter, c.found_correct))
172 |     out.write('found: %d phrases; correct: %d.\n' %
173 |               (c.found_guessed, c.correct_chunk))
174 | 
175 |     if c.token_counter > 0:
176 |         out.write('accuracy: %6.2f%%; ' %
177 |                   (100.*c.correct_tags/c.token_counter))
178 |         out.write('precision: %6.2f%%; ' % (100.*overall.prec))
179 |         out.write('recall: %6.2f%%; ' % (100.*overall.rec))
180 |         out.write('FB1: %6.2f\n' % (100.*overall.fscore))
181 | 
182 |     for i, m in sorted(by_type.items()):
183 |         out.write('%17s: ' % i)
184 |         out.write('precision: %6.2f%%; ' % (100.*m.prec))
185 |         out.write('recall: %6.2f%%; ' % (100.*m.rec))
186 |         out.write('FB1: %6.2f  %d\n' % (100.*m.fscore, c.t_found_guessed[i]))
187 | 
188 | def end_of_chunk(prev_tag, tag, prev_type, type_):
189 |     # check if a chunk ended between the previous and current word
190 |     # arguments: previous and current chunk tags, previous and current types
191 |     chunk_end = False
192 | 
193 |     if prev_tag == 'E': chunk_end = True
194 |     if prev_tag == 'S': chunk_end = True
195 | 
196 |     if prev_tag == 'B' and tag == 'B': chunk_end = True
197 |     if prev_tag == 'B' and tag == 'S': chunk_end = True
198 |     if prev_tag == 'B' and tag == 'O': chunk_end = True
199 |     if prev_tag == 'I' and tag == 'B': chunk_end = True
200 |     if prev_tag == 'I' and tag == 'S': chunk_end = True
201 |     if prev_tag == 'I' and tag == 'O': chunk_end = True
202 | 
203 |     if prev_tag != 'O' and prev_tag != '.' and prev_type != type_:
204 |         chunk_end = True
205 | 
206 |     # these chunks are assumed to have length 1
207 |     if prev_tag == ']': chunk_end = True
208 |     if prev_tag == '[': chunk_end = True
209 | 
210 |     return chunk_end
211 | 
212 | def start_of_chunk(prev_tag, tag, prev_type, type_):
213 |     # check if a chunk started between the previous and current word
214 |     # arguments: previous and current chunk tags, previous and current types
215 |     chunk_start = False
216 | 
217 |     if tag == 'B': chunk_start = True
218 |     if tag == 'S': chunk_start = True
219 | 
220 |     if prev_tag == 'E' and tag == 'E': chunk_start = True
221 |     if prev_tag == 'E' and tag == 'I': chunk_start = True
222 |     if prev_tag == 'S' and tag == 'E': chunk_start = True
223 |     if prev_tag == 'S' and tag == 'I': chunk_start = True
224 |     if prev_tag == 'O' and tag == 'E': chunk_start = True
225 |     if prev_tag == 'O' and tag == 'I': chunk_start = True
226 | 
227 |     if tag != 'O' and tag != '.' and prev_type != type_:
228 |         chunk_start = True
229 | 
230 |     # these chunks are assumed to have length 1
231 |     if tag == '[': chunk_start = True
232 |     if tag == ']': chunk_start = True
233 | 
234 |     return chunk_start
235 | 
236 | def main(argv):
237 |     args = parse_args(argv[1:])
238 | 
239 |     if args.file is None:
240 |         counts = evaluate(sys.stdin, args)
241 |     else:
242 |         with open(args.file) as f:
243 |             counts = evaluate(f, args)
244 |     report(counts)
245 | 
246 | if __name__ == '__main__':
247 |     sys.exit(main(sys.argv))
248 | 


--------------------------------------------------------------------------------
/evaluator.py:
--------------------------------------------------------------------------------
 1 | import time
 2 | import collections
 3 | import numpy
 4 | import conlleval
 5 | 
 6 | class SequenceLabelingEvaluator(object):
 7 |     def __init__(self, main_label, label2id, conll_eval=False):
 8 |         self.main_label = main_label
 9 |         self.label2id = label2id
10 |         self.conll_eval = conll_eval
11 |         self.main_label_id = self.label2id[self.main_label]
12 | 
13 |         self.cost_sum = 0.0
14 |         self.correct_sum = 0.0
15 |         self.main_predicted_count = 0
16 |         self.main_total_count = 0
17 |         self.main_correct_count = 0
18 |         self.token_count = 0
19 |         self.start_time = time.time()
20 | 
21 |         self.id2label = collections.OrderedDict()
22 |         for label in self.label2id:
23 |             self.id2label[self.label2id[label]] = label
24 | 
25 |         self.conll_format = []
26 | 
27 |     def append_data(self, cost, batch, predicted_labels):
28 |         self.cost_sum += cost
29 |         for i in range(len(batch)):
30 |             for j in range(len(batch[i])):
31 |                 token = batch[i][j][0]
32 |                 gold_label = batch[i][j][-1]
33 |                 predicted_label = self.id2label[predicted_labels[i][j]]
34 | 
35 |                 self.token_count += 1
36 |                 if gold_label == predicted_label:
37 |                     self.correct_sum += 1
38 |                 if predicted_label == self.main_label:
39 |                     self.main_predicted_count += 1
40 |                 if gold_label == self.main_label:
41 |                     self.main_total_count += 1
42 |                 if predicted_label == gold_label and gold_label == self.main_label:
43 |                     self.main_correct_count += 1
44 | 
45 |                 self.conll_format.append(token + "\t" + gold_label + "\t" + predicted_label)
46 |             self.conll_format.append("")
47 | 
48 | 
49 |     def get_results(self, name):
50 |         p = (float(self.main_correct_count) / float(self.main_predicted_count)) if (self.main_predicted_count > 0) else 0.0
51 |         r = (float(self.main_correct_count) / float(self.main_total_count)) if (self.main_total_count > 0) else 0.0
52 |         f = (2.0 * p * r / (p + r)) if (p+r > 0.0) else 0.0
53 |         f05 = ((1.0 + 0.5*0.5) * p * r / ((0.5*0.5 * p) + r)) if (p+r > 0.0) else 0.0
54 | 
55 |         results = collections.OrderedDict()
56 |         results[name + "_cost_avg"] = self.cost_sum / float(self.token_count)
57 |         results[name + "_cost_sum"] = self.cost_sum
58 |         results[name + "_main_predicted_count"] = self.main_predicted_count
59 |         results[name + "_main_total_count"] = self.main_total_count
60 |         results[name + "_main_correct_count"] = self.main_correct_count
61 |         results[name + "_p"] = p
62 |         results[name + "_r"] = r
63 |         results[name + "_f"] = f
64 |         results[name + "_f05"] = f05
65 |         results[name + "_accuracy"] = self.correct_sum / float(self.token_count)
66 |         results[name + "_token_count"] = self.token_count
67 |         results[name + "_time"] = float(time.time()) - float(self.start_time)
68 | 
69 |         if self.label2id is not None and self.conll_eval == True:
70 |             conll_counts = conlleval.evaluate(self.conll_format)
71 |             conll_metrics_overall, conll_metrics_by_type = conlleval.metrics(conll_counts)
72 |             results[name + "_conll_accuracy"] = float(conll_counts.correct_tags) / float(conll_counts.token_counter)
73 |             results[name + "_conll_p"] = conll_metrics_overall.prec
74 |             results[name + "_conll_r"] = conll_metrics_overall.rec
75 |             results[name + "_conll_f"] = conll_metrics_overall.fscore
76 | #            for i, m in sorted(conll_metrics_by_type.items()):
77 | #                results[name + "_conll_p_" + str(i)] = m.prec
78 | #                results[name + "_conll_r_" + str(i)] = m.rec
79 | #                results[name + "_conll_f_" + str(i)] = m.fscore #str(m.fscore) + " " + str(conll_counts.t_found_guessed[i])
80 | 
81 |         return results
82 | 
83 | 
84 | 
85 | 


--------------------------------------------------------------------------------
/experiment.py:
--------------------------------------------------------------------------------
  1 | import sys
  2 | import collections
  3 | import numpy
  4 | import random
  5 | import math
  6 | import os
  7 | import gc
  8 | 
  9 | try:
 10 |     import ConfigParser as configparser
 11 | except:
 12 |     import configparser
 13 | 
 14 | 
 15 | from labeler import SequenceLabeler
 16 | from evaluator import SequenceLabelingEvaluator
 17 | 
 18 | def read_input_files(file_paths, max_sentence_length=-1):
 19 |     """
 20 |     Reads input files in whitespace-separated format.
 21 |     Will split file_paths on comma, reading from multiple files.
 22 |     The format assumes the first column is the word, the last column is the label.
 23 |     """
 24 |     sentences = []
 25 |     line_length = None
 26 |     for file_path in file_paths.strip().split(","):
 27 |         with open(file_path, "r") as f:
 28 |             sentence = []
 29 |             for line in f:
 30 |                 line = line.strip()
 31 |                 if len(line) > 0:
 32 |                     line_parts = line.split()
 33 |                     assert(len(line_parts) >= 2)
 34 |                     assert(len(line_parts) == line_length or line_length == None)
 35 |                     line_length = len(line_parts)
 36 |                     sentence.append(line_parts)
 37 |                 elif len(line) == 0 and len(sentence) > 0:
 38 |                     if max_sentence_length <= 0 or len(sentence) <= max_sentence_length:
 39 |                         sentences.append(sentence)
 40 |                     sentence = []
 41 |             if len(sentence) > 0:
 42 |                 if max_sentence_length <= 0 or len(sentence) <= max_sentence_length:
 43 |                     sentences.append(sentence)
 44 |     return sentences
 45 | 
 46 | 
 47 | 
 48 | def parse_config(config_section, config_path):
 49 |     """
 50 |     Reads configuration from the file and returns a dictionary.
 51 |     Tries to guess the correct datatype for each of the config values.
 52 |     """
 53 |     config_parser = configparser.SafeConfigParser(allow_no_value=True)
 54 |     config_parser.read(config_path)
 55 |     config = collections.OrderedDict()
 56 |     for key, value in config_parser.items(config_section):
 57 |         if value is None or len(value.strip()) == 0:
 58 |             config[key] = None
 59 |         elif value.lower() in ["true", "false"]:
 60 |             config[key] = config_parser.getboolean(config_section, key)
 61 |         elif value.isdigit():
 62 |             config[key] = config_parser.getint(config_section, key)
 63 |         elif is_float(value):
 64 |             config[key] = config_parser.getfloat(config_section, key)
 65 |         else:
 66 |             config[key] = config_parser.get(config_section, key)
 67 |     return config
 68 | 
 69 | 
 70 | def is_float(value):
 71 |     """
 72 |     Check in value is of type float()
 73 |     """
 74 |     try:
 75 |         float(value)
 76 |         return True
 77 |     except ValueError:
 78 |         return False
 79 | 
 80 | 
 81 | def create_batches_of_sentence_ids(sentences, batch_equal_size, max_batch_size):
 82 |     """
 83 |     Groups together sentences into batches
 84 |     If batch_equal_size is True, make all sentences in a batch be equal length.
 85 |     If max_batch_size is positive, this value determines the maximum number of sentences in each batch.
 86 |     If max_batch_size has a negative value, the function dynamically creates the batches such that each batch contains abs(max_batch_size) words.
 87 |     Returns a list of lists with sentences ids.
 88 |     """
 89 |     batches_of_sentence_ids = []
 90 |     if batch_equal_size == True:
 91 |         sentence_ids_by_length = collections.OrderedDict()
 92 |         sentence_length_sum = 0.0
 93 |         for i in range(len(sentences)):
 94 |             length = len(sentences[i])
 95 |             if length not in sentence_ids_by_length:
 96 |                 sentence_ids_by_length[length] = []
 97 |             sentence_ids_by_length[length].append(i)
 98 | 
 99 |         for sentence_length in sentence_ids_by_length:
100 |             if max_batch_size > 0:
101 |                 batch_size = max_batch_size
102 |             else:
103 |                 batch_size = int((-1.0 * max_batch_size) / sentence_length)
104 | 
105 |             for i in range(0, len(sentence_ids_by_length[sentence_length]), batch_size):
106 |                 batches_of_sentence_ids.append(sentence_ids_by_length[sentence_length][i:i + batch_size])
107 |     else:
108 |         current_batch = []
109 |         max_sentence_length = 0
110 |         for i in range(len(sentences)):
111 |             current_batch.append(i)
112 |             if len(sentences[i]) > max_sentence_length:
113 |                 max_sentence_length = len(sentences[i])
114 |             if (max_batch_size > 0 and len(current_batch) >= max_batch_size) \
115 |               or (max_batch_size <= 0 and len(current_batch)*max_sentence_length >= (-1 * max_batch_size)):
116 |                 batches_of_sentence_ids.append(current_batch)
117 |                 current_batch = []
118 |                 max_sentence_length = 0
119 |         if len(current_batch) > 0:
120 |             batches_of_sentence_ids.append(current_batch)
121 |     return batches_of_sentence_ids
122 | 
123 | 
124 | 
125 | def process_sentences(data, labeler, is_training, learningrate, config, name):
126 |     """
127 |     Process all the sentences with the labeler, return evaluation metrics.
128 |     """
129 |     evaluator = SequenceLabelingEvaluator(config["main_label"], labeler.label2id, config["conll_eval"])
130 |     batches_of_sentence_ids = create_batches_of_sentence_ids(data, config["batch_equal_size"], config["max_batch_size"])
131 |     if is_training == True:
132 |         random.shuffle(batches_of_sentence_ids)
133 | 
134 |     for sentence_ids_in_batch in batches_of_sentence_ids:
135 |         batch = [data[i] for i in sentence_ids_in_batch]
136 |         cost, predicted_labels, predicted_probs = labeler.process_batch(batch, is_training, learningrate)
137 | 
138 |         evaluator.append_data(cost, batch, predicted_labels)
139 | 
140 |         word_ids, char_ids, char_mask, label_ids = None, None, None, None
141 |         while config["garbage_collection"] == True and gc.collect() > 0:
142 |             pass
143 | 
144 |     results = evaluator.get_results(name)
145 |     for key in results:
146 |         print(key + ": " + str(results[key]))
147 | 
148 |     return results
149 | 
150 | 
151 | 
152 | def run_experiment(config_path):
153 |     config = parse_config("config", config_path)
154 |     temp_model_path = config_path + ".model"
155 |     if "random_seed" in config:
156 |         random.seed(config["random_seed"])
157 |         numpy.random.seed(config["random_seed"])
158 | 
159 |     for key, val in config.items():
160 |         print(str(key) + ": " + str(val))
161 | 
162 |     data_train, data_dev, data_test = None, None, None
163 |     if config["path_train"] != None and len(config["path_train"]) > 0:
164 |         data_train = read_input_files(config["path_train"], config["max_train_sent_length"])
165 |     if config["path_dev"] != None and len(config["path_dev"]) > 0:
166 |         data_dev = read_input_files(config["path_dev"])
167 |     if config["path_test"] != None and len(config["path_test"]) > 0:
168 |         data_test = []
169 |         for path_test in config["path_test"].strip().split(":"):
170 |             data_test += read_input_files(path_test)
171 | 
172 |     if config["load"] != None and len(config["load"]) > 0:
173 |         labeler = SequenceLabeler.load(config["load"])
174 |     else:
175 |         labeler = SequenceLabeler(config)
176 |         labeler.build_vocabs(data_train, data_dev, data_test, config["preload_vectors"])
177 |         labeler.construct_network()
178 |         labeler.initialize_session()
179 |         if config["preload_vectors"] != None:
180 |             labeler.preload_word_embeddings(config["preload_vectors"])
181 | 
182 |     print("parameter_count: " + str(labeler.get_parameter_count()))
183 |     print("parameter_count_without_word_embeddings: " + str(labeler.get_parameter_count_without_word_embeddings()))
184 | 
185 |     if data_train != None:
186 |         model_selector = config["model_selector"].split(":")[0]
187 |         model_selector_type = config["model_selector"].split(":")[1]
188 |         best_selector_value = 0.0
189 |         best_epoch = -1
190 |         learningrate = config["learningrate"]
191 |         for epoch in range(config["epochs"]):
192 |             print("EPOCH: " + str(epoch))
193 |             print("current_learningrate: " + str(learningrate))
194 |             random.shuffle(data_train)
195 | 
196 |             results_train = process_sentences(data_train, labeler, is_training=True, learningrate=learningrate, config=config, name="train")
197 | 
198 |             if data_dev != None:
199 |                 results_dev = process_sentences(data_dev, labeler, is_training=False, learningrate=0.0, config=config, name="dev")
200 | 
201 |                 if math.isnan(results_dev["dev_cost_sum"]) or math.isinf(results_dev["dev_cost_sum"]):
202 |                     sys.stderr.write("ERROR: Cost is NaN or Inf. Exiting.\n")
203 |                     break
204 | 
205 |                 if (epoch == 0 or (model_selector_type == "high" and results_dev[model_selector] > best_selector_value) 
206 |                                or (model_selector_type == "low" and results_dev[model_selector] < best_selector_value)):
207 |                     best_epoch = epoch
208 |                     best_selector_value = results_dev[model_selector]
209 |                     labeler.saver.save(labeler.session, temp_model_path, latest_filename=os.path.basename(temp_model_path)+".checkpoint")
210 |                 print("best_epoch: " + str(best_epoch))
211 | 
212 |                 if config["stop_if_no_improvement_for_epochs"] > 0 and (epoch - best_epoch) >= config["stop_if_no_improvement_for_epochs"]:
213 |                     break
214 | 
215 |                 if (epoch - best_epoch) > 3:
216 |                     learningrate *= config["learningrate_decay"]
217 | 
218 |             while config["garbage_collection"] == True and gc.collect() > 0:
219 |                 pass
220 | 
221 |         if data_dev != None and best_epoch >= 0:
222 |             # loading the best model so far
223 |             labeler.saver.restore(labeler.session, temp_model_path)
224 | 
225 |             os.remove(temp_model_path+".checkpoint")
226 |             os.remove(temp_model_path+".data-00000-of-00001")
227 |             os.remove(temp_model_path+".index")
228 |             os.remove(temp_model_path+".meta")
229 | 
230 |     if config["save"] is not None and len(config["save"]) > 0:
231 |         labeler.save(config["save"])
232 | 
233 |     if config["path_test"] is not None:
234 |         i = 0
235 |         for path_test in config["path_test"].strip().split(":"):
236 |             data_test = read_input_files(path_test)
237 |             results_test = process_sentences(data_test, labeler, is_training=False, learningrate=0.0, config=config, name="test"+str(i))
238 |             i += 1
239 | 
240 | 
241 | if __name__ == "__main__":
242 |     run_experiment(sys.argv[1])
243 | 
244 | 


--------------------------------------------------------------------------------
/labeler.py:
--------------------------------------------------------------------------------
  1 | import collections
  2 | import tensorflow as tf
  3 | import re
  4 | import numpy
  5 | from tensorflow.python.framework import ops
  6 | from tensorflow.python.ops import math_ops
  7 | 
  8 | try:
  9 |     import cPickle as pickle
 10 | except:
 11 |     import pickle
 12 | 
 13 | class SequenceLabeler(object):
 14 |     def __init__(self, config):
 15 |         self.config = config
 16 | 
 17 |         self.UNK = "<unk>"
 18 |         self.CUNK = "<cunk>"
 19 | 
 20 |         self.word2id = None
 21 |         self.char2id = None
 22 |         self.label2id = None
 23 |         self.singletons = None
 24 | 
 25 | 
 26 |     def build_vocabs(self, data_train, data_dev, data_test, embedding_path=None):
 27 |         data_source = list(data_train)
 28 |         if self.config["vocab_include_devtest"]:
 29 |             if data_dev != None:
 30 |                 data_source += data_dev
 31 |             if data_test != None:
 32 |                 data_source += data_test
 33 | 
 34 |         char_counter = collections.Counter()
 35 |         for sentence in data_source:
 36 |             for word in sentence:
 37 |                 char_counter.update(word[0])
 38 |         self.char2id = collections.OrderedDict([(self.CUNK, 0)])
 39 |         for char, count in char_counter.most_common():
 40 |             if char not in self.char2id:
 41 |                 self.char2id[char] = len(self.char2id)
 42 | 
 43 |         word_counter = collections.Counter()
 44 |         for sentence in data_source:
 45 |             for word in sentence:
 46 |                 w = word[0]
 47 |                 if self.config["lowercase"] == True:
 48 |                     w = w.lower()
 49 |                 if self.config["replace_digits"] == True:
 50 |                     w = re.sub(r'\d', '0', w)
 51 |                 word_counter[w] += 1
 52 |         self.word2id = collections.OrderedDict([(self.UNK, 0)])
 53 |         for word, count in word_counter.most_common():
 54 |             if self.config["min_word_freq"] <= 0 or count >= self.config["min_word_freq"]:
 55 |                 if word not in self.word2id:
 56 |                     self.word2id[word] = len(self.word2id)
 57 | 
 58 |         self.singletons = set([word for word in word_counter if word_counter[word] == 1])
 59 | 
 60 |         label_counter = collections.Counter()
 61 |         for sentence in data_train: #this one only based on training data
 62 |             for word in sentence:
 63 |                 label_counter[word[-1]] += 1
 64 |         self.label2id = collections.OrderedDict()
 65 |         for label, count in label_counter.most_common():
 66 |             if label not in self.label2id:
 67 |                 self.label2id[label] = len(self.label2id)
 68 | 
 69 |         if embedding_path != None and self.config["vocab_only_embedded"] == True:
 70 |             self.embedding_vocab = set([self.UNK])
 71 |             with open(embedding_path, 'r') as f:
 72 |                 for line in f:
 73 |                     line_parts = line.strip().split()
 74 |                     if len(line_parts) <= 2:
 75 |                         continue
 76 |                     w = line_parts[0]
 77 |                     if self.config["lowercase"] == True:
 78 |                         w = w.lower()
 79 |                     if self.config["replace_digits"] == True:
 80 |                         w = re.sub(r'\d', '0', w)
 81 |                     self.embedding_vocab.add(w)
 82 |             word2id_revised = collections.OrderedDict()
 83 |             for word in self.word2id:
 84 |                 if word in embedding_vocab and word not in word2id_revised:
 85 |                     word2id_revised[word] = len(word2id_revised)
 86 |             self.word2id = word2id_revised
 87 | 
 88 |         print("n_words: " + str(len(self.word2id)))
 89 |         print("n_chars: " + str(len(self.char2id)))
 90 |         print("n_labels: " + str(len(self.label2id)))
 91 |         print("n_singletons: " + str(len(self.singletons)))
 92 | 
 93 | 
 94 |     def construct_network(self):
 95 |         self.word_ids = tf.placeholder(tf.int32, [None, None], name="word_ids")
 96 |         self.char_ids = tf.placeholder(tf.int32, [None, None, None], name="char_ids")
 97 |         self.sentence_lengths = tf.placeholder(tf.int32, [None], name="sentence_lengths")
 98 |         self.word_lengths = tf.placeholder(tf.int32, [None, None], name="word_lengths")
 99 |         self.label_ids = tf.placeholder(tf.int32, [None, None], name="label_ids")
100 |         self.learningrate = tf.placeholder(tf.float32, name="learningrate")
101 |         self.is_training = tf.placeholder(tf.int32, name="is_training")
102 | 
103 |         self.loss = 0.0
104 |         input_tensor = None
105 |         input_vector_size = 0
106 | 
107 |         self.initializer = None
108 |         if self.config["initializer"] == "normal":
109 |             self.initializer = tf.random_normal_initializer(mean=0.0, stddev=0.1)
110 |         elif self.config["initializer"] == "glorot":
111 |             self.initializer = tf.glorot_uniform_initializer()
112 |         elif self.config["initializer"] == "xavier":
113 |             self.initializer = tf.glorot_normal_initializer()
114 |         else:
115 |             raise ValueError("Unknown initializer")
116 | 
117 |         self.word_embeddings = tf.get_variable("word_embeddings", 
118 |             shape=[len(self.word2id), self.config["word_embedding_size"]], 
119 |             initializer=(tf.zeros_initializer() if self.config["emb_initial_zero"] == True else self.initializer), 
120 |             trainable=(True if self.config["train_embeddings"] == True else False))
121 |         input_tensor = tf.nn.embedding_lookup(self.word_embeddings, self.word_ids)
122 |         input_vector_size = self.config["word_embedding_size"]
123 | 
124 |         if self.config["char_embedding_size"] > 0 and self.config["char_recurrent_size"] > 0:
125 |             with tf.variable_scope("chars"), tf.control_dependencies([tf.assert_equal(tf.shape(self.char_ids)[2], tf.reduce_max(self.word_lengths), message="Char dimensions don't match")]):
126 |                 self.char_embeddings = tf.get_variable("char_embeddings", 
127 |                     shape=[len(self.char2id), self.config["char_embedding_size"]], 
128 |                     initializer=self.initializer, 
129 |                     trainable=True)
130 |                 char_input_tensor = tf.nn.embedding_lookup(self.char_embeddings, self.char_ids)
131 | 
132 |                 s = tf.shape(char_input_tensor)
133 |                 char_input_tensor = tf.reshape(char_input_tensor, shape=[s[0]*s[1], s[2], self.config["char_embedding_size"]])
134 |                 _word_lengths = tf.reshape(self.word_lengths, shape=[s[0]*s[1]])
135 | 
136 |                 char_lstm_cell_fw = tf.nn.rnn_cell.LSTMCell(self.config["char_recurrent_size"], 
137 |                     use_peepholes=self.config["lstm_use_peepholes"], 
138 |                     state_is_tuple=True, 
139 |                     initializer=self.initializer,
140 |                     reuse=False)
141 |                 char_lstm_cell_bw = tf.nn.rnn_cell.LSTMCell(self.config["char_recurrent_size"], 
142 |                     use_peepholes=self.config["lstm_use_peepholes"], 
143 |                     state_is_tuple=True, 
144 |                     initializer=self.initializer,
145 |                     reuse=False)
146 | 
147 |                 char_lstm_outputs = tf.nn.bidirectional_dynamic_rnn(char_lstm_cell_fw, char_lstm_cell_bw, char_input_tensor, sequence_length=_word_lengths, dtype=tf.float32, time_major=False)
148 |                 _, ((_, char_output_fw), (_, char_output_bw)) = char_lstm_outputs
149 |                 char_output_tensor = tf.concat([char_output_fw, char_output_bw], axis=-1)
150 |                 char_output_tensor = tf.reshape(char_output_tensor, shape=[s[0], s[1], 2 * self.config["char_recurrent_size"]])
151 |                 char_output_vector_size = 2 * self.config["char_recurrent_size"]
152 | 
153 |                 if self.config["lmcost_char_gamma"] > 0.0:
154 |                     self.loss += self.config["lmcost_char_gamma"] * self.construct_lmcost(char_output_tensor, char_output_tensor, self.sentence_lengths, self.word_ids, "separate", "lmcost_char_separate")
155 |                 if self.config["lmcost_joint_char_gamma"] > 0.0:
156 |                     self.loss += self.config["lmcost_joint_char_gamma"] * self.construct_lmcost(char_output_tensor, char_output_tensor, self.sentence_lengths, self.word_ids, "joint", "lmcost_char_joint")
157 | 
158 |                 if self.config["char_hidden_layer_size"] > 0:
159 |                     char_hidden_layer_size = self.config["word_embedding_size"] if self.config["char_integration_method"] == "attention" else self.config["char_hidden_layer_size"]
160 |                     char_output_tensor = tf.layers.dense(char_output_tensor, char_hidden_layer_size, activation=tf.tanh, kernel_initializer=self.initializer)
161 |                     char_output_vector_size = char_hidden_layer_size
162 | 
163 |                 if self.config["char_integration_method"] == "concat":
164 |                     input_tensor = tf.concat([input_tensor, char_output_tensor], axis=-1)
165 |                     input_vector_size += char_output_vector_size
166 |                 elif self.config["char_integration_method"] == "attention":
167 |                     assert(char_output_vector_size == self.config["word_embedding_size"]), "This method requires the char representation to have the same size as word embeddings"
168 |                     static_input_tensor = tf.stop_gradient(input_tensor)
169 |                     is_unk = tf.equal(self.word_ids, self.word2id[self.UNK])
170 |                     char_output_tensor_normalised = tf.nn.l2_normalize(char_output_tensor, 2)
171 |                     static_input_tensor_normalised = tf.nn.l2_normalize(static_input_tensor, 2)
172 |                     cosine_cost = 1.0 - tf.reduce_sum(tf.multiply(char_output_tensor_normalised, static_input_tensor_normalised), axis=2)
173 |                     is_padding = tf.logical_not(tf.sequence_mask(self.sentence_lengths, maxlen=tf.shape(self.word_ids)[1]))
174 |                     cosine_cost_unk = tf.where(tf.logical_or(is_unk, is_padding), x=tf.zeros_like(cosine_cost), y=cosine_cost)
175 |                     self.loss += self.config["char_attention_cosine_cost"] * tf.reduce_sum(cosine_cost_unk)
176 |                     attention_evidence_tensor = tf.concat([input_tensor, char_output_tensor], axis=2)
177 |                     attention_output = tf.layers.dense(attention_evidence_tensor, self.config["word_embedding_size"], activation=tf.tanh, kernel_initializer=self.initializer)
178 |                     attention_output = tf.layers.dense(attention_output, self.config["word_embedding_size"], activation=tf.sigmoid, kernel_initializer=self.initializer)
179 |                     input_tensor = tf.multiply(input_tensor, attention_output) + tf.multiply(char_output_tensor, (1.0 - attention_output))
180 |                 elif self.config["char_integration_method"] == "none":
181 |                     input_tensor = input_tensor
182 |                 else:
183 |                     raise ValueError("Unknown char integration method")
184 | 
185 |         dropout_input = self.config["dropout_input"] * tf.cast(self.is_training, tf.float32) + (1.0 - tf.cast(self.is_training, tf.float32))
186 |         input_tensor =  tf.nn.dropout(input_tensor, dropout_input, name="dropout_word")
187 | 
188 |         word_lstm_cell_fw = tf.nn.rnn_cell.LSTMCell(self.config["word_recurrent_size"], 
189 |             use_peepholes=self.config["lstm_use_peepholes"], 
190 |             state_is_tuple=True, 
191 |             initializer=self.initializer,
192 |             reuse=False)
193 |         word_lstm_cell_bw = tf.nn.rnn_cell.LSTMCell(self.config["word_recurrent_size"], 
194 |             use_peepholes=self.config["lstm_use_peepholes"], 
195 |             state_is_tuple=True, 
196 |             initializer=self.initializer,
197 |             reuse=False)
198 | 
199 |         with tf.control_dependencies([tf.assert_equal(tf.shape(self.word_ids)[1], tf.reduce_max(self.sentence_lengths), message="Sentence dimensions don't match")]):
200 |             (lstm_outputs_fw, lstm_outputs_bw), _ = tf.nn.bidirectional_dynamic_rnn(word_lstm_cell_fw, word_lstm_cell_bw, input_tensor, sequence_length=self.sentence_lengths, dtype=tf.float32, time_major=False)
201 | 
202 |         dropout_word_lstm = self.config["dropout_word_lstm"] * tf.cast(self.is_training, tf.float32) + (1.0 - tf.cast(self.is_training, tf.float32))
203 |         lstm_outputs_fw =  tf.nn.dropout(lstm_outputs_fw, dropout_word_lstm)
204 |         lstm_outputs_bw =  tf.nn.dropout(lstm_outputs_bw, dropout_word_lstm)
205 | 
206 |         if self.config["lmcost_lstm_gamma"] > 0.0:
207 |             self.loss += self.config["lmcost_lstm_gamma"] * self.construct_lmcost(lstm_outputs_fw, lstm_outputs_bw, self.sentence_lengths, self.word_ids, "separate", "lmcost_lstm_separate")
208 |         if self.config["lmcost_joint_lstm_gamma"] > 0.0:
209 |             self.loss += self.config["lmcost_joint_lstm_gamma"] * self.construct_lmcost(lstm_outputs_fw, lstm_outputs_bw, self.sentence_lengths, self.word_ids, "joint", "lmcost_lstm_joint")
210 | 
211 |         processed_tensor = tf.concat([lstm_outputs_fw, lstm_outputs_bw], 2)
212 |         processed_tensor_size = self.config["word_recurrent_size"] * 2
213 | 
214 |         if self.config["hidden_layer_size"] > 0:
215 |             processed_tensor = tf.layers.dense(processed_tensor, self.config["hidden_layer_size"], activation=tf.tanh, kernel_initializer=self.initializer)
216 |             processed_tensor_size = self.config["hidden_layer_size"]
217 | 
218 |         self.scores = tf.layers.dense(processed_tensor, len(self.label2id), activation=None, kernel_initializer=self.initializer, name="output_ff")
219 | 
220 |         if self.config["crf_on_top"] == True:
221 |             crf_num_tags = self.scores.get_shape()[2].value
222 |             self.crf_transition_params = tf.get_variable("output_crf_transitions", [crf_num_tags, crf_num_tags], initializer=self.initializer)
223 |             log_likelihood, self.crf_transition_params = tf.contrib.crf.crf_log_likelihood(self.scores, self.label_ids, self.sentence_lengths, transition_params=self.crf_transition_params)
224 |             self.loss += self.config["main_cost"] * tf.reduce_sum(-log_likelihood) 
225 |         else:
226 |             self.probabilities = tf.nn.softmax(self.scores)
227 |             self.predictions = tf.argmax(self.probabilities, 2)
228 |             loss_ = tf.nn.sparse_softmax_cross_entropy_with_logits(logits=self.scores, labels=self.label_ids)
229 |             mask = tf.sequence_mask(self.sentence_lengths, maxlen=tf.shape(self.word_ids)[1])
230 |             loss_ = tf.boolean_mask(loss_, mask)
231 |             self.loss += self.config["main_cost"] * tf.reduce_sum(loss_) 
232 | 
233 |         self.train_op = self.construct_optimizer(self.config["opt_strategy"], self.loss, self.learningrate, self.config["clip"])
234 | 
235 | 
236 |     def construct_lmcost(self, input_tensor_fw, input_tensor_bw, sentence_lengths, target_ids, lmcost_type, name):
237 |         with tf.variable_scope(name):
238 |             lmcost_max_vocab_size = min(len(self.word2id), self.config["lmcost_max_vocab_size"])
239 |             target_ids = tf.where(tf.greater_equal(target_ids, lmcost_max_vocab_size-1), x=(lmcost_max_vocab_size-1)+tf.zeros_like(target_ids), y=target_ids)
240 |             cost = 0.0
241 |             if lmcost_type == "separate":
242 |                 lmcost_fw_mask = tf.sequence_mask(sentence_lengths, maxlen=tf.shape(target_ids)[1])[:,1:]
243 |                 lmcost_bw_mask = tf.sequence_mask(sentence_lengths, maxlen=tf.shape(target_ids)[1])[:,:-1]
244 |                 lmcost_fw = self._construct_lmcost(input_tensor_fw[:,:-1,:], lmcost_max_vocab_size, lmcost_fw_mask, target_ids[:,1:], name=name+"_fw")
245 |                 lmcost_bw = self._construct_lmcost(input_tensor_bw[:,1:,:], lmcost_max_vocab_size, lmcost_bw_mask, target_ids[:,:-1], name=name+"_bw")
246 |                 cost += lmcost_fw + lmcost_bw
247 |             elif lmcost_type == "joint":
248 |                 joint_input_tensor = tf.concat([input_tensor_fw[:,:-2,:], input_tensor_bw[:,2:,:]], axis=-1)
249 |                 lmcost_mask = tf.sequence_mask(sentence_lengths, maxlen=tf.shape(target_ids)[1])[:,1:-1]
250 |                 cost += self._construct_lmcost(joint_input_tensor, lmcost_max_vocab_size, lmcost_mask, target_ids[:,1:-1], name=name+"_joint")
251 |             else:
252 |                 raise ValueError("Unknown lmcost_type: " + str(lmcost_type))
253 |             return cost
254 | 
255 | 
256 |     def _construct_lmcost(self, input_tensor, lmcost_max_vocab_size, lmcost_mask, target_ids, name):
257 |         with tf.variable_scope(name):
258 |             lmcost_hidden_layer = tf.layers.dense(input_tensor, self.config["lmcost_hidden_layer_size"], activation=tf.tanh, kernel_initializer=self.initializer)
259 |             lmcost_output = tf.layers.dense(lmcost_hidden_layer, lmcost_max_vocab_size, activation=None, kernel_initializer=self.initializer)
260 |             lmcost_loss = tf.nn.sparse_softmax_cross_entropy_with_logits(logits=lmcost_output, labels=target_ids)
261 |             lmcost_loss = tf.where(lmcost_mask, lmcost_loss, tf.zeros_like(lmcost_loss))
262 |             return tf.reduce_sum(lmcost_loss)
263 | 
264 | 
265 |     def construct_optimizer(self, opt_strategy, loss, learningrate, clip):
266 |         optimizer = None
267 |         if opt_strategy == "adadelta":
268 |             optimizer = tf.train.AdadeltaOptimizer(learning_rate=learningrate)
269 |         elif opt_strategy == "adam":
270 |             optimizer = tf.train.AdamOptimizer(learning_rate=learningrate)
271 |         elif opt_strategy == "sgd":
272 |             optimizer = tf.train.GradientDescentOptimizer(learning_rate=learningrate)
273 |         else:
274 |             raise ValueError("Unknown optimisation strategy: " + str(opt_strategy))
275 | 
276 |         if clip > 0.0:
277 |             grads, vs     = zip(*optimizer.compute_gradients(loss))
278 |             grads, gnorm  = tf.clip_by_global_norm(grads, clip)
279 |             train_op = optimizer.apply_gradients(zip(grads, vs))
280 |         else:
281 |             train_op = optimizer.minimize(loss)
282 |         return train_op
283 | 
284 | 
285 |     def preload_word_embeddings(self, embedding_path):
286 |         loaded_embeddings = set()
287 |         embedding_matrix = self.session.run(self.word_embeddings)
288 |         with open(embedding_path, 'r') as f:
289 |             for line in f:
290 |                 line_parts = line.strip().split()
291 |                 if len(line_parts) <= 2:
292 |                     continue
293 |                 w = line_parts[0]
294 |                 if self.config["lowercase"] == True:
295 |                     w = w.lower()
296 |                 if self.config["replace_digits"] == True:
297 |                     w = re.sub(r'\d', '0', w)
298 |                 if w in self.word2id and w not in loaded_embeddings:
299 |                     word_id = self.word2id[w]
300 |                     embedding = numpy.array(line_parts[1:])
301 |                     embedding_matrix[word_id] = embedding
302 |                     loaded_embeddings.add(w)
303 |         self.session.run(self.word_embeddings.assign(embedding_matrix))
304 |         print("n_preloaded_embeddings: " + str(len(loaded_embeddings)))
305 | 
306 | 
307 |     def translate2id(self, token, token2id, unk_token, lowercase=False, replace_digits=False, singletons=None, singletons_prob=0.0):
308 |         if lowercase == True:
309 |             token = token.lower()
310 |         if replace_digits == True:
311 |             token = re.sub(r'\d', '0', token)
312 | 
313 |         token_id = None
314 |         if singletons != None and token in singletons and token in token2id and unk_token != None and numpy.random.uniform() < singletons_prob:
315 |             token_id = token2id[unk_token]
316 |         elif token in token2id:
317 |             token_id = token2id[token]
318 |         elif unk_token != None:
319 |             token_id = token2id[unk_token]
320 |         else:
321 |             raise ValueError("Unable to handle value, no UNK token: " + str(token))
322 |         return token_id
323 | 
324 | 
325 |     def create_input_dictionary_for_batch(self, batch, is_training, learningrate):
326 |         sentence_lengths = numpy.array([len(sentence) for sentence in batch])
327 |         max_sentence_length = sentence_lengths.max()
328 |         max_word_length = numpy.array([numpy.array([len(word[0]) for word in sentence]).max() for sentence in batch]).max()
329 |         if self.config["allowed_word_length"] > 0 and self.config["allowed_word_length"] < max_word_length:
330 |             max_word_length = min(max_word_length, self.config["allowed_word_length"])
331 | 
332 |         word_ids = numpy.zeros((len(batch), max_sentence_length), dtype=numpy.int32)
333 |         char_ids = numpy.zeros((len(batch), max_sentence_length, max_word_length), dtype=numpy.int32)
334 |         word_lengths = numpy.zeros((len(batch), max_sentence_length), dtype=numpy.int32)
335 |         label_ids = numpy.zeros((len(batch), max_sentence_length), dtype=numpy.int32)
336 | 
337 |         singletons = self.singletons if is_training == True else None
338 |         singletons_prob = self.config["singletons_prob"] if is_training == True else 0.0
339 |         for i in range(len(batch)):
340 |             for j in range(len(batch[i])):
341 |                 word_ids[i][j] = self.translate2id(batch[i][j][0], self.word2id, self.UNK, lowercase=self.config["lowercase"], replace_digits=self.config["replace_digits"], singletons=singletons, singletons_prob=singletons_prob)
342 |                 label_ids[i][j] = self.translate2id(batch[i][j][-1], self.label2id, None)
343 |                 word_lengths[i][j] = min(len(batch[i][j][0]), max_word_length)
344 |                 for k in range(min(len(batch[i][j][0]), max_word_length)):
345 |                     char_ids[i][j][k] = self.translate2id(batch[i][j][0][k], self.char2id, self.CUNK)
346 | 
347 |         input_dictionary = {self.word_ids: word_ids, self.char_ids: char_ids, self.sentence_lengths: sentence_lengths, self.word_lengths: word_lengths, self.label_ids: label_ids, self.learningrate: learningrate, self.is_training: is_training}
348 |         return input_dictionary
349 | 
350 | 
351 |     def viterbi_decode(self, score, transition_params):
352 |         trellis = numpy.zeros_like(score)
353 |         backpointers = numpy.zeros_like(score, dtype=numpy.int32)
354 |         trellis[0] = score[0]
355 | 
356 |         for t in range(1, score.shape[0]):
357 |             v = numpy.expand_dims(trellis[t - 1], 1) + transition_params
358 |             trellis[t] = score[t] + numpy.max(v, 0)
359 |             backpointers[t] = numpy.argmax(v, 0)
360 | 
361 |         viterbi = [numpy.argmax(trellis[-1])]
362 |         for bp in reversed(backpointers[1:]):
363 |             viterbi.append(bp[viterbi[-1]])
364 |         viterbi.reverse()
365 | 
366 |         viterbi_score = numpy.max(trellis[-1])
367 |         return viterbi, viterbi_score, trellis
368 | 
369 | 
370 |     def process_batch(self, batch, is_training, learningrate):
371 |         feed_dict = self.create_input_dictionary_for_batch(batch, is_training, learningrate)
372 | 
373 |         if self.config["crf_on_top"] == True:
374 |             cost, scores = self.session.run([self.loss, self.scores] + ([self.train_op] if is_training == True else []), feed_dict=feed_dict)[:2]
375 |             predicted_labels = []
376 |             predicted_probs = []
377 |             for i in range(len(batch)):
378 |                 sentence_length = len(batch[i])
379 |                 viterbi_seq, viterbi_score, viterbi_trellis = self.viterbi_decode(scores[i], self.session.run(self.crf_transition_params))
380 |                 predicted_labels.append(viterbi_seq[:sentence_length])
381 |                 predicted_probs.append(viterbi_trellis[:sentence_length])
382 |         else:
383 |             cost, predicted_labels_, predicted_probs_ = self.session.run([self.loss, self.predictions, self.probabilities] + ([self.train_op] if is_training == True else []), feed_dict=feed_dict)[:3]
384 |             predicted_labels = []
385 |             predicted_probs = []
386 |             for i in range(len(batch)):
387 |                 sentence_length = len(batch[i])
388 |                 predicted_labels.append(predicted_labels_[i][:sentence_length])
389 |                 predicted_probs.append(predicted_probs_[i][:sentence_length])
390 | 
391 |         return cost, predicted_labels, predicted_probs
392 | 
393 | 
394 |     def initialize_session(self):
395 |         tf.set_random_seed(self.config["random_seed"])
396 |         session_config = tf.ConfigProto()
397 |         session_config.gpu_options.allow_growth = self.config["tf_allow_growth"]
398 |         session_config.gpu_options.per_process_gpu_memory_fraction = self.config["tf_per_process_gpu_memory_fraction"]
399 |         self.session = tf.Session(config=session_config)
400 |         self.session.run(tf.global_variables_initializer())
401 |         self.saver = tf.train.Saver(max_to_keep=1)
402 | 
403 | 
404 |     def get_parameter_count(self):
405 |         total_parameters = 0
406 |         for variable in tf.trainable_variables():
407 |             shape = variable.get_shape()
408 |             variable_parameters = 1
409 |             for dim in shape:
410 |                 variable_parameters *= dim.value
411 |             total_parameters += variable_parameters
412 |         return total_parameters
413 | 
414 | 
415 |     def get_parameter_count_without_word_embeddings(self):
416 |         shape = self.word_embeddings.get_shape()
417 |         variable_parameters = 1
418 |         for dim in shape:
419 |             variable_parameters *= dim.value
420 |         return self.get_parameter_count() - variable_parameters
421 | 
422 | 
423 |     def save(self, filename):
424 |         dump = {}
425 |         dump["config"] = self.config
426 |         dump["UNK"] = self.UNK
427 |         dump["CUNK"] = self.CUNK
428 |         dump["word2id"] = self.word2id
429 |         dump["char2id"] = self.char2id
430 |         dump["label2id"] = self.label2id
431 |         dump["singletons"] = self.singletons
432 | 
433 |         dump["params"] = {}
434 |         for variable in tf.global_variables():
435 |             assert(variable.name not in dump["params"]), "Error: variable with this name already exists" + str(variable.name)
436 |             dump["params"][variable.name] = self.session.run(variable)
437 |         with open(filename, 'wb') as f:
438 |             pickle.dump(dump, f, protocol=pickle.HIGHEST_PROTOCOL)
439 | 
440 | 
441 |     @staticmethod
442 |     def load(filename):
443 |         with open(filename, 'rb') as f:
444 |             dump = pickle.load(f)
445 | 
446 |             # for safety, so we don't overwrite old models
447 |             dump["config"]["save"] = None
448 | 
449 |             labeler = SequenceLabeler(dump["config"])
450 |             labeler.UNK = dump["UNK"]
451 |             labeler.CUNK = dump["CUNK"]
452 |             labeler.word2id = dump["word2id"]
453 |             labeler.char2id = dump["char2id"]
454 |             labeler.label2id = dump["label2id"]
455 |             labeler.singletons = dump["singletons"]
456 | 
457 |             labeler.construct_network()
458 |             labeler.initialize_session()
459 |             labeler.load_params(filename)
460 | 
461 |             return labeler
462 | 
463 | 
464 |     def load_params(self, filename):
465 |         with open(filename, 'rb') as f:
466 |             dump = pickle.load(f)
467 | 
468 |             for variable in tf.global_variables():
469 |                 assert(variable.name in dump["params"]), "Variable not in dump: " + str(variable.name)
470 |                 assert(variable.shape == dump["params"][variable.name].shape), "Variable shape not as expected: " + str(variable.name) + " " + str(variable.shape) + " " + str(dump["params"][variable.name].shape)
471 |                 value = numpy.asarray(dump["params"][variable.name])
472 |                 self.session.run(variable.assign(value))
473 | 
474 | 


--------------------------------------------------------------------------------
/print_output.py:
--------------------------------------------------------------------------------
 1 | import sys
 2 | import labeler
 3 | import experiment
 4 | import numpy
 5 | import collections
 6 | import time
 7 | 
 8 | def print_predictions(print_probs, model_path, input_file):
 9 |     time_loading = time.time()
10 |     model = labeler.SequenceLabeler.load(model_path)
11 | 
12 |     time_noloading = time.time()
13 |     config = model.config
14 |     predictions_cache = {}
15 | 
16 |     id2label = collections.OrderedDict()
17 |     for label in model.label2id:
18 |         id2label[model.label2id[label]] = label
19 | 
20 |     sentences_test = experiment.read_input_files(input_file)
21 |     batches_of_sentence_ids = experiment.create_batches_of_sentence_ids(sentences_test, config["batch_equal_size"], config['max_batch_size'])
22 | 
23 |     for sentence_ids_in_batch in batches_of_sentence_ids:
24 |         batch = [sentences_test[i] for i in sentence_ids_in_batch]
25 |         cost, predicted_labels, predicted_probs = model.process_batch(batch, is_training=False, learningrate=0.0)
26 | 
27 |         assert(len(sentence_ids_in_batch) == len(predicted_labels))
28 | 
29 |         for i in range(len(sentence_ids_in_batch)):
30 |             key = str(sentence_ids_in_batch[i])
31 |             predictions = []
32 |             if print_probs == False:
33 |                 for j in range(len(predicted_labels[i])):
34 |                     predictions.append(id2label[predicted_labels[i][j]])
35 |             elif print_probs == True:
36 |                 for j in range(len(predicted_probs[i])):
37 |                     p_ = ""
38 |                     for k in range(len(predicted_probs[i][j])):
39 |                         p_ += str(id2label[k]) + ":" + str(predicted_probs[i][j][k]) + "\t"
40 |                     predictions.append(p_.strip())
41 |             predictions_cache[key] = predictions
42 | 
43 |     sentence_id = 0
44 |     word_id = 0
45 |     with open(input_file, "r") as f:
46 |         for line in f:
47 |             if len(line.strip()) == 0:
48 |                 print("")
49 |                 if word_id == 0:
50 |                     continue
51 |                 assert(len(predictions_cache[str(sentence_id)]) == word_id), str(len(predictions_cache[str(sentence_id)])) + " " + str(word_id)
52 |                 sentence_id += 1
53 |                 word_id = 0
54 |                 continue
55 |             assert(str(sentence_id) in predictions_cache)
56 |             assert(len(predictions_cache[str(sentence_id)]) > word_id)
57 |             print(line.strip() + "\t" + predictions_cache[str(sentence_id)][word_id].strip())
58 |             word_id += 1
59 |     
60 |     sys.stderr.write("Processed: " + input_file + "\n")
61 |     sys.stderr.write("Elapsed time with loading: " + str(time.time() - time_loading) + "\n")
62 |     sys.stderr.write("Elapsed time without loading: " + str(time.time() - time_noloading) + "\n")
63 | 
64 | 
65 | 
66 | 
67 | if __name__ == "__main__":
68 |     if sys.argv[1] == "labels":
69 |         print_probs = False
70 |     elif sys.argv[1] == "probs":
71 |         print_probs = True
72 |     else:
73 |         raise ValueError("Unknown value")
74 | 
75 |     print_predictions(print_probs, sys.argv[2], sys.argv[3])
76 | 
77 | 
78 | 


--------------------------------------------------------------------------------