├── README.md ├── conf └── fcepublic.conf ├── conlleval.py ├── evaluator.py ├── experiment.py ├── labeler.py └── print_output.py /README.md: -------------------------------------------------------------------------------- 1 | Sequence labeler 2 | ========================= 3 | 4 | This is a neural network sequence labeling system. Given a sequence of tokens, it will learn to assign labels to each token. Can be used for named entity recognition, POS-tagging, error detection, chunking, CCG supertagging, etc. 5 | 6 | The main model implements a bidirectional LSTM for sequence tagging. In addition, you can incorporate character-level information -- either by concatenating a character-based representation, or by using an attention/gating mechanism for combining it with a word embedding. 7 | 8 | Run with: 9 | 10 | python experiment.py config.conf 11 | 12 | Preferably with Tensorflow set up to use CUDA, so the process can run on a GPU. The script will train the model on the training data, test it on the test data, and print various evaluation metrics. 13 | 14 | Note: The original sequence labeler was implemented in Theano, but since Theano is soon ending support, I have reimplemented it in TensorFlow. I also used the chance to refactor the code a bit, and it should be better in every way. However, if you need the specific code used in previously published papers, you'll need to refer to older commits. 15 | 16 | Requirements 17 | ------------------------- 18 | 19 | * python (tested with 2.7.12 and 3.5.2) 20 | * numpy (tested with 1.13.3 and 1.14.0) 21 | * tensorflow (tested with 1.3.0 and 1.4.1) 22 | 23 | 24 | Data format 25 | ------------------------- 26 | 27 | The training and test data is expected in standard CoNLL-type tab-separated format. One word per line, separate column for token and label, empty line between sentences. 28 | 29 | For error detection, this would be something like: 30 | 31 | I c 32 | saws i 33 | the c 34 | show c 35 | 36 | 37 | The first column is assumed to be the token and the last column is the label. There can be other columns in the middle, which are currently not used. For example: 38 | 39 | EU NNP I-NP S-ORG 40 | rejects VBZ I-VP O 41 | German JJ I-NP S-MISC 42 | call NN I-NP O 43 | to TO I-VP O 44 | boycott VB I-VP O 45 | British JJ I-NP S-MISC 46 | lamb NN I-NP O 47 | . . O O 48 | 49 | 50 | Configuration 51 | ------------------------- 52 | 53 | Edit the values in config.conf as needed: 54 | 55 | * **path_train** - Path to the training data, in CoNLL tab-separated format. One word per line, first column is the word, last column is the label. Empty lines between sentences. 56 | * **path_dev** - Path to the development data, used for choosing the best epoch. 57 | * **path_test** - Path to the test file. Can contain multiple files, colon separated. 58 | * **conll_eval** - Whether the standard CoNLL NER evaluation should be run. 59 | * **main_label** - The output label for which precision/recall/F-measure are calculated. Does not affect accuracy or measures from the CoNLL eval. 60 | * **model_selector** - What is measured on the dev set for model selection: "dev_conll_f:high" for NER and chunking, "dev_acc:high" for POS-tagging, "dev_f05:high" for error detection. 61 | * **preload_vectors** - Path to the pretrained word embeddings, in word2vec plain text format. If your embeddings are in binary, you can use [convertvec](https://github.com/marekrei/convertvec) to convert them to plain text. 62 | * **word_embedding_size** - Size of the word embeddings used in the model. 63 | * **crf_on_top** - If True, use a CRF as the output layer. If False, use softmax instead. 64 | * **emb_initial_zero** - Whether word embeddings should have zero initialisation by default. 65 | * **train_embeddings** - Whether word embeddings should be updated during training. 66 | * **char_embedding_size** - Size of the character embeddings. 67 | * **word_recurrent_size** - Size of the word-level LSTM hidden layers. 68 | * **char_recurrent_size** - Size of the char-level LSTM hidden layers. 69 | * **hidden_layer_size** - Size of the extra hidden layer on top of the bi-LSTM. 70 | * **char_hidden_layer_size** - Size of the extra hidden layer on top of the character-based component. 71 | * **lowercase** - Whether words should be lowercased when mapping to word embeddings. 72 | * **replace_digits** - Whether all digits should be replaced by 0. 73 | * **min_word_freq** - Minimal frequency of words to be included in the vocabulary. Others will be considered OOV. 74 | * **singletons_prob** - The probability of mapping words that appear only once to OOV instead during training. 75 | * **allowed_word_length** - Maximum allowed word length, clipping the rest. Can be necessary if the text contains unreasonably long tokens, eg URLs. 76 | * **max_train_sent_length** - Discard sentences longer than this limit when training. 77 | * **vocab_include_devtest** - Load words from dev and test sets also into the vocabulary. If they don't appear in the training set, they will have the default representations from the preloaded embeddings. 78 | * **vocab_only_embedded** - Whether the vocabulary should contain only words in the pretrained embedding set. 79 | * **initializer** - The method used to initialize weight matrices in the network. 80 | * **opt_strategy** - The method used for weight updates. 81 | * **learningrate** - Learning rate. 82 | * **clip** - Clip the gradient to a range. 83 | * **batch_equal_size** - Create batches of sentences with equal length. 84 | * **epochs** - Maximum number of epochs to run. 85 | * **stop_if_no_improvement_for_epochs** - Training will be stopped if there has been no improvement for n epochs. 86 | * **learningrate_decay** - If performance hasn't improved for 3 epochs, multiply the learning rate with this value. 87 | * **dropout_input** - The probability for applying dropout to the word representations. 0.0 means no dropout. 88 | * **dropout_word_lstm** - The probability for applying dropout to the LSTM outputs. 89 | * **tf_per_process_gpu_memory_fraction** - The fraction of GPU memory that the process can use. 90 | * **tf_allow_growth** - Whether the GPU memory usage can grow dynamically. 91 | * **main_cost** - Control the weight of the main labeling objective. 92 | * **lmcost_max_vocab_size** = Maximum vocabulary size for the language modeling loss. The remaining words are mapped to a single entry. 93 | * **lmcost_hidden_layer_size** = Hidden layer size for the language modeling loss. 94 | * **lmcost_gamma** - Weight for the language modeling loss. 95 | * **char_integration_method** - How character information is integrated. Options are: "none" (not integrated), "concat" (concatenated), "attention" (the method proposed in Rei et al. (2016)). 96 | * **save** - Path to save the model. 97 | * **load** - Path to load the model. 98 | * **garbage_collection** - Whether garbage collection is explicitly called. Makes things slower but can operate with bigger models. 99 | * **lstm_use_peepholes** - Whether to use the LSTM implementation with peepholes. 100 | * **random_seed** - Random seed for initialisation and data shuffling. This can affect results, so for robust conclusions I recommend running multiple experiments with different seeds and averaging the metrics. 101 | 102 | 103 | 104 | 105 | 106 | 107 | Printing output 108 | ------------------------- 109 | 110 | There is now a separate script for loading a saved model and using it to print output for a given input file. Use the **save** option in the config file for saving the model. The input file needs to be in the same format as the training data (one word per line, labels in a separate column). The labels are expected for printing output as well. If you don't know the correct labels, just print any valid label in that field. 111 | 112 | To print the output, run: 113 | 114 | python print_output.py labels model_file input_file 115 | 116 | This will print the input file to standard output, with an extra column at the end that shows the prediction. 117 | 118 | You can also use: 119 | 120 | python print_output.py probs model_file input_file 121 | 122 | This will print the individual probabilities for each of the possible labels. 123 | If the model is using CRFs, the *probs* option will output unnormalised state scores without taking the transitions into account. 124 | 125 | 126 | References 127 | ------------------------- 128 | 129 | The main sequence labeling model is described here: 130 | 131 | [**Compositional Sequence Labeling Models for Error Detection in Learner Writing**](http://aclweb.org/anthology/P/P16/P16-1112.pdf) 132 | Marek Rei and Helen Yannakoudakis 133 | *In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (ACL-2016)* 134 | 135 | 136 | The character-level component is described here: 137 | 138 | [**Attending to characters in neural sequence labeling models**](https://aclweb.org/anthology/C/C16/C16-1030.pdf) 139 | Marek Rei, Gamal K.O. Crichton and Sampo Pyysalo 140 | *In Proceedings of the 26th International Conference on Computational Linguistics (COLING-2016)* 141 | 142 | The language modeling objective is described here: 143 | 144 | [**Semi-supervised Multitask Learning for Sequence Labeling**](https://arxiv.org/abs/1704.07156) 145 | Marek Rei 146 | *In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (ACL-2017)* 147 | 148 | The CRF implementation is based on: 149 | 150 | [**Neural Architectures for Named Entity Recognition**](https://arxiv.org/abs/1603.01360) 151 | Guillaume Lample, Miguel Ballesteros, Sandeep Subramanian, Kazuya Kawakami and Chris Dyer 152 | *In Proceedings of NAACL-HLT 2016* 153 | 154 | 155 | The conlleval.py script is from: https://github.com/spyysalo/conlleval.py 156 | 157 | 158 | 159 | 160 | License 161 | --------------------------- 162 | 163 | The code is distributed under the Affero General Public License 3 (AGPL-3.0) by default. 164 | If you wish to use it under a different license, feel free to get in touch. 165 | 166 | Copyright (c) 2018 Marek Rei 167 | 168 | This program is free software: you can redistribute it and/or modify 169 | it under the terms of the GNU Affero General Public License as 170 | published by the Free Software Foundation, either version 3 of the 171 | License, or (at your option) any later version. 172 | 173 | This program is distributed in the hope that it will be useful, 174 | but WITHOUT ANY WARRANTY; without even the implied warranty of 175 | MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the 176 | GNU Affero General Public License for more details. 177 | -------------------------------------------------------------------------------- /conf/fcepublic.conf: -------------------------------------------------------------------------------- 1 | [config] 2 | dataset = fcepublic 3 | path_train = fce-public.train.original.tsv 4 | path_dev = fce-public.dev.original.tsv 5 | path_test = fce-public.dev.original.tsv:fce-public.test.original.tsv:nucle.test0.original.tsv:nucle.test1.original.tsv 6 | conll_eval = False 7 | main_label = i 8 | model_selector = dev_f05:high 9 | preload_vectors = embeddings/glove/glove.6B.300d.txt 10 | word_embedding_size = 300 11 | crf_on_top = False 12 | emb_initial_zero = False 13 | train_embeddings = True 14 | char_embedding_size = 100 15 | word_recurrent_size = 300 16 | char_recurrent_size = 100 17 | hidden_layer_size = 50 18 | char_hidden_layer_size = 50 19 | lowercase = True 20 | replace_digits = True 21 | min_word_freq = -1 22 | singletons_prob = 0.1 23 | allowed_word_length = -1 24 | max_train_sent_length = -1 25 | vocab_include_devtest = True 26 | vocab_only_embedded = False 27 | initializer = glorot 28 | opt_strategy = adadelta 29 | learningrate = 1.0 30 | clip = 0.0 31 | batch_equal_size = False 32 | max_batch_size = 32 33 | epochs = 200 34 | stop_if_no_improvement_for_epochs = 7 35 | learningrate_decay = 0.9 36 | dropout_input = 0.5 37 | dropout_word_lstm = 0.5 38 | tf_per_process_gpu_memory_fraction = 1.0 39 | tf_allow_growth = True 40 | main_cost = 1.0 41 | lmcost_max_vocab_size = 7500 42 | lmcost_hidden_layer_size = 50 43 | lmcost_lstm_gamma = 0.1 44 | lmcost_joint_lstm_gamma = 0.0 45 | lmcost_char_gamma = 0.0 46 | lmcost_joint_char_gamma = 0.0 47 | char_attention_cosine_cost = 1.0 48 | char_integration_method = concat 49 | save = 50 | load = 51 | garbage_collection = False 52 | lstm_use_peepholes = False 53 | random_seed = 100 54 | -------------------------------------------------------------------------------- /conlleval.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python 2 | 3 | # Python version of the evaluation script from CoNLL'00- 4 | # Originates from: https://github.com/spyysalo/conlleval.py 5 | 6 | 7 | # Intentional differences: 8 | # - accept any space as delimiter by default 9 | # - optional file argument (default STDIN) 10 | # - option to set boundary (-b argument) 11 | # - LaTeX output (-l argument) not supported 12 | # - raw tags (-r argument) not supported 13 | 14 | import sys 15 | import re 16 | 17 | from collections import defaultdict, namedtuple 18 | 19 | ANY_SPACE = '' 20 | 21 | class FormatError(Exception): 22 | pass 23 | 24 | Metrics = namedtuple('Metrics', 'tp fp fn prec rec fscore') 25 | 26 | class EvalCounts(object): 27 | def __init__(self): 28 | self.correct_chunk = 0 # number of correctly identified chunks 29 | self.correct_tags = 0 # number of correct chunk tags 30 | self.found_correct = 0 # number of chunks in corpus 31 | self.found_guessed = 0 # number of identified chunks 32 | self.token_counter = 0 # token counter (ignores sentence breaks) 33 | 34 | # counts by type 35 | self.t_correct_chunk = defaultdict(int) 36 | self.t_found_correct = defaultdict(int) 37 | self.t_found_guessed = defaultdict(int) 38 | 39 | def parse_args(argv): 40 | import argparse 41 | parser = argparse.ArgumentParser( 42 | description='evaluate tagging results using CoNLL criteria', 43 | formatter_class=argparse.ArgumentDefaultsHelpFormatter 44 | ) 45 | arg = parser.add_argument 46 | arg('-b', '--boundary', metavar='STR', default='-X-', 47 | help='sentence boundary') 48 | arg('-d', '--delimiter', metavar='CHAR', default=ANY_SPACE, 49 | help='character delimiting items in input') 50 | arg('-o', '--otag', metavar='CHAR', default='O', 51 | help='alternative outside tag') 52 | arg('file', nargs='?', default=None) 53 | return parser.parse_args(argv) 54 | 55 | def parse_tag(t): 56 | m = re.match(r'^([^-]*)-(.*)$', t) 57 | return m.groups() if m else (t, '') 58 | 59 | def evaluate(iterable, options=None): 60 | if options is None: 61 | options = parse_args([]) # use defaults 62 | 63 | counts = EvalCounts() 64 | num_features = None # number of features per line 65 | in_correct = False # currently processed chunks is correct until now 66 | last_correct = 'O' # previous chunk tag in corpus 67 | last_correct_type = '' # type of previously identified chunk tag 68 | last_guessed = 'O' # previously identified chunk tag 69 | last_guessed_type = '' # type of previous chunk tag in corpus 70 | 71 | for line in iterable: 72 | line = line.rstrip('\r\n') 73 | 74 | if options.delimiter == ANY_SPACE: 75 | features = line.split() 76 | else: 77 | features = line.split(options.delimiter) 78 | 79 | if num_features is None: 80 | num_features = len(features) 81 | elif num_features != len(features) and len(features) != 0: 82 | raise FormatError('unexpected number of features: %d (%d)' % 83 | (len(features), num_features)) 84 | 85 | if len(features) == 0 or features[0] == options.boundary: 86 | features = [options.boundary, 'O', 'O'] 87 | if len(features) < 3: 88 | raise FormatError('unexpected number of features in line %s' % line) 89 | 90 | guessed, guessed_type = parse_tag(features.pop()) 91 | correct, correct_type = parse_tag(features.pop()) 92 | first_item = features.pop(0) 93 | 94 | if first_item == options.boundary: 95 | guessed = 'O' 96 | 97 | end_correct = end_of_chunk(last_correct, correct, 98 | last_correct_type, correct_type) 99 | end_guessed = end_of_chunk(last_guessed, guessed, 100 | last_guessed_type, guessed_type) 101 | start_correct = start_of_chunk(last_correct, correct, 102 | last_correct_type, correct_type) 103 | start_guessed = start_of_chunk(last_guessed, guessed, 104 | last_guessed_type, guessed_type) 105 | 106 | if in_correct: 107 | if (end_correct and end_guessed and 108 | last_guessed_type == last_correct_type): 109 | in_correct = False 110 | counts.correct_chunk += 1 111 | counts.t_correct_chunk[last_correct_type] += 1 112 | elif (end_correct != end_guessed or guessed_type != correct_type): 113 | in_correct = False 114 | 115 | if start_correct and start_guessed and guessed_type == correct_type: 116 | in_correct = True 117 | 118 | if start_correct: 119 | counts.found_correct += 1 120 | counts.t_found_correct[correct_type] += 1 121 | if start_guessed: 122 | counts.found_guessed += 1 123 | counts.t_found_guessed[guessed_type] += 1 124 | if first_item != options.boundary: 125 | if correct == guessed and guessed_type == correct_type: 126 | counts.correct_tags += 1 127 | counts.token_counter += 1 128 | 129 | last_guessed = guessed 130 | last_correct = correct 131 | last_guessed_type = guessed_type 132 | last_correct_type = correct_type 133 | 134 | if in_correct: 135 | counts.correct_chunk += 1 136 | counts.t_correct_chunk[last_correct_type] += 1 137 | 138 | return counts 139 | 140 | def uniq(iterable): 141 | seen = set() 142 | return [i for i in iterable if not (i in seen or seen.add(i))] 143 | 144 | def calculate_metrics(correct, guessed, total): 145 | tp, fp, fn = correct, guessed-correct, total-correct 146 | p = 0 if tp + fp == 0 else 1.*tp / (tp + fp) 147 | r = 0 if tp + fn == 0 else 1.*tp / (tp + fn) 148 | f = 0 if p + r == 0 else 2 * p * r / (p + r) 149 | return Metrics(tp, fp, fn, p, r, f) 150 | 151 | def metrics(counts): 152 | c = counts 153 | overall = calculate_metrics( 154 | c.correct_chunk, c.found_guessed, c.found_correct 155 | ) 156 | by_type = {} 157 | for t in uniq(list(c.t_found_correct.keys()) + list(c.t_found_guessed.keys())): 158 | by_type[t] = calculate_metrics( 159 | c.t_correct_chunk[t], c.t_found_guessed[t], c.t_found_correct[t] 160 | ) 161 | return overall, by_type 162 | 163 | def report(counts, out=None): 164 | if out is None: 165 | out = sys.stdout 166 | 167 | overall, by_type = metrics(counts) 168 | 169 | c = counts 170 | out.write('processed %d tokens with %d phrases; ' % 171 | (c.token_counter, c.found_correct)) 172 | out.write('found: %d phrases; correct: %d.\n' % 173 | (c.found_guessed, c.correct_chunk)) 174 | 175 | if c.token_counter > 0: 176 | out.write('accuracy: %6.2f%%; ' % 177 | (100.*c.correct_tags/c.token_counter)) 178 | out.write('precision: %6.2f%%; ' % (100.*overall.prec)) 179 | out.write('recall: %6.2f%%; ' % (100.*overall.rec)) 180 | out.write('FB1: %6.2f\n' % (100.*overall.fscore)) 181 | 182 | for i, m in sorted(by_type.items()): 183 | out.write('%17s: ' % i) 184 | out.write('precision: %6.2f%%; ' % (100.*m.prec)) 185 | out.write('recall: %6.2f%%; ' % (100.*m.rec)) 186 | out.write('FB1: %6.2f %d\n' % (100.*m.fscore, c.t_found_guessed[i])) 187 | 188 | def end_of_chunk(prev_tag, tag, prev_type, type_): 189 | # check if a chunk ended between the previous and current word 190 | # arguments: previous and current chunk tags, previous and current types 191 | chunk_end = False 192 | 193 | if prev_tag == 'E': chunk_end = True 194 | if prev_tag == 'S': chunk_end = True 195 | 196 | if prev_tag == 'B' and tag == 'B': chunk_end = True 197 | if prev_tag == 'B' and tag == 'S': chunk_end = True 198 | if prev_tag == 'B' and tag == 'O': chunk_end = True 199 | if prev_tag == 'I' and tag == 'B': chunk_end = True 200 | if prev_tag == 'I' and tag == 'S': chunk_end = True 201 | if prev_tag == 'I' and tag == 'O': chunk_end = True 202 | 203 | if prev_tag != 'O' and prev_tag != '.' and prev_type != type_: 204 | chunk_end = True 205 | 206 | # these chunks are assumed to have length 1 207 | if prev_tag == ']': chunk_end = True 208 | if prev_tag == '[': chunk_end = True 209 | 210 | return chunk_end 211 | 212 | def start_of_chunk(prev_tag, tag, prev_type, type_): 213 | # check if a chunk started between the previous and current word 214 | # arguments: previous and current chunk tags, previous and current types 215 | chunk_start = False 216 | 217 | if tag == 'B': chunk_start = True 218 | if tag == 'S': chunk_start = True 219 | 220 | if prev_tag == 'E' and tag == 'E': chunk_start = True 221 | if prev_tag == 'E' and tag == 'I': chunk_start = True 222 | if prev_tag == 'S' and tag == 'E': chunk_start = True 223 | if prev_tag == 'S' and tag == 'I': chunk_start = True 224 | if prev_tag == 'O' and tag == 'E': chunk_start = True 225 | if prev_tag == 'O' and tag == 'I': chunk_start = True 226 | 227 | if tag != 'O' and tag != '.' and prev_type != type_: 228 | chunk_start = True 229 | 230 | # these chunks are assumed to have length 1 231 | if tag == '[': chunk_start = True 232 | if tag == ']': chunk_start = True 233 | 234 | return chunk_start 235 | 236 | def main(argv): 237 | args = parse_args(argv[1:]) 238 | 239 | if args.file is None: 240 | counts = evaluate(sys.stdin, args) 241 | else: 242 | with open(args.file) as f: 243 | counts = evaluate(f, args) 244 | report(counts) 245 | 246 | if __name__ == '__main__': 247 | sys.exit(main(sys.argv)) 248 | -------------------------------------------------------------------------------- /evaluator.py: -------------------------------------------------------------------------------- 1 | import time 2 | import collections 3 | import numpy 4 | import conlleval 5 | 6 | class SequenceLabelingEvaluator(object): 7 | def __init__(self, main_label, label2id, conll_eval=False): 8 | self.main_label = main_label 9 | self.label2id = label2id 10 | self.conll_eval = conll_eval 11 | self.main_label_id = self.label2id[self.main_label] 12 | 13 | self.cost_sum = 0.0 14 | self.correct_sum = 0.0 15 | self.main_predicted_count = 0 16 | self.main_total_count = 0 17 | self.main_correct_count = 0 18 | self.token_count = 0 19 | self.start_time = time.time() 20 | 21 | self.id2label = collections.OrderedDict() 22 | for label in self.label2id: 23 | self.id2label[self.label2id[label]] = label 24 | 25 | self.conll_format = [] 26 | 27 | def append_data(self, cost, batch, predicted_labels): 28 | self.cost_sum += cost 29 | for i in range(len(batch)): 30 | for j in range(len(batch[i])): 31 | token = batch[i][j][0] 32 | gold_label = batch[i][j][-1] 33 | predicted_label = self.id2label[predicted_labels[i][j]] 34 | 35 | self.token_count += 1 36 | if gold_label == predicted_label: 37 | self.correct_sum += 1 38 | if predicted_label == self.main_label: 39 | self.main_predicted_count += 1 40 | if gold_label == self.main_label: 41 | self.main_total_count += 1 42 | if predicted_label == gold_label and gold_label == self.main_label: 43 | self.main_correct_count += 1 44 | 45 | self.conll_format.append(token + "\t" + gold_label + "\t" + predicted_label) 46 | self.conll_format.append("") 47 | 48 | 49 | def get_results(self, name): 50 | p = (float(self.main_correct_count) / float(self.main_predicted_count)) if (self.main_predicted_count > 0) else 0.0 51 | r = (float(self.main_correct_count) / float(self.main_total_count)) if (self.main_total_count > 0) else 0.0 52 | f = (2.0 * p * r / (p + r)) if (p+r > 0.0) else 0.0 53 | f05 = ((1.0 + 0.5*0.5) * p * r / ((0.5*0.5 * p) + r)) if (p+r > 0.0) else 0.0 54 | 55 | results = collections.OrderedDict() 56 | results[name + "_cost_avg"] = self.cost_sum / float(self.token_count) 57 | results[name + "_cost_sum"] = self.cost_sum 58 | results[name + "_main_predicted_count"] = self.main_predicted_count 59 | results[name + "_main_total_count"] = self.main_total_count 60 | results[name + "_main_correct_count"] = self.main_correct_count 61 | results[name + "_p"] = p 62 | results[name + "_r"] = r 63 | results[name + "_f"] = f 64 | results[name + "_f05"] = f05 65 | results[name + "_accuracy"] = self.correct_sum / float(self.token_count) 66 | results[name + "_token_count"] = self.token_count 67 | results[name + "_time"] = float(time.time()) - float(self.start_time) 68 | 69 | if self.label2id is not None and self.conll_eval == True: 70 | conll_counts = conlleval.evaluate(self.conll_format) 71 | conll_metrics_overall, conll_metrics_by_type = conlleval.metrics(conll_counts) 72 | results[name + "_conll_accuracy"] = float(conll_counts.correct_tags) / float(conll_counts.token_counter) 73 | results[name + "_conll_p"] = conll_metrics_overall.prec 74 | results[name + "_conll_r"] = conll_metrics_overall.rec 75 | results[name + "_conll_f"] = conll_metrics_overall.fscore 76 | # for i, m in sorted(conll_metrics_by_type.items()): 77 | # results[name + "_conll_p_" + str(i)] = m.prec 78 | # results[name + "_conll_r_" + str(i)] = m.rec 79 | # results[name + "_conll_f_" + str(i)] = m.fscore #str(m.fscore) + " " + str(conll_counts.t_found_guessed[i]) 80 | 81 | return results 82 | 83 | 84 | 85 | -------------------------------------------------------------------------------- /experiment.py: -------------------------------------------------------------------------------- 1 | import sys 2 | import collections 3 | import numpy 4 | import random 5 | import math 6 | import os 7 | import gc 8 | 9 | try: 10 | import ConfigParser as configparser 11 | except: 12 | import configparser 13 | 14 | 15 | from labeler import SequenceLabeler 16 | from evaluator import SequenceLabelingEvaluator 17 | 18 | def read_input_files(file_paths, max_sentence_length=-1): 19 | """ 20 | Reads input files in whitespace-separated format. 21 | Will split file_paths on comma, reading from multiple files. 22 | The format assumes the first column is the word, the last column is the label. 23 | """ 24 | sentences = [] 25 | line_length = None 26 | for file_path in file_paths.strip().split(","): 27 | with open(file_path, "r") as f: 28 | sentence = [] 29 | for line in f: 30 | line = line.strip() 31 | if len(line) > 0: 32 | line_parts = line.split() 33 | assert(len(line_parts) >= 2) 34 | assert(len(line_parts) == line_length or line_length == None) 35 | line_length = len(line_parts) 36 | sentence.append(line_parts) 37 | elif len(line) == 0 and len(sentence) > 0: 38 | if max_sentence_length <= 0 or len(sentence) <= max_sentence_length: 39 | sentences.append(sentence) 40 | sentence = [] 41 | if len(sentence) > 0: 42 | if max_sentence_length <= 0 or len(sentence) <= max_sentence_length: 43 | sentences.append(sentence) 44 | return sentences 45 | 46 | 47 | 48 | def parse_config(config_section, config_path): 49 | """ 50 | Reads configuration from the file and returns a dictionary. 51 | Tries to guess the correct datatype for each of the config values. 52 | """ 53 | config_parser = configparser.SafeConfigParser(allow_no_value=True) 54 | config_parser.read(config_path) 55 | config = collections.OrderedDict() 56 | for key, value in config_parser.items(config_section): 57 | if value is None or len(value.strip()) == 0: 58 | config[key] = None 59 | elif value.lower() in ["true", "false"]: 60 | config[key] = config_parser.getboolean(config_section, key) 61 | elif value.isdigit(): 62 | config[key] = config_parser.getint(config_section, key) 63 | elif is_float(value): 64 | config[key] = config_parser.getfloat(config_section, key) 65 | else: 66 | config[key] = config_parser.get(config_section, key) 67 | return config 68 | 69 | 70 | def is_float(value): 71 | """ 72 | Check in value is of type float() 73 | """ 74 | try: 75 | float(value) 76 | return True 77 | except ValueError: 78 | return False 79 | 80 | 81 | def create_batches_of_sentence_ids(sentences, batch_equal_size, max_batch_size): 82 | """ 83 | Groups together sentences into batches 84 | If batch_equal_size is True, make all sentences in a batch be equal length. 85 | If max_batch_size is positive, this value determines the maximum number of sentences in each batch. 86 | If max_batch_size has a negative value, the function dynamically creates the batches such that each batch contains abs(max_batch_size) words. 87 | Returns a list of lists with sentences ids. 88 | """ 89 | batches_of_sentence_ids = [] 90 | if batch_equal_size == True: 91 | sentence_ids_by_length = collections.OrderedDict() 92 | sentence_length_sum = 0.0 93 | for i in range(len(sentences)): 94 | length = len(sentences[i]) 95 | if length not in sentence_ids_by_length: 96 | sentence_ids_by_length[length] = [] 97 | sentence_ids_by_length[length].append(i) 98 | 99 | for sentence_length in sentence_ids_by_length: 100 | if max_batch_size > 0: 101 | batch_size = max_batch_size 102 | else: 103 | batch_size = int((-1.0 * max_batch_size) / sentence_length) 104 | 105 | for i in range(0, len(sentence_ids_by_length[sentence_length]), batch_size): 106 | batches_of_sentence_ids.append(sentence_ids_by_length[sentence_length][i:i + batch_size]) 107 | else: 108 | current_batch = [] 109 | max_sentence_length = 0 110 | for i in range(len(sentences)): 111 | current_batch.append(i) 112 | if len(sentences[i]) > max_sentence_length: 113 | max_sentence_length = len(sentences[i]) 114 | if (max_batch_size > 0 and len(current_batch) >= max_batch_size) \ 115 | or (max_batch_size <= 0 and len(current_batch)*max_sentence_length >= (-1 * max_batch_size)): 116 | batches_of_sentence_ids.append(current_batch) 117 | current_batch = [] 118 | max_sentence_length = 0 119 | if len(current_batch) > 0: 120 | batches_of_sentence_ids.append(current_batch) 121 | return batches_of_sentence_ids 122 | 123 | 124 | 125 | def process_sentences(data, labeler, is_training, learningrate, config, name): 126 | """ 127 | Process all the sentences with the labeler, return evaluation metrics. 128 | """ 129 | evaluator = SequenceLabelingEvaluator(config["main_label"], labeler.label2id, config["conll_eval"]) 130 | batches_of_sentence_ids = create_batches_of_sentence_ids(data, config["batch_equal_size"], config["max_batch_size"]) 131 | if is_training == True: 132 | random.shuffle(batches_of_sentence_ids) 133 | 134 | for sentence_ids_in_batch in batches_of_sentence_ids: 135 | batch = [data[i] for i in sentence_ids_in_batch] 136 | cost, predicted_labels, predicted_probs = labeler.process_batch(batch, is_training, learningrate) 137 | 138 | evaluator.append_data(cost, batch, predicted_labels) 139 | 140 | word_ids, char_ids, char_mask, label_ids = None, None, None, None 141 | while config["garbage_collection"] == True and gc.collect() > 0: 142 | pass 143 | 144 | results = evaluator.get_results(name) 145 | for key in results: 146 | print(key + ": " + str(results[key])) 147 | 148 | return results 149 | 150 | 151 | 152 | def run_experiment(config_path): 153 | config = parse_config("config", config_path) 154 | temp_model_path = config_path + ".model" 155 | if "random_seed" in config: 156 | random.seed(config["random_seed"]) 157 | numpy.random.seed(config["random_seed"]) 158 | 159 | for key, val in config.items(): 160 | print(str(key) + ": " + str(val)) 161 | 162 | data_train, data_dev, data_test = None, None, None 163 | if config["path_train"] != None and len(config["path_train"]) > 0: 164 | data_train = read_input_files(config["path_train"], config["max_train_sent_length"]) 165 | if config["path_dev"] != None and len(config["path_dev"]) > 0: 166 | data_dev = read_input_files(config["path_dev"]) 167 | if config["path_test"] != None and len(config["path_test"]) > 0: 168 | data_test = [] 169 | for path_test in config["path_test"].strip().split(":"): 170 | data_test += read_input_files(path_test) 171 | 172 | if config["load"] != None and len(config["load"]) > 0: 173 | labeler = SequenceLabeler.load(config["load"]) 174 | else: 175 | labeler = SequenceLabeler(config) 176 | labeler.build_vocabs(data_train, data_dev, data_test, config["preload_vectors"]) 177 | labeler.construct_network() 178 | labeler.initialize_session() 179 | if config["preload_vectors"] != None: 180 | labeler.preload_word_embeddings(config["preload_vectors"]) 181 | 182 | print("parameter_count: " + str(labeler.get_parameter_count())) 183 | print("parameter_count_without_word_embeddings: " + str(labeler.get_parameter_count_without_word_embeddings())) 184 | 185 | if data_train != None: 186 | model_selector = config["model_selector"].split(":")[0] 187 | model_selector_type = config["model_selector"].split(":")[1] 188 | best_selector_value = 0.0 189 | best_epoch = -1 190 | learningrate = config["learningrate"] 191 | for epoch in range(config["epochs"]): 192 | print("EPOCH: " + str(epoch)) 193 | print("current_learningrate: " + str(learningrate)) 194 | random.shuffle(data_train) 195 | 196 | results_train = process_sentences(data_train, labeler, is_training=True, learningrate=learningrate, config=config, name="train") 197 | 198 | if data_dev != None: 199 | results_dev = process_sentences(data_dev, labeler, is_training=False, learningrate=0.0, config=config, name="dev") 200 | 201 | if math.isnan(results_dev["dev_cost_sum"]) or math.isinf(results_dev["dev_cost_sum"]): 202 | sys.stderr.write("ERROR: Cost is NaN or Inf. Exiting.\n") 203 | break 204 | 205 | if (epoch == 0 or (model_selector_type == "high" and results_dev[model_selector] > best_selector_value) 206 | or (model_selector_type == "low" and results_dev[model_selector] < best_selector_value)): 207 | best_epoch = epoch 208 | best_selector_value = results_dev[model_selector] 209 | labeler.saver.save(labeler.session, temp_model_path, latest_filename=os.path.basename(temp_model_path)+".checkpoint") 210 | print("best_epoch: " + str(best_epoch)) 211 | 212 | if config["stop_if_no_improvement_for_epochs"] > 0 and (epoch - best_epoch) >= config["stop_if_no_improvement_for_epochs"]: 213 | break 214 | 215 | if (epoch - best_epoch) > 3: 216 | learningrate *= config["learningrate_decay"] 217 | 218 | while config["garbage_collection"] == True and gc.collect() > 0: 219 | pass 220 | 221 | if data_dev != None and best_epoch >= 0: 222 | # loading the best model so far 223 | labeler.saver.restore(labeler.session, temp_model_path) 224 | 225 | os.remove(temp_model_path+".checkpoint") 226 | os.remove(temp_model_path+".data-00000-of-00001") 227 | os.remove(temp_model_path+".index") 228 | os.remove(temp_model_path+".meta") 229 | 230 | if config["save"] is not None and len(config["save"]) > 0: 231 | labeler.save(config["save"]) 232 | 233 | if config["path_test"] is not None: 234 | i = 0 235 | for path_test in config["path_test"].strip().split(":"): 236 | data_test = read_input_files(path_test) 237 | results_test = process_sentences(data_test, labeler, is_training=False, learningrate=0.0, config=config, name="test"+str(i)) 238 | i += 1 239 | 240 | 241 | if __name__ == "__main__": 242 | run_experiment(sys.argv[1]) 243 | 244 | -------------------------------------------------------------------------------- /labeler.py: -------------------------------------------------------------------------------- 1 | import collections 2 | import tensorflow as tf 3 | import re 4 | import numpy 5 | from tensorflow.python.framework import ops 6 | from tensorflow.python.ops import math_ops 7 | 8 | try: 9 | import cPickle as pickle 10 | except: 11 | import pickle 12 | 13 | class SequenceLabeler(object): 14 | def __init__(self, config): 15 | self.config = config 16 | 17 | self.UNK = "" 18 | self.CUNK = "" 19 | 20 | self.word2id = None 21 | self.char2id = None 22 | self.label2id = None 23 | self.singletons = None 24 | 25 | 26 | def build_vocabs(self, data_train, data_dev, data_test, embedding_path=None): 27 | data_source = list(data_train) 28 | if self.config["vocab_include_devtest"]: 29 | if data_dev != None: 30 | data_source += data_dev 31 | if data_test != None: 32 | data_source += data_test 33 | 34 | char_counter = collections.Counter() 35 | for sentence in data_source: 36 | for word in sentence: 37 | char_counter.update(word[0]) 38 | self.char2id = collections.OrderedDict([(self.CUNK, 0)]) 39 | for char, count in char_counter.most_common(): 40 | if char not in self.char2id: 41 | self.char2id[char] = len(self.char2id) 42 | 43 | word_counter = collections.Counter() 44 | for sentence in data_source: 45 | for word in sentence: 46 | w = word[0] 47 | if self.config["lowercase"] == True: 48 | w = w.lower() 49 | if self.config["replace_digits"] == True: 50 | w = re.sub(r'\d', '0', w) 51 | word_counter[w] += 1 52 | self.word2id = collections.OrderedDict([(self.UNK, 0)]) 53 | for word, count in word_counter.most_common(): 54 | if self.config["min_word_freq"] <= 0 or count >= self.config["min_word_freq"]: 55 | if word not in self.word2id: 56 | self.word2id[word] = len(self.word2id) 57 | 58 | self.singletons = set([word for word in word_counter if word_counter[word] == 1]) 59 | 60 | label_counter = collections.Counter() 61 | for sentence in data_train: #this one only based on training data 62 | for word in sentence: 63 | label_counter[word[-1]] += 1 64 | self.label2id = collections.OrderedDict() 65 | for label, count in label_counter.most_common(): 66 | if label not in self.label2id: 67 | self.label2id[label] = len(self.label2id) 68 | 69 | if embedding_path != None and self.config["vocab_only_embedded"] == True: 70 | self.embedding_vocab = set([self.UNK]) 71 | with open(embedding_path, 'r') as f: 72 | for line in f: 73 | line_parts = line.strip().split() 74 | if len(line_parts) <= 2: 75 | continue 76 | w = line_parts[0] 77 | if self.config["lowercase"] == True: 78 | w = w.lower() 79 | if self.config["replace_digits"] == True: 80 | w = re.sub(r'\d', '0', w) 81 | self.embedding_vocab.add(w) 82 | word2id_revised = collections.OrderedDict() 83 | for word in self.word2id: 84 | if word in embedding_vocab and word not in word2id_revised: 85 | word2id_revised[word] = len(word2id_revised) 86 | self.word2id = word2id_revised 87 | 88 | print("n_words: " + str(len(self.word2id))) 89 | print("n_chars: " + str(len(self.char2id))) 90 | print("n_labels: " + str(len(self.label2id))) 91 | print("n_singletons: " + str(len(self.singletons))) 92 | 93 | 94 | def construct_network(self): 95 | self.word_ids = tf.placeholder(tf.int32, [None, None], name="word_ids") 96 | self.char_ids = tf.placeholder(tf.int32, [None, None, None], name="char_ids") 97 | self.sentence_lengths = tf.placeholder(tf.int32, [None], name="sentence_lengths") 98 | self.word_lengths = tf.placeholder(tf.int32, [None, None], name="word_lengths") 99 | self.label_ids = tf.placeholder(tf.int32, [None, None], name="label_ids") 100 | self.learningrate = tf.placeholder(tf.float32, name="learningrate") 101 | self.is_training = tf.placeholder(tf.int32, name="is_training") 102 | 103 | self.loss = 0.0 104 | input_tensor = None 105 | input_vector_size = 0 106 | 107 | self.initializer = None 108 | if self.config["initializer"] == "normal": 109 | self.initializer = tf.random_normal_initializer(mean=0.0, stddev=0.1) 110 | elif self.config["initializer"] == "glorot": 111 | self.initializer = tf.glorot_uniform_initializer() 112 | elif self.config["initializer"] == "xavier": 113 | self.initializer = tf.glorot_normal_initializer() 114 | else: 115 | raise ValueError("Unknown initializer") 116 | 117 | self.word_embeddings = tf.get_variable("word_embeddings", 118 | shape=[len(self.word2id), self.config["word_embedding_size"]], 119 | initializer=(tf.zeros_initializer() if self.config["emb_initial_zero"] == True else self.initializer), 120 | trainable=(True if self.config["train_embeddings"] == True else False)) 121 | input_tensor = tf.nn.embedding_lookup(self.word_embeddings, self.word_ids) 122 | input_vector_size = self.config["word_embedding_size"] 123 | 124 | if self.config["char_embedding_size"] > 0 and self.config["char_recurrent_size"] > 0: 125 | with tf.variable_scope("chars"), tf.control_dependencies([tf.assert_equal(tf.shape(self.char_ids)[2], tf.reduce_max(self.word_lengths), message="Char dimensions don't match")]): 126 | self.char_embeddings = tf.get_variable("char_embeddings", 127 | shape=[len(self.char2id), self.config["char_embedding_size"]], 128 | initializer=self.initializer, 129 | trainable=True) 130 | char_input_tensor = tf.nn.embedding_lookup(self.char_embeddings, self.char_ids) 131 | 132 | s = tf.shape(char_input_tensor) 133 | char_input_tensor = tf.reshape(char_input_tensor, shape=[s[0]*s[1], s[2], self.config["char_embedding_size"]]) 134 | _word_lengths = tf.reshape(self.word_lengths, shape=[s[0]*s[1]]) 135 | 136 | char_lstm_cell_fw = tf.nn.rnn_cell.LSTMCell(self.config["char_recurrent_size"], 137 | use_peepholes=self.config["lstm_use_peepholes"], 138 | state_is_tuple=True, 139 | initializer=self.initializer, 140 | reuse=False) 141 | char_lstm_cell_bw = tf.nn.rnn_cell.LSTMCell(self.config["char_recurrent_size"], 142 | use_peepholes=self.config["lstm_use_peepholes"], 143 | state_is_tuple=True, 144 | initializer=self.initializer, 145 | reuse=False) 146 | 147 | char_lstm_outputs = tf.nn.bidirectional_dynamic_rnn(char_lstm_cell_fw, char_lstm_cell_bw, char_input_tensor, sequence_length=_word_lengths, dtype=tf.float32, time_major=False) 148 | _, ((_, char_output_fw), (_, char_output_bw)) = char_lstm_outputs 149 | char_output_tensor = tf.concat([char_output_fw, char_output_bw], axis=-1) 150 | char_output_tensor = tf.reshape(char_output_tensor, shape=[s[0], s[1], 2 * self.config["char_recurrent_size"]]) 151 | char_output_vector_size = 2 * self.config["char_recurrent_size"] 152 | 153 | if self.config["lmcost_char_gamma"] > 0.0: 154 | self.loss += self.config["lmcost_char_gamma"] * self.construct_lmcost(char_output_tensor, char_output_tensor, self.sentence_lengths, self.word_ids, "separate", "lmcost_char_separate") 155 | if self.config["lmcost_joint_char_gamma"] > 0.0: 156 | self.loss += self.config["lmcost_joint_char_gamma"] * self.construct_lmcost(char_output_tensor, char_output_tensor, self.sentence_lengths, self.word_ids, "joint", "lmcost_char_joint") 157 | 158 | if self.config["char_hidden_layer_size"] > 0: 159 | char_hidden_layer_size = self.config["word_embedding_size"] if self.config["char_integration_method"] == "attention" else self.config["char_hidden_layer_size"] 160 | char_output_tensor = tf.layers.dense(char_output_tensor, char_hidden_layer_size, activation=tf.tanh, kernel_initializer=self.initializer) 161 | char_output_vector_size = char_hidden_layer_size 162 | 163 | if self.config["char_integration_method"] == "concat": 164 | input_tensor = tf.concat([input_tensor, char_output_tensor], axis=-1) 165 | input_vector_size += char_output_vector_size 166 | elif self.config["char_integration_method"] == "attention": 167 | assert(char_output_vector_size == self.config["word_embedding_size"]), "This method requires the char representation to have the same size as word embeddings" 168 | static_input_tensor = tf.stop_gradient(input_tensor) 169 | is_unk = tf.equal(self.word_ids, self.word2id[self.UNK]) 170 | char_output_tensor_normalised = tf.nn.l2_normalize(char_output_tensor, 2) 171 | static_input_tensor_normalised = tf.nn.l2_normalize(static_input_tensor, 2) 172 | cosine_cost = 1.0 - tf.reduce_sum(tf.multiply(char_output_tensor_normalised, static_input_tensor_normalised), axis=2) 173 | is_padding = tf.logical_not(tf.sequence_mask(self.sentence_lengths, maxlen=tf.shape(self.word_ids)[1])) 174 | cosine_cost_unk = tf.where(tf.logical_or(is_unk, is_padding), x=tf.zeros_like(cosine_cost), y=cosine_cost) 175 | self.loss += self.config["char_attention_cosine_cost"] * tf.reduce_sum(cosine_cost_unk) 176 | attention_evidence_tensor = tf.concat([input_tensor, char_output_tensor], axis=2) 177 | attention_output = tf.layers.dense(attention_evidence_tensor, self.config["word_embedding_size"], activation=tf.tanh, kernel_initializer=self.initializer) 178 | attention_output = tf.layers.dense(attention_output, self.config["word_embedding_size"], activation=tf.sigmoid, kernel_initializer=self.initializer) 179 | input_tensor = tf.multiply(input_tensor, attention_output) + tf.multiply(char_output_tensor, (1.0 - attention_output)) 180 | elif self.config["char_integration_method"] == "none": 181 | input_tensor = input_tensor 182 | else: 183 | raise ValueError("Unknown char integration method") 184 | 185 | dropout_input = self.config["dropout_input"] * tf.cast(self.is_training, tf.float32) + (1.0 - tf.cast(self.is_training, tf.float32)) 186 | input_tensor = tf.nn.dropout(input_tensor, dropout_input, name="dropout_word") 187 | 188 | word_lstm_cell_fw = tf.nn.rnn_cell.LSTMCell(self.config["word_recurrent_size"], 189 | use_peepholes=self.config["lstm_use_peepholes"], 190 | state_is_tuple=True, 191 | initializer=self.initializer, 192 | reuse=False) 193 | word_lstm_cell_bw = tf.nn.rnn_cell.LSTMCell(self.config["word_recurrent_size"], 194 | use_peepholes=self.config["lstm_use_peepholes"], 195 | state_is_tuple=True, 196 | initializer=self.initializer, 197 | reuse=False) 198 | 199 | with tf.control_dependencies([tf.assert_equal(tf.shape(self.word_ids)[1], tf.reduce_max(self.sentence_lengths), message="Sentence dimensions don't match")]): 200 | (lstm_outputs_fw, lstm_outputs_bw), _ = tf.nn.bidirectional_dynamic_rnn(word_lstm_cell_fw, word_lstm_cell_bw, input_tensor, sequence_length=self.sentence_lengths, dtype=tf.float32, time_major=False) 201 | 202 | dropout_word_lstm = self.config["dropout_word_lstm"] * tf.cast(self.is_training, tf.float32) + (1.0 - tf.cast(self.is_training, tf.float32)) 203 | lstm_outputs_fw = tf.nn.dropout(lstm_outputs_fw, dropout_word_lstm) 204 | lstm_outputs_bw = tf.nn.dropout(lstm_outputs_bw, dropout_word_lstm) 205 | 206 | if self.config["lmcost_lstm_gamma"] > 0.0: 207 | self.loss += self.config["lmcost_lstm_gamma"] * self.construct_lmcost(lstm_outputs_fw, lstm_outputs_bw, self.sentence_lengths, self.word_ids, "separate", "lmcost_lstm_separate") 208 | if self.config["lmcost_joint_lstm_gamma"] > 0.0: 209 | self.loss += self.config["lmcost_joint_lstm_gamma"] * self.construct_lmcost(lstm_outputs_fw, lstm_outputs_bw, self.sentence_lengths, self.word_ids, "joint", "lmcost_lstm_joint") 210 | 211 | processed_tensor = tf.concat([lstm_outputs_fw, lstm_outputs_bw], 2) 212 | processed_tensor_size = self.config["word_recurrent_size"] * 2 213 | 214 | if self.config["hidden_layer_size"] > 0: 215 | processed_tensor = tf.layers.dense(processed_tensor, self.config["hidden_layer_size"], activation=tf.tanh, kernel_initializer=self.initializer) 216 | processed_tensor_size = self.config["hidden_layer_size"] 217 | 218 | self.scores = tf.layers.dense(processed_tensor, len(self.label2id), activation=None, kernel_initializer=self.initializer, name="output_ff") 219 | 220 | if self.config["crf_on_top"] == True: 221 | crf_num_tags = self.scores.get_shape()[2].value 222 | self.crf_transition_params = tf.get_variable("output_crf_transitions", [crf_num_tags, crf_num_tags], initializer=self.initializer) 223 | log_likelihood, self.crf_transition_params = tf.contrib.crf.crf_log_likelihood(self.scores, self.label_ids, self.sentence_lengths, transition_params=self.crf_transition_params) 224 | self.loss += self.config["main_cost"] * tf.reduce_sum(-log_likelihood) 225 | else: 226 | self.probabilities = tf.nn.softmax(self.scores) 227 | self.predictions = tf.argmax(self.probabilities, 2) 228 | loss_ = tf.nn.sparse_softmax_cross_entropy_with_logits(logits=self.scores, labels=self.label_ids) 229 | mask = tf.sequence_mask(self.sentence_lengths, maxlen=tf.shape(self.word_ids)[1]) 230 | loss_ = tf.boolean_mask(loss_, mask) 231 | self.loss += self.config["main_cost"] * tf.reduce_sum(loss_) 232 | 233 | self.train_op = self.construct_optimizer(self.config["opt_strategy"], self.loss, self.learningrate, self.config["clip"]) 234 | 235 | 236 | def construct_lmcost(self, input_tensor_fw, input_tensor_bw, sentence_lengths, target_ids, lmcost_type, name): 237 | with tf.variable_scope(name): 238 | lmcost_max_vocab_size = min(len(self.word2id), self.config["lmcost_max_vocab_size"]) 239 | target_ids = tf.where(tf.greater_equal(target_ids, lmcost_max_vocab_size-1), x=(lmcost_max_vocab_size-1)+tf.zeros_like(target_ids), y=target_ids) 240 | cost = 0.0 241 | if lmcost_type == "separate": 242 | lmcost_fw_mask = tf.sequence_mask(sentence_lengths, maxlen=tf.shape(target_ids)[1])[:,1:] 243 | lmcost_bw_mask = tf.sequence_mask(sentence_lengths, maxlen=tf.shape(target_ids)[1])[:,:-1] 244 | lmcost_fw = self._construct_lmcost(input_tensor_fw[:,:-1,:], lmcost_max_vocab_size, lmcost_fw_mask, target_ids[:,1:], name=name+"_fw") 245 | lmcost_bw = self._construct_lmcost(input_tensor_bw[:,1:,:], lmcost_max_vocab_size, lmcost_bw_mask, target_ids[:,:-1], name=name+"_bw") 246 | cost += lmcost_fw + lmcost_bw 247 | elif lmcost_type == "joint": 248 | joint_input_tensor = tf.concat([input_tensor_fw[:,:-2,:], input_tensor_bw[:,2:,:]], axis=-1) 249 | lmcost_mask = tf.sequence_mask(sentence_lengths, maxlen=tf.shape(target_ids)[1])[:,1:-1] 250 | cost += self._construct_lmcost(joint_input_tensor, lmcost_max_vocab_size, lmcost_mask, target_ids[:,1:-1], name=name+"_joint") 251 | else: 252 | raise ValueError("Unknown lmcost_type: " + str(lmcost_type)) 253 | return cost 254 | 255 | 256 | def _construct_lmcost(self, input_tensor, lmcost_max_vocab_size, lmcost_mask, target_ids, name): 257 | with tf.variable_scope(name): 258 | lmcost_hidden_layer = tf.layers.dense(input_tensor, self.config["lmcost_hidden_layer_size"], activation=tf.tanh, kernel_initializer=self.initializer) 259 | lmcost_output = tf.layers.dense(lmcost_hidden_layer, lmcost_max_vocab_size, activation=None, kernel_initializer=self.initializer) 260 | lmcost_loss = tf.nn.sparse_softmax_cross_entropy_with_logits(logits=lmcost_output, labels=target_ids) 261 | lmcost_loss = tf.where(lmcost_mask, lmcost_loss, tf.zeros_like(lmcost_loss)) 262 | return tf.reduce_sum(lmcost_loss) 263 | 264 | 265 | def construct_optimizer(self, opt_strategy, loss, learningrate, clip): 266 | optimizer = None 267 | if opt_strategy == "adadelta": 268 | optimizer = tf.train.AdadeltaOptimizer(learning_rate=learningrate) 269 | elif opt_strategy == "adam": 270 | optimizer = tf.train.AdamOptimizer(learning_rate=learningrate) 271 | elif opt_strategy == "sgd": 272 | optimizer = tf.train.GradientDescentOptimizer(learning_rate=learningrate) 273 | else: 274 | raise ValueError("Unknown optimisation strategy: " + str(opt_strategy)) 275 | 276 | if clip > 0.0: 277 | grads, vs = zip(*optimizer.compute_gradients(loss)) 278 | grads, gnorm = tf.clip_by_global_norm(grads, clip) 279 | train_op = optimizer.apply_gradients(zip(grads, vs)) 280 | else: 281 | train_op = optimizer.minimize(loss) 282 | return train_op 283 | 284 | 285 | def preload_word_embeddings(self, embedding_path): 286 | loaded_embeddings = set() 287 | embedding_matrix = self.session.run(self.word_embeddings) 288 | with open(embedding_path, 'r') as f: 289 | for line in f: 290 | line_parts = line.strip().split() 291 | if len(line_parts) <= 2: 292 | continue 293 | w = line_parts[0] 294 | if self.config["lowercase"] == True: 295 | w = w.lower() 296 | if self.config["replace_digits"] == True: 297 | w = re.sub(r'\d', '0', w) 298 | if w in self.word2id and w not in loaded_embeddings: 299 | word_id = self.word2id[w] 300 | embedding = numpy.array(line_parts[1:]) 301 | embedding_matrix[word_id] = embedding 302 | loaded_embeddings.add(w) 303 | self.session.run(self.word_embeddings.assign(embedding_matrix)) 304 | print("n_preloaded_embeddings: " + str(len(loaded_embeddings))) 305 | 306 | 307 | def translate2id(self, token, token2id, unk_token, lowercase=False, replace_digits=False, singletons=None, singletons_prob=0.0): 308 | if lowercase == True: 309 | token = token.lower() 310 | if replace_digits == True: 311 | token = re.sub(r'\d', '0', token) 312 | 313 | token_id = None 314 | if singletons != None and token in singletons and token in token2id and unk_token != None and numpy.random.uniform() < singletons_prob: 315 | token_id = token2id[unk_token] 316 | elif token in token2id: 317 | token_id = token2id[token] 318 | elif unk_token != None: 319 | token_id = token2id[unk_token] 320 | else: 321 | raise ValueError("Unable to handle value, no UNK token: " + str(token)) 322 | return token_id 323 | 324 | 325 | def create_input_dictionary_for_batch(self, batch, is_training, learningrate): 326 | sentence_lengths = numpy.array([len(sentence) for sentence in batch]) 327 | max_sentence_length = sentence_lengths.max() 328 | max_word_length = numpy.array([numpy.array([len(word[0]) for word in sentence]).max() for sentence in batch]).max() 329 | if self.config["allowed_word_length"] > 0 and self.config["allowed_word_length"] < max_word_length: 330 | max_word_length = min(max_word_length, self.config["allowed_word_length"]) 331 | 332 | word_ids = numpy.zeros((len(batch), max_sentence_length), dtype=numpy.int32) 333 | char_ids = numpy.zeros((len(batch), max_sentence_length, max_word_length), dtype=numpy.int32) 334 | word_lengths = numpy.zeros((len(batch), max_sentence_length), dtype=numpy.int32) 335 | label_ids = numpy.zeros((len(batch), max_sentence_length), dtype=numpy.int32) 336 | 337 | singletons = self.singletons if is_training == True else None 338 | singletons_prob = self.config["singletons_prob"] if is_training == True else 0.0 339 | for i in range(len(batch)): 340 | for j in range(len(batch[i])): 341 | word_ids[i][j] = self.translate2id(batch[i][j][0], self.word2id, self.UNK, lowercase=self.config["lowercase"], replace_digits=self.config["replace_digits"], singletons=singletons, singletons_prob=singletons_prob) 342 | label_ids[i][j] = self.translate2id(batch[i][j][-1], self.label2id, None) 343 | word_lengths[i][j] = min(len(batch[i][j][0]), max_word_length) 344 | for k in range(min(len(batch[i][j][0]), max_word_length)): 345 | char_ids[i][j][k] = self.translate2id(batch[i][j][0][k], self.char2id, self.CUNK) 346 | 347 | input_dictionary = {self.word_ids: word_ids, self.char_ids: char_ids, self.sentence_lengths: sentence_lengths, self.word_lengths: word_lengths, self.label_ids: label_ids, self.learningrate: learningrate, self.is_training: is_training} 348 | return input_dictionary 349 | 350 | 351 | def viterbi_decode(self, score, transition_params): 352 | trellis = numpy.zeros_like(score) 353 | backpointers = numpy.zeros_like(score, dtype=numpy.int32) 354 | trellis[0] = score[0] 355 | 356 | for t in range(1, score.shape[0]): 357 | v = numpy.expand_dims(trellis[t - 1], 1) + transition_params 358 | trellis[t] = score[t] + numpy.max(v, 0) 359 | backpointers[t] = numpy.argmax(v, 0) 360 | 361 | viterbi = [numpy.argmax(trellis[-1])] 362 | for bp in reversed(backpointers[1:]): 363 | viterbi.append(bp[viterbi[-1]]) 364 | viterbi.reverse() 365 | 366 | viterbi_score = numpy.max(trellis[-1]) 367 | return viterbi, viterbi_score, trellis 368 | 369 | 370 | def process_batch(self, batch, is_training, learningrate): 371 | feed_dict = self.create_input_dictionary_for_batch(batch, is_training, learningrate) 372 | 373 | if self.config["crf_on_top"] == True: 374 | cost, scores = self.session.run([self.loss, self.scores] + ([self.train_op] if is_training == True else []), feed_dict=feed_dict)[:2] 375 | predicted_labels = [] 376 | predicted_probs = [] 377 | for i in range(len(batch)): 378 | sentence_length = len(batch[i]) 379 | viterbi_seq, viterbi_score, viterbi_trellis = self.viterbi_decode(scores[i], self.session.run(self.crf_transition_params)) 380 | predicted_labels.append(viterbi_seq[:sentence_length]) 381 | predicted_probs.append(viterbi_trellis[:sentence_length]) 382 | else: 383 | cost, predicted_labels_, predicted_probs_ = self.session.run([self.loss, self.predictions, self.probabilities] + ([self.train_op] if is_training == True else []), feed_dict=feed_dict)[:3] 384 | predicted_labels = [] 385 | predicted_probs = [] 386 | for i in range(len(batch)): 387 | sentence_length = len(batch[i]) 388 | predicted_labels.append(predicted_labels_[i][:sentence_length]) 389 | predicted_probs.append(predicted_probs_[i][:sentence_length]) 390 | 391 | return cost, predicted_labels, predicted_probs 392 | 393 | 394 | def initialize_session(self): 395 | tf.set_random_seed(self.config["random_seed"]) 396 | session_config = tf.ConfigProto() 397 | session_config.gpu_options.allow_growth = self.config["tf_allow_growth"] 398 | session_config.gpu_options.per_process_gpu_memory_fraction = self.config["tf_per_process_gpu_memory_fraction"] 399 | self.session = tf.Session(config=session_config) 400 | self.session.run(tf.global_variables_initializer()) 401 | self.saver = tf.train.Saver(max_to_keep=1) 402 | 403 | 404 | def get_parameter_count(self): 405 | total_parameters = 0 406 | for variable in tf.trainable_variables(): 407 | shape = variable.get_shape() 408 | variable_parameters = 1 409 | for dim in shape: 410 | variable_parameters *= dim.value 411 | total_parameters += variable_parameters 412 | return total_parameters 413 | 414 | 415 | def get_parameter_count_without_word_embeddings(self): 416 | shape = self.word_embeddings.get_shape() 417 | variable_parameters = 1 418 | for dim in shape: 419 | variable_parameters *= dim.value 420 | return self.get_parameter_count() - variable_parameters 421 | 422 | 423 | def save(self, filename): 424 | dump = {} 425 | dump["config"] = self.config 426 | dump["UNK"] = self.UNK 427 | dump["CUNK"] = self.CUNK 428 | dump["word2id"] = self.word2id 429 | dump["char2id"] = self.char2id 430 | dump["label2id"] = self.label2id 431 | dump["singletons"] = self.singletons 432 | 433 | dump["params"] = {} 434 | for variable in tf.global_variables(): 435 | assert(variable.name not in dump["params"]), "Error: variable with this name already exists" + str(variable.name) 436 | dump["params"][variable.name] = self.session.run(variable) 437 | with open(filename, 'wb') as f: 438 | pickle.dump(dump, f, protocol=pickle.HIGHEST_PROTOCOL) 439 | 440 | 441 | @staticmethod 442 | def load(filename): 443 | with open(filename, 'rb') as f: 444 | dump = pickle.load(f) 445 | 446 | # for safety, so we don't overwrite old models 447 | dump["config"]["save"] = None 448 | 449 | labeler = SequenceLabeler(dump["config"]) 450 | labeler.UNK = dump["UNK"] 451 | labeler.CUNK = dump["CUNK"] 452 | labeler.word2id = dump["word2id"] 453 | labeler.char2id = dump["char2id"] 454 | labeler.label2id = dump["label2id"] 455 | labeler.singletons = dump["singletons"] 456 | 457 | labeler.construct_network() 458 | labeler.initialize_session() 459 | labeler.load_params(filename) 460 | 461 | return labeler 462 | 463 | 464 | def load_params(self, filename): 465 | with open(filename, 'rb') as f: 466 | dump = pickle.load(f) 467 | 468 | for variable in tf.global_variables(): 469 | assert(variable.name in dump["params"]), "Variable not in dump: " + str(variable.name) 470 | assert(variable.shape == dump["params"][variable.name].shape), "Variable shape not as expected: " + str(variable.name) + " " + str(variable.shape) + " " + str(dump["params"][variable.name].shape) 471 | value = numpy.asarray(dump["params"][variable.name]) 472 | self.session.run(variable.assign(value)) 473 | 474 | -------------------------------------------------------------------------------- /print_output.py: -------------------------------------------------------------------------------- 1 | import sys 2 | import labeler 3 | import experiment 4 | import numpy 5 | import collections 6 | import time 7 | 8 | def print_predictions(print_probs, model_path, input_file): 9 | time_loading = time.time() 10 | model = labeler.SequenceLabeler.load(model_path) 11 | 12 | time_noloading = time.time() 13 | config = model.config 14 | predictions_cache = {} 15 | 16 | id2label = collections.OrderedDict() 17 | for label in model.label2id: 18 | id2label[model.label2id[label]] = label 19 | 20 | sentences_test = experiment.read_input_files(input_file) 21 | batches_of_sentence_ids = experiment.create_batches_of_sentence_ids(sentences_test, config["batch_equal_size"], config['max_batch_size']) 22 | 23 | for sentence_ids_in_batch in batches_of_sentence_ids: 24 | batch = [sentences_test[i] for i in sentence_ids_in_batch] 25 | cost, predicted_labels, predicted_probs = model.process_batch(batch, is_training=False, learningrate=0.0) 26 | 27 | assert(len(sentence_ids_in_batch) == len(predicted_labels)) 28 | 29 | for i in range(len(sentence_ids_in_batch)): 30 | key = str(sentence_ids_in_batch[i]) 31 | predictions = [] 32 | if print_probs == False: 33 | for j in range(len(predicted_labels[i])): 34 | predictions.append(id2label[predicted_labels[i][j]]) 35 | elif print_probs == True: 36 | for j in range(len(predicted_probs[i])): 37 | p_ = "" 38 | for k in range(len(predicted_probs[i][j])): 39 | p_ += str(id2label[k]) + ":" + str(predicted_probs[i][j][k]) + "\t" 40 | predictions.append(p_.strip()) 41 | predictions_cache[key] = predictions 42 | 43 | sentence_id = 0 44 | word_id = 0 45 | with open(input_file, "r") as f: 46 | for line in f: 47 | if len(line.strip()) == 0: 48 | print("") 49 | if word_id == 0: 50 | continue 51 | assert(len(predictions_cache[str(sentence_id)]) == word_id), str(len(predictions_cache[str(sentence_id)])) + " " + str(word_id) 52 | sentence_id += 1 53 | word_id = 0 54 | continue 55 | assert(str(sentence_id) in predictions_cache) 56 | assert(len(predictions_cache[str(sentence_id)]) > word_id) 57 | print(line.strip() + "\t" + predictions_cache[str(sentence_id)][word_id].strip()) 58 | word_id += 1 59 | 60 | sys.stderr.write("Processed: " + input_file + "\n") 61 | sys.stderr.write("Elapsed time with loading: " + str(time.time() - time_loading) + "\n") 62 | sys.stderr.write("Elapsed time without loading: " + str(time.time() - time_noloading) + "\n") 63 | 64 | 65 | 66 | 67 | if __name__ == "__main__": 68 | if sys.argv[1] == "labels": 69 | print_probs = False 70 | elif sys.argv[1] == "probs": 71 | print_probs = True 72 | else: 73 | raise ValueError("Unknown value") 74 | 75 | print_predictions(print_probs, sys.argv[2], sys.argv[3]) 76 | 77 | 78 | --------------------------------------------------------------------------------