├── .gitignore ├── Data Facts.ipynb ├── LICENSE ├── README.md ├── crf_baseline ├── README.md ├── code │ ├── __init__.py │ ├── feature_extraction_supporting_functions_words.py │ ├── feature_extraction_words.py │ └── utils.py ├── main_finetune.py ├── main_threeTasks.py └── validation.py ├── dataset.tar.gz ├── dataset ├── clean_test.txt ├── clean_train.txt └── clean_valid.txt ├── keras ├── README.md ├── code │ ├── __init__.py │ ├── models.py │ └── utils.py ├── main_multiTaskLearning.py └── main_threeTasks.py └── tensorflow ├── README.md ├── cv_model.py ├── play_with.py ├── ref_model.py └── utils ├── __init__.py ├── data_utils.py └── general_utils.py /.gitignore: -------------------------------------------------------------------------------- 1 | # Created by .ignore support plugin (hsz.mobi) 2 | .gitignore 3 | .idea/ 4 | -------------------------------------------------------------------------------- /LICENSE: -------------------------------------------------------------------------------- 1 | MIT License 2 | 3 | Copyright (c) 2018 Digital Humanities Laboratory 4 | 5 | Permission is hereby granted, free of charge, to any person obtaining a copy 6 | of this software and associated documentation files (the "Software"), to deal 7 | in the Software without restriction, including without limitation the rights 8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell 9 | copies of the Software, and to permit persons to whom the Software is 10 | furnished to do so, subject to the following conditions: 11 | 12 | The above copyright notice and this permission notice shall be included in all 13 | copies or substantial portions of the Software. 14 | 15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR 16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, 17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE 18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER 19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, 20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE 21 | SOFTWARE. 22 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # Deep Reference Parsing 2 | 3 | This repository contains the code for the following article: 4 | 5 | @article{alves_deep_2018, 6 | author = {{Rodrigues Alves, Danny and Giovanni Colavizza and Frédéric Kaplan}}, 7 | title = {{Deep Reference Mining from Scholarly Literature in the Arts and Humanities}}, 8 | journal = {{Frontiers in Research Metrics & Analytics}}, 9 | volume = 3, 10 | number = 21, 11 | year = 2018, 12 | doi = {10.3389/frma.2018.00021} 13 | } 14 | 15 | ## Task definition 16 | 17 | We focus on the task of reference mining, instantiated into three tasks: reference components detection (task 1), reference typology detection (task 2) and reference span detection (task 3). 18 | 19 | * Sequence: *G. Ostrogorsky, History of the Byzantine State, Rutgers University Press, 1986.* 20 | * Task 1: *author author title title title title title publisher publisher publisher year* 21 | * Task 2: *b-secondary i-secondary ... e-secondary* 22 | * Task 3: *b-r i-r ... e-r* 23 | 24 | ## Contents 25 | 26 | * `LICENSE` MIT. 27 | * `README.md` this file. 28 | * `dataset/` 29 | * [train](dataset/clean_test.txt) Train split, CoNLL format. 30 | * [test](dataset/clean_train.txt) Test split, CoNLL format. 31 | * [validation](dataset/clean_valid.txt) Validation split, CoNLL format. 32 | * [compressed dataset](dataset.tar.gz) Compressed dataset. 33 | * [data facts](Data%20Facts.ipynb) a Python notebook to explore the dataset (number of references, tag distributions). 34 | * [crf_baseline](crf_baseline) CRF baseline implementation details. 35 | * [keras](keras) Keras implementation details. 36 | * [tensorflow](tensorflow) TF implementation details. 37 | 38 | ## Dataset 39 | 40 | Example of dataset entry (beginning of validation dataset, first line/sequence): Token Task1tag Task2tag Task3tag`: 41 | 42 | -DOCSTART- -X- -X- o 43 | 44 | C author b-secondary b-r 45 | . author i-secondary i-r 46 | Agnoletti author i-secondary i-r 47 | , author i-secondary i-r 48 | Treviso title i-secondary i-r 49 | e title i-secondary i-r 50 | le title i-secondary i-r 51 | sue title i-secondary i-r 52 | pievi title i-secondary i-r 53 | . title i-secondary i-r 54 | Illustrazione title i-secondary i-r 55 | storica title i-secondary i-r 56 | , title i-secondary i-r 57 | Treviso publicationplace i-secondary i-r 58 | 1898 year i-secondary i-r 59 | , year i-secondary i-r 60 | 2 publicationspecifications i-secondary i-r 61 | v publicationspecifications e-secondary i-r 62 | . publicationspecifications e-secondary e-r 63 | 64 | Pre-trained word vectors can be downloaded from Zenodo: [![DOI](https://zenodo.org/badge/DOI/10.5281/zenodo.1175213.svg)](https://doi.org/10.5281/zenodo.1175213) 65 | 66 | ## Implementations 67 | 68 | ### CRF baseline 69 | 70 | See internal [readme](crf_baseline/README.md) for details. 71 | 72 | ### Keras 73 | 74 | See internal [readme](keras/README.md) for details. 75 | 76 | ### Tensor Flow 77 | 78 | See internal [readme](tensorflow/README.md) for details. 79 | 80 | This implementation borrows from [Guillaume Genthial's Sequence Tagging with Tensorflow](https://guillaumegenthial.github.io/sequence-tagging-with-tensorflow.html). 81 | 82 | -------------------------------------------------------------------------------- /crf_baseline/README.md: -------------------------------------------------------------------------------- 1 | # CRF baseline implementation 2 | 3 | ## How to 4 | The directory contains code to run the CRF model used as baseline. Code to train, fine-tune and validate models are given. 5 | 6 | For single tasks, one model for each of the three tasks will be computed by running the python script *main_threeTasks.py*: each model will be stored under the folder *models*. In order to fine-tune the two CRF model parameters *c1* and *c2*, the python script *main_finetune.py* can be run: a plot gathering the results is saved in the folder *plots* and the best model will be save in the *models* folder. To compute the validation score, the script *validation.py* needs to have the models generated previously *crf_t1.pkl*, *crf_t2.pkl* and *crf_t3.pkl* and will print the classification scores. 7 | 8 | The data is expected to be in a *dataset* folder, in the main repository directory, with three files inside it: *clean_train.txt* for the training dataset, *clean_test.txt* for the testing dataset, and *clean_valid.txt* for the validation dataset. 9 | 10 | python main_finetune.py 11 | python main_threeTasks.py 12 | python validation.py 13 | 14 | 15 | ## Contents 16 | * `README.md` this file. 17 | * `code/` 18 | * [feature_extraction_supporting_functions_words](code/feature_extraction_supporting_functions_words.py) helper functions to extract features from words. 19 | * [feature_extraction_words](code/feature_extraction_words.py) functions to extract features from words. 20 | * [utils](code/utils.py) utility functions to load data and redirected log files. 21 | * [main_finetune](main_finetune.py) python script to fine-tune two model parameters. 22 | * [main_threeTasks](main_threeTasks.py) python script to train one CRF model for each task. 23 | * [validation](validation.py) python script to compute classification score on validation dataset for the three tasks. 24 | 25 | ## Dependencies 26 | * Numpy: 1.13.3 27 | * Sklearn : 0.19.1 28 | * [Sklearn crfsuite](https://sklearn-crfsuite.readthedocs.io/en/latest/index.html) Sklearn crfsuite : 0.3.6 29 | * Python 3.5 -------------------------------------------------------------------------------- /crf_baseline/code/__init__.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/python 2 | # -*- coding: utf-8 -*- 3 | 4 | -------------------------------------------------------------------------------- /crf_baseline/code/feature_extraction_supporting_functions_words.py: -------------------------------------------------------------------------------- 1 | # -*- coding: utf-8 -*- 2 | """ 3 | Extraction of Features from words, used to parse references 4 | 5 | Inspired from CRFSuite by Naoaki Okazaki: http://www.chokkan.org/software/crfsuite/. 6 | """ 7 | __author__ = """Naoaki Okazaki, Giovanni Colavizza""" 8 | 9 | import string, re 10 | 11 | def get_shape(token): 12 | r = '' 13 | for c in token: 14 | if c.isupper(): 15 | r += 'U' 16 | elif c.islower(): 17 | r += 'L' 18 | elif c.isdigit(): 19 | r += 'D' 20 | elif c in ('.', ','): 21 | r += '.' 22 | elif c in (';', ':', '?', '!'): 23 | r += ';' 24 | elif c in ('+', '-', '*', '/', '=', '|', '_'): 25 | r += '-' 26 | elif c in ('(', '{', '[', '<'): 27 | r += '(' 28 | elif c in (')', '}', ']', '>'): 29 | r += ')' 30 | else: 31 | r += c 32 | return r 33 | 34 | def degenerate(src): 35 | dst = '' 36 | for c in src: 37 | if not dst or dst[-1] != c: 38 | dst += c 39 | return dst 40 | 41 | def get_type(token): 42 | T = ( 43 | 'AllUpper', 'AllDigit', 'AllSymbol', 44 | 'AllUpperDigit', 'AllUpperSymbol', 'AllDigitSymbol', 45 | 'AllUpperDigitSymbol', 46 | 'InitUpper', 47 | 'AllLetter', 48 | 'AllAlnum', 49 | ) 50 | R = set(T) 51 | if not token: 52 | return 'EMPTY' 53 | 54 | for i in range(len(token)): 55 | c = token[i] 56 | if c.isupper(): 57 | R.discard('AllDigit') 58 | R.discard('AllSymbol') 59 | R.discard('AllDigitSymbol') 60 | elif c.isdigit() or c in (',', '.'): 61 | R.discard('AllUpper') 62 | R.discard('AllSymbol') 63 | R.discard('AllUpperSymbol') 64 | R.discard('AllLetter') 65 | elif c.islower(): 66 | R.discard('AllUpper') 67 | R.discard('AllDigit') 68 | R.discard('AllSymbol') 69 | R.discard('AllUpperDigit') 70 | R.discard('AllUpperSymbol') 71 | R.discard('AllDigitSymbol') 72 | R.discard('AllUpperDigitSymbol') 73 | else: 74 | R.discard('AllUpper') 75 | R.discard('AllDigit') 76 | R.discard('AllUpperDigit') 77 | R.discard('AllLetter') 78 | R.discard('AllAlnum') 79 | 80 | if i == 0 and not c.isupper(): 81 | R.discard('InitUpper') 82 | 83 | for tag in T: 84 | if tag in R: 85 | return tag 86 | return 'NO' 87 | 88 | def get_2d(token): 89 | return len(token) == 2 and token.isdigit() 90 | 91 | def get_4d(token): 92 | return len(token) == 4 and token.isdigit() 93 | 94 | def get_parYear(token): 95 | if token[0] == '(' and token[-1] == ')': 96 | if get_4d(token[1:-1]) or get_2d(token[1:-1]): 97 | return True 98 | return False 99 | 100 | # if both digit and alphabetic 101 | def get_da(token): 102 | bd = False 103 | ba = False 104 | for c in token: 105 | if c.isdigit(): 106 | bd = True 107 | elif c.isalpha(): 108 | ba = True 109 | else: 110 | return False 111 | return bd and ba 112 | 113 | def get_dand(token, p): 114 | bd = False 115 | bdd = False 116 | for c in token: 117 | if c.isdigit(): 118 | bd = True 119 | elif c == p: 120 | bdd = True 121 | else: 122 | return False 123 | return bd and bdd 124 | 125 | def get_all_other(token): 126 | for c in token: 127 | if c.isalnum(): 128 | return False 129 | return True 130 | 131 | def get_capperiod(token): 132 | return len(token) == 2 and token[0].isupper() and token[1] == '.' 133 | 134 | def contains_upper(token): 135 | b = False 136 | for c in token: 137 | b |= c.isupper() 138 | return b 139 | 140 | def contains_lower(token): 141 | b = False 142 | for c in token: 143 | b |= c.islower() 144 | return b 145 | 146 | def contains_alpha(token): 147 | b = False 148 | for c in token: 149 | b |= c.isalpha() 150 | return b 151 | 152 | def contains_digit(token): 153 | b = False 154 | for c in token: 155 | b |= c.isdigit() 156 | return b 157 | 158 | def contains_symbol(token): 159 | b = False 160 | for c in token: 161 | b |= ~c.isalnum() 162 | return b 163 | 164 | # abbreviations 165 | def is_abbr(token): 166 | b = False 167 | if "." in token: 168 | for p in string.punctuation: 169 | token = token.replace(p, '') 170 | if len(token) < 2 & len(token) > 0: 171 | b = True 172 | elif len(token) == 2: 173 | if token[0] == token[1]: 174 | b = True 175 | return b 176 | 177 | # alternative 178 | # average frequency of abbreviations 179 | # pattern from http://stackoverflow.com/questions/17779771/finding-acronyms-using-regex-in-python 180 | def abbr_pattern(token): 181 | b = False 182 | 183 | if token is None or len(token) < 1: 184 | return b 185 | 186 | pattern = r'(?:(?<=\.|\s)[A-Z]\.)+' 187 | counter = re.search(pattern, token) 188 | if counter: 189 | return True 190 | return b 191 | 192 | # New 193 | def is_roman(token): 194 | b = True 195 | for p in string.punctuation: 196 | token = token.replace(p, '') 197 | for c in token: 198 | b &= c.lower() in ['i', 'x', 'v', 'c', 'l', 'm', 'd'] 199 | return b 200 | 201 | # Return true if a sequence of at least 2 characters matches with roman numbers 202 | def contains_roman(token): 203 | for n,c in enumerate(token[1:]): 204 | if c.isupper() and c.lower() in ['i', 'x', 'v', 'c', 'l', 'm', 'd']: 205 | if token[n-1].isupper() and token[n-1].lower() in ['i', 'x', 'v', 'c', 'l', 'm', 'd']: 206 | return True 207 | return False 208 | 209 | # is interval, e.g. 1900-10 210 | def is_interval(token): 211 | b = True 212 | if "-" in token: 213 | for x in token.split("-"): 214 | try: 215 | int(x) 216 | except: 217 | b = False 218 | return b 219 | 220 | # measure the punctuation frequency of a piece of text 221 | def punctuation(text, norm=True): 222 | 223 | if text is None or len(text) < 1: 224 | return 0 225 | 226 | counter = 0 227 | 228 | for w in range(len(text)): 229 | if text[w] in string.punctuation: 230 | counter += 1 231 | 232 | return counter/len(text) if norm else counter 233 | 234 | # measure the number frequency of a piece of text 235 | def numbers(text, norm=True): 236 | 237 | if text is None or len(text) < 1: 238 | return 0 239 | 240 | counter = 0 241 | for w in text.split(): 242 | for p in string.punctuation: 243 | w = w.replace(p,"") 244 | try: 245 | int(w) 246 | except: 247 | continue 248 | counter += 1 249 | 250 | return counter/len(text) if norm else counter 251 | 252 | # frequency of upper case letters 253 | def upper_case(text, norm=True): 254 | 255 | if text is None or len(text) < 1: 256 | return 0 257 | 258 | counter = 0 259 | for n in range(len(text)): 260 | if text[n].isupper(): 261 | counter += 1 262 | 263 | return counter/len(text) if norm else counter 264 | 265 | # frequency of lower case letters 266 | def lower_case(text, norm=True): 267 | 268 | if text is None or len(text) < 1: 269 | return 0 270 | 271 | counter = 0 272 | for n in range(len(text)): 273 | if text[n].islower(): 274 | counter += 1 275 | 276 | return counter/len(text) if norm else counter 277 | 278 | # number of chars (with whitespace) 279 | def chars(text): 280 | 281 | if text is None or len(text) < 1: 282 | return 0 283 | 284 | return len(text) 285 | 286 | # is abbreviation? (any word of len <= 3 with a dot at the end) 287 | def abbr(text): 288 | 289 | if text is None or len(text) < 1: 290 | return 0 291 | 292 | if len(text) <= 4 and text[-1] == ".": 293 | return True 294 | else: 295 | return False 296 | 297 | # Boolean generation functions based on generic input 298 | def b(v): 299 | return 'yes' if v else 'no' -------------------------------------------------------------------------------- /crf_baseline/code/feature_extraction_words.py: -------------------------------------------------------------------------------- 1 | # -*- coding: utf-8 -*- 2 | """ 3 | Generator of features, relies on feature_extraction_supporting_functions_words 4 | 5 | Compatible with sklearn_crfsuite 6 | """ 7 | __author__ = """Giovanni Colavizza""" 8 | 9 | from code.feature_extraction_supporting_functions_words import * 10 | 11 | def generate_featuresFull(word,n_id,defval=''): 12 | """ 13 | Creates a set of features for a given token. 14 | 15 | :param word: token 16 | :param n_id: reference number of token in feature window 17 | :param def_val: default value for missing features 18 | :return: The token and its features 19 | """ 20 | 21 | v = {'w%d'%n_id: word} 22 | # Lowercased token. 23 | v['wl%d'%n_id] = v['w%d'%n_id].lower() 24 | # Token shape. 25 | v['shape%d'%n_id] = get_shape(v['w%d'%n_id]) 26 | # Token shape degenerated. 27 | v['shaped%d'%n_id] = degenerate(v['shape%d'%n_id]) 28 | # Token type. 29 | v['type%d'%n_id] = get_type(v['w%d'%n_id]) 30 | 31 | # Prefixes (length between one to four). 32 | v['p1%d'%n_id] = v['w%d'%n_id][0] if len(v['w%d'%n_id]) >= 1 else defval 33 | v['p2%d'%n_id] = v['w%d'%n_id][:2] if len(v['w%d'%n_id]) >= 2 else defval 34 | v['p3%d'%n_id] = v['w%d'%n_id][:3] if len(v['w%d'%n_id]) >= 3 else defval 35 | v['p4%d'%n_id] = v['w%d'%n_id][:4] if len(v['w%d'%n_id]) >= 4 else defval 36 | 37 | # Suffixes (length between one to four). 38 | v['s1%d'%n_id] = v['w%d'%n_id][-1] if len(v['w%d'%n_id]) >= 1 else defval 39 | v['s2%d'%n_id] = v['w%d'%n_id][-2:] if len(v['w%d'%n_id]) >= 2 else defval 40 | v['s3%d'%n_id] = v['w%d'%n_id][-3:] if len(v['w%d'%n_id]) >= 3 else defval 41 | v['s4%d'%n_id] = v['w%d'%n_id][-4:] if len(v['w%d'%n_id]) >= 4 else defval 42 | 43 | # Two digits 44 | v['2d%d'%n_id] = b(get_2d(v['w%d'%n_id])) 45 | # Four digits. 46 | v['4d%d'%n_id] = b(get_4d(v['w%d'%n_id])) 47 | # Has a number with parentheses 48 | v['4d%d'%n_id] = b(get_parYear(v['w%d'%n_id])) 49 | # Alphanumeric token. 50 | v['d&a%d'%n_id] = b(get_da(v['w%d'%n_id])) 51 | # Digits and '-'. 52 | v['d&-%d'%n_id] = b(get_dand(v['w%d'%n_id], '-')) 53 | # Digits and '/'. 54 | v['d&/%d'%n_id] = b(get_dand(v['w%d'%n_id], '/')) 55 | # Digits and ','. 56 | v['d&,%d'%n_id] = b(get_dand(v['w%d'%n_id], ',')) 57 | # Digits and '.'. 58 | v['d&.%d'%n_id] = b(get_dand(v['w%d'%n_id], '.')) 59 | # A uppercase letter followed by '.' 60 | v['up%d'%n_id] = b(get_capperiod(v['w%d'%n_id])) 61 | 62 | # An initial uppercase letter. 63 | v['iu%d'%n_id] = b(v['w%d'%n_id] and v['w%d'%n_id][0].isupper()) 64 | # All uppercase letters. 65 | v['au%d'%n_id] = b(v['w%d'%n_id].isupper()) 66 | # All lowercase letters. 67 | v['al%d'%n_id] = b(v['w%d'%n_id].islower()) 68 | # All digit letters. 69 | v['ad%d'%n_id] = b(v['w%d'%n_id].isdigit()) 70 | # All other (non-alphanumeric) letters. 71 | v['ao%d'%n_id] = b(get_all_other(v['w%d'%n_id])) 72 | 73 | # Contains a uppercase letter. 74 | v['cu%d'%n_id] = b(contains_upper(v['w%d'%n_id])) 75 | # Contains a lowercase letter. 76 | v['cl%d'%n_id] = b(contains_lower(v['w%d'%n_id])) 77 | # Contains a alphabet letter. 78 | v['ca%d'%n_id] = b(contains_alpha(v['w%d'%n_id])) 79 | # Contains a digit. 80 | v['cd%d'%n_id] = b(contains_digit(v['w%d'%n_id])) 81 | # Contains a symbol. 82 | v['cs%d'%n_id] = b(contains_symbol(v['w%d'%n_id])) 83 | 84 | # Is abbreviation. 85 | v['ab%d'%n_id] = b(is_abbr(v['w%d'%n_id])) 86 | # Is abbreviation 2 87 | v['ab2%d'%n_id] = b(abbr(v['w%d'%n_id])) 88 | # Is Roman number. 89 | v['ro%d'%n_id] = b(is_roman(v['w%d'%n_id])) 90 | v['cont_ro%d'%n_id] = b(contains_roman(v['w%d'%n_id])) 91 | # Is Interval. 92 | v['int%d'%n_id] = b(is_interval(v['w%d'%n_id])) 93 | 94 | return v 95 | 96 | def generate_featuresLight(word,n_id,defval=''): 97 | """ 98 | Lightweight version of the above. 99 | 100 | :param word: token 101 | :param n_id: reference number of token in feature window 102 | :param def_val: default value for missing features 103 | :return: The token and its features 104 | """ 105 | 106 | v = {'w%d'%n_id: word} 107 | # Lowercased token. 108 | v['wl%d'%n_id] = v['w%d'%n_id].lower() 109 | # Token shape. 110 | v['shape%d'%n_id] = get_shape(v['w%d'%n_id]) 111 | # Token shape degenerated. 112 | v['shaped%d'%n_id] = degenerate(v['shape%d'%n_id]) 113 | # Token type. 114 | v['type%d'%n_id] = get_type(v['w%d'%n_id]) 115 | 116 | # Prefixes (length between one to four). 117 | v['p1%d'%n_id] = v['w%d'%n_id][0] if len(v['w%d'%n_id]) >= 1 else defval 118 | v['p2%d'%n_id] = v['w%d'%n_id][:2] if len(v['w%d'%n_id]) >= 2 else defval 119 | 120 | # Suffixes (length between one to four). 121 | v['s1%d'%n_id] = v['w%d'%n_id][-1] if len(v['w%d'%n_id]) >= 1 else defval 122 | v['s2%d'%n_id] = v['w%d'%n_id][-2:] if len(v['w%d'%n_id]) >= 2 else defval 123 | 124 | # Two digits 125 | v['2d%d'%n_id] = b(get_2d(v['w%d'%n_id])) 126 | # Four digits. 127 | v['4d%d'%n_id] = b(get_4d(v['w%d'%n_id])) 128 | # Alphanumeric token. 129 | v['d&a%d'%n_id] = b(get_da(v['w%d'%n_id])) 130 | # Digits and '-'. 131 | v['d&-%d'%n_id] = b(get_dand(v['w%d'%n_id], '-')) 132 | # Digits and '/'. 133 | v['d&/%d'%n_id] = b(get_dand(v['w%d'%n_id], '/')) 134 | # Digits and ','. 135 | v['d&,%d'%n_id] = b(get_dand(v['w%d'%n_id], ',')) 136 | # Digits and '.'. 137 | v['d&.%d'%n_id] = b(get_dand(v['w%d'%n_id], '.')) 138 | # A uppercase letter followed by '.' 139 | v['up%d'%n_id] = b(get_capperiod(v['w%d'%n_id])) 140 | 141 | # An initial uppercase letter. 142 | v['iu%d'%n_id] = b(v['w%d'%n_id] and v['w%d'%n_id][0].isupper()) 143 | # All uppercase letters. 144 | v['au%d'%n_id] = b(v['w%d'%n_id].isupper()) 145 | # All lowercase letters. 146 | v['al%d'%n_id] = b(v['w%d'%n_id].islower()) 147 | # All digit letters. 148 | v['ad%d'%n_id] = b(v['w%d'%n_id].isdigit()) 149 | # All other (non-alphanumeric) letters. 150 | v['ao%d'%n_id] = b(get_all_other(v['w%d'%n_id])) 151 | 152 | # Contains a uppercase letter. 153 | v['cu%d'%n_id] = b(contains_upper(v['w%d'%n_id])) 154 | # Contains a lowercase letter. 155 | v['cl%d'%n_id] = b(contains_lower(v['w%d'%n_id])) 156 | # Contains a alphabet letter. 157 | v['ca%d'%n_id] = b(contains_alpha(v['w%d'%n_id])) 158 | # Contains a digit. 159 | v['cd%d'%n_id] = b(contains_digit(v['w%d'%n_id])) 160 | # Contains a symbol. 161 | v['cs%d'%n_id] = b(contains_symbol(v['w%d'%n_id])) 162 | 163 | # Is abbreviation. 164 | v['ab%d'%n_id] = b(is_abbr(v['w%d'%n_id])) 165 | # Is abbreviation 2 166 | v['ab2%d'%n_id] = b(abbr(v['w%d'%n_id])) 167 | # Is Roman number. 168 | v['ro%d'%n_id] = b(is_roman(v['w%d'%n_id])) 169 | # Is Interval. 170 | v['int%d'%n_id] = b(is_interval(v['w%d'%n_id])) 171 | 172 | # remove tags and lowercase: language independence 173 | del v['w%d'%n_id] 174 | del v['wl%d'%n_id] 175 | 176 | return v 177 | 178 | 179 | 180 | def word2features(sequence, i, extra_labels=[], window=2, feature_function=generate_featuresFull): 181 | """ 182 | Takes a dataset from a specific document and exports its features for parsing. 183 | 184 | :param sequence: a list of tokens 185 | :param i: index of token in sequence 186 | :param extra_labels: list of labels to assign to the text 187 | :param window: window to consider of preceding and following tokens (e.g. 2 means features for tokens -2 to 2 included will be generated) 188 | :return: dictionary of token features 189 | """ 190 | 191 | """ 192 | Template of data coming in (4 sequences): 193 | 194 | ['piGNATTi', 'T', '.,', 'Le', 'pitture', 'di', 'Paolo', 'Vero', '-'], 195 | ['nese', 'nella', 'chiesa', 'di', 'S', '.', 'Sebastiano', 'in'], 196 | ['Venezia', ',', 'Milano', '1966', '.'], 197 | ['piGNATTi', 't', '.,', 'Paolo', 'Veronese', ',', 'Milano']] 198 | 199 | """ 200 | 201 | if len(extra_labels) > 0: 202 | assert len(text) == len(extra_labels) 203 | 204 | word = sequence[i] 205 | position_in_sequence = i 206 | 207 | 208 | features = feature_function(word, 0) 209 | features.update({ 210 | 'position': position_in_sequence, 211 | }) 212 | 213 | # Extra Labels to add 214 | if len(extra_labels) > 0: 215 | features.update({'tag': extra_labels[i]}) 216 | 217 | 218 | if i == 0: 219 | features['BOS'] = True # Begin of Sequence 220 | else: 221 | for n in range(-window,0): 222 | if i+n >= 0: 223 | word = sequence[i+n] 224 | features.update(feature_function(word,n)) 225 | # Extra labels 226 | if len(extra_labels) > 0: 227 | features.update({"tag%s"%n:extra_labels[i+n]}) 228 | 229 | if i == len(sequence)-1: 230 | features['EOS'] = True # End of sequence 231 | for n in range(1,window+1): 232 | if i+n < len(sequence)-1: 233 | word = sequence[i+n] 234 | features.update(feature_function(word,n)) 235 | # Extra labels 236 | if len(extra_labels) > 0: 237 | features.update({"tag%s"%n:extra_labels[i+n]}) 238 | 239 | 240 | return features 241 | -------------------------------------------------------------------------------- /crf_baseline/code/utils.py: -------------------------------------------------------------------------------- 1 | import sys 2 | 3 | 4 | def setPrintToFile(filename): 5 | stdout_original = sys.stdout 6 | f = open(filename, 'w') 7 | sys.stdout = f 8 | return f,stdout_original 9 | 10 | 11 | def closePrintToFile(f, stdout_original): 12 | sys.stdout = stdout_original 13 | f.close() 14 | 15 | 16 | def load_data(file): 17 | words = [] 18 | tags_1 = [] 19 | tags_2 = [] 20 | tags_3 = [] 21 | tags_4 = [] 22 | 23 | word = tags1 = tags2 = tags3 = tags4 = [] 24 | with open (file, "r") as file: 25 | for line in file: 26 | if 'DOCSTART' not in line: #Do not take the first line into consideration 27 | # Check if empty line 28 | if line in ['\n', '\r\n']: 29 | # Append line 30 | words.append(word) 31 | tags_1.append(tags1) 32 | tags_2.append(tags2) 33 | tags_3.append(tags3) 34 | tags_4.append(tags4) 35 | 36 | # Reset 37 | word = [] 38 | tags1 = [] 39 | tags2 = [] 40 | tags3 = [] 41 | tags4 = [] 42 | 43 | else: 44 | # Split the line into words, tag #1, tag #2, tag #3 45 | w = line[:-1].split(" ") 46 | word.append(w[0]) 47 | tags1.append(w[1]) 48 | tags2.append(w[2]) 49 | tags3.append(w[3]) 50 | tags4.append(w[4]) 51 | 52 | return words,tags_1,tags_2,tags_3,tags_4 -------------------------------------------------------------------------------- /crf_baseline/main_finetune.py: -------------------------------------------------------------------------------- 1 | import random 2 | 3 | import numpy as np 4 | import time 5 | 6 | # Plot 7 | import matplotlib 8 | matplotlib.use('agg') 9 | import matplotlib.pyplot as plt 10 | 11 | # CRF 12 | import sklearn_crfsuite 13 | from sklearn_crfsuite import scorers, metrics 14 | from sklearn.metrics import make_scorer, confusion_matrix 15 | from sklearn.externals import joblib 16 | from sklearn.model_selection import RandomizedSearchCV 17 | 18 | # For model validation 19 | import scipy 20 | 21 | 22 | # Utils functions 23 | from code.feature_extraction_supporting_functions_words import * 24 | from code.feature_extraction_words import * 25 | from code.utils import * 26 | 27 | 28 | # Load entire data 29 | X_train_w, train_t1, train_t2, train_t3 = load_data("../dataset/clean_train.txt") 30 | X_test_w, test_t1, test_t2, test_t3= load_data("../dataset/clean_test.txt") 31 | 32 | 33 | for task in ["t3", "t2", "t1"]: #Ordered according to increase computation time 34 | 35 | print("=========================== Task {0} ========================= Start:{1}".format(task, time.strftime("%D %H:%M:%S"))) 36 | # Set file 37 | file, stdout_original = setPrintToFile("results/CRF_model_task_{0}.txt".format(task)) 38 | 39 | # Task data 40 | y_train = eval("train_"+task) 41 | y_test = eval("test_"+task) 42 | 43 | # Build CRF data format 44 | window = 2 # the window of dependance for the CRFs 45 | X_train = [[word2features(text, i, window=window) for i in range(len(text))] for text in X_train_w] 46 | X_test = [[word2features(text, i, window=window) for i in range(len(text))] for text in X_test_w] 47 | 48 | 49 | print("Training data - number of lines: ", len(X_train)) 50 | print("Testing data - number of lines: ", len(X_test)) 51 | print('----') 52 | print("Training data - number of tokens: ", len([x for y in X_train for x in y])) 53 | print("Testing data - number of tokens: ", len([x for y in X_test for x in y])) 54 | print() 55 | print() 56 | 57 | 58 | 59 | # CRF Model : Fine-tuning c1 and c2 60 | 61 | # Parameters search (Based on https://sklearn-crfsuite.readthedocs.io/en/latest/tutorial.html#hyperparameter-optimization) 62 | crf = sklearn_crfsuite.CRF( 63 | max_iterations=100, 64 | algorithm = 'lbfgs', 65 | all_possible_transitions=False 66 | ) 67 | 68 | params_space = { 69 | 'c1': scipy.stats.expon(scale=0.5), 70 | 'c2': scipy.stats.expon(scale=0.05) 71 | } 72 | 73 | scorer = make_scorer(metrics.flat_f1_score, average='weighted') 74 | 75 | # search 76 | rs = RandomizedSearchCV(crf, params_space, 77 | cv=3, 78 | verbose=1, 79 | n_jobs=-10, 80 | n_iter=5, 81 | scoring=scorer) 82 | rs.fit(X_train, y_train) 83 | 84 | print('best params:', rs.best_params_) 85 | print('best CV score:', rs.best_score_) 86 | 87 | 88 | # Create score plot 89 | _x = [s.parameters['c1'] for s in rs.grid_scores_] 90 | _y = [s.parameters['c2'] for s in rs.grid_scores_] 91 | _c = [s.mean_validation_score for s in rs.grid_scores_] 92 | 93 | fig = plt.figure() 94 | fig.set_size_inches(12, 12) 95 | ax = plt.gca() 96 | ax.set_xlabel('C1') 97 | ax.set_ylabel('C2') 98 | ax.set_title("Randomized Hyperparameter Search CV Results (min={:0.3}, max={:0.3})".format(min(_c), max(_c))) 99 | ax.scatter(_x, _y, c=_c, s=60, cmap="bwr_r") 100 | print("F1 scores: Dark blue => {:0.4}, dark red => {:0.4}".format(min(_c), max(_c))) 101 | plt.savefig("plots/plot_fine_tuning_task_{0}".format(task)) 102 | #Save plot 103 | 104 | 105 | 106 | 107 | # Testing with best parameters 108 | y_pred = rs.best_estimator_.predict(X_test) 109 | print('best params:', rs.best_params_) 110 | print('best CV score:', rs.best_score_) 111 | print(metrics.flat_classification_report( 112 | y_test, y_pred, digits=3 113 | )) 114 | 115 | 116 | # Save best model 117 | joblib.dump(rs.best_estimator_,'models/crf_{0}.pkl'.format(task)) 118 | 119 | 120 | # Close file 121 | closePrintToFile(file, stdout_original) 122 | print("=========================== Task {0} ========================= End:{1}".format(task, time.strftime("%D %H:%M:%S"))) 123 | -------------------------------------------------------------------------------- /crf_baseline/main_threeTasks.py: -------------------------------------------------------------------------------- 1 | import random 2 | 3 | import numpy as np 4 | import matplotlib.pyplot as plt 5 | import matplotlib 6 | import time 7 | 8 | # CRF 9 | import sklearn_crfsuite 10 | from sklearn_crfsuite import scorers, metrics 11 | from sklearn.metrics import make_scorer, confusion_matrix 12 | from sklearn.externals import joblib 13 | 14 | 15 | # Utils functions 16 | from code.feature_extraction_supporting_functions_words import * 17 | from code.feature_extraction_words import * 18 | from code.utils import * 19 | 20 | 21 | # Load entire data 22 | X_train_w, train_t1, train_t2, train_t3 = load_data("../dataset/clean_train.txt") 23 | X_test_w, test_t1, test_t2, test_t3= load_data("../dataset/clean_test.txt") 24 | 25 | 26 | for task in ["t1", "t2", "t3"]: 27 | 28 | print("=========================== Task {0} ========================= Start:{1}".format(task, time.strftime("%D %H:%M:%S"))) 29 | # Set file 30 | file, stdout_original = setPrintToFile("results/CRF_model_task_{0}.txt".format(task)) 31 | 32 | # Task data 33 | y_train = eval("train_"+task) 34 | y_test = eval("test_"+task) 35 | 36 | # Build CRF data format 37 | window = 2 # the window of dependance for the CRFs 38 | X_train = [[word2features(text, i, window=window) for i in range(len(text))] for text in X_train_w] 39 | X_test = [[word2features(text, i, window=window) for i in range(len(text))] for text in X_test_w] 40 | 41 | 42 | print("Training data - number of lines: ", len(X_train)) 43 | print("Testing data - number of lines: ", len(X_test)) 44 | print('----') 45 | print("Training data - number of tokens: ", len([x for y in X_train for x in y])) 46 | print("Testing data - number of tokens: ", len([x for y in X_test for x in y])) 47 | print() 48 | print() 49 | 50 | 51 | 52 | # CRF Model 53 | 54 | crf = sklearn_crfsuite.CRF( 55 | algorithm='lbfgs', 56 | c1=0.1, 57 | c2=0.1, 58 | max_iterations=100, 59 | all_possible_transitions=False 60 | ) 61 | crf.fit(X_train, y_train) 62 | 63 | # Save CRF model 64 | joblib.dump(crf,'models/crf_{0}.pkl'.format(task)) 65 | 66 | 67 | 68 | # Testing 69 | y_pred = crf.predict(X_test) 70 | print(metrics.flat_classification_report( 71 | y_test, y_pred, digits=3 72 | )) 73 | 74 | 75 | # Close file 76 | closePrintToFile(file, stdout_original) 77 | print("=========================== Task {0} ========================= End:{1}".format(task, time.strftime("%D %H:%M:%S"))) -------------------------------------------------------------------------------- /crf_baseline/validation.py: -------------------------------------------------------------------------------- 1 | import random 2 | 3 | import numpy as np 4 | import time 5 | 6 | # Python objects 7 | import pickle 8 | 9 | 10 | # Plot 11 | import matplotlib 12 | matplotlib.use('agg') 13 | import matplotlib.pyplot as plt 14 | 15 | # CRF 16 | import sklearn_crfsuite 17 | from sklearn_crfsuite import scorers, metrics 18 | from sklearn.metrics import make_scorer, confusion_matrix 19 | from sklearn.externals import joblib 20 | from sklearn.model_selection import RandomizedSearchCV 21 | 22 | # For model validation 23 | import scipy 24 | 25 | 26 | # Utils functions 27 | from code.feature_extraction_supporting_functions_words import * 28 | from code.feature_extraction_words import * 29 | from code.utils import * 30 | 31 | 32 | 33 | # Load validation data 34 | window = 2 35 | X_valid_w, valid_t1, valid_t2, valid_t3 = load_data("../dataset/clean_valid.txt") 36 | X_valid = [[word2features(text, i, window=window) for i in range(len(text))] for text in X_valid_w] 37 | 38 | 39 | 40 | # TASK 1 41 | y_valid = valid_t1 42 | crf = pickle.load(open("models/crf_t1.pkl", "rb" )) 43 | print(crf) 44 | y_pred = crf.predict(X_valid) 45 | print(metrics.flat_classification_report( 46 | y_valid, y_pred, digits=6 47 | )) 48 | 49 | # Task 2 50 | y_valid = valid_t2 51 | crf = pickle.load(open("models/crf_t2.pkl", "rb" )) 52 | print(crf) 53 | y_pred = crf.predict(X_valid) 54 | print(metrics.flat_classification_report( 55 | y_valid, y_pred, digits=6 56 | )) 57 | 58 | 59 | # Task 3 60 | y_valid = valid_t3 61 | crf = pickle.load(open("models/crf_t3.pkl", "rb" )) 62 | print(crf) 63 | y_pred = crf.predict(X_valid) 64 | print(metrics.flat_classification_report( 65 | y_valid, y_pred, digits=6 66 | )) 67 | -------------------------------------------------------------------------------- /dataset.tar.gz: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/dhlab-epfl/LinkedBooksDeepReferenceParsing/9411db4e918baffa361895c50ae9ce2046fafc3c/dataset.tar.gz -------------------------------------------------------------------------------- /keras/README.md: -------------------------------------------------------------------------------- 1 | ## Keras 2 | 3 | The directory contains code to run the models with a Keras implementation and a Tensorlow backend. Code for both single and multitask models are given. For single tasks, one model for each of the three tasks will be computed by running the python script *main_threeTasks.py*. The multitask learning model can be computed with the script *main_multiTaskLearning.py*. 4 | 5 | The data is expected to be in a *dataset* folder, in the main repository directory, with three files inside it: *clean_train.txt* for the training dataset, *clean_test.txt* for the testing dataset, and *clean_valid.txt* for the validation dataset. Inside the dataset folder, a *pretrained_vectors* folder is expected, with two files inside it: *vecs_100.txt* and *vecs_300.txt*. 6 | 7 | The results will be stored into the *model_results* folder, with one directory created for each model. 8 | 9 | python main_threeTasks.py 10 | python main_multiTaskLearning.py 11 | 12 | ## Contents 13 | * `README.md` this file. 14 | * `code/` 15 | * [models](code/models.py) code to create, train and validated NN models. 16 | * [utils](code/utils.py) utility functions to run the models. 17 | * [main_multiTaskLearning](main_multiTaskLearning.py) python script to run the multi-task model. 18 | * [main_threeTasks](main_threeTasks.py) python script to train one NN model for each task. 19 | 20 | ## Dependencies 21 | * Keras : version 2.1.1 22 | * TensorFlow: 1.4.0 23 | * Numpy: 1.13.3 24 | * [Keras contrib](https://github.com/keras-team/keras-contrib) Keras contrib : 0.0.2 25 | * Sklearn : 0.19.1 26 | * [Sklearn crfsuite](https://sklearn-crfsuite.readthedocs.io/en/latest/index.html) Sklearn crfsuite : 0.3.6 27 | * Python 3.5 28 | -------------------------------------------------------------------------------- /keras/code/__init__.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/python 2 | # -*- coding: utf-8 -*- 3 | 4 | -------------------------------------------------------------------------------- /keras/code/models.py: -------------------------------------------------------------------------------- 1 | # -*- coding: utf-8 -*- 2 | 3 | """ 4 | Functions for building Keras models 5 | """ 6 | 7 | import os 8 | import random 9 | import numpy as np 10 | import tensorflow 11 | random.seed(42) 12 | np.random.seed(42) 13 | tensorflow.set_random_seed(42) 14 | 15 | # Keras function 16 | from keras.callbacks import EarlyStopping 17 | from keras.layers import Embedding, LSTM, Dense, Bidirectional, Dropout, Input, TimeDistributed, Flatten, Convolution1D, MaxPooling1D, concatenate 18 | from keras.models import Sequential, Model 19 | from keras.optimizers import Adam, RMSprop 20 | from keras_contrib.layers import CRF 21 | from keras_contrib.utils import save_load_utils 22 | 23 | from sklearn.metrics import confusion_matrix 24 | from sklearn_crfsuite import metrics 25 | 26 | # Utils script 27 | from code.utils import * 28 | 29 | 30 | 31 | def BiLSTM_model(filename, train, output, 32 | X_train, X_test, word2ind, maxWords, 33 | y_train, y_test, ind2label, 34 | validation=False, X_valid=None, y_valid=None, 35 | word_embeddings=True, pretrained_embedding="", word_embedding_size=100, 36 | maxChar=0, char_embedding_type="", char2ind="", char_embedding_size=50, 37 | lstm_hidden=32, nbr_epochs=5, batch_size=32, dropout=0, optimizer='rmsprop', early_stopping_patience=-1, 38 | folder_path="model_results", gen_confusion_matrix=False, printPadding=False 39 | ): 40 | """ 41 | Build, train and test a BiLSTM Keras model. Works for multi-tasking learning. 42 | The model architecture looks like: 43 | 44 | - Words representations: 45 | - Word embeddings 46 | - Character-level representation [Optional] 47 | - Dropout 48 | - Bidirectional LSTM 49 | - Dropout 50 | - Softmax/CRF for predictions 51 | 52 | 53 | :param filename: File to redirect the printing 54 | :param train: Boolean if the model must be trained or not. If False, the model's wieght are expected to be stored in "folder_path/filename/filename.h5" 55 | :param otput: "crf" or "softmax". Type of prediction layer to use 56 | 57 | :param X_train: Data to train the model 58 | :param X_test: Data to test the model 59 | :param word2ind: Dictionary containing all words in the training data and a unique integer per word 60 | :param maxWords: Maximum number of words in a sequence 61 | 62 | :param y_train: Labels to train the model for the prediction task 63 | :param y_test: Labels to test the model for the prediction task 64 | :param ind2label: Dictionary where all labels for task 1 are mapped into a unique integer 65 | 66 | :param validation: Boolean. If true, the validation score will be computed from 'X_valid' and 'y_valid' 67 | :param X_valid: Optional. Validation dataset 68 | :param y_valid: Optional. Validation dataset labels 69 | 70 | :param word_embeddings: Boolean value. Add word embeddings into the model. 71 | :param pretrained_embedding: Use the pretrained word embeddings. 72 | Three values: 73 | - "": Do not use pre-trained word embeddings (Default) 74 | - False: Use the pre-trained embedding vectors as the weights in the Embedding layer 75 | - True: Use the pre-trained embedding vectors as weight initialiers. The Embedding layer will still be trained. 76 | :param word_embedding_size: Size of the pre-trained word embedding to use (100 or 300) 77 | 78 | :param maxChar: The maximum numbers of characters in a word. If set to 0, the model will not use character-level representations of the words 79 | :param char_embedding_type: Type of model to use in order to compute the character-level representation of words: Two values: "CNN" or "BILSTM" 80 | :param char2ind: A dictionary where each character is maped into a unique integer 81 | :param char_embedding_size: size of the character-level word representations 82 | 83 | :param lstm_hidden: Dimentionality of the LSTM output space 84 | :param nbr_epochs: Number of epochs to train the model 85 | :param batch_size: Size of batches while training the model 86 | :param dropout: Rate to apply for each Dropout layer in the model 87 | :param optimizer: Optimizer to use while compiling the model 88 | :param early_stopping_patience: Number of continuous tolerated epochs without improvement during training. 89 | 90 | :param folder_path: Path to the directory storing all to-be-generated files 91 | :param gen_confusion_matrix: Boolean value. Generated confusion matrices or not. 92 | :param printPadding: Boolean. Prints the classification matrix taking padding as a possible label. 93 | 94 | 95 | :return: The classification scores for both tasks. 96 | """ 97 | print("====== {0} start ======".format(filename)) 98 | end_string = "====== {0} end ======".format(filename) 99 | 100 | # Create directory to store results 101 | os.makedirs(folder_path+"/"+filename) 102 | filepath = folder_path+"/"+filename+"/"+filename 103 | 104 | # Set print outputs file 105 | file, stdout_original = setPrintToFile("{0}.txt".format(filepath)) 106 | 107 | # Model params 108 | nbr_words = len(word2ind)+1 109 | out_size = len(ind2label)+1 110 | best_results = "" 111 | 112 | embeddings_list = [] 113 | inputs = [] 114 | 115 | # Input - Word Embeddings 116 | if word_embeddings: 117 | word_input = Input((maxWords,)) 118 | inputs.append(word_input) 119 | if pretrained_embedding=="": 120 | word_embedding = Embedding(nbr_words, word_embedding_size)(word_input) 121 | else: 122 | # Retrieve embeddings 123 | embedding_matrix = word2VecEmbeddings(word2ind, word_embedding_size) 124 | word_embedding = Embedding(nbr_words, word_embedding_size, weights=[embedding_matrix], trainable=pretrained_embedding, mask_zero=False)(word_input) 125 | embeddings_list.append(word_embedding) 126 | 127 | # Input - Characters Embeddings 128 | if maxChar!=0: 129 | character_input = Input((maxWords,maxChar,)) 130 | char_embedding = character_embedding_layer(char_embedding_type, character_input, maxChar, len(char2ind)+1, char_embedding_size) 131 | embeddings_list.append(char_embedding) 132 | inputs.append(character_input) 133 | 134 | # Model - Inner Layers - BiLSTM with Dropout 135 | embeddings = concatenate(embeddings_list) if len(embeddings_list)==2 else embeddings_list[0] 136 | model = Dropout(dropout)(embeddings) 137 | model = Bidirectional(LSTM(lstm_hidden, return_sequences=True, dropout=dropout))(model) 138 | model = Dropout(dropout)(model) 139 | 140 | 141 | if output == "crf": 142 | # Output - CRF 143 | crfs = [[CRF(out_size),out_size] for out_size in [len(x)+1 for x in ind2label]] 144 | outputs = [x[0](Dense(x[1])(model)) for x in crfs] 145 | model_loss = [x[0].loss_function for x in crfs] 146 | model_metrics = [x[0].viterbi_acc for x in crfs] 147 | 148 | if output == "softmax": 149 | outputs = [Dense(out_size, activation='softmax')(model) for out_size in [len(x)+1 for x in ind2label]] 150 | model_loss = ['categorical_crossentropy' for x in outputs] 151 | model_metrics = None 152 | 153 | # Model 154 | model = Model(inputs=inputs, outputs=outputs) 155 | model.compile(loss=model_loss, metrics=model_metrics, optimizer=get_optimizer(optimizer)) 156 | print(model.summary(line_length=150),"\n\n\n\n") 157 | 158 | 159 | # Training Callbacks: 160 | callbacks = [] 161 | value_to_monitor = 'val_f1' 162 | best_model_weights_path = "{0}.h5".format(filepath) 163 | 164 | # 1) Classifition scores 165 | classification_scores = Classification_Scores([X_train, y_train], ind2label, best_model_weights_path) 166 | callbacks.append(classification_scores) 167 | 168 | # 2) EarlyStopping 169 | if early_stopping_patience != -1: 170 | early_stopping = EarlyStopping(monitor=value_to_monitor, patience=early_stopping_patience, mode='max') 171 | callbacks.append(early_stopping) 172 | 173 | 174 | # Train 175 | if train: 176 | # Train the model. Keras's method argument 'validation_data' is referred as 'testing data' in this code. 177 | hist = model.fit(X_train, y_train, validation_data=[X_test, y_test], epochs=nbr_epochs, batch_size=batch_size, callbacks=callbacks, verbose=2) 178 | 179 | print() 180 | print('-------------------------------------------') 181 | print("Best F1 score:", early_stopping.best, " (epoch number {0})".format(1+np.argmax(hist.history[value_to_monitor]))) 182 | 183 | # Save Training scores 184 | save_model_training_scores("{0}".format(filepath), hist, classification_scores) 185 | 186 | # Print best testing classification report 187 | best_epoch = np.argmax(hist.history[value_to_monitor]) 188 | print(classification_scores.test_report[best_epoch]) 189 | 190 | 191 | # Best epoch results 192 | best_results = model_best_scores(classification_scores, best_epoch) 193 | 194 | # Load weigths from best training epoch into model 195 | save_load_utils.load_all_weights(model, best_model_weights_path) 196 | 197 | # Create confusion matrices 198 | if gen_confusion_matrix: 199 | for i, y_target in enumerate(y_test): 200 | # Compute predictions, flatten 201 | predictions, target = compute_predictions(model, X_test, y_target, ind2label[i]) 202 | # Generate confusion matrices 203 | save_confusion_matrix(target, predictions, list(ind2label[i].values()), "{0}_task_{1}_confusion_matrix_test".format(filepath,str(i+1))) 204 | 205 | 206 | # Validation dataset 207 | if validation: 208 | print() 209 | print("Validation dataset") 210 | print("======================") 211 | # Compute classification report 212 | for i, y_target in enumerate(y_valid): 213 | # Compute predictions, flatten 214 | predictions, target = compute_predictions(model, X_valid, y_target, ind2label[i], nbrTask=i) 215 | 216 | # Only for multi-task 217 | if len(y_train) > 1: 218 | print("For task "+str(i+1)+"\n") 219 | print("====================================================================================") 220 | 221 | print("") 222 | if printPadding: 223 | print("With padding into account") 224 | print(metrics.flat_classification_report([target], [predictions], digits=4)) 225 | print("") 226 | print('----------------------------------------------') 227 | print("") 228 | print("Without the padding:") 229 | print(metrics.flat_classification_report([target], [predictions], digits=4, labels=list(ind2label[i].values()))) 230 | 231 | # Generate confusion matrices 232 | save_confusion_matrix(target, predictions, list(ind2label[i].values()), "{0}_task_{1}_confusion_matrix_validation".format(filepath,str(i+1))) 233 | 234 | 235 | # Close file 236 | closePrintToFile(file, stdout_original) 237 | print(end_string) 238 | 239 | return best_results 240 | 241 | 242 | 243 | 244 | def character_embedding_layer(layer_type, character_input, maxChar, nbr_chars, char_embedding_size, 245 | cnn_kernel_size=2, cnn_filters=30, lstm_units=50): 246 | """ 247 | Return layer for computing the character-level representations of words. 248 | 249 | There is two type of architectures: 250 | 251 | Architecture CNN: 252 | - Character Embeddings 253 | - Flatten 254 | - Convolution 255 | - MaxPool 256 | 257 | Architecture BILSTM: 258 | - Character Embeddings 259 | - Flatten 260 | - Bidirectional LSTM 261 | 262 | :param layer_type: Model architecture to use "CNN" or "BILSTM" 263 | :param character_input: Keras Input layer, size of the input 264 | :param maxChar: The maximum numbers of characters in a word. If set to 0, the model will not use character-level representations of the words 265 | :param nbr_chars: Numbers of unique characters present in the data 266 | :param char_embedding_size: size of the character-level word representations 267 | :param cnn_kernel_size: For the CNN architecture, size of the kernel in the Convolution layer 268 | :param cnn_filters: For the CNN architecture, number of filters in the Convolution layer 269 | :param lstm_units: For the BILSTM architecture, dimensionality of the output LSTM space (half of the Bidirectinal LSTM output space) 270 | 271 | :return: Character-level representation layers 272 | """ 273 | 274 | embed_char_out = TimeDistributed(Embedding(nbr_chars, char_embedding_size), name='char_embedding')(character_input) 275 | embed_char = TimeDistributed(Flatten())(embed_char_out) 276 | 277 | if layer_type == "CNN": 278 | conv1d_out = TimeDistributed(Convolution1D(kernel_size=cnn_kernel_size, filters=cnn_filters, padding='same'))(embed_char) 279 | char_emb = TimeDistributed(MaxPooling1D(maxChar))(conv1d_out) 280 | 281 | if layer_type == "BILSTM": 282 | char_emb = Bidirectional(LSTM(lstm_units,return_sequences=True))(embed_char) 283 | 284 | return char_emb 285 | 286 | 287 | 288 | 289 | def get_optimizer(type, learning_rate=0.001, decay=0.0): 290 | """ 291 | Return the optimizer needeed to compile Keras models. 292 | 293 | :param type: Type of optimizer. Two types supported: 'ADAM' and 'RMSprop' 294 | :param learning_rate: float >= 0. Learning rate. 295 | :pram decay:float >= 0. Learning rate decay over each update 296 | 297 | :return: The optimizer to use directly into keras model compiling function. 298 | """ 299 | 300 | if type == "adam": 301 | return Adam(lr=learning_rate, decay=decay) 302 | 303 | if type == "rmsprop": 304 | return RMSprop(lr=learning_rate, decay=decay) 305 | 306 | -------------------------------------------------------------------------------- /keras/code/utils.py: -------------------------------------------------------------------------------- 1 | # -*- coding: utf-8 -*- 2 | 3 | """ 4 | Support functions for dealing with data and building models 5 | """ 6 | 7 | import random 8 | import numpy as np 9 | import tensorflow 10 | random.seed(42) 11 | np.random.seed(42) 12 | tensorflow.set_random_seed(42) 13 | 14 | import sys 15 | import csv 16 | import itertools 17 | 18 | from keras.callbacks import Callback 19 | from keras.preprocessing.sequence import pad_sequences 20 | from keras_contrib.utils import save_load_utils 21 | from sklearn.metrics import precision_recall_fscore_support, confusion_matrix 22 | from sklearn_crfsuite import metrics 23 | 24 | # Plot 25 | import matplotlib 26 | matplotlib.use('agg') 27 | import matplotlib.pyplot as plt 28 | 29 | 30 | 31 | def load_data(filepath): 32 | """ 33 | Load and return the data stored in the given path. 34 | The data is structured as follows: 35 | Each line contains four columns separated by a single space. 36 | Each word has been put on a separate line and there is an empty line after each sentence. 37 | The first item on each line is a word, the second, third and fourth are tags related to the word. 38 | Example: 39 | The sentence "L. Antonielli, Iprefetti dell' Italia napoleonica, Bologna 1983." is represented in the dataset as: 40 | L author b-secondary b-r 41 | . author i-secondary i-r 42 | Antonielli author i-secondary i-r 43 | , author i-secondary i-r 44 | Iprefetti title i-secondary i-r 45 | dell title i-secondary i-r 46 | ’ title i-secondary i-r 47 | Italia title i-secondary i-r 48 | napoleonica title i-secondary i-r 49 | , title i-secondary i-r 50 | Bologna publicationplace i-secondary i-r 51 | 1983 year e-secondary i-r 52 | . year e-secondary e-r 53 | 54 | :param filepath: Path to the data 55 | :return: Four arrays: The first one contains sentences (one array of words per sentence) and the other threes are arrays of tags. 56 | 57 | """ 58 | 59 | # Arrays to return 60 | words = [] 61 | tags_1 = [] 62 | tags_2 = [] 63 | tags_3 = [] 64 | 65 | word = tags1 = tags2 = tags3 = [] 66 | with open (filepath, "r") as file: 67 | for line in file: 68 | if 'DOCSTART' not in line: #Do not take the first line into consideration 69 | # Check if empty line 70 | if line in ['\n', '\r\n']: 71 | # Append line 72 | words.append(word) 73 | tags_1.append(tags1) 74 | tags_2.append(tags2) 75 | tags_3.append(tags3) 76 | 77 | # Reset 78 | word = [] 79 | tags1 = [] 80 | tags2 = [] 81 | tags3 = [] 82 | 83 | else: 84 | # Split the line into words, tag #1, tag #2, tag #3 85 | w = line[:-1].split(" ") 86 | word.append(w[0]) 87 | tags1.append(w[1]) 88 | tags2.append(w[2]) 89 | tags3.append(w[3]) 90 | 91 | return words,tags_1,tags_2,tags_3 92 | 93 | 94 | 95 | 96 | def setPrintToFile(filename): 97 | """ 98 | Redirect all prints into a file 99 | 100 | :param filename: File to redirect all prints 101 | :return: the file and the original print "direction" 102 | """ 103 | 104 | # Retrieve current print direction 105 | stdout_original = sys.stdout 106 | # Create file 107 | f = open(filename, 'w') 108 | # Set the new print redirection 109 | sys.stdout = f 110 | return f,stdout_original 111 | 112 | 113 | def closePrintToFile(f, stdout_original): 114 | """ 115 | Change the print direction and closes a file. 116 | 117 | :param filename: File to close 118 | :param stdout_original: Print direction 119 | """ 120 | sys.stdout = stdout_original 121 | f.close() 122 | 123 | 124 | 125 | 126 | def mergeDigits(datas, digits_word): 127 | """ 128 | All digits in the given data will be mapped to the same word 129 | 130 | :param datas: The data to transform 131 | :param digits_word: Word to map digits to 132 | :return: The data transformed data 133 | """ 134 | return [[[digits_word if x.isdigit() else x for x in w ] for w in data] for data in datas] 135 | 136 | 137 | 138 | def indexData_x(x, ukn_words): 139 | """ 140 | Map each word in the given data to a unique integer. A special index will be kept for "out-of-vocabulary" words. 141 | 142 | :param x: The data 143 | :return: Two dictionaries: one where words are keys and indexes values, another one "reversed" (keys->index, values->words) 144 | """ 145 | 146 | # Retrieve all words used in the data (with duplicates) 147 | all_text = [w for e in x for w in e] 148 | # Compute the unique words (remove duplicates) 149 | words = list(set(all_text)) 150 | print("Number of entries: ",len(all_text)) 151 | print("Individual entries: ",len(words)) 152 | 153 | # Assign an integer index for each individual word 154 | word2ind = {word: index for index, word in enumerate(words, 2)} 155 | ind2word = {index: word for index, word in enumerate(words, 2)} 156 | 157 | # To deal with out-of-vocabulary words 158 | word2ind.update({ukn_words:1}) 159 | ind2word.update({1:ukn_words}) 160 | 161 | # The index '0' is kept free in both dictionaries 162 | 163 | return word2ind, ind2word 164 | 165 | 166 | def indexData_y(y): 167 | """ 168 | Map each word in the given data to a unique integer. 169 | 170 | :param y: The data 171 | :return: Two dictionaries: one where words are keys and indexes values, another one "reversed" (keys->index, values->words) 172 | """ 173 | 174 | # Unique attributes in the data, sort alphabetically 175 | labels_t1 = list(set([w for e in y for w in e])) 176 | labels_t1 = sorted(labels_t1, key=str.lower) 177 | print("Number of labels: ", len(labels_t1)) 178 | 179 | # Assign an integer index for each individual label 180 | label2ind = {label: index for index, label in enumerate(labels_t1, 1)} 181 | ind2label = {index: label for index, label in enumerate(labels_t1, 1)} 182 | 183 | # The index '0' is kept free in both dictionaries 184 | 185 | return label2ind, ind2label 186 | 187 | 188 | def encodePadData_x(x, word2ind, maxlen, ukn_words, padding_style): 189 | """ 190 | Transform a data of words in a data of integers, where each entrie as the same length. 191 | 192 | :param x: The data to transform 193 | :param word2ind: Dictionary to retrieve the integer for each word in the data 194 | :param maxlen: The length of each entry in the returned data 195 | :param ukn_words: Key, in the dictionary words-index, to use for words not present in the dictionary 196 | :param padding_style: Padding style to use for having each entry in the data with the same length 197 | :return: The tranformed data 198 | """ 199 | print ('Maximum sequence length - general :', maxlen) 200 | print ('Maximum sequence length - data :', max([len(xx) for xx in x])) 201 | 202 | # Encode: Map each words to the corresponding integer 203 | X_enc = [[word2ind[c] if c in word2ind.keys() else word2ind[ukn_words] for c in xx ] for xx in x] 204 | 205 | # Pad: Each entry in the data must have the same length 206 | X_encode = pad_sequences(X_enc, maxlen=maxlen, padding=padding_style) 207 | 208 | return X_encode 209 | 210 | 211 | def encodePadData_y(y, label2ind, maxlen, padding_style): 212 | """ 213 | Apply one-hot-encoding to each label in the dataset. Each entrie will have the same length 214 | 215 | Example: 216 | Input: label2ind={Label_A:1, Label_B:2, Label_C:3}, maxlen=4 217 | y=[ [Label_A, Label_C] , [Label_A, Label_B, Label_C] ] 218 | Output: [ [[1,0,0], [0,0,1], [0,0,0], , [0,0,0]] , [[1,0,0], [0,1,0], [0,0,1]], [0,0,0], ] 219 | 220 | :param y: The data to encode 221 | :param label2ind: Dictionary where each value in the data is mapped to a unique integer 222 | :param maxlen: The length of each entry in the returned data 223 | :param padding_style: Padding style to use for having each entry in the data with the same length 224 | :return: The transformed data 225 | """ 226 | 227 | print ('Maximum sequence length - labels :', maxlen) 228 | 229 | # Encode y (with pad) 230 | def encode(x, n): 231 | """ 232 | Return an array of zeros, except for an entry set to 1 (one-hot-encode) 233 | :param x: Index entry to set to 1 234 | :param n: Length of the array to return 235 | :return: The created array 236 | """ 237 | result = np.zeros(n) 238 | result[x] = 1 239 | return result 240 | 241 | # Transform each label into its index in the data 242 | y_pad = [[0] * (maxlen - len(ey)) + [label2ind[c] for c in ey] for ey in y] 243 | # One-hot-encode label 244 | max_label = max(label2ind.values()) + 1 245 | y_enc = [[encode(c, max_label) for c in ey] for ey in y_pad] 246 | 247 | # Repad (to have numpy array) 248 | y_encode = pad_sequences(y_enc, maxlen=maxlen, padding=padding_style) 249 | 250 | return y_encode 251 | 252 | 253 | 254 | def characterLevelIndex(X, digits_word): 255 | """ 256 | Map each character present in the dataset into an unique integer. All digits are mapped into a single array. 257 | 258 | :param X: Data to retrieve characters from 259 | :param digits_word: Words regrouping all digits 260 | :return: A dictionary where each character is maped into a unique integer, the maximum number of words in the data, the maximum of characters in a word 261 | """ 262 | 263 | # Create a set of all character 264 | all_chars = list(set([c for s in X for w in s for c in w])) 265 | 266 | # Create an index for each character 267 | # The index 1 is reserved for the digits, regrouped under the word param `digits_word` 268 | char2ind = {char: index for index, char in enumerate(all_chars, 2)} 269 | ind2char = {index: char for index, char in enumerate(all_chars, 2)} 270 | 271 | # To deal with out-of-vocabulary words 272 | char2ind.update({digits_word:1}) 273 | ind2char.update({1:digits_word}) 274 | 275 | # For padding 276 | maxWords = max([len(s) for s in X]) 277 | maxChar = max([len(w) for s in X for w in s]) 278 | print("Maximum number of words in a sequence :", maxWords) 279 | print("Maximum number of characters in a word :", maxChar) 280 | 281 | return char2ind, maxWords, maxChar 282 | 283 | 284 | def characterLevelData(X, char2ind, maxWords, maxChar, digits_word, padding_style): 285 | """ 286 | For each word in the data, transform it into an array of characters. All characters array will have the same length. All sequence will have the same array length. 287 | All digits will be maped to the same character arry. 288 | If a character is not present in the dictionary character-index, discard it. 289 | 290 | :param X: The data 291 | :param chra2ind: Dictionary where each character is maped to a unique integer 292 | :param maxWords: Maximum number of words in a sequence 293 | :param maxChar: Maximum number of characters in a word 294 | :param digits_word: Word regrouping all digits. 295 | :param padding_style: Padding style to use for having each entry in the data with the same length 296 | :return: The transformed array 297 | """ 298 | 299 | # Transform each word into an array of characters (discards those oov) 300 | X_char = [[[char2ind[c] for c in w if c in char2ind.keys()] if w!=digits_word else [1] for w in s] for s in X] 301 | 302 | # Pad words - Each words has the same number of characters 303 | X_char = pad_sequences([pad_sequences(s, maxChar, padding=padding_style) for s in X_char], maxWords, padding=padding_style) 304 | return X_char 305 | 306 | 307 | 308 | def word2VecEmbeddings(word2ind, num_features_embedding): 309 | """ 310 | Convert a file of pre-computed word embeddings into dictionary: {word -> embedding vector}. Only return words of interest. 311 | If the word isn't in the embedding, returned a zero-vector instead. 312 | 313 | :param word2ind: Dictionary {words -> index}. The keys represented the words for each embeddings will be retrieved. 314 | :param num_features_embedding: Size of the embedding vectors 315 | :return: Array of embeddings vectors. The embeddings vector at position i corresponds to the word with value i in the dictionary param `word2ind` 316 | """ 317 | 318 | # Pre-trained embeddings filepath 319 | file_path = "dataset/pretrained_vectors/vecs_{0}.txt".format(num_features_embedding) 320 | ukn_index = "$UKN$" 321 | 322 | # Read the embeddings file 323 | embeddings_all = {} 324 | with open (file_path, "r") as file: 325 | for line in file: 326 | l = line.split(' ') 327 | embeddings_all[l[0]] = l[1:] 328 | 329 | # Compute the embedding for each word in the dataset 330 | embedding_matrix = np.zeros((len(word2ind)+1, num_features_embedding)) 331 | for word, i in word2ind.items(): 332 | if word in embeddings_all: 333 | embedding_matrix[i] = embeddings_all[word] 334 | # else: 335 | # embedding_matrix[i] = embeddings_all[ukn_index] 336 | 337 | # Delete the word2vec dictionary from memory 338 | del embeddings_all 339 | 340 | return embedding_matrix 341 | 342 | 343 | 344 | class Classification_Scores(Callback): 345 | """ 346 | Add the F1 score on the testing data at the end of each epoch. 347 | In case of multi-outputs, compute the F1 score for each output layer and the mean of all F1 scores. 348 | Compute the training F1 score for each epoch. Store the results internally. 349 | Internally, the accuracy and recall scores will also be stored, both for training and testing dataset. 350 | The model's weigths for the best epoch will be save in a given folder. 351 | """ 352 | 353 | def __init__(self, train_data, ind2label, model_save_path): 354 | """ 355 | :param train_data: The data used to compute training accuracy. One array of two arrays => [X_train, y_train] 356 | :param ind2label: Dictionary of index-label to add tags label into results 357 | :param model_save_path: Path to save the best model's weigths 358 | """ 359 | self.train_data = train_data 360 | self.ind2label = ind2label 361 | self.model_save_path = model_save_path 362 | self.score_name = 'val_f1' 363 | 364 | 365 | 366 | def on_train_begin(self, logs={}): 367 | self.test_report = [] 368 | self.test_f1s = [] 369 | self.test_acc = [] 370 | self.test_recall = [] 371 | self.train_f1s = [] 372 | self.train_acc = [] 373 | self.train_recall = [] 374 | 375 | self.best_score = -1 376 | 377 | # Add F1-score as a metric to print at end of each epoch 378 | self.params['metrics'].append("val_f1") 379 | 380 | # In case of multiple outputs 381 | if len(self.model.output_layers) > 1: 382 | for output_layer in self.model.output_layers: 383 | self.params['metrics'].append("val_"+output_layer.name+"_f1") 384 | 385 | 386 | 387 | def compute_scores(self, pred, targ): 388 | """ 389 | Compute the Accuracy, Recall and F1 scores between the two given arrays pred and targ (targ is the golden truth) 390 | """ 391 | val_predict = np.argmax(pred, axis=-1) 392 | val_targ = np.argmax(targ, axis=-1) 393 | 394 | # Flatten arrays for sklearn 395 | predict_flat = np.ravel(val_predict) 396 | targ_flat = np.ravel(val_targ) 397 | 398 | # Compute scores 399 | return precision_recall_fscore_support(targ_flat, predict_flat, average='weighted', labels=[x for x in np.unique(targ_flat) if x!=0])[:3] 400 | 401 | 402 | def compute_epoch_training_F1(self): 403 | """ 404 | Compute and save the F1 score for the training data 405 | """ 406 | in_length = len(self.model.input_layers) 407 | out_length = len(self.model.output_layers) 408 | predictions = self.model.predict(self.train_data[0]) 409 | if len(predictions) != out_length: 410 | predictions = [predictions] 411 | 412 | vals_acc = [] 413 | vals_recall = [] 414 | vals_f1 = [] 415 | for i,pred in enumerate(predictions): 416 | _val_acc, _val_recall, _val_f1 = self.compute_scores(np.asarray(pred), self.train_data[1][i]) 417 | vals_acc.append(_val_acc) 418 | vals_recall.append(_val_recall) 419 | vals_f1.append(_val_f1) 420 | 421 | self.train_acc.append(sum(vals_acc)/len(vals_acc)) 422 | self.train_recall.append(sum(vals_recall)/len(vals_recall)) 423 | self.train_f1s.append(sum(vals_f1)/len(vals_f1)) 424 | 425 | 426 | def classification_report(self, i, pred, targ, printPadding=False): 427 | """ 428 | Comput the classification report for the predictions given. 429 | """ 430 | 431 | # Hold all classification reports 432 | reports = [] 433 | 434 | # The model predicts probabilities for each tag. Retrieve the id of the most probable tag. 435 | pred_index = np.argmax(pred, axis=-1) 436 | # Reverse the one-hot encoding for target 437 | true_index = np.argmax(targ, axis=-1) 438 | 439 | # Index 0 in the predictions referes to padding 440 | ind2labelNew = self.ind2label[i].copy() 441 | ind2labelNew.update({0: "null"}) 442 | 443 | # Compute the labels for each prediction 444 | pred_label = [[ind2labelNew[x] for x in a] for a in pred_index] 445 | true_label = [[ind2labelNew[x] for x in b] for b in true_index] 446 | 447 | # CLASSIFICATION REPORTS 448 | reports.append("") 449 | if printPadding: 450 | reports.append("With padding into account") 451 | reports.append(metrics.flat_classification_report(true_label, pred_label, digits=4)) 452 | reports.append("") 453 | reports.append('----------------------------------------------') 454 | reports.append("") 455 | reports.append("Without the padding:") 456 | reports.append(metrics.flat_classification_report(true_label, pred_label, digits=4, labels=list(self.ind2label[i].values()))) 457 | return '\n'.join(reports) 458 | 459 | 460 | def on_epoch_end(self, epoch, logs={}): 461 | """ 462 | At the end of each epoch, compute the F1 score for the validation data. 463 | In case of multi-outputs model, compute one value per output and average all to return the overall F1 score. 464 | Same model's weights for the best epoch. 465 | """ 466 | self.compute_epoch_training_F1() 467 | in_length = len(self.model.input_layers) # X data - to predict from 468 | out_length = len(self.model.output_layers) # Number of tasks 469 | 470 | # Compute the model predictions 471 | predictions = self.model.predict(self.validation_data[:in_length]) 472 | # In case of single output 473 | if len(predictions) != out_length: 474 | predictions = [predictions] 475 | 476 | 477 | vals_acc = [] 478 | vals_recall = [] 479 | vals_f1 = [] 480 | reports = "" 481 | # Iterate over all output predictions 482 | for i,pred in enumerate(predictions): 483 | _val_acc, _val_recall, _val_f1 = self.compute_scores(np.asarray(pred), self.validation_data[in_length+i]) 484 | 485 | # Classification report 486 | reports += "For task "+str(i+1)+"\n" 487 | reports += "====================================================================================" 488 | reports += self.classification_report(i,np.asarray(pred), self.validation_data[in_length+i]) + "\n\n\n" 489 | 490 | # Add scores internally 491 | vals_acc.append(_val_acc) 492 | vals_recall.append(_val_recall) 493 | vals_f1.append(_val_f1) 494 | 495 | # Add F1 score to be log 496 | f1_name = "val_"+self.model.output_layers[i].name+"_f1" 497 | logs[f1_name] = _val_f1 498 | 499 | 500 | # Add classification reports for all the predicitions/tasks 501 | self.test_report.append(reports) 502 | 503 | # Add internally 504 | self.test_acc.append(sum(vals_acc)/len(vals_acc)) 505 | self.test_recall.append(sum(vals_recall)/len(vals_recall)) 506 | self.test_f1s.append(sum(vals_f1)/len(vals_f1)) 507 | 508 | # Add to log 509 | f1_mean = sum(vals_f1)/len(vals_f1) 510 | logs["val_f1"] = f1_mean 511 | 512 | # Save best model's weights 513 | if f1_mean > self.best_score: 514 | self.best_score = f1_mean 515 | save_load_utils.save_all_weights(self.model, self.model_save_path) 516 | 517 | 518 | 519 | def write_to_csv(filename, columns, rows): 520 | """ 521 | Create a .csv file with the data given 522 | 523 | :param filename: Path and name of the .csv file, without csv extension 524 | :param columns: Columns of the csv file (First row of the file) 525 | :param rows: Data to write into the csv file, given per row 526 | 527 | """ 528 | with open(filename+'.csv', 'w') as csvfile: 529 | wr = csv.writer(csvfile, quoting=csv.QUOTE_ALL) 530 | wr.writerow(columns) 531 | for n in rows: 532 | wr.writerow(n) 533 | 534 | 535 | def save_model_training_scores(filename, hist, classification_scores): 536 | """ 537 | Create a .csv file containg the model training metrics for each epoch 538 | 539 | :param filename: Path and name of the .csv file without csv extension 540 | :param hist: Default model training history returned by Keras 541 | :param classification_scores: Classification_Scores instance used as callback in the model's training 542 | 543 | :return: Nothing. 544 | """ 545 | csv_values = [] 546 | 547 | csv_columns = ["Epoch", "Training Accuracy", "Training Recall", "Training F1", "Testing Accuracy", "Testing Recall", "Testing F1"] 548 | 549 | csv_values.append(hist.epoch) # Epoch column 550 | 551 | # Training metrics 552 | csv_values.append(classification_scores.train_acc) # Training Accuracy column 553 | csv_values.append(classification_scores.train_recall) # Training Recall column 554 | csv_values.append(classification_scores.train_f1s) # Training F1 column 555 | 556 | # Testing metrics 557 | csv_values.append(classification_scores.test_acc) # Testing Accuracy column 558 | csv_values.append(classification_scores.test_recall) # Testing Accuracy column 559 | csv_values.append(classification_scores.test_f1s) # Testing Accuracy column 560 | 561 | # Creste file 562 | write_to_csv(filename, csv_columns, zip(*csv_values)) 563 | return 564 | 565 | 566 | def model_best_scores(classification_scores, best_epoch): 567 | """ 568 | Return the metrics from best epoch 569 | 570 | :param classification_scores: Classification_Scores instance used as callback in the model's training 571 | :param best_epoch: Best training epoch index 572 | 573 | :return Best epoch training metrics: ["Best epoch", "Training Accuracy", "Training Recall", "Training F1", "Testing Accuracy", "Testing Recall", "Testing F1"] 574 | """ 575 | best_values = [] 576 | best_values.append(1 + best_epoch) 577 | 578 | best_values.append(classification_scores.train_acc[best_epoch]) 579 | best_values.append(classification_scores.train_recall[best_epoch]) 580 | best_values.append(classification_scores.train_f1s[best_epoch]) 581 | 582 | best_values.append(classification_scores.test_acc[best_epoch]) 583 | best_values.append(classification_scores.test_recall[best_epoch]) 584 | best_values.append(classification_scores.test_f1s[best_epoch]) 585 | 586 | return best_values 587 | 588 | 589 | 590 | def compute_predictions(model, X, y, ind2label, nbrTask=-1): 591 | """ 592 | Compute the predictions and ground truth 593 | 594 | :param model: The model making predictions 595 | :param X: Data 596 | :param y: Ground truth 597 | :param ind2label: Dictionaries of index to labels. Used to return have labels to predictions. 598 | 599 | :return: The predictions and groud truth ready to be compared, flatten (1-d array). 600 | """ 601 | 602 | # Compute training score 603 | pred = model.predict(X) 604 | if len(model.outputs)>1: # For multi-task 605 | pred = pred[nbrTask] 606 | pred = np.asarray(pred) 607 | # Compute validation score 608 | pred_index = np.argmax(pred, axis=-1) 609 | 610 | # Reverse the one-hot encoding 611 | true_index = np.argmax(y, axis=-1) 612 | 613 | # Index 0 in the predictions referes to padding 614 | ind2labelNew = ind2label.copy() 615 | ind2labelNew.update({0: "null"}) 616 | 617 | # Compute the labels for each prediction 618 | pred_label = [[ind2labelNew[x] for x in a] for a in pred_index] 619 | true_label = [[ind2labelNew[x] for x in b] for b in true_index] 620 | 621 | # Flatten data 622 | predict_flat = np.ravel(pred_label) 623 | targ_flat = np.ravel(true_label) 624 | 625 | return predict_flat, targ_flat 626 | 627 | 628 | 629 | def save_confusion_matrix(y_target, y_predictions, labels, figure_path, figure_size=(20,20)): 630 | """ 631 | Generate two confusion matrices plots: with and without normalization. 632 | 633 | :param y_target: Tags groud truth 634 | :param y_predictions: Tags predictions 635 | :param labels: Predictions classes to use 636 | :param figure_path: Path the save figures 637 | :param figure_size: Size of the generated figures 638 | 639 | :return: Nothing 640 | """ 641 | 642 | # Compute confusion matrices 643 | cnf_matrix = confusion_matrix(y_target, y_predictions) 644 | 645 | # Confusion matrix 646 | plt.figure(figsize=figure_size) 647 | plot_confusion_matrix(cnf_matrix, classes=labels, title='Confusion matrix, without normalization') 648 | plt.savefig("{0}.png".format(figure_path)) 649 | 650 | # Confusion matrix with normalization 651 | plt.figure(figsize=figure_size) 652 | plot_confusion_matrix(cnf_matrix, classes=labels, normalize=True, title='Normalized confusion matrix') 653 | plt.savefig("{0}_normalized.png".format(figure_path)) 654 | 655 | return 656 | 657 | 658 | def plot_confusion_matrix(cm, classes, normalize=False, title='Confusion matrix', cmap=plt.cm.Blues, printToFile=False): 659 | """ 660 | FROM: http://scikit-learn.org/stable/auto_examples/model_selection/plot_confusion_matrix.html 661 | This function prints and plots the confusion matrix. 662 | Normalization can be applied by setting `normalize=True`. 663 | """ 664 | if normalize: 665 | cm = cm.astype('float') / cm.sum(axis=1)[:, np.newaxis] 666 | if printToFile: print("Normalized confusion matrix") 667 | else: 668 | if printToFile: print('Confusion matrix, without normalization') 669 | 670 | if printToFile: print(cm) 671 | 672 | plt.imshow(cm, interpolation='nearest', cmap=cmap) 673 | plt.title(title) 674 | plt.colorbar() 675 | tick_marks = np.arange(len(classes)) 676 | plt.xticks(tick_marks, classes, rotation=90) 677 | plt.yticks(tick_marks, classes) 678 | 679 | fmt = '.2f' if normalize else 'd' 680 | thresh = cm.max() / 2. 681 | for i, j in itertools.product(range(cm.shape[0]), range(cm.shape[1])): 682 | plt.text(j, i, format(cm[i, j], fmt), 683 | horizontalalignment="center", 684 | color="white" if cm[i, j] > thresh else "black") 685 | 686 | plt.tight_layout() 687 | plt.ylabel('True label') 688 | plt.xlabel('Predicted label') 689 | -------------------------------------------------------------------------------- /keras/main_multiTaskLearning.py: -------------------------------------------------------------------------------- 1 | import random 2 | import numpy as np 3 | import tensorflow 4 | random.seed(42) 5 | np.random.seed(42) 6 | tensorflow.set_random_seed(42) 7 | 8 | # Models and Utils scripts 9 | from code.models import * 10 | from code.utils import * 11 | 12 | 13 | # Load entire data 14 | X_train_w, y_train1_w, y_train2_w, y_train3_w = load_data("dataset/clean_train.txt") # Training data 15 | X_test_w, y_test1_w, y_test2_w, y_test3_w = load_data("dataset/clean_test.txt") # Testing data 16 | X_valid_w, y_valid1_w, y_valid2_w, y_valid3_w = load_data("dataset/clean_valid.txt") # Validation data 17 | 18 | 19 | # Merge digits under the same word 20 | digits_word = "$NUM$" 21 | X_train_w, X_test_w, X_valid_w = mergeDigits([X_train_w, X_test_w, X_valid_w], digits_word) 22 | 23 | # Compute indexes for words+labels in the training data 24 | ukn_words = "out-of-vocabulary" # Out-of-vocabulary words entry in the "words to index" dictionary 25 | word2ind, ind2word = indexData_x(X_train_w, ukn_words) 26 | label2ind1, ind2label1 = indexData_y(y_train1_w) 27 | label2ind2, ind2label2 = indexData_y(y_train2_w) 28 | label2ind3, ind2label3 = indexData_y(y_train3_w) 29 | 30 | print(ind2label1) 31 | print(ind2label2) 32 | print(ind2label3) 33 | 34 | 35 | 36 | # Convert data into indexes data 37 | maxlen = max([len(xx) for xx in X_train_w]) 38 | padding_style = 'pre' # 'pre' or 'post': Style of the padding, in order to have sequence of the same size 39 | X_train = encodePadData_x(X_train_w, word2ind, maxlen, ukn_words, padding_style) 40 | X_test = encodePadData_x(X_test_w, word2ind, maxlen, ukn_words, padding_style) 41 | X_valid = encodePadData_x(X_valid_w, word2ind, maxlen, ukn_words, padding_style) 42 | 43 | y_train1 = encodePadData_y(y_train1_w, label2ind1, maxlen, padding_style) 44 | y_test1 = encodePadData_y(y_test1_w, label2ind1, maxlen, padding_style) 45 | y_valid1 = encodePadData_y(y_valid1_w, label2ind1, maxlen, padding_style) 46 | 47 | y_train2 = encodePadData_y(y_train2_w, label2ind2, maxlen, padding_style) 48 | y_test2 = encodePadData_y(y_test2_w, label2ind2, maxlen, padding_style) 49 | y_valid2 = encodePadData_y(y_valid2_w, label2ind2, maxlen, padding_style) 50 | 51 | y_train3 = encodePadData_y(y_train3_w, label2ind3, maxlen, padding_style) 52 | y_test3 = encodePadData_y(y_test3_w, label2ind3, maxlen, padding_style) 53 | y_valid3 = encodePadData_y(y_valid3_w, label2ind3, maxlen, padding_style) 54 | 55 | 56 | 57 | # Create the character level data 58 | char2ind, maxWords, maxChar = characterLevelIndex(X_train_w, digits_word) 59 | X_train_char = characterLevelData(X_train_w, char2ind, maxWords, maxChar, digits_word, padding_style) 60 | X_test_char = characterLevelData(X_test_w, char2ind, maxWords, maxChar, digits_word, padding_style) 61 | X_valid_char = characterLevelData(X_valid_w, char2ind, maxWords, maxChar, digits_word, padding_style) 62 | 63 | 64 | 65 | 66 | # Model parameters 67 | epoch = 25 68 | batch = 100 69 | dropout = 0.5 70 | lstm_size = 200 71 | 72 | 73 | y_train = [y_train1, y_train2, y_train3] 74 | y_test = [y_test1, y_test2, y_test3] 75 | y_valid = [y_valid1, y_valid2, y_valid3] 76 | ind2label = [ind2label1, ind2label2, ind2label3] 77 | 78 | model_name = "multi_task" 79 | 80 | BiLSTM_model(model_name, True, "crf", 81 | [X_train, X_train_char], [X_test, X_test_char], word2ind, maxWords, 82 | y_train, y_test, ind2label, 83 | validation=True, X_valid=[X_valid, X_valid_char], y_valid=y_valid, 84 | pretrained_embedding=True, word_embedding_size=300, 85 | maxChar=maxChar, char_embedding_type="BILSTM", char2ind=char2ind, char_embedding_size=100, 86 | lstm_hidden=lstm_size, nbr_epochs=epoch, batch_size=batch, dropout=dropout, 87 | gen_confusion_matrix=True, early_stopping_patience=5 88 | ) 89 | 90 | 91 | print("FINITO") -------------------------------------------------------------------------------- /keras/main_threeTasks.py: -------------------------------------------------------------------------------- 1 | import random 2 | import numpy as np 3 | import tensorflow 4 | 5 | # Seed 6 | random.seed(42) 7 | np.random.seed(42) 8 | tensorflow.set_random_seed(42) 9 | 10 | # Models and Utils scripts 11 | from code.models import * 12 | from code.utils import * 13 | 14 | 15 | # Load entire data 16 | X_train_w, y_train1_w, y_train2_w, y_train3_w = load_data("dataset/clean_train.txt") # Training data 17 | X_test_w, y_test1_w, y_test2_w, y_test3_w = load_data("dataset/clean_test.txt") # Testing data 18 | X_valid_w, y_valid1_w, y_valid2_w, y_valid3_w = load_data("dataset/clean_valid.txt") # Validation data 19 | 20 | 21 | # Merge digits under the same word 22 | digits_word = "$NUM$" 23 | X_train_w, X_test_w, X_valid_w = mergeDigits([X_train_w, X_test_w, X_valid_w], digits_word) 24 | 25 | # Compute indexes for words+labels in the training data 26 | ukn_words = "out-of-vocabulary" # Out-of-vocabulary words entry in the "words to index" dictionary 27 | word2ind, ind2word = indexData_x(X_train_w, ukn_words) 28 | label2ind1, ind2label1 = indexData_y(y_train1_w) 29 | label2ind2, ind2label2 = indexData_y(y_train2_w) 30 | label2ind3, ind2label3 = indexData_y(y_train3_w) 31 | 32 | print(ind2label1) 33 | print(ind2label2) 34 | print(ind2label3) 35 | 36 | 37 | 38 | # Convert data into indexes data 39 | maxlen = max([len(xx) for xx in X_train_w]) 40 | padding_style = 'pre' # 'pre' or 'post': Style of the padding, in order to have sequence of the same size 41 | X_train = encodePadData_x(X_train_w, word2ind, maxlen, ukn_words, padding_style) 42 | X_test = encodePadData_x(X_test_w, word2ind, maxlen, ukn_words, padding_style) 43 | X_valid = encodePadData_x(X_valid_w, word2ind, maxlen, ukn_words, padding_style) 44 | 45 | y_train1 = encodePadData_y(y_train1_w, label2ind1, maxlen, padding_style) 46 | y_test1 = encodePadData_y(y_test1_w, label2ind1, maxlen, padding_style) 47 | y_valid1 = encodePadData_y(y_valid1_w, label2ind1, maxlen, padding_style) 48 | 49 | y_train2 = encodePadData_y(y_train2_w, label2ind2, maxlen, padding_style) 50 | y_test2 = encodePadData_y(y_test2_w, label2ind2, maxlen, padding_style) 51 | y_valid2 = encodePadData_y(y_valid2_w, label2ind2, maxlen, padding_style) 52 | 53 | y_train3 = encodePadData_y(y_train3_w, label2ind3, maxlen, padding_style) 54 | y_test3 = encodePadData_y(y_test3_w, label2ind3, maxlen, padding_style) 55 | y_valid3 = encodePadData_y(y_valid3_w, label2ind3, maxlen, padding_style) 56 | 57 | 58 | 59 | # Create the character level data 60 | char2ind, maxWords, maxChar = characterLevelIndex(X_train_w, digits_word) 61 | X_train_char = characterLevelData(X_train_w, char2ind, maxWords, maxChar, digits_word, padding_style) 62 | X_test_char = characterLevelData(X_test_w, char2ind, maxWords, maxChar, digits_word, padding_style) 63 | X_valid_char = characterLevelData(X_valid_w, char2ind, maxWords, maxChar, digits_word, padding_style) 64 | 65 | 66 | # Training, Tesing and Validation data for the model (word emb + char features) 67 | X_training = [X_train, X_train_char] 68 | X_testing = [X_test, X_test_char] 69 | X_validation = [X_valid, X_valid_char] 70 | 71 | 72 | # Model parameters 73 | epoch = 25 74 | batch = 100 75 | dropout = 0.5 76 | lstm_size = 200 77 | 78 | 79 | 80 | model_name = "task1" 81 | BiLSTM_model(model_name, True, "crf", 82 | X_training, X_testing, word2ind, maxWords, 83 | [y_train1], [y_test1], [ind2label1], 84 | validation=True, X_valid=X_validation, y_valid=[y_valid1], 85 | pretrained_embedding=True, word_embedding_size=300, 86 | maxChar=maxChar, char_embedding_type="BILSTM", char2ind=char2ind, char_embedding_size=100, 87 | lstm_hidden=lstm_size, nbr_epochs=epoch, batch_size=batch, dropout=dropout, 88 | gen_confusion_matrix=True, early_stopping_patience=5 89 | ) 90 | 91 | print("=====") 92 | 93 | model_name = "task2" 94 | BiLSTM_model(model_name, True, "crf", 95 | X_training, X_testing, word2ind, maxWords, 96 | [y_train2], [y_test2], [ind2label2], 97 | validation=True, X_valid=X_validation, y_valid=[y_valid2], 98 | pretrained_embedding=True, word_embedding_size=300, 99 | maxChar=maxChar, char_embedding_type="BILSTM", char2ind=char2ind, char_embedding_size=100, 100 | lstm_hidden=lstm_size, nbr_epochs=epoch, batch_size=batch, dropout=dropout, 101 | gen_confusion_matrix=True, early_stopping_patience=5 102 | ) 103 | 104 | print("=====") 105 | 106 | model_name = "task3" 107 | BiLSTM_model(model_name, True, "crf", 108 | X_training, X_testing, word2ind, maxWords, 109 | [y_train3], [y_test3], [ind2label3], 110 | validation=True, X_valid=X_validation, y_valid=[y_valid3], 111 | pretrained_embedding=True, word_embedding_size=300, 112 | maxChar=maxChar, char_embedding_type="BILSTM", char2ind=char2ind, char_embedding_size=100, 113 | lstm_hidden=lstm_size, nbr_epochs=epoch, batch_size=batch, dropout=dropout, 114 | gen_confusion_matrix=True, early_stopping_patience=5 115 | ) 116 | 117 | 118 | print("Done.") 119 | -------------------------------------------------------------------------------- /tensorflow/README.md: -------------------------------------------------------------------------------- 1 | # Tensorflow implementation 2 | 3 | ## How to 4 | A model can be trained using `ref_model.py. All parameters can be tuned there too (see comments). In order to do this, both dataset and pretrained vectors need to be stored as for the Keras implementation. 5 | 6 | python ref_model.py 7 | 8 | Once a model is trained, it can be used interactively calling `play_with.py`. Model selection can be done using `cv_model.py` which implements grid search. 9 | 10 | ## Contents 11 | * `README.md` this file. 12 | * `utils/` 13 | * [data utils](utils/data_utils.py) general data utility functions. 14 | * [general utils](utils/general_utils.py) other utility functions. 15 | * [reference parsing model](ref_model.py) contains the main RefModel model discussed in the paper, and can be run to train an instance (assumes the dataset and pretrained vectors are available). 16 | * [cross validation](cv_model.py) contains code to fot multiple models for model selection or fine tuning (assumes the dataset and pretrained vectors are available). 17 | * [play with](play_with.py) contains code to load a model and use it with an interactive terminal. 18 | 19 | ## Dependencies 20 | * TensorFlow: 1.4.0 21 | * Numpy: 1.13.3 22 | * Sklearn : 0.19.1 23 | * Python 3.5 24 | 25 | ## Future work 26 | * Add a conf file, ideally shared with the implementation in Keras. 27 | * Add a multitask implementation. -------------------------------------------------------------------------------- /tensorflow/cv_model.py: -------------------------------------------------------------------------------- 1 | """ 2 | Cross validation for model selection 3 | """ 4 | 5 | import numpy as np 6 | import itertools as it 7 | from collections import OrderedDict 8 | import os 9 | from model.data_utils import build_data, load_vocab, get_processing_word,\ 10 | coNLLDataset_full 11 | from ref_model import RefModel 12 | 13 | # GLOBALS 14 | 15 | # dataset locations and basic configs 16 | filename_dev = "../dataset/clean_test.txt" 17 | filename_test = "../dataset/clean_valid.txt" 18 | filename_train = "../dataset/clean_train.txt" 19 | which_tags = -3 # -1, -2, -3: Ackerman author b-s b-secondary b-r 20 | task_dir = "cv_%d"%which_tags 21 | use_chars = True # parameter to change globally 22 | use_pretrained = True # parameter to change globally 23 | max_iter = None # if not None, max number of examples in Dataset 24 | n_epocs = 25 25 | dim_words = [100,300] # pretrained word embeddings, be they exist! 26 | 27 | # vocabs (created with build_data) 28 | filename_words = "working_dir/words.txt" 29 | filename_words_ext = "working_dir/words_ext.txt" 30 | filename_tags = "working_dir/tags.txt" 31 | filename_chars = "working_dir/chars.txt" 32 | 33 | # build data for all possible models 34 | build_data(filename_dev,filename_test,filename_train,dim_words,filename_words, 35 | filename_words_ext,filename_tags,filename_chars, 36 | filename_word="../pretrained_vectors/vecs_{}.txt", 37 | filename_word_vec_trimmed="../pretrained_vectors/vecs_{}.trimmed.npz", 38 | which_tags=which_tags) 39 | 40 | # load vocabs 41 | vocab_words = load_vocab(filename_words) 42 | if use_pretrained: 43 | vocab_words = load_vocab(filename_words_ext) 44 | vocab_tags = load_vocab(filename_tags) 45 | vocab_chars = load_vocab(filename_chars) 46 | nwords = len(vocab_words) 47 | nchars = len(vocab_chars) 48 | ntags = len(vocab_tags) 49 | 50 | # load data 51 | processing_word = get_processing_word(vocab_words, 52 | vocab_chars, lowercase=True, chars=use_chars) 53 | processing_tag = get_processing_word(vocab_tags, 54 | lowercase=False, allow_unk=False) 55 | X_dev, y_dev = coNLLDataset_full(filename_dev, processing_word, processing_tag, max_iter, which_tags) 56 | X_train, y_train = coNLLDataset_full(filename_train, processing_word, processing_tag, max_iter, which_tags) 57 | X_valid, y_valid = coNLLDataset_full(filename_test, processing_word, processing_tag, max_iter, which_tags) 58 | 59 | print("Size of train, test and valid sets (in number of sentences): ") 60 | print(len(X_train), " ", len(y_train), " ", len(X_dev), " ", len(y_dev), " ", len(X_valid), " ", len(y_valid)) 61 | 62 | def train_model(config,conf_id): 63 | """Train, evaluates and reports on a single model 64 | 65 | :param config: (dict) parameter configuration 66 | :param conf_id: (int) id of the configuration to fit 67 | :return: None 68 | """ 69 | 70 | # general config 71 | model_name = str(config) 72 | print("Model configuration:",model_name) 73 | dir_output = "results/%s/%s_%s_%d"%(task_dir,str(use_pretrained),str(use_chars),conf_id) 74 | print("Model directory:", dir_output) 75 | os.makedirs(dir_output, exist_ok=True) 76 | os.makedirs(dir_output, exist_ok=True) 77 | with open(os.path.join(dir_output, "config_%s_%s_%d.txt"%(str(use_pretrained),str(use_chars),c)), "w") as f: 78 | f.write(model_name) 79 | dir_model = os.path.join(dir_output, "model.weights") 80 | 81 | model = RefModel(processing_word=processing_word, processing_tag=processing_tag, vocab_chars=vocab_chars, 82 | vocab_words=vocab_words, vocab_tags=vocab_tags, nwords=nwords, nchars=nchars, 83 | ntags=ntags, dir_output=dir_output, dir_model=dir_model, dim_word=config["dim_word"],dim_char=config["dim_char"], 84 | use_pretrained=config["use_pretrained"],train_embeddings=config["train_embeddings"], 85 | dropout=config["dropout"],batch_size=config["batch_size"],lr_method=config["lr_method"],lr=config["lr"], 86 | lr_decay=config["lr_decay"],clip=config["clip"],nepoch_no_imprv=config["nepoch_no_imprv"],l2_reg_lambda=config["l2_reg_lambda"], 87 | hidden_size_char=config["hidden_size_char"],hidden_size_lstm=config["hidden_size_lstm"], 88 | use_crf=config["use_crf"],use_chars=config["use_chars"],use_cnn=config["use_cnn"],random_state=config["random_state"]) 89 | 90 | # fit 91 | fitted = model.fit(X_train, y_train, X_dev, y_dev, n_epocs) 92 | print("Test final f1 score: ", fitted.best_score) 93 | ev_msg = fitted.evaluate(X_valid, y_valid) 94 | 95 | # report 96 | with open(os.path.join("results/%s"%(task_dir),"cv_report.txt"),"a") as f: 97 | f.write("------------\n") 98 | f.write("Model: %s\n"%model_name) 99 | 100 | f.write("Test final f1 score: %f\n"%fitted.best_score) 101 | 102 | # evaluate 103 | f.write("Evaluation: %s\n"%str(ev_msg)) 104 | with open(os.path.join("results/%s"%(task_dir), "cv_report.csv"), "a") as f: 105 | f.write("Model_%s_%s_%d"%(str(use_pretrained),str(use_chars),c)+";"+str(fitted.best_score)+";"+str(ev_msg["f1"])+";"+str(ev_msg["acc"])+";"+str(ev_msg["p"])+";"+str(ev_msg["r"])+"\n") 106 | 107 | if __name__ == "__main__": 108 | 109 | # Param search 110 | # NB use chars or not is decided above, as is the task (which_tags). 111 | param_distribs = OrderedDict({ 112 | "dim_word" : [100,300], 113 | "dim_char" : [100,300], 114 | "use_pretrained" : [use_pretrained], # see above 115 | "train_embeddings" : [True,False], # only used if use_pretrained is True 116 | "dropout" : [0.5], 117 | "batch_size" : [50], 118 | "lr_method" : ["adam"], 119 | "lr" : [0.001], 120 | "lr_decay" : [0.9], 121 | "clip" : [-1], 122 | "nepoch_no_imprv" : [5], 123 | "l2_reg_lambda" : [0], 124 | "hidden_size_char" : [100], 125 | "hidden_size_lstm" : [300], 126 | "use_crf" : [True,False], 127 | "use_chars" : [use_chars], # see above 128 | "use_cnn" : [True,False], 129 | "random_state" : [0] # reproducibility 130 | }) 131 | 132 | # create a list of configurations 133 | n_configs = np.prod([len(v) for v in param_distribs.values()]) 134 | print("Total number of configurations to try:",n_configs) 135 | 136 | allNames = sorted(param_distribs) 137 | combinations = it.product(*(param_distribs[Name] for Name in allNames)) 138 | combinations = list(combinations) 139 | assert len(combinations)==n_configs 140 | 141 | # initialize report csv file 142 | os.makedirs("results/%s"%(task_dir),exist_ok=True) 143 | if not os.path.isfile(os.path.join("results/%s"%(task_dir), "cv_report.csv")): 144 | with open(os.path.join("results/%s"%(task_dir), "cv_report.csv"), "w") as f: 145 | f.write("model_name;best_f1_test_score;f1_validation;accuracy_validation;precision_validation;recall_validation\n") 146 | 147 | for n,c in enumerate(combinations): 148 | config = {k:v for k,v in zip(allNames,c)} 149 | train_model(config,n) -------------------------------------------------------------------------------- /tensorflow/play_with.py: -------------------------------------------------------------------------------- 1 | """ 2 | Use a pretrained model for predictions 3 | Borrows from: https://github.com/guillaumegenthial/sequence_tagging 4 | """ 5 | 6 | import os 7 | 8 | from ref_model import RefModel 9 | from model.data_utils import load_vocab, get_processing_word, coNLLDataset_full 10 | 11 | def interactive_shell(model): 12 | """Creates interactive shell to play with model 13 | 14 | :param model: instance of RefModel 15 | """ 16 | print(""" 17 | This is an interactive mode. 18 | To exit, enter 'exit'. 19 | You can enter a sentence like 20 | input> I love Paris""") 21 | 22 | while True: 23 | sentence = input("input> ") 24 | 25 | words_raw = sentence.strip().split() 26 | 27 | if words_raw and words_raw[0] in ["exit","quit","bye","q","stop"]: 28 | break 29 | 30 | preds = model.predict(words_raw) 31 | 32 | print(" ".join(words_raw)) 33 | print(" ".join(preds)) 34 | 35 | # dataset locations and basic configs 36 | filename_dev = "../dataset/clean_test.txt" 37 | filename_test = "../dataset/clean_valid.txt" 38 | filename_train = "../dataset/clean_train.txt" 39 | which_tags = -3 # -1, -2, -3: Ackerman author b-secondary b-r 40 | use_chars = True 41 | max_iter = None # if None, max number of examples in Dataset 42 | 43 | # general config: trained model directory 44 | dir_output = "results/test_run" 45 | dir_model = os.path.join(dir_output, "model.weights") 46 | 47 | # vocabs (created with build_data) 48 | filename_words = "working_dir/words.txt" 49 | filename_words_ext = "working_dir/words_ext.txt" 50 | filename_tags = "working_dir/tags.txt" 51 | filename_chars = "working_dir/chars.txt" 52 | 53 | # load vocabs 54 | vocab_words = load_vocab(filename_words) 55 | vocab_tags = load_vocab(filename_tags) 56 | vocab_chars = load_vocab(filename_chars) 57 | nwords = len(vocab_words) 58 | nchars = len(vocab_chars) 59 | ntags = len(vocab_tags) 60 | 61 | # load data 62 | processing_word = get_processing_word(vocab_words, 63 | vocab_chars, lowercase=True, chars=use_chars) 64 | processing_tag = get_processing_word(vocab_tags, 65 | lowercase=False, allow_unk=False) 66 | model = RefModel(processing_word=processing_word,processing_tag=processing_tag,vocab_chars=vocab_chars, 67 | vocab_words=vocab_words,vocab_tags=vocab_tags,nwords=nwords,nchars=nchars, 68 | ntags=ntags,dir_output=dir_output,dir_model=dir_model,use_chars=use_chars,random_state=0, 69 | use_pretrained=True, hidden_size_char=50, batch_size=100, lr_decay=1, l2_reg_lambda=0, 70 | use_crf=True, use_cnn=False, dim_word=300, hidden_size_lstm=200, lr=0.001, 71 | train_embeddings=True, dim_char=100, lr_method="rmsprop") 72 | 73 | model.restore_session() 74 | interactive_shell(model) -------------------------------------------------------------------------------- /tensorflow/ref_model.py: -------------------------------------------------------------------------------- 1 | """ 2 | Reference parsing model 3 | Borrows from: https://github.com/guillaumegenthial/sequence_tagging 4 | """ 5 | 6 | import numpy as np 7 | import tensorflow as tf 8 | import os 9 | from collections import OrderedDict 10 | 11 | from sklearn.base import BaseEstimator, ClassifierMixin 12 | 13 | from model.data_utils import minibatches, pad_sequences, get_chunks, build_data, \ 14 | export_trimmed_word_vectors, load_vocab, get_processing_word, CoNLLDataset, coNLLDataset_full, conv1d 15 | from model.general_utils import Progbar 16 | 17 | class RefModel(BaseEstimator, ClassifierMixin): 18 | """Model for reference parsing""" 19 | 20 | def __init__(self,processing_word,processing_tag,vocab_chars,vocab_words,vocab_tags, 21 | nwords,nchars,ntags,dir_output,dir_model,dim_word=300,dim_char=100,use_pretrained=False,train_embeddings=False, 22 | dropout=0.5,batch_size=50,lr_method="adam",lr=0.001,lr_decay=0.9, 23 | clip=-1,nepoch_no_imprv=10,l2_reg_lambda=0.0,hidden_size_char=100,hidden_size_lstm=300, 24 | use_crf=True,use_chars=True,use_cnn=False,random_state=None): 25 | """ 26 | Initialize the RefModel by simply storing all the hyperparameters. 27 | 28 | :param processing_word: (function) to process words 29 | :param processing_tag: (function) to process tags 30 | :param vocab_chars: (dictionary) of characters 31 | :param vocab_words: (dictionary) of words 32 | :param vocab_tags: (dictionary) of tags 33 | :param nwords: (int) number of words 34 | :param nchars: (int) number of characters 35 | :param ntags: (int) number of tags 36 | :param dir_output: (string) output directory 37 | :param dir_model: (string) model output directory 38 | :param dim_word: (int) dimensionality of word embeddings 39 | :param dim_char: (int) dimensionality of character embeddings 40 | :param use_pretrained: (bool) if to use pretrained embeddings 41 | :param train_embeddings: (bool) if to further train embeddings 42 | :param dropout: (float between 0 and 1) propout percentage 43 | :param batch_size: (int) batch size 44 | :param lr_method: (string) learning method (adagrad, sgd, rmsprop) 45 | :param lr: (float) learning rate 46 | :param lr_decay: (float between 0 and 1) learning rate 47 | :param clip: (float) clip rate 48 | :param nepoch_no_imprv: (int) early stopping number of epochs before interrupting without improvements 49 | :param l2_reg_lambda: (float) lambda for l2 regularization 50 | :param hidden_size_char: (int) size of hidden character lstm layer 51 | :param hidden_size_lstm: (int) size of hidden lstm layer 52 | :param use_crf: (bool) if to use crf prediction 53 | :param use_chars: (bool) if to use characters 54 | :param use_cnn: (bool) if to use cnn over lstm for character embeddings 55 | :param random_state: (int) random state 56 | """ 57 | 58 | # externals 59 | self.processing_word = processing_word 60 | self.processing_tag = processing_tag 61 | self.vocab_chars = vocab_chars 62 | self.vocab_words = vocab_words 63 | self.vocab_tags = vocab_tags 64 | self.nwords = nwords 65 | self.nchars = nchars 66 | self.ntags = ntags 67 | self.dir_output = dir_output 68 | self.dir_model = dir_model 69 | 70 | # embeddings 71 | self.dim_word = dim_word 72 | self.dim_char = dim_char 73 | self.use_pretrained = use_pretrained 74 | self.idx_to_tag = {idx: tag for tag, idx in 75 | self.vocab_tags.items()} 76 | 77 | # training 78 | self.train_embeddings = train_embeddings 79 | self._dropout = dropout 80 | self.batch_size = batch_size 81 | self.lr_method = lr_method 82 | self._lr = lr 83 | self.lr_decay = lr_decay 84 | self.clip = clip # if negative, no clipping 85 | self.nepoch_no_imprv = nepoch_no_imprv 86 | self.l2_reg_lambda = l2_reg_lambda # if 0, no l2 regularization 87 | 88 | # model hyperparameters 89 | self.hidden_size_char = hidden_size_char # lstm on chars 90 | self.hidden_size_lstm = hidden_size_lstm # lstm on word embeddings 91 | 92 | # NOTE: if both chars and crf, only 1.6x slower on GPU 93 | self.use_crf = use_crf # if crf, training is 1.7x slower on CPU 94 | self.use_chars = use_chars # if char embedding, training is 3.5x slower on CPU 95 | self.use_cnn = use_cnn # if to use CNN char embeddings, if not use bi-LSTM 96 | 97 | # embedding files 98 | self._filename_emb = "../pretrained_vectors/vecs_{}.txt".format(self.dim_word) 99 | # trimmed embeddings (created with build_data.py) 100 | self._filename_trimmed = "../pretrained_vectors/vecs_{}.trimmed.npz".format(self.dim_word) 101 | self.embeddings = (export_trimmed_word_vectors(self._filename_trimmed) 102 | if self.use_pretrained else None) 103 | 104 | # extra 105 | self.random_state = random_state 106 | self._session = None 107 | 108 | 109 | def _add_placeholders(self): 110 | """Define placeholder entries to computational graph""" 111 | # shape = (batch size, max length of sentence in batch) 112 | self.word_ids = tf.placeholder(tf.int32, shape=[None, None], 113 | name="word_ids") 114 | 115 | # shape = (batch size) 116 | self.sequence_lengths = tf.placeholder(tf.int32, shape=[None], 117 | name="sequence_lengths") 118 | 119 | # shape = (batch size, max length of sentence, max length of word) 120 | self.char_ids = tf.placeholder(tf.int32, shape=[None, None, None], 121 | name="char_ids") 122 | 123 | # shape = (batch_size, max_length of sentence) 124 | self.word_lengths = tf.placeholder(tf.int32, shape=[None, None], 125 | name="word_lengths") 126 | 127 | # shape = (batch size, max length of sentence in batch) 128 | self.labels = tf.placeholder(tf.int32, shape=[None, None], 129 | name="labels") 130 | 131 | # hyper parameters 132 | self.dropout = tf.placeholder(dtype=tf.float32, shape=[], 133 | name="dropout") 134 | self.lr = tf.placeholder(dtype=tf.float32, shape=[], 135 | name="lr") 136 | 137 | # l2 regularization 138 | self.l2_loss = tf.constant(0.0, name="l2_loss") 139 | 140 | 141 | def _get_feed_dict(self, words, labels=None, lr=None, dropout=None): 142 | """ 143 | Given some data, pad it and build a feed dictionary 144 | 145 | :param words: (list) of sentences. A sentence is a list of ids of a list of words. A word is a list of ids 146 | :param labels: (list) of ids 147 | :param lr: (float) learning rate 148 | :param dropout: (float) keep prob 149 | :return: dict {placeholder: value} 150 | """ 151 | 152 | # perform padding of the given data 153 | if self.use_chars: 154 | words = [zip(*w) for w in words] 155 | char_ids,word_ids = zip(*words) 156 | word_ids, sequence_lengths = pad_sequences(word_ids, 0) 157 | char_ids, word_lengths = pad_sequences(char_ids, pad_tok=0, 158 | nlevels=2) 159 | else: 160 | word_ids, sequence_lengths = pad_sequences(words, 0) 161 | 162 | # build feed dictionary 163 | feed = { 164 | self.word_ids: word_ids, 165 | self.sequence_lengths: sequence_lengths 166 | } 167 | 168 | if self.use_chars: 169 | feed[self.char_ids] = char_ids 170 | feed[self.word_lengths] = word_lengths 171 | 172 | if labels is not None: 173 | labels, _ = pad_sequences(labels, 0) 174 | feed[self.labels] = labels 175 | 176 | if lr is not None: 177 | feed[self.lr] = lr 178 | 179 | if dropout is not None: 180 | feed[self.dropout] = dropout 181 | 182 | return feed, sequence_lengths 183 | 184 | 185 | def _add_word_embeddings_op(self): 186 | """Defines self.word_embeddings 187 | 188 | If self.embeddings is not None and is a np array initialized 189 | with pre-trained word vectors, the word embeddings is just a look-up 190 | and we train the vectors if config train_embeddings is True. 191 | Otherwise, a random matrix with the correct shape is initialized. 192 | 193 | Note: add a DropoutWrapper to have dropout within cells. 194 | """ 195 | 196 | with tf.variable_scope("words"): 197 | if self.embeddings is None: 198 | _word_embeddings = tf.get_variable( 199 | name="_word_embeddings", 200 | dtype=tf.float32, 201 | shape=[self.nwords, self.dim_word]) 202 | else: 203 | _word_embeddings = tf.Variable( 204 | self.embeddings, 205 | name="_word_embeddings", 206 | dtype=tf.float32, 207 | trainable=self.train_embeddings) 208 | 209 | word_embeddings = tf.nn.embedding_lookup(_word_embeddings, 210 | self.word_ids, name="word_embeddings") 211 | 212 | with tf.variable_scope("chars"): 213 | if self.use_chars: 214 | # get char embeddings matrix 215 | _char_embeddings = tf.get_variable( 216 | name="_char_embeddings", 217 | dtype=tf.float32, 218 | shape=[self.nchars, self.dim_char]) 219 | char_embeddings = tf.nn.embedding_lookup(_char_embeddings, 220 | self.char_ids, name="char_embeddings") 221 | 222 | # put the time dimension on axis=1 223 | s = tf.shape(char_embeddings) 224 | # now becomes batch size * max sentence length, char in word, dim_char 225 | char_embeddings = tf.reshape(char_embeddings, 226 | shape=[s[0] * s[1], s[-2], self.dim_char]) 227 | 228 | if self.use_cnn: 229 | widths = [2,3,5] 230 | strides = [1] 231 | outputs = list() 232 | for w in widths: 233 | for st in strides: 234 | with tf.name_scope("conv-maxpool-%d-%d" % (w,st)): 235 | output = conv1d(char_embeddings, self.hidden_size_char, width=w, stride=st) 236 | output = tf.reduce_max(tf.nn.relu(output), 1) # activation and max pooling to have 1 feature vector per word 237 | outputs.append(output) 238 | 239 | # concat output 240 | output = tf.concat(outputs, axis=-1) 241 | 242 | # shape = (batch size, max sentence length, len(widths)*len(strides) * char hidden size) 243 | output = tf.reshape(output, 244 | shape=[s[0], s[1], len(widths)*len(strides) * self.hidden_size_char]) 245 | output = tf.nn.dropout(output, self.dropout) 246 | 247 | else: 248 | # bi-LSTM to learn character embeddings 249 | # reshape word lengths 250 | word_lengths = tf.reshape(self.word_lengths, shape=[s[0]*s[1]]) 251 | 252 | # bi lstm on chars 253 | cell_fw = tf.contrib.rnn.LSTMCell(self.hidden_size_char, 254 | state_is_tuple=True) 255 | cell_bw = tf.contrib.rnn.LSTMCell(self.hidden_size_char, 256 | state_is_tuple=True) 257 | _output = tf.nn.bidirectional_dynamic_rnn( 258 | cell_fw, cell_bw, char_embeddings, 259 | sequence_length=word_lengths, dtype=tf.float32) 260 | 261 | # read and concat output 262 | _, ((_, output_fw), (_, output_bw)) = _output 263 | output = tf.concat([output_fw, output_bw], axis=-1) 264 | 265 | # shape = (batch size, max sentence length, 2*char hidden size) 266 | output = tf.reshape(output, 267 | shape=[s[0], s[1], 2*self.hidden_size_char]) 268 | output = tf.nn.dropout(output, self.dropout) 269 | 270 | word_embeddings = tf.concat([word_embeddings, output], axis=-1) 271 | 272 | self.word_embeddings = tf.nn.dropout(word_embeddings, self.dropout) 273 | 274 | 275 | def _add_logits_op(self): 276 | """Defines self.logits 277 | 278 | For each word in each sentence of the batch, it corresponds to a vector 279 | of scores, of dimension equal to the number of tags. 280 | 281 | Note: add a DropoutWrapper to have dropout within cells. 282 | """ 283 | 284 | with tf.variable_scope("bi-lstm"): 285 | cell_fw = tf.contrib.rnn.LSTMCell(self.hidden_size_lstm) 286 | cell_bw = tf.contrib.rnn.LSTMCell(self.hidden_size_lstm) 287 | (output_fw, output_bw), _ = tf.nn.bidirectional_dynamic_rnn( 288 | cell_fw, cell_bw, self.word_embeddings, 289 | sequence_length=self.sequence_lengths, dtype=tf.float32) 290 | output = tf.concat([output_fw, output_bw], axis=-1) 291 | output = tf.nn.dropout(output, self.dropout) 292 | 293 | # act here to expand to multiple outputs and to add attention 294 | with tf.variable_scope("pred"): 295 | W = tf.get_variable("W", dtype=tf.float32, 296 | shape=[2*self.hidden_size_lstm, self.ntags]) 297 | 298 | b = tf.get_variable("b", shape=[self.ntags], 299 | dtype=tf.float32, initializer=tf.zeros_initializer()) 300 | # l2 regularization 301 | self.l2_loss += tf.nn.l2_loss(W) 302 | self.l2_loss += tf.nn.l2_loss(b) 303 | 304 | nsteps = tf.shape(output)[1] 305 | output = tf.reshape(output, [-1, 2*self.hidden_size_lstm]) 306 | pred = tf.matmul(output, W) + b 307 | self.logits = tf.reshape(pred, [-1, nsteps, self.ntags]) 308 | 309 | 310 | def _add_pred_op(self): 311 | """Defines self.labels_pred 312 | 313 | This op is defined only in the case where we don't use a CRF since in 314 | that case we can make the prediction "in the graph" (thanks to tf 315 | functions in other words). With CRF, as the inference is coded 316 | in python and not in pure tensorflow, we have to make the prediction 317 | outside the graph. 318 | 319 | Note: this is no longer the case, see https://github.com/tensorflow/tensorflow/tree/master/tensorflow/contrib/crf. 320 | """ 321 | 322 | if not self.use_crf: 323 | self.labels_pred = tf.cast(tf.argmax(self.logits, axis=-1), tf.int32) 324 | 325 | 326 | def _add_loss_op(self): 327 | """Defines the loss""" 328 | 329 | if self.use_crf: 330 | log_likelihood, trans_params = tf.contrib.crf.crf_log_likelihood( 331 | self.logits, self.labels, self.sequence_lengths) 332 | self.trans_params = trans_params # need to evaluate it for decoding 333 | self.loss = tf.reduce_mean(-log_likelihood) + self.l2_reg_lambda * self.l2_loss 334 | else: 335 | losses = tf.nn.sparse_softmax_cross_entropy_with_logits( 336 | logits=self.logits, labels=self.labels) 337 | mask = tf.sequence_mask(self.sequence_lengths) 338 | losses = tf.boolean_mask(losses, mask) 339 | self.loss = tf.reduce_mean(losses) + self.l2_reg_lambda * self.l2_loss 340 | 341 | 342 | def _add_train_op(self, lr_method, lr, loss, clip=-1): 343 | """ 344 | Defines self.train_op that performs an update on a batch 345 | 346 | :param lr_method: (string) sgd method, for example "adam" 347 | :param lr: (tf.placeholder) tf.float32, learning rate 348 | :param loss: (tensor) tf.float32 loss to minimize 349 | :param clip: (python float) clipping of gradient. If < 0, no clipping 350 | :return: None 351 | """ 352 | 353 | _lr_m = lr_method.lower() # lower to make sure 354 | 355 | with tf.variable_scope("train_step"): 356 | if _lr_m == 'adam': # sgd method 357 | optimizer = tf.train.AdamOptimizer(lr) 358 | elif _lr_m == 'adagrad': 359 | optimizer = tf.train.AdagradOptimizer(lr) 360 | elif _lr_m == 'sgd': 361 | optimizer = tf.train.GradientDescentOptimizer(lr) 362 | elif _lr_m == 'rmsprop': 363 | optimizer = tf.train.RMSPropOptimizer(lr) 364 | else: 365 | raise NotImplementedError("Unknown method {}".format(_lr_m)) 366 | 367 | if clip > 0: # gradient clipping if clip is positive 368 | grads, vs = zip(*optimizer.compute_gradients(loss)) 369 | grads, gnorm = tf.clip_by_global_norm(grads, clip) 370 | self.train_op = optimizer.apply_gradients(zip(grads, vs)) 371 | else: 372 | self.train_op = optimizer.minimize(loss) 373 | 374 | 375 | def _predict_batch(self, words): 376 | """ 377 | Predict for a batch of data 378 | 379 | :param words: (list) of sentences 380 | :return: (list) of labels for each sentence 381 | sequence_length 382 | """ 383 | 384 | fd, sequence_lengths = self._get_feed_dict(words, dropout=1.0) 385 | 386 | if self.use_crf: 387 | # get tag scores and transition params of CRF 388 | viterbi_sequences = [] 389 | logits, trans_params = self._session.run( 390 | [self.logits, self.trans_params], feed_dict=fd) 391 | 392 | # iterate over the sentences because no batching in viterbi_decode 393 | for logit, sequence_length in zip(logits, sequence_lengths): 394 | logit = logit[:sequence_length] # keep only the valid steps 395 | viterbi_seq, viterbi_score = tf.contrib.crf.viterbi_decode( 396 | logit, trans_params) 397 | viterbi_sequences += [viterbi_seq] 398 | 399 | return viterbi_sequences, sequence_lengths 400 | 401 | else: 402 | labels_pred = self._session.run(self.labels_pred, feed_dict=fd) 403 | 404 | return labels_pred, sequence_lengths 405 | 406 | 407 | def _run_epoch(self, X_train, y_train, X_dev, y_dev, epoch): 408 | """ 409 | Performs one complete pass over the train set and evaluate on dev 410 | 411 | :param X_train: (list) with training data 412 | :param y_train: (list) with training labels 413 | :param X_dev: (list) with testing data 414 | :param y_dev: (list) with testing labels 415 | :param epoch: (int) which epoch it is 416 | :return: (python float) score to select model on, higher is better 417 | """ 418 | 419 | # progbar stuff for logging 420 | batch_size = self.batch_size 421 | nbatches = (len(X_train) + batch_size - 1) // batch_size 422 | prog = Progbar(target=nbatches) 423 | 424 | rnd_idx = np.random.permutation(len(X_train)) 425 | for i, rnd_indices in enumerate(np.array_split(rnd_idx, len(X_train) // batch_size)): 426 | words, labels = [X_train[x] for x in list(rnd_indices)], [y_train[y] for y in list(rnd_indices)] 427 | fd, _ = self._get_feed_dict(words, labels, self._lr, self._dropout) 428 | 429 | _, train_loss = self._session.run( 430 | [self.train_op, self.loss], feed_dict=fd) 431 | 432 | prog.update(i + 1, [("train loss", train_loss)]) 433 | 434 | # tensorboard 435 | if i % 10 == 0: 436 | # loss 437 | loss_summary = self._loss_summary.eval(feed_dict=fd) 438 | self._file_writer.add_summary(loss_summary, epoch * nbatches + i) 439 | # train eval 440 | metrics = self._run_evaluate(words, labels) 441 | summary = tf.Summary() 442 | summary.value.add(tag='precision_train', simple_value=metrics["p"]) 443 | summary.value.add(tag='recall_train', simple_value=metrics["r"]) 444 | summary.value.add(tag='f1_train', simple_value=metrics["f1"]) 445 | summary.value.add(tag='accuracy_train', simple_value=metrics["acc"]) 446 | self._file_writer.add_summary(summary, epoch * nbatches + i) 447 | # test eval 448 | metrics = self._run_evaluate(X_dev, y_dev) 449 | summary = tf.Summary() 450 | summary.value.add(tag='precision_test', simple_value=metrics["p"]) 451 | summary.value.add(tag='recall_test', simple_value=metrics["r"]) 452 | summary.value.add(tag='f1_test', simple_value=metrics["f1"]) 453 | summary.value.add(tag='accuracy_test', simple_value=metrics["acc"]) 454 | self._file_writer.add_summary(summary, epoch) 455 | 456 | # final epoch test eval 457 | metrics = self._run_evaluate(X_dev, y_dev) 458 | msg = " - ".join(["{} {:04.2f}".format(k, v) 459 | for k, v in metrics.items()]) 460 | print(msg) 461 | 462 | return metrics["f1"] 463 | 464 | 465 | def _run_evaluate(self, X_dev, y_dev): 466 | """ 467 | Evaluates performance on test set 468 | 469 | :param X_dev:(list) with dev data 470 | :param y_dev: (list) with dev labels 471 | :return: (dict) metrics["acc"] = 98.4, ... 472 | """ 473 | 474 | accs = [] 475 | correct_preds, total_correct, total_preds = 0., 0., 0. 476 | 477 | rnd_idx = np.random.permutation(len(X_dev)) 478 | for rnd_indices in np.array_split(rnd_idx, len(X_dev) // self.batch_size): 479 | words, labels = [X_dev[x] for x in list(rnd_indices)], [y_dev[y] for y in list(rnd_indices)] 480 | labels_pred, sequence_lengths = self._predict_batch(words) 481 | 482 | for lab, lab_pred, length in zip(labels, labels_pred, 483 | sequence_lengths): 484 | lab = lab[:length] 485 | lab_pred = lab_pred[:length] 486 | accs += [a==b for (a, b) in zip(lab, lab_pred)] 487 | 488 | lab_chunks = set(get_chunks(lab, self.vocab_tags)) 489 | lab_pred_chunks = set(get_chunks(lab_pred, 490 | self.vocab_tags)) 491 | 492 | correct_preds += len(lab_chunks & lab_pred_chunks) 493 | total_preds += len(lab_pred_chunks) 494 | total_correct += len(lab_chunks) 495 | 496 | p = correct_preds / total_preds if correct_preds > 0 else 0 497 | r = correct_preds / total_correct if correct_preds > 0 else 0 498 | f1 = 2 * p * r / (p + r) if correct_preds > 0 else 0 499 | acc = np.mean(accs) 500 | 501 | return OrderedDict({"acc": 100*acc, "f1": 100*f1, "p": p, "r": r}) 502 | 503 | 504 | def _reinitialize_weights(self, scope_name): 505 | """Reinitializes the weights of a given layer 506 | 507 | :param scope_name: (string) scope of variables to reinitialize 508 | """ 509 | 510 | variables = tf.contrib.framework.get_variables(scope_name) 511 | init = tf.variables_initializer(variables) 512 | self._session.run(init) 513 | 514 | 515 | def _initialize(self): 516 | """Initialize the variables""" 517 | 518 | print("Initializing tf session") 519 | self._init = tf.global_variables_initializer() 520 | self._saver = tf.train.Saver() 521 | 522 | 523 | def restore_session(self): 524 | """Reload weights into session""" 525 | self._graph = tf.Graph() 526 | with self._graph.as_default(): 527 | self.build() 528 | self._session = tf.Session(graph=self._graph) 529 | self._saver.restore(self._session, self.dir_model) 530 | 531 | 532 | def save_session(self): 533 | """Saves session = weights""" 534 | if not os.path.exists(self.dir_model): 535 | os.makedirs(self.dir_model,exist_ok=True) 536 | self._saver.save(self._session, self.dir_model) 537 | 538 | 539 | def add_summary(self): 540 | """Defines variables for Tensorboard""" 541 | self._loss_summary = tf.summary.scalar('loss', self.loss) 542 | self._file_writer = tf.summary.FileWriter(self.dir_output, 543 | self._session.graph) 544 | 545 | 546 | def close_session(self): 547 | """Closes the session""" 548 | if self._session: 549 | self._session.close() 550 | 551 | 552 | def _get_model_params(self): 553 | """From: https://github.com/ageron/handson-ml/blob/master/11_deep_learning.ipynb 554 | Get all variable values (used for early stopping, faster than saving to disk)""" 555 | 556 | with self._graph.as_default(): 557 | gvars = tf.get_collection(tf.GraphKeys.GLOBAL_VARIABLES) 558 | return {gvar.op.name: value for gvar, value in zip(gvars, self._session.run(gvars))} 559 | 560 | 561 | def _restore_model_params(self, model_params): 562 | """From: https://github.com/ageron/handson-ml/blob/master/11_deep_learning.ipynb 563 | Set all variables to the given values (for early stopping, faster than loading from disk) 564 | 565 | :param model_params: (dict) parameters of the model to restore 566 | """ 567 | 568 | gvar_names = list(model_params.keys()) 569 | assign_ops = {gvar_name: self._graph.get_operation_by_name(gvar_name + "/Assign") 570 | for gvar_name in gvar_names} 571 | init_values = {gvar_name: assign_op.inputs[1] for gvar_name, assign_op in assign_ops.items()} 572 | fd = {init_values[gvar_name]: model_params[gvar_name] for gvar_name in gvar_names} 573 | self._session.run(assign_ops, feed_dict=fd) 574 | 575 | 576 | def build(self): 577 | """Builds the computational graph""" 578 | 579 | if self.random_state is not None: 580 | tf.set_random_seed(self.random_state) 581 | np.random.seed(self.random_state) 582 | 583 | # specific functions 584 | self._add_placeholders() 585 | self._add_word_embeddings_op() 586 | self._add_logits_op() 587 | self._add_pred_op() 588 | self._add_loss_op() 589 | 590 | # generic functions that add training op and initialize vars 591 | self._add_train_op(self.lr_method, self.lr, self.loss, self.clip) 592 | self._initialize() # initialize vars and saver, session is still not there 593 | 594 | 595 | def fit(self, X, y, X_valid=None, y_valid=None, nepochs=100): 596 | """ 597 | Performs training with early stopping and lr exponential decay 598 | 599 | :param X: (list) data 600 | :param y: (list) labels 601 | :param X_valid: (list) validation data 602 | :param y_valid: (list) validation data 603 | :param nepochs: (int) number of epochs to run for 604 | :return: self (model, instance of RefModel) 605 | """ 606 | 607 | self.close_session() 608 | self._graph = tf.Graph() 609 | with self._graph.as_default(): 610 | self.build() 611 | 612 | self.best_score = 0 613 | nepoch_no_imprv = 5 # for early stopping, this should be passed as a parameter 614 | best_params = None 615 | 616 | self._session = tf.Session(graph=self._graph) 617 | with self._session.as_default(): 618 | self._init.run() 619 | self.add_summary() # tensorboard 620 | for epoch in range(nepochs): 621 | print("Epoch {:} out of {:}".format(epoch + 1, nepochs)) 622 | 623 | score = self._run_epoch(X, y, X_valid, y_valid, epoch) 624 | self._lr *= self.lr_decay # decay learning rate 625 | 626 | # early stopping and saving best parameters 627 | if score >= self.best_score: 628 | best_params = self._get_model_params() 629 | nepoch_no_imprv = 0 630 | self.best_score = score 631 | self.save_session() # save new params 632 | print("- new best score!") 633 | else: 634 | nepoch_no_imprv += 1 635 | if nepoch_no_imprv >= self.nepoch_no_imprv: 636 | print("- early stopping {} epochs without "\ 637 | "improvement".format(nepoch_no_imprv)) 638 | break 639 | 640 | # If we used early stopping then rollback to the best model found 641 | if best_params: 642 | self._restore_model_params(best_params) 643 | return self 644 | 645 | 646 | def predict(self, words_raw): 647 | """ 648 | Returns list of predicted tags 649 | 650 | :param words_raw: (list) of words (string), just one sentence (no batch) 651 | :return preds: (list) of tags (string), one for each word in the sentence 652 | """ 653 | 654 | words = [[self.processing_word(w)] for w in words_raw] 655 | pred_ids, _ = self._predict_batch(words) 656 | preds = [self.idx_to_tag[idx[0]] for idx in pred_ids] 657 | 658 | return preds 659 | 660 | 661 | def evaluate(self, X_dev, y_dev): 662 | """ 663 | Evaluate model on test set 664 | 665 | :param X_dev: (list) dev data 666 | :param y_dev: (list) dev labels 667 | :return: (dict) of metrics 668 | """ 669 | 670 | metrics = self._run_evaluate(X_dev, y_dev) 671 | return metrics 672 | 673 | 674 | if __name__ == "__main__": 675 | 676 | # Example of usage 677 | 678 | # dataset locations and basic configs 679 | filename_dev = "../dataset/clean_test.txt" 680 | filename_test = "../dataset/clean_valid.txt" 681 | filename_train = "../dataset/clean_train.txt" 682 | which_tags = -3 # -1, -2, -3: Ackerman author b-secondary b-r 683 | use_chars = True 684 | max_iter = None # if None, max number of examples in Dataset 685 | 686 | # general config: trained model directory 687 | dir_output = "results/test_run" 688 | dir_model = os.path.join(dir_output, "model.weights") 689 | 690 | # vocabs (created with build_data.py) 691 | filename_words = "working_dir/words.txt" 692 | filename_words_ext = "working_dir/words_ext.txt" 693 | filename_tags = "working_dir/tags.txt" 694 | filename_chars = "working_dir/chars.txt" 695 | 696 | # build data (just to test model) 697 | build_data(filename_dev, filename_test, filename_train, [300], filename_words, 698 | filename_words_ext, filename_tags, filename_chars, 699 | filename_word="../pretrained_vectors/vecs_{}.txt", 700 | filename_word_vec_trimmed="../pretrained_vectors/vecs_{}.trimmed.npz", 701 | which_tags=which_tags) 702 | 703 | vocab_words = load_vocab(filename_words) 704 | vocab_tags = load_vocab(filename_tags) 705 | vocab_chars = load_vocab(filename_chars) 706 | nwords = len(vocab_words) 707 | nchars = len(vocab_chars) 708 | ntags = len(vocab_tags) 709 | 710 | # load data 711 | processing_word = get_processing_word(vocab_words, 712 | vocab_chars, lowercase=True, chars=use_chars) 713 | processing_tag = get_processing_word(vocab_tags, 714 | lowercase=False, allow_unk=False) 715 | X_dev, y_dev = coNLLDataset_full(filename_dev, processing_word, processing_tag, max_iter, which_tags) 716 | X_train, y_train = coNLLDataset_full(filename_train, processing_word, processing_tag, max_iter, which_tags) 717 | X_valid, y_valid = coNLLDataset_full(filename_test, processing_word, processing_tag, max_iter, which_tags) 718 | 719 | print("Size of train, test and valid sets (in number of sentences): ") 720 | print(len(X_train), " ", len(y_train), " ", len(X_dev), " ", len(y_dev), " ", len(X_valid), " ", len(y_valid)) 721 | 722 | model = RefModel(processing_word=processing_word,processing_tag=processing_tag,vocab_chars=vocab_chars, 723 | vocab_words=vocab_words,vocab_tags=vocab_tags,nwords=nwords,nchars=nchars, 724 | ntags=ntags,dir_output=dir_output,dir_model=dir_model,use_chars=use_chars,random_state=0, 725 | use_pretrained=True, hidden_size_char=50, batch_size=100, lr_decay=1, l2_reg_lambda=0, 726 | use_crf=True, use_cnn=False, dim_word=300, hidden_size_lstm=200, lr=0.001, 727 | train_embeddings=True, dim_char=100, lr_method="rmsprop") 728 | 729 | fitted = model.fit(X_train, y_train, X_dev, y_dev, 50) 730 | print("Final f1 score: ",fitted.best_score) 731 | print("\nValidation:") 732 | print(str(fitted.evaluate(X_valid, y_valid))) -------------------------------------------------------------------------------- /tensorflow/utils/__init__.py: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/dhlab-epfl/LinkedBooksDeepReferenceParsing/9411db4e918baffa361895c50ae9ce2046fafc3c/tensorflow/utils/__init__.py -------------------------------------------------------------------------------- /tensorflow/utils/data_utils.py: -------------------------------------------------------------------------------- 1 | """ 2 | Utilities for dealing with data 3 | Borrows from: https://github.com/guillaumegenthial/sequence_tagging 4 | """ 5 | 6 | import numpy as np 7 | import tensorflow as tf 8 | from collections import OrderedDict 9 | 10 | # shared global variables 11 | UNK = "$UNK$" 12 | NUM = "$NUM$" 13 | NONE = "o" 14 | 15 | # special error message 16 | class MyIOError(Exception): 17 | def __init__(self, filename): 18 | # custom error message 19 | message = """ 20 | ERROR: Unable to locate file {}. 21 | 22 | FIX: Check that build_data has been called before training. 23 | """.format(filename) 24 | super(MyIOError, self).__init__(message) 25 | 26 | 27 | def build_data(filename_dev,filename_test,filename_train,dim_words,filename_words, 28 | filename_words_ext,filename_tags,filename_chars, 29 | filename_word_vec="../pretrained_vectors/vecs_{}.txt", 30 | filename_word_vec_trimmed="../pretrained_vectors/vecs_{}.trimmed.npz", 31 | which_tags=-1): 32 | """ 33 | Prepares the dataset before training a model. 34 | 35 | :param filename_dev: the file with test data (dev) 36 | :param filename_test: the file with validation data 37 | :param filename_train: the file with train data 38 | :param dim_words: dimensionality of word embeddings 39 | :param filename_words: filename where to put exported word vocabulary 40 | :param filename_words_ext: filename where to put exported word vocabulary 41 | :param filename_tags: filename where to put exported tag vocabulary 42 | :param filename_chars: filename where to put exported char vocabulary 43 | :param filename_word_vec: filename of word vectors 44 | :param filename_word_vec_trimmed: filename where to put exported trimmed word vectors 45 | :param which_tags: which tagging scheme to use (-1 -2 -3 or 3 2 1 for task 1 2 3 respectively) 46 | :return: None 47 | """ 48 | 49 | processing_word = get_processing_word(lowercase=True) 50 | 51 | # Generators 52 | dev = CoNLLDataset(filename_dev, processing_word, which_tags=which_tags) 53 | test = CoNLLDataset(filename_test, processing_word, which_tags=which_tags) 54 | train = CoNLLDataset(filename_train, processing_word, which_tags=which_tags) 55 | 56 | # Build Word, Char and Tag vocab 57 | vocab_words, vocab_tags = get_vocabs([train, dev, test]) 58 | vocab = vocab_words 59 | vocab.add(UNK) 60 | vocab.add(NUM) 61 | vocab_chars = get_char_vocab(train) 62 | write_vocab(vocab, filename_words) 63 | write_vocab(vocab_tags, filename_tags) 64 | write_vocab(vocab_chars, filename_chars) 65 | 66 | # Export extended vocab 67 | vocab_vec = get_vec_vocab(filename_word_vec.format(dim_words[0])) # pick any, words are the same 68 | vocab = vocab & vocab_vec 69 | write_vocab(vocab, filename_words_ext) 70 | 71 | # Trim vectors 72 | vocab = load_vocab(filename_words) 73 | for dim_word in dim_words: 74 | export_trimmed_word_vectors(vocab, filename_word_vec.format(dim_word), 75 | filename_word_vec_trimmed.format(dim_word), dim_word) 76 | 77 | class CoNLLDataset(object): 78 | """Class that iterates over CoNLL Dataset 79 | 80 | __iter__ method yields a tuple (words, tags) 81 | words: list of raw words 82 | tags: list of raw tags 83 | 84 | If processing_word and processing_tag are not None, 85 | optional preprocessing is appplied 86 | 87 | Example: 88 | ```python 89 | data = CoNLLDataset(filename) 90 | for sentence, tags in data: 91 | pass 92 | ``` 93 | 94 | """ 95 | def __init__(self, filename, processing_word=None, processing_tag=None, 96 | max_iter=None, which_tags=-1): 97 | """ 98 | :param filename: path to the file 99 | :param processing_word: (optional) function that takes a word as input 100 | :param processing_tag: (optional) function that takes a tag as input 101 | :param max_iter: (optional) max number of sentences to yield 102 | :param which_tags: (optional) which tagging scheme to use (-1 -2 -3 or 3 2 1 for task 1 2 3 respectively) 103 | """ 104 | self.filename = filename 105 | self.processing_word = processing_word 106 | self.processing_tag = processing_tag 107 | self.max_iter = max_iter 108 | self.which_tags = which_tags 109 | self.length = None 110 | 111 | 112 | def __iter__(self): 113 | niter = 0 114 | with open(self.filename) as f: 115 | words, tags = [], [] 116 | for line in f: 117 | line = line.strip() 118 | if (len(line) == 0 or line.startswith("-DOCSTART-")): 119 | if len(words) != 0: 120 | niter += 1 121 | if self.max_iter is not None and niter > self.max_iter: 122 | break 123 | yield words, tags 124 | words, tags = [], [] 125 | else: 126 | ls = line.split() 127 | word, tag = ls[0],ls[self.which_tags] 128 | if self.processing_word is not None: 129 | word = self.processing_word(word) 130 | if self.processing_tag is not None: 131 | tag = self.processing_tag(tag) 132 | words += [word] 133 | tags += [tag] 134 | 135 | 136 | def __len__(self): 137 | """Iterates once over the corpus to set and store length""" 138 | if self.length is None: 139 | self.length = 0 140 | for _ in self: 141 | self.length += 1 142 | 143 | return self.length 144 | 145 | 146 | def coNLLDataset_full(filename, processing_word=None, processing_tag=None, max_iter=None, which_tags=-1): 147 | """ 148 | Same as above but simply processes all datasets and returns full lists of X and y in memory (no yield). 149 | 150 | :param filename: path to the file 151 | :param processing_word: (optional) function that takes a word as input 152 | :param processing_tag: (optional) function that takes a tag as input 153 | :param max_iter: (optional) max number of sentences to yield 154 | :param which_tags: (optional) which tagging scheme to use (-1 -2 -3 or 3 2 1 for task 1 2 3 respectively) 155 | :return X,y: lists of words and tags in sequences 156 | """ 157 | 158 | 159 | X,y = [], [] 160 | 161 | niter = 0 162 | with open(filename) as f: 163 | words, tags = [], [] 164 | for line in f: 165 | line = line.strip() 166 | if (len(line) == 0 or line.startswith("-DOCSTART-")): 167 | if len(words) != 0: 168 | niter += 1 169 | if max_iter is not None and niter > max_iter: 170 | break 171 | X.append(words) 172 | y.append(tags) 173 | words, tags = [], [] 174 | else: 175 | ls = line.split() 176 | word, tag = ls[0],ls[which_tags] 177 | if processing_word is not None: 178 | word = processing_word(word) 179 | if processing_tag is not None: 180 | tag = processing_tag(tag) 181 | words += [word] 182 | tags += [tag] 183 | 184 | return X,y 185 | 186 | 187 | def get_vocabs(datasets): 188 | """ 189 | Build vocabulary from an iterable of datasets objects 190 | 191 | :param datasets: datasets: a list of dataset objects 192 | :return: a set of all the words in the dataset 193 | """ 194 | print("Building vocab...") 195 | vocab_words = set() 196 | vocab_tags = set() 197 | for dataset in datasets: 198 | for words, tags in dataset: 199 | vocab_words.update(words) 200 | vocab_tags.update(tags) 201 | print("- done. {} tokens".format(len(vocab_words))) 202 | return vocab_words, vocab_tags 203 | 204 | 205 | def get_char_vocab(dataset): 206 | """ 207 | Build char vocabulary from an iterable of datasets objects 208 | 209 | :param dataset: dataset: a iterator yielding tuples (sentence, tags) 210 | :return: a set of all the characters in the dataset 211 | """ 212 | vocab_char = set() 213 | for words, _ in dataset: 214 | for word in words: 215 | vocab_char.update(word) 216 | 217 | return vocab_char 218 | 219 | 220 | def get_vec_vocab(filename): 221 | """ 222 | Load vocab from file 223 | 224 | :param filename: filename: path to the word vectors 225 | :return: vocab: set() of strings 226 | """ 227 | print("Building vocab...") 228 | vocab = set() 229 | with open(filename) as f: 230 | for line in f: 231 | word = line.strip().split(' ')[0] 232 | vocab.add(word) 233 | print("- done. {} tokens".format(len(vocab))) 234 | return vocab 235 | 236 | 237 | def write_vocab(vocab, filename): 238 | """ 239 | Writes a vocab to a file, one word per line. 240 | 241 | :param vocab: iterable that yields word 242 | :param filename: path to vocab file 243 | :return: None (write a word per line) 244 | """ 245 | print("Writing vocab...") 246 | with open(filename, "w") as f: 247 | for i, word in enumerate(vocab): 248 | if i != len(vocab) - 1: 249 | f.write("{}\n".format(word)) 250 | else: 251 | f.write(word) 252 | print("- done. {} tokens".format(len(vocab))) 253 | 254 | 255 | def load_vocab(filename): 256 | """ 257 | Loads vocab from a file 258 | 259 | :param filename: (string) the format of the file must be one word per line 260 | :return: dict[word] = index 261 | """ 262 | try: 263 | d = OrderedDict() 264 | with open(filename) as f: 265 | for idx, word in enumerate(f): 266 | word = word.strip() 267 | d[word] = idx 268 | 269 | except IOError: 270 | raise MyIOError(filename) 271 | return d 272 | 273 | 274 | def export_trimmed_word_vectors(vocab, word_filename, trimmed_filename, dim): 275 | """ 276 | Saves word vectors in numpy array 277 | 278 | :param vocab: dictionary vocab[word] = index 279 | :param word_filename: a path to a word file 280 | :param trimmed_filename: a path where to store a matrix in npy 281 | :param dim: (int) dimension of embeddings 282 | :return: None 283 | """ 284 | embeddings = np.zeros([len(vocab), dim]) 285 | with open(word_filename) as f: 286 | for line in f: 287 | line = line.strip().split(' ') 288 | word = line[0] 289 | embedding = [float(x) for x in line[1:]] 290 | if word in vocab: 291 | word_idx = vocab[word] 292 | embeddings[word_idx] = np.asarray(embedding) 293 | 294 | np.savez_compressed(trimmed_filename, embeddings=embeddings) 295 | 296 | 297 | def get_trimmed_word_vectors(filename): 298 | """ 299 | Get word vectors 300 | 301 | :param filename: path to the npz file 302 | :return: matrix of embeddings (np array) 303 | """ 304 | try: 305 | with np.load(filename) as data: 306 | return data["embeddings"] 307 | 308 | except IOError: 309 | raise MyIOError(filename) 310 | 311 | 312 | def get_processing_word(vocab_words=None, vocab_chars=None, 313 | lowercase=False, chars=False, allow_unk=True): 314 | """ 315 | Return lambda function that transform a word (string) into list, 316 | or tuple of (list, id) of int corresponding to the ids of the word and 317 | its corresponding characters. 318 | Note that only known chars from train are used (i.e. chars for which we have learned an embedding, and only known words 319 | are used. Unknown words are featured with the UNK word vector. Note that this solution prevents learning new embeddings for them, 320 | because either a word was seen at training, or it is impossible do deal with properly..). 321 | 322 | :param vocab_words: dict[word] = idx 323 | :param vocab_chars: dict[char] = idx 324 | :param lowercase: if to transform to lowercase 325 | :param chars: if to export characters too 326 | :param allow_unk: if to allow for the use of the UNK token 327 | :return: f("cat") = ([12, 4, 32], 12345) 328 | = (list of char ids, word id) 329 | """ 330 | def f(word): 331 | # 0. get chars of words 332 | if vocab_chars is not None and chars == True: 333 | char_ids = [] 334 | for char in word: 335 | # ignore chars out of vocabulary 336 | if char in vocab_chars: 337 | char_ids += [vocab_chars[char]] 338 | 339 | # 1. preprocess word 340 | if lowercase: 341 | word = word.lower() 342 | if word.isdigit(): 343 | word = NUM 344 | 345 | # 2. get id of word 346 | if vocab_words is not None: 347 | if word in vocab_words: 348 | word = vocab_words[word] 349 | else: 350 | if allow_unk: 351 | word = vocab_words[UNK] 352 | else: 353 | raise Exception("Unknow key is not allowed. Check that "\ 354 | "your vocab (tags?) is correct") 355 | 356 | # 3. return tuple char ids, word id 357 | if vocab_chars is not None and chars == True: 358 | return char_ids, word 359 | else: 360 | return word 361 | 362 | return f 363 | 364 | 365 | def _pad_sequences(sequences, pad_tok, max_length): 366 | """ 367 | Pads to the right, at the end of the sequence. 368 | 369 | :param sequences: a generator of list or tuple 370 | :param pad_tok: the char to pad with 371 | :param max_length: the maximum length of a sequence 372 | :return: a list of list where each sublist has same length 373 | """ 374 | sequence_padded, sequence_length = [], [] 375 | 376 | for seq in sequences: 377 | seq = list(seq) 378 | seq_ = seq[:max_length] + [pad_tok]*max(max_length - len(seq), 0) 379 | sequence_padded += [seq_] 380 | sequence_length += [min(len(seq), max_length)] 381 | 382 | return sequence_padded, sequence_length 383 | 384 | 385 | def pad_sequences(sequences, pad_tok, nlevels=1): 386 | """ 387 | Pads to the right, at the end of the sequence, at levels 1 (just words) and 2 (both words and characters) 388 | 389 | :param sequences: a generator of list or tuple 390 | :param pad_tok: the char to pad with 391 | :param nlevels: "depth" of padding, for the case where we have characters ids 392 | :return: a list of list where each sublist has same length 393 | """ 394 | if nlevels == 1: 395 | max_length = max(map(lambda x : len(x), sequences)) 396 | sequence_padded, sequence_length = _pad_sequences(sequences, 397 | pad_tok, max_length) 398 | 399 | elif nlevels == 2: 400 | max_length_word = max([max(map(lambda x: len(x), seq)) 401 | for seq in sequences]) 402 | sequence_padded, sequence_length = [], [] 403 | for seq in sequences: 404 | # all words are same length now 405 | sp, sl = _pad_sequences(seq, pad_tok, max_length_word) 406 | sequence_padded += [sp] 407 | sequence_length += [sl] 408 | 409 | max_length_sentence = max(map(lambda x : len(x), sequences)) 410 | sequence_padded, _ = _pad_sequences(sequence_padded, 411 | [pad_tok]*max_length_word, max_length_sentence) 412 | sequence_length, _ = _pad_sequences(sequence_length, 0, 413 | max_length_sentence) 414 | 415 | return sequence_padded, sequence_length 416 | 417 | 418 | def minibatches(data, minibatch_size): 419 | """ 420 | Yields data in minimatches. 421 | 422 | :param data: generator of (sentence, tags) tuples 423 | :param minibatch_size: (int) 424 | :return: list of tuples 425 | """ 426 | x_batch, y_batch = [], [] 427 | for (x, y) in data: 428 | if len(x_batch) == minibatch_size: 429 | yield x_batch, y_batch 430 | x_batch, y_batch = [], [] 431 | 432 | if type(x[0]) == tuple: 433 | x = zip(*x) 434 | x_batch += [x] 435 | y_batch += [y] 436 | 437 | if len(x_batch) != 0: 438 | yield x_batch, y_batch 439 | 440 | 441 | def get_chunk_type(tok, idx_to_tag): 442 | """ 443 | Return chunk type 444 | 445 | :param tok: id of token, ex 4 446 | :param idx_to_tag: dictionary {4: "B-PER", ...} 447 | :return: tuple: "B", "PER" 448 | """ 449 | tag_name = idx_to_tag[tok] 450 | tag_class = tag_name.split('-')[0] 451 | tag_type = tag_name.split('-')[-1] 452 | return tag_class, tag_type 453 | 454 | 455 | def get_chunks(seq, tags): 456 | """ 457 | Given a sequence of tags, group entities and their position 458 | 459 | Example: 460 | seq = [4, 5, 0, 3] 461 | tags = {"B-PER": 4, "I-PER": 5, "B-LOC": 3} 462 | result = [("PER", 0, 2), ("LOC", 3, 4)] 463 | 464 | 465 | :param seq: [4, 4, 0, 0, ...] sequence of labels 466 | :param tags: dict["O"] = 4 467 | :return: list of (chunk_type, chunk_start, chunk_end) 468 | """ 469 | default = tags[NONE] 470 | idx_to_tag = {idx: tag for tag, idx in tags.items()} 471 | chunks = [] 472 | chunk_type, chunk_start = None, None 473 | for i, tok in enumerate(seq): 474 | # End of a chunk 1 475 | if tok == default and chunk_type is not None: 476 | # Add a chunk. 477 | chunk = (chunk_type, chunk_start, i) 478 | chunks.append(chunk) 479 | chunk_type, chunk_start = None, None 480 | 481 | # End of a chunk + start of a chunk! 482 | elif tok != default: 483 | tok_chunk_class, tok_chunk_type = get_chunk_type(tok, idx_to_tag) 484 | if chunk_type is None: 485 | chunk_type, chunk_start = tok_chunk_type, i 486 | elif tok_chunk_type != chunk_type or tok_chunk_class == "b": 487 | chunk = (chunk_type, chunk_start, i) 488 | chunks.append(chunk) 489 | chunk_type, chunk_start = tok_chunk_type, i 490 | else: 491 | pass 492 | 493 | # end condition 494 | if chunk_type is not None: 495 | chunk = (chunk_type, chunk_start, len(seq)) 496 | chunks.append(chunk) 497 | 498 | return chunks 499 | 500 | 501 | def conv1d(input_, output_size, width=3, stride=1): 502 | """ 503 | 1d convolution for texts, from: https://medium.com/@TalPerry/convolutional-methods-for-text-d5260fd5675f 504 | 505 | :param input_: A tensor of embedded tokens with shape [batch_size,max_length,embedding_size] 506 | :param output_size: The number of feature maps we'd like to calculate 507 | :param width: The filter width 508 | :param stride: The stride 509 | :return: A tensor of the convolved input with shape [batch_size,max_length,output_size] 510 | """ 511 | inputSize = input_.get_shape()[-1] # How many channels on the input (The size of our embedding for instance) 512 | 513 | # This is where we make our text an image of height 1 514 | input_ = tf.expand_dims(input_, axis=1) # Change the shape to [batch_size,1,max_length,embedding_size] 515 | 516 | # Make sure the height of the filter is 1 517 | filter_ = tf.get_variable("conv_filter_%d_%d" % (width,stride), shape=[1, width, inputSize, output_size]) 518 | 519 | # Run the convolution as if this were an image 520 | convolved = tf.nn.conv2d(input_, filter=filter_, strides=[1, 1, stride, 1], padding="SAME") 521 | 522 | # Remove the extra dimension, i.e. make the shape [batch_size,max_length,output_size] 523 | result = tf.squeeze(convolved, axis=1) 524 | return result -------------------------------------------------------------------------------- /tensorflow/utils/general_utils.py: -------------------------------------------------------------------------------- 1 | """ 2 | Utilities for dealing with general stuff 3 | Borrows from: https://github.com/guillaumegenthial/sequence_tagging 4 | """ 5 | 6 | import time 7 | import sys 8 | import logging 9 | import numpy as np 10 | 11 | 12 | def get_logger(filename): 13 | """ 14 | Return a logger instance that writes in filename 15 | 16 | :param filename: (string) path to log.txt 17 | :return: (instance of logger) 18 | """ 19 | logger = logging.getLogger('logger') 20 | logger.setLevel(logging.DEBUG) 21 | logging.basicConfig(format='%(message)s', level=logging.DEBUG) 22 | handler = logging.FileHandler(filename) 23 | handler.setLevel(logging.DEBUG) 24 | handler.setFormatter(logging.Formatter( 25 | '%(asctime)s:%(levelname)s: %(message)s')) 26 | logging.getLogger().addHandler(handler) 27 | 28 | return logger 29 | 30 | 31 | class Progbar(object): 32 | """Progbar class copied from keras (https://github.com/fchollet/keras/) 33 | 34 | Displays a progress bar. 35 | Small edit : added strict arg to update 36 | # Arguments 37 | target: Total number of steps expected. 38 | interval: Minimum visual progress update interval (in seconds). 39 | """ 40 | 41 | def __init__(self, target, width=30, verbose=1): 42 | self.width = width 43 | self.target = target 44 | self.sum_values = {} 45 | self.unique_values = [] 46 | self.start = time.time() 47 | self.total_width = 0 48 | self.seen_so_far = 0 49 | self.verbose = verbose 50 | 51 | def update(self, current, values=[], exact=[], strict=[]): 52 | """ 53 | Updates the progress bar. 54 | # Arguments 55 | current: Index of current step. 56 | values: List of tuples (name, value_for_last_step). 57 | The progress bar will display averages for these values. 58 | exact: List of tuples (name, value_for_last_step). 59 | The progress bar will display these values directly. 60 | 61 | :param current: Index of current step. 62 | :param values: List of tuples (name, value_for_last_step). 63 | The progress bar will display averages for these values. 64 | :param exact: List of tuples (name, value_for_last_step). 65 | The progress bar will display these values directly. 66 | :param strict: 67 | :return: None 68 | """ 69 | 70 | for k, v in values: 71 | if k not in self.sum_values: 72 | self.sum_values[k] = [v * (current - self.seen_so_far), 73 | current - self.seen_so_far] 74 | self.unique_values.append(k) 75 | else: 76 | self.sum_values[k][0] += v * (current - self.seen_so_far) 77 | self.sum_values[k][1] += (current - self.seen_so_far) 78 | for k, v in exact: 79 | if k not in self.sum_values: 80 | self.unique_values.append(k) 81 | self.sum_values[k] = [v, 1] 82 | 83 | for k, v in strict: 84 | if k not in self.sum_values: 85 | self.unique_values.append(k) 86 | self.sum_values[k] = v 87 | 88 | self.seen_so_far = current 89 | 90 | now = time.time() 91 | if self.verbose == 1: 92 | prev_total_width = self.total_width 93 | sys.stdout.write("\b" * prev_total_width) 94 | sys.stdout.write("\r") 95 | 96 | numdigits = int(np.floor(np.log10(self.target))) + 1 97 | barstr = '%%%dd/%%%dd [' % (numdigits, numdigits) 98 | bar = barstr % (current, self.target) 99 | prog = float(current)/self.target 100 | prog_width = int(self.width*prog) 101 | if prog_width > 0: 102 | bar += ('='*(prog_width-1)) 103 | if current < self.target: 104 | bar += '>' 105 | else: 106 | bar += '=' 107 | bar += ('.'*(self.width-prog_width)) 108 | bar += ']' 109 | sys.stdout.write(bar) 110 | self.total_width = len(bar) 111 | 112 | if current: 113 | time_per_unit = (now - self.start) / current 114 | else: 115 | time_per_unit = 0 116 | eta = time_per_unit*(self.target - current) 117 | info = '' 118 | if current < self.target: 119 | info += ' - ETA: %ds' % eta 120 | else: 121 | info += ' - %ds' % (now - self.start) 122 | for k in self.unique_values: 123 | if type(self.sum_values[k]) is list: 124 | info += ' - %s: %.4f' % (k, 125 | self.sum_values[k][0] / max(1, self.sum_values[k][1])) 126 | else: 127 | info += ' - %s: %s' % (k, self.sum_values[k]) 128 | 129 | self.total_width += len(info) 130 | if prev_total_width > self.total_width: 131 | info += ((prev_total_width-self.total_width) * " ") 132 | 133 | sys.stdout.write(info) 134 | sys.stdout.flush() 135 | 136 | if current >= self.target: 137 | sys.stdout.write("\n") 138 | 139 | if self.verbose == 2: 140 | if current >= self.target: 141 | info = '%ds' % (now - self.start) 142 | for k in self.unique_values: 143 | info += ' - %s: %.4f' % (k, 144 | self.sum_values[k][0] / max(1, self.sum_values[k][1])) 145 | sys.stdout.write(info + "\n") 146 | 147 | def add(self, n, values=[]): 148 | self.update(self.seen_so_far+n, values) 149 | 150 | 151 | --------------------------------------------------------------------------------