├── .gitignore
├── Data Facts.ipynb
├── LICENSE
├── README.md
├── crf_baseline
    ├── README.md
    ├── code
    │   ├── __init__.py
    │   ├── feature_extraction_supporting_functions_words.py
    │   ├── feature_extraction_words.py
    │   └── utils.py
    ├── main_finetune.py
    ├── main_threeTasks.py
    └── validation.py
├── dataset.tar.gz
├── dataset
    ├── clean_test.txt
    ├── clean_train.txt
    └── clean_valid.txt
├── keras
    ├── README.md
    ├── code
    │   ├── __init__.py
    │   ├── models.py
    │   └── utils.py
    ├── main_multiTaskLearning.py
    └── main_threeTasks.py
└── tensorflow
    ├── README.md
    ├── cv_model.py
    ├── play_with.py
    ├── ref_model.py
    └── utils
        ├── __init__.py
        ├── data_utils.py
        └── general_utils.py


/.gitignore:
--------------------------------------------------------------------------------
1 | # Created by .ignore support plugin (hsz.mobi)
2 | .gitignore
3 | .idea/
4 | 


--------------------------------------------------------------------------------
/LICENSE:
--------------------------------------------------------------------------------
 1 | MIT License
 2 | 
 3 | Copyright (c) 2018 Digital Humanities Laboratory
 4 | 
 5 | Permission is hereby granted, free of charge, to any person obtaining a copy
 6 | of this software and associated documentation files (the "Software"), to deal
 7 | in the Software without restriction, including without limitation the rights
 8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
 9 | copies of the Software, and to permit persons to whom the Software is
10 | furnished to do so, subject to the following conditions:
11 | 
12 | The above copyright notice and this permission notice shall be included in all
13 | copies or substantial portions of the Software.
14 | 
15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
21 | SOFTWARE.
22 | 


--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
 1 | # Deep Reference Parsing
 2 | 
 3 | This repository contains the code for the following article:
 4 |     
 5 |     @article{alves_deep_2018,
 6 |           author       = {{Rodrigues Alves, Danny and Giovanni Colavizza and Frédéric Kaplan}},
 7 |           title        = {{Deep Reference Mining from Scholarly Literature in the Arts and Humanities}},
 8 |           journal      = {{Frontiers in Research Metrics & Analytics}},
 9 |           volume       = 3,
10 |           number       = 21,
11 |           year         = 2018,
12 |           doi          = {10.3389/frma.2018.00021}
13 |         }
14 | 
15 | ## Task definition
16 | 
17 | We focus on the task of reference mining, instantiated into three tasks: reference components detection (task 1), reference typology detection (task 2) and reference span detection (task 3).
18 | 
19 | * Sequence: *G. Ostrogorsky, History of the Byzantine State, Rutgers University Press, 1986.*
20 | * Task 1: *author author title title title title title publisher publisher publisher year*
21 | * Task 2: *b-secondary i-secondary ... e-secondary*
22 | * Task 3: *b-r i-r ... e-r*
23 | 
24 | ## Contents
25 | 
26 | * `LICENSE` MIT.
27 | * `README.md` this file.
28 | * `dataset/`
29 |     * [train](dataset/clean_test.txt) Train split, CoNLL format.
30 |     * [test](dataset/clean_train.txt) Test split, CoNLL format.
31 |     * [validation](dataset/clean_valid.txt) Validation split, CoNLL format.
32 | * [compressed dataset](dataset.tar.gz) Compressed dataset.
33 | * [data facts](Data%20Facts.ipynb) a Python notebook to explore the dataset (number of references, tag distributions).
34 | * [crf_baseline](crf_baseline) CRF baseline implementation details.
35 | * [keras](keras) Keras implementation details.
36 | * [tensorflow](tensorflow) TF implementation details.
37 | 
38 | ## Dataset
39 | 
40 | Example of dataset entry (beginning of validation dataset, first line/sequence): Token Task1tag Task2tag Task3tag`:
41 | 
42 |     -DOCSTART- -X- -X- o
43 | 
44 |     C author b-secondary b-r
45 |     . author i-secondary i-r
46 |     Agnoletti author i-secondary i-r
47 |     , author i-secondary i-r
48 |     Treviso title i-secondary i-r
49 |     e title i-secondary i-r
50 |     le title i-secondary i-r
51 |     sue title i-secondary i-r
52 |     pievi title i-secondary i-r
53 |     . title i-secondary i-r
54 |     Illustrazione title i-secondary i-r
55 |     storica title i-secondary i-r
56 |     , title i-secondary i-r
57 |     Treviso publicationplace i-secondary i-r
58 |     1898 year i-secondary i-r
59 |     , year i-secondary i-r
60 |     2 publicationspecifications i-secondary i-r
61 |     v publicationspecifications e-secondary i-r
62 |     . publicationspecifications e-secondary e-r
63 | 
64 | Pre-trained word vectors can be downloaded from Zenodo: [![DOI](https://zenodo.org/badge/DOI/10.5281/zenodo.1175213.svg)](https://doi.org/10.5281/zenodo.1175213)
65 | 
66 | ## Implementations
67 | 
68 | ### CRF baseline
69 | 
70 | See internal [readme](crf_baseline/README.md) for details.
71 | 
72 | ### Keras
73 | 
74 | See internal [readme](keras/README.md) for details.
75 | 
76 | ### Tensor Flow
77 | 
78 | See internal [readme](tensorflow/README.md) for details.
79 | 
80 | This implementation borrows from [Guillaume Genthial's Sequence Tagging with Tensorflow](https://guillaumegenthial.github.io/sequence-tagging-with-tensorflow.html).
81 | 
82 | 


--------------------------------------------------------------------------------
/crf_baseline/README.md:
--------------------------------------------------------------------------------
 1 | # CRF baseline implementation
 2 | 
 3 | ## How to
 4 | The directory contains code to run the CRF model used as baseline. Code to train, fine-tune and validate models are given. 
 5 | 
 6 | For single tasks, one model for each of the three tasks will be computed by running the python script *main_threeTasks.py*: each model will be stored under the folder *models*. In order to fine-tune the two CRF model parameters *c1* and *c2*, the python script *main_finetune.py* can be run: a plot gathering the results is saved in the folder *plots* and the best model will be save in the *models* folder. To compute the validation score, the script *validation.py* needs to have the models generated previously *crf_t1.pkl*, *crf_t2.pkl* and *crf_t3.pkl* and will print the classification scores.
 7 | 
 8 | The data is expected to be in a *dataset* folder, in the main repository directory, with three files inside it: *clean_train.txt* for the training dataset, *clean_test.txt* for the testing dataset, and *clean_valid.txt* for the validation dataset.
 9 | 
10 |     python main_finetune.py
11 |     python main_threeTasks.py
12 |     python validation.py
13 | 
14 | 
15 | ## Contents
16 | * `README.md` this file.
17 | * `code/`
18 |     * [feature_extraction_supporting_functions_words](code/feature_extraction_supporting_functions_words.py) helper functions to extract features from words.
19 |     * [feature_extraction_words](code/feature_extraction_words.py) functions to extract features from words.
20 |     * [utils](code/utils.py) utility functions to load data and redirected log files.
21 | * [main_finetune](main_finetune.py) python script to fine-tune two model parameters.
22 | * [main_threeTasks](main_threeTasks.py) python script to train one CRF model for each task.
23 | * [validation](validation.py) python script to compute classification score on validation dataset for the three tasks.
24 | 
25 | ## Dependencies 
26 | * Numpy: 1.13.3
27 | * Sklearn : 0.19.1
28 | * [Sklearn crfsuite](https://sklearn-crfsuite.readthedocs.io/en/latest/index.html) Sklearn crfsuite : 0.3.6
29 | * Python 3.5


--------------------------------------------------------------------------------
/crf_baseline/code/__init__.py:
--------------------------------------------------------------------------------
1 | #!/usr/bin/python
2 | # -*- coding: utf-8 -*-
3 | 
4 | 


--------------------------------------------------------------------------------
/crf_baseline/code/feature_extraction_supporting_functions_words.py:
--------------------------------------------------------------------------------
  1 | # -*- coding: utf-8 -*-
  2 | """
  3 | Extraction of Features from words, used to parse references
  4 | 
  5 | Inspired from CRFSuite by Naoaki Okazaki: http://www.chokkan.org/software/crfsuite/.
  6 | """
  7 | __author__ = """Naoaki Okazaki, Giovanni Colavizza"""
  8 | 
  9 | import string, re
 10 | 
 11 | def get_shape(token):
 12 |     r = ''
 13 |     for c in token:
 14 |         if c.isupper():
 15 |             r += 'U'
 16 |         elif c.islower():
 17 |             r += 'L'
 18 |         elif c.isdigit():
 19 |             r += 'D'
 20 |         elif c in ('.', ','):
 21 |             r += '.'
 22 |         elif c in (';', ':', '?', '!'):
 23 |             r += ';'
 24 |         elif c in ('+', '-', '*', '/', '=', '|', '_'):
 25 |             r += '-'
 26 |         elif c in ('(', '{', '[', '<'):
 27 |             r += '('
 28 |         elif c in (')', '}', ']', '>'):
 29 |             r += ')'
 30 |         else:
 31 |             r += c
 32 |     return r
 33 | 
 34 | def degenerate(src):
 35 |     dst = ''
 36 |     for c in src:
 37 |         if not dst or dst[-1] != c:
 38 |             dst += c
 39 |     return dst
 40 | 
 41 | def get_type(token):
 42 |     T = (
 43 |         'AllUpper', 'AllDigit', 'AllSymbol',
 44 |         'AllUpperDigit', 'AllUpperSymbol', 'AllDigitSymbol',
 45 |         'AllUpperDigitSymbol',
 46 |         'InitUpper',
 47 |         'AllLetter',
 48 |         'AllAlnum',
 49 |         )
 50 |     R = set(T)
 51 |     if not token:
 52 |         return 'EMPTY'
 53 | 
 54 |     for i in range(len(token)):
 55 |         c = token[i]
 56 |         if c.isupper():
 57 |             R.discard('AllDigit')
 58 |             R.discard('AllSymbol')
 59 |             R.discard('AllDigitSymbol')
 60 |         elif c.isdigit() or c in (',', '.'):
 61 |             R.discard('AllUpper')
 62 |             R.discard('AllSymbol')
 63 |             R.discard('AllUpperSymbol')
 64 |             R.discard('AllLetter')
 65 |         elif c.islower():
 66 |             R.discard('AllUpper')
 67 |             R.discard('AllDigit')
 68 |             R.discard('AllSymbol')
 69 |             R.discard('AllUpperDigit')
 70 |             R.discard('AllUpperSymbol')
 71 |             R.discard('AllDigitSymbol')
 72 |             R.discard('AllUpperDigitSymbol')
 73 |         else:
 74 |             R.discard('AllUpper')
 75 |             R.discard('AllDigit')
 76 |             R.discard('AllUpperDigit')
 77 |             R.discard('AllLetter')
 78 |             R.discard('AllAlnum')
 79 | 
 80 |         if i == 0 and not c.isupper():
 81 |             R.discard('InitUpper')
 82 | 
 83 |     for tag in T:
 84 |         if tag in R:
 85 |             return tag
 86 |     return 'NO'
 87 | 
 88 | def get_2d(token):
 89 |     return len(token) == 2 and token.isdigit()
 90 | 
 91 | def get_4d(token):
 92 |     return len(token) == 4 and token.isdigit()
 93 | 
 94 | def get_parYear(token):
 95 |     if token[0] == '(' and token[-1] == ')':
 96 |         if get_4d(token[1:-1]) or get_2d(token[1:-1]):
 97 |             return True
 98 |     return False
 99 | 
100 | # if both digit and alphabetic
101 | def get_da(token):
102 |     bd = False
103 |     ba = False
104 |     for c in token:
105 |         if c.isdigit():
106 |             bd = True
107 |         elif c.isalpha():
108 |             ba = True
109 |         else:
110 |             return False
111 |     return bd and ba
112 | 
113 | def get_dand(token, p):
114 |     bd = False
115 |     bdd = False
116 |     for c in token:
117 |         if c.isdigit():
118 |             bd = True
119 |         elif c == p:
120 |             bdd = True
121 |         else:
122 |             return False
123 |     return bd and bdd
124 | 
125 | def get_all_other(token):
126 |     for c in token:
127 |         if c.isalnum():
128 |             return False
129 |     return True
130 | 
131 | def get_capperiod(token):
132 |     return len(token) == 2 and token[0].isupper() and token[1] == '.'
133 | 
134 | def contains_upper(token):
135 |     b = False
136 |     for c in token:
137 |         b |= c.isupper()
138 |     return b
139 | 
140 | def contains_lower(token):
141 |     b = False
142 |     for c in token:
143 |         b |= c.islower()
144 |     return b
145 | 
146 | def contains_alpha(token):
147 |     b = False
148 |     for c in token:
149 |         b |= c.isalpha()
150 |     return b
151 | 
152 | def contains_digit(token):
153 |     b = False
154 |     for c in token:
155 |         b |= c.isdigit()
156 |     return b
157 | 
158 | def contains_symbol(token):
159 |     b = False
160 |     for c in token:
161 |         b |= ~c.isalnum()
162 |     return b
163 | 
164 | # abbreviations
165 | def is_abbr(token):
166 |     b = False
167 |     if "." in token:
168 |         for p in string.punctuation:
169 |             token = token.replace(p, '')
170 |         if len(token) < 2 & len(token) > 0:
171 |             b = True
172 |         elif len(token) == 2:
173 |             if token[0] == token[1]:
174 |                 b = True
175 |     return b
176 | 
177 | # alternative
178 | # average frequency of abbreviations
179 | # pattern from http://stackoverflow.com/questions/17779771/finding-acronyms-using-regex-in-python
180 | def abbr_pattern(token):
181 |     b = False
182 | 
183 |     if token is None or len(token) < 1:
184 |         return b
185 | 
186 |     pattern = r'(?:(?<=\.|\s)[A-Z]\.)+'
187 |     counter = re.search(pattern, token)
188 |     if counter:
189 |         return True
190 |     return b
191 | 
192 | # New
193 | def is_roman(token):
194 |     b = True
195 |     for p in string.punctuation:
196 |         token = token.replace(p, '')
197 |     for c in token:
198 |         b &= c.lower() in ['i', 'x', 'v', 'c', 'l', 'm', 'd']
199 |     return b
200 | 
201 | # Return true if a sequence of at least 2 characters matches with roman numbers
202 | def contains_roman(token):
203 |     for n,c in enumerate(token[1:]):
204 |         if c.isupper() and c.lower() in ['i', 'x', 'v', 'c', 'l', 'm', 'd']:
205 |             if token[n-1].isupper() and token[n-1].lower() in ['i', 'x', 'v', 'c', 'l', 'm', 'd']:
206 |                 return True
207 |     return False
208 | 
209 | # is interval, e.g. 1900-10
210 | def is_interval(token):
211 |     b = True
212 |     if "-" in token:
213 |         for x in token.split("-"):
214 |             try:
215 |                 int(x)
216 |             except:
217 |                 b = False
218 |     return b
219 | 
220 | # measure the punctuation frequency of a piece of text
221 | def punctuation(text, norm=True):
222 | 
223 |     if text is None or len(text) < 1:
224 |         return 0
225 | 
226 |     counter = 0
227 | 
228 |     for w in range(len(text)):
229 |         if text[w] in string.punctuation:
230 |             counter += 1
231 | 
232 |     return counter/len(text) if norm else counter
233 | 
234 | # measure the number frequency of a piece of text
235 | def numbers(text, norm=True):
236 | 
237 |     if text is None or len(text) < 1:
238 |         return 0
239 | 
240 |     counter = 0
241 |     for w in text.split():
242 |         for p in string.punctuation:
243 |             w = w.replace(p,"")
244 |         try:
245 |             int(w)
246 |         except:
247 |             continue
248 |         counter += 1
249 | 
250 |     return counter/len(text) if norm else counter
251 | 
252 | # frequency of upper case letters
253 | def upper_case(text, norm=True):
254 | 
255 |     if text is None or len(text) < 1:
256 |         return 0
257 | 
258 |     counter = 0
259 |     for n in range(len(text)):
260 |         if text[n].isupper():
261 |             counter += 1
262 | 
263 |     return counter/len(text) if norm else counter
264 | 
265 | # frequency of lower case letters
266 | def lower_case(text, norm=True):
267 | 
268 |     if text is None or len(text) < 1:
269 |         return 0
270 | 
271 |     counter = 0
272 |     for n in range(len(text)):
273 |         if text[n].islower():
274 |             counter += 1
275 | 
276 |     return counter/len(text) if norm else counter
277 | 
278 | # number of chars (with whitespace)
279 | def chars(text):
280 | 
281 |     if text is None or len(text) < 1:
282 |         return 0
283 | 
284 |     return len(text)
285 | 
286 | # is abbreviation? (any word of len <= 3 with a dot at the end)
287 | def abbr(text):
288 | 
289 |     if text is None or len(text) < 1:
290 |         return 0
291 | 
292 |     if len(text) <= 4 and text[-1] == ".":
293 |         return True
294 |     else:
295 |         return False
296 | 
297 | # Boolean generation functions based on generic input
298 | def b(v):
299 |     return 'yes' if v else 'no'


--------------------------------------------------------------------------------
/crf_baseline/code/feature_extraction_words.py:
--------------------------------------------------------------------------------
  1 | # -*- coding: utf-8 -*-
  2 | """
  3 | Generator of features, relies on feature_extraction_supporting_functions_words
  4 | 
  5 | Compatible with sklearn_crfsuite
  6 | """
  7 | __author__ = """Giovanni Colavizza"""
  8 | 
  9 | from code.feature_extraction_supporting_functions_words import *
 10 | 
 11 | def generate_featuresFull(word,n_id,defval=''):
 12 |     """
 13 |     Creates a set of features for a given token.
 14 | 
 15 |     :param word: token
 16 |     :param n_id: reference number of token in feature window
 17 |     :param def_val: default value for missing features
 18 |     :return: The token and its features
 19 |     """
 20 | 
 21 |     v = {'w%d'%n_id: word}
 22 |     # Lowercased token.
 23 |     v['wl%d'%n_id] = v['w%d'%n_id].lower()
 24 |     # Token shape.
 25 |     v['shape%d'%n_id] = get_shape(v['w%d'%n_id])
 26 |     # Token shape degenerated.
 27 |     v['shaped%d'%n_id] = degenerate(v['shape%d'%n_id])
 28 |     # Token type.
 29 |     v['type%d'%n_id] = get_type(v['w%d'%n_id])
 30 | 
 31 |     # Prefixes (length between one to four).
 32 |     v['p1%d'%n_id] = v['w%d'%n_id][0] if len(v['w%d'%n_id]) >= 1 else defval
 33 |     v['p2%d'%n_id] = v['w%d'%n_id][:2] if len(v['w%d'%n_id]) >= 2 else defval
 34 |     v['p3%d'%n_id] = v['w%d'%n_id][:3] if len(v['w%d'%n_id]) >= 3 else defval
 35 |     v['p4%d'%n_id] = v['w%d'%n_id][:4] if len(v['w%d'%n_id]) >= 4 else defval
 36 | 
 37 |     # Suffixes (length between one to four).
 38 |     v['s1%d'%n_id] = v['w%d'%n_id][-1] if len(v['w%d'%n_id]) >= 1 else defval
 39 |     v['s2%d'%n_id] = v['w%d'%n_id][-2:] if len(v['w%d'%n_id]) >= 2 else defval
 40 |     v['s3%d'%n_id] = v['w%d'%n_id][-3:] if len(v['w%d'%n_id]) >= 3 else defval
 41 |     v['s4%d'%n_id] = v['w%d'%n_id][-4:] if len(v['w%d'%n_id]) >= 4 else defval
 42 | 
 43 |     # Two digits
 44 |     v['2d%d'%n_id] = b(get_2d(v['w%d'%n_id]))
 45 |     # Four digits.
 46 |     v['4d%d'%n_id] = b(get_4d(v['w%d'%n_id]))
 47 |     # Has a number with parentheses
 48 |     v['4d%d'%n_id] = b(get_parYear(v['w%d'%n_id]))
 49 |     # Alphanumeric token.
 50 |     v['d&a%d'%n_id] = b(get_da(v['w%d'%n_id]))
 51 |     # Digits and '-'.
 52 |     v['d&-%d'%n_id] = b(get_dand(v['w%d'%n_id], '-'))
 53 |     # Digits and '/'.
 54 |     v['d&/%d'%n_id] = b(get_dand(v['w%d'%n_id], '/'))
 55 |     # Digits and ','.
 56 |     v['d&,%d'%n_id] = b(get_dand(v['w%d'%n_id], ','))
 57 |     # Digits and '.'.
 58 |     v['d&.%d'%n_id] = b(get_dand(v['w%d'%n_id], '.'))
 59 |     # A uppercase letter followed by '.'
 60 |     v['up%d'%n_id] = b(get_capperiod(v['w%d'%n_id]))
 61 | 
 62 |     # An initial uppercase letter.
 63 |     v['iu%d'%n_id] = b(v['w%d'%n_id] and v['w%d'%n_id][0].isupper())
 64 |     # All uppercase letters.
 65 |     v['au%d'%n_id] = b(v['w%d'%n_id].isupper())
 66 |     # All lowercase letters.
 67 |     v['al%d'%n_id] = b(v['w%d'%n_id].islower())
 68 |     # All digit letters.
 69 |     v['ad%d'%n_id] = b(v['w%d'%n_id].isdigit())
 70 |     # All other (non-alphanumeric) letters.
 71 |     v['ao%d'%n_id] = b(get_all_other(v['w%d'%n_id]))
 72 | 
 73 |     # Contains a uppercase letter.
 74 |     v['cu%d'%n_id] = b(contains_upper(v['w%d'%n_id]))
 75 |     # Contains a lowercase letter.
 76 |     v['cl%d'%n_id] = b(contains_lower(v['w%d'%n_id]))
 77 |     # Contains a alphabet letter.
 78 |     v['ca%d'%n_id] = b(contains_alpha(v['w%d'%n_id]))
 79 |     # Contains a digit.
 80 |     v['cd%d'%n_id] = b(contains_digit(v['w%d'%n_id]))
 81 |     # Contains a symbol.
 82 |     v['cs%d'%n_id] = b(contains_symbol(v['w%d'%n_id]))
 83 | 
 84 |     # Is abbreviation.
 85 |     v['ab%d'%n_id] = b(is_abbr(v['w%d'%n_id]))
 86 |     # Is abbreviation 2
 87 |     v['ab2%d'%n_id] = b(abbr(v['w%d'%n_id]))
 88 |     # Is Roman number.
 89 |     v['ro%d'%n_id] = b(is_roman(v['w%d'%n_id]))
 90 |     v['cont_ro%d'%n_id] = b(contains_roman(v['w%d'%n_id]))
 91 |     # Is Interval.
 92 |     v['int%d'%n_id] = b(is_interval(v['w%d'%n_id]))
 93 | 
 94 |     return v
 95 | 
 96 | def generate_featuresLight(word,n_id,defval=''):
 97 |     """
 98 |     Lightweight version of the above.
 99 | 
100 |     :param word: token
101 |     :param n_id: reference number of token in feature window
102 |     :param def_val: default value for missing features
103 |     :return: The token and its features
104 |     """
105 | 
106 |     v = {'w%d'%n_id: word}
107 |     # Lowercased token.
108 |     v['wl%d'%n_id] = v['w%d'%n_id].lower()
109 |     # Token shape.
110 |     v['shape%d'%n_id] = get_shape(v['w%d'%n_id])
111 |     # Token shape degenerated.
112 |     v['shaped%d'%n_id] = degenerate(v['shape%d'%n_id])
113 |     # Token type.
114 |     v['type%d'%n_id] = get_type(v['w%d'%n_id])
115 | 
116 |     # Prefixes (length between one to four).
117 |     v['p1%d'%n_id] = v['w%d'%n_id][0] if len(v['w%d'%n_id]) >= 1 else defval
118 |     v['p2%d'%n_id] = v['w%d'%n_id][:2] if len(v['w%d'%n_id]) >= 2 else defval
119 | 
120 |     # Suffixes (length between one to four).
121 |     v['s1%d'%n_id] = v['w%d'%n_id][-1] if len(v['w%d'%n_id]) >= 1 else defval
122 |     v['s2%d'%n_id] = v['w%d'%n_id][-2:] if len(v['w%d'%n_id]) >= 2 else defval
123 | 
124 |     # Two digits
125 |     v['2d%d'%n_id] = b(get_2d(v['w%d'%n_id]))
126 |     # Four digits.
127 |     v['4d%d'%n_id] = b(get_4d(v['w%d'%n_id]))
128 |     # Alphanumeric token.
129 |     v['d&a%d'%n_id] = b(get_da(v['w%d'%n_id]))
130 |     # Digits and '-'.
131 |     v['d&-%d'%n_id] = b(get_dand(v['w%d'%n_id], '-'))
132 |     # Digits and '/'.
133 |     v['d&/%d'%n_id] = b(get_dand(v['w%d'%n_id], '/'))
134 |     # Digits and ','.
135 |     v['d&,%d'%n_id] = b(get_dand(v['w%d'%n_id], ','))
136 |     # Digits and '.'.
137 |     v['d&.%d'%n_id] = b(get_dand(v['w%d'%n_id], '.'))
138 |     # A uppercase letter followed by '.'
139 |     v['up%d'%n_id] = b(get_capperiod(v['w%d'%n_id]))
140 | 
141 |     # An initial uppercase letter.
142 |     v['iu%d'%n_id] = b(v['w%d'%n_id] and v['w%d'%n_id][0].isupper())
143 |     # All uppercase letters.
144 |     v['au%d'%n_id] = b(v['w%d'%n_id].isupper())
145 |     # All lowercase letters.
146 |     v['al%d'%n_id] = b(v['w%d'%n_id].islower())
147 |     # All digit letters.
148 |     v['ad%d'%n_id] = b(v['w%d'%n_id].isdigit())
149 |     # All other (non-alphanumeric) letters.
150 |     v['ao%d'%n_id] = b(get_all_other(v['w%d'%n_id]))
151 | 
152 |     # Contains a uppercase letter.
153 |     v['cu%d'%n_id] = b(contains_upper(v['w%d'%n_id]))
154 |     # Contains a lowercase letter.
155 |     v['cl%d'%n_id] = b(contains_lower(v['w%d'%n_id]))
156 |     # Contains a alphabet letter.
157 |     v['ca%d'%n_id] = b(contains_alpha(v['w%d'%n_id]))
158 |     # Contains a digit.
159 |     v['cd%d'%n_id] = b(contains_digit(v['w%d'%n_id]))
160 |     # Contains a symbol.
161 |     v['cs%d'%n_id] = b(contains_symbol(v['w%d'%n_id]))
162 | 
163 |     # Is abbreviation.
164 |     v['ab%d'%n_id] = b(is_abbr(v['w%d'%n_id]))
165 |     # Is abbreviation 2
166 |     v['ab2%d'%n_id] = b(abbr(v['w%d'%n_id]))
167 |     # Is Roman number.
168 |     v['ro%d'%n_id] = b(is_roman(v['w%d'%n_id]))
169 |     # Is Interval.
170 |     v['int%d'%n_id] = b(is_interval(v['w%d'%n_id]))
171 | 
172 |     # remove tags and lowercase: language independence
173 |     del v['w%d'%n_id]
174 |     del v['wl%d'%n_id]
175 | 
176 |     return v
177 | 
178 | 
179 | 
180 | def word2features(sequence, i, extra_labels=[], window=2, feature_function=generate_featuresFull):
181 |     """
182 |     Takes a dataset from a specific document and exports its features for parsing.
183 | 
184 |     :param sequence: a list of tokens
185 |     :param i: index of token in sequence
186 |     :param extra_labels: list of labels to assign to the text
187 |     :param window: window to consider of preceding and following tokens (e.g. 2 means features for tokens -2 to 2 included will be generated)
188 |     :return: dictionary of token features
189 |     """
190 | 
191 |     """
192 |     Template of data coming in (4 sequences):
193 | 
194 |         ['piGNATTi', 'T', '.,', 'Le', 'pitture', 'di', 'Paolo', 'Vero', '-'],
195 |         ['nese', 'nella', 'chiesa', 'di', 'S', '.', 'Sebastiano', 'in'],
196 |         ['Venezia', ',', 'Milano', '1966', '.'],
197 |         ['piGNATTi', 't', '.,', 'Paolo', 'Veronese', ',', 'Milano']]
198 | 
199 |     """
200 | 
201 |     if len(extra_labels) > 0:
202 |         assert len(text) == len(extra_labels)
203 | 
204 |     word = sequence[i]
205 |     position_in_sequence = i
206 | 
207 | 
208 |     features = feature_function(word, 0)
209 |     features.update({
210 |         'position': position_in_sequence,
211 |         })
212 | 
213 |     # Extra Labels to add
214 |     if len(extra_labels) > 0:
215 |         features.update({'tag': extra_labels[i]})
216 | 
217 | 
218 |     if i == 0:
219 |         features['BOS'] = True # Begin of Sequence
220 |     else:
221 |         for n in range(-window,0):
222 |             if i+n >= 0:
223 |                 word = sequence[i+n]
224 |                 features.update(feature_function(word,n))
225 |                 # Extra labels
226 |                 if len(extra_labels) > 0:
227 |                     features.update({"tag%s"%n:extra_labels[i+n]})
228 | 
229 |     if i == len(sequence)-1:
230 |         features['EOS'] = True # End of sequence
231 |         for n in range(1,window+1):
232 |             if i+n < len(sequence)-1:
233 |                 word = sequence[i+n]
234 |                 features.update(feature_function(word,n))
235 |                 # Extra labels
236 |                 if len(extra_labels) > 0:
237 |                     features.update({"tag%s"%n:extra_labels[i+n]})
238 | 
239 | 
240 |     return features
241 | 


--------------------------------------------------------------------------------
/crf_baseline/code/utils.py:
--------------------------------------------------------------------------------
 1 | import sys
 2 | 
 3 | 
 4 | def setPrintToFile(filename):
 5 |     stdout_original = sys.stdout
 6 |     f = open(filename, 'w')
 7 |     sys.stdout = f
 8 |     return f,stdout_original
 9 |     
10 | 
11 | def closePrintToFile(f, stdout_original):
12 |     sys.stdout = stdout_original
13 |     f.close()
14 | 
15 | 
16 | def load_data(file):
17 |     words = []
18 |     tags_1 = []
19 |     tags_2 = []
20 |     tags_3 = []
21 |     tags_4 = []
22 | 
23 |     word = tags1 = tags2 = tags3 = tags4 = []
24 |     with open (file, "r") as file:
25 |         for line in file:
26 |             if 'DOCSTART' not in line: #Do not take the first line into consideration
27 |                 # Check if empty line
28 |                 if line in ['\n', '\r\n']:
29 |                     # Append line
30 |                     words.append(word)
31 |                     tags_1.append(tags1)
32 |                     tags_2.append(tags2)
33 |                     tags_3.append(tags3)
34 |                     tags_4.append(tags4)
35 | 
36 |                     # Reset
37 |                     word = []
38 |                     tags1 = []
39 |                     tags2 = []
40 |                     tags3 = []
41 |                     tags4 = []
42 | 
43 |                 else:
44 |                     # Split the line into words, tag #1, tag #2, tag #3
45 |                     w = line[:-1].split(" ")
46 |                     word.append(w[0])
47 |                     tags1.append(w[1])
48 |                     tags2.append(w[2])
49 |                     tags3.append(w[3])
50 |                     tags4.append(w[4])
51 | 
52 |     return words,tags_1,tags_2,tags_3,tags_4


--------------------------------------------------------------------------------
/crf_baseline/main_finetune.py:
--------------------------------------------------------------------------------
  1 | import random
  2 | 
  3 | import numpy as np
  4 | import time
  5 | 
  6 | # Plot
  7 | import matplotlib
  8 | matplotlib.use('agg')
  9 | import matplotlib.pyplot as plt
 10 | 
 11 | # CRF
 12 | import sklearn_crfsuite
 13 | from sklearn_crfsuite 	import scorers, metrics
 14 | from sklearn.metrics 	import make_scorer, confusion_matrix
 15 | from sklearn.externals 	import joblib
 16 | from sklearn.model_selection import RandomizedSearchCV
 17 | 
 18 | # For model validation
 19 | import scipy
 20 | 
 21 | 
 22 | # Utils functions
 23 | from code.feature_extraction_supporting_functions_words import *
 24 | from code.feature_extraction_words import *
 25 | from code.utils import *
 26 | 
 27 | 
 28 | # Load entire data
 29 | X_train_w, train_t1, train_t2, train_t3 = load_data("../dataset/clean_train.txt")
 30 | X_test_w, test_t1, test_t2, test_t3= load_data("../dataset/clean_test.txt")
 31 | 
 32 | 
 33 | for task in ["t3", "t2", "t1"]: #Ordered according to increase computation time
 34 | 
 35 | 	print("=========================== Task {0} ========================= Start:{1}".format(task,  time.strftime("%D %H:%M:%S")))
 36 | 	# Set file
 37 | 	file, stdout_original = setPrintToFile("results/CRF_model_task_{0}.txt".format(task))
 38 | 
 39 | 	# Task data
 40 | 	y_train = eval("train_"+task)
 41 | 	y_test = eval("test_"+task)
 42 | 
 43 | 	# Build CRF data format
 44 | 	window = 2 # the window of dependance for the CRFs
 45 | 	X_train = [[word2features(text, i, window=window) for i in range(len(text))] for text in X_train_w]
 46 | 	X_test  = [[word2features(text, i, window=window) for i in range(len(text))] for text in X_test_w]
 47 | 
 48 | 
 49 | 	print("Training data  - number of lines:  ", len(X_train))
 50 | 	print("Testing  data  - number of lines:  ", len(X_test))
 51 | 	print('----')
 52 | 	print("Training data  - number of tokens:  ", len([x for y in X_train for x in y]))
 53 | 	print("Testing  data  - number of tokens:  ", len([x for y in X_test  for x in y]))
 54 | 	print()
 55 | 	print()
 56 | 
 57 | 
 58 | 
 59 | 	# CRF Model : Fine-tuning c1 and c2
 60 | 
 61 | 	# Parameters search (Based on https://sklearn-crfsuite.readthedocs.io/en/latest/tutorial.html#hyperparameter-optimization)
 62 | 	crf = sklearn_crfsuite.CRF( 
 63 | 		max_iterations=100,
 64 | 		algorithm = 'lbfgs',
 65 | 		all_possible_transitions=False
 66 | 	)
 67 | 
 68 | 	params_space = {
 69 | 		'c1': scipy.stats.expon(scale=0.5),
 70 | 		'c2': scipy.stats.expon(scale=0.05)
 71 | 	}
 72 | 
 73 | 	scorer = make_scorer(metrics.flat_f1_score, average='weighted')
 74 | 		
 75 | 	# search
 76 | 	rs = RandomizedSearchCV(crf, params_space, 
 77 | 							cv=3, 
 78 | 							verbose=1, 
 79 | 							n_jobs=-10, 
 80 | 							n_iter=5, 
 81 | 							scoring=scorer)
 82 | 	rs.fit(X_train, y_train)
 83 | 
 84 | 	print('best params:', rs.best_params_)
 85 | 	print('best CV score:', rs.best_score_)
 86 | 
 87 | 
 88 | 	# Create score plot
 89 | 	_x = [s.parameters['c1'] for s in rs.grid_scores_]
 90 | 	_y = [s.parameters['c2'] for s in rs.grid_scores_]
 91 | 	_c = [s.mean_validation_score for s in rs.grid_scores_]
 92 | 
 93 | 	fig = plt.figure()
 94 | 	fig.set_size_inches(12, 12)
 95 | 	ax = plt.gca()
 96 | 	ax.set_xlabel('C1')
 97 | 	ax.set_ylabel('C2')
 98 | 	ax.set_title("Randomized Hyperparameter Search CV Results (min={:0.3}, max={:0.3})".format(min(_c), max(_c)))
 99 | 	ax.scatter(_x, _y, c=_c, s=60, cmap="bwr_r")
100 | 	print("F1 scores: Dark blue => {:0.4}, dark red => {:0.4}".format(min(_c), max(_c)))
101 | 	plt.savefig("plots/plot_fine_tuning_task_{0}".format(task))
102 | 	#Save plot
103 | 
104 | 
105 | 
106 | 
107 | 	# Testing with best parameters
108 | 	y_pred = rs.best_estimator_.predict(X_test)
109 | 	print('best params:', rs.best_params_)
110 | 	print('best CV score:', rs.best_score_)
111 | 	print(metrics.flat_classification_report(
112 | 	    y_test, y_pred, digits=3
113 | 	))
114 | 
115 | 
116 | 	# Save best model
117 | 	joblib.dump(rs.best_estimator_,'models/crf_{0}.pkl'.format(task))
118 | 
119 | 
120 | 	# Close file
121 | 	closePrintToFile(file, stdout_original)
122 | 	print("=========================== Task {0} ========================= End:{1}".format(task,  time.strftime("%D %H:%M:%S")))
123 | 


--------------------------------------------------------------------------------
/crf_baseline/main_threeTasks.py:
--------------------------------------------------------------------------------
 1 | import random
 2 | 
 3 | import numpy as np
 4 | import matplotlib.pyplot as plt
 5 | import matplotlib
 6 | import time
 7 | 
 8 | # CRF
 9 | import sklearn_crfsuite
10 | from sklearn_crfsuite 	import scorers, metrics
11 | from sklearn.metrics 	import make_scorer, confusion_matrix
12 | from sklearn.externals 	import joblib
13 | 
14 | 
15 | # Utils functions
16 | from code.feature_extraction_supporting_functions_words import *
17 | from code.feature_extraction_words import *
18 | from code.utils import *
19 | 
20 | 
21 | # Load entire data
22 | X_train_w, train_t1, train_t2, train_t3 = load_data("../dataset/clean_train.txt")
23 | X_test_w, test_t1, test_t2, test_t3= load_data("../dataset/clean_test.txt")
24 | 
25 | 
26 | for task in ["t1", "t2", "t3"]:
27 | 
28 | 	print("=========================== Task {0} ========================= Start:{1}".format(task,  time.strftime("%D %H:%M:%S")))
29 | 	# Set file
30 | 	file, stdout_original = setPrintToFile("results/CRF_model_task_{0}.txt".format(task))
31 | 
32 | 	# Task data
33 | 	y_train = eval("train_"+task)
34 | 	y_test = eval("test_"+task)
35 | 
36 | 	# Build CRF data format
37 | 	window = 2 # the window of dependance for the CRFs
38 | 	X_train = [[word2features(text, i, window=window) for i in range(len(text))] for text in X_train_w]
39 | 	X_test  = [[word2features(text, i, window=window) for i in range(len(text))] for text in X_test_w]
40 | 
41 | 
42 | 	print("Training data  - number of lines:  ", len(X_train))
43 | 	print("Testing  data  - number of lines:  ", len(X_test))
44 | 	print('----')
45 | 	print("Training data  - number of tokens:  ", len([x for y in X_train for x in y]))
46 | 	print("Testing  data  - number of tokens:  ", len([x for y in X_test  for x in y]))
47 | 	print()
48 | 	print()
49 | 
50 | 
51 | 
52 | 	# CRF Model
53 | 
54 | 	crf = sklearn_crfsuite.CRF( 
55 | 	    algorithm='lbfgs',
56 | 	    c1=0.1,
57 | 	    c2=0.1,
58 | 	    max_iterations=100,
59 | 	    all_possible_transitions=False
60 | 	)
61 | 	crf.fit(X_train, y_train)
62 | 
63 | 	# Save CRF model
64 | 	joblib.dump(crf,'models/crf_{0}.pkl'.format(task))
65 | 
66 | 
67 | 
68 | 	# Testing
69 | 	y_pred = crf.predict(X_test)
70 | 	print(metrics.flat_classification_report(
71 | 	    y_test, y_pred, digits=3
72 | 	))
73 | 
74 | 
75 | 	# Close file
76 | 	closePrintToFile(file, stdout_original)
77 | 	print("=========================== Task {0} ========================= End:{1}".format(task,  time.strftime("%D %H:%M:%S")))


--------------------------------------------------------------------------------
/crf_baseline/validation.py:
--------------------------------------------------------------------------------
 1 | import random
 2 | 
 3 | import numpy as np
 4 | import time
 5 | 
 6 | # Python objects
 7 | import pickle
 8 | 
 9 | 
10 | # Plot
11 | import matplotlib
12 | matplotlib.use('agg')
13 | import matplotlib.pyplot as plt
14 | 
15 | # CRF
16 | import sklearn_crfsuite
17 | from sklearn_crfsuite 	import scorers, metrics
18 | from sklearn.metrics 	import make_scorer, confusion_matrix
19 | from sklearn.externals 	import joblib
20 | from sklearn.model_selection import RandomizedSearchCV
21 | 
22 | # For model validation
23 | import scipy
24 | 
25 | 
26 | # Utils functions
27 | from code.feature_extraction_supporting_functions_words import *
28 | from code.feature_extraction_words import *
29 | from code.utils import *
30 | 
31 | 
32 | 
33 | # Load validation data
34 | window = 2
35 | X_valid_w, valid_t1, valid_t2, valid_t3 = load_data("../dataset/clean_valid.txt")
36 | X_valid  = [[word2features(text, i, window=window) for i in range(len(text))] for text in X_valid_w]
37 | 
38 | 
39 | 
40 | # TASK 1
41 | y_valid = valid_t1
42 | crf = pickle.load(open("models/crf_t1.pkl", "rb" ))
43 | print(crf)
44 | y_pred = crf.predict(X_valid)
45 | print(metrics.flat_classification_report(
46 |     y_valid, y_pred, digits=6
47 | ))
48 | 
49 | # Task 2
50 | y_valid = valid_t2
51 | crf = pickle.load(open("models/crf_t2.pkl", "rb" ))
52 | print(crf)
53 | y_pred = crf.predict(X_valid)
54 | print(metrics.flat_classification_report(
55 |     y_valid, y_pred, digits=6
56 | ))
57 | 
58 | 
59 | # Task 3
60 | y_valid = valid_t3
61 | crf = pickle.load(open("models/crf_t3.pkl", "rb" ))
62 | print(crf)
63 | y_pred = crf.predict(X_valid)
64 | print(metrics.flat_classification_report(
65 |     y_valid, y_pred, digits=6
66 | ))
67 | 


--------------------------------------------------------------------------------
/dataset.tar.gz:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/dhlab-epfl/LinkedBooksDeepReferenceParsing/9411db4e918baffa361895c50ae9ce2046fafc3c/dataset.tar.gz


--------------------------------------------------------------------------------
/keras/README.md:
--------------------------------------------------------------------------------
 1 | ## Keras
 2 | 
 3 | The directory contains code to run the models with a Keras implementation and a Tensorlow backend. Code for both single and multitask models are given. For single tasks, one model for each of the three tasks will be computed by running the python script *main_threeTasks.py*. The multitask learning model can be computed with the script *main_multiTaskLearning.py*.
 4 | 
 5 | The data is expected to be in a *dataset* folder, in the main repository directory, with three files inside it: *clean_train.txt* for the training dataset, *clean_test.txt* for the testing dataset, and *clean_valid.txt* for the validation dataset. Inside the dataset folder, a *pretrained_vectors* folder is expected, with two files inside it: *vecs_100.txt* and *vecs_300.txt*.
 6 | 
 7 | The results will be stored into the *model_results* folder, with one directory created for each model.
 8 | 
 9 |     python main_threeTasks.py
10 |     python main_multiTaskLearning.py
11 | 
12 | ## Contents
13 | * `README.md` this file.
14 | * `code/`
15 |     * [models](code/models.py) code to create, train and validated NN models.
16 |     * [utils](code/utils.py) utility functions to run the models.
17 | * [main_multiTaskLearning](main_multiTaskLearning.py) python script to run the multi-task model.
18 | * [main_threeTasks](main_threeTasks.py) python script to train one NN model for each task.
19 | 
20 | ## Dependencies
21 | * Keras : version 2.1.1
22 | * TensorFlow: 1.4.0
23 | * Numpy: 1.13.3
24 | * [Keras contrib](https://github.com/keras-team/keras-contrib) Keras contrib : 0.0.2
25 | * Sklearn : 0.19.1
26 | * [Sklearn crfsuite](https://sklearn-crfsuite.readthedocs.io/en/latest/index.html) Sklearn crfsuite : 0.3.6
27 | * Python 3.5	
28 | 


--------------------------------------------------------------------------------
/keras/code/__init__.py:
--------------------------------------------------------------------------------
1 | #!/usr/bin/python
2 | # -*- coding: utf-8 -*-
3 | 
4 | 


--------------------------------------------------------------------------------
/keras/code/models.py:
--------------------------------------------------------------------------------
  1 | # -*- coding: utf-8 -*-
  2 | 
  3 | """
  4 | Functions for building Keras models
  5 | """
  6 | 
  7 | import os
  8 | import random
  9 | import numpy as np
 10 | import tensorflow
 11 | random.seed(42)
 12 | np.random.seed(42)
 13 | tensorflow.set_random_seed(42)
 14 | 
 15 | # Keras function
 16 | from keras.callbacks import EarlyStopping
 17 | from keras.layers import Embedding, LSTM, Dense, Bidirectional, Dropout, Input, TimeDistributed, Flatten, Convolution1D, MaxPooling1D, concatenate
 18 | from keras.models import Sequential, Model
 19 | from keras.optimizers import Adam, RMSprop
 20 | from keras_contrib.layers import CRF
 21 | from keras_contrib.utils import save_load_utils
 22 | 
 23 | from sklearn.metrics import confusion_matrix
 24 | from sklearn_crfsuite import metrics
 25 | 
 26 | # Utils script
 27 | from code.utils import *
 28 | 
 29 | 
 30 | 
 31 | def BiLSTM_model(filename, train, output,
 32 |               X_train, X_test, word2ind, maxWords,
 33 |               y_train, y_test, ind2label,
 34 |               validation=False, X_valid=None, y_valid=None,
 35 |               word_embeddings=True, pretrained_embedding="", word_embedding_size=100,
 36 |               maxChar=0, char_embedding_type="", char2ind="", char_embedding_size=50,
 37 |               lstm_hidden=32, nbr_epochs=5, batch_size=32, dropout=0, optimizer='rmsprop', early_stopping_patience=-1,
 38 |               folder_path="model_results", gen_confusion_matrix=False, printPadding=False
 39 |             ):    
 40 |     """
 41 |         Build, train and test a BiLSTM Keras model. Works for multi-tasking learning.
 42 |         The model architecture looks like:
 43 |             
 44 |             - Words representations: 
 45 |                 - Word embeddings
 46 |                 - Character-level representation [Optional]
 47 |             - Dropout
 48 |             - Bidirectional LSTM
 49 |             - Dropout
 50 |             - Softmax/CRF for predictions
 51 | 
 52 | 
 53 |         :param filename: File to redirect the printing
 54 |         :param train: Boolean if the model must be trained or not. If False, the model's wieght are expected to be stored in "folder_path/filename/filename.h5" 
 55 |         :param otput: "crf" or "softmax". Type of prediction layer to use
 56 |         
 57 |         :param X_train: Data to train the model
 58 |         :param X_test: Data to test the model
 59 |         :param word2ind: Dictionary containing all words in the training data and a unique integer per word
 60 |         :param maxWords: Maximum number of words in a sequence 
 61 | 
 62 |         :param y_train: Labels to train the model for the prediction task
 63 |         :param y_test: Labels to test the model for the prediction task
 64 |         :param ind2label: Dictionary where all labels for task 1 are mapped into a unique integer
 65 | 
 66 |         :param validation: Boolean. If true, the validation score will be computed from 'X_valid' and 'y_valid'
 67 |         :param X_valid: Optional. Validation dataset
 68 |         :param y_valid: Optional. Validation dataset labels
 69 | 
 70 |         :param word_embeddings: Boolean value. Add word embeddings into the model.
 71 |         :param pretrained_embedding: Use the pretrained word embeddings. 
 72 |                                      Three values: 
 73 |                                             - "":    Do not use pre-trained word embeddings (Default)
 74 |                                             - False: Use the pre-trained embedding vectors as the weights in the Embedding layer
 75 |                                             - True:  Use the pre-trained embedding vectors as weight initialiers. The Embedding layer will still be trained.
 76 |         :param word_embedding_size: Size of the pre-trained word embedding to use (100 or 300)
 77 | 
 78 |         :param maxChar: The maximum numbers of characters in a word. If set to 0, the model will not use character-level representations of the words
 79 |         :param char_embedding_type: Type of model to use in order to compute the character-level representation of words: Two values: "CNN" or "BILSTM"
 80 |         :param char2ind: A dictionary where each character is maped into a unique integer
 81 |         :param char_embedding_size: size of the character-level word representations
 82 | 
 83 |         :param lstm_hidden: Dimentionality of the LSTM output space
 84 |         :param nbr_epochs: Number of epochs to train the model
 85 |         :param batch_size: Size of batches while training the model
 86 |         :param dropout: Rate to apply for each Dropout layer in the model
 87 |         :param optimizer: Optimizer to use while compiling the model
 88 |         :param early_stopping_patience: Number of continuous tolerated epochs without improvement during training.
 89 | 
 90 |         :param folder_path: Path to the directory storing all to-be-generated files
 91 |         :param gen_confusion_matrix: Boolean value. Generated confusion matrices or not.
 92 |         :param printPadding: Boolean. Prints the classification matrix taking padding as a possible label.
 93 | 
 94 | 
 95 |         :return: The classification scores for both tasks.
 96 |     """      
 97 |     print("====== {0} start ======".format(filename))
 98 |     end_string = "====== {0} end ======".format(filename)
 99 |     
100 |     # Create directory to store results
101 |     os.makedirs(folder_path+"/"+filename)
102 |     filepath = folder_path+"/"+filename+"/"+filename
103 | 
104 |     # Set print outputs file
105 |     file, stdout_original = setPrintToFile("{0}.txt".format(filepath))    
106 | 
107 |     # Model params
108 |     nbr_words = len(word2ind)+1
109 |     out_size = len(ind2label)+1
110 |     best_results = ""
111 |     
112 |     embeddings_list = []
113 |     inputs = []
114 | 
115 |     # Input - Word Embeddings
116 |     if word_embeddings:
117 |         word_input = Input((maxWords,))
118 |         inputs.append(word_input)
119 |         if pretrained_embedding=="":
120 |             word_embedding = Embedding(nbr_words, word_embedding_size)(word_input)
121 |         else:
122 |             # Retrieve embeddings
123 |             embedding_matrix = word2VecEmbeddings(word2ind, word_embedding_size)
124 |             word_embedding = Embedding(nbr_words, word_embedding_size, weights=[embedding_matrix], trainable=pretrained_embedding, mask_zero=False)(word_input)
125 |         embeddings_list.append(word_embedding)
126 | 
127 |     # Input - Characters Embeddings
128 |     if maxChar!=0:
129 |         character_input     = Input((maxWords,maxChar,))
130 |         char_embedding      = character_embedding_layer(char_embedding_type, character_input, maxChar, len(char2ind)+1, char_embedding_size) 
131 |         embeddings_list.append(char_embedding)
132 |         inputs.append(character_input)
133 | 
134 |     # Model - Inner Layers - BiLSTM with Dropout
135 |     embeddings = concatenate(embeddings_list) if len(embeddings_list)==2 else embeddings_list[0]
136 |     model = Dropout(dropout)(embeddings)
137 |     model = Bidirectional(LSTM(lstm_hidden, return_sequences=True, dropout=dropout))(model)
138 |     model = Dropout(dropout)(model)
139 |     
140 |     
141 |     if output == "crf":
142 |         # Output - CRF
143 |         crfs = [[CRF(out_size),out_size] for out_size in [len(x)+1 for x in ind2label]]
144 |         outputs = [x[0](Dense(x[1])(model)) for x in crfs]
145 |         model_loss = [x[0].loss_function for x in crfs]
146 |         model_metrics = [x[0].viterbi_acc for x in crfs]
147 | 
148 |     if output == "softmax":
149 |         outputs = [Dense(out_size, activation='softmax')(model) for out_size in [len(x)+1 for x in ind2label]]
150 |         model_loss = ['categorical_crossentropy' for x in outputs]
151 |         model_metrics = None
152 | 
153 |     # Model
154 |     model = Model(inputs=inputs, outputs=outputs)
155 |     model.compile(loss=model_loss, metrics=model_metrics, optimizer=get_optimizer(optimizer))
156 |     print(model.summary(line_length=150),"\n\n\n\n")
157 | 
158 | 
159 |     # Training Callbacks:
160 |     callbacks = []
161 |     value_to_monitor = 'val_f1'
162 |     best_model_weights_path = "{0}.h5".format(filepath)
163 |     
164 |     #    1) Classifition scores
165 |     classification_scores = Classification_Scores([X_train, y_train], ind2label, best_model_weights_path)
166 |     callbacks.append(classification_scores)
167 |     
168 |     #    2) EarlyStopping
169 |     if early_stopping_patience != -1:
170 |         early_stopping = EarlyStopping(monitor=value_to_monitor, patience=early_stopping_patience, mode='max')
171 |         callbacks.append(early_stopping)
172 | 
173 | 
174 |     # Train
175 |     if train:
176 |         # Train the model. Keras's method argument 'validation_data' is referred as 'testing data' in this code.
177 |         hist = model.fit(X_train, y_train, validation_data=[X_test, y_test], epochs=nbr_epochs, batch_size=batch_size, callbacks=callbacks, verbose=2)
178 |         
179 |         print()
180 |         print('-------------------------------------------')
181 |         print("Best F1 score:", early_stopping.best, "  (epoch number {0})".format(1+np.argmax(hist.history[value_to_monitor])))
182 |         
183 |         # Save Training scores
184 |         save_model_training_scores("{0}".format(filepath), hist, classification_scores)
185 |         
186 |         # Print best testing classification report
187 |         best_epoch = np.argmax(hist.history[value_to_monitor])
188 |         print(classification_scores.test_report[best_epoch])
189 | 
190 |         
191 |         # Best epoch results
192 |         best_results = model_best_scores(classification_scores, best_epoch)
193 | 
194 |     # Load weigths from best training epoch into model
195 |     save_load_utils.load_all_weights(model, best_model_weights_path)
196 | 
197 |     # Create confusion matrices
198 |     if gen_confusion_matrix:
199 |         for i, y_target in enumerate(y_test):
200 |             # Compute predictions, flatten
201 |             predictions, target = compute_predictions(model, X_test, y_target, ind2label[i])
202 |             # Generate confusion matrices
203 |             save_confusion_matrix(target, predictions,  list(ind2label[i].values()), "{0}_task_{1}_confusion_matrix_test".format(filepath,str(i+1)))
204 | 
205 | 
206 |     # Validation dataset
207 |     if validation:
208 |         print()
209 |         print("Validation dataset")
210 |         print("======================")
211 |         # Compute classification report 
212 |         for i, y_target in enumerate(y_valid):
213 |             # Compute predictions, flatten
214 |             predictions, target = compute_predictions(model, X_valid, y_target, ind2label[i], nbrTask=i)
215 | 
216 |             # Only for multi-task
217 |             if len(y_train) > 1:
218 |                 print("For task "+str(i+1)+"\n")
219 |                 print("====================================================================================")
220 | 
221 |             print("")
222 |             if printPadding:
223 |                 print("With padding into account")
224 |                 print(metrics.flat_classification_report([target], [predictions], digits=4))
225 |                 print("")
226 |                 print('----------------------------------------------')
227 |                 print("")
228 |                 print("Without the padding:")
229 |             print(metrics.flat_classification_report([target], [predictions], digits=4, labels=list(ind2label[i].values())))
230 | 
231 |             # Generate confusion matrices
232 |             save_confusion_matrix(target, predictions,  list(ind2label[i].values()), "{0}_task_{1}_confusion_matrix_validation".format(filepath,str(i+1)))
233 | 
234 | 
235 |     # Close file
236 |     closePrintToFile(file, stdout_original)
237 |     print(end_string)
238 | 
239 |     return best_results
240 | 
241 | 
242 | 
243 | 
244 | def character_embedding_layer(layer_type, character_input, maxChar, nbr_chars, char_embedding_size, 
245 |                             cnn_kernel_size=2, cnn_filters=30, lstm_units=50):
246 |     """
247 |         Return layer for computing the character-level representations of words.
248 |         
249 |         There is two type of architectures:
250 | 
251 |             Architecture CNN:
252 |                 - Character Embeddings
253 |                 - Flatten
254 |                 - Convolution
255 |                 - MaxPool
256 | 
257 |             Architecture BILSTM:
258 |                 - Character Embeddings
259 |                 - Flatten
260 |                 - Bidirectional LSTM
261 | 
262 |         :param layer_type: Model architecture to use "CNN" or "BILSTM"
263 |         :param character_input: Keras Input layer, size of the input
264 |         :param maxChar: The maximum numbers of characters in a word. If set to 0, the model will not use character-level representations of the words
265 |         :param nbr_chars: Numbers of unique characters present in the data
266 |         :param char_embedding_size: size of the character-level word representations
267 |         :param cnn_kernel_size: For the CNN architecture, size of the kernel in the Convolution layer
268 |         :param cnn_filters: For the CNN architecture, number of filters in the Convolution layer
269 |         :param lstm_units: For the BILSTM architecture, dimensionality of the output LSTM space (half of the Bidirectinal LSTM output space)
270 | 
271 |         :return: Character-level representation layers
272 |     """
273 | 
274 |     embed_char_out      = TimeDistributed(Embedding(nbr_chars, char_embedding_size), name='char_embedding')(character_input)
275 |     embed_char          = TimeDistributed(Flatten())(embed_char_out)
276 |     
277 |     if layer_type == "CNN":
278 |         conv1d_out      = TimeDistributed(Convolution1D(kernel_size=cnn_kernel_size, filters=cnn_filters, padding='same'))(embed_char)
279 |         char_emb        = TimeDistributed(MaxPooling1D(maxChar))(conv1d_out)
280 |     
281 |     if layer_type == "BILSTM":
282 |         char_emb        = Bidirectional(LSTM(lstm_units,return_sequences=True))(embed_char)
283 |     
284 |     return char_emb
285 | 
286 | 
287 | 
288 | 
289 | def get_optimizer(type, learning_rate=0.001, decay=0.0):
290 |     """
291 |         Return the optimizer needeed to compile Keras models.
292 | 
293 |         :param type: Type of optimizer. Two types supported: 'ADAM' and 'RMSprop'
294 |         :param learning_rate: float >= 0. Learning rate.
295 |         :pram decay:float >= 0. Learning rate decay over each update
296 | 
297 |         :return: The optimizer to use directly into keras model compiling function.
298 |     """
299 | 
300 |     if type == "adam":
301 |         return Adam(lr=learning_rate, decay=decay)
302 | 
303 |     if type == "rmsprop":
304 |         return RMSprop(lr=learning_rate, decay=decay)
305 | 
306 | 


--------------------------------------------------------------------------------
/keras/code/utils.py:
--------------------------------------------------------------------------------
  1 | # -*- coding: utf-8 -*-
  2 | 
  3 | """
  4 | Support functions for dealing with data and building models
  5 | """
  6 | 
  7 | import random
  8 | import numpy as np
  9 | import tensorflow
 10 | random.seed(42)
 11 | np.random.seed(42)
 12 | tensorflow.set_random_seed(42)
 13 | 
 14 | import sys
 15 | import csv
 16 | import itertools
 17 | 
 18 | from keras.callbacks import Callback
 19 | from keras.preprocessing.sequence import pad_sequences
 20 | from keras_contrib.utils import save_load_utils
 21 | from sklearn.metrics import precision_recall_fscore_support, confusion_matrix
 22 | from sklearn_crfsuite import metrics
 23 | 
 24 | # Plot
 25 | import matplotlib
 26 | matplotlib.use('agg')
 27 | import matplotlib.pyplot as plt
 28 | 
 29 | 
 30 | 
 31 | def load_data(filepath):
 32 |     """
 33 |         Load and return the data stored in the given path.
 34 |         The data is structured as follows: 
 35 |             Each line contains four columns separated by a single space. 
 36 |             Each word has been put on a separate line and there is an empty line after each sentence. 
 37 |             The first item on each line is a word, the second, third and fourth are tags related to the word.
 38 |         Example:
 39 |             The sentence "L. Antonielli, Iprefetti dell' Italia napoleonica, Bologna 1983." is represented in the dataset as:
 40 |                 L author b-secondary b-r
 41 |                 . author i-secondary i-r
 42 |                 Antonielli author i-secondary i-r
 43 |                 , author i-secondary i-r
 44 |                 Iprefetti title i-secondary i-r
 45 |                 dell title i-secondary i-r
 46 |                 ’ title i-secondary i-r
 47 |                 Italia title i-secondary i-r
 48 |                 napoleonica title i-secondary i-r
 49 |                 , title i-secondary i-r
 50 |                 Bologna publicationplace i-secondary i-r
 51 |                 1983 year e-secondary i-r
 52 |                 . year e-secondary e-r
 53 | 
 54 |         :param filepath: Path to the data
 55 |         :return: Four arrays: The first one contains sentences (one array of words per sentence) and the other threes are arrays of tags.
 56 | 
 57 |     """
 58 | 
 59 |     # Arrays to return
 60 |     words = []
 61 |     tags_1 = []
 62 |     tags_2 = []
 63 |     tags_3 = []
 64 | 
 65 |     word = tags1 = tags2 = tags3 = []
 66 |     with open (filepath, "r") as file:
 67 |         for line in file:
 68 |             if 'DOCSTART' not in line: #Do not take the first line into consideration
 69 |                 # Check if empty line
 70 |                 if line in ['\n', '\r\n']:
 71 |                     # Append line
 72 |                     words.append(word)
 73 |                     tags_1.append(tags1)
 74 |                     tags_2.append(tags2)
 75 |                     tags_3.append(tags3)
 76 | 
 77 |                     # Reset
 78 |                     word = []
 79 |                     tags1 = []
 80 |                     tags2 = []
 81 |                     tags3 = []
 82 | 
 83 |                 else:
 84 |                     # Split the line into words, tag #1, tag #2, tag #3
 85 |                     w = line[:-1].split(" ")
 86 |                     word.append(w[0])
 87 |                     tags1.append(w[1])
 88 |                     tags2.append(w[2])
 89 |                     tags3.append(w[3])
 90 | 
 91 |     return words,tags_1,tags_2,tags_3
 92 | 
 93 | 
 94 | 
 95 | 
 96 | def setPrintToFile(filename):
 97 |     """
 98 |         Redirect all prints into a file
 99 | 
100 |         :param filename: File to redirect all prints
101 |         :return: the file and the original print "direction"
102 |     """
103 | 
104 |     # Retrieve current print direction
105 |     stdout_original = sys.stdout
106 |     # Create file
107 |     f = open(filename, 'w')
108 |     # Set the new print redirection
109 |     sys.stdout = f
110 |     return f,stdout_original
111 |     
112 | 
113 | def closePrintToFile(f, stdout_original):
114 |     """
115 |         Change the print direction and closes a file.
116 | 
117 |         :param filename: File to close
118 |         :param stdout_original: Print direction
119 |     """
120 |     sys.stdout = stdout_original
121 |     f.close()
122 | 
123 | 
124 | 
125 | 
126 | def mergeDigits(datas, digits_word):
127 |     """
128 |         All digits in the given data will be mapped to the same word
129 | 
130 |         :param datas: The data to transform
131 |         :param digits_word: Word to map digits to
132 |         :return: The data transformed data
133 |     """
134 |     return [[[digits_word if x.isdigit() else x for x in w ] for w in data] for data in datas]
135 | 
136 | 
137 | 
138 | def indexData_x(x, ukn_words):
139 |     """
140 |         Map each word in the given data to a unique integer. A special index will be kept for "out-of-vocabulary" words.
141 | 
142 |         :param x: The data
143 |         :return: Two dictionaries: one where words are keys and indexes values, another one "reversed" (keys->index, values->words)
144 |     """
145 | 
146 |     # Retrieve all words used in the data (with duplicates)
147 |     all_text = [w for e in x for w in e]
148 |     # Compute the unique words (remove duplicates)
149 |     words = list(set(all_text))
150 |     print("Number of  entries: ",len(all_text))
151 |     print("Individual entries: ",len(words))
152 | 
153 |     # Assign an integer index for each individual word
154 |     word2ind = {word: index for index, word in enumerate(words, 2)}
155 |     ind2word = {index: word for index, word in enumerate(words, 2)}
156 | 
157 |     # To deal with out-of-vocabulary words
158 |     word2ind.update({ukn_words:1})
159 |     ind2word.update({1:ukn_words})
160 | 
161 |     # The index '0' is kept free in both dictionaries
162 | 
163 |     return word2ind, ind2word
164 | 
165 | 
166 | def indexData_y(y):
167 |     """
168 |         Map each word in the given data to a unique integer.
169 | 
170 |         :param y: The data
171 |         :return: Two dictionaries: one where words are keys and indexes values, another one "reversed" (keys->index, values->words)
172 |     """
173 |     
174 |     # Unique attributes in the data, sort alphabetically
175 |     labels_t1 = list(set([w for e in y for w in e]))
176 |     labels_t1 = sorted(labels_t1, key=str.lower)
177 |     print("Number of labels: ", len(labels_t1))
178 |     
179 |     # Assign an integer index for each individual label
180 |     label2ind = {label: index for index, label in enumerate(labels_t1, 1)}
181 |     ind2label = {index: label for index, label in enumerate(labels_t1, 1)}
182 |     
183 |     # The index '0' is kept free in both dictionaries
184 | 
185 |     return label2ind, ind2label
186 | 
187 | 
188 | def encodePadData_x(x, word2ind, maxlen, ukn_words, padding_style):
189 |     """
190 |         Transform a data of words in a data of integers, where each entrie as the same length.
191 | 
192 |         :param x: The data to transform
193 |         :param word2ind: Dictionary to retrieve the integer for each word in the data
194 |         :param maxlen: The length of each entry in the returned data
195 |         :param ukn_words: Key, in the dictionary words-index, to use for words not present in the dictionary
196 |         :param padding_style: Padding style to use for having each entry in the data with the same length
197 |         :return: The tranformed data
198 |     """
199 |     print ('Maximum sequence length - general :', maxlen)
200 |     print ('Maximum sequence length - data    :', max([len(xx) for xx in x]))
201 |     
202 |     # Encode: Map each words to the corresponding integer
203 |     X_enc = [[word2ind[c] if c in word2ind.keys() else word2ind[ukn_words] for c in xx ] for xx in x]
204 | 
205 |     # Pad: Each entry in the data must have the same length
206 |     X_encode = pad_sequences(X_enc, maxlen=maxlen, padding=padding_style)
207 | 
208 |     return X_encode
209 | 
210 | 
211 | def encodePadData_y(y, label2ind, maxlen, padding_style):
212 |     """
213 |         Apply one-hot-encoding to each label in the dataset. Each entrie will have the same length
214 | 
215 |         Example:
216 |             Input:  label2ind={Label_A:1, Label_B:2, Label_C:3}, maxlen=4
217 |                     y=[ [Label_A, Label_C]                     ,     [Label_A, Label_B, Label_C] ]
218 |             Output:   [ [[1,0,0], [0,0,1], [0,0,0], , [0,0,0]] ,     [[1,0,0], [0,1,0], [0,0,1]], [0,0,0],  ]
219 | 
220 |         :param y: The data to encode
221 |         :param label2ind:  Dictionary where each value in the data is mapped to a unique integer
222 |         :param maxlen: The length of each entry in the returned data
223 |         :param padding_style: Padding style to use for having each entry in the data with the same length
224 |         :return: The transformed data
225 |     """
226 | 
227 |     print ('Maximum sequence length - labels :', maxlen)
228 |     
229 |     # Encode y (with pad)
230 |     def encode(x, n):
231 |         """
232 |             Return an array of zeros, except for an entry set to 1 (one-hot-encode)
233 |             :param x: Index entry to set to 1
234 |             :param n: Length of the array to return
235 |             :return: The created array
236 |         """
237 |         result = np.zeros(n)
238 |         result[x] = 1
239 |         return result
240 | 
241 |     # Transform each label into its index in the data
242 |     y_pad = [[0] * (maxlen - len(ey)) + [label2ind[c] for c in ey] for ey in y]
243 |     # One-hot-encode label
244 |     max_label = max(label2ind.values()) + 1
245 |     y_enc = [[encode(c, max_label) for c in ey] for ey in y_pad]
246 |     
247 |     # Repad (to have numpy array)
248 |     y_encode = pad_sequences(y_enc, maxlen=maxlen, padding=padding_style)
249 |     
250 |     return y_encode
251 | 
252 | 
253 | 
254 | def characterLevelIndex(X, digits_word):
255 |     """
256 |         Map each character present in the dataset into an unique integer. All digits are mapped into a single array.
257 | 
258 |         :param X: Data to retrieve characters from
259 |         :param digits_word: Words regrouping all digits
260 |         :return: A dictionary where each character is maped into a unique integer, the maximum number of words in the data, the maximum of characters in a word
261 |     """
262 | 
263 |     # Create a set of all character
264 |     all_chars = list(set([c for s in X for w in s for c in w]))
265 | 
266 |     # Create an index for each character
267 |     # The index 1 is reserved for the digits, regrouped under the word param `digits_word`
268 |     char2ind = {char: index for index, char in enumerate(all_chars, 2)}
269 |     ind2char = {index: char for index, char in enumerate(all_chars, 2)}
270 | 
271 |     # To deal with out-of-vocabulary words
272 |     char2ind.update({digits_word:1})
273 |     ind2char.update({1:digits_word})
274 | 
275 |     # For padding
276 |     maxWords = max([len(s) for s in X])
277 |     maxChar  = max([len(w) for s in X for w in s])
278 |     print("Maximum number of words in a sequence  :", maxWords)
279 |     print("Maximum number of characters in a word :", maxChar)
280 | 
281 |     return char2ind, maxWords, maxChar
282 | 
283 | 
284 | def characterLevelData(X, char2ind, maxWords, maxChar, digits_word, padding_style):
285 |     """
286 |         For each word in the data, transform it into an array of characters. All characters array will have the same length. All sequence will have the same array length.
287 |         All digits will be maped to the same character arry.
288 |         If a character is not present in the dictionary character-index, discard it.
289 | 
290 |         :param X: The data
291 |         :param chra2ind: Dictionary where each character is maped to a unique integer
292 |         :param maxWords: Maximum number of words in a sequence
293 |         :param maxChar: Maximum number of characters in a word
294 |         :param digits_word: Word regrouping all digits. 
295 |         :param padding_style: Padding style to use for having each entry in the data with the same length
296 |         :return: The transformed array
297 |     """
298 | 
299 |     # Transform each word into an array of characters (discards those oov)
300 |     X_char = [[[char2ind[c] for c in w if c in char2ind.keys()] if w!=digits_word else [1] for w in s] for s in X]
301 | 
302 |     # Pad words - Each words has the same number of characters
303 |     X_char = pad_sequences([pad_sequences(s, maxChar, padding=padding_style) for s in X_char], maxWords, padding=padding_style)
304 |     return X_char
305 |     
306 | 
307 | 
308 | def word2VecEmbeddings(word2ind, num_features_embedding):
309 |     """
310 |         Convert a file of pre-computed word embeddings into dictionary: {word -> embedding vector}. Only return words of interest. 
311 |         If the word isn't in the embedding, returned a zero-vector instead.
312 | 
313 |         :param word2ind: Dictionary {words -> index}. The keys represented the words for each embeddings will be retrieved.
314 |         :param num_features_embedding: Size of the embedding vectors
315 |         :return: Array of embeddings vectors. The embeddings vector at position i corresponds to the word with value i in the dictionary param `word2ind`
316 |     """
317 | 
318 |     # Pre-trained embeddings filepath
319 |     file_path = "dataset/pretrained_vectors/vecs_{0}.txt".format(num_features_embedding)
320 |     ukn_index = "$UKN$"
321 | 
322 |     # Read the embeddings file
323 |     embeddings_all = {}
324 |     with open (file_path, "r") as file:
325 |         for line in file:
326 |             l = line.split(' ')
327 |             embeddings_all[l[0]] = l[1:]
328 | 
329 |     # Compute the embedding for each word in the dataset
330 |     embedding_matrix = np.zeros((len(word2ind)+1, num_features_embedding))
331 |     for word, i in word2ind.items():
332 |         if word in embeddings_all:
333 |             embedding_matrix[i] = embeddings_all[word]
334 | #        else:
335 |  #           embedding_matrix[i] = embeddings_all[ukn_index]
336 | 
337 |     # Delete the word2vec dictionary from memory
338 |     del embeddings_all
339 | 
340 |     return embedding_matrix
341 | 
342 | 
343 | 
344 | class Classification_Scores(Callback):
345 |     """
346 |         Add the F1 score on the testing data at the end of each epoch.
347 |         In case of multi-outputs, compute the F1 score for each output layer and the mean of all F1 scores.
348 |         Compute the training F1 score for each epoch. Store the results internally.
349 |         Internally, the accuracy and recall scores will also be stored, both for training and testing dataset.
350 |         The model's weigths for the best epoch will be save in a given folder.
351 |     """
352 |     
353 |     def __init__(self, train_data, ind2label, model_save_path):
354 |         """
355 |             :param train_data: The data used to compute training accuracy. One array of two arrays => [X_train, y_train]
356 |             :param ind2label: Dictionary of index-label to add tags label into results
357 |             :param model_save_path: Path to save the best model's weigths
358 |         """
359 |         self.train_data = train_data
360 |         self.ind2label = ind2label
361 |         self.model_save_path = model_save_path
362 |         self.score_name = 'val_f1'
363 | 
364 |     
365 | 
366 |     def on_train_begin(self, logs={}):
367 |         self.test_report = []
368 |         self.test_f1s = []
369 |         self.test_acc = []
370 |         self.test_recall = []
371 |         self.train_f1s = []
372 |         self.train_acc = []
373 |         self.train_recall = []
374 | 
375 |         self.best_score = -1
376 |         
377 |         # Add F1-score as a metric to print at end of each epoch
378 |         self.params['metrics'].append("val_f1")
379 |         
380 |         # In case of multiple outputs
381 |         if len(self.model.output_layers) > 1:
382 |             for output_layer in self.model.output_layers:
383 |                 self.params['metrics'].append("val_"+output_layer.name+"_f1")
384 | 
385 | 
386 |                 
387 |     def compute_scores(self, pred, targ):
388 |         """
389 |             Compute the Accuracy, Recall and F1 scores between the two given arrays pred and targ (targ is the golden truth)
390 |         """
391 |         val_predict = np.argmax(pred, axis=-1)
392 |         val_targ = np.argmax(targ, axis=-1)
393 | 
394 |         # Flatten arrays for sklearn
395 |         predict_flat = np.ravel(val_predict)
396 |         targ_flat = np.ravel(val_targ)
397 | 
398 |         # Compute scores
399 |         return precision_recall_fscore_support(targ_flat, predict_flat, average='weighted', labels=[x for x in np.unique(targ_flat) if x!=0])[:3]
400 | 
401 |     
402 |     def compute_epoch_training_F1(self):
403 |         """
404 |             Compute and save the F1 score for the training data
405 |         """
406 |         in_length  = len(self.model.input_layers)
407 |         out_length = len(self.model.output_layers)
408 |         predictions = self.model.predict(self.train_data[0])
409 |         if len(predictions) != out_length:
410 |             predictions = [predictions]
411 |     
412 |         vals_acc = []
413 |         vals_recall = []
414 |         vals_f1 = []
415 |         for i,pred in enumerate(predictions):
416 |             _val_acc, _val_recall, _val_f1 = self.compute_scores(np.asarray(pred), self.train_data[1][i])
417 |             vals_acc.append(_val_acc)
418 |             vals_recall.append(_val_recall)
419 |             vals_f1.append(_val_f1)
420 |         
421 |         self.train_acc.append(sum(vals_acc)/len(vals_acc))
422 |         self.train_recall.append(sum(vals_recall)/len(vals_recall))
423 |         self.train_f1s.append(sum(vals_f1)/len(vals_f1))
424 |     
425 |     
426 |     def classification_report(self, i, pred, targ, printPadding=False):
427 |         """
428 |             Comput the classification report for the predictions given.
429 |         """
430 | 
431 |         # Hold all classification reports
432 |         reports = []
433 |         
434 |         # The model predicts probabilities for each tag. Retrieve the id of the most probable tag.
435 |         pred_index = np.argmax(pred, axis=-1)
436 |         # Reverse the one-hot encoding for target
437 |         true_index = np.argmax(targ, axis=-1) 
438 | 
439 |         # Index 0 in the predictions referes to padding
440 |         ind2labelNew = self.ind2label[i].copy()
441 |         ind2labelNew.update({0: "null"})
442 | 
443 |         # Compute the labels for each prediction
444 |         pred_label = [[ind2labelNew[x] for x in a] for a in pred_index]
445 |         true_label = [[ind2labelNew[x] for x in b] for b in true_index]
446 | 
447 |         # CLASSIFICATION REPORTS 
448 |         reports.append("")
449 |         if printPadding:
450 |             reports.append("With padding into account")
451 |             reports.append(metrics.flat_classification_report(true_label, pred_label, digits=4))
452 |             reports.append("")
453 |             reports.append('----------------------------------------------')
454 |             reports.append("")
455 |             reports.append("Without the padding:")
456 |         reports.append(metrics.flat_classification_report(true_label, pred_label, digits=4, labels=list(self.ind2label[i].values())))
457 |         return '\n'.join(reports)
458 |     
459 | 
460 |     def on_epoch_end(self, epoch, logs={}):
461 |         """
462 |             At the end of each epoch, compute the F1 score for the validation data.
463 |             In case of multi-outputs model, compute one value per output and average all to return the overall F1 score.
464 |             Same model's weights for the best epoch.
465 |         """
466 |         self.compute_epoch_training_F1()
467 |         in_length  = len(self.model.input_layers)  # X data - to predict from
468 |         out_length = len(self.model.output_layers) # Number of tasks
469 |         
470 |         # Compute the model predictions
471 |         predictions = self.model.predict(self.validation_data[:in_length])
472 |         # In case of single output
473 |         if len(predictions) != out_length:
474 |             predictions = [predictions]
475 |         
476 |         
477 |         vals_acc = []
478 |         vals_recall = []
479 |         vals_f1 = []
480 |         reports = ""
481 |         # Iterate over all output predictions
482 |         for i,pred in enumerate(predictions):
483 |             _val_acc, _val_recall, _val_f1 = self.compute_scores(np.asarray(pred), self.validation_data[in_length+i])
484 | 
485 |             # Classification report
486 |             reports += "For task "+str(i+1)+"\n"
487 |             reports += "===================================================================================="
488 |             reports += self.classification_report(i,np.asarray(pred), self.validation_data[in_length+i]) + "\n\n\n"
489 |             
490 |             # Add scores internally
491 |             vals_acc.append(_val_acc)
492 |             vals_recall.append(_val_recall)
493 |             vals_f1.append(_val_f1)
494 |             
495 |             # Add F1 score to be log
496 |             f1_name = "val_"+self.model.output_layers[i].name+"_f1"
497 |             logs[f1_name] = _val_f1
498 |             
499 | 
500 |         # Add classification reports for all the predicitions/tasks
501 |         self.test_report.append(reports)
502 | 
503 |         # Add internally
504 |         self.test_acc.append(sum(vals_acc)/len(vals_acc))
505 |         self.test_recall.append(sum(vals_recall)/len(vals_recall))
506 |         self.test_f1s.append(sum(vals_f1)/len(vals_f1))
507 |         
508 |         # Add to log
509 |         f1_mean = sum(vals_f1)/len(vals_f1)
510 |         logs["val_f1"] = f1_mean
511 | 
512 |         # Save best model's weights
513 |         if f1_mean > self.best_score:
514 |             self.best_score = f1_mean
515 |             save_load_utils.save_all_weights(self.model, self.model_save_path)
516 | 
517 | 
518 | 
519 | def write_to_csv(filename, columns, rows):
520 |     """
521 |         Create a .csv file with the data given
522 | 
523 |         :param filename: Path and name of the .csv file, without csv extension
524 |         :param columns: Columns of the csv file (First row of the file)
525 |         :param rows: Data to write into the csv file, given per row
526 | 
527 |     """
528 |     with open(filename+'.csv', 'w') as csvfile:
529 |         wr = csv.writer(csvfile, quoting=csv.QUOTE_ALL)
530 |         wr.writerow(columns)    
531 |         for n in rows:
532 |             wr.writerow(n)
533 | 
534 | 
535 | def save_model_training_scores(filename, hist, classification_scores):
536 |     """
537 |         Create a .csv file containg the model training metrics for each epoch
538 | 
539 |         :param filename: Path and name of the .csv file without csv extension
540 |         :param hist: Default model training history returned by Keras
541 |         :param classification_scores: Classification_Scores instance used as callback in the model's training 
542 | 
543 |         :return: Nothing.
544 |     """
545 |     csv_values = []
546 | 
547 |     csv_columns = ["Epoch", "Training Accuracy", "Training Recall", "Training F1", "Testing Accuracy", "Testing Recall", "Testing F1"]
548 | 
549 |     csv_values.append(hist.epoch) # Epoch column
550 | 
551 |     # Training metrics
552 |     csv_values.append(classification_scores.train_acc)     # Training Accuracy column
553 |     csv_values.append(classification_scores.train_recall)  # Training Recall column
554 |     csv_values.append(classification_scores.train_f1s)     # Training F1 column
555 | 
556 |     # Testing metrics
557 |     csv_values.append(classification_scores.test_acc)       # Testing Accuracy column
558 |     csv_values.append(classification_scores.test_recall)    # Testing Accuracy column 
559 |     csv_values.append(classification_scores.test_f1s)       # Testing Accuracy column
560 | 
561 |     # Creste file
562 |     write_to_csv(filename, csv_columns, zip(*csv_values))
563 |     return
564 | 
565 | 
566 | def model_best_scores(classification_scores, best_epoch):
567 |     """
568 |         Return the metrics from best epoch
569 | 
570 |         :param classification_scores: Classification_Scores instance used as callback in the model's training 
571 |         :param best_epoch: Best training epoch index
572 | 
573 |         :return Best epoch training metrics: ["Best epoch", "Training Accuracy", "Training Recall", "Training F1", "Testing Accuracy", "Testing Recall", "Testing F1"]
574 |     """
575 |     best_values = []
576 |     best_values.append(1 + best_epoch)
577 | 
578 |     best_values.append(classification_scores.train_acc[best_epoch])
579 |     best_values.append(classification_scores.train_recall[best_epoch])
580 |     best_values.append(classification_scores.train_f1s[best_epoch])
581 | 
582 |     best_values.append(classification_scores.test_acc[best_epoch])
583 |     best_values.append(classification_scores.test_recall[best_epoch])
584 |     best_values.append(classification_scores.test_f1s[best_epoch])
585 | 
586 |     return best_values
587 | 
588 | 
589 | 
590 | def compute_predictions(model, X, y, ind2label, nbrTask=-1):
591 |     """
592 |         Compute the predictions and ground truth
593 | 
594 |         :param model: The model making predictions
595 |         :param X: Data
596 |         :param y: Ground truth
597 |         :param ind2label: Dictionaries of index to labels. Used to return have labels to predictions.
598 | 
599 |         :return: The predictions and groud truth ready to be compared, flatten (1-d array).
600 |     """
601 | 
602 |     # Compute training score
603 |     pred = model.predict(X)
604 |     if len(model.outputs)>1: # For multi-task
605 |         pred = pred[nbrTask]
606 |     pred = np.asarray(pred)
607 |     # Compute validation score
608 |     pred_index = np.argmax(pred, axis=-1)
609 | 
610 |     # Reverse the one-hot encoding
611 |     true_index = np.argmax(y, axis=-1) 
612 | 
613 |     # Index 0 in the predictions referes to padding
614 |     ind2labelNew = ind2label.copy()
615 |     ind2labelNew.update({0: "null"})
616 |     
617 |     # Compute the labels for each prediction
618 |     pred_label = [[ind2labelNew[x] for x in a] for a in pred_index]
619 |     true_label = [[ind2labelNew[x] for x in b] for b in true_index]
620 | 
621 |     # Flatten data
622 |     predict_flat = np.ravel(pred_label)
623 |     targ_flat = np.ravel(true_label)
624 | 
625 |     return predict_flat, targ_flat
626 | 
627 | 
628 | 
629 | def save_confusion_matrix(y_target, y_predictions, labels, figure_path, figure_size=(20,20)):
630 |     """
631 |         Generate two confusion matrices plots: with and without normalization.
632 | 
633 |         :param y_target: Tags groud truth
634 |         :param y_predictions: Tags predictions
635 |         :param labels: Predictions classes to use
636 |         :param figure_path: Path the save figures
637 |         :param figure_size: Size of the generated figures
638 | 
639 |         :return: Nothing
640 |     """
641 | 
642 |     # Compute confusion matrices
643 |     cnf_matrix = confusion_matrix(y_target, y_predictions)
644 | 
645 |     # Confusion matrix 
646 |     plt.figure(figsize=figure_size)
647 |     plot_confusion_matrix(cnf_matrix, classes=labels, title='Confusion matrix, without normalization')
648 |     plt.savefig("{0}.png".format(figure_path))
649 |     
650 |     # Confusion matrix  with normalization
651 |     plt.figure(figsize=figure_size)
652 |     plot_confusion_matrix(cnf_matrix, classes=labels, normalize=True, title='Normalized confusion matrix')
653 |     plt.savefig("{0}_normalized.png".format(figure_path))
654 | 
655 |     return
656 | 
657 | 
658 | def plot_confusion_matrix(cm, classes, normalize=False, title='Confusion matrix', cmap=plt.cm.Blues, printToFile=False):
659 |     """
660 |         FROM: http://scikit-learn.org/stable/auto_examples/model_selection/plot_confusion_matrix.html
661 |         This function prints and plots the confusion matrix.
662 |         Normalization can be applied by setting `normalize=True`.
663 |     """
664 |     if normalize:
665 |         cm = cm.astype('float') / cm.sum(axis=1)[:, np.newaxis]
666 |         if printToFile: print("Normalized confusion matrix")
667 |     else:
668 |         if printToFile: print('Confusion matrix, without normalization')
669 | 
670 |     if printToFile: print(cm)
671 | 
672 |     plt.imshow(cm, interpolation='nearest', cmap=cmap)
673 |     plt.title(title)
674 |     plt.colorbar()
675 |     tick_marks = np.arange(len(classes))
676 |     plt.xticks(tick_marks, classes, rotation=90)
677 |     plt.yticks(tick_marks, classes)
678 | 
679 |     fmt = '.2f' if normalize else 'd'
680 |     thresh = cm.max() / 2.
681 |     for i, j in itertools.product(range(cm.shape[0]), range(cm.shape[1])):
682 |         plt.text(j, i, format(cm[i, j], fmt),
683 |                  horizontalalignment="center",
684 |                  color="white" if cm[i, j] > thresh else "black")
685 | 
686 |     plt.tight_layout()
687 |     plt.ylabel('True label')
688 |     plt.xlabel('Predicted label')
689 | 


--------------------------------------------------------------------------------
/keras/main_multiTaskLearning.py:
--------------------------------------------------------------------------------
 1 | import random
 2 | import numpy as np
 3 | import tensorflow
 4 | random.seed(42)
 5 | np.random.seed(42)
 6 | tensorflow.set_random_seed(42)
 7 | 
 8 | # Models and Utils scripts
 9 | from code.models import *
10 | from code.utils import *
11 | 
12 | 
13 | # Load entire data
14 | X_train_w, y_train1_w, y_train2_w, y_train3_w 	= load_data("dataset/clean_train.txt")	# Training data
15 | X_test_w,  y_test1_w,  y_test2_w,  y_test3_w 	= load_data("dataset/clean_test.txt")	# Testing data
16 | X_valid_w, y_valid1_w, y_valid2_w, y_valid3_w 	= load_data("dataset/clean_valid.txt")	# Validation data
17 | 
18 | 
19 | # Merge digits under the same word
20 | digits_word = "$NUM$" 
21 | X_train_w, X_test_w, X_valid_w = mergeDigits([X_train_w, X_test_w, X_valid_w], digits_word)
22 | 
23 | # Compute indexes for words+labels in the training data
24 | ukn_words = "out-of-vocabulary"   # Out-of-vocabulary words entry in the "words to index" dictionary
25 | word2ind,   ind2word   =  indexData_x(X_train_w, ukn_words)
26 | label2ind1, ind2label1 =  indexData_y(y_train1_w)
27 | label2ind2, ind2label2 =  indexData_y(y_train2_w)
28 | label2ind3, ind2label3 =  indexData_y(y_train3_w)
29 | 
30 | print(ind2label1)
31 | print(ind2label2)
32 | print(ind2label3)
33 | 
34 | 
35 | 
36 | # Convert data into indexes data
37 | maxlen  = max([len(xx) for xx in X_train_w])
38 | padding_style   = 'pre'                 # 'pre' or 'post': Style of the padding, in order to have sequence of the same size
39 | X_train   = encodePadData_x(X_train_w,  word2ind,   maxlen, ukn_words, padding_style)
40 | X_test    = encodePadData_x(X_test_w,   word2ind,   maxlen, ukn_words, padding_style)
41 | X_valid   = encodePadData_x(X_valid_w,  word2ind,   maxlen, ukn_words, padding_style)
42 | 
43 | y_train1  = encodePadData_y(y_train1_w, label2ind1, maxlen, padding_style)
44 | y_test1   = encodePadData_y(y_test1_w,  label2ind1, maxlen, padding_style)
45 | y_valid1  = encodePadData_y(y_valid1_w, label2ind1, maxlen, padding_style)
46 | 
47 | y_train2  = encodePadData_y(y_train2_w, label2ind2, maxlen, padding_style)
48 | y_test2   = encodePadData_y(y_test2_w,  label2ind2, maxlen, padding_style)
49 | y_valid2  = encodePadData_y(y_valid2_w, label2ind2, maxlen, padding_style)
50 | 
51 | y_train3  = encodePadData_y(y_train3_w, label2ind3, maxlen, padding_style)
52 | y_test3   = encodePadData_y(y_test3_w,  label2ind3, maxlen, padding_style)
53 | y_valid3  = encodePadData_y(y_valid3_w, label2ind3, maxlen, padding_style)
54 | 
55 | 
56 | 
57 | # Create the character level data
58 | char2ind, maxWords, maxChar = characterLevelIndex(X_train_w, digits_word)
59 | X_train_char = characterLevelData(X_train_w, char2ind, maxWords, maxChar, digits_word, padding_style)
60 | X_test_char  = characterLevelData(X_test_w,  char2ind, maxWords, maxChar, digits_word, padding_style)
61 | X_valid_char = characterLevelData(X_valid_w, char2ind, maxWords, maxChar, digits_word, padding_style)
62 | 
63 | 
64 | 
65 | 
66 | # Model parameters
67 | epoch = 25
68 | batch = 100
69 | dropout = 0.5
70 | lstm_size = 200
71 | 
72 | 
73 | y_train = [y_train1, y_train2, y_train3]
74 | y_test  = [y_test1,  y_test2,  y_test3] 
75 | y_valid = [y_valid1, y_valid2, y_valid3]
76 | ind2label = [ind2label1, ind2label2, ind2label3]
77 | 
78 | model_name = "multi_task"
79 | 
80 | BiLSTM_model(model_name, True, "crf",
81 | 	  [X_train, X_train_char], [X_test, X_test_char], word2ind, maxWords,
82 | 	  y_train, y_test, ind2label,
83 | 	  validation=True, X_valid=[X_valid, X_valid_char], y_valid=y_valid,
84 | 	  pretrained_embedding=True, word_embedding_size=300,
85 | 	  maxChar=maxChar, char_embedding_type="BILSTM", char2ind=char2ind, char_embedding_size=100,
86 | 	  lstm_hidden=lstm_size, nbr_epochs=epoch, batch_size=batch, dropout=dropout,
87 | 	  gen_confusion_matrix=True, early_stopping_patience=5
88 | 	)
89 | 
90 | 
91 | print("FINITO")


--------------------------------------------------------------------------------
/keras/main_threeTasks.py:
--------------------------------------------------------------------------------
  1 | import random
  2 | import numpy as np
  3 | import tensorflow
  4 | 
  5 | # Seed
  6 | random.seed(42)
  7 | np.random.seed(42)
  8 | tensorflow.set_random_seed(42)
  9 | 
 10 | # Models and Utils scripts
 11 | from code.models import *
 12 | from code.utils import *
 13 | 
 14 | 
 15 | # Load entire data
 16 | X_train_w, y_train1_w, y_train2_w, y_train3_w 	= load_data("dataset/clean_train.txt")	# Training data
 17 | X_test_w,  y_test1_w,  y_test2_w,  y_test3_w 	= load_data("dataset/clean_test.txt")	# Testing data
 18 | X_valid_w, y_valid1_w, y_valid2_w, y_valid3_w 	= load_data("dataset/clean_valid.txt")	# Validation data
 19 | 
 20 | 
 21 | # Merge digits under the same word
 22 | digits_word = "$NUM$" 
 23 | X_train_w, X_test_w, X_valid_w = mergeDigits([X_train_w, X_test_w, X_valid_w], digits_word)
 24 | 
 25 | # Compute indexes for words+labels in the training data
 26 | ukn_words = "out-of-vocabulary"   # Out-of-vocabulary words entry in the "words to index" dictionary
 27 | word2ind,   ind2word   =  indexData_x(X_train_w, ukn_words)
 28 | label2ind1, ind2label1 =  indexData_y(y_train1_w)
 29 | label2ind2, ind2label2 =  indexData_y(y_train2_w)
 30 | label2ind3, ind2label3 =  indexData_y(y_train3_w)
 31 | 
 32 | print(ind2label1)
 33 | print(ind2label2)
 34 | print(ind2label3)
 35 | 
 36 | 
 37 | 
 38 | # Convert data into indexes data
 39 | maxlen  = max([len(xx) for xx in X_train_w])
 40 | padding_style   = 'pre'                 # 'pre' or 'post': Style of the padding, in order to have sequence of the same size
 41 | X_train   = encodePadData_x(X_train_w,  word2ind,   maxlen, ukn_words, padding_style)
 42 | X_test    = encodePadData_x(X_test_w,   word2ind,   maxlen, ukn_words, padding_style)
 43 | X_valid   = encodePadData_x(X_valid_w,  word2ind,   maxlen, ukn_words, padding_style)
 44 | 
 45 | y_train1  = encodePadData_y(y_train1_w, label2ind1, maxlen, padding_style)
 46 | y_test1   = encodePadData_y(y_test1_w,  label2ind1, maxlen, padding_style)
 47 | y_valid1  = encodePadData_y(y_valid1_w, label2ind1, maxlen, padding_style)
 48 | 
 49 | y_train2  = encodePadData_y(y_train2_w, label2ind2, maxlen, padding_style)
 50 | y_test2   = encodePadData_y(y_test2_w,  label2ind2, maxlen, padding_style)
 51 | y_valid2  = encodePadData_y(y_valid2_w, label2ind2, maxlen, padding_style)
 52 | 
 53 | y_train3  = encodePadData_y(y_train3_w, label2ind3, maxlen, padding_style)
 54 | y_test3   = encodePadData_y(y_test3_w,  label2ind3, maxlen, padding_style)
 55 | y_valid3  = encodePadData_y(y_valid3_w, label2ind3, maxlen, padding_style)
 56 | 
 57 | 
 58 | 
 59 | # Create the character level data
 60 | char2ind, maxWords, maxChar = characterLevelIndex(X_train_w, digits_word)
 61 | X_train_char = characterLevelData(X_train_w, char2ind, maxWords, maxChar, digits_word, padding_style)
 62 | X_test_char  = characterLevelData(X_test_w,  char2ind, maxWords, maxChar, digits_word, padding_style)
 63 | X_valid_char = characterLevelData(X_valid_w, char2ind, maxWords, maxChar, digits_word, padding_style)
 64 | 
 65 | 
 66 | # Training, Tesing and Validation data for the model (word emb + char features)
 67 | X_training = [X_train, X_train_char]
 68 | X_testing = [X_test, X_test_char]
 69 | X_validation = [X_valid, X_valid_char]
 70 | 
 71 | 
 72 | # Model parameters
 73 | epoch = 25
 74 | batch = 100
 75 | dropout = 0.5
 76 | lstm_size = 200
 77 | 
 78 | 
 79 | 
 80 | model_name = "task1"
 81 | BiLSTM_model(model_name, True, "crf",
 82 | 	  X_training, X_testing, word2ind, maxWords,
 83 | 	  [y_train1], [y_test1], [ind2label1],
 84 | 	  validation=True, X_valid=X_validation, y_valid=[y_valid1],
 85 | 	  pretrained_embedding=True, word_embedding_size=300,
 86 | 	  maxChar=maxChar, char_embedding_type="BILSTM", char2ind=char2ind, char_embedding_size=100,
 87 | 	  lstm_hidden=lstm_size, nbr_epochs=epoch, batch_size=batch, dropout=dropout,
 88 | 	  gen_confusion_matrix=True, early_stopping_patience=5
 89 | 	)
 90 | 
 91 | print("=====")
 92 | 
 93 | model_name = "task2"
 94 | BiLSTM_model(model_name, True, "crf",
 95 | 	  X_training, X_testing, word2ind, maxWords,
 96 | 	  [y_train2], [y_test2], [ind2label2],
 97 | 	  validation=True, X_valid=X_validation, y_valid=[y_valid2],
 98 | 	  pretrained_embedding=True, word_embedding_size=300,
 99 | 	  maxChar=maxChar, char_embedding_type="BILSTM", char2ind=char2ind, char_embedding_size=100,
100 | 	  lstm_hidden=lstm_size, nbr_epochs=epoch, batch_size=batch, dropout=dropout,
101 | 	  gen_confusion_matrix=True, early_stopping_patience=5
102 | 	)
103 | 
104 | print("=====")
105 | 
106 | model_name = "task3"
107 | BiLSTM_model(model_name, True, "crf",
108 | 	  X_training, X_testing, word2ind, maxWords,
109 | 	  [y_train3], [y_test3], [ind2label3],
110 | 	  validation=True, X_valid=X_validation, y_valid=[y_valid3],
111 | 	  pretrained_embedding=True, word_embedding_size=300,
112 | 	  maxChar=maxChar, char_embedding_type="BILSTM", char2ind=char2ind, char_embedding_size=100,
113 | 	  lstm_hidden=lstm_size, nbr_epochs=epoch, batch_size=batch, dropout=dropout,
114 | 	  gen_confusion_matrix=True, early_stopping_patience=5
115 | 	)
116 | 
117 | 
118 | print("Done.")
119 | 


--------------------------------------------------------------------------------
/tensorflow/README.md:
--------------------------------------------------------------------------------
 1 | # Tensorflow implementation
 2 | 
 3 | ## How to
 4 | A model can be trained using `ref_model.py. All parameters can be tuned there too (see comments). In order to do this, both dataset and pretrained vectors need to be stored as for the Keras implementation.
 5 | 
 6 |     python ref_model.py
 7 |     
 8 | Once a model is trained, it can be used interactively calling `play_with.py`. Model selection can be done using `cv_model.py` which implements grid search.
 9 | 
10 | ## Contents
11 | * `README.md` this file.
12 | * `utils/`
13 |     * [data utils](utils/data_utils.py) general data utility functions.
14 |     * [general utils](utils/general_utils.py) other utility functions.
15 | * [reference parsing model](ref_model.py) contains the main RefModel model discussed in the paper, and can be run to train an instance (assumes the dataset and pretrained vectors are available).
16 | * [cross validation](cv_model.py) contains code to fot multiple models for model selection or fine tuning (assumes the dataset and pretrained vectors are available).
17 | * [play with](play_with.py) contains code to load a model and use it with an interactive terminal.
18 | 
19 | ## Dependencies 
20 | * TensorFlow: 1.4.0
21 | * Numpy: 1.13.3
22 | * Sklearn : 0.19.1
23 | * Python 3.5
24 | 
25 | ## Future work
26 | * Add a conf file, ideally shared with the implementation in Keras.
27 | * Add a multitask implementation.


--------------------------------------------------------------------------------
/tensorflow/cv_model.py:
--------------------------------------------------------------------------------
  1 | """
  2 | Cross validation for model selection
  3 | """
  4 | 
  5 | import numpy as np
  6 | import itertools as it
  7 | from collections import OrderedDict
  8 | import os
  9 | from model.data_utils import build_data, load_vocab, get_processing_word,\
 10 |         coNLLDataset_full
 11 | from ref_model import RefModel
 12 | 
 13 | # GLOBALS
 14 | 
 15 | # dataset locations and basic configs
 16 | filename_dev = "../dataset/clean_test.txt"
 17 | filename_test = "../dataset/clean_valid.txt"
 18 | filename_train = "../dataset/clean_train.txt"
 19 | which_tags = -3  # -1, -2, -3: Ackerman author b-s b-secondary b-r
 20 | task_dir = "cv_%d"%which_tags
 21 | use_chars = True # parameter to change globally
 22 | use_pretrained = True # parameter to change globally
 23 | max_iter = None  # if not None, max number of examples in Dataset
 24 | n_epocs = 25
 25 | dim_words = [100,300] # pretrained word embeddings, be they exist!
 26 | 
 27 | # vocabs (created with build_data)
 28 | filename_words = "working_dir/words.txt"
 29 | filename_words_ext = "working_dir/words_ext.txt"
 30 | filename_tags = "working_dir/tags.txt"
 31 | filename_chars = "working_dir/chars.txt"
 32 | 
 33 | # build data for all possible models
 34 | build_data(filename_dev,filename_test,filename_train,dim_words,filename_words,
 35 |                filename_words_ext,filename_tags,filename_chars,
 36 |                filename_word="../pretrained_vectors/vecs_{}.txt",
 37 |                filename_word_vec_trimmed="../pretrained_vectors/vecs_{}.trimmed.npz",
 38 |                which_tags=which_tags)
 39 | 
 40 | # load vocabs
 41 | vocab_words = load_vocab(filename_words)
 42 | if use_pretrained:
 43 |     vocab_words = load_vocab(filename_words_ext)
 44 | vocab_tags = load_vocab(filename_tags)
 45 | vocab_chars = load_vocab(filename_chars)
 46 | nwords = len(vocab_words)
 47 | nchars = len(vocab_chars)
 48 | ntags = len(vocab_tags)
 49 | 
 50 | # load data
 51 | processing_word = get_processing_word(vocab_words,
 52 |                                       vocab_chars, lowercase=True, chars=use_chars)
 53 | processing_tag = get_processing_word(vocab_tags,
 54 |                                      lowercase=False, allow_unk=False)
 55 | X_dev, y_dev = coNLLDataset_full(filename_dev, processing_word, processing_tag, max_iter, which_tags)
 56 | X_train, y_train = coNLLDataset_full(filename_train, processing_word, processing_tag, max_iter, which_tags)
 57 | X_valid, y_valid = coNLLDataset_full(filename_test, processing_word, processing_tag, max_iter, which_tags)
 58 | 
 59 | print("Size of train, test and valid sets (in number of sentences): ")
 60 | print(len(X_train), " ", len(y_train), " ", len(X_dev), " ", len(y_dev), " ", len(X_valid), " ", len(y_valid))
 61 | 
 62 | def train_model(config,conf_id):
 63 |     """Train, evaluates and reports on a single model
 64 | 
 65 |     :param config: (dict) parameter configuration
 66 |     :param conf_id: (int) id of the configuration to fit
 67 |     :return: None
 68 |     """
 69 | 
 70 |     # general config
 71 |     model_name = str(config)
 72 |     print("Model configuration:",model_name)
 73 |     dir_output = "results/%s/%s_%s_%d"%(task_dir,str(use_pretrained),str(use_chars),conf_id)
 74 |     print("Model directory:", dir_output)
 75 |     os.makedirs(dir_output, exist_ok=True)
 76 |     os.makedirs(dir_output, exist_ok=True)
 77 |     with open(os.path.join(dir_output, "config_%s_%s_%d.txt"%(str(use_pretrained),str(use_chars),c)), "w") as f:
 78 |         f.write(model_name)
 79 |     dir_model = os.path.join(dir_output, "model.weights")
 80 | 
 81 |     model = RefModel(processing_word=processing_word, processing_tag=processing_tag, vocab_chars=vocab_chars,
 82 |                      vocab_words=vocab_words, vocab_tags=vocab_tags, nwords=nwords, nchars=nchars,
 83 |                      ntags=ntags, dir_output=dir_output, dir_model=dir_model, dim_word=config["dim_word"],dim_char=config["dim_char"],
 84 |                      use_pretrained=config["use_pretrained"],train_embeddings=config["train_embeddings"],
 85 |                      dropout=config["dropout"],batch_size=config["batch_size"],lr_method=config["lr_method"],lr=config["lr"],
 86 |                      lr_decay=config["lr_decay"],clip=config["clip"],nepoch_no_imprv=config["nepoch_no_imprv"],l2_reg_lambda=config["l2_reg_lambda"],
 87 |                      hidden_size_char=config["hidden_size_char"],hidden_size_lstm=config["hidden_size_lstm"],
 88 |                      use_crf=config["use_crf"],use_chars=config["use_chars"],use_cnn=config["use_cnn"],random_state=config["random_state"])
 89 | 
 90 |     # fit
 91 |     fitted = model.fit(X_train, y_train, X_dev, y_dev, n_epocs)
 92 |     print("Test final f1 score: ", fitted.best_score)
 93 |     ev_msg = fitted.evaluate(X_valid, y_valid)
 94 | 
 95 |     # report
 96 |     with open(os.path.join("results/%s"%(task_dir),"cv_report.txt"),"a") as f:
 97 |         f.write("------------\n")
 98 |         f.write("Model: %s\n"%model_name)
 99 | 
100 |         f.write("Test final f1 score: %f\n"%fitted.best_score)
101 | 
102 |         # evaluate
103 |         f.write("Evaluation: %s\n"%str(ev_msg))
104 |     with open(os.path.join("results/%s"%(task_dir), "cv_report.csv"), "a") as f:
105 |         f.write("Model_%s_%s_%d"%(str(use_pretrained),str(use_chars),c)+";"+str(fitted.best_score)+";"+str(ev_msg["f1"])+";"+str(ev_msg["acc"])+";"+str(ev_msg["p"])+";"+str(ev_msg["r"])+"\n")
106 | 
107 | if __name__ == "__main__":
108 | 
109 |     # Param search
110 |     # NB use chars or not is decided above, as is the task (which_tags).
111 |     param_distribs = OrderedDict({
112 |         "dim_word"          : [100,300],
113 |         "dim_char"          : [100,300],
114 |         "use_pretrained"    : [use_pretrained], # see above
115 |         "train_embeddings"  : [True,False], # only used if use_pretrained is True
116 |         "dropout"           : [0.5],
117 |         "batch_size"        : [50],
118 |         "lr_method"         : ["adam"],
119 |         "lr"                : [0.001],
120 |         "lr_decay"          : [0.9],
121 |         "clip"              : [-1],
122 |         "nepoch_no_imprv"   : [5],
123 |         "l2_reg_lambda"     : [0],
124 |         "hidden_size_char"  : [100],
125 |         "hidden_size_lstm"  : [300],
126 |         "use_crf"           : [True,False],
127 |         "use_chars"         : [use_chars], # see above
128 |         "use_cnn"           : [True,False],
129 |         "random_state"      : [0] # reproducibility
130 |     })
131 | 
132 |     # create a list of configurations
133 |     n_configs = np.prod([len(v) for v in param_distribs.values()])
134 |     print("Total number of configurations to try:",n_configs)
135 | 
136 |     allNames = sorted(param_distribs)
137 |     combinations = it.product(*(param_distribs[Name] for Name in allNames))
138 |     combinations = list(combinations)
139 |     assert len(combinations)==n_configs
140 | 
141 |     # initialize report csv file
142 |     os.makedirs("results/%s"%(task_dir),exist_ok=True)
143 |     if not os.path.isfile(os.path.join("results/%s"%(task_dir), "cv_report.csv")):
144 |         with open(os.path.join("results/%s"%(task_dir), "cv_report.csv"), "w") as f:
145 |             f.write("model_name;best_f1_test_score;f1_validation;accuracy_validation;precision_validation;recall_validation\n")
146 | 
147 |     for n,c in enumerate(combinations):
148 |         config = {k:v for k,v in zip(allNames,c)}
149 |         train_model(config,n)


--------------------------------------------------------------------------------
/tensorflow/play_with.py:
--------------------------------------------------------------------------------
 1 | """
 2 | Use a pretrained model for predictions
 3 | Borrows from: https://github.com/guillaumegenthial/sequence_tagging
 4 | """
 5 | 
 6 | import os
 7 | 
 8 | from ref_model import RefModel
 9 | from model.data_utils import load_vocab, get_processing_word, coNLLDataset_full
10 | 
11 | def interactive_shell(model):
12 |     """Creates interactive shell to play with model
13 | 
14 |     :param model: instance of RefModel
15 |     """
16 |     print("""
17 |     This is an interactive mode.
18 |     To exit, enter 'exit'.
19 |     You can enter a sentence like
20 |     input> I love Paris""")
21 | 
22 |     while True:
23 |         sentence = input("input> ")
24 | 
25 |         words_raw = sentence.strip().split()
26 | 
27 |         if words_raw and words_raw[0] in ["exit","quit","bye","q","stop"]:
28 |             break
29 | 
30 |         preds = model.predict(words_raw)
31 | 
32 |         print(" ".join(words_raw))
33 |         print(" ".join(preds))
34 | 
35 | # dataset locations and basic configs
36 | filename_dev = "../dataset/clean_test.txt"
37 | filename_test = "../dataset/clean_valid.txt"
38 | filename_train = "../dataset/clean_train.txt"
39 | which_tags = -3  # -1, -2, -3: Ackerman author b-secondary b-r
40 | use_chars = True
41 | max_iter = None  # if None, max number of examples in Dataset
42 | 
43 | # general config: trained model directory
44 | dir_output = "results/test_run"
45 | dir_model = os.path.join(dir_output, "model.weights")
46 | 
47 | # vocabs (created with build_data)
48 | filename_words = "working_dir/words.txt"
49 | filename_words_ext = "working_dir/words_ext.txt"
50 | filename_tags = "working_dir/tags.txt"
51 | filename_chars = "working_dir/chars.txt"
52 | 
53 | # load vocabs
54 | vocab_words = load_vocab(filename_words)
55 | vocab_tags = load_vocab(filename_tags)
56 | vocab_chars = load_vocab(filename_chars)
57 | nwords = len(vocab_words)
58 | nchars = len(vocab_chars)
59 | ntags = len(vocab_tags)
60 | 
61 | # load data
62 | processing_word = get_processing_word(vocab_words,
63 |                                       vocab_chars, lowercase=True, chars=use_chars)
64 | processing_tag = get_processing_word(vocab_tags,
65 |                                      lowercase=False, allow_unk=False)
66 | model = RefModel(processing_word=processing_word,processing_tag=processing_tag,vocab_chars=vocab_chars,
67 |                  vocab_words=vocab_words,vocab_tags=vocab_tags,nwords=nwords,nchars=nchars,
68 |                  ntags=ntags,dir_output=dir_output,dir_model=dir_model,use_chars=use_chars,random_state=0,
69 |                  use_pretrained=True, hidden_size_char=50, batch_size=100, lr_decay=1, l2_reg_lambda=0,
70 |                  use_crf=True, use_cnn=False, dim_word=300, hidden_size_lstm=200, lr=0.001,
71 |                  train_embeddings=True, dim_char=100, lr_method="rmsprop")
72 | 
73 | model.restore_session()
74 | interactive_shell(model)


--------------------------------------------------------------------------------
/tensorflow/ref_model.py:
--------------------------------------------------------------------------------
  1 | """
  2 | Reference parsing model
  3 | Borrows from: https://github.com/guillaumegenthial/sequence_tagging
  4 | """
  5 | 
  6 | import numpy as np
  7 | import tensorflow as tf
  8 | import os
  9 | from collections import OrderedDict
 10 | 
 11 | from sklearn.base import BaseEstimator, ClassifierMixin
 12 | 
 13 | from model.data_utils import minibatches, pad_sequences, get_chunks, build_data, \
 14 |     export_trimmed_word_vectors, load_vocab, get_processing_word, CoNLLDataset, coNLLDataset_full, conv1d
 15 | from model.general_utils import Progbar
 16 | 
 17 | class RefModel(BaseEstimator, ClassifierMixin):
 18 |     """Model for reference parsing"""
 19 | 
 20 |     def __init__(self,processing_word,processing_tag,vocab_chars,vocab_words,vocab_tags,
 21 |                  nwords,nchars,ntags,dir_output,dir_model,dim_word=300,dim_char=100,use_pretrained=False,train_embeddings=False,
 22 |                  dropout=0.5,batch_size=50,lr_method="adam",lr=0.001,lr_decay=0.9,
 23 |                  clip=-1,nepoch_no_imprv=10,l2_reg_lambda=0.0,hidden_size_char=100,hidden_size_lstm=300,
 24 |                  use_crf=True,use_chars=True,use_cnn=False,random_state=None):
 25 |         """
 26 |         Initialize the RefModel by simply storing all the hyperparameters.
 27 | 
 28 |         :param processing_word: (function) to process words
 29 |         :param processing_tag: (function) to process tags
 30 |         :param vocab_chars: (dictionary) of characters
 31 |         :param vocab_words: (dictionary) of words
 32 |         :param vocab_tags: (dictionary) of tags
 33 |         :param nwords: (int) number of words
 34 |         :param nchars: (int) number of characters
 35 |         :param ntags: (int) number of tags
 36 |         :param dir_output: (string) output directory
 37 |         :param dir_model: (string) model output directory
 38 |         :param dim_word: (int) dimensionality of word embeddings
 39 |         :param dim_char: (int) dimensionality of character embeddings
 40 |         :param use_pretrained: (bool) if to use pretrained embeddings
 41 |         :param train_embeddings: (bool) if to further train embeddings
 42 |         :param dropout: (float between 0 and 1) propout percentage
 43 |         :param batch_size: (int) batch size
 44 |         :param lr_method: (string) learning method (adagrad, sgd, rmsprop)
 45 |         :param lr: (float) learning rate
 46 |         :param lr_decay: (float between 0 and 1) learning rate
 47 |         :param clip: (float) clip rate
 48 |         :param nepoch_no_imprv: (int) early stopping number of epochs before interrupting without improvements
 49 |         :param l2_reg_lambda: (float) lambda for l2 regularization
 50 |         :param hidden_size_char: (int) size of hidden character lstm layer
 51 |         :param hidden_size_lstm: (int) size of hidden lstm layer
 52 |         :param use_crf: (bool) if to use crf prediction
 53 |         :param use_chars: (bool) if to use characters
 54 |         :param use_cnn: (bool) if to use cnn over lstm for character embeddings
 55 |         :param random_state: (int) random state
 56 |         """
 57 | 
 58 |         # externals
 59 |         self.processing_word = processing_word
 60 |         self.processing_tag  = processing_tag
 61 |         self.vocab_chars     = vocab_chars
 62 |         self.vocab_words     = vocab_words
 63 |         self.vocab_tags      = vocab_tags
 64 |         self.nwords          = nwords         
 65 |         self.nchars          = nchars
 66 |         self.ntags           = ntags
 67 |         self.dir_output      = dir_output
 68 |         self.dir_model       = dir_model
 69 | 
 70 |         # embeddings
 71 |         self.dim_word = dim_word
 72 |         self.dim_char = dim_char
 73 |         self.use_pretrained = use_pretrained
 74 |         self.idx_to_tag = {idx: tag for tag, idx in
 75 |                            self.vocab_tags.items()}
 76 | 
 77 |         # training
 78 |         self.train_embeddings = train_embeddings
 79 |         self._dropout = dropout
 80 |         self.batch_size = batch_size
 81 |         self.lr_method = lr_method
 82 |         self._lr = lr
 83 |         self.lr_decay = lr_decay
 84 |         self.clip = clip  # if negative, no clipping
 85 |         self.nepoch_no_imprv = nepoch_no_imprv
 86 |         self.l2_reg_lambda = l2_reg_lambda  # if 0, no l2 regularization
 87 | 
 88 |         # model hyperparameters
 89 |         self.hidden_size_char = hidden_size_char  # lstm on chars
 90 |         self.hidden_size_lstm = hidden_size_lstm  # lstm on word embeddings
 91 | 
 92 |         # NOTE: if both chars and crf, only 1.6x slower on GPU
 93 |         self.use_crf = use_crf  # if crf, training is 1.7x slower on CPU
 94 |         self.use_chars = use_chars  # if char embedding, training is 3.5x slower on CPU
 95 |         self.use_cnn = use_cnn  # if to use CNN char embeddings, if not use bi-LSTM
 96 | 
 97 |         # embedding files
 98 |         self._filename_emb = "../pretrained_vectors/vecs_{}.txt".format(self.dim_word)
 99 |         # trimmed embeddings (created with build_data.py)
100 |         self._filename_trimmed = "../pretrained_vectors/vecs_{}.trimmed.npz".format(self.dim_word)
101 |         self.embeddings = (export_trimmed_word_vectors(self._filename_trimmed)
102 |                       if self.use_pretrained else None)
103 | 
104 |         # extra
105 |         self.random_state = random_state
106 |         self._session = None
107 | 
108 | 
109 |     def _add_placeholders(self):
110 |         """Define placeholder entries to computational graph"""
111 |         # shape = (batch size, max length of sentence in batch)
112 |         self.word_ids = tf.placeholder(tf.int32, shape=[None, None],
113 |                         name="word_ids")
114 | 
115 |         # shape = (batch size)
116 |         self.sequence_lengths = tf.placeholder(tf.int32, shape=[None],
117 |                         name="sequence_lengths")
118 | 
119 |         # shape = (batch size, max length of sentence, max length of word)
120 |         self.char_ids = tf.placeholder(tf.int32, shape=[None, None, None],
121 |                         name="char_ids")
122 | 
123 |         # shape = (batch_size, max_length of sentence)
124 |         self.word_lengths = tf.placeholder(tf.int32, shape=[None, None],
125 |                         name="word_lengths")
126 | 
127 |         # shape = (batch size, max length of sentence in batch)
128 |         self.labels = tf.placeholder(tf.int32, shape=[None, None],
129 |                         name="labels")
130 | 
131 |         # hyper parameters
132 |         self.dropout = tf.placeholder(dtype=tf.float32, shape=[],
133 |                         name="dropout")
134 |         self.lr = tf.placeholder(dtype=tf.float32, shape=[],
135 |                         name="lr")
136 | 
137 |         # l2 regularization
138 |         self.l2_loss = tf.constant(0.0, name="l2_loss")
139 | 
140 | 
141 |     def _get_feed_dict(self, words, labels=None, lr=None, dropout=None):
142 |         """
143 |         Given some data, pad it and build a feed dictionary
144 | 
145 |         :param words: (list) of sentences. A sentence is a list of ids of a list of words. A word is a list of ids
146 |         :param labels: (list) of ids
147 |         :param lr: (float) learning rate
148 |         :param dropout: (float) keep prob
149 |         :return: dict {placeholder: value}
150 |         """
151 | 
152 |         # perform padding of the given data
153 |         if self.use_chars:
154 |             words = [zip(*w) for w in words]
155 |             char_ids,word_ids = zip(*words)
156 |             word_ids, sequence_lengths = pad_sequences(word_ids, 0)
157 |             char_ids, word_lengths = pad_sequences(char_ids, pad_tok=0,
158 |                 nlevels=2)
159 |         else:
160 |             word_ids, sequence_lengths = pad_sequences(words, 0)
161 | 
162 |         # build feed dictionary
163 |         feed = {
164 |             self.word_ids: word_ids,
165 |             self.sequence_lengths: sequence_lengths
166 |         }
167 | 
168 |         if self.use_chars:
169 |             feed[self.char_ids] = char_ids
170 |             feed[self.word_lengths] = word_lengths
171 | 
172 |         if labels is not None:
173 |             labels, _ = pad_sequences(labels, 0)
174 |             feed[self.labels] = labels
175 | 
176 |         if lr is not None:
177 |             feed[self.lr] = lr
178 | 
179 |         if dropout is not None:
180 |             feed[self.dropout] = dropout
181 | 
182 |         return feed, sequence_lengths
183 | 
184 | 
185 |     def _add_word_embeddings_op(self):
186 |         """Defines self.word_embeddings
187 | 
188 |         If self.embeddings is not None and is a np array initialized
189 |         with pre-trained word vectors, the word embeddings is just a look-up
190 |         and we train the vectors if config train_embeddings is True.
191 |         Otherwise, a random matrix with the correct shape is initialized.
192 | 
193 |         Note: add a DropoutWrapper to have dropout within cells.
194 |         """
195 | 
196 |         with tf.variable_scope("words"):
197 |             if self.embeddings is None:
198 |                 _word_embeddings = tf.get_variable(
199 |                         name="_word_embeddings",
200 |                         dtype=tf.float32,
201 |                         shape=[self.nwords, self.dim_word])
202 |             else:
203 |                 _word_embeddings = tf.Variable(
204 |                         self.embeddings,
205 |                         name="_word_embeddings",
206 |                         dtype=tf.float32,
207 |                         trainable=self.train_embeddings)
208 | 
209 |             word_embeddings = tf.nn.embedding_lookup(_word_embeddings,
210 |                     self.word_ids, name="word_embeddings")
211 | 
212 |         with tf.variable_scope("chars"):
213 |             if self.use_chars:
214 |                 # get char embeddings matrix
215 |                 _char_embeddings = tf.get_variable(
216 |                         name="_char_embeddings",
217 |                         dtype=tf.float32,
218 |                         shape=[self.nchars, self.dim_char])
219 |                 char_embeddings = tf.nn.embedding_lookup(_char_embeddings,
220 |                         self.char_ids, name="char_embeddings")
221 | 
222 |                 # put the time dimension on axis=1
223 |                 s = tf.shape(char_embeddings)
224 |                 # now becomes batch size * max sentence length, char in word, dim_char
225 |                 char_embeddings = tf.reshape(char_embeddings,
226 |                                              shape=[s[0] * s[1], s[-2], self.dim_char])
227 | 
228 |                 if self.use_cnn:
229 |                     widths = [2,3,5]
230 |                     strides = [1]
231 |                     outputs = list()
232 |                     for w in widths:
233 |                         for st in strides:
234 |                             with tf.name_scope("conv-maxpool-%d-%d" % (w,st)):
235 |                                 output = conv1d(char_embeddings, self.hidden_size_char, width=w, stride=st)
236 |                                 output = tf.reduce_max(tf.nn.relu(output), 1)  # activation and max pooling to have 1 feature vector per word
237 |                                 outputs.append(output)
238 | 
239 |                     # concat output
240 |                     output = tf.concat(outputs, axis=-1)
241 | 
242 |                     # shape = (batch size, max sentence length, len(widths)*len(strides) * char hidden size)
243 |                     output = tf.reshape(output,
244 |                                         shape=[s[0], s[1], len(widths)*len(strides) * self.hidden_size_char])
245 |                     output = tf.nn.dropout(output, self.dropout)
246 | 
247 |                 else:
248 |                     # bi-LSTM to learn character embeddings
249 |                     # reshape word lengths
250 |                     word_lengths = tf.reshape(self.word_lengths, shape=[s[0]*s[1]])
251 | 
252 |                     # bi lstm on chars
253 |                     cell_fw = tf.contrib.rnn.LSTMCell(self.hidden_size_char,
254 |                             state_is_tuple=True)
255 |                     cell_bw = tf.contrib.rnn.LSTMCell(self.hidden_size_char,
256 |                             state_is_tuple=True)
257 |                     _output = tf.nn.bidirectional_dynamic_rnn(
258 |                             cell_fw, cell_bw, char_embeddings,
259 |                             sequence_length=word_lengths, dtype=tf.float32)
260 | 
261 |                     # read and concat output
262 |                     _, ((_, output_fw), (_, output_bw)) = _output
263 |                     output = tf.concat([output_fw, output_bw], axis=-1)
264 | 
265 |                     # shape = (batch size, max sentence length, 2*char hidden size)
266 |                     output = tf.reshape(output,
267 |                             shape=[s[0], s[1], 2*self.hidden_size_char])
268 |                     output = tf.nn.dropout(output, self.dropout)
269 | 
270 |                 word_embeddings = tf.concat([word_embeddings, output], axis=-1)
271 | 
272 |         self.word_embeddings = tf.nn.dropout(word_embeddings, self.dropout)
273 | 
274 | 
275 |     def _add_logits_op(self):
276 |         """Defines self.logits
277 | 
278 |         For each word in each sentence of the batch, it corresponds to a vector
279 |         of scores, of dimension equal to the number of tags.
280 | 
281 |         Note: add a DropoutWrapper to have dropout within cells.
282 |         """
283 | 
284 |         with tf.variable_scope("bi-lstm"):
285 |             cell_fw = tf.contrib.rnn.LSTMCell(self.hidden_size_lstm)
286 |             cell_bw = tf.contrib.rnn.LSTMCell(self.hidden_size_lstm)
287 |             (output_fw, output_bw), _ = tf.nn.bidirectional_dynamic_rnn(
288 |                     cell_fw, cell_bw, self.word_embeddings,
289 |                     sequence_length=self.sequence_lengths, dtype=tf.float32)
290 |             output = tf.concat([output_fw, output_bw], axis=-1)
291 |             output = tf.nn.dropout(output, self.dropout)
292 | 
293 |         # act here to expand to multiple outputs and to add attention
294 |         with tf.variable_scope("pred"):
295 |             W = tf.get_variable("W", dtype=tf.float32,
296 |                     shape=[2*self.hidden_size_lstm, self.ntags])
297 | 
298 |             b = tf.get_variable("b", shape=[self.ntags],
299 |                     dtype=tf.float32, initializer=tf.zeros_initializer())
300 |             # l2 regularization
301 |             self.l2_loss += tf.nn.l2_loss(W)
302 |             self.l2_loss += tf.nn.l2_loss(b)
303 | 
304 |             nsteps = tf.shape(output)[1]
305 |             output = tf.reshape(output, [-1, 2*self.hidden_size_lstm])
306 |             pred = tf.matmul(output, W) + b
307 |             self.logits = tf.reshape(pred, [-1, nsteps, self.ntags])
308 | 
309 | 
310 |     def _add_pred_op(self):
311 |         """Defines self.labels_pred
312 | 
313 |         This op is defined only in the case where we don't use a CRF since in
314 |         that case we can make the prediction "in the graph" (thanks to tf
315 |         functions in other words). With CRF, as the inference is coded
316 |         in python and not in pure tensorflow, we have to make the prediction
317 |         outside the graph.
318 | 
319 |         Note: this is no longer the case, see https://github.com/tensorflow/tensorflow/tree/master/tensorflow/contrib/crf.
320 |         """
321 | 
322 |         if not self.use_crf:
323 |             self.labels_pred = tf.cast(tf.argmax(self.logits, axis=-1), tf.int32)
324 | 
325 | 
326 |     def _add_loss_op(self):
327 |         """Defines the loss"""
328 | 
329 |         if self.use_crf:
330 |             log_likelihood, trans_params = tf.contrib.crf.crf_log_likelihood(
331 |                     self.logits, self.labels, self.sequence_lengths)
332 |             self.trans_params = trans_params # need to evaluate it for decoding
333 |             self.loss = tf.reduce_mean(-log_likelihood) + self.l2_reg_lambda * self.l2_loss
334 |         else:
335 |             losses = tf.nn.sparse_softmax_cross_entropy_with_logits(
336 |                     logits=self.logits, labels=self.labels)
337 |             mask = tf.sequence_mask(self.sequence_lengths)
338 |             losses = tf.boolean_mask(losses, mask)
339 |             self.loss = tf.reduce_mean(losses) + self.l2_reg_lambda * self.l2_loss
340 | 
341 | 
342 |     def _add_train_op(self, lr_method, lr, loss, clip=-1):
343 |         """
344 |         Defines self.train_op that performs an update on a batch
345 | 
346 |         :param lr_method: (string) sgd method, for example "adam"
347 |         :param lr: (tf.placeholder) tf.float32, learning rate
348 |         :param loss: (tensor) tf.float32 loss to minimize
349 |         :param clip: (python float) clipping of gradient. If < 0, no clipping
350 |         :return: None
351 |         """
352 | 
353 |         _lr_m = lr_method.lower() # lower to make sure
354 | 
355 |         with tf.variable_scope("train_step"):
356 |             if _lr_m == 'adam': # sgd method
357 |                 optimizer = tf.train.AdamOptimizer(lr)
358 |             elif _lr_m == 'adagrad':
359 |                 optimizer = tf.train.AdagradOptimizer(lr)
360 |             elif _lr_m == 'sgd':
361 |                 optimizer = tf.train.GradientDescentOptimizer(lr)
362 |             elif _lr_m == 'rmsprop':
363 |                 optimizer = tf.train.RMSPropOptimizer(lr)
364 |             else:
365 |                 raise NotImplementedError("Unknown method {}".format(_lr_m))
366 | 
367 |             if clip > 0: # gradient clipping if clip is positive
368 |                 grads, vs     = zip(*optimizer.compute_gradients(loss))
369 |                 grads, gnorm  = tf.clip_by_global_norm(grads, clip)
370 |                 self.train_op = optimizer.apply_gradients(zip(grads, vs))
371 |             else:
372 |                 self.train_op = optimizer.minimize(loss)
373 | 
374 | 
375 |     def _predict_batch(self, words):
376 |         """
377 |         Predict for a batch of data
378 | 
379 |         :param words: (list) of sentences
380 |         :return: (list) of labels for each sentence
381 |             sequence_length
382 |         """
383 | 
384 |         fd, sequence_lengths = self._get_feed_dict(words, dropout=1.0)
385 | 
386 |         if self.use_crf:
387 |             # get tag scores and transition params of CRF
388 |             viterbi_sequences = []
389 |             logits, trans_params = self._session.run(
390 |                     [self.logits, self.trans_params], feed_dict=fd)
391 | 
392 |             # iterate over the sentences because no batching in viterbi_decode
393 |             for logit, sequence_length in zip(logits, sequence_lengths):
394 |                 logit = logit[:sequence_length] # keep only the valid steps
395 |                 viterbi_seq, viterbi_score = tf.contrib.crf.viterbi_decode(
396 |                         logit, trans_params)
397 |                 viterbi_sequences += [viterbi_seq]
398 | 
399 |             return viterbi_sequences, sequence_lengths
400 | 
401 |         else:
402 |             labels_pred = self._session.run(self.labels_pred, feed_dict=fd)
403 | 
404 |             return labels_pred, sequence_lengths
405 | 
406 | 
407 |     def _run_epoch(self, X_train, y_train, X_dev, y_dev, epoch):
408 |         """
409 |         Performs one complete pass over the train set and evaluate on dev
410 | 
411 |         :param X_train: (list) with training data
412 |         :param y_train: (list) with training labels
413 |         :param X_dev: (list) with testing data
414 |         :param y_dev: (list) with testing labels
415 |         :param epoch: (int) which epoch it is
416 |         :return: (python float) score to select model on, higher is better
417 |         """
418 | 
419 |         # progbar stuff for logging
420 |         batch_size = self.batch_size
421 |         nbatches = (len(X_train) + batch_size - 1) // batch_size
422 |         prog = Progbar(target=nbatches)
423 | 
424 |         rnd_idx = np.random.permutation(len(X_train))
425 |         for i, rnd_indices in enumerate(np.array_split(rnd_idx, len(X_train) // batch_size)):
426 |             words, labels = [X_train[x] for x in list(rnd_indices)], [y_train[y] for y in list(rnd_indices)]
427 |             fd, _ = self._get_feed_dict(words, labels, self._lr, self._dropout)
428 | 
429 |             _, train_loss = self._session.run(
430 |                     [self.train_op, self.loss], feed_dict=fd)
431 | 
432 |             prog.update(i + 1, [("train loss", train_loss)])
433 | 
434 |             # tensorboard
435 |             if i % 10 == 0:
436 |                 # loss
437 |                 loss_summary = self._loss_summary.eval(feed_dict=fd)
438 |                 self._file_writer.add_summary(loss_summary, epoch * nbatches + i)
439 |                 # train eval
440 |                 metrics = self._run_evaluate(words, labels)
441 |                 summary = tf.Summary()
442 |                 summary.value.add(tag='precision_train', simple_value=metrics["p"])
443 |                 summary.value.add(tag='recall_train', simple_value=metrics["r"])
444 |                 summary.value.add(tag='f1_train', simple_value=metrics["f1"])
445 |                 summary.value.add(tag='accuracy_train', simple_value=metrics["acc"])
446 |                 self._file_writer.add_summary(summary, epoch * nbatches + i)
447 |                 # test eval
448 |                 metrics = self._run_evaluate(X_dev, y_dev)
449 |                 summary = tf.Summary()
450 |                 summary.value.add(tag='precision_test', simple_value=metrics["p"])
451 |                 summary.value.add(tag='recall_test', simple_value=metrics["r"])
452 |                 summary.value.add(tag='f1_test', simple_value=metrics["f1"])
453 |                 summary.value.add(tag='accuracy_test', simple_value=metrics["acc"])
454 |                 self._file_writer.add_summary(summary, epoch)
455 | 
456 |         # final epoch test eval
457 |         metrics = self._run_evaluate(X_dev, y_dev)
458 |         msg = " - ".join(["{} {:04.2f}".format(k, v)
459 |                 for k, v in metrics.items()])
460 |         print(msg)
461 | 
462 |         return metrics["f1"]
463 | 
464 | 
465 |     def _run_evaluate(self, X_dev, y_dev):
466 |         """
467 |         Evaluates performance on test set
468 | 
469 |         :param X_dev:(list) with dev data
470 |         :param y_dev: (list) with dev labels
471 |         :return: (dict) metrics["acc"] = 98.4, ...
472 |         """
473 | 
474 |         accs = []
475 |         correct_preds, total_correct, total_preds = 0., 0., 0.
476 | 
477 |         rnd_idx = np.random.permutation(len(X_dev))
478 |         for rnd_indices in np.array_split(rnd_idx, len(X_dev) // self.batch_size):
479 |             words, labels = [X_dev[x] for x in list(rnd_indices)], [y_dev[y] for y in list(rnd_indices)]
480 |             labels_pred, sequence_lengths = self._predict_batch(words)
481 | 
482 |             for lab, lab_pred, length in zip(labels, labels_pred,
483 |                                              sequence_lengths):
484 |                 lab      = lab[:length]
485 |                 lab_pred = lab_pred[:length]
486 |                 accs    += [a==b for (a, b) in zip(lab, lab_pred)]
487 | 
488 |                 lab_chunks      = set(get_chunks(lab, self.vocab_tags))
489 |                 lab_pred_chunks = set(get_chunks(lab_pred,
490 |                                                  self.vocab_tags))
491 | 
492 |                 correct_preds += len(lab_chunks & lab_pred_chunks)
493 |                 total_preds   += len(lab_pred_chunks)
494 |                 total_correct += len(lab_chunks)
495 | 
496 |         p   = correct_preds / total_preds if correct_preds > 0 else 0
497 |         r   = correct_preds / total_correct if correct_preds > 0 else 0
498 |         f1  = 2 * p * r / (p + r) if correct_preds > 0 else 0
499 |         acc = np.mean(accs)
500 | 
501 |         return OrderedDict({"acc": 100*acc, "f1": 100*f1, "p": p, "r": r})
502 | 
503 | 
504 |     def _reinitialize_weights(self, scope_name):
505 |         """Reinitializes the weights of a given layer
506 | 
507 |         :param scope_name: (string) scope of variables to reinitialize
508 |         """
509 | 
510 |         variables = tf.contrib.framework.get_variables(scope_name)
511 |         init = tf.variables_initializer(variables)
512 |         self._session.run(init)
513 | 
514 | 
515 |     def _initialize(self):
516 |         """Initialize the variables"""
517 | 
518 |         print("Initializing tf session")
519 |         self._init = tf.global_variables_initializer()
520 |         self._saver = tf.train.Saver()
521 | 
522 | 
523 |     def restore_session(self):
524 |         """Reload weights into session"""
525 |         self._graph = tf.Graph()
526 |         with self._graph.as_default():
527 |             self.build()
528 |         self._session = tf.Session(graph=self._graph)
529 |         self._saver.restore(self._session, self.dir_model)
530 | 
531 | 
532 |     def save_session(self):
533 |         """Saves session = weights"""
534 |         if not os.path.exists(self.dir_model):
535 |             os.makedirs(self.dir_model,exist_ok=True)
536 |         self._saver.save(self._session, self.dir_model)
537 | 
538 | 
539 |     def add_summary(self):
540 |         """Defines variables for Tensorboard"""
541 |         self._loss_summary = tf.summary.scalar('loss', self.loss)
542 |         self._file_writer = tf.summary.FileWriter(self.dir_output,
543 |                 self._session.graph)
544 | 
545 | 
546 |     def close_session(self):
547 |         """Closes the session"""
548 |         if self._session:
549 |             self._session.close()
550 | 
551 | 
552 |     def _get_model_params(self):
553 |         """From: https://github.com/ageron/handson-ml/blob/master/11_deep_learning.ipynb
554 |         Get all variable values (used for early stopping, faster than saving to disk)"""
555 | 
556 |         with self._graph.as_default():
557 |             gvars = tf.get_collection(tf.GraphKeys.GLOBAL_VARIABLES)
558 |         return {gvar.op.name: value for gvar, value in zip(gvars, self._session.run(gvars))}
559 | 
560 | 
561 |     def _restore_model_params(self, model_params):
562 |         """From: https://github.com/ageron/handson-ml/blob/master/11_deep_learning.ipynb
563 |         Set all variables to the given values (for early stopping, faster than loading from disk)
564 | 
565 |         :param model_params: (dict) parameters of the model to restore
566 |         """
567 | 
568 |         gvar_names = list(model_params.keys())
569 |         assign_ops = {gvar_name: self._graph.get_operation_by_name(gvar_name + "/Assign")
570 |                       for gvar_name in gvar_names}
571 |         init_values = {gvar_name: assign_op.inputs[1] for gvar_name, assign_op in assign_ops.items()}
572 |         fd = {init_values[gvar_name]: model_params[gvar_name] for gvar_name in gvar_names}
573 |         self._session.run(assign_ops, feed_dict=fd)
574 | 
575 | 
576 |     def build(self):
577 |         """Builds the computational graph"""
578 | 
579 |         if self.random_state is not None:
580 |             tf.set_random_seed(self.random_state)
581 |             np.random.seed(self.random_state)
582 | 
583 |         # specific functions
584 |         self._add_placeholders()
585 |         self._add_word_embeddings_op()
586 |         self._add_logits_op()
587 |         self._add_pred_op()
588 |         self._add_loss_op()
589 | 
590 |         # generic functions that add training op and initialize vars
591 |         self._add_train_op(self.lr_method, self.lr, self.loss, self.clip)
592 |         self._initialize() # initialize vars and saver, session is still not there
593 | 
594 | 
595 |     def fit(self, X, y, X_valid=None, y_valid=None, nepochs=100):
596 |         """
597 |         Performs training with early stopping and lr exponential decay
598 | 
599 |         :param X: (list) data
600 |         :param y: (list) labels
601 |         :param X_valid: (list) validation data
602 |         :param y_valid: (list) validation data
603 |         :param nepochs: (int) number of epochs to run for
604 |         :return: self (model, instance of RefModel)
605 |         """
606 | 
607 |         self.close_session()
608 |         self._graph = tf.Graph()
609 |         with self._graph.as_default():
610 |             self.build()
611 | 
612 |         self.best_score = 0
613 |         nepoch_no_imprv = 5 # for early stopping, this should be passed as a parameter
614 |         best_params = None
615 | 
616 |         self._session = tf.Session(graph=self._graph)
617 |         with self._session.as_default():
618 |             self._init.run()
619 |             self.add_summary()  # tensorboard
620 |             for epoch in range(nepochs):
621 |                 print("Epoch {:} out of {:}".format(epoch + 1, nepochs))
622 | 
623 |                 score = self._run_epoch(X, y, X_valid, y_valid, epoch)
624 |                 self._lr *= self.lr_decay # decay learning rate
625 | 
626 |                 # early stopping and saving best parameters
627 |                 if score >= self.best_score:
628 |                     best_params = self._get_model_params()
629 |                     nepoch_no_imprv = 0
630 |                     self.best_score = score
631 |                     self.save_session() # save new params
632 |                     print("- new best score!")
633 |                 else:
634 |                     nepoch_no_imprv += 1
635 |                     if nepoch_no_imprv >= self.nepoch_no_imprv:
636 |                         print("- early stopping {} epochs without "\
637 |                                 "improvement".format(nepoch_no_imprv))
638 |                         break
639 | 
640 |             # If we used early stopping then rollback to the best model found
641 |             if best_params:
642 |                 self._restore_model_params(best_params)
643 |             return self
644 | 
645 | 
646 |     def predict(self, words_raw):
647 |         """
648 |         Returns list of predicted tags
649 | 
650 |         :param words_raw: (list) of words (string), just one sentence (no batch)
651 |         :return preds: (list) of tags (string), one for each word in the sentence
652 |         """
653 | 
654 |         words = [[self.processing_word(w)] for w in words_raw]
655 |         pred_ids, _ = self._predict_batch(words)
656 |         preds = [self.idx_to_tag[idx[0]] for idx in pred_ids]
657 | 
658 |         return preds
659 | 
660 | 
661 |     def evaluate(self, X_dev, y_dev):
662 |         """
663 |         Evaluate model on test set
664 | 
665 |         :param X_dev: (list) dev data
666 |         :param y_dev: (list) dev labels
667 |         :return: (dict) of metrics
668 |         """
669 | 
670 |         metrics = self._run_evaluate(X_dev, y_dev)
671 |         return metrics
672 | 
673 | 
674 | if __name__ == "__main__":
675 | 
676 |     # Example of usage
677 | 
678 |     # dataset locations and basic configs
679 |     filename_dev = "../dataset/clean_test.txt"
680 |     filename_test = "../dataset/clean_valid.txt"
681 |     filename_train = "../dataset/clean_train.txt"
682 |     which_tags = -3  # -1, -2, -3: Ackerman author b-secondary b-r
683 |     use_chars = True
684 |     max_iter = None  # if None, max number of examples in Dataset
685 | 
686 |     # general config: trained model directory
687 |     dir_output = "results/test_run"
688 |     dir_model = os.path.join(dir_output, "model.weights")
689 | 
690 |     # vocabs (created with build_data.py)
691 |     filename_words = "working_dir/words.txt"
692 |     filename_words_ext = "working_dir/words_ext.txt"
693 |     filename_tags = "working_dir/tags.txt"
694 |     filename_chars = "working_dir/chars.txt"
695 | 
696 |     # build data (just to test model)
697 |     build_data(filename_dev, filename_test, filename_train, [300], filename_words,
698 |                filename_words_ext, filename_tags, filename_chars,
699 |                filename_word="../pretrained_vectors/vecs_{}.txt",
700 |                filename_word_vec_trimmed="../pretrained_vectors/vecs_{}.trimmed.npz",
701 |                which_tags=which_tags)
702 | 
703 |     vocab_words = load_vocab(filename_words)
704 |     vocab_tags = load_vocab(filename_tags)
705 |     vocab_chars = load_vocab(filename_chars)
706 |     nwords = len(vocab_words)
707 |     nchars = len(vocab_chars)
708 |     ntags = len(vocab_tags)
709 | 
710 |     # load data
711 |     processing_word = get_processing_word(vocab_words,
712 |                                           vocab_chars, lowercase=True, chars=use_chars)
713 |     processing_tag = get_processing_word(vocab_tags,
714 |                                          lowercase=False, allow_unk=False)
715 |     X_dev, y_dev = coNLLDataset_full(filename_dev, processing_word, processing_tag, max_iter, which_tags)
716 |     X_train, y_train = coNLLDataset_full(filename_train, processing_word, processing_tag, max_iter, which_tags)
717 |     X_valid, y_valid = coNLLDataset_full(filename_test, processing_word, processing_tag, max_iter, which_tags)
718 | 
719 |     print("Size of train, test and valid sets (in number of sentences): ")
720 |     print(len(X_train), " ", len(y_train), " ", len(X_dev), " ", len(y_dev), " ", len(X_valid), " ", len(y_valid))
721 | 
722 |     model = RefModel(processing_word=processing_word,processing_tag=processing_tag,vocab_chars=vocab_chars,
723 |                      vocab_words=vocab_words,vocab_tags=vocab_tags,nwords=nwords,nchars=nchars,
724 |                      ntags=ntags,dir_output=dir_output,dir_model=dir_model,use_chars=use_chars,random_state=0,
725 |                      use_pretrained=True, hidden_size_char=50, batch_size=100, lr_decay=1, l2_reg_lambda=0,
726 |                      use_crf=True, use_cnn=False, dim_word=300, hidden_size_lstm=200, lr=0.001,
727 |                      train_embeddings=True, dim_char=100, lr_method="rmsprop")
728 | 
729 |     fitted = model.fit(X_train, y_train, X_dev, y_dev, 50)
730 |     print("Final f1 score: ",fitted.best_score)
731 |     print("\nValidation:")
732 |     print(str(fitted.evaluate(X_valid, y_valid)))


--------------------------------------------------------------------------------
/tensorflow/utils/__init__.py:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/dhlab-epfl/LinkedBooksDeepReferenceParsing/9411db4e918baffa361895c50ae9ce2046fafc3c/tensorflow/utils/__init__.py


--------------------------------------------------------------------------------
/tensorflow/utils/data_utils.py:
--------------------------------------------------------------------------------
  1 | """
  2 | Utilities for dealing with data
  3 | Borrows from: https://github.com/guillaumegenthial/sequence_tagging
  4 | """
  5 | 
  6 | import numpy as np
  7 | import tensorflow as tf
  8 | from collections import OrderedDict
  9 | 
 10 | # shared global variables
 11 | UNK = "$UNK$"
 12 | NUM = "$NUM$"
 13 | NONE = "o"
 14 | 
 15 | # special error message
 16 | class MyIOError(Exception):
 17 |     def __init__(self, filename):
 18 |         # custom error message
 19 |         message = """
 20 |         ERROR: Unable to locate file {}.
 21 |         
 22 |         FIX: Check that build_data has been called before training.
 23 |         """.format(filename)
 24 |         super(MyIOError, self).__init__(message)
 25 | 
 26 | 
 27 | def build_data(filename_dev,filename_test,filename_train,dim_words,filename_words,
 28 |                filename_words_ext,filename_tags,filename_chars,
 29 |                filename_word_vec="../pretrained_vectors/vecs_{}.txt",
 30 |                filename_word_vec_trimmed="../pretrained_vectors/vecs_{}.trimmed.npz",
 31 |                which_tags=-1):
 32 |     """
 33 |     Prepares the dataset before training a model.
 34 | 
 35 |     :param filename_dev: the file with test data (dev)
 36 |     :param filename_test: the file with validation data
 37 |     :param filename_train: the file with train data
 38 |     :param dim_words: dimensionality of word embeddings
 39 |     :param filename_words: filename where to put exported word vocabulary
 40 |     :param filename_words_ext: filename where to put exported word vocabulary
 41 |     :param filename_tags: filename where to put exported tag vocabulary
 42 |     :param filename_chars: filename where to put exported char vocabulary
 43 |     :param filename_word_vec: filename of word vectors
 44 |     :param filename_word_vec_trimmed: filename where to put exported trimmed word vectors
 45 |     :param which_tags: which tagging scheme to use (-1 -2 -3 or 3 2 1 for task 1 2 3 respectively)
 46 |     :return: None
 47 |     """
 48 | 
 49 |     processing_word = get_processing_word(lowercase=True)
 50 | 
 51 |     # Generators
 52 |     dev = CoNLLDataset(filename_dev, processing_word, which_tags=which_tags)
 53 |     test = CoNLLDataset(filename_test, processing_word, which_tags=which_tags)
 54 |     train = CoNLLDataset(filename_train, processing_word, which_tags=which_tags)
 55 | 
 56 |     # Build Word, Char and Tag vocab
 57 |     vocab_words, vocab_tags = get_vocabs([train, dev, test])
 58 |     vocab = vocab_words
 59 |     vocab.add(UNK)
 60 |     vocab.add(NUM)
 61 |     vocab_chars = get_char_vocab(train)
 62 |     write_vocab(vocab, filename_words)
 63 |     write_vocab(vocab_tags, filename_tags)
 64 |     write_vocab(vocab_chars, filename_chars)
 65 | 
 66 |     # Export extended vocab
 67 |     vocab_vec = get_vec_vocab(filename_word_vec.format(dim_words[0])) # pick any, words are the same
 68 |     vocab = vocab & vocab_vec
 69 |     write_vocab(vocab, filename_words_ext)
 70 | 
 71 |     # Trim vectors
 72 |     vocab = load_vocab(filename_words)
 73 |     for dim_word in dim_words:
 74 |         export_trimmed_word_vectors(vocab, filename_word_vec.format(dim_word),
 75 |                                      filename_word_vec_trimmed.format(dim_word), dim_word)
 76 | 
 77 | class CoNLLDataset(object):
 78 |     """Class that iterates over CoNLL Dataset
 79 | 
 80 |     __iter__ method yields a tuple (words, tags)
 81 |         words: list of raw words
 82 |         tags: list of raw tags
 83 | 
 84 |     If processing_word and processing_tag are not None,
 85 |     optional preprocessing is appplied
 86 | 
 87 |     Example:
 88 |         ```python
 89 |         data = CoNLLDataset(filename)
 90 |         for sentence, tags in data:
 91 |             pass
 92 |         ```
 93 | 
 94 |     """
 95 |     def __init__(self, filename, processing_word=None, processing_tag=None,
 96 |                  max_iter=None, which_tags=-1):
 97 |         """
 98 |         :param filename: path to the file
 99 |         :param processing_word: (optional) function that takes a word as input
100 |         :param processing_tag: (optional) function that takes a tag as input
101 |         :param max_iter: (optional) max number of sentences to yield
102 |         :param which_tags: (optional) which tagging scheme to use (-1 -2 -3 or 3 2 1 for task 1 2 3 respectively)
103 |         """
104 |         self.filename = filename
105 |         self.processing_word = processing_word
106 |         self.processing_tag = processing_tag
107 |         self.max_iter = max_iter
108 |         self.which_tags = which_tags
109 |         self.length = None
110 | 
111 | 
112 |     def __iter__(self):
113 |         niter = 0
114 |         with open(self.filename) as f:
115 |             words, tags = [], []
116 |             for line in f:
117 |                 line = line.strip()
118 |                 if (len(line) == 0 or line.startswith("-DOCSTART-")):
119 |                     if len(words) != 0:
120 |                         niter += 1
121 |                         if self.max_iter is not None and niter > self.max_iter:
122 |                             break
123 |                         yield words, tags
124 |                         words, tags = [], []
125 |                 else:
126 |                     ls = line.split()
127 |                     word, tag = ls[0],ls[self.which_tags]
128 |                     if self.processing_word is not None:
129 |                         word = self.processing_word(word)
130 |                     if self.processing_tag is not None:
131 |                         tag = self.processing_tag(tag)
132 |                     words += [word]
133 |                     tags += [tag]
134 | 
135 | 
136 |     def __len__(self):
137 |         """Iterates once over the corpus to set and store length"""
138 |         if self.length is None:
139 |             self.length = 0
140 |             for _ in self:
141 |                 self.length += 1
142 | 
143 |         return self.length
144 | 
145 | 
146 | def coNLLDataset_full(filename, processing_word=None, processing_tag=None, max_iter=None, which_tags=-1):
147 |     """
148 |     Same as above but simply processes all datasets and returns full lists of X and y in memory (no yield).
149 | 
150 |     :param filename: path to the file
151 |     :param processing_word: (optional) function that takes a word as input
152 |     :param processing_tag: (optional) function that takes a tag as input
153 |     :param max_iter: (optional) max number of sentences to yield
154 |     :param which_tags: (optional) which tagging scheme to use (-1 -2 -3 or 3 2 1 for task 1 2 3 respectively)
155 |     :return X,y: lists of words and tags in sequences
156 |     """
157 | 
158 | 
159 |     X,y = [], []
160 | 
161 |     niter = 0
162 |     with open(filename) as f:
163 |         words, tags = [], []
164 |         for line in f:
165 |             line = line.strip()
166 |             if (len(line) == 0 or line.startswith("-DOCSTART-")):
167 |                 if len(words) != 0:
168 |                     niter += 1
169 |                     if max_iter is not None and niter > max_iter:
170 |                         break
171 |                     X.append(words)
172 |                     y.append(tags)
173 |                     words, tags = [], []
174 |             else:
175 |                 ls = line.split()
176 |                 word, tag = ls[0],ls[which_tags]
177 |                 if processing_word is not None:
178 |                     word = processing_word(word)
179 |                 if processing_tag is not None:
180 |                     tag = processing_tag(tag)
181 |                 words += [word]
182 |                 tags += [tag]
183 | 
184 |     return X,y
185 | 
186 | 
187 | def get_vocabs(datasets):
188 |     """
189 |     Build vocabulary from an iterable of datasets objects
190 | 
191 |     :param datasets: datasets: a list of dataset objects
192 |     :return: a set of all the words in the dataset
193 |     """
194 |     print("Building vocab...")
195 |     vocab_words = set()
196 |     vocab_tags = set()
197 |     for dataset in datasets:
198 |         for words, tags in dataset:
199 |             vocab_words.update(words)
200 |             vocab_tags.update(tags)
201 |     print("- done. {} tokens".format(len(vocab_words)))
202 |     return vocab_words, vocab_tags
203 | 
204 | 
205 | def get_char_vocab(dataset):
206 |     """
207 |     Build char vocabulary from an iterable of datasets objects
208 | 
209 |     :param dataset: dataset: a iterator yielding tuples (sentence, tags)
210 |     :return: a set of all the characters in the dataset
211 |     """
212 |     vocab_char = set()
213 |     for words, _ in dataset:
214 |         for word in words:
215 |             vocab_char.update(word)
216 | 
217 |     return vocab_char
218 | 
219 | 
220 | def get_vec_vocab(filename):
221 |     """
222 |     Load vocab from file
223 | 
224 |     :param filename: filename: path to the word vectors
225 |     :return: vocab: set() of strings
226 |     """
227 |     print("Building vocab...")
228 |     vocab = set()
229 |     with open(filename) as f:
230 |         for line in f:
231 |             word = line.strip().split(' ')[0]
232 |             vocab.add(word)
233 |     print("- done. {} tokens".format(len(vocab)))
234 |     return vocab
235 | 
236 | 
237 | def write_vocab(vocab, filename):
238 |     """
239 |     Writes a vocab to a file, one word per line.
240 | 
241 |     :param vocab: iterable that yields word
242 |     :param filename: path to vocab file
243 |     :return: None (write a word per line)
244 |     """
245 |     print("Writing vocab...")
246 |     with open(filename, "w") as f:
247 |         for i, word in enumerate(vocab):
248 |             if i != len(vocab) - 1:
249 |                 f.write("{}\n".format(word))
250 |             else:
251 |                 f.write(word)
252 |     print("- done. {} tokens".format(len(vocab)))
253 | 
254 | 
255 | def load_vocab(filename):
256 |     """
257 |     Loads vocab from a file
258 | 
259 |     :param filename: (string) the format of the file must be one word per line
260 |     :return: dict[word] = index
261 |     """
262 |     try:
263 |         d = OrderedDict()
264 |         with open(filename) as f:
265 |             for idx, word in enumerate(f):
266 |                 word = word.strip()
267 |                 d[word] = idx
268 | 
269 |     except IOError:
270 |         raise MyIOError(filename)
271 |     return d
272 | 
273 | 
274 | def export_trimmed_word_vectors(vocab, word_filename, trimmed_filename, dim):
275 |     """
276 |     Saves word vectors in numpy array
277 | 
278 |     :param vocab: dictionary vocab[word] = index
279 |     :param word_filename: a path to a word file
280 |     :param trimmed_filename: a path where to store a matrix in npy
281 |     :param dim: (int) dimension of embeddings
282 |     :return: None
283 |     """
284 |     embeddings = np.zeros([len(vocab), dim])
285 |     with open(word_filename) as f:
286 |         for line in f:
287 |             line = line.strip().split(' ')
288 |             word = line[0]
289 |             embedding = [float(x) for x in line[1:]]
290 |             if word in vocab:
291 |                 word_idx = vocab[word]
292 |                 embeddings[word_idx] = np.asarray(embedding)
293 | 
294 |     np.savez_compressed(trimmed_filename, embeddings=embeddings)
295 | 
296 | 
297 | def get_trimmed_word_vectors(filename):
298 |     """
299 |     Get word vectors
300 | 
301 |     :param filename: path to the npz file
302 |     :return: matrix of embeddings (np array)
303 |     """
304 |     try:
305 |         with np.load(filename) as data:
306 |             return data["embeddings"]
307 | 
308 |     except IOError:
309 |         raise MyIOError(filename)
310 | 
311 | 
312 | def get_processing_word(vocab_words=None, vocab_chars=None,
313 |                     lowercase=False, chars=False, allow_unk=True):
314 |     """
315 |     Return lambda function that transform a word (string) into list,
316 |     or tuple of (list, id) of int corresponding to the ids of the word and
317 |     its corresponding characters.
318 |     Note that only known chars from train are used (i.e. chars for which we have learned an embedding, and only known words
319 |     are used. Unknown words are featured with the UNK word vector. Note that this solution prevents learning new embeddings for them,
320 |     because either a word was seen at training, or it is impossible do deal with properly..).
321 | 
322 |     :param vocab_words: dict[word] = idx
323 |     :param vocab_chars: dict[char] = idx
324 |     :param lowercase: if to transform to lowercase
325 |     :param chars: if to export characters too
326 |     :param allow_unk: if to allow for the use of the UNK token
327 |     :return: f("cat") = ([12, 4, 32], 12345)
328 |                  = (list of char ids, word id)
329 |     """
330 |     def f(word):
331 |         # 0. get chars of words
332 |         if vocab_chars is not None and chars == True:
333 |             char_ids = []
334 |             for char in word:
335 |                 # ignore chars out of vocabulary
336 |                 if char in vocab_chars:
337 |                     char_ids += [vocab_chars[char]]
338 | 
339 |         # 1. preprocess word
340 |         if lowercase:
341 |             word = word.lower()
342 |         if word.isdigit():
343 |             word = NUM
344 | 
345 |         # 2. get id of word
346 |         if vocab_words is not None:
347 |             if word in vocab_words:
348 |                 word = vocab_words[word]
349 |             else:
350 |                 if allow_unk:
351 |                     word = vocab_words[UNK]
352 |                 else:
353 |                     raise Exception("Unknow key is not allowed. Check that "\
354 |                                     "your vocab (tags?) is correct")
355 | 
356 |         # 3. return tuple char ids, word id
357 |         if vocab_chars is not None and chars == True:
358 |             return char_ids, word
359 |         else:
360 |             return word
361 | 
362 |     return f
363 | 
364 | 
365 | def _pad_sequences(sequences, pad_tok, max_length):
366 |     """
367 |     Pads to the right, at the end of the sequence.
368 | 
369 |     :param sequences: a generator of list or tuple
370 |     :param pad_tok: the char to pad with
371 |     :param max_length: the maximum length of a sequence
372 |     :return: a list of list where each sublist has same length
373 |     """
374 |     sequence_padded, sequence_length = [], []
375 | 
376 |     for seq in sequences:
377 |         seq = list(seq)
378 |         seq_ = seq[:max_length] + [pad_tok]*max(max_length - len(seq), 0)
379 |         sequence_padded += [seq_]
380 |         sequence_length += [min(len(seq), max_length)]
381 | 
382 |     return sequence_padded, sequence_length
383 | 
384 | 
385 | def pad_sequences(sequences, pad_tok, nlevels=1):
386 |     """
387 |     Pads to the right, at the end of the sequence, at levels 1 (just words) and 2 (both words and characters)
388 | 
389 |     :param sequences: a generator of list or tuple
390 |     :param pad_tok: the char to pad with
391 |     :param nlevels: "depth" of padding, for the case where we have characters ids
392 |     :return: a list of list where each sublist has same length
393 |     """
394 |     if nlevels == 1:
395 |         max_length = max(map(lambda x : len(x), sequences))
396 |         sequence_padded, sequence_length = _pad_sequences(sequences,
397 |                                             pad_tok, max_length)
398 | 
399 |     elif nlevels == 2:
400 |         max_length_word = max([max(map(lambda x: len(x), seq))
401 |                                for seq in sequences])
402 |         sequence_padded, sequence_length = [], []
403 |         for seq in sequences:
404 |             # all words are same length now
405 |             sp, sl = _pad_sequences(seq, pad_tok, max_length_word)
406 |             sequence_padded += [sp]
407 |             sequence_length += [sl]
408 | 
409 |         max_length_sentence = max(map(lambda x : len(x), sequences))
410 |         sequence_padded, _ = _pad_sequences(sequence_padded,
411 |                 [pad_tok]*max_length_word, max_length_sentence)
412 |         sequence_length, _ = _pad_sequences(sequence_length, 0,
413 |                 max_length_sentence)
414 | 
415 |     return sequence_padded, sequence_length
416 | 
417 | 
418 | def minibatches(data, minibatch_size):
419 |     """
420 |     Yields data in minimatches.
421 | 
422 |     :param data: generator of (sentence, tags) tuples
423 |     :param minibatch_size: (int)
424 |     :return: list of tuples
425 |     """
426 |     x_batch, y_batch = [], []
427 |     for (x, y) in data:
428 |         if len(x_batch) == minibatch_size:
429 |             yield x_batch, y_batch
430 |             x_batch, y_batch = [], []
431 | 
432 |         if type(x[0]) == tuple:
433 |             x = zip(*x)
434 |         x_batch += [x]
435 |         y_batch += [y]
436 | 
437 |     if len(x_batch) != 0:
438 |         yield x_batch, y_batch
439 | 
440 | 
441 | def get_chunk_type(tok, idx_to_tag):
442 |     """
443 |     Return chunk type
444 | 
445 |     :param tok: id of token, ex 4
446 |     :param idx_to_tag: dictionary {4: "B-PER", ...}
447 |     :return: tuple: "B", "PER"
448 |     """
449 |     tag_name = idx_to_tag[tok]
450 |     tag_class = tag_name.split('-')[0]
451 |     tag_type = tag_name.split('-')[-1]
452 |     return tag_class, tag_type
453 | 
454 | 
455 | def get_chunks(seq, tags):
456 |     """
457 |     Given a sequence of tags, group entities and their position
458 | 
459 |     Example:
460 |         seq = [4, 5, 0, 3]
461 |         tags = {"B-PER": 4, "I-PER": 5, "B-LOC": 3}
462 |         result = [("PER", 0, 2), ("LOC", 3, 4)]
463 | 
464 | 
465 |     :param seq: [4, 4, 0, 0, ...] sequence of labels
466 |     :param tags: dict["O"] = 4
467 |     :return: list of (chunk_type, chunk_start, chunk_end)
468 |     """
469 |     default = tags[NONE]
470 |     idx_to_tag = {idx: tag for tag, idx in tags.items()}
471 |     chunks = []
472 |     chunk_type, chunk_start = None, None
473 |     for i, tok in enumerate(seq):
474 |         # End of a chunk 1
475 |         if tok == default and chunk_type is not None:
476 |             # Add a chunk.
477 |             chunk = (chunk_type, chunk_start, i)
478 |             chunks.append(chunk)
479 |             chunk_type, chunk_start = None, None
480 | 
481 |         # End of a chunk + start of a chunk!
482 |         elif tok != default:
483 |             tok_chunk_class, tok_chunk_type = get_chunk_type(tok, idx_to_tag)
484 |             if chunk_type is None:
485 |                 chunk_type, chunk_start = tok_chunk_type, i
486 |             elif tok_chunk_type != chunk_type or tok_chunk_class == "b":
487 |                 chunk = (chunk_type, chunk_start, i)
488 |                 chunks.append(chunk)
489 |                 chunk_type, chunk_start = tok_chunk_type, i
490 |         else:
491 |             pass
492 | 
493 |     # end condition
494 |     if chunk_type is not None:
495 |         chunk = (chunk_type, chunk_start, len(seq))
496 |         chunks.append(chunk)
497 | 
498 |     return chunks
499 | 
500 | 
501 | def conv1d(input_, output_size, width=3, stride=1):
502 |     """
503 |     1d convolution for texts, from: https://medium.com/@TalPerry/convolutional-methods-for-text-d5260fd5675f
504 | 
505 |     :param input_: A tensor of embedded tokens with shape [batch_size,max_length,embedding_size]
506 |     :param output_size: The number of feature maps we'd like to calculate
507 |     :param width: The filter width
508 |     :param stride: The stride
509 |     :return: A tensor of the convolved input with shape [batch_size,max_length,output_size]
510 |     """
511 |     inputSize = input_.get_shape()[-1] # How many channels on the input (The size of our embedding for instance)
512 | 
513 |     # This is where we make our text an image of height 1
514 |     input_ = tf.expand_dims(input_, axis=1) # Change the shape to [batch_size,1,max_length,embedding_size]
515 | 
516 |     # Make sure the height of the filter is 1
517 |     filter_ = tf.get_variable("conv_filter_%d_%d" % (width,stride), shape=[1, width, inputSize, output_size])
518 | 
519 |     # Run the convolution as if this were an image
520 |     convolved = tf.nn.conv2d(input_, filter=filter_, strides=[1, 1, stride, 1], padding="SAME")
521 | 
522 |     # Remove the extra dimension, i.e. make the shape [batch_size,max_length,output_size]
523 |     result = tf.squeeze(convolved, axis=1)
524 |     return result


--------------------------------------------------------------------------------
/tensorflow/utils/general_utils.py:
--------------------------------------------------------------------------------
  1 | """
  2 | Utilities for dealing with general stuff
  3 | Borrows from: https://github.com/guillaumegenthial/sequence_tagging
  4 | """
  5 | 
  6 | import time
  7 | import sys
  8 | import logging
  9 | import numpy as np
 10 | 
 11 | 
 12 | def get_logger(filename):
 13 |     """
 14 |     Return a logger instance that writes in filename
 15 | 
 16 |     :param filename: (string) path to log.txt
 17 |     :return: (instance of logger)
 18 |     """
 19 |     logger = logging.getLogger('logger')
 20 |     logger.setLevel(logging.DEBUG)
 21 |     logging.basicConfig(format='%(message)s', level=logging.DEBUG)
 22 |     handler = logging.FileHandler(filename)
 23 |     handler.setLevel(logging.DEBUG)
 24 |     handler.setFormatter(logging.Formatter(
 25 |             '%(asctime)s:%(levelname)s: %(message)s'))
 26 |     logging.getLogger().addHandler(handler)
 27 | 
 28 |     return logger
 29 | 
 30 | 
 31 | class Progbar(object):
 32 |     """Progbar class copied from keras (https://github.com/fchollet/keras/)
 33 | 
 34 |     Displays a progress bar.
 35 |     Small edit : added strict arg to update
 36 |     # Arguments
 37 |         target: Total number of steps expected.
 38 |         interval: Minimum visual progress update interval (in seconds).
 39 |     """
 40 | 
 41 |     def __init__(self, target, width=30, verbose=1):
 42 |         self.width = width
 43 |         self.target = target
 44 |         self.sum_values = {}
 45 |         self.unique_values = []
 46 |         self.start = time.time()
 47 |         self.total_width = 0
 48 |         self.seen_so_far = 0
 49 |         self.verbose = verbose
 50 | 
 51 |     def update(self, current, values=[], exact=[], strict=[]):
 52 |         """
 53 |         Updates the progress bar.
 54 |         # Arguments
 55 |             current: Index of current step.
 56 |             values: List of tuples (name, value_for_last_step).
 57 |                 The progress bar will display averages for these values.
 58 |             exact: List of tuples (name, value_for_last_step).
 59 |                 The progress bar will display these values directly.
 60 | 
 61 |         :param current: Index of current step.
 62 |         :param values: List of tuples (name, value_for_last_step).
 63 |                 The progress bar will display averages for these values.
 64 |         :param exact: List of tuples (name, value_for_last_step).
 65 |                 The progress bar will display these values directly.
 66 |         :param strict:
 67 |         :return: None
 68 |         """
 69 | 
 70 |         for k, v in values:
 71 |             if k not in self.sum_values:
 72 |                 self.sum_values[k] = [v * (current - self.seen_so_far),
 73 |                                       current - self.seen_so_far]
 74 |                 self.unique_values.append(k)
 75 |             else:
 76 |                 self.sum_values[k][0] += v * (current - self.seen_so_far)
 77 |                 self.sum_values[k][1] += (current - self.seen_so_far)
 78 |         for k, v in exact:
 79 |             if k not in self.sum_values:
 80 |                 self.unique_values.append(k)
 81 |             self.sum_values[k] = [v, 1]
 82 | 
 83 |         for k, v in strict:
 84 |             if k not in self.sum_values:
 85 |                 self.unique_values.append(k)
 86 |             self.sum_values[k] = v
 87 | 
 88 |         self.seen_so_far = current
 89 | 
 90 |         now = time.time()
 91 |         if self.verbose == 1:
 92 |             prev_total_width = self.total_width
 93 |             sys.stdout.write("\b" * prev_total_width)
 94 |             sys.stdout.write("\r")
 95 | 
 96 |             numdigits = int(np.floor(np.log10(self.target))) + 1
 97 |             barstr = '%%%dd/%%%dd [' % (numdigits, numdigits)
 98 |             bar = barstr % (current, self.target)
 99 |             prog = float(current)/self.target
100 |             prog_width = int(self.width*prog)
101 |             if prog_width > 0:
102 |                 bar += ('='*(prog_width-1))
103 |                 if current < self.target:
104 |                     bar += '>'
105 |                 else:
106 |                     bar += '='
107 |             bar += ('.'*(self.width-prog_width))
108 |             bar += ']'
109 |             sys.stdout.write(bar)
110 |             self.total_width = len(bar)
111 | 
112 |             if current:
113 |                 time_per_unit = (now - self.start) / current
114 |             else:
115 |                 time_per_unit = 0
116 |             eta = time_per_unit*(self.target - current)
117 |             info = ''
118 |             if current < self.target:
119 |                 info += ' - ETA: %ds' % eta
120 |             else:
121 |                 info += ' - %ds' % (now - self.start)
122 |             for k in self.unique_values:
123 |                 if type(self.sum_values[k]) is list:
124 |                     info += ' - %s: %.4f' % (k,
125 |                         self.sum_values[k][0] / max(1, self.sum_values[k][1]))
126 |                 else:
127 |                     info += ' - %s: %s' % (k, self.sum_values[k])
128 | 
129 |             self.total_width += len(info)
130 |             if prev_total_width > self.total_width:
131 |                 info += ((prev_total_width-self.total_width) * " ")
132 | 
133 |             sys.stdout.write(info)
134 |             sys.stdout.flush()
135 | 
136 |             if current >= self.target:
137 |                 sys.stdout.write("\n")
138 | 
139 |         if self.verbose == 2:
140 |             if current >= self.target:
141 |                 info = '%ds' % (now - self.start)
142 |                 for k in self.unique_values:
143 |                     info += ' - %s: %.4f' % (k,
144 |                         self.sum_values[k][0] / max(1, self.sum_values[k][1]))
145 |                 sys.stdout.write(info + "\n")
146 | 
147 |     def add(self, n, values=[]):
148 |         self.update(self.seen_so_far+n, values)
149 | 
150 | 
151 | 


--------------------------------------------------------------------------------