├── .gitignore ├── README ├── bilou2bio.py ├── bio2bilou.py ├── compare_nested_entities.py ├── conll2eval_nested.py ├── conlleval ├── morpho_dataset.py ├── requirements.txt ├── run_conlleval.sh ├── run_eval_nested.sh ├── tagger.py └── test_run ├── run.sh ├── test.conll └── train.conll /.gitignore: -------------------------------------------------------------------------------- 1 | /__pycache__/ 2 | logs 3 | -------------------------------------------------------------------------------- /README: -------------------------------------------------------------------------------- 1 | Source code: Neural Architectures for Nested NER through Linearization 2 | ====================================================================== 3 | Jana Straková, Milan Straka and Jan Hajič 4 | https://aclweb.org/anthology/papers/P/P19/P19-1527/ 5 | {strakova,straka,hajic}@ufal.mff.cuni.cz 6 | 7 | License 8 | ------- 9 | 10 | Copyright 2019 Institute of Formal and Applied Linguistics, Faculty of 11 | Mathematics and Physics, Charles University, Czech Republic. 12 | 13 | This Source Code Form is subject to the terms of the Mozilla Public 14 | License, v. 2.0. If a copy of the MPL was not distributed with this 15 | file, You can obtain one at http://mozilla.org/MPL/2.0/. 16 | 17 | Please cite as: 18 | --------------- 19 | 20 | @inproceedings{strakova-etal-2019-neural, 21 | title = {{Neural Architectures for Nested {NER} through Linearization}}, 22 | author = {Jana Strakov{\'a} and Milan Straka and Jan Haji\v{c}}, 23 | booktitle = {Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics}, 24 | month = jul, 25 | year = {2019}, 26 | address = {Florence, Italy}, 27 | publisher = {Association for Computational Linguistics}, 28 | url = {https://www.aclweb.org/anthology/P19-1527}, 29 | pages = {5326--5331}, 30 | } 31 | 32 | How to run the tagger 33 | --------------------- 34 | 35 | 1. Install requirements 36 | 37 | pip install -r requirements.txt 38 | 39 | 2. Download the data 40 | 41 | ACE-2004: https://catalog.ldc.upenn.edu/LDC2005T09 42 | ACE-2005: https://catalog.ldc.upenn.edu/LDC2006T06 43 | GENIA: http://www.geniaproject.org/ 44 | 45 | 3. Create inputs 46 | 47 | The input of the tagger is in the CoNLL-2003 BILOU format. CoNLL-2003 shared 48 | task data format is described here: 49 | https://www.clips.uantwerpen.be/conll2003/ner/ . BILOU format is described 50 | here (Ratinov and Roth, 2009): https://www.aclweb.org/anthology/W09-1119 . 51 | 52 | The input format is a CoNLL format, with one token per line, sentences 53 | delimited by empty line. For each token, columns are separated by tabs. First 54 | column is the surface token, second column is lemma, third column is a POS tag 55 | and fourth column is the BILOU encoded NE label. 56 | 57 | For flat corpora (e.g. CoNLL-2003 English and German), the fourth column bears 58 | exactly one NE label, e.g. (example from CoNLL-2003 English): 59 | 60 | -DOCSTART- -docstart- NN O 61 | 62 | EU EU NNP U-ORG 63 | rejects reject VBZ O 64 | German german JJ U-MISC 65 | call call NN O 66 | to to TO O 67 | boycott boycott VB O 68 | British british JJ U-MISC 69 | lamb lamb NN O 70 | . . . O 71 | 72 | For nested NE corpora, the NE tags are linearized (flattened) according to 73 | rules described in the paper, e.g. (example from ACE-2004): 74 | 75 | The the DT B-GPE 76 | Chinese chinese JJ I-GPE|U-GPE 77 | government government NN L-GPE 78 | and and CC O 79 | the the DT B-GPE 80 | Australian australian JJ I-GPE|U-GPE 81 | government government NN L-GPE 82 | signed sign VBD O 83 | an an DT O 84 | agreement agreement NN O 85 | today today NN O 86 | , , , O 87 | wherein wherein WRB O 88 | the the DT B-GPE 89 | Australian australian JJ I-GPE|U-GPE 90 | party party NN L-GPE 91 | would would MD O 92 | provide provide VB O 93 | China China NNP U-GPE 94 | with with IN O 95 | a a DT O 96 | preferential preferential JJ O 97 | financial financial JJ O 98 | loan loan NN O 99 | of of IN O 100 | 150 150 CD O 101 | million million CD O 102 | Australian australian JJ U-GPE 103 | dollars dollar NNS O 104 | . . . O 105 | 106 | The lemmatization and POS tagging can be done with e.g. UDPipe 107 | (http://ufal.mff.cuni.cz/udpipe) or with MorphoDiTa 108 | (http://ufal.mff.cuni.cz/morphodita) or with any tool of your choice. If you 109 | don't have any POS tagger or lemmatizer, simply fill the respective columns 110 | with dummy (e.g. "_"). 111 | 112 | 4. Get word embeddings 113 | 114 | - word2vec, 115 | - FastText, 116 | - BERT, 117 | - ELMo, 118 | - Flair 119 | 120 | from sources described in the paper. The input formats are: 121 | 122 | - word2vec: The native word2vec text file. 123 | - FastText: The native FastText binary. 124 | - contextualized embeddings (BERT, ELMo, Flair): A text file with one token per 125 | line, first column is the token, all other columns are the vector real valued 126 | numbers; columns separated with space. The format is readable for human eyes, 127 | but quite large, sorry for the inconvenience. The per-token BERT 128 | contextualized word embeddings are created as an average of all token 129 | corresponding BERT subowords. The ELMo and Flair are generated using this 130 | code: https://github.com/zalandoresearch/flair. 131 | 132 | You can also run the tagger without pretrained word embeddings just with 133 | end-to-end word embeddings and character-level embeddings (created inside the 134 | tagger), or with a subset of the above mentioned pretrained word embeddings. 135 | 136 | 5. Run the tagger 137 | 138 | Usage example: 139 | 140 | ./tagger.py --corpus=CoNLL_en --train_data=conll_en/train_dev_bilou.conll --test_data=conll_en/test_bilou.conll --decoding=seq2seq --epochs=10:1e-3,8:1e-4 --form_wes_model=word_embeddings/conll_en_form.txt --lemma_wes_model=word_embeddings/conll_en_lemma.txt --bert_embeddings_train=bert_embeddings/conll_en_train_dev_bert_large_embeddings.txt --bert_embeddings_test=bert_embeddings/conll_en_test_bert_large_embeddings.txt --flair_train=flair_embeddings/conll_en_train_dev.txt --flair_test=flair_embeddings/conll_en_test.txt --elmo_train=elmo_embeddings/conll_en_train_dev.txt --elmo_test=elmo_embeddings/conll_en_test.txt --name=seq2seq+ELMo+BERT+Flair 141 | -------------------------------------------------------------------------------- /bilou2bio.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python3 2 | # coding=utf-8 3 | # 4 | # Copyright 2018 Institute of Formal and Applied Linguistics, Faculty of 5 | # Mathematics and Physics, Charles University, Czech Republic. 6 | # 7 | # This Source Code Form is subject to the terms of the Mozilla Public 8 | # License, v. 2.0. If a copy of the MPL was not distributed with this 9 | # file, You can obtain one at http://mozilla.org/MPL/2.0/. 10 | 11 | """Converts CoNLL file from BILOU to BIO encoding.""" 12 | 13 | import sys 14 | 15 | 16 | if __name__ == "__main__": 17 | import argparse 18 | 19 | lines = [] 20 | for line in sys.stdin: 21 | line = line.rstrip("\r\n") 22 | if not line: 23 | print() 24 | else: 25 | form, lemma, tag, label = line.split("\t") 26 | if label.startswith("U-"): 27 | label = label.replace("U-", "B-") 28 | if label.startswith("L-"): 29 | label = label.replace("L-", "I-") 30 | print("\t".join([form, lemma, tag, label])) 31 | -------------------------------------------------------------------------------- /bio2bilou.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python3 2 | # coding=utf-8 3 | # 4 | # Copyright 2018 Institute of Formal and Applied Linguistics, Faculty of 5 | # Mathematics and Physics, Charles University, Czech Republic. 6 | # 7 | # This Source Code Form is subject to the terms of the Mozilla Public 8 | # License, v. 2.0. If a copy of the MPL was not distributed with this 9 | # file, You can obtain one at http://mozilla.org/MPL/2.0/. 10 | 11 | """Converts CoNLL file from BIO to BILOU encoding.""" 12 | 13 | import sys 14 | 15 | def print_entity(lines): 16 | n = len(lines) 17 | if n > 0: 18 | if n == 1: 19 | lines[0][3] = lines[0][3].replace("I-","U-") 20 | lines[0][3] = lines[0][3].replace("B-","U-") 21 | else: 22 | lines[0][3] = lines[0][3].replace("I-", "B-") 23 | lines[n-1][3] = lines[n-1][3].replace("I-","L-") 24 | for i in range(n): 25 | print("\t".join(lines[i])) 26 | 27 | if __name__ == "__main__": 28 | import argparse 29 | 30 | lines = [] 31 | prev_label = "O" 32 | i = 0 33 | for line in sys.stdin: 34 | line = line.rstrip("\r\n") 35 | i += 1 36 | if not line: 37 | print_entity(lines) 38 | lines = [] 39 | prev_label = "O" 40 | print() 41 | else: 42 | if len(line.split("\t")) != 4: 43 | print("Incorrect line number " + str(i)) 44 | sys.exit(1) 45 | form, lemma, tag, label = line.split("\t") 46 | # no entity, entity may have ended on previous lines 47 | if label == "O": 48 | print_entity(lines) 49 | lines = [] 50 | print("\t".join([form, lemma, tag, label])) 51 | # new entity starts here, entity may have ended on previous lines 52 | elif label[-2:] != prev_label[-2:]: 53 | print_entity(lines) 54 | lines = [] 55 | lines.append([form, lemma, tag, label]) 56 | # other 57 | else: 58 | lines.append([form, lemma, tag, label]) 59 | prev_label = label 60 | -------------------------------------------------------------------------------- /compare_nested_entities.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python3 2 | # coding=utf-8 3 | # 4 | # Copyright 2019 Institute of Formal and Applied Linguistics, Faculty of 5 | # Mathematics and Physics, Charles University, Czech Republic. 6 | # 7 | # This Source Code Form is subject to the terms of the Mozilla Public 8 | # License, v. 2.0. If a copy of the MPL was not distributed with this 9 | # file, You can obtain one at http://mozilla.org/MPL/2.0/. 10 | 11 | """Evaluates nested entity predictions. 12 | 13 | The predictions are supposed to be in the following format: 14 | 15 | One entity mention per line, two columns per line separated by table. First 16 | column are entity mention token ids separated by comma, second column is 17 | a BIO or BILOU label. Only classes are compared, the B-, I-, L- and U- 18 | prefixes are stripped. 19 | """ 20 | 21 | 22 | import sys 23 | 24 | 25 | if __name__ == "__main__": 26 | 27 | with open(sys.argv[1], "r", encoding="utf-8") as fr: 28 | gold_entities = fr.readlines() 29 | for i in range(len(gold_entities)): 30 | gold_entities[i] = gold_entities[i].split("\t")[:2] 31 | 32 | with open(sys.argv[2], "r", encoding="utf-8") as fr: 33 | system_entities = fr.readlines() 34 | for i in range(len(system_entities)): 35 | system_entities[i] = system_entities[i].split("\t")[:2] 36 | 37 | correct_retrieved = 0 38 | for entity in system_entities: 39 | if entity in gold_entities: 40 | correct_retrieved += 1 41 | 42 | recall = correct_retrieved / len(gold_entities) if gold_entities else 0 43 | precision = correct_retrieved / len(system_entities) if system_entities else 0 44 | f1 = (2 * recall * precision) / (recall + precision) if recall+precision else 0 45 | 46 | print("Correct retrieved: {}".format(correct_retrieved)) 47 | print("Retrieved: {}".format(len(system_entities))) 48 | print("Gold: {}".format(len(gold_entities))) 49 | print("Recall: {:.2f}".format(recall*100)) 50 | print("Precision: {:.2f}".format(precision*100)) 51 | print("F1: {:.2f}".format(f1*100)) 52 | -------------------------------------------------------------------------------- /conll2eval_nested.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python3 2 | # coding=utf-8 3 | # 4 | # Copyright 2019 Institute of Formal and Applied Linguistics, Faculty of 5 | # Mathematics and Physics, Charles University, Czech Republic. 6 | # 7 | # This Source Code Form is subject to the terms of the Mozilla Public 8 | # License, v. 2.0. If a copy of the MPL was not distributed with this 9 | # file, You can obtain one at http://mozilla.org/MPL/2.0/. 10 | 11 | """Creates an evaluation file with named entities. 12 | 13 | Input: CoNLL file with linearized (encoded) nested named entity labels 14 | delimited with |. 15 | 16 | Output: One entity mention per line, two columns per line separated by table. 17 | First column are entity mentino token ids separated by comma, second column is 18 | a BIO or BILOU label. 19 | 20 | The output can be then evaluated with compare_nested_entities.py. 21 | """ 22 | 23 | import sys 24 | 25 | COL_SEP = "\t" 26 | 27 | def raw(label): 28 | return label[2:] 29 | 30 | def flush(running_ids, running_forms, running_labels): 31 | for i in range(len(running_ids)): 32 | print(running_ids[i] + COL_SEP + running_labels[i] + COL_SEP + running_forms[i]) 33 | return ([], [], []) 34 | 35 | if __name__ == "__main__": 36 | 37 | i = 0 38 | running_ids = [] 39 | running_forms = [] 40 | running_labels = [] 41 | for line in sys.stdin: 42 | line = line.rstrip("\r\n") 43 | if not line: # flush entities 44 | (running_ids, running_forms, running_labels) = flush(running_ids, running_forms, running_labels) 45 | else: 46 | form , _, _, ne = line.split("\t") 47 | if ne == "O": # flush entities 48 | (running_ids, running_forms, running_labels) = flush(running_ids, running_forms, running_labels) 49 | else: 50 | labels = ne.split("|") 51 | for j in range(len(labels)): # for each label 52 | label = labels[j] 53 | if j < len(running_ids): # running entity 54 | # previous running entity ends here, print and insert new entity instead 55 | if label.startswith("B-") or label.startswith("U-") or running_labels[j] != raw(label): 56 | print(running_ids[j] + COL_SEP + running_labels[j] + COL_SEP + running_forms[j]) 57 | running_ids[j] = str(i) 58 | running_forms[j] = form 59 | # entity continues, append ids and forms 60 | else: 61 | running_ids[j] += "," + str(i) 62 | running_forms[j] += " " + form 63 | running_labels[j] = raw(label) 64 | else: # no running entities, new entity starts here, just append 65 | running_ids.append(str(i)) 66 | running_forms.append(form) 67 | running_labels.append(raw(label)) 68 | 69 | i += 1 70 | -------------------------------------------------------------------------------- /conlleval: -------------------------------------------------------------------------------- 1 | #!/usr/bin/perl -w 2 | # conlleval: evaluate result of processing CoNLL-2000 shared task 3 | # usage: conlleval [-l] [-r] [-d delimiterTag] [-o oTag] < file 4 | # README: http://cnts.uia.ac.be/conll2000/chunking/output.html 5 | # options: l: generate LaTeX output for tables like in 6 | # http://cnts.uia.ac.be/conll2003/ner/example.tex 7 | # r: accept raw result tags (without B- and I- prefix; 8 | # assumes one word per chunk) 9 | # d: alternative delimiter tag (default is single space) 10 | # o: alternative outside tag (default is O) 11 | # note: the file should contain lines with items separated 12 | # by $delimiter characters (default space). The final 13 | # two items should contain the correct tag and the 14 | # guessed tag in that order. Sentences should be 15 | # separated from each other by empty lines or lines 16 | # with $boundary fields (default -X-). 17 | # url: http://lcg-www.uia.ac.be/conll2000/chunking/ 18 | # started: 1998-09-25 19 | # version: 2004-01-26 20 | # author: Erik Tjong Kim Sang 21 | 22 | use strict; 23 | 24 | my $false = 0; 25 | my $true = 42; 26 | 27 | my $boundary = "-X-"; # sentence boundary 28 | my $correct; # current corpus chunk tag (I,O,B) 29 | my $correctChunk = 0; # number of correctly identified chunks 30 | my $correctTags = 0; # number of correct chunk tags 31 | my $correctType; # type of current corpus chunk tag (NP,VP,etc.) 32 | my $delimiter = " "; # field delimiter 33 | my $FB1 = 0.0; # FB1 score (Van Rijsbergen 1979) 34 | my $firstItem; # first feature (for sentence boundary checks) 35 | my $foundCorrect = 0; # number of chunks in corpus 36 | my $foundGuessed = 0; # number of identified chunks 37 | my $guessed; # current guessed chunk tag 38 | my $guessedType; # type of current guessed chunk tag 39 | my $i; # miscellaneous counter 40 | my $inCorrect = $false; # currently processed chunk is correct until now 41 | my $lastCorrect = "O"; # previous chunk tag in corpus 42 | my $latex = 0; # generate LaTeX formatted output 43 | my $lastCorrectType = ""; # type of previously identified chunk tag 44 | my $lastGuessed = "O"; # previously identified chunk tag 45 | my $lastGuessedType = ""; # type of previous chunk tag in corpus 46 | my $lastType; # temporary storage for detecting duplicates 47 | my $line; # line 48 | my $nbrOfFeatures = -1; # number of features per line 49 | my $precision = 0.0; # precision score 50 | my $oTag = "O"; # outside tag, default O 51 | my $raw = 0; # raw input: add B to every token 52 | my $recall = 0.0; # recall score 53 | my $tokenCounter = 0; # token counter (ignores sentence breaks) 54 | 55 | my %correctChunk = (); # number of correctly identified chunks per type 56 | my %foundCorrect = (); # number of chunks in corpus per type 57 | my %foundGuessed = (); # number of identified chunks per type 58 | 59 | my @features; # features on line 60 | my @sortedTypes; # sorted list of chunk type names 61 | 62 | # sanity check 63 | while (@ARGV and $ARGV[0] =~ /^-/) { 64 | if ($ARGV[0] eq "-l") { $latex = 1; shift(@ARGV); } 65 | elsif ($ARGV[0] eq "-r") { $raw = 1; shift(@ARGV); } 66 | elsif ($ARGV[0] eq "-d") { 67 | shift(@ARGV); 68 | if (not defined $ARGV[0]) { 69 | die "conlleval: -d requires delimiter character"; 70 | } 71 | $delimiter = shift(@ARGV); 72 | } elsif ($ARGV[0] eq "-o") { 73 | shift(@ARGV); 74 | if (not defined $ARGV[0]) { 75 | die "conlleval: -o requires delimiter character"; 76 | } 77 | $oTag = shift(@ARGV); 78 | } else { die "conlleval: unknown argument $ARGV[0]\n"; } 79 | } 80 | if (@ARGV) { die "conlleval: unexpected command line argument\n"; } 81 | # process input 82 | while () { 83 | chomp($line = $_); 84 | @features = split(/$delimiter/,$line); 85 | if ($nbrOfFeatures < 0) { $nbrOfFeatures = $#features; } 86 | elsif ($nbrOfFeatures != $#features and @features != 0) { 87 | printf STDERR "unexpected number of features: %d (%d)\n", 88 | $#features+1,$nbrOfFeatures+1; 89 | exit(1); 90 | } 91 | if (@features == 0 or 92 | $features[0] eq $boundary) { @features = ($boundary,"O","O"); } 93 | if (@features < 2) { 94 | die "conlleval: unexpected number of features in line $line\n"; 95 | } 96 | if ($raw) { 97 | if ($features[$#features] eq $oTag) { $features[$#features] = "O"; } 98 | if ($features[$#features-1] eq $oTag) { $features[$#features-1] = "O"; } 99 | if ($features[$#features] ne "O") { 100 | $features[$#features] = "B-$features[$#features]"; 101 | } 102 | if ($features[$#features-1] ne "O") { 103 | $features[$#features-1] = "B-$features[$#features-1]"; 104 | } 105 | } 106 | # 20040126 ET code which allows hyphens in the types 107 | if ($features[$#features] =~ /^([^-]*)-(.*)$/) { 108 | $guessed = $1; 109 | $guessedType = $2; 110 | } else { 111 | $guessed = $features[$#features]; 112 | $guessedType = ""; 113 | } 114 | pop(@features); 115 | if ($features[$#features] =~ /^([^-]*)-(.*)$/) { 116 | $correct = $1; 117 | $correctType = $2; 118 | } else { 119 | $correct = $features[$#features]; 120 | $correctType = ""; 121 | } 122 | pop(@features); 123 | # ($guessed,$guessedType) = split(/-/,pop(@features)); 124 | # ($correct,$correctType) = split(/-/,pop(@features)); 125 | $guessedType = $guessedType ? $guessedType : ""; 126 | $correctType = $correctType ? $correctType : ""; 127 | $firstItem = shift(@features); 128 | 129 | # 1999-06-26 sentence breaks should always be counted as out of chunk 130 | if ( $firstItem eq $boundary ) { $guessed = "O"; } 131 | 132 | if ($inCorrect) { 133 | if ( &endOfChunk($lastCorrect,$correct,$lastCorrectType,$correctType) and 134 | &endOfChunk($lastGuessed,$guessed,$lastGuessedType,$guessedType) and 135 | $lastGuessedType eq $lastCorrectType) { 136 | $inCorrect=$false; 137 | $correctChunk++; 138 | $correctChunk{$lastCorrectType} = $correctChunk{$lastCorrectType} ? 139 | $correctChunk{$lastCorrectType}+1 : 1; 140 | } elsif ( 141 | &endOfChunk($lastCorrect,$correct,$lastCorrectType,$correctType) != 142 | &endOfChunk($lastGuessed,$guessed,$lastGuessedType,$guessedType) or 143 | $guessedType ne $correctType ) { 144 | $inCorrect=$false; 145 | } 146 | } 147 | 148 | if ( &startOfChunk($lastCorrect,$correct,$lastCorrectType,$correctType) and 149 | &startOfChunk($lastGuessed,$guessed,$lastGuessedType,$guessedType) and 150 | $guessedType eq $correctType) { $inCorrect = $true; } 151 | 152 | if ( &startOfChunk($lastCorrect,$correct,$lastCorrectType,$correctType) ) { 153 | $foundCorrect++; 154 | $foundCorrect{$correctType} = $foundCorrect{$correctType} ? 155 | $foundCorrect{$correctType}+1 : 1; 156 | } 157 | if ( &startOfChunk($lastGuessed,$guessed,$lastGuessedType,$guessedType) ) { 158 | $foundGuessed++; 159 | $foundGuessed{$guessedType} = $foundGuessed{$guessedType} ? 160 | $foundGuessed{$guessedType}+1 : 1; 161 | } 162 | if ( $firstItem ne $boundary ) { 163 | if ( $correct eq $guessed and $guessedType eq $correctType ) { 164 | $correctTags++; 165 | } 166 | $tokenCounter++; 167 | } 168 | 169 | $lastGuessed = $guessed; 170 | $lastCorrect = $correct; 171 | $lastGuessedType = $guessedType; 172 | $lastCorrectType = $correctType; 173 | } 174 | if ($inCorrect) { 175 | $correctChunk++; 176 | $correctChunk{$lastCorrectType} = $correctChunk{$lastCorrectType} ? 177 | $correctChunk{$lastCorrectType}+1 : 1; 178 | } 179 | 180 | if (not $latex) { 181 | # compute overall precision, recall and FB1 (default values are 0.0) 182 | $precision = 100*$correctChunk/$foundGuessed if ($foundGuessed > 0); 183 | $recall = 100*$correctChunk/$foundCorrect if ($foundCorrect > 0); 184 | $FB1 = 2*$precision*$recall/($precision+$recall) 185 | if ($precision+$recall > 0); 186 | 187 | # print overall performance 188 | printf "processed $tokenCounter tokens with $foundCorrect phrases; "; 189 | printf "found: $foundGuessed phrases; correct: $correctChunk.\n"; 190 | if ($tokenCounter>0) { 191 | printf "accuracy: %6.2f%%; ",100*$correctTags/$tokenCounter; 192 | printf "precision: %6.2f%%; ",$precision; 193 | printf "recall: %6.2f%%; ",$recall; 194 | printf "FB1: %6.2f\n",$FB1; 195 | } 196 | } 197 | 198 | # sort chunk type names 199 | undef($lastType); 200 | @sortedTypes = (); 201 | foreach $i (sort (keys %foundCorrect,keys %foundGuessed)) { 202 | if (not($lastType) or $lastType ne $i) { 203 | push(@sortedTypes,($i)); 204 | } 205 | $lastType = $i; 206 | } 207 | # print performance per chunk type 208 | if (not $latex) { 209 | for $i (@sortedTypes) { 210 | $correctChunk{$i} = $correctChunk{$i} ? $correctChunk{$i} : 0; 211 | if (not($foundGuessed{$i})) { $foundGuessed{$i} = 0; $precision = 0.0; } 212 | else { $precision = 100*$correctChunk{$i}/$foundGuessed{$i}; } 213 | if (not($foundCorrect{$i})) { $recall = 0.0; } 214 | else { $recall = 100*$correctChunk{$i}/$foundCorrect{$i}; } 215 | if ($precision+$recall == 0.0) { $FB1 = 0.0; } 216 | else { $FB1 = 2*$precision*$recall/($precision+$recall); } 217 | printf "%17s: ",$i; 218 | printf "precision: %6.2f%%; ",$precision; 219 | printf "recall: %6.2f%%; ",$recall; 220 | printf "FB1: %6.2f %d\n",$FB1,$foundGuessed{$i}; 221 | } 222 | } else { 223 | print " & Precision & Recall & F\$_{\\beta=1} \\\\\\hline"; 224 | for $i (@sortedTypes) { 225 | $correctChunk{$i} = $correctChunk{$i} ? $correctChunk{$i} : 0; 226 | if (not($foundGuessed{$i})) { $precision = 0.0; } 227 | else { $precision = 100*$correctChunk{$i}/$foundGuessed{$i}; } 228 | if (not($foundCorrect{$i})) { $recall = 0.0; } 229 | else { $recall = 100*$correctChunk{$i}/$foundCorrect{$i}; } 230 | if ($precision+$recall == 0.0) { $FB1 = 0.0; } 231 | else { $FB1 = 2*$precision*$recall/($precision+$recall); } 232 | printf "\n%-7s & %6.2f\\%% & %6.2f\\%% & %6.2f \\\\", 233 | $i,$precision,$recall,$FB1; 234 | } 235 | print "\\hline\n"; 236 | $precision = 0.0; 237 | $recall = 0; 238 | $FB1 = 0.0; 239 | $precision = 100*$correctChunk/$foundGuessed if ($foundGuessed > 0); 240 | $recall = 100*$correctChunk/$foundCorrect if ($foundCorrect > 0); 241 | $FB1 = 2*$precision*$recall/($precision+$recall) 242 | if ($precision+$recall > 0); 243 | printf "Overall & %6.2f\\%% & %6.2f\\%% & %6.2f \\\\\\hline\n", 244 | $precision,$recall,$FB1; 245 | } 246 | 247 | exit 0; 248 | 249 | # endOfChunk: checks if a chunk ended between the previous and current word 250 | # arguments: previous and current chunk tags, previous and current types 251 | # note: this code is capable of handling other chunk representations 252 | # than the default CoNLL-2000 ones, see EACL'99 paper of Tjong 253 | # Kim Sang and Veenstra http://xxx.lanl.gov/abs/cs.CL/9907006 254 | 255 | sub endOfChunk { 256 | my $prevTag = shift(@_); 257 | my $tag = shift(@_); 258 | my $prevType = shift(@_); 259 | my $type = shift(@_); 260 | my $chunkEnd = $false; 261 | 262 | if ( $prevTag eq "B" and $tag eq "B" ) { $chunkEnd = $true; } 263 | if ( $prevTag eq "B" and $tag eq "O" ) { $chunkEnd = $true; } 264 | if ( $prevTag eq "I" and $tag eq "B" ) { $chunkEnd = $true; } 265 | if ( $prevTag eq "I" and $tag eq "O" ) { $chunkEnd = $true; } 266 | 267 | if ( $prevTag eq "E" and $tag eq "E" ) { $chunkEnd = $true; } 268 | if ( $prevTag eq "E" and $tag eq "I" ) { $chunkEnd = $true; } 269 | if ( $prevTag eq "E" and $tag eq "O" ) { $chunkEnd = $true; } 270 | if ( $prevTag eq "I" and $tag eq "O" ) { $chunkEnd = $true; } 271 | 272 | if ($prevTag ne "O" and $prevTag ne "." and $prevType ne $type) { 273 | $chunkEnd = $true; 274 | } 275 | 276 | # corrected 1998-12-22: these chunks are assumed to have length 1 277 | if ( $prevTag eq "]" ) { $chunkEnd = $true; } 278 | if ( $prevTag eq "[" ) { $chunkEnd = $true; } 279 | 280 | return($chunkEnd); 281 | } 282 | 283 | # startOfChunk: checks if a chunk started between the previous and current word 284 | # arguments: previous and current chunk tags, previous and current types 285 | # note: this code is capable of handling other chunk representations 286 | # than the default CoNLL-2000 ones, see EACL'99 paper of Tjong 287 | # Kim Sang and Veenstra http://xxx.lanl.gov/abs/cs.CL/9907006 288 | 289 | sub startOfChunk { 290 | my $prevTag = shift(@_); 291 | my $tag = shift(@_); 292 | my $prevType = shift(@_); 293 | my $type = shift(@_); 294 | my $chunkStart = $false; 295 | 296 | if ( $prevTag eq "B" and $tag eq "B" ) { $chunkStart = $true; } 297 | if ( $prevTag eq "I" and $tag eq "B" ) { $chunkStart = $true; } 298 | if ( $prevTag eq "O" and $tag eq "B" ) { $chunkStart = $true; } 299 | if ( $prevTag eq "O" and $tag eq "I" ) { $chunkStart = $true; } 300 | 301 | if ( $prevTag eq "E" and $tag eq "E" ) { $chunkStart = $true; } 302 | if ( $prevTag eq "E" and $tag eq "I" ) { $chunkStart = $true; } 303 | if ( $prevTag eq "O" and $tag eq "E" ) { $chunkStart = $true; } 304 | if ( $prevTag eq "O" and $tag eq "I" ) { $chunkStart = $true; } 305 | 306 | if ($tag ne "O" and $tag ne "." and $prevType ne $type) { 307 | $chunkStart = $true; 308 | } 309 | 310 | # corrected 1998-12-22: these chunks are assumed to have length 1 311 | if ( $tag eq "[" ) { $chunkStart = $true; } 312 | if ( $tag eq "]" ) { $chunkStart = $true; } 313 | 314 | return($chunkStart); 315 | } 316 | -------------------------------------------------------------------------------- /morpho_dataset.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python3 2 | # coding=utf-8 3 | # 4 | # Copyright 2019 Institute of Formal and Applied Linguistics, Faculty of 5 | # Mathematics and Physics, Charles University, Czech Republic. 6 | # 7 | # This Source Code Form is subject to the terms of the Mozilla Public 8 | # License, v. 2.0. If a copy of the MPL was not distributed with this 9 | # file, You can obtain one at http://mozilla.org/MPL/2.0/. 10 | 11 | """MorphoDataset class to handle NE tagged data.""" 12 | 13 | import numpy as np 14 | 15 | 16 | class MorphoDataset: 17 | """Class capable of loading morphological datasets in vertical format. 18 | The dataset is assumed to be composed of factors (by default FORMS, LEMMAS, POS and TAGS), 19 | each an object containing the following fields: 20 | - strings: Strings of the original words. 21 | - word_ids: Word ids of the original words (uses and ). 22 | - words_map: String -> word_id map. 23 | - words: Word_id -> string list. 24 | - alphabet_map: Character -> char_id map. 25 | - alphabet: Char_id -> character list. 26 | - charseq_ids: Character_sequence ids of the original words. 27 | - charseqs_map: String -> character_sequence_id map. 28 | - charseqs: Character_sequence_id -> [characters], where character is an index 29 | to the dataset alphabet. 30 | """ 31 | FORMS = 0 32 | LEMMAS = 1 33 | POS = 2 34 | TAGS = 3 35 | FACTORS = 4 36 | 37 | class _Factor: 38 | def __init__(self, train=None): 39 | self.words_map = train.words_map if train else {'': 0, '': 1, '': 2, '': 3} 40 | self.words = train.words if train else ['', '', '', ''] 41 | self.word_ids = [] 42 | self.alphabet_map = train.alphabet_map if train else {'': 0, '': 1, '': 2, '': 3} 43 | self.alphabet = train.alphabet if train else ['', '', '', ''] 44 | self.charseqs_map = {'': 0} 45 | self.charseqs = [[self.alphabet_map['']]] 46 | self.charseq_ids = [] 47 | self.strings = [] 48 | 49 | def __init__(self, filename, train=None, shuffle_batches=True, max_sentences=None, add_bow_eow=False, seq2seq=False, bert_embeddings_filename=None, flair_filename=None, elmo_filename=None): 50 | """Load dataset from file in vertical format. 51 | Arguments: 52 | add_bow_eow: Whether to add BOW/EOW characters to the word characters. 53 | seq2seq: Multiple labels may be predicted. 54 | train: If given, the words and alphabets are reused from the training data. 55 | """ 56 | 57 | # Create alphabet map 58 | self._alphabet_map = train._alphabet_map if train else {'': 0, '': 1, '': 2, '': 3} 59 | self._alphabet = train._alphabet if train else ['', '', '', ''] 60 | 61 | # Create word maps 62 | self._factors = [] 63 | for f in range(self.FACTORS): 64 | self._factors.append(self._Factor(train._factors[f] if train else None)) 65 | 66 | # Load the sentences 67 | with open(filename, "r", encoding="utf-8") as file: 68 | in_sentence = False 69 | for line in file: 70 | line = line.rstrip("\r\n") 71 | if line: 72 | columns = line.split("\t") 73 | for f in range(self.FACTORS): 74 | factor = self._factors[f] 75 | if not in_sentence: 76 | factor.word_ids.append([]) 77 | factor.charseq_ids.append([]) 78 | factor.strings.append([]) 79 | column = columns[f] if f < len(columns) else '' 80 | words = [] 81 | if f == self.TAGS and seq2seq: 82 | words = column.split("|") 83 | words.append("") 84 | else: 85 | words = [column] 86 | for word in words: 87 | factor.strings[-1].append(word) 88 | 89 | # Character-level information 90 | if word not in factor.charseqs_map: 91 | factor.charseqs_map[word] = len(factor.charseqs) 92 | factor.charseqs.append([]) 93 | if add_bow_eow: 94 | factor.charseqs[-1].append(factor.alphabet_map['']) 95 | for c in word: 96 | if c not in factor.alphabet_map: 97 | if train: 98 | c = '' 99 | else: 100 | factor.alphabet_map[c] = len(factor.alphabet) 101 | factor.alphabet.append(c) 102 | factor.charseqs[-1].append(factor.alphabet_map[c]) 103 | if add_bow_eow: 104 | factor.charseqs[-1].append(factor.alphabet_map['']) 105 | factor.charseq_ids[-1].append(factor.charseqs_map[word]) 106 | 107 | # Word-level information 108 | if word not in factor.words_map: 109 | if train: 110 | word = '' 111 | else: 112 | factor.words_map[word] = len(factor.words) 113 | factor.words.append(word) 114 | factor.word_ids[-1].append(factor.words_map[word]) 115 | in_sentence = True 116 | else: 117 | in_sentence = False 118 | if max_sentences is not None and len(self._factors[self.FORMS].word_ids) >= max_sentences: 119 | break 120 | 121 | # Compute sentence lengths 122 | sentences = len(self._factors[self.FORMS].word_ids) 123 | self._sentence_lens = np.zeros([sentences], np.int32) 124 | for i in range(len(self._factors[self.FORMS].word_ids)): 125 | self._sentence_lens[i] = len(self._factors[self.FORMS].word_ids[i]) 126 | 127 | # Compute tag lengths 128 | tags = len(self._factors[self.TAGS].word_ids) 129 | self._tag_lens = np.zeros([tags], np.int32) 130 | for i in range(len(self._factors[self.TAGS].word_ids)): 131 | self._tag_lens[i] = len(self._factors[self.TAGS].word_ids[i]) 132 | 133 | self._shuffle_batches = shuffle_batches 134 | self._permutation = np.random.permutation(len(self._sentence_lens)) if self._shuffle_batches else np.arange(len(self._sentence_lens)) 135 | 136 | # Load pretrained BERT embeddings 137 | self._bert_embeddings = [] # [sentences x words x bert_embeddings] 138 | if bert_embeddings_filename: 139 | with open(bert_embeddings_filename, "r", encoding="utf-8") as file: 140 | in_sentence = False 141 | for line in file: 142 | line = line.rstrip("\r\n") 143 | if line: 144 | if not in_sentence: 145 | self._bert_embeddings.append([]) 146 | self._bert_embeddings[-1].append(list(map(float, line.split(" ")[1:]))) 147 | in_sentence = True 148 | else: 149 | self._bert_embeddings[-1] = np.array(self._bert_embeddings[-1], dtype=np.float32) 150 | in_sentence = False 151 | 152 | # Load pretrained flair embeddings 153 | self._flair_embeddings = [] # [sentences x words x flair_embeddings] 154 | if flair_filename: 155 | with open(flair_filename, "r", encoding="utf-8") as file: 156 | in_sentence = False 157 | for line in file: 158 | line = line.rstrip("\r\n") 159 | if line: 160 | if not in_sentence: 161 | self._flair_embeddings.append([]) 162 | self._flair_embeddings[-1].append(list(map(float, line.split(" ")[1:]))) 163 | in_sentence = True 164 | else: 165 | self._flair_embeddings[-1] = np.array(self._flair_embeddings[-1], dtype=np.float32) 166 | in_sentence = False 167 | 168 | # Load pretrained elmo embeddings 169 | self._elmo_embeddings = [] # [sentences x words x elmo_embeddings] 170 | if elmo_filename: 171 | with open(elmo_filename, "r", encoding="utf-8") as file: 172 | in_sentence = False 173 | for line in file: 174 | line = line.rstrip("\r\n") 175 | if line: 176 | if not in_sentence: 177 | self._elmo_embeddings.append([]) 178 | self._elmo_embeddings[-1].append(list(map(float, line.split(" ")[1:]))) 179 | in_sentence = True 180 | else: 181 | self._elmo_embeddings[-1] = np.array(self._elmo_embeddings[-1], dtype=np.float32) 182 | in_sentence = False 183 | 184 | 185 | @property 186 | def bert_embeddings(self): 187 | return self._bert_embeddings 188 | 189 | @property 190 | def flair_embeddings(self): 191 | return self._flair_embeddings 192 | 193 | @property 194 | def elmo_embeddings(self): 195 | return self._elmo_embeddings 196 | 197 | @property 198 | def sentence_lens(self): 199 | return self._sentence_lens 200 | 201 | @property 202 | def tag_lens(self): 203 | return self._tag_lens 204 | 205 | @property 206 | def factors(self): 207 | """Return the factors of the dataset. 208 | The result is an array of factors, each an object containing: 209 | strings: Strings of the original words. 210 | word_ids: Word ids of the original words (uses and ). 211 | words_map: String -> word_id map. 212 | words: Word_id -> string list. 213 | alphabet_map: Character -> char_id map. 214 | alphabet: Char_id -> character list. 215 | charseq_ids: Character_sequence ids of the original words. 216 | charseqs_map: String -> character_sequence_id map. 217 | charseqs: Character_sequence_id -> [characters], where character is an index 218 | to the dataset alphabet. 219 | """ 220 | 221 | return self._factors 222 | 223 | def next_batch(self, batch_size, form_wes_model, lemma_wes_model, fasttext_model, including_charseqs=False, seq2seq=False): 224 | """Return the next batch. 225 | Arguments: 226 | including_charseqs: if True, also batch_charseq_ids, batch_charseqs and batch_charseq_lens are returned 227 | Returns: 228 | {sentence_lens, batch_word_ids, batch_charseq_ids, batch_charseqs, batch_pretrained_wes} 229 | sequence_lens: batch of sentence_lens 230 | batch_word_ids: for each factor, batch of words_id 231 | batch_charseq_ids: For each factor, batch of charseq_ids 232 | (the same shape as words_id, but with the ids pointing into batch_charseqs). 233 | Returned only if including_charseqs is True. 234 | batch_charseqs: For each factor, all unique charseqs in the batch, 235 | indexable by batch_charseq_ids. Contains indices of characters from self.alphabet. 236 | Returned only if including_charseqs is True. 237 | batch_charseq_lens: For each factor, length of charseqs in batch_charseqs. 238 | Returned only if including_charseqs is True. 239 | batch_pretrained_form_wes: For each FORM factor, batch of pretrained word embeddings. 240 | Returned only if form_wes_model != None. 241 | batch_pretrained_lemma_wes: For each LEMMA factor, batch of pretrained word embeddings. 242 | Returned onlyu if lemma_wes_model != None. 243 | batch_bert_embeddings: For each FORM factor, batch of pretrained BERT embeddings. 244 | Returned only if bert_embeddings_filename != None during initialiation. 245 | batch_flair_embeddings: For each FORM factor, batch of pretrained Flair embeddings. 246 | Returned only if flair_filename != None during initialiation. 247 | batch_elmo_embeddings: For each FORM factor, batch of pretrained ELMo embeddings. 248 | Returned only if elmo_filename != None during initialiation. 249 | """ 250 | 251 | batch_size = min(batch_size, len(self._permutation)) 252 | batch_perm = self._permutation[:batch_size] 253 | self._permutation = self._permutation[batch_size:] 254 | return self._next_batch(batch_perm, form_wes_model, lemma_wes_model, fasttext_model, including_charseqs, seq2seq) 255 | 256 | def epoch_finished(self): 257 | if len(self._permutation) == 0: 258 | self._permutation = np.random.permutation(len(self._sentence_lens)) if self._shuffle_batches else np.arange(len(self._sentence_lens)) 259 | return True 260 | return False 261 | 262 | def bert_embeddings_dim(self): 263 | if self._bert_embeddings: 264 | return self._bert_embeddings[0].shape[1] 265 | else: 266 | return 0 267 | 268 | def flair_embeddings_dim(self): 269 | if self._flair_embeddings: 270 | return self._flair_embeddings[0].shape[1] 271 | else: 272 | return 0 273 | 274 | def elmo_embeddings_dim(self): 275 | if self._elmo_embeddings: 276 | return self._elmo_embeddings[0].shape[1] 277 | else: 278 | return 0 279 | 280 | def _next_batch(self, batch_perm, form_wes_model, lemma_wes_model, fasttext_model, including_charseqs, seq2seq=False): 281 | batch_size = len(batch_perm) 282 | batch_dict = dict() 283 | 284 | # General data 285 | batch_sentence_lens = self._sentence_lens[batch_perm] 286 | max_sentence_len = np.max(batch_sentence_lens) 287 | 288 | if seq2seq: 289 | batch_tag_lens = self._tag_lens[batch_perm] 290 | max_tag_len = np.max(batch_tag_lens) 291 | 292 | # Word-level data 293 | batch_word_ids = [] 294 | batch_word_wes = [] 295 | for f in range(self.FACTORS): 296 | factor = self._factors[f] 297 | if f == self.TAGS and seq2seq: 298 | batch_word_ids.append(np.zeros([batch_size, max_tag_len], np.int32)) 299 | for i in range(batch_size): 300 | batch_word_ids[-1][i, 0:batch_tag_lens[i]] = factor.word_ids[batch_perm[i]] 301 | else: 302 | batch_word_ids.append(np.zeros([batch_size, max_sentence_len], np.int32)) 303 | for i in range(batch_size): 304 | batch_word_ids[-1][i, 0:batch_sentence_lens[i]] = factor.word_ids[batch_perm[i]] 305 | 306 | batch_dict["sentence_lens"] = self._sentence_lens[batch_perm] 307 | batch_dict["word_ids"] = batch_word_ids 308 | 309 | # Character-level data 310 | if including_charseqs: 311 | batch_charseq_ids, batch_charseqs, batch_charseq_lens = [], [], [] 312 | 313 | for f in range(self.FACTORS): 314 | if not (f == self.TAGS and seq2seq): 315 | factor = self._factors[f] 316 | batch_charseq_ids.append(np.zeros([batch_size, max_sentence_len], np.int32)) 317 | charseqs_map = {} 318 | charseqs = [] 319 | charseq_lens = [] 320 | for i in range(batch_size): 321 | for j, charseq_id in enumerate(factor.charseq_ids[batch_perm[i]]): 322 | if charseq_id not in charseqs_map: 323 | charseqs_map[charseq_id] = len(charseqs) 324 | charseqs.append(factor.charseqs[charseq_id]) 325 | batch_charseq_ids[-1][i, j] = charseqs_map[charseq_id] 326 | 327 | batch_charseq_lens.append(np.array([len(charseq) for charseq in charseqs], np.int32)) 328 | batch_charseqs.append(np.zeros([len(charseqs), np.max(batch_charseq_lens[-1])], np.int32)) 329 | for i in range(len(charseqs)): 330 | batch_charseqs[-1][i, 0:len(charseqs[i])] = charseqs[i] 331 | batch_dict["batch_charseq_ids"] = batch_charseq_ids 332 | batch_dict["batch_charseqs"] = batch_charseqs 333 | batch_dict["batch_charseq_lens"] = batch_charseq_lens 334 | 335 | # Pretrained word embeddings for forms 336 | if form_wes_model: 337 | we_size = form_wes_model.vectors.shape[1] # get pretrained WEs dimension 338 | pretrained_wes = np.zeros([batch_size, max_sentence_len, we_size], np.float32) 339 | for i in range(batch_size): 340 | for j, word in enumerate(self._factors[self.FORMS].strings[batch_perm[i]]): 341 | if word in form_wes_model: 342 | pretrained_wes[i, j] = form_wes_model[word] 343 | elif word.lower() in form_wes_model: 344 | pretrained_wes[i, j] = form_wes_model[word.lower()] 345 | batch_dict["batch_form_pretrained_wes"] = pretrained_wes 346 | 347 | # Fasttext word embeddings for forms 348 | if fasttext_model: 349 | we_size = fasttext_model.get_dimension() # get pretrained WEs dimension 350 | fasttext_wes = np.zeros([batch_size, max_sentence_len, we_size], np.float32) 351 | for i in range(batch_size): 352 | for j, word in enumerate(self._factors[self.FORMS].strings[batch_perm[i]]): 353 | fasttext_wes[i, j] = fasttext_model.get_word_vector(word) 354 | batch_dict["batch_form_fasttext_wes"] = fasttext_wes 355 | 356 | # Pretrained BERT embeddings for forms 357 | if self._bert_embeddings: 358 | we_size = self.bert_embeddings_dim() 359 | batch_bert_embeddings = np.zeros([batch_size, max_sentence_len, we_size], np.float32) 360 | for i in range(batch_size): 361 | batch_bert_embeddings[i, :self._bert_embeddings[batch_perm[i]].shape[0]] = self._bert_embeddings[batch_perm[i]] 362 | batch_dict["batch_bert_wes"] = batch_bert_embeddings 363 | 364 | # Pretrained flair embeddings for forms 365 | if self._flair_embeddings: 366 | we_size = self.flair_embeddings_dim() 367 | batch_flair_embeddings = np.zeros([batch_size, max_sentence_len, we_size], np.float32) 368 | for i in range(batch_size): 369 | batch_flair_embeddings[i, :self._flair_embeddings[batch_perm[i]].shape[0]] = self._flair_embeddings[batch_perm[i]] 370 | batch_dict["batch_flair_wes"] = batch_flair_embeddings 371 | 372 | # Pretrained elmo embeddings for forms 373 | if self._elmo_embeddings: 374 | we_size = self.elmo_embeddings_dim() 375 | batch_elmo_embeddings = np.zeros([batch_size, max_sentence_len, we_size], np.float32) 376 | for i in range(batch_size): 377 | batch_elmo_embeddings[i, :self._elmo_embeddings[batch_perm[i]].shape[0]] = self._elmo_embeddings[batch_perm[i]] 378 | batch_dict["batch_elmo_wes"] = batch_elmo_embeddings 379 | 380 | # Pretrained word embeddings for lemmas 381 | if lemma_wes_model: 382 | we_size = lemma_wes_model.vectors.shape[1] # get pretrained WEs dimension 383 | pretrained_wes = np.zeros([batch_size, max_sentence_len, we_size], np.float32) 384 | for i in range(batch_size): 385 | for j, word in enumerate(self._factors[self.LEMMAS].strings[batch_perm[i]]): 386 | if word in lemma_wes_model: 387 | pretrained_wes[i, j] = lemma_wes_model[word] 388 | batch_dict["batch_lemma_pretrained_wes"] = pretrained_wes 389 | 390 | return batch_dict 391 | -------------------------------------------------------------------------------- /requirements.txt: -------------------------------------------------------------------------------- 1 | fastText 2 | tensorflow<2.0 3 | git+https://github.com/danielfrg/word2vec 4 | -------------------------------------------------------------------------------- /run_conlleval.sh: -------------------------------------------------------------------------------- 1 | #!/bin/bash 2 | # 3 | # Copyright 2019 Institute of Formal and Applied Linguistics, Faculty of 4 | # Mathematics and Physics, Charles University, Czech Republic. 5 | # 6 | # This Source Code Form is subject to the terms of the Mozilla Public 7 | # License, v. 2.0. If a copy of the MPL was not distributed with this 8 | # file, You can obtain one at http://mozilla.org/MPL/2.0/. 9 | 10 | # This script evaluates the TensorFlow output, both during training and 11 | # prediction phase, for flat corpora (CoNLL-2003 and CoNLL-2002), using the 12 | # official distributed evaluation script conlleval. 13 | 14 | set -e 15 | 16 | name="$1" 17 | gold="$2" 18 | system="$3" 19 | 20 | if [ $name == "dev" ]; then 21 | $(dirname $0)/bilou2bio.py < ${system} > ${name}_system_bio.conll 22 | $(dirname $0)/bilou2bio.py < $(dirname $0)/${gold} > ${name}_gold_bio.conll 23 | paste ${name}_gold_bio.conll ${name}_system_bio.conll | cut -f1,2,3,4,8 > ${name}_conlleval_input.conll 24 | elif [ $name == "test" ]; then 25 | $(dirname $0)/bilou2bio.py < ${system} > ${name}_system_bio.conll 26 | paste $(dirname $0)/${gold} ${name}_system_bio.conll | cut -f1,2,3,4,8 > ${name}_conlleval_input.conll 27 | else 28 | echo "./run_conlleval.sh: Unknown file name \"$name\"." 29 | exit 1 30 | fi 31 | 32 | $(dirname $0)/conlleval -d "\t" < ${name}_conlleval_input.conll > $name.eval 33 | -------------------------------------------------------------------------------- /run_eval_nested.sh: -------------------------------------------------------------------------------- 1 | #!/bin/bash 2 | # 3 | # Copyright 2019 Institute of Formal and Applied Linguistics, Faculty of 4 | # Mathematics and Physics, Charles University, Czech Republic. 5 | # 6 | # This Source Code Form is subject to the terms of the Mozilla Public 7 | # License, v. 2.0. If a copy of the MPL was not distributed with this 8 | # file, You can obtain one at http://mozilla.org/MPL/2.0/. 9 | 10 | # This script evaluates the TensorFlow output, both during training and 11 | # prediction phase, for nested corpora, using the evaluation script 12 | # compare_nested_entities.py. 13 | 14 | set -e 15 | 16 | name="$1" 17 | gold_dir="$2" 18 | 19 | cat ${name}_system_predictions.conll | $(dirname $0)/conll2eval_nested.py > ${name}_system_entities.txt 20 | $(dirname $0)/compare_nested_entities.py $(dirname $0)/${gold_dir}/${name}_gold_entities.txt ${name}_system_entities.txt > ${name}.eval 21 | -------------------------------------------------------------------------------- /tagger.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python3 2 | # coding=utf-8 3 | # 4 | # Copyright 2019 Institute of Formal and Applied Linguistics, Faculty of 5 | # Mathematics and Physics, Charles University, Czech Republic. 6 | # 7 | # This Source Code Form is subject to the terms of the Mozilla Public 8 | # License, v. 2.0. If a copy of the MPL was not distributed with this 9 | # file, You can obtain one at http://mozilla.org/MPL/2.0/. 10 | 11 | """Nested NER training and evaluation in TensorFlow.""" 12 | 13 | import json 14 | import os 15 | import sys 16 | 17 | import fasttext 18 | import numpy as np 19 | import tensorflow as tf 20 | import word2vec 21 | 22 | import morpho_dataset 23 | 24 | 25 | class Network: 26 | def __init__(self, threads, seed=42): 27 | # Create an empty graph and a session 28 | graph = tf.Graph() 29 | graph.seed = seed 30 | self.session = tf.Session(graph = graph, config=tf.ConfigProto(inter_op_parallelism_threads=threads, 31 | intra_op_parallelism_threads=threads)) 32 | 33 | def construct(self, args, num_forms, num_form_chars, num_lemmas, num_lemma_chars, num_pos, 34 | pretrained_form_we_dim, pretrained_lemma_we_dim, pretrained_fasttext_dim, 35 | num_tags, tag_bos, tag_eow, pretrained_bert_dim, pretrained_flair_dim, pretrained_elmo_dim, 36 | predict_only): 37 | with self.session.graph.as_default(): 38 | 39 | # Inputs 40 | self.sentence_lens = tf.placeholder(tf.int32, [None], name="sentence_lens") 41 | self.form_ids = tf.placeholder(tf.int32, [None, None], name="form_ids") 42 | self.lemma_ids = tf.placeholder(tf.int32, [None, None], name="lemma_ids") 43 | self.pos_ids = tf.placeholder(tf.int32, [None, None], name="pos_ids") 44 | self.pretrained_form_wes = tf.placeholder(tf.float32, [None, None, pretrained_form_we_dim], name="pretrained_form_wes") 45 | self.pretrained_lemma_wes = tf.placeholder(tf.float32, [None, None, pretrained_lemma_we_dim], name="pretrained_lemma_wes") 46 | self.pretrained_fasttext_wes = tf.placeholder(tf.float32, [None, None, pretrained_fasttext_dim], name="fasttext_wes") 47 | self.pretrained_bert_wes = tf.placeholder(tf.float32, [None, None, pretrained_bert_dim], name="bert_wes") 48 | self.pretrained_flair_wes = tf.placeholder(tf.float32, [None, None, pretrained_flair_dim], name="flair_wes") 49 | self.pretrained_elmo_wes = tf.placeholder(tf.float32, [None, None, pretrained_elmo_dim], name="elmo_wes") 50 | self.tags = tf.placeholder(tf.int32, [None, None], name="tags") 51 | self.is_training = tf.placeholder(tf.bool, []) 52 | self.learning_rate = tf.placeholder(tf.float32, []) 53 | 54 | if args.including_charseqs: 55 | self.form_charseqs = tf.placeholder(tf.int32, [None, None], name="form_charseqs") 56 | self.form_charseq_lens = tf.placeholder(tf.int32, [None], name="form_charseq_lens") 57 | self.form_charseq_ids = tf.placeholder(tf.int32, [None,None], name="form_charseq_ids") 58 | 59 | self.lemma_charseqs = tf.placeholder(tf.int32, [None, None], name="lemma_charseqs") 60 | self.lemma_charseq_lens = tf.placeholder(tf.int32, [None], name="lemma_charseq_lens") 61 | self.lemma_charseq_ids = tf.placeholder(tf.int32, [None,None], name="lemma_charseq_ids") 62 | 63 | # RNN Cell 64 | if args.rnn_cell == "LSTM": 65 | rnn_cell = tf.nn.rnn_cell.BasicLSTMCell 66 | elif args.rnn_cell == "GRU": 67 | rnn_cell = tf.nn.rnn_cell.GRUCell 68 | else: 69 | raise ValueError("Unknown rnn_cell {}".format(args.rnn_cell)) 70 | 71 | inputs = [] 72 | 73 | # Trainable embeddings for forms 74 | form_embeddings = tf.get_variable("form_embeddings", shape=[num_forms, args.we_dim], dtype=tf.float32) 75 | inputs.append(tf.nn.embedding_lookup(form_embeddings, self.form_ids)) 76 | 77 | # Trainable embeddings for lemmas 78 | lemma_embeddings = tf.get_variable("lemma_embeddings", shape=[num_lemmas, args.we_dim], dtype=tf.float32) 79 | inputs.append(tf.nn.embedding_lookup(lemma_embeddings, self.lemma_ids)) 80 | 81 | # POS encoded as one-hot vectors 82 | inputs.append(tf.one_hot(self.pos_ids, num_pos)) 83 | 84 | # Pretrained embeddings for forms 85 | if args.form_wes_model: 86 | inputs.append(self.pretrained_form_wes) 87 | 88 | # Pretrained embeddings for lemmas 89 | if args.lemma_wes_model: 90 | inputs.append(self.pretrained_lemma_wes) 91 | 92 | # Fasttext form embeddings 93 | if args.fasttext_model: 94 | inputs.append(self.pretrained_fasttext_wes) 95 | 96 | # BERT form embeddings 97 | if pretrained_bert_dim: 98 | inputs.append(self.pretrained_bert_wes) 99 | 100 | # Flair form embeddings 101 | if pretrained_flair_dim: 102 | inputs.append(self.pretrained_flair_wes) 103 | 104 | # ELMo form embeddings 105 | if pretrained_elmo_dim: 106 | inputs.append(self.pretrained_elmo_wes) 107 | 108 | # Character-level form embeddings 109 | if args.including_charseqs: 110 | 111 | # Generate character embeddings for num_form_chars of dimensionality args.cle_dim. 112 | character_embeddings = tf.get_variable("form_character_embeddings", 113 | shape=[num_form_chars, args.cle_dim], 114 | dtype=tf.float32) 115 | 116 | # Embed self.form_charseqs (list of unique form in the batch) using the character embeddings. 117 | characters_embedded = tf.nn.embedding_lookup(character_embeddings, self.form_charseqs) 118 | 119 | # Use tf.nn.bidirectional.rnn to process embedded self.form_charseqs 120 | # using a GRU cell of dimensionality args.cle_dim. 121 | _, (state_fwd, state_bwd) = tf.nn.bidirectional_dynamic_rnn( 122 | tf.nn.rnn_cell.GRUCell(args.cle_dim), tf.nn.rnn_cell.GRUCell(args.cle_dim), 123 | characters_embedded, sequence_length=self.form_charseq_lens, dtype=tf.float32, scope="form_cle") 124 | 125 | # Sum the resulting fwd and bwd state to generate character-level form embedding (CLE) 126 | # of unique forms in the batch. 127 | cle = tf.concat([state_fwd, state_bwd], axis=1) 128 | 129 | # Generate CLEs of all form in the batch by indexing the just computed embeddings 130 | # by self.form_charseq_ids (using tf.nn.embedding_lookup). 131 | cle_embedded = tf.nn.embedding_lookup(cle, self.form_charseq_ids) 132 | 133 | # Concatenate the form embeddings (computed above in inputs) and the CLE (in this order). 134 | inputs.append(cle_embedded) 135 | 136 | # Character-level lemma embeddings 137 | if args.including_charseqs: 138 | 139 | character_embeddings = tf.get_variable("lemma_character_embeddings", 140 | shape=[num_lemma_chars, args.cle_dim], 141 | dtype=tf.float32) 142 | characters_embedded = tf.nn.embedding_lookup(character_embeddings, self.lemma_charseqs) 143 | _, (state_fwd, state_bwd) = tf.nn.bidirectional_dynamic_rnn( 144 | tf.nn.rnn_cell.GRUCell(args.cle_dim), tf.nn.rnn_cell.GRUCell(args.cle_dim), 145 | characters_embedded, sequence_length=self.lemma_charseq_lens, dtype=tf.float32, scope="lemma_cle") 146 | cle = tf.concat([state_fwd, state_bwd], axis=1) 147 | cle_embedded = tf.nn.embedding_lookup(cle, self.lemma_charseq_ids) 148 | inputs.append(cle_embedded) 149 | 150 | # Concatenate inputs 151 | inputs = tf.concat(inputs, axis=2) 152 | 153 | # Dropout 154 | inputs_dropout = tf.layers.dropout(inputs, rate=args.dropout, training=self.is_training) 155 | 156 | # Computation 157 | hidden_layer_dropout = inputs_dropout # first layer is input 158 | for i in range(args.rnn_layers): 159 | (hidden_layer_fwd, hidden_layer_bwd), _ = tf.nn.bidirectional_dynamic_rnn( 160 | rnn_cell(args.rnn_cell_dim), rnn_cell(args.rnn_cell_dim), 161 | hidden_layer_dropout, sequence_length=self.sentence_lens, dtype=tf.float32, 162 | scope="RNN-{}".format(i)) 163 | hidden_layer = tf.concat([hidden_layer_fwd, hidden_layer_bwd], axis=2) 164 | if i == 0: hidden_layer_dropout = 0 165 | hidden_layer_dropout += tf.layers.dropout(hidden_layer, rate=args.dropout, training=self.is_training) 166 | 167 | # Decoders 168 | if args.decoding == "CRF": # conditional random fields 169 | output_layer = tf.layers.dense(hidden_layer_dropout, num_tags) 170 | weights = tf.sequence_mask(self.sentence_lens, dtype=tf.float32) 171 | log_likelihood, transition_params = tf.contrib.crf.crf_log_likelihood( 172 | output_layer, self.tags, self.sentence_lens) 173 | loss = tf.reduce_mean(-log_likelihood) 174 | self.predictions, viterbi_score = tf.contrib.crf.crf_decode( 175 | output_layer, transition_params, self.sentence_lens) 176 | self.predictions_training = self.predictions 177 | elif args.decoding == "ME": # vanilla maximum entropy 178 | output_layer = tf.layers.dense(hidden_layer_dropout, num_tags) 179 | weights = tf.sequence_mask(self.sentence_lens, dtype=tf.float32) 180 | if args.label_smoothing: 181 | gold_labels = tf.one_hot(self.tags, num_tags) * (1 - args.label_smoothing) + args.label_smoothing / num_tags 182 | loss = tf.losses.softmax_cross_entropy(gold_labels, output_layer, weights=weights) 183 | else: 184 | loss = tf.losses.sparse_softmax_cross_entropy(self.tags, output_layer, weights=weights) 185 | self.predictions = tf.argmax(output_layer, axis=2) 186 | self.predictions_training = self.predictions 187 | elif args.decoding in ["LSTM", "seq2seq"]: # Decoder 188 | # Generate target embeddings for target chars, of shape [target_chars, args.char_dim]. 189 | tag_embeddings = tf.get_variable("tag_embeddings", shape=[num_tags, args.we_dim], dtype=tf.float32) 190 | 191 | # Embed the target_seqs using the target embeddings. 192 | tags_embedded = tf.nn.embedding_lookup(tag_embeddings, self.tags) 193 | 194 | decoder_rnn_cell = rnn_cell(args.rnn_cell_dim) 195 | 196 | # Create a `decoder_layer` -- a fully connected layer with 197 | # target_chars neurons used in the decoder to classify into target characters. 198 | decoder_layer = tf.layers.Dense(num_tags) 199 | 200 | sentence_lens = self.sentence_lens 201 | max_sentence_len = tf.reduce_max(sentence_lens) 202 | tags = self.tags 203 | # The DecoderTraining will be used during training. It will output logits for each 204 | # target character. 205 | class DecoderTraining(tf.contrib.seq2seq.Decoder): 206 | @property 207 | def batch_size(self): return tf.shape(hidden_layer_dropout)[0] 208 | @property 209 | def output_dtype(self): return tf.float32 # Type for logits of target characters 210 | @property 211 | def output_size(self): return num_tags # Length of logits for every output 212 | @property 213 | def tag_eow(self): return tag_eow 214 | 215 | def initialize(self, name=None): 216 | states = decoder_rnn_cell.zero_state(self.batch_size, tf.float32) 217 | inputs = [tf.nn.embedding_lookup(tag_embeddings, tf.fill([self.batch_size], tag_bos)), hidden_layer_dropout[:,0]] 218 | inputs = tf.concat(inputs, axis=1) 219 | if args.decoding == "seq2seq": 220 | predicted_eows = tf.zeros([self.batch_size], dtype=tf.int32) 221 | inputs = (inputs, predicted_eows) 222 | finished = sentence_lens <= 0 223 | return finished, inputs, states 224 | 225 | def step(self, time, inputs, states, name=None): 226 | if args.decoding == "seq2seq": 227 | inputs, predicted_eows = inputs 228 | outputs, states = decoder_rnn_cell(inputs, states) 229 | outputs = decoder_layer(outputs) 230 | next_input = [tf.nn.embedding_lookup(tag_embeddings, tags[:,time])] 231 | if args.decoding == "seq2seq": 232 | predicted_eows += tf.to_int32(tf.equal(tags[:, time], self.tag_eow)) 233 | indices = tf.where(tf.one_hot(tf.minimum(predicted_eows, max_sentence_len - 1), tf.reduce_max(predicted_eows) + 1)) 234 | next_input.append(tf.gather_nd(hidden_layer_dropout, indices)) 235 | else: 236 | next_input.append(hidden_layer_dropout[:,tf.minimum(time + 1, max_sentence_len - 1)]) 237 | next_input = tf.concat(next_input, axis=1) 238 | if args.decoding == "seq2seq": 239 | next_input = (next_input, predicted_eows) 240 | finished = sentence_lens <= predicted_eows 241 | else: 242 | finished = sentence_lens <= time + 1 243 | return outputs, states, next_input, finished 244 | output_layer, _, prediction_training_lens = tf.contrib.seq2seq.dynamic_decode(DecoderTraining()) 245 | self.predictions_training = tf.argmax(output_layer, axis=2, output_type=tf.int32) 246 | weights = tf.sequence_mask(prediction_training_lens, dtype=tf.float32) 247 | if args.label_smoothing: 248 | gold_labels = tf.one_hot(self.tags, num_tags) * (1 - args.label_smoothing) + args.label_smoothing / num_tags 249 | loss = tf.losses.softmax_cross_entropy(gold_labels, output_layer, weights=weights) 250 | else: 251 | loss = tf.losses.sparse_softmax_cross_entropy(self.tags, output_layer, weights=weights) 252 | 253 | # The DecoderPrediction will be used during prediction. It will 254 | # directly output the predicted target characters. 255 | class DecoderPrediction(tf.contrib.seq2seq.Decoder): 256 | @property 257 | def batch_size(self): return tf.shape(hidden_layer_dropout)[0] 258 | @property 259 | def output_dtype(self): return tf.int32 # Type for predicted target characters 260 | @property 261 | def output_size(self): return 1 # Will return just one output 262 | @property 263 | def tag_eow(self): return tag_eow 264 | 265 | def initialize(self, name=None): 266 | states = decoder_rnn_cell.zero_state(self.batch_size, tf.float32) 267 | inputs = [tf.nn.embedding_lookup(tag_embeddings, tf.fill([self.batch_size], tag_bos)), hidden_layer_dropout[:,0]] 268 | inputs = tf.concat(inputs, axis=1) 269 | if args.decoding == "seq2seq": 270 | predicted_eows = tf.zeros([self.batch_size], dtype=tf.int32) 271 | inputs = (inputs, predicted_eows) 272 | finished = sentence_lens <= 0 273 | return finished, inputs, states 274 | 275 | def step(self, time, inputs, states, name=None): 276 | if args.decoding == "seq2seq": 277 | inputs, predicted_eows = inputs 278 | outputs, states = decoder_rnn_cell(inputs, states) 279 | outputs = decoder_layer(outputs) 280 | outputs = tf.argmax(outputs, axis=1, output_type=self.output_dtype) 281 | next_input = [tf.nn.embedding_lookup(tag_embeddings, outputs)] 282 | if args.decoding == "seq2seq": 283 | predicted_eows += tf.to_int32(tf.equal(outputs, self.tag_eow)) 284 | indices = tf.where(tf.one_hot(tf.minimum(predicted_eows, max_sentence_len - 1), tf.reduce_max(predicted_eows) + 1)) 285 | next_input.append(tf.gather_nd(hidden_layer_dropout, indices)) 286 | else: 287 | next_input.append(hidden_layer_dropout[:,tf.minimum(time + 1, max_sentence_len - 1)]) 288 | next_input = tf.concat(next_input, axis=1) 289 | if args.decoding == "seq2seq": 290 | next_input = (next_input, predicted_eows) 291 | finished = sentence_lens <= predicted_eows 292 | else: 293 | finished = sentence_lens <= time + 1 294 | return outputs, states, next_input, finished 295 | self.predictions, _, _ = tf.contrib.seq2seq.dynamic_decode( 296 | DecoderPrediction(), maximum_iterations=3*tf.reduce_max(self.sentence_lens) + 10) 297 | 298 | # Saver 299 | self.saver = tf.train.Saver(max_to_keep=1) 300 | if predict_only: return 301 | 302 | # Training 303 | global_step = tf.train.create_global_step() 304 | self.training = tf.contrib.opt.LazyAdamOptimizer(learning_rate=self.learning_rate, beta2=args.beta_2).minimize(loss, global_step=global_step) 305 | 306 | # Summaries 307 | self.current_accuracy, self.update_accuracy = tf.metrics.accuracy(self.tags, self.predictions_training, weights=weights) 308 | self.current_loss, self.update_loss = tf.metrics.mean(loss, weights=tf.reduce_sum(weights)) 309 | self.reset_metrics = tf.variables_initializer(tf.get_collection(tf.GraphKeys.METRIC_VARIABLES)) 310 | 311 | summary_writer = tf.contrib.summary.create_file_writer(args.logdir, flush_millis=10 * 1000) 312 | self.summaries = {} 313 | with summary_writer.as_default(), tf.contrib.summary.record_summaries_every_n_global_steps(100): 314 | self.summaries["train"] = [tf.contrib.summary.scalar("train/loss", self.update_loss), 315 | tf.contrib.summary.scalar("train/accuracy", self.update_accuracy)] 316 | with summary_writer.as_default(), tf.contrib.summary.always_record_summaries(): 317 | for dataset in ["dev", "test"]: 318 | self.summaries[dataset] = [tf.contrib.summary.scalar(dataset + "/loss", self.current_loss), 319 | tf.contrib.summary.scalar(dataset + "/accuracy", self.current_accuracy)] 320 | 321 | self.metrics = {} 322 | self.metrics_summarize = {} 323 | for metric in ["precision", "recall", "F1"]: 324 | self.metrics[metric] = tf.placeholder(tf.float32, [], name=metric) 325 | self.metrics_summarize[metric] = {} 326 | with summary_writer.as_default(), tf.contrib.summary.always_record_summaries(): 327 | for dataset in ["dev", "test"]: 328 | self.metrics_summarize[metric][dataset] = tf.contrib.summary.scalar(dataset + "/" + metric, 329 | self.metrics[metric]) 330 | 331 | # Initialize variables 332 | self.session.run(tf.global_variables_initializer()) 333 | with summary_writer.as_default(): 334 | tf.contrib.summary.initialize(session=self.session, graph=self.session.graph) 335 | 336 | 337 | def train_epoch(self, train, learning_rate, args): 338 | while not train.epoch_finished(): 339 | seq2seq = args.decoding == "seq2seq" 340 | batch_dict = train.next_batch(args.batch_size, args.form_wes_model, args.lemma_wes_model, args.fasttext_model, including_charseqs=args.including_charseqs, seq2seq=seq2seq) 341 | if args.word_dropout: 342 | mask = np.random.binomial(n=1, p=args.word_dropout, size=batch_dict["word_ids"][train.FORMS].shape) 343 | batch_dict["word_ids"][train.FORMS] = (1 - mask) * batch_dict["word_ids"][train.FORMS] + mask * train.factors[train.FORMS].words_map[""] 344 | 345 | mask = np.random.binomial(n=1, p=args.word_dropout, size=batch_dict["word_ids"][train.LEMMAS].shape) 346 | batch_dict["word_ids"][train.LEMMAS] = (1 - mask) * batch_dict["word_ids"][train.LEMMAS] + mask * train.factors[train.LEMMAS].words_map[""] 347 | 348 | self.session.run(self.reset_metrics) 349 | feeds = {self.sentence_lens: batch_dict["sentence_lens"], 350 | self.form_ids: batch_dict["word_ids"][train.FORMS], 351 | self.lemma_ids: batch_dict["word_ids"][train.LEMMAS], 352 | self.pos_ids: batch_dict["word_ids"][train.POS], 353 | self.tags: batch_dict["word_ids"][train.TAGS], 354 | self.is_training: True, 355 | self.learning_rate: learning_rate} 356 | if args.form_wes_model: # pretrained form embeddings 357 | feeds[self.pretrained_form_wes] = batch_dict["batch_form_pretrained_wes"] 358 | if args.lemma_wes_model: # pretrained lemma embeddings 359 | feeds[self.pretrained_lemma_wes] = batch_dict["batch_lemma_pretrained_wes"] 360 | if args.fasttext_model: # fasttext form embeddings 361 | feeds[self.pretrained_fasttext_wes] = batch_dict["batch_form_fasttext_wes"] 362 | if args.bert_embeddings_train: # BERT embeddings 363 | feeds[self.pretrained_bert_wes] = batch_dict["batch_bert_wes"] 364 | if args.flair_train: # flair embeddings 365 | feeds[self.pretrained_flair_wes] = batch_dict["batch_flair_wes"] 366 | if args.elmo_train: # elmo embeddings 367 | feeds[self.pretrained_elmo_wes] = batch_dict["batch_elmo_wes"] 368 | 369 | if args.including_charseqs: # character-level embeddings 370 | feeds[self.form_charseqs] = batch_dict["batch_charseqs"][train.FORMS] 371 | feeds[self.form_charseq_lens] = batch_dict["batch_charseq_lens"][train.FORMS] 372 | feeds[self.form_charseq_ids] = batch_dict["batch_charseq_ids"][train.FORMS] 373 | 374 | feeds[self.lemma_charseqs] = batch_dict["batch_charseqs"][train.LEMMAS] 375 | feeds[self.lemma_charseq_lens] = batch_dict["batch_charseq_lens"][train.LEMMAS] 376 | feeds[self.lemma_charseq_ids] = batch_dict["batch_charseq_ids"][train.LEMMAS] 377 | 378 | self.session.run([self.training, self.summaries["train"]], feeds) 379 | 380 | 381 | def evaluate(self, dataset_name, dataset, args): 382 | with open("{}/{}_system_predictions.conll".format(args.logdir, dataset_name), "w", encoding="utf-8") as prediction_file: 383 | self.predict(dataset_name, dataset, args, prediction_file, evaluating=True) 384 | 385 | f1 = 0.0 386 | if args.corpus in ["CoNLL_en", "CoNLL_de", "CoNLL_nl", "CoNLL_es"]: 387 | os.system("cd {} && ../../run_conlleval.sh {} {} {}_system_predictions.conll".format(args.logdir, dataset_name, args.__dict__[dataset_name + "_data"], dataset_name)) 388 | 389 | with open("{}/{}.eval".format(args.logdir,dataset_name), "r", encoding="utf-8") as result_file: 390 | for line in result_file: 391 | line = line.strip("\n") 392 | if line.startswith("accuracy:"): 393 | f1 = float(line.split()[-1]) 394 | self.session.run(self.metrics_summarize["F1"][dataset_name], {self.metrics["F1"]: f1}) 395 | 396 | return f1 397 | elif args.corpus in [ "ACE2004", "ACE2005", "GENIA" ]: # nested named entities evaluation 398 | os.system("cd {} && ../../run_eval_nested.sh {} {}".format(args.logdir, dataset_name, os.path.dirname(args.__dict__[dataset_name + "_data"]))) 399 | 400 | with open("{}/{}.eval".format(args.logdir,dataset_name), "r", encoding="utf-8") as result_file: 401 | for line in result_file: 402 | line = line.strip("\n") 403 | if line.startswith("Recall:"): 404 | recall = float(line.split(" ")[1]) 405 | if line.startswith("Precision:"): 406 | precision = float(line.split(" ")[1]) 407 | if line.startswith("F1:"): 408 | f1 = float(line.split(" ")[1]) 409 | for metric, value in [["precision", precision], ["recall", recall], ["F1", f1]]: 410 | self.session.run(self.metrics_summarize[metric][dataset_name], {self.metrics[metric]: value}) 411 | return f1 412 | else: 413 | raise ValueError("Unknown corpus {}".format(args.corpus)) 414 | 415 | 416 | def predict(self, dataset_name, dataset, args, prediction_file, evaluating=False): 417 | if evaluating: 418 | self.session.run(self.reset_metrics) 419 | tags = [] 420 | while not dataset.epoch_finished(): 421 | seq2seq = args.decoding == "seq2seq" 422 | batch_dict = dataset.next_batch(args.batch_size, args.form_wes_model, args.lemma_wes_model, args.fasttext_model, args.including_charseqs, seq2seq=seq2seq) 423 | targets = [self.predictions] 424 | feeds = {self.sentence_lens: batch_dict["sentence_lens"], 425 | self.form_ids: batch_dict["word_ids"][dataset.FORMS], 426 | self.lemma_ids: batch_dict["word_ids"][train.LEMMAS], 427 | self.pos_ids: batch_dict["word_ids"][train.POS], 428 | self.is_training: False} 429 | if evaluating: 430 | targets.extend([self.update_accuracy, self.update_loss]) 431 | feeds[self.tags] = batch_dict["word_ids"][dataset.TAGS] 432 | if args.form_wes_model: # pretrained form embeddings 433 | feeds[self.pretrained_form_wes] = batch_dict["batch_form_pretrained_wes"] 434 | if args.lemma_wes_model: # pretrained lemma embeddings 435 | feeds[self.pretrained_lemma_wes] = batch_dict["batch_lemma_pretrained_wes"] 436 | if args.fasttext_model: # fasttext form embeddings 437 | feeds[self.pretrained_fasttext_wes] = batch_dict["batch_form_fasttext_wes"] 438 | if args.bert_embeddings_dev or args.bert_embeddings_test: # BERT embeddings 439 | feeds[self.pretrained_bert_wes] = batch_dict["batch_bert_wes"] 440 | if args.flair_dev or args.flair_test: # flair embeddings 441 | feeds[self.pretrained_flair_wes] = batch_dict["batch_flair_wes"] 442 | if args.elmo_dev or args.elmo_test: # elmo embeddings 443 | feeds[self.pretrained_elmo_wes] = batch_dict["batch_elmo_wes"] 444 | 445 | if args.including_charseqs: # character-level embeddings 446 | feeds[self.form_charseqs] = batch_dict["batch_charseqs"][dataset.FORMS] 447 | feeds[self.form_charseq_lens] = batch_dict["batch_charseq_lens"][dataset.FORMS] 448 | feeds[self.form_charseq_ids] = batch_dict["batch_charseq_ids"][dataset.FORMS] 449 | 450 | feeds[self.lemma_charseqs] = batch_dict["batch_charseqs"][dataset.LEMMAS] 451 | feeds[self.lemma_charseq_lens] = batch_dict["batch_charseq_lens"][dataset.LEMMAS] 452 | feeds[self.lemma_charseq_ids] = batch_dict["batch_charseq_ids"][dataset.LEMMAS] 453 | 454 | tags.extend(self.session.run(targets, feeds)[0]) 455 | 456 | if evaluating: 457 | self.session.run([self.current_accuracy, self.summaries[dataset_name]]) 458 | 459 | forms = dataset.factors[dataset.FORMS].strings 460 | for s in range(len(forms)): 461 | j = 0 462 | for i in range(len(forms[s])): 463 | if args.decoding == "seq2seq": # collect all tags until 464 | labels = [] 465 | while j < len(tags[s]) and dataset.factors[dataset.TAGS].words[tags[s][j]] != "": 466 | labels.append(dataset.factors[dataset.TAGS].words[tags[s][j]]) 467 | j += 1 468 | j += 1 # skip the "" 469 | print("{}\t_\t_\t{}".format(forms[s][i], "|".join(labels)), file=prediction_file) 470 | else: 471 | print("{}\t_\t_\t{}".format(forms[s][i], dataset.factors[dataset.TAGS].words[tags[s][i]]), file=prediction_file) 472 | print("", file=prediction_file) 473 | 474 | 475 | if __name__ == "__main__": 476 | import argparse 477 | import datetime 478 | import os 479 | import re 480 | 481 | # Fix random seed 482 | np.random.seed(42) 483 | 484 | # Parse arguments 485 | parser = argparse.ArgumentParser() 486 | parser.add_argument("--batch_size", default=8, type=int, help="Batch size.") 487 | parser.add_argument("--bert_embeddings_dev", default=None, type=str, help="Pretrained BERT embeddings for dev data.") 488 | parser.add_argument("--bert_embeddings_test", default=None, type=str, help="Pretrained BERT embeddings for test data.") 489 | parser.add_argument("--bert_embeddings_train", default=None, type=str, help="Pretrained BERT embeddings for train data.") 490 | parser.add_argument("--beta_2", default=0.98, type=float, help="Beta 2.") 491 | parser.add_argument("--corpus", default="CoNLL_en", type=str, help="CoNLL_en|CoNLL_de|CoNLL_nl|CoNLL_es|ACE2004|ACE2005|GENIA.") 492 | parser.add_argument("--cle_dim", default=128, type=int, help="Character-level embedding dimension.") 493 | parser.add_argument("--decoding", default="CRF", type=str, help="Decoding: [CRF|ME|LSTM|seq2seq].") 494 | parser.add_argument("--dev_data", default=None, type=str, help="Dev data.") 495 | parser.add_argument("--dropout", default=0.5, type=float, help="Dropout rate.") 496 | parser.add_argument("--elmo_dev", default=None, type=str, help="ELMo dev embeddings.") 497 | parser.add_argument("--elmo_test", default=None, type=str, help="ELMo test embeddings.") 498 | parser.add_argument("--elmo_train", default=None, type=str, help="ELMo train embeddings.") 499 | parser.add_argument("--epochs", default="10:1e-3", type=str, help="Epochs and learning rates.") 500 | parser.add_argument("--fasttext_model", default=None, type=str, help="Fasttext subwords.") 501 | parser.add_argument("--flair_dev", default=None, type=str, help="Flair dev embeddings.") 502 | parser.add_argument("--flair_test", default=None, type=str, help="Flair test embeddings.") 503 | parser.add_argument("--flair_train", default=None, type=str, help="Flair train embeddings.") 504 | parser.add_argument("--form_wes_model", default=None, type=str, help="Pretrained form WEs.") 505 | parser.add_argument("--label_smoothing", default=0, type=float, help="Label smoothing.") 506 | parser.add_argument("--lemma_wes_model", default=None, type=str, help="Pretrained lemma WEs.") 507 | parser.add_argument("--max_sentences", default=None, type=int, help="Number of training sentences (for debugging).") 508 | parser.add_argument("--name", default=None, type=str, help="Experiment name.") 509 | parser.add_argument("--predict", default=None, type=str, help="Predict using the passed model.") 510 | parser.add_argument("--rnn_cell", default="LSTM", type=str, help="RNN cell type.") 511 | parser.add_argument("--rnn_cell_dim", default=256, type=int, help="RNN cell dimension.") 512 | parser.add_argument("--rnn_layers", default=1, type=int, help="Number of hidden layers.") 513 | parser.add_argument("--test_data", default=None, type=str, help="Test data.") 514 | parser.add_argument("--train_data", default=None, type=str, help="Training data.") 515 | parser.add_argument("--threads", default=4, type=int, help="Maximum number of threads to use.") 516 | parser.add_argument("--we_dim", default=256, type=int, help="Word embedding dimension.") 517 | parser.add_argument("--word_dropout", default=0.2, type=float, help="Word dropout.") 518 | args = parser.parse_args() 519 | 520 | if args.predict: 521 | # Load saved options from the model 522 | with open("{}/options.json".format(args.predict), mode="r") as options_file: 523 | args = argparse.Namespace(**json.load(options_file)) 524 | parser.parse_args(namespace=args) 525 | else: 526 | # Create logdir name 527 | logargs = dict(vars(args).items()) 528 | logargs["form_wes_model"] = 1 if args.form_wes_model else 0 529 | logargs["lemma_wes_model"] = 1 if args.lemma_wes_model else 0 530 | del logargs["bert_embeddings_dev"] 531 | del logargs["bert_embeddings_test"] 532 | del logargs["bert_embeddings_train"] 533 | del logargs["beta_2"] 534 | del logargs["cle_dim"] 535 | del logargs["dev_data"] 536 | del logargs["dropout"] 537 | del logargs["elmo_dev"] 538 | del logargs["elmo_test"] 539 | del logargs["elmo_train"] 540 | del logargs["flair_dev"] 541 | del logargs["flair_test"] 542 | del logargs["flair_train"] 543 | del logargs["label_smoothing"] 544 | del logargs["max_sentences"] 545 | del logargs["rnn_cell_dim"] 546 | del logargs["test_data"] 547 | del logargs["threads"] 548 | del logargs["train_data"] 549 | del logargs["we_dim"] 550 | del logargs["word_dropout"] 551 | logargs["bert_embeddings"] = 1 if args.bert_embeddings_train else 0 552 | logargs["flair_embeddings"] = 1 if args.flair_train else 0 553 | logargs["elmo_embeddings"] = 1 if args.elmo_train else 0 554 | 555 | args.logdir = "logs/{}-{}-{}".format( 556 | os.path.basename(__file__), 557 | datetime.datetime.now().strftime("%Y-%m-%d_%H%M%S"), 558 | ",".join(("{}={}".format(re.sub("(.)[^_]*_?", r"\1", key), re.sub("^.*/", "", value) if type(value) == str else value) 559 | for key, value in sorted(logargs.items()))) 560 | ) 561 | if not os.path.exists("logs"): os.mkdir("logs") # TF 1.6 will do this by itself 562 | if not os.path.exists(args.logdir): os.mkdir(args.logdir) 563 | 564 | # Dump passed options to allow future prediction. 565 | with open("{}/options.json".format(args.logdir), mode="w") as options_file: 566 | json.dump(vars(args), options_file, sort_keys=True) 567 | 568 | # Postprocess args 569 | args.epochs = [(int(epochs), float(lr)) for epochs, lr in (epochs_lr.split(":") for epochs_lr in args.epochs.split(","))] 570 | 571 | # Load the data 572 | seq2seq = args.decoding == "seq2seq" 573 | train = morpho_dataset.MorphoDataset(args.train_data, max_sentences=args.max_sentences, seq2seq=seq2seq, bert_embeddings_filename=args.bert_embeddings_train, flair_filename=args.flair_train, elmo_filename=args.elmo_train) 574 | if args.dev_data: 575 | dev = morpho_dataset.MorphoDataset(args.dev_data, train=train, shuffle_batches=False, seq2seq=seq2seq, bert_embeddings_filename=args.bert_embeddings_dev, flair_filename=args.flair_dev, elmo_filename=args.elmo_dev) 576 | test = morpho_dataset.MorphoDataset(args.test_data, train=train, shuffle_batches=False, seq2seq=seq2seq, bert_embeddings_filename=args.bert_embeddings_test,flair_filename=args.flair_test, elmo_filename=args.elmo_test) 577 | 578 | # Load pretrained form embeddings 579 | if args.form_wes_model: 580 | args.form_wes_model = word2vec.load(args.form_wes_model) 581 | if args.lemma_wes_model: 582 | args.lemma_wes_model = word2vec.load(args.lemma_wes_model) 583 | 584 | # Load fasttext subwords embeddings 585 | if args.fasttext_model: 586 | args.fasttext_model = fasttext.load_model(args.fasttext_model) 587 | 588 | # Character-level embeddings 589 | args.including_charseqs = (args.cle_dim > 0) 590 | 591 | # Construct the network 592 | network = Network(threads=args.threads) 593 | network.construct(args, 594 | num_forms=len(train.factors[train.FORMS].words), 595 | num_form_chars=len(train.factors[train.FORMS].alphabet), 596 | num_lemmas=len(train.factors[train.LEMMAS].words), 597 | num_lemma_chars=len(train.factors[train.LEMMAS].alphabet), 598 | num_pos=len(train.factors[train.POS].words), 599 | pretrained_form_we_dim=args.form_wes_model.vectors.shape[1] if args.form_wes_model else 0, 600 | pretrained_lemma_we_dim=args.lemma_wes_model.vectors.shape[1] if args.lemma_wes_model else 0, 601 | pretrained_fasttext_dim=args.fasttext_model.get_dimension() if args.fasttext_model else 0, 602 | num_tags=len(train.factors[train.TAGS].words), 603 | tag_bos=train.factors[train.TAGS].words_map[""], 604 | tag_eow=train.factors[train.TAGS].words_map[""], 605 | pretrained_bert_dim=train.bert_embeddings_dim(), 606 | pretrained_flair_dim=train.flair_embeddings_dim(), 607 | pretrained_elmo_dim=train.elmo_embeddings_dim(), 608 | predict_only=args.predict) 609 | 610 | if args.predict: 611 | network.saver.restore(network.session, "{}/model".format(args.predict.rstrip("/"))) 612 | print("Predicting test data", file=sys.stderr) 613 | network.predict("test", test, args, sys.stdout, evaluating=False) 614 | else: 615 | # Train 616 | for epochs, learning_rate in args.epochs: 617 | for epoch in range(epochs): 618 | network.train_epoch(train, learning_rate, args) 619 | dev_score = 0 620 | if args.dev_data: 621 | dev_score = network.evaluate("dev", dev, args) 622 | print("{}".format(dev_score)) 623 | # Save network 624 | network.saver.save(network.session, "{}/model".format(args.logdir), write_meta_graph=False) 625 | # Test 626 | test_score = network.evaluate("test", test, args) 627 | print("{}".format(test_score)) 628 | -------------------------------------------------------------------------------- /test_run/run.sh: -------------------------------------------------------------------------------- 1 | #!/bin/bash 2 | # 3 | # Copyright 2019 Institute of Formal and Applied Linguistics, Faculty of 4 | # Mathematics and Physics, Charles University, Czech Republic. 5 | # 6 | # This Source Code Form is subject to the terms of the Mozilla Public 7 | # License, v. 2.0. If a copy of the MPL was not distributed with this 8 | # file, You can obtain one at http://mozilla.org/MPL/2.0/. 9 | 10 | # Tagger test run with minimal parameters. 11 | 12 | set -e 13 | 14 | cat test.conll | ../conll2eval_nested.py > test_gold_entities.txt 15 | 16 | # Seq2seq 17 | (cd ../ && ./tagger.py --corpus=ACE2004 --train_data=test_run/train.conll --test_data=test_run/test.conll --decoding=seq2seq --epochs=50:1e-3,8:1e-4 --name=test_run) 18 | 19 | # LSTM-CRF 20 | (cd ../ && ./tagger.py --corpus=ACE2004 --train_data=test_run/train.conll --test_data=test_run/test.conll --decoding=CRF --epochs=50:1e-3,8:1e-4 --name=test_run) 21 | -------------------------------------------------------------------------------- /test_run/test.conll: -------------------------------------------------------------------------------- 1 | The the DT B-GPE 2 | Chinese chinese JJ I-GPE|U-GPE 3 | government government NN L-GPE 4 | 5 | -------------------------------------------------------------------------------- /test_run/train.conll: -------------------------------------------------------------------------------- 1 | The the DT B-GPE 2 | Chinese chinese JJ I-GPE|U-GPE 3 | government government NN L-GPE 4 | and and CC O 5 | the the DT B-GPE 6 | Australian australian JJ I-GPE|U-GPE 7 | government government NN L-GPE 8 | signed sign VBD O 9 | an an DT O 10 | agreement agreement NN O 11 | today today NN O 12 | , , , O 13 | wherein wherein WRB O 14 | the the DT B-GPE 15 | Australian australian JJ I-GPE|U-GPE 16 | party party NN L-GPE 17 | would would MD O 18 | provide provide VB O 19 | China China NNP U-GPE 20 | with with IN O 21 | a a DT O 22 | preferential preferential JJ O 23 | financial financial JJ O 24 | loan loan NN O 25 | of of IN O 26 | 150 150 CD O 27 | million million CD O 28 | Australian australian JJ U-GPE 29 | dollars dollar NNS O 30 | . . . O 31 | 32 | --------------------------------------------------------------------------------