├── .gitignore
├── README
├── bilou2bio.py
├── bio2bilou.py
├── compare_nested_entities.py
├── conll2eval_nested.py
├── conlleval
├── morpho_dataset.py
├── requirements.txt
├── run_conlleval.sh
├── run_eval_nested.sh
├── tagger.py
└── test_run
    ├── run.sh
    ├── test.conll
    └── train.conll


/.gitignore:
--------------------------------------------------------------------------------
1 | /__pycache__/
2 | logs
3 | 


--------------------------------------------------------------------------------
/README:
--------------------------------------------------------------------------------
  1 | Source code: Neural Architectures for Nested NER through Linearization
  2 | ======================================================================
  3 | Jana Straková, Milan Straka and Jan Hajič
  4 | https://aclweb.org/anthology/papers/P/P19/P19-1527/
  5 | {strakova,straka,hajic}@ufal.mff.cuni.cz
  6 | 
  7 | License
  8 | -------
  9 | 
 10 | Copyright 2019 Institute of Formal and Applied Linguistics, Faculty of
 11 | Mathematics and Physics, Charles University, Czech Republic.
 12 | 
 13 | This Source Code Form is subject to the terms of the Mozilla Public
 14 | License, v. 2.0. If a copy of the MPL was not distributed with this
 15 | file, You can obtain one at http://mozilla.org/MPL/2.0/.
 16 | 
 17 | Please cite as:
 18 | ---------------
 19 | 
 20 | @inproceedings{strakova-etal-2019-neural,
 21 |     title = {{Neural Architectures for Nested {NER} through Linearization}},
 22 |     author = {Jana Strakov{\'a} and Milan Straka and Jan Haji\v{c}},
 23 |     booktitle = {Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics},
 24 |     month = jul,
 25 |     year = {2019},
 26 |     address = {Florence, Italy},
 27 |     publisher = {Association for Computational Linguistics},
 28 |     url = {https://www.aclweb.org/anthology/P19-1527},
 29 |     pages = {5326--5331},
 30 | }
 31 | 
 32 | How to run the tagger
 33 | ---------------------
 34 | 
 35 | 1. Install requirements
 36 | 
 37 | pip install -r requirements.txt
 38 | 
 39 | 2. Download the data
 40 | 
 41 | ACE-2004: https://catalog.ldc.upenn.edu/LDC2005T09
 42 | ACE-2005: https://catalog.ldc.upenn.edu/LDC2006T06
 43 | GENIA: http://www.geniaproject.org/
 44 | 
 45 | 3. Create inputs
 46 | 
 47 | The input of the tagger is in the CoNLL-2003 BILOU format. CoNLL-2003 shared
 48 | task data format is described here:
 49 | https://www.clips.uantwerpen.be/conll2003/ner/ . BILOU format is described
 50 | here (Ratinov and Roth, 2009): https://www.aclweb.org/anthology/W09-1119 .
 51 | 
 52 | The input format is a CoNLL format, with one token per line, sentences
 53 | delimited by empty line. For each token, columns are separated by tabs. First
 54 | column is the surface token, second column is lemma, third column is a POS tag
 55 | and fourth column is the BILOU encoded NE label.
 56 | 
 57 | For flat corpora (e.g. CoNLL-2003 English and German), the fourth column bears
 58 | exactly one NE label, e.g. (example from CoNLL-2003 English):
 59 | 
 60 | -DOCSTART-      -docstart-      NN      O
 61 | 
 62 | EU      EU      NNP     U-ORG
 63 | rejects reject  VBZ     O
 64 | German  german  JJ      U-MISC
 65 | call    call    NN      O
 66 | to      to      TO      O
 67 | boycott boycott VB      O
 68 | British british JJ      U-MISC
 69 | lamb    lamb    NN      O
 70 | .       .       .       O
 71 | 
 72 | For nested NE corpora, the NE tags are linearized (flattened) according to
 73 | rules described in the paper, e.g. (example from ACE-2004):
 74 | 
 75 | The	the	DT	B-GPE
 76 | Chinese	chinese	JJ	I-GPE|U-GPE
 77 | government	government	NN	L-GPE
 78 | and	and	CC	O
 79 | the	the	DT	B-GPE
 80 | Australian	australian	JJ	I-GPE|U-GPE
 81 | government	government	NN	L-GPE
 82 | signed	sign	VBD	O
 83 | an	an	DT	O
 84 | agreement	agreement	NN	O
 85 | today	today	NN	O
 86 | ,	,	,	O
 87 | wherein	wherein	WRB	O
 88 | the	the	DT	B-GPE
 89 | Australian	australian	JJ	I-GPE|U-GPE
 90 | party	party	NN	L-GPE
 91 | would	would	MD	O
 92 | provide	provide	VB	O
 93 | China	China	NNP	U-GPE
 94 | with	with	IN	O
 95 | a	a	DT	O
 96 | preferential	preferential	JJ	O
 97 | financial	financial	JJ	O
 98 | loan	loan	NN	O
 99 | of	of	IN	O
100 | 150	150	CD	O
101 | million	million	CD	O
102 | Australian	australian	JJ	U-GPE
103 | dollars	dollar	NNS	O
104 | .	.	.	O
105 | 
106 | The lemmatization and POS tagging can be done with e.g. UDPipe
107 | (http://ufal.mff.cuni.cz/udpipe) or with MorphoDiTa
108 | (http://ufal.mff.cuni.cz/morphodita) or with any tool of your choice. If you
109 | don't have any POS tagger or lemmatizer, simply fill the respective columns
110 | with dummy (e.g. "_").
111 | 
112 | 4. Get word embeddings
113 | 
114 | - word2vec,
115 | - FastText,
116 | - BERT,
117 | - ELMo,
118 | - Flair
119 | 
120 | from sources described in the paper. The input formats are:
121 | 
122 | - word2vec: The native word2vec text file. 
123 | - FastText: The native FastText binary.
124 | - contextualized embeddings (BERT, ELMo, Flair): A text file with one token per
125 |   line, first column is the token, all other columns are the vector real valued
126 |   numbers; columns separated with space. The format is readable for human eyes,
127 |   but quite large, sorry for the inconvenience. The per-token BERT
128 |   contextualized word embeddings are created as an average of all token
129 |   corresponding BERT subowords. The ELMo and Flair are generated using this
130 |   code: https://github.com/zalandoresearch/flair. 
131 | 
132 | You can also run the tagger without pretrained word embeddings just with
133 | end-to-end word embeddings and character-level embeddings (created inside the
134 | tagger), or with a subset of the above mentioned pretrained word embeddings.
135 | 
136 | 5. Run the tagger
137 | 
138 | Usage example:
139 | 
140 | ./tagger.py --corpus=CoNLL_en --train_data=conll_en/train_dev_bilou.conll --test_data=conll_en/test_bilou.conll --decoding=seq2seq --epochs=10:1e-3,8:1e-4 --form_wes_model=word_embeddings/conll_en_form.txt --lemma_wes_model=word_embeddings/conll_en_lemma.txt --bert_embeddings_train=bert_embeddings/conll_en_train_dev_bert_large_embeddings.txt --bert_embeddings_test=bert_embeddings/conll_en_test_bert_large_embeddings.txt --flair_train=flair_embeddings/conll_en_train_dev.txt --flair_test=flair_embeddings/conll_en_test.txt --elmo_train=elmo_embeddings/conll_en_train_dev.txt --elmo_test=elmo_embeddings/conll_en_test.txt --name=seq2seq+ELMo+BERT+Flair
141 | 


--------------------------------------------------------------------------------
/bilou2bio.py:
--------------------------------------------------------------------------------
 1 | #!/usr/bin/env python3
 2 | # coding=utf-8
 3 | #
 4 | # Copyright 2018 Institute of Formal and Applied Linguistics, Faculty of
 5 | # Mathematics and Physics, Charles University, Czech Republic.
 6 | #
 7 | # This Source Code Form is subject to the terms of the Mozilla Public
 8 | # License, v. 2.0. If a copy of the MPL was not distributed with this
 9 | # file, You can obtain one at http://mozilla.org/MPL/2.0/.
10 | 
11 | """Converts CoNLL file from BILOU to BIO encoding."""
12 | 
13 | import sys
14 | 
15 | 
16 | if __name__ == "__main__":
17 |     import argparse
18 | 
19 |     lines = []
20 |     for line in sys.stdin:
21 |         line = line.rstrip("\r\n")
22 |         if not line:
23 |             print()
24 |         else:
25 |             form, lemma, tag, label = line.split("\t")
26 |             if label.startswith("U-"):
27 |                 label = label.replace("U-", "B-")
28 |             if label.startswith("L-"):
29 |                 label = label.replace("L-", "I-")
30 |             print("\t".join([form, lemma, tag, label]))
31 | 


--------------------------------------------------------------------------------
/bio2bilou.py:
--------------------------------------------------------------------------------
 1 | #!/usr/bin/env python3
 2 | # coding=utf-8
 3 | #
 4 | # Copyright 2018 Institute of Formal and Applied Linguistics, Faculty of
 5 | # Mathematics and Physics, Charles University, Czech Republic.
 6 | #
 7 | # This Source Code Form is subject to the terms of the Mozilla Public
 8 | # License, v. 2.0. If a copy of the MPL was not distributed with this
 9 | # file, You can obtain one at http://mozilla.org/MPL/2.0/.
10 | 
11 | """Converts CoNLL file from BIO to BILOU encoding."""
12 | 
13 | import sys
14 | 
15 | def print_entity(lines):
16 |     n = len(lines)
17 |     if n > 0:
18 |         if n == 1:
19 |             lines[0][3] = lines[0][3].replace("I-","U-")
20 |             lines[0][3] = lines[0][3].replace("B-","U-")
21 |         else:
22 |             lines[0][3] = lines[0][3].replace("I-", "B-")
23 |             lines[n-1][3] = lines[n-1][3].replace("I-","L-")
24 |         for i in range(n):
25 |             print("\t".join(lines[i]))
26 | 
27 | if __name__ == "__main__":
28 |     import argparse
29 | 
30 |     lines = []
31 |     prev_label = "O"
32 |     i = 0
33 |     for line in sys.stdin:
34 |         line = line.rstrip("\r\n")
35 |         i += 1
36 |         if not line:
37 |             print_entity(lines)
38 |             lines = []
39 |             prev_label = "O"
40 |             print()
41 |         else:
42 |             if len(line.split("\t")) != 4:
43 |                 print("Incorrect line number " + str(i))
44 |                 sys.exit(1)
45 |             form, lemma, tag, label = line.split("\t")
46 |             # no entity, entity may have ended on previous lines
47 |             if label == "O":
48 |                 print_entity(lines)
49 |                 lines = []
50 |                 print("\t".join([form, lemma, tag, label]))
51 |             # new entity starts here, entity may have ended on previous lines
52 |             elif label[-2:] != prev_label[-2:]:
53 |                 print_entity(lines)
54 |                 lines = []
55 |                 lines.append([form, lemma, tag, label])
56 |             # other
57 |             else:
58 |                 lines.append([form, lemma, tag, label])
59 |             prev_label = label
60 | 


--------------------------------------------------------------------------------
/compare_nested_entities.py:
--------------------------------------------------------------------------------
 1 | #!/usr/bin/env python3
 2 | # coding=utf-8
 3 | #
 4 | # Copyright 2019 Institute of Formal and Applied Linguistics, Faculty of
 5 | # Mathematics and Physics, Charles University, Czech Republic.
 6 | #
 7 | # This Source Code Form is subject to the terms of the Mozilla Public
 8 | # License, v. 2.0. If a copy of the MPL was not distributed with this
 9 | # file, You can obtain one at http://mozilla.org/MPL/2.0/.
10 | 
11 | """Evaluates nested entity predictions.
12 | 
13 | The predictions are supposed to be in the following format:
14 | 
15 | One entity mention per line, two columns per line separated by table. First
16 | column are entity mention token ids separated by comma, second column is
17 | a BIO or BILOU label. Only classes are compared, the B-, I-, L- and U-
18 | prefixes are stripped.
19 | """
20 | 
21 | 
22 | import sys
23 | 
24 | 
25 | if __name__ == "__main__":
26 | 
27 |     with open(sys.argv[1], "r", encoding="utf-8") as fr:
28 |         gold_entities = fr.readlines()
29 |         for i in range(len(gold_entities)):
30 |             gold_entities[i] = gold_entities[i].split("\t")[:2]
31 | 
32 |     with open(sys.argv[2], "r", encoding="utf-8") as fr:
33 |         system_entities = fr.readlines()
34 |         for i in range(len(system_entities)):
35 |             system_entities[i] = system_entities[i].split("\t")[:2]
36 | 
37 |     correct_retrieved = 0
38 |     for entity in system_entities:
39 |         if entity in gold_entities:
40 |             correct_retrieved += 1
41 | 
42 |     recall = correct_retrieved / len(gold_entities) if gold_entities else 0
43 |     precision = correct_retrieved / len(system_entities) if system_entities else 0
44 |     f1 = (2 * recall * precision) / (recall + precision) if recall+precision else 0
45 | 
46 |     print("Correct retrieved: {}".format(correct_retrieved))
47 |     print("Retrieved: {}".format(len(system_entities)))
48 |     print("Gold: {}".format(len(gold_entities)))
49 |     print("Recall: {:.2f}".format(recall*100))
50 |     print("Precision: {:.2f}".format(precision*100))
51 |     print("F1: {:.2f}".format(f1*100))
52 | 


--------------------------------------------------------------------------------
/conll2eval_nested.py:
--------------------------------------------------------------------------------
 1 | #!/usr/bin/env python3
 2 | # coding=utf-8
 3 | #
 4 | # Copyright 2019 Institute of Formal and Applied Linguistics, Faculty of
 5 | # Mathematics and Physics, Charles University, Czech Republic.
 6 | #
 7 | # This Source Code Form is subject to the terms of the Mozilla Public
 8 | # License, v. 2.0. If a copy of the MPL was not distributed with this
 9 | # file, You can obtain one at http://mozilla.org/MPL/2.0/.
10 | 
11 | """Creates an evaluation file with named entities.
12 | 
13 | Input: CoNLL file with linearized (encoded) nested named entity labels
14 | delimited with |.
15 | 
16 | Output: One entity mention per line, two columns per line separated by table.
17 | First column are entity mentino token ids separated by comma, second column is
18 | a BIO or BILOU label.
19 | 
20 | The output can be then evaluated with compare_nested_entities.py.
21 | """ 
22 | 
23 | import sys
24 | 
25 | COL_SEP = "\t"
26 | 
27 | def raw(label):
28 |     return label[2:]
29 | 
30 | def flush(running_ids, running_forms, running_labels):
31 |     for i in range(len(running_ids)):
32 |         print(running_ids[i] + COL_SEP + running_labels[i] + COL_SEP + running_forms[i])
33 |     return ([], [], [])
34 | 
35 | if __name__ == "__main__":
36 | 
37 |     i = 0
38 |     running_ids = []
39 |     running_forms = []
40 |     running_labels = []
41 |     for line in sys.stdin:
42 |         line = line.rstrip("\r\n")
43 |         if not line: # flush entities
44 |             (running_ids, running_forms, running_labels) = flush(running_ids, running_forms, running_labels)
45 |         else:
46 |             form , _, _, ne = line.split("\t")
47 |             if ne == "O": # flush entities
48 |                 (running_ids, running_forms, running_labels) = flush(running_ids, running_forms, running_labels)
49 |             else:
50 |                 labels = ne.split("|")
51 |                 for j in range(len(labels)): # for each label
52 |                     label = labels[j]
53 |                     if j < len(running_ids): # running entity
54 |                         # previous running entity ends here, print and insert new entity instead
55 |                         if label.startswith("B-") or label.startswith("U-") or running_labels[j] != raw(label):
56 |                             print(running_ids[j] + COL_SEP + running_labels[j] + COL_SEP + running_forms[j])
57 |                             running_ids[j] = str(i)
58 |                             running_forms[j] = form
59 |                         # entity continues, append ids and forms
60 |                         else:
61 |                             running_ids[j] += "," + str(i)
62 |                             running_forms[j] += " " + form
63 |                         running_labels[j] = raw(label)
64 |                     else: # no running entities, new entity starts here, just append
65 |                         running_ids.append(str(i))
66 |                         running_forms.append(form)
67 |                         running_labels.append(raw(label))
68 |                         
69 |         i += 1
70 | 


--------------------------------------------------------------------------------
/conlleval:
--------------------------------------------------------------------------------
  1 | #!/usr/bin/perl -w
  2 | # conlleval: evaluate result of processing CoNLL-2000 shared task
  3 | # usage:     conlleval [-l] [-r] [-d delimiterTag] [-o oTag] < file
  4 | #            README: http://cnts.uia.ac.be/conll2000/chunking/output.html
  5 | # options:   l: generate LaTeX output for tables like in
  6 | #               http://cnts.uia.ac.be/conll2003/ner/example.tex
  7 | #            r: accept raw result tags (without B- and I- prefix;
  8 | #                                       assumes one word per chunk)
  9 | #            d: alternative delimiter tag (default is single space)
 10 | #            o: alternative outside tag (default is O)
 11 | # note:      the file should contain lines with items separated
 12 | #            by $delimiter characters (default space). The final
 13 | #            two items should contain the correct tag and the 
 14 | #            guessed tag in that order. Sentences should be
 15 | #            separated from each other by empty lines or lines
 16 | #            with $boundary fields (default -X-).
 17 | # url:       http://lcg-www.uia.ac.be/conll2000/chunking/
 18 | # started:   1998-09-25
 19 | # version:   2004-01-26
 20 | # author:    Erik Tjong Kim Sang <erikt@uia.ua.ac.be>
 21 | 
 22 | use strict;
 23 | 
 24 | my $false = 0;
 25 | my $true = 42;
 26 | 
 27 | my $boundary = "-X-";     # sentence boundary
 28 | my $correct;              # current corpus chunk tag (I,O,B)
 29 | my $correctChunk = 0;     # number of correctly identified chunks
 30 | my $correctTags = 0;      # number of correct chunk tags
 31 | my $correctType;          # type of current corpus chunk tag (NP,VP,etc.)
 32 | my $delimiter = " ";      # field delimiter
 33 | my $FB1 = 0.0;            # FB1 score (Van Rijsbergen 1979)
 34 | my $firstItem;            # first feature (for sentence boundary checks)
 35 | my $foundCorrect = 0;     # number of chunks in corpus
 36 | my $foundGuessed = 0;     # number of identified chunks
 37 | my $guessed;              # current guessed chunk tag
 38 | my $guessedType;          # type of current guessed chunk tag
 39 | my $i;                    # miscellaneous counter
 40 | my $inCorrect = $false;   # currently processed chunk is correct until now
 41 | my $lastCorrect = "O";    # previous chunk tag in corpus
 42 | my $latex = 0;            # generate LaTeX formatted output
 43 | my $lastCorrectType = ""; # type of previously identified chunk tag
 44 | my $lastGuessed = "O";    # previously identified chunk tag
 45 | my $lastGuessedType = ""; # type of previous chunk tag in corpus
 46 | my $lastType;             # temporary storage for detecting duplicates
 47 | my $line;                 # line
 48 | my $nbrOfFeatures = -1;   # number of features per line
 49 | my $precision = 0.0;      # precision score
 50 | my $oTag = "O";           # outside tag, default O
 51 | my $raw = 0;              # raw input: add B to every token
 52 | my $recall = 0.0;         # recall score
 53 | my $tokenCounter = 0;     # token counter (ignores sentence breaks)
 54 | 
 55 | my %correctChunk = ();    # number of correctly identified chunks per type
 56 | my %foundCorrect = ();    # number of chunks in corpus per type
 57 | my %foundGuessed = ();    # number of identified chunks per type
 58 | 
 59 | my @features;             # features on line
 60 | my @sortedTypes;          # sorted list of chunk type names
 61 | 
 62 | # sanity check
 63 | while (@ARGV and $ARGV[0] =~ /^-/) {
 64 |    if ($ARGV[0] eq "-l") { $latex = 1; shift(@ARGV); }
 65 |    elsif ($ARGV[0] eq "-r") { $raw = 1; shift(@ARGV); }
 66 |    elsif ($ARGV[0] eq "-d") { 
 67 |       shift(@ARGV); 
 68 |       if (not defined $ARGV[0]) { 
 69 |          die "conlleval: -d requires delimiter character"; 
 70 |       }
 71 |       $delimiter = shift(@ARGV);
 72 |    } elsif ($ARGV[0] eq "-o") {
 73 |       shift(@ARGV);
 74 |       if (not defined $ARGV[0]) {
 75 |          die "conlleval: -o requires delimiter character";
 76 |       }
 77 |       $oTag = shift(@ARGV);
 78 |    } else { die "conlleval: unknown argument $ARGV[0]\n"; }
 79 | }
 80 | if (@ARGV) { die "conlleval: unexpected command line argument\n"; }
 81 | # process input
 82 | while (<STDIN>) {
 83 |    chomp($line = $_);
 84 |    @features = split(/$delimiter/,$line);
 85 |    if ($nbrOfFeatures < 0) { $nbrOfFeatures = $#features; }
 86 |    elsif ($nbrOfFeatures != $#features and @features != 0) {
 87 |       printf STDERR "unexpected number of features: %d (%d)\n",
 88 |          $#features+1,$nbrOfFeatures+1;
 89 |       exit(1);
 90 |    }
 91 |    if (@features == 0 or 
 92 |        $features[0] eq $boundary) { @features = ($boundary,"O","O"); }
 93 |    if (@features < 2) { 
 94 |       die "conlleval: unexpected number of features in line $line\n"; 
 95 |    }
 96 |    if ($raw) {
 97 |       if ($features[$#features] eq $oTag) { $features[$#features] = "O"; } 
 98 |       if ($features[$#features-1] eq $oTag) { $features[$#features-1] = "O"; } 
 99 |       if ($features[$#features] ne "O") { 
100 |          $features[$#features] = "B-$features[$#features]";
101 |       }
102 |       if ($features[$#features-1] ne "O") { 
103 |          $features[$#features-1] = "B-$features[$#features-1]";
104 |       }
105 |    }
106 |    # 20040126 ET code which allows hyphens in the types
107 |    if ($features[$#features] =~ /^([^-]*)-(.*)$/) {
108 |       $guessed = $1;
109 |       $guessedType = $2;
110 |    } else { 
111 |       $guessed = $features[$#features]; 
112 |       $guessedType = ""; 
113 |    }
114 |    pop(@features);
115 |    if ($features[$#features] =~ /^([^-]*)-(.*)$/) {
116 |       $correct = $1;
117 |       $correctType = $2;
118 |    } else { 
119 |       $correct = $features[$#features]; 
120 |       $correctType = ""; 
121 |    }
122 |    pop(@features);
123 | #  ($guessed,$guessedType) = split(/-/,pop(@features));
124 | #  ($correct,$correctType) = split(/-/,pop(@features));
125 |    $guessedType = $guessedType ? $guessedType : "";
126 |    $correctType = $correctType ? $correctType : "";
127 |    $firstItem = shift(@features);
128 | 
129 |    # 1999-06-26 sentence breaks should always be counted as out of chunk
130 |    if ( $firstItem eq $boundary ) { $guessed = "O"; }
131 | 
132 |    if ($inCorrect) {
133 |       if ( &endOfChunk($lastCorrect,$correct,$lastCorrectType,$correctType) and
134 |            &endOfChunk($lastGuessed,$guessed,$lastGuessedType,$guessedType) and
135 |            $lastGuessedType eq $lastCorrectType) {
136 |          $inCorrect=$false;
137 |          $correctChunk++;
138 |          $correctChunk{$lastCorrectType} = $correctChunk{$lastCorrectType} ?
139 |              $correctChunk{$lastCorrectType}+1 : 1;
140 |       } elsif ( 
141 |            &endOfChunk($lastCorrect,$correct,$lastCorrectType,$correctType) != 
142 |            &endOfChunk($lastGuessed,$guessed,$lastGuessedType,$guessedType) or
143 |            $guessedType ne $correctType ) {
144 |          $inCorrect=$false; 
145 |       }
146 |    }
147 | 
148 |    if ( &startOfChunk($lastCorrect,$correct,$lastCorrectType,$correctType) and 
149 |         &startOfChunk($lastGuessed,$guessed,$lastGuessedType,$guessedType) and
150 |         $guessedType eq $correctType) { $inCorrect = $true; }
151 | 
152 |    if ( &startOfChunk($lastCorrect,$correct,$lastCorrectType,$correctType) ) {
153 |       $foundCorrect++; 
154 |       $foundCorrect{$correctType} = $foundCorrect{$correctType} ?
155 |           $foundCorrect{$correctType}+1 : 1;
156 |    }
157 |    if ( &startOfChunk($lastGuessed,$guessed,$lastGuessedType,$guessedType) ) {
158 |       $foundGuessed++; 
159 |       $foundGuessed{$guessedType} = $foundGuessed{$guessedType} ?
160 |           $foundGuessed{$guessedType}+1 : 1;
161 |    }
162 |    if ( $firstItem ne $boundary ) { 
163 |       if ( $correct eq $guessed and $guessedType eq $correctType ) { 
164 |          $correctTags++; 
165 |       }
166 |       $tokenCounter++; 
167 |    }
168 | 
169 |    $lastGuessed = $guessed;
170 |    $lastCorrect = $correct;
171 |    $lastGuessedType = $guessedType;
172 |    $lastCorrectType = $correctType;
173 | }
174 | if ($inCorrect) { 
175 |    $correctChunk++;
176 |    $correctChunk{$lastCorrectType} = $correctChunk{$lastCorrectType} ?
177 |        $correctChunk{$lastCorrectType}+1 : 1;
178 | }
179 | 
180 | if (not $latex) {
181 |    # compute overall precision, recall and FB1 (default values are 0.0)
182 |    $precision = 100*$correctChunk/$foundGuessed if ($foundGuessed > 0);
183 |    $recall = 100*$correctChunk/$foundCorrect if ($foundCorrect > 0);
184 |    $FB1 = 2*$precision*$recall/($precision+$recall)
185 |       if ($precision+$recall > 0);
186 |    
187 |    # print overall performance
188 |    printf "processed $tokenCounter tokens with $foundCorrect phrases; ";
189 |    printf "found: $foundGuessed phrases; correct: $correctChunk.\n";
190 |    if ($tokenCounter>0) {
191 |       printf "accuracy: %6.2f%%; ",100*$correctTags/$tokenCounter;
192 |       printf "precision: %6.2f%%; ",$precision;
193 |       printf "recall: %6.2f%%; ",$recall;
194 |       printf "FB1: %6.2f\n",$FB1;
195 |    }
196 | }
197 | 
198 | # sort chunk type names
199 | undef($lastType);
200 | @sortedTypes = ();
201 | foreach $i (sort (keys %foundCorrect,keys %foundGuessed)) {
202 |    if (not($lastType) or $lastType ne $i) { 
203 |       push(@sortedTypes,($i));
204 |    }
205 |    $lastType = $i;
206 | }
207 | # print performance per chunk type
208 | if (not $latex) {
209 |    for $i (@sortedTypes) {
210 |       $correctChunk{$i} = $correctChunk{$i} ? $correctChunk{$i} : 0;
211 |       if (not($foundGuessed{$i})) { $foundGuessed{$i} = 0; $precision = 0.0; }
212 |       else { $precision = 100*$correctChunk{$i}/$foundGuessed{$i}; }
213 |       if (not($foundCorrect{$i})) { $recall = 0.0; }
214 |       else { $recall = 100*$correctChunk{$i}/$foundCorrect{$i}; }
215 |       if ($precision+$recall == 0.0) { $FB1 = 0.0; }
216 |       else { $FB1 = 2*$precision*$recall/($precision+$recall); }
217 |       printf "%17s: ",$i;
218 |       printf "precision: %6.2f%%; ",$precision;
219 |       printf "recall: %6.2f%%; ",$recall;
220 |       printf "FB1: %6.2f  %d\n",$FB1,$foundGuessed{$i};
221 |    }
222 | } else {
223 |    print "        & Precision &  Recall  & F\$_{\\beta=1} \\\\\\hline";
224 |    for $i (@sortedTypes) {
225 |       $correctChunk{$i} = $correctChunk{$i} ? $correctChunk{$i} : 0;
226 |       if (not($foundGuessed{$i})) { $precision = 0.0; }
227 |       else { $precision = 100*$correctChunk{$i}/$foundGuessed{$i}; }
228 |       if (not($foundCorrect{$i})) { $recall = 0.0; }
229 |       else { $recall = 100*$correctChunk{$i}/$foundCorrect{$i}; }
230 |       if ($precision+$recall == 0.0) { $FB1 = 0.0; }
231 |       else { $FB1 = 2*$precision*$recall/($precision+$recall); }
232 |       printf "\n%-7s &  %6.2f\\%% & %6.2f\\%% & %6.2f \\\\",
233 |              $i,$precision,$recall,$FB1;
234 |    }
235 |    print "\\hline\n";
236 |    $precision = 0.0;
237 |    $recall = 0;
238 |    $FB1 = 0.0;
239 |    $precision = 100*$correctChunk/$foundGuessed if ($foundGuessed > 0);
240 |    $recall = 100*$correctChunk/$foundCorrect if ($foundCorrect > 0);
241 |    $FB1 = 2*$precision*$recall/($precision+$recall)
242 |       if ($precision+$recall > 0);
243 |    printf "Overall &  %6.2f\\%% & %6.2f\\%% & %6.2f \\\\\\hline\n",
244 |           $precision,$recall,$FB1;
245 | }
246 | 
247 | exit 0;
248 | 
249 | # endOfChunk: checks if a chunk ended between the previous and current word
250 | # arguments:  previous and current chunk tags, previous and current types
251 | # note:       this code is capable of handling other chunk representations
252 | #             than the default CoNLL-2000 ones, see EACL'99 paper of Tjong
253 | #             Kim Sang and Veenstra http://xxx.lanl.gov/abs/cs.CL/9907006
254 | 
255 | sub endOfChunk {
256 |    my $prevTag = shift(@_);
257 |    my $tag = shift(@_);
258 |    my $prevType = shift(@_);
259 |    my $type = shift(@_);
260 |    my $chunkEnd = $false;
261 | 
262 |    if ( $prevTag eq "B" and $tag eq "B" ) { $chunkEnd = $true; }
263 |    if ( $prevTag eq "B" and $tag eq "O" ) { $chunkEnd = $true; }
264 |    if ( $prevTag eq "I" and $tag eq "B" ) { $chunkEnd = $true; }
265 |    if ( $prevTag eq "I" and $tag eq "O" ) { $chunkEnd = $true; }
266 | 
267 |    if ( $prevTag eq "E" and $tag eq "E" ) { $chunkEnd = $true; }
268 |    if ( $prevTag eq "E" and $tag eq "I" ) { $chunkEnd = $true; }
269 |    if ( $prevTag eq "E" and $tag eq "O" ) { $chunkEnd = $true; }
270 |    if ( $prevTag eq "I" and $tag eq "O" ) { $chunkEnd = $true; }
271 | 
272 |    if ($prevTag ne "O" and $prevTag ne "." and $prevType ne $type) { 
273 |       $chunkEnd = $true; 
274 |    }
275 | 
276 |    # corrected 1998-12-22: these chunks are assumed to have length 1
277 |    if ( $prevTag eq "]" ) { $chunkEnd = $true; }
278 |    if ( $prevTag eq "[" ) { $chunkEnd = $true; }
279 | 
280 |    return($chunkEnd);   
281 | }
282 | 
283 | # startOfChunk: checks if a chunk started between the previous and current word
284 | # arguments:    previous and current chunk tags, previous and current types
285 | # note:         this code is capable of handling other chunk representations
286 | #               than the default CoNLL-2000 ones, see EACL'99 paper of Tjong
287 | #               Kim Sang and Veenstra http://xxx.lanl.gov/abs/cs.CL/9907006
288 | 
289 | sub startOfChunk {
290 |    my $prevTag = shift(@_);
291 |    my $tag = shift(@_);
292 |    my $prevType = shift(@_);
293 |    my $type = shift(@_);
294 |    my $chunkStart = $false;
295 | 
296 |    if ( $prevTag eq "B" and $tag eq "B" ) { $chunkStart = $true; }
297 |    if ( $prevTag eq "I" and $tag eq "B" ) { $chunkStart = $true; }
298 |    if ( $prevTag eq "O" and $tag eq "B" ) { $chunkStart = $true; }
299 |    if ( $prevTag eq "O" and $tag eq "I" ) { $chunkStart = $true; }
300 | 
301 |    if ( $prevTag eq "E" and $tag eq "E" ) { $chunkStart = $true; }
302 |    if ( $prevTag eq "E" and $tag eq "I" ) { $chunkStart = $true; }
303 |    if ( $prevTag eq "O" and $tag eq "E" ) { $chunkStart = $true; }
304 |    if ( $prevTag eq "O" and $tag eq "I" ) { $chunkStart = $true; }
305 | 
306 |    if ($tag ne "O" and $tag ne "." and $prevType ne $type) { 
307 |       $chunkStart = $true; 
308 |    }
309 | 
310 |    # corrected 1998-12-22: these chunks are assumed to have length 1
311 |    if ( $tag eq "[" ) { $chunkStart = $true; }
312 |    if ( $tag eq "]" ) { $chunkStart = $true; }
313 | 
314 |    return($chunkStart);   
315 | }
316 | 


--------------------------------------------------------------------------------
/morpho_dataset.py:
--------------------------------------------------------------------------------
  1 | #!/usr/bin/env python3
  2 | # coding=utf-8
  3 | #
  4 | # Copyright 2019 Institute of Formal and Applied Linguistics, Faculty of
  5 | # Mathematics and Physics, Charles University, Czech Republic.
  6 | #
  7 | # This Source Code Form is subject to the terms of the Mozilla Public
  8 | # License, v. 2.0. If a copy of the MPL was not distributed with this
  9 | # file, You can obtain one at http://mozilla.org/MPL/2.0/.
 10 | 
 11 | """MorphoDataset class to handle NE tagged data."""
 12 | 
 13 | import numpy as np
 14 | 
 15 | 
 16 | class MorphoDataset:
 17 |     """Class capable of loading morphological datasets in vertical format.
 18 |     The dataset is assumed to be composed of factors (by default FORMS, LEMMAS, POS and TAGS),
 19 |     each an object containing the following fields:
 20 |     - strings: Strings of the original words.
 21 |     - word_ids: Word ids of the original words (uses <unk> and <pad>).
 22 |     - words_map: String -> word_id map.
 23 |     - words: Word_id -> string list.
 24 |     - alphabet_map: Character -> char_id map.
 25 |     - alphabet: Char_id -> character list.
 26 |     - charseq_ids: Character_sequence ids of the original words.
 27 |     - charseqs_map: String -> character_sequence_id map.
 28 |     - charseqs: Character_sequence_id -> [characters], where character is an index
 29 |         to the dataset alphabet.
 30 |     """
 31 |     FORMS = 0
 32 |     LEMMAS = 1
 33 |     POS = 2
 34 |     TAGS = 3
 35 |     FACTORS = 4
 36 | 
 37 |     class _Factor:
 38 |         def __init__(self, train=None):
 39 |             self.words_map = train.words_map if train else {'<pad>': 0, '<unk>': 1, '<eow>': 2, '<bos>': 3}
 40 |             self.words = train.words if train else ['<pad>', '<unk>', '<eow>', '<bos>']
 41 |             self.word_ids = []
 42 |             self.alphabet_map = train.alphabet_map if train else {'<pad>': 0, '<unk>': 1, '<bow>': 2, '<eow>': 3}
 43 |             self.alphabet = train.alphabet if train else ['<pad>', '<unk>', '<bow>', '<eow>']
 44 |             self.charseqs_map = {'<pad>': 0}
 45 |             self.charseqs = [[self.alphabet_map['<pad>']]]
 46 |             self.charseq_ids = []
 47 |             self.strings = []
 48 | 
 49 |     def __init__(self, filename, train=None, shuffle_batches=True, max_sentences=None, add_bow_eow=False, seq2seq=False, bert_embeddings_filename=None, flair_filename=None, elmo_filename=None):
 50 |         """Load dataset from file in vertical format.
 51 |         Arguments:
 52 |         add_bow_eow: Whether to add BOW/EOW characters to the word characters.
 53 |         seq2seq: Multiple labels may be predicted.
 54 |         train: If given, the words and alphabets are reused from the training data.
 55 |         """
 56 | 
 57 |         # Create alphabet map
 58 |         self._alphabet_map = train._alphabet_map if train else {'<pad>': 0, '<unk>': 1, '<bow>': 2, '<eow>': 3}
 59 |         self._alphabet = train._alphabet if train else ['<pad>', '<unk>', '<bow>', '<eow>']
 60 | 
 61 |         # Create word maps
 62 |         self._factors = []
 63 |         for f in range(self.FACTORS):
 64 |             self._factors.append(self._Factor(train._factors[f] if train else None))
 65 | 
 66 |         # Load the sentences
 67 |         with open(filename, "r", encoding="utf-8") as file:
 68 |             in_sentence = False
 69 |             for line in file:
 70 |                 line = line.rstrip("\r\n")
 71 |                 if line:
 72 |                     columns = line.split("\t")
 73 |                     for f in range(self.FACTORS):
 74 |                         factor = self._factors[f]
 75 |                         if not in_sentence:
 76 |                             factor.word_ids.append([])
 77 |                             factor.charseq_ids.append([])
 78 |                             factor.strings.append([])
 79 |                         column = columns[f] if f < len(columns) else '<pad>'
 80 |                         words = []
 81 |                         if f == self.TAGS and seq2seq:
 82 |                             words = column.split("|")
 83 |                             words.append("<eow>")
 84 |                         else:
 85 |                             words = [column]
 86 |                         for word in words:
 87 |                             factor.strings[-1].append(word)
 88 | 
 89 |                             # Character-level information
 90 |                             if word not in factor.charseqs_map:
 91 |                                 factor.charseqs_map[word] = len(factor.charseqs)
 92 |                                 factor.charseqs.append([])
 93 |                                 if add_bow_eow:
 94 |                                     factor.charseqs[-1].append(factor.alphabet_map['<bow>'])
 95 |                                 for c in word:
 96 |                                     if c not in factor.alphabet_map:
 97 |                                         if train:
 98 |                                             c = '<unk>'
 99 |                                         else:
100 |                                             factor.alphabet_map[c] = len(factor.alphabet)
101 |                                             factor.alphabet.append(c)
102 |                                     factor.charseqs[-1].append(factor.alphabet_map[c])
103 |                                 if add_bow_eow:
104 |                                     factor.charseqs[-1].append(factor.alphabet_map['<eow>'])
105 |                             factor.charseq_ids[-1].append(factor.charseqs_map[word])
106 | 
107 |                             # Word-level information
108 |                             if word not in factor.words_map:
109 |                                 if train:
110 |                                     word = '<unk>'
111 |                                 else:
112 |                                     factor.words_map[word] = len(factor.words)
113 |                                     factor.words.append(word)
114 |                             factor.word_ids[-1].append(factor.words_map[word])
115 |                     in_sentence = True
116 |                 else:
117 |                     in_sentence = False
118 |                     if max_sentences is not None and len(self._factors[self.FORMS].word_ids) >= max_sentences:
119 |                         break
120 | 
121 |         # Compute sentence lengths
122 |         sentences = len(self._factors[self.FORMS].word_ids)
123 |         self._sentence_lens = np.zeros([sentences], np.int32)
124 |         for i in range(len(self._factors[self.FORMS].word_ids)):
125 |             self._sentence_lens[i] = len(self._factors[self.FORMS].word_ids[i])
126 | 
127 |         # Compute tag lengths
128 |         tags = len(self._factors[self.TAGS].word_ids)
129 |         self._tag_lens = np.zeros([tags], np.int32)
130 |         for i in range(len(self._factors[self.TAGS].word_ids)):
131 |             self._tag_lens[i] = len(self._factors[self.TAGS].word_ids[i])
132 | 
133 |         self._shuffle_batches = shuffle_batches
134 |         self._permutation = np.random.permutation(len(self._sentence_lens)) if self._shuffle_batches else np.arange(len(self._sentence_lens))
135 | 
136 |         # Load pretrained BERT embeddings
137 |         self._bert_embeddings = []  # [sentences x words x bert_embeddings]
138 |         if bert_embeddings_filename:
139 |             with open(bert_embeddings_filename, "r", encoding="utf-8") as file:
140 |                 in_sentence = False
141 |                 for line in file:
142 |                     line = line.rstrip("\r\n")
143 |                     if line:
144 |                         if not in_sentence:
145 |                             self._bert_embeddings.append([])
146 |                         self._bert_embeddings[-1].append(list(map(float, line.split(" ")[1:])))
147 |                         in_sentence = True
148 |                     else:
149 |                         self._bert_embeddings[-1] = np.array(self._bert_embeddings[-1], dtype=np.float32)
150 |                         in_sentence = False
151 |                         
152 |         # Load pretrained flair embeddings
153 |         self._flair_embeddings = []  # [sentences x words x flair_embeddings]
154 |         if flair_filename:
155 |             with open(flair_filename, "r", encoding="utf-8") as file:
156 |                 in_sentence = False
157 |                 for line in file:
158 |                     line = line.rstrip("\r\n")
159 |                     if line:
160 |                         if not in_sentence:
161 |                             self._flair_embeddings.append([])
162 |                         self._flair_embeddings[-1].append(list(map(float, line.split(" ")[1:])))
163 |                         in_sentence = True
164 |                     else:
165 |                         self._flair_embeddings[-1] = np.array(self._flair_embeddings[-1], dtype=np.float32)
166 |                         in_sentence = False
167 |                         
168 |         # Load pretrained elmo embeddings
169 |         self._elmo_embeddings = []  # [sentences x words x elmo_embeddings]
170 |         if elmo_filename:
171 |             with open(elmo_filename, "r", encoding="utf-8") as file:
172 |                 in_sentence = False
173 |                 for line in file:
174 |                     line = line.rstrip("\r\n")
175 |                     if line:
176 |                         if not in_sentence:
177 |                             self._elmo_embeddings.append([])
178 |                         self._elmo_embeddings[-1].append(list(map(float, line.split(" ")[1:])))
179 |                         in_sentence = True
180 |                     else:
181 |                         self._elmo_embeddings[-1] = np.array(self._elmo_embeddings[-1], dtype=np.float32)
182 |                         in_sentence = False
183 | 
184 | 
185 |     @property
186 |     def bert_embeddings(self):
187 |         return self._bert_embeddings
188 | 
189 |     @property
190 |     def flair_embeddings(self):
191 |         return self._flair_embeddings
192 | 
193 |     @property
194 |     def elmo_embeddings(self):
195 |         return self._elmo_embeddings
196 | 
197 |     @property
198 |     def sentence_lens(self):
199 |         return self._sentence_lens
200 | 
201 |     @property
202 |     def tag_lens(self):
203 |         return self._tag_lens
204 | 
205 |     @property
206 |     def factors(self):
207 |         """Return the factors of the dataset.
208 |         The result is an array of factors, each an object containing:
209 |         strings: Strings of the original words.
210 |         word_ids: Word ids of the original words (uses <unk> and <pad>).
211 |         words_map: String -> word_id map.
212 |         words: Word_id -> string list.
213 |         alphabet_map: Character -> char_id map.
214 |         alphabet: Char_id -> character list.
215 |         charseq_ids: Character_sequence ids of the original words.
216 |         charseqs_map: String -> character_sequence_id map.
217 |         charseqs: Character_sequence_id -> [characters], where character is an index
218 |           to the dataset alphabet.
219 |         """
220 | 
221 |         return self._factors
222 | 
223 |     def next_batch(self, batch_size, form_wes_model, lemma_wes_model, fasttext_model, including_charseqs=False, seq2seq=False):
224 |         """Return the next batch.
225 |         Arguments:
226 |         including_charseqs: if True, also batch_charseq_ids, batch_charseqs and batch_charseq_lens are returned
227 |         Returns: 
228 |         {sentence_lens, batch_word_ids, batch_charseq_ids, batch_charseqs, batch_pretrained_wes}
229 |         sequence_lens: batch of sentence_lens
230 |         batch_word_ids: for each factor, batch of words_id
231 |         batch_charseq_ids: For each factor, batch of charseq_ids
232 |           (the same shape as words_id, but with the ids pointing into batch_charseqs).
233 |           Returned only if including_charseqs is True.
234 |         batch_charseqs: For each factor, all unique charseqs in the batch,
235 |           indexable by batch_charseq_ids. Contains indices of characters from self.alphabet.
236 |           Returned only if including_charseqs is True.
237 |         batch_charseq_lens: For each factor, length of charseqs in batch_charseqs.
238 |           Returned only if including_charseqs is True.
239 |         batch_pretrained_form_wes: For each FORM factor, batch of pretrained word embeddings.
240 |           Returned only if form_wes_model != None.
241 |         batch_pretrained_lemma_wes: For each LEMMA factor, batch of pretrained word embeddings.
242 |           Returned onlyu if lemma_wes_model != None.
243 |         batch_bert_embeddings: For each FORM factor, batch of pretrained BERT embeddings.
244 |             Returned only if bert_embeddings_filename != None during initialiation.
245 |         batch_flair_embeddings: For each FORM factor, batch of pretrained Flair embeddings.
246 |             Returned only if flair_filename != None during initialiation.
247 |         batch_elmo_embeddings: For each FORM factor, batch of pretrained ELMo embeddings.
248 |             Returned only if elmo_filename != None during initialiation.
249 |         """
250 | 
251 |         batch_size = min(batch_size, len(self._permutation))
252 |         batch_perm = self._permutation[:batch_size]
253 |         self._permutation = self._permutation[batch_size:]
254 |         return self._next_batch(batch_perm, form_wes_model, lemma_wes_model, fasttext_model, including_charseqs, seq2seq)
255 | 
256 |     def epoch_finished(self):
257 |         if len(self._permutation) == 0:
258 |             self._permutation = np.random.permutation(len(self._sentence_lens)) if self._shuffle_batches else np.arange(len(self._sentence_lens))
259 |             return True
260 |         return False
261 | 
262 |     def bert_embeddings_dim(self):
263 |         if self._bert_embeddings:
264 |             return self._bert_embeddings[0].shape[1]
265 |         else:
266 |             return 0
267 |         
268 |     def flair_embeddings_dim(self):
269 |         if self._flair_embeddings:
270 |             return self._flair_embeddings[0].shape[1]
271 |         else:
272 |             return 0
273 | 
274 |     def elmo_embeddings_dim(self):
275 |         if self._elmo_embeddings:
276 |             return self._elmo_embeddings[0].shape[1]
277 |         else:
278 |             return 0
279 | 
280 |     def _next_batch(self, batch_perm, form_wes_model, lemma_wes_model, fasttext_model, including_charseqs, seq2seq=False):
281 |         batch_size = len(batch_perm)
282 |         batch_dict = dict()
283 | 
284 |         # General data
285 |         batch_sentence_lens = self._sentence_lens[batch_perm]
286 |         max_sentence_len = np.max(batch_sentence_lens)
287 | 
288 |         if seq2seq:
289 |             batch_tag_lens = self._tag_lens[batch_perm]
290 |             max_tag_len = np.max(batch_tag_lens)
291 | 
292 |         # Word-level data
293 |         batch_word_ids = []
294 |         batch_word_wes = []
295 |         for f in range(self.FACTORS):
296 |             factor = self._factors[f]
297 |             if f == self.TAGS and seq2seq:
298 |                 batch_word_ids.append(np.zeros([batch_size, max_tag_len], np.int32))
299 |                 for i in range(batch_size):
300 |                     batch_word_ids[-1][i, 0:batch_tag_lens[i]] = factor.word_ids[batch_perm[i]]
301 |             else:
302 |                 batch_word_ids.append(np.zeros([batch_size, max_sentence_len], np.int32))
303 |                 for i in range(batch_size):
304 |                     batch_word_ids[-1][i, 0:batch_sentence_lens[i]] = factor.word_ids[batch_perm[i]]
305 | 
306 |         batch_dict["sentence_lens"] = self._sentence_lens[batch_perm]
307 |         batch_dict["word_ids"] = batch_word_ids
308 | 
309 |         # Character-level data
310 |         if including_charseqs: 
311 |             batch_charseq_ids, batch_charseqs, batch_charseq_lens = [], [], []
312 | 
313 |             for f in range(self.FACTORS):
314 |                 if not (f == self.TAGS and seq2seq):
315 |                     factor = self._factors[f]
316 |                     batch_charseq_ids.append(np.zeros([batch_size, max_sentence_len], np.int32))
317 |                     charseqs_map = {}
318 |                     charseqs = []
319 |                     charseq_lens = []
320 |                     for i in range(batch_size):
321 |                         for j, charseq_id in enumerate(factor.charseq_ids[batch_perm[i]]):
322 |                             if charseq_id not in charseqs_map:
323 |                                 charseqs_map[charseq_id] = len(charseqs)
324 |                                 charseqs.append(factor.charseqs[charseq_id])
325 |                             batch_charseq_ids[-1][i, j] = charseqs_map[charseq_id]
326 | 
327 |                     batch_charseq_lens.append(np.array([len(charseq) for charseq in charseqs], np.int32))
328 |                     batch_charseqs.append(np.zeros([len(charseqs), np.max(batch_charseq_lens[-1])], np.int32))
329 |                     for i in range(len(charseqs)):
330 |                         batch_charseqs[-1][i, 0:len(charseqs[i])] = charseqs[i]
331 |             batch_dict["batch_charseq_ids"] = batch_charseq_ids
332 |             batch_dict["batch_charseqs"] = batch_charseqs
333 |             batch_dict["batch_charseq_lens"] = batch_charseq_lens
334 | 
335 |         # Pretrained word embeddings for forms
336 |         if form_wes_model:
337 |             we_size = form_wes_model.vectors.shape[1] # get pretrained WEs dimension
338 |             pretrained_wes = np.zeros([batch_size, max_sentence_len, we_size], np.float32)
339 |             for i in range(batch_size):
340 |                 for j, word in enumerate(self._factors[self.FORMS].strings[batch_perm[i]]):
341 |                     if word in form_wes_model:
342 |                         pretrained_wes[i, j] = form_wes_model[word]
343 |                     elif word.lower() in form_wes_model:
344 |                         pretrained_wes[i, j] = form_wes_model[word.lower()]
345 |             batch_dict["batch_form_pretrained_wes"] = pretrained_wes 
346 | 
347 |         # Fasttext word embeddings for forms
348 |         if fasttext_model:
349 |             we_size = fasttext_model.get_dimension() # get pretrained WEs dimension
350 |             fasttext_wes = np.zeros([batch_size, max_sentence_len, we_size], np.float32)
351 |             for i in range(batch_size):
352 |                 for j, word in enumerate(self._factors[self.FORMS].strings[batch_perm[i]]):
353 |                     fasttext_wes[i, j] = fasttext_model.get_word_vector(word)
354 |             batch_dict["batch_form_fasttext_wes"] = fasttext_wes 
355 | 
356 |         # Pretrained BERT embeddings for forms
357 |         if self._bert_embeddings:
358 |             we_size = self.bert_embeddings_dim()
359 |             batch_bert_embeddings = np.zeros([batch_size, max_sentence_len, we_size], np.float32)
360 |             for i in range(batch_size):
361 |                 batch_bert_embeddings[i, :self._bert_embeddings[batch_perm[i]].shape[0]] = self._bert_embeddings[batch_perm[i]]
362 |             batch_dict["batch_bert_wes"] = batch_bert_embeddings
363 |             
364 |         # Pretrained flair embeddings for forms
365 |         if self._flair_embeddings:
366 |             we_size = self.flair_embeddings_dim()
367 |             batch_flair_embeddings = np.zeros([batch_size, max_sentence_len, we_size], np.float32)
368 |             for i in range(batch_size):
369 |                 batch_flair_embeddings[i, :self._flair_embeddings[batch_perm[i]].shape[0]] = self._flair_embeddings[batch_perm[i]]
370 |             batch_dict["batch_flair_wes"] = batch_flair_embeddings
371 | 
372 |         # Pretrained elmo embeddings for forms
373 |         if self._elmo_embeddings:
374 |             we_size = self.elmo_embeddings_dim()
375 |             batch_elmo_embeddings = np.zeros([batch_size, max_sentence_len, we_size], np.float32)
376 |             for i in range(batch_size):
377 |                 batch_elmo_embeddings[i, :self._elmo_embeddings[batch_perm[i]].shape[0]] = self._elmo_embeddings[batch_perm[i]]
378 |             batch_dict["batch_elmo_wes"] = batch_elmo_embeddings
379 | 
380 |         # Pretrained word embeddings for lemmas
381 |         if lemma_wes_model:
382 |             we_size = lemma_wes_model.vectors.shape[1] # get pretrained WEs dimension
383 |             pretrained_wes = np.zeros([batch_size, max_sentence_len, we_size], np.float32)
384 |             for i in range(batch_size):
385 |                 for j, word in enumerate(self._factors[self.LEMMAS].strings[batch_perm[i]]):
386 |                     if word in lemma_wes_model:
387 |                         pretrained_wes[i, j] = lemma_wes_model[word]
388 |             batch_dict["batch_lemma_pretrained_wes"] = pretrained_wes 
389 | 
390 |         return batch_dict
391 | 


--------------------------------------------------------------------------------
/requirements.txt:
--------------------------------------------------------------------------------
1 | fastText
2 | tensorflow<2.0
3 | git+https://github.com/danielfrg/word2vec
4 | 


--------------------------------------------------------------------------------
/run_conlleval.sh:
--------------------------------------------------------------------------------
 1 | #!/bin/bash
 2 | #
 3 | # Copyright 2019 Institute of Formal and Applied Linguistics, Faculty of
 4 | # Mathematics and Physics, Charles University, Czech Republic.
 5 | #
 6 | # This Source Code Form is subject to the terms of the Mozilla Public
 7 | # License, v. 2.0. If a copy of the MPL was not distributed with this
 8 | # file, You can obtain one at http://mozilla.org/MPL/2.0/.
 9 | 
10 | # This script evaluates the TensorFlow output, both during training and
11 | # prediction phase, for flat corpora (CoNLL-2003 and CoNLL-2002), using the
12 | # official distributed evaluation script conlleval.
13 | 
14 | set -e
15 | 
16 | name="$1"
17 | gold="$2"
18 | system="$3"
19 | 
20 | if [ $name == "dev" ]; then
21 |   $(dirname $0)/bilou2bio.py < ${system} > ${name}_system_bio.conll
22 |   $(dirname $0)/bilou2bio.py < $(dirname $0)/${gold} > ${name}_gold_bio.conll
23 |   paste ${name}_gold_bio.conll ${name}_system_bio.conll | cut -f1,2,3,4,8 > ${name}_conlleval_input.conll
24 | elif [ $name == "test" ]; then
25 |   $(dirname $0)/bilou2bio.py < ${system} > ${name}_system_bio.conll
26 |   paste $(dirname $0)/${gold} ${name}_system_bio.conll | cut -f1,2,3,4,8 > ${name}_conlleval_input.conll
27 | else
28 |   echo "./run_conlleval.sh: Unknown file name \"$name\"."
29 |   exit 1
30 | fi
31 | 
32 | $(dirname $0)/conlleval -d "\t" < ${name}_conlleval_input.conll > $name.eval
33 | 


--------------------------------------------------------------------------------
/run_eval_nested.sh:
--------------------------------------------------------------------------------
 1 | #!/bin/bash
 2 | #
 3 | # Copyright 2019 Institute of Formal and Applied Linguistics, Faculty of
 4 | # Mathematics and Physics, Charles University, Czech Republic.
 5 | #
 6 | # This Source Code Form is subject to the terms of the Mozilla Public
 7 | # License, v. 2.0. If a copy of the MPL was not distributed with this
 8 | # file, You can obtain one at http://mozilla.org/MPL/2.0/.
 9 | 
10 | # This script evaluates the TensorFlow output, both during training and
11 | # prediction phase, for nested corpora, using the evaluation script
12 | # compare_nested_entities.py.
13 | 
14 | set -e
15 | 
16 | name="$1"
17 | gold_dir="$2"
18 | 
19 | cat ${name}_system_predictions.conll | $(dirname $0)/conll2eval_nested.py > ${name}_system_entities.txt
20 | $(dirname $0)/compare_nested_entities.py $(dirname $0)/${gold_dir}/${name}_gold_entities.txt ${name}_system_entities.txt > ${name}.eval
21 | 


--------------------------------------------------------------------------------
/tagger.py:
--------------------------------------------------------------------------------
  1 | #!/usr/bin/env python3
  2 | # coding=utf-8
  3 | #
  4 | # Copyright 2019 Institute of Formal and Applied Linguistics, Faculty of
  5 | # Mathematics and Physics, Charles University, Czech Republic.
  6 | #
  7 | # This Source Code Form is subject to the terms of the Mozilla Public
  8 | # License, v. 2.0. If a copy of the MPL was not distributed with this
  9 | # file, You can obtain one at http://mozilla.org/MPL/2.0/.
 10 | 
 11 | """Nested NER training and evaluation in TensorFlow."""
 12 | 
 13 | import json
 14 | import os
 15 | import sys
 16 | 
 17 | import fasttext
 18 | import numpy as np
 19 | import tensorflow as tf
 20 | import word2vec
 21 | 
 22 | import morpho_dataset
 23 | 
 24 | 
 25 | class Network:
 26 |     def __init__(self, threads, seed=42):
 27 |         # Create an empty graph and a session
 28 |         graph = tf.Graph()
 29 |         graph.seed = seed
 30 |         self.session = tf.Session(graph = graph, config=tf.ConfigProto(inter_op_parallelism_threads=threads,
 31 |                                                                        intra_op_parallelism_threads=threads))
 32 | 
 33 |     def construct(self, args, num_forms, num_form_chars, num_lemmas, num_lemma_chars, num_pos,
 34 |                   pretrained_form_we_dim, pretrained_lemma_we_dim, pretrained_fasttext_dim,
 35 |                   num_tags, tag_bos, tag_eow, pretrained_bert_dim, pretrained_flair_dim, pretrained_elmo_dim,
 36 |                   predict_only):
 37 |         with self.session.graph.as_default():
 38 | 
 39 |             # Inputs
 40 |             self.sentence_lens = tf.placeholder(tf.int32, [None], name="sentence_lens")
 41 |             self.form_ids = tf.placeholder(tf.int32, [None, None], name="form_ids")
 42 |             self.lemma_ids = tf.placeholder(tf.int32, [None, None], name="lemma_ids")
 43 |             self.pos_ids = tf.placeholder(tf.int32, [None, None], name="pos_ids")
 44 |             self.pretrained_form_wes = tf.placeholder(tf.float32, [None, None, pretrained_form_we_dim], name="pretrained_form_wes")
 45 |             self.pretrained_lemma_wes = tf.placeholder(tf.float32, [None, None, pretrained_lemma_we_dim], name="pretrained_lemma_wes")
 46 |             self.pretrained_fasttext_wes = tf.placeholder(tf.float32, [None, None, pretrained_fasttext_dim], name="fasttext_wes")
 47 |             self.pretrained_bert_wes = tf.placeholder(tf.float32, [None, None, pretrained_bert_dim], name="bert_wes")
 48 |             self.pretrained_flair_wes = tf.placeholder(tf.float32, [None, None, pretrained_flair_dim], name="flair_wes")
 49 |             self.pretrained_elmo_wes = tf.placeholder(tf.float32, [None, None, pretrained_elmo_dim], name="elmo_wes")
 50 |             self.tags = tf.placeholder(tf.int32, [None, None], name="tags")
 51 |             self.is_training = tf.placeholder(tf.bool, [])
 52 |             self.learning_rate = tf.placeholder(tf.float32, [])
 53 | 
 54 |             if args.including_charseqs:
 55 |                 self.form_charseqs = tf.placeholder(tf.int32, [None, None], name="form_charseqs")
 56 |                 self.form_charseq_lens = tf.placeholder(tf.int32, [None], name="form_charseq_lens")
 57 |                 self.form_charseq_ids = tf.placeholder(tf.int32, [None,None], name="form_charseq_ids")
 58 |                 
 59 |                 self.lemma_charseqs = tf.placeholder(tf.int32, [None, None], name="lemma_charseqs")
 60 |                 self.lemma_charseq_lens = tf.placeholder(tf.int32, [None], name="lemma_charseq_lens")
 61 |                 self.lemma_charseq_ids = tf.placeholder(tf.int32, [None,None], name="lemma_charseq_ids")
 62 |                 
 63 |             # RNN Cell
 64 |             if args.rnn_cell == "LSTM":
 65 |                 rnn_cell = tf.nn.rnn_cell.BasicLSTMCell
 66 |             elif args.rnn_cell == "GRU":
 67 |                 rnn_cell = tf.nn.rnn_cell.GRUCell
 68 |             else:
 69 |                 raise ValueError("Unknown rnn_cell {}".format(args.rnn_cell))
 70 | 
 71 |             inputs = []
 72 | 
 73 |             # Trainable embeddings for forms
 74 |             form_embeddings = tf.get_variable("form_embeddings", shape=[num_forms, args.we_dim], dtype=tf.float32)
 75 |             inputs.append(tf.nn.embedding_lookup(form_embeddings, self.form_ids))
 76 |             
 77 |             # Trainable embeddings for lemmas
 78 |             lemma_embeddings = tf.get_variable("lemma_embeddings", shape=[num_lemmas, args.we_dim], dtype=tf.float32)
 79 |             inputs.append(tf.nn.embedding_lookup(lemma_embeddings, self.lemma_ids))
 80 |             
 81 |             # POS encoded as one-hot vectors
 82 |             inputs.append(tf.one_hot(self.pos_ids, num_pos))
 83 |             
 84 |             # Pretrained embeddings for forms
 85 |             if args.form_wes_model:
 86 |                 inputs.append(self.pretrained_form_wes)
 87 | 
 88 |             # Pretrained embeddings for lemmas
 89 |             if args.lemma_wes_model:
 90 |                 inputs.append(self.pretrained_lemma_wes)
 91 |             
 92 |             # Fasttext form embeddings
 93 |             if args.fasttext_model:
 94 |                 inputs.append(self.pretrained_fasttext_wes)
 95 | 
 96 |             # BERT form embeddings
 97 |             if pretrained_bert_dim:
 98 |                 inputs.append(self.pretrained_bert_wes)
 99 | 
100 |             # Flair form embeddings
101 |             if pretrained_flair_dim:
102 |                 inputs.append(self.pretrained_flair_wes)
103 |                 
104 |             # ELMo form embeddings
105 |             if pretrained_elmo_dim:
106 |                 inputs.append(self.pretrained_elmo_wes)
107 | 
108 |             # Character-level form embeddings
109 |             if args.including_charseqs:
110 | 
111 |                 # Generate character embeddings for num_form_chars of dimensionality args.cle_dim.
112 |                 character_embeddings = tf.get_variable("form_character_embeddings",
113 |                                                         shape=[num_form_chars, args.cle_dim],
114 |                                                         dtype=tf.float32)
115 |                 
116 |                 # Embed self.form_charseqs (list of unique form in the batch) using the character embeddings.
117 |                 characters_embedded = tf.nn.embedding_lookup(character_embeddings, self.form_charseqs)
118 |                 
119 |                 # Use tf.nn.bidirectional.rnn to process embedded self.form_charseqs
120 |                 # using a GRU cell of dimensionality args.cle_dim.
121 |                 _, (state_fwd, state_bwd) = tf.nn.bidirectional_dynamic_rnn(
122 |                         tf.nn.rnn_cell.GRUCell(args.cle_dim), tf.nn.rnn_cell.GRUCell(args.cle_dim),
123 |                         characters_embedded, sequence_length=self.form_charseq_lens, dtype=tf.float32, scope="form_cle")
124 |                 
125 |                 # Sum the resulting fwd and bwd state to generate character-level form embedding (CLE)
126 |                 # of unique forms in the batch.
127 |                 cle = tf.concat([state_fwd, state_bwd], axis=1)
128 |                 
129 |                 # Generate CLEs of all form in the batch by indexing the just computed embeddings
130 |                 # by self.form_charseq_ids (using tf.nn.embedding_lookup).
131 |                 cle_embedded = tf.nn.embedding_lookup(cle, self.form_charseq_ids)
132 |                 
133 |                 # Concatenate the form embeddings (computed above in inputs) and the CLE (in this order).
134 |                 inputs.append(cle_embedded)
135 | 
136 |             # Character-level lemma embeddings
137 |             if args.including_charseqs:
138 | 
139 |                 character_embeddings = tf.get_variable("lemma_character_embeddings",
140 |                                                         shape=[num_lemma_chars, args.cle_dim],
141 |                                                         dtype=tf.float32)
142 |                 characters_embedded = tf.nn.embedding_lookup(character_embeddings, self.lemma_charseqs)
143 |                 _, (state_fwd, state_bwd) = tf.nn.bidirectional_dynamic_rnn(
144 |                         tf.nn.rnn_cell.GRUCell(args.cle_dim), tf.nn.rnn_cell.GRUCell(args.cle_dim),
145 |                         characters_embedded, sequence_length=self.lemma_charseq_lens, dtype=tf.float32, scope="lemma_cle")
146 |                 cle = tf.concat([state_fwd, state_bwd], axis=1)
147 |                 cle_embedded = tf.nn.embedding_lookup(cle, self.lemma_charseq_ids)
148 |                 inputs.append(cle_embedded)
149 | 
150 |             # Concatenate inputs
151 |             inputs = tf.concat(inputs, axis=2)
152 |             
153 |             # Dropout
154 |             inputs_dropout = tf.layers.dropout(inputs, rate=args.dropout, training=self.is_training)
155 |             
156 |             # Computation
157 |             hidden_layer_dropout = inputs_dropout # first layer is input
158 |             for i in range(args.rnn_layers):
159 |                 (hidden_layer_fwd, hidden_layer_bwd), _ = tf.nn.bidirectional_dynamic_rnn(
160 |                     rnn_cell(args.rnn_cell_dim), rnn_cell(args.rnn_cell_dim),
161 |                     hidden_layer_dropout, sequence_length=self.sentence_lens, dtype=tf.float32,
162 |                     scope="RNN-{}".format(i))
163 |                 hidden_layer = tf.concat([hidden_layer_fwd, hidden_layer_bwd], axis=2)
164 |                 if i == 0: hidden_layer_dropout = 0
165 |                 hidden_layer_dropout += tf.layers.dropout(hidden_layer, rate=args.dropout, training=self.is_training)
166 | 
167 |             # Decoders
168 |             if args.decoding == "CRF": # conditional random fields
169 |                 output_layer = tf.layers.dense(hidden_layer_dropout, num_tags)
170 |                 weights = tf.sequence_mask(self.sentence_lens, dtype=tf.float32)
171 |                 log_likelihood, transition_params = tf.contrib.crf.crf_log_likelihood(
172 |                     output_layer, self.tags, self.sentence_lens)
173 |                 loss = tf.reduce_mean(-log_likelihood)
174 |                 self.predictions, viterbi_score = tf.contrib.crf.crf_decode(
175 |                     output_layer, transition_params, self.sentence_lens)
176 |                 self.predictions_training = self.predictions
177 |             elif args.decoding == "ME": # vanilla maximum entropy
178 |                 output_layer = tf.layers.dense(hidden_layer_dropout, num_tags)
179 |                 weights = tf.sequence_mask(self.sentence_lens, dtype=tf.float32)
180 |                 if args.label_smoothing:
181 |                     gold_labels = tf.one_hot(self.tags, num_tags) * (1 - args.label_smoothing) + args.label_smoothing / num_tags
182 |                     loss = tf.losses.softmax_cross_entropy(gold_labels, output_layer, weights=weights)
183 |                 else:
184 |                     loss = tf.losses.sparse_softmax_cross_entropy(self.tags, output_layer, weights=weights)
185 |                 self.predictions = tf.argmax(output_layer, axis=2)
186 |                 self.predictions_training = self.predictions
187 |             elif args.decoding in ["LSTM", "seq2seq"]: # Decoder
188 |                 # Generate target embeddings for target chars, of shape [target_chars, args.char_dim].
189 |                 tag_embeddings = tf.get_variable("tag_embeddings", shape=[num_tags, args.we_dim], dtype=tf.float32)
190 | 
191 |                 # Embed the target_seqs using the target embeddings. 
192 |                 tags_embedded = tf.nn.embedding_lookup(tag_embeddings, self.tags)
193 | 
194 |                 decoder_rnn_cell = rnn_cell(args.rnn_cell_dim)
195 | 
196 |                 # Create a `decoder_layer` -- a fully connected layer with
197 |                 # target_chars neurons used in the decoder to classify into target characters.
198 |                 decoder_layer = tf.layers.Dense(num_tags)
199 |                 
200 |                 sentence_lens = self.sentence_lens
201 |                 max_sentence_len = tf.reduce_max(sentence_lens)
202 |                 tags = self.tags
203 |                 # The DecoderTraining will be used during training. It will output logits for each
204 |                 # target character.
205 |                 class DecoderTraining(tf.contrib.seq2seq.Decoder):
206 |                     @property
207 |                     def batch_size(self): return tf.shape(hidden_layer_dropout)[0]
208 |                     @property
209 |                     def output_dtype(self): return tf.float32 # Type for logits of target characters
210 |                     @property
211 |                     def output_size(self): return num_tags # Length of logits for every output
212 |                     @property
213 |                     def tag_eow(self): return tag_eow
214 | 
215 |                     def initialize(self, name=None):
216 |                         states = decoder_rnn_cell.zero_state(self.batch_size, tf.float32)
217 |                         inputs = [tf.nn.embedding_lookup(tag_embeddings, tf.fill([self.batch_size], tag_bos)), hidden_layer_dropout[:,0]]
218 |                         inputs = tf.concat(inputs, axis=1)
219 |                         if args.decoding == "seq2seq":
220 |                             predicted_eows = tf.zeros([self.batch_size], dtype=tf.int32)
221 |                             inputs = (inputs, predicted_eows)
222 |                         finished = sentence_lens <= 0
223 |                         return finished, inputs, states
224 | 
225 |                     def step(self, time, inputs, states, name=None):
226 |                         if args.decoding == "seq2seq":
227 |                             inputs, predicted_eows = inputs
228 |                         outputs, states = decoder_rnn_cell(inputs, states)
229 |                         outputs = decoder_layer(outputs)
230 |                         next_input = [tf.nn.embedding_lookup(tag_embeddings, tags[:,time])]
231 |                         if args.decoding == "seq2seq":
232 |                             predicted_eows += tf.to_int32(tf.equal(tags[:, time], self.tag_eow))
233 |                             indices = tf.where(tf.one_hot(tf.minimum(predicted_eows, max_sentence_len - 1), tf.reduce_max(predicted_eows) + 1))
234 |                             next_input.append(tf.gather_nd(hidden_layer_dropout, indices))
235 |                         else:
236 |                             next_input.append(hidden_layer_dropout[:,tf.minimum(time + 1, max_sentence_len - 1)])
237 |                         next_input = tf.concat(next_input, axis=1)
238 |                         if args.decoding == "seq2seq":
239 |                             next_input = (next_input, predicted_eows)
240 |                             finished = sentence_lens <= predicted_eows
241 |                         else:
242 |                             finished = sentence_lens <= time + 1
243 |                         return outputs, states, next_input, finished
244 |                 output_layer, _, prediction_training_lens = tf.contrib.seq2seq.dynamic_decode(DecoderTraining())
245 |                 self.predictions_training = tf.argmax(output_layer, axis=2, output_type=tf.int32)
246 |                 weights = tf.sequence_mask(prediction_training_lens, dtype=tf.float32)
247 |                 if args.label_smoothing:
248 |                     gold_labels = tf.one_hot(self.tags, num_tags) * (1 - args.label_smoothing) + args.label_smoothing / num_tags
249 |                     loss = tf.losses.softmax_cross_entropy(gold_labels, output_layer, weights=weights)
250 |                 else:
251 |                     loss = tf.losses.sparse_softmax_cross_entropy(self.tags, output_layer, weights=weights)
252 | 
253 |                 # The DecoderPrediction will be used during prediction. It will
254 |                 # directly output the predicted target characters.
255 |                 class DecoderPrediction(tf.contrib.seq2seq.Decoder):
256 |                     @property
257 |                     def batch_size(self): return tf.shape(hidden_layer_dropout)[0]
258 |                     @property
259 |                     def output_dtype(self): return tf.int32 # Type for predicted target characters
260 |                     @property
261 |                     def output_size(self): return 1 # Will return just one output
262 |                     @property
263 |                     def tag_eow(self): return tag_eow
264 | 
265 |                     def initialize(self, name=None):
266 |                         states = decoder_rnn_cell.zero_state(self.batch_size, tf.float32)
267 |                         inputs = [tf.nn.embedding_lookup(tag_embeddings, tf.fill([self.batch_size], tag_bos)), hidden_layer_dropout[:,0]]
268 |                         inputs = tf.concat(inputs, axis=1)
269 |                         if args.decoding == "seq2seq":
270 |                             predicted_eows = tf.zeros([self.batch_size], dtype=tf.int32)
271 |                             inputs = (inputs, predicted_eows) 
272 |                         finished = sentence_lens <= 0
273 |                         return finished, inputs, states
274 |                     
275 |                     def step(self, time, inputs, states, name=None):
276 |                         if args.decoding == "seq2seq":
277 |                             inputs, predicted_eows = inputs
278 |                         outputs, states = decoder_rnn_cell(inputs, states)
279 |                         outputs = decoder_layer(outputs)
280 |                         outputs = tf.argmax(outputs, axis=1, output_type=self.output_dtype)
281 |                         next_input = [tf.nn.embedding_lookup(tag_embeddings, outputs)]
282 |                         if args.decoding == "seq2seq":
283 |                             predicted_eows += tf.to_int32(tf.equal(outputs, self.tag_eow))
284 |                             indices = tf.where(tf.one_hot(tf.minimum(predicted_eows, max_sentence_len - 1), tf.reduce_max(predicted_eows) + 1))
285 |                             next_input.append(tf.gather_nd(hidden_layer_dropout, indices))
286 |                         else:
287 |                             next_input.append(hidden_layer_dropout[:,tf.minimum(time + 1, max_sentence_len - 1)])
288 |                         next_input = tf.concat(next_input, axis=1)
289 |                         if args.decoding == "seq2seq":
290 |                             next_input = (next_input, predicted_eows)
291 |                             finished = sentence_lens <= predicted_eows
292 |                         else:
293 |                             finished = sentence_lens <= time + 1
294 |                         return outputs, states, next_input, finished
295 |                 self.predictions, _, _ = tf.contrib.seq2seq.dynamic_decode(
296 |                         DecoderPrediction(), maximum_iterations=3*tf.reduce_max(self.sentence_lens) + 10)
297 |                 
298 |             # Saver
299 |             self.saver = tf.train.Saver(max_to_keep=1)
300 |             if predict_only: return
301 | 
302 |             # Training
303 |             global_step = tf.train.create_global_step()
304 |             self.training = tf.contrib.opt.LazyAdamOptimizer(learning_rate=self.learning_rate, beta2=args.beta_2).minimize(loss, global_step=global_step)
305 | 
306 |             # Summaries
307 |             self.current_accuracy, self.update_accuracy = tf.metrics.accuracy(self.tags, self.predictions_training, weights=weights)
308 |             self.current_loss, self.update_loss = tf.metrics.mean(loss, weights=tf.reduce_sum(weights))
309 |             self.reset_metrics = tf.variables_initializer(tf.get_collection(tf.GraphKeys.METRIC_VARIABLES))
310 | 
311 |             summary_writer = tf.contrib.summary.create_file_writer(args.logdir, flush_millis=10 * 1000)
312 |             self.summaries = {}
313 |             with summary_writer.as_default(), tf.contrib.summary.record_summaries_every_n_global_steps(100):
314 |                 self.summaries["train"] = [tf.contrib.summary.scalar("train/loss", self.update_loss),
315 |                                            tf.contrib.summary.scalar("train/accuracy", self.update_accuracy)]
316 |             with summary_writer.as_default(), tf.contrib.summary.always_record_summaries():
317 |                 for dataset in ["dev", "test"]:
318 |                     self.summaries[dataset] = [tf.contrib.summary.scalar(dataset + "/loss", self.current_loss),
319 |                                                tf.contrib.summary.scalar(dataset + "/accuracy", self.current_accuracy)]
320 | 
321 |             self.metrics = {}
322 |             self.metrics_summarize = {}
323 |             for metric in ["precision", "recall", "F1"]:
324 |                 self.metrics[metric] = tf.placeholder(tf.float32, [], name=metric)
325 |                 self.metrics_summarize[metric] = {}
326 |                 with summary_writer.as_default(), tf.contrib.summary.always_record_summaries():
327 |                     for dataset in ["dev", "test"]:
328 |                         self.metrics_summarize[metric][dataset] = tf.contrib.summary.scalar(dataset + "/" + metric,
329 |                                                                                             self.metrics[metric])
330 | 
331 |             # Initialize variables
332 |             self.session.run(tf.global_variables_initializer())
333 |             with summary_writer.as_default():
334 |                 tf.contrib.summary.initialize(session=self.session, graph=self.session.graph)
335 | 
336 | 
337 |     def train_epoch(self, train, learning_rate, args):
338 |         while not train.epoch_finished():
339 |             seq2seq = args.decoding == "seq2seq"
340 |             batch_dict = train.next_batch(args.batch_size, args.form_wes_model, args.lemma_wes_model, args.fasttext_model, including_charseqs=args.including_charseqs, seq2seq=seq2seq)
341 |             if args.word_dropout:
342 |                 mask = np.random.binomial(n=1, p=args.word_dropout, size=batch_dict["word_ids"][train.FORMS].shape)
343 |                 batch_dict["word_ids"][train.FORMS] = (1 - mask) * batch_dict["word_ids"][train.FORMS] + mask * train.factors[train.FORMS].words_map["<unk>"]
344 |                 
345 |                 mask = np.random.binomial(n=1, p=args.word_dropout, size=batch_dict["word_ids"][train.LEMMAS].shape)
346 |                 batch_dict["word_ids"][train.LEMMAS] = (1 - mask) * batch_dict["word_ids"][train.LEMMAS] + mask * train.factors[train.LEMMAS].words_map["<unk>"]
347 | 
348 |             self.session.run(self.reset_metrics)
349 |             feeds = {self.sentence_lens: batch_dict["sentence_lens"],
350 |                      self.form_ids: batch_dict["word_ids"][train.FORMS],
351 |                      self.lemma_ids: batch_dict["word_ids"][train.LEMMAS],
352 |                      self.pos_ids: batch_dict["word_ids"][train.POS],
353 |                      self.tags: batch_dict["word_ids"][train.TAGS],
354 |                      self.is_training: True,
355 |                      self.learning_rate: learning_rate}
356 |             if args.form_wes_model: # pretrained form embeddings
357 |                 feeds[self.pretrained_form_wes] = batch_dict["batch_form_pretrained_wes"]
358 |             if args.lemma_wes_model: # pretrained lemma embeddings
359 |                 feeds[self.pretrained_lemma_wes] = batch_dict["batch_lemma_pretrained_wes"]
360 |             if args.fasttext_model: # fasttext form embeddings
361 |                 feeds[self.pretrained_fasttext_wes] = batch_dict["batch_form_fasttext_wes"]
362 |             if args.bert_embeddings_train: # BERT embeddings
363 |                 feeds[self.pretrained_bert_wes] = batch_dict["batch_bert_wes"]
364 |             if args.flair_train: # flair embeddings
365 |                 feeds[self.pretrained_flair_wes] = batch_dict["batch_flair_wes"]
366 |             if args.elmo_train: # elmo embeddings
367 |                 feeds[self.pretrained_elmo_wes] = batch_dict["batch_elmo_wes"]
368 | 
369 |             if args.including_charseqs: # character-level embeddings
370 |                 feeds[self.form_charseqs] = batch_dict["batch_charseqs"][train.FORMS]
371 |                 feeds[self.form_charseq_lens] = batch_dict["batch_charseq_lens"][train.FORMS]
372 |                 feeds[self.form_charseq_ids] = batch_dict["batch_charseq_ids"][train.FORMS]
373 |                 
374 |                 feeds[self.lemma_charseqs] = batch_dict["batch_charseqs"][train.LEMMAS]
375 |                 feeds[self.lemma_charseq_lens] = batch_dict["batch_charseq_lens"][train.LEMMAS]
376 |                 feeds[self.lemma_charseq_ids] = batch_dict["batch_charseq_ids"][train.LEMMAS]
377 | 
378 |             self.session.run([self.training, self.summaries["train"]], feeds)
379 | 
380 | 
381 |     def evaluate(self, dataset_name, dataset, args):
382 |         with open("{}/{}_system_predictions.conll".format(args.logdir, dataset_name), "w", encoding="utf-8") as prediction_file:
383 |             self.predict(dataset_name, dataset, args, prediction_file, evaluating=True)
384 | 
385 |         f1 = 0.0
386 |         if args.corpus in ["CoNLL_en", "CoNLL_de", "CoNLL_nl", "CoNLL_es"]:
387 |             os.system("cd {} && ../../run_conlleval.sh {} {} {}_system_predictions.conll".format(args.logdir, dataset_name, args.__dict__[dataset_name + "_data"], dataset_name))
388 | 
389 |             with open("{}/{}.eval".format(args.logdir,dataset_name), "r", encoding="utf-8") as result_file:
390 |                 for line in result_file:
391 |                     line = line.strip("\n")
392 |                     if line.startswith("accuracy:"):
393 |                         f1 = float(line.split()[-1])
394 |                         self.session.run(self.metrics_summarize["F1"][dataset_name], {self.metrics["F1"]: f1})
395 | 
396 |             return f1
397 |         elif args.corpus in [ "ACE2004", "ACE2005", "GENIA" ]: # nested named entities evaluation
398 |             os.system("cd {} && ../../run_eval_nested.sh {} {}".format(args.logdir, dataset_name, os.path.dirname(args.__dict__[dataset_name + "_data"])))
399 | 
400 |             with open("{}/{}.eval".format(args.logdir,dataset_name), "r", encoding="utf-8") as result_file:
401 |                 for line in result_file:
402 |                     line = line.strip("\n")
403 |                     if line.startswith("Recall:"):
404 |                         recall = float(line.split(" ")[1])
405 |                     if line.startswith("Precision:"):
406 |                         precision = float(line.split(" ")[1])
407 |                     if line.startswith("F1:"):
408 |                         f1 = float(line.split(" ")[1])
409 |                         for metric, value in [["precision", precision], ["recall", recall], ["F1", f1]]:
410 |                             self.session.run(self.metrics_summarize[metric][dataset_name], {self.metrics[metric]: value})
411 |             return f1
412 |         else:
413 |             raise ValueError("Unknown corpus {}".format(args.corpus))
414 | 
415 | 
416 |     def predict(self, dataset_name, dataset, args, prediction_file, evaluating=False):
417 |         if evaluating:
418 |             self.session.run(self.reset_metrics)
419 |         tags = []
420 |         while not dataset.epoch_finished():
421 |             seq2seq = args.decoding == "seq2seq"
422 |             batch_dict = dataset.next_batch(args.batch_size, args.form_wes_model, args.lemma_wes_model, args.fasttext_model, args.including_charseqs, seq2seq=seq2seq)
423 |             targets = [self.predictions]
424 |             feeds = {self.sentence_lens: batch_dict["sentence_lens"],
425 |                     self.form_ids: batch_dict["word_ids"][dataset.FORMS],
426 |                     self.lemma_ids: batch_dict["word_ids"][train.LEMMAS],
427 |                     self.pos_ids: batch_dict["word_ids"][train.POS],
428 |                     self.is_training: False}
429 |             if evaluating:
430 |                 targets.extend([self.update_accuracy, self.update_loss])
431 |                 feeds[self.tags] = batch_dict["word_ids"][dataset.TAGS]
432 |             if args.form_wes_model: # pretrained form embeddings
433 |                 feeds[self.pretrained_form_wes] = batch_dict["batch_form_pretrained_wes"]
434 |             if args.lemma_wes_model: # pretrained lemma embeddings
435 |                 feeds[self.pretrained_lemma_wes] = batch_dict["batch_lemma_pretrained_wes"]
436 |             if args.fasttext_model: # fasttext form embeddings
437 |                 feeds[self.pretrained_fasttext_wes] = batch_dict["batch_form_fasttext_wes"]
438 |             if args.bert_embeddings_dev or args.bert_embeddings_test: # BERT embeddings
439 |                 feeds[self.pretrained_bert_wes] = batch_dict["batch_bert_wes"]
440 |             if args.flair_dev or args.flair_test: # flair embeddings
441 |                 feeds[self.pretrained_flair_wes] = batch_dict["batch_flair_wes"]
442 |             if args.elmo_dev or args.elmo_test: # elmo embeddings
443 |                 feeds[self.pretrained_elmo_wes] = batch_dict["batch_elmo_wes"]
444 | 
445 |             if args.including_charseqs: # character-level embeddings
446 |                 feeds[self.form_charseqs] = batch_dict["batch_charseqs"][dataset.FORMS]
447 |                 feeds[self.form_charseq_lens] = batch_dict["batch_charseq_lens"][dataset.FORMS]
448 |                 feeds[self.form_charseq_ids] = batch_dict["batch_charseq_ids"][dataset.FORMS]
449 |                 
450 |                 feeds[self.lemma_charseqs] = batch_dict["batch_charseqs"][dataset.LEMMAS]
451 |                 feeds[self.lemma_charseq_lens] = batch_dict["batch_charseq_lens"][dataset.LEMMAS]
452 |                 feeds[self.lemma_charseq_ids] = batch_dict["batch_charseq_ids"][dataset.LEMMAS]
453 | 
454 |             tags.extend(self.session.run(targets, feeds)[0])
455 | 
456 |         if evaluating:
457 |             self.session.run([self.current_accuracy, self.summaries[dataset_name]])
458 |      
459 |         forms = dataset.factors[dataset.FORMS].strings
460 |         for s in range(len(forms)):
461 |             j = 0
462 |             for i in range(len(forms[s])):
463 |                 if args.decoding == "seq2seq": # collect all tags until <eow>
464 |                     labels = []
465 |                     while j < len(tags[s]) and dataset.factors[dataset.TAGS].words[tags[s][j]] != "<eow>":
466 |                         labels.append(dataset.factors[dataset.TAGS].words[tags[s][j]])
467 |                         j += 1
468 |                     j += 1 # skip the "<eow>"
469 |                     print("{}\t_\t_\t{}".format(forms[s][i], "|".join(labels)), file=prediction_file)
470 |                 else:
471 |                     print("{}\t_\t_\t{}".format(forms[s][i], dataset.factors[dataset.TAGS].words[tags[s][i]]), file=prediction_file)
472 |             print("", file=prediction_file)
473 |         
474 | 
475 | if __name__ == "__main__":
476 |     import argparse
477 |     import datetime
478 |     import os
479 |     import re
480 | 
481 |     # Fix random seed
482 |     np.random.seed(42)
483 | 
484 |     # Parse arguments
485 |     parser = argparse.ArgumentParser()
486 |     parser.add_argument("--batch_size", default=8, type=int, help="Batch size.")
487 |     parser.add_argument("--bert_embeddings_dev", default=None, type=str, help="Pretrained BERT embeddings for dev data.")
488 |     parser.add_argument("--bert_embeddings_test", default=None, type=str, help="Pretrained BERT embeddings for test data.")
489 |     parser.add_argument("--bert_embeddings_train", default=None, type=str, help="Pretrained BERT embeddings for train data.")
490 |     parser.add_argument("--beta_2", default=0.98, type=float, help="Beta 2.")
491 |     parser.add_argument("--corpus", default="CoNLL_en", type=str, help="CoNLL_en|CoNLL_de|CoNLL_nl|CoNLL_es|ACE2004|ACE2005|GENIA.")
492 |     parser.add_argument("--cle_dim", default=128, type=int, help="Character-level embedding dimension.")
493 |     parser.add_argument("--decoding", default="CRF", type=str, help="Decoding: [CRF|ME|LSTM|seq2seq].")
494 |     parser.add_argument("--dev_data", default=None, type=str, help="Dev data.")
495 |     parser.add_argument("--dropout", default=0.5, type=float, help="Dropout rate.")
496 |     parser.add_argument("--elmo_dev", default=None, type=str, help="ELMo dev embeddings.")
497 |     parser.add_argument("--elmo_test", default=None, type=str, help="ELMo test embeddings.")
498 |     parser.add_argument("--elmo_train", default=None, type=str, help="ELMo train embeddings.")
499 |     parser.add_argument("--epochs", default="10:1e-3", type=str, help="Epochs and learning rates.")
500 |     parser.add_argument("--fasttext_model", default=None, type=str, help="Fasttext subwords.")
501 |     parser.add_argument("--flair_dev", default=None, type=str, help="Flair dev embeddings.")
502 |     parser.add_argument("--flair_test", default=None, type=str, help="Flair test embeddings.")
503 |     parser.add_argument("--flair_train", default=None, type=str, help="Flair train embeddings.")
504 |     parser.add_argument("--form_wes_model", default=None, type=str, help="Pretrained form WEs.")
505 |     parser.add_argument("--label_smoothing", default=0, type=float, help="Label smoothing.")
506 |     parser.add_argument("--lemma_wes_model", default=None, type=str, help="Pretrained lemma WEs.")
507 |     parser.add_argument("--max_sentences", default=None, type=int, help="Number of training sentences (for debugging).")
508 |     parser.add_argument("--name", default=None, type=str, help="Experiment name.")
509 |     parser.add_argument("--predict", default=None, type=str, help="Predict using the passed model.")
510 |     parser.add_argument("--rnn_cell", default="LSTM", type=str, help="RNN cell type.")
511 |     parser.add_argument("--rnn_cell_dim", default=256, type=int, help="RNN cell dimension.")
512 |     parser.add_argument("--rnn_layers", default=1, type=int, help="Number of hidden layers.")
513 |     parser.add_argument("--test_data", default=None, type=str, help="Test data.")
514 |     parser.add_argument("--train_data", default=None, type=str, help="Training data.")
515 |     parser.add_argument("--threads", default=4, type=int, help="Maximum number of threads to use.")
516 |     parser.add_argument("--we_dim", default=256, type=int, help="Word embedding dimension.")
517 |     parser.add_argument("--word_dropout", default=0.2, type=float, help="Word dropout.")
518 |     args = parser.parse_args()
519 | 
520 |     if args.predict:
521 |         # Load saved options from the model
522 |         with open("{}/options.json".format(args.predict), mode="r") as options_file:
523 |             args = argparse.Namespace(**json.load(options_file))
524 |         parser.parse_args(namespace=args)
525 |     else:
526 |         # Create logdir name
527 |         logargs = dict(vars(args).items())
528 |         logargs["form_wes_model"] = 1 if args.form_wes_model else 0
529 |         logargs["lemma_wes_model"] = 1 if args.lemma_wes_model else 0
530 |         del logargs["bert_embeddings_dev"]
531 |         del logargs["bert_embeddings_test"]
532 |         del logargs["bert_embeddings_train"]
533 |         del logargs["beta_2"]
534 |         del logargs["cle_dim"]
535 |         del logargs["dev_data"]
536 |         del logargs["dropout"]
537 |         del logargs["elmo_dev"]
538 |         del logargs["elmo_test"]
539 |         del logargs["elmo_train"]
540 |         del logargs["flair_dev"]
541 |         del logargs["flair_test"]
542 |         del logargs["flair_train"]
543 |         del logargs["label_smoothing"]
544 |         del logargs["max_sentences"]
545 |         del logargs["rnn_cell_dim"]
546 |         del logargs["test_data"]
547 |         del logargs["threads"]
548 |         del logargs["train_data"]
549 |         del logargs["we_dim"]
550 |         del logargs["word_dropout"]
551 |         logargs["bert_embeddings"] = 1 if args.bert_embeddings_train else 0
552 |         logargs["flair_embeddings"] = 1 if args.flair_train else 0
553 |         logargs["elmo_embeddings"] = 1 if args.elmo_train else 0
554 | 
555 |         args.logdir = "logs/{}-{}-{}".format(
556 |             os.path.basename(__file__),
557 |             datetime.datetime.now().strftime("%Y-%m-%d_%H%M%S"),
558 |             ",".join(("{}={}".format(re.sub("(.)[^_]*_?", r"\1", key), re.sub("^.*/", "", value) if type(value) == str else value)
559 |                       for key, value in sorted(logargs.items())))
560 |         )
561 |         if not os.path.exists("logs"): os.mkdir("logs") # TF 1.6 will do this by itself
562 |         if not os.path.exists(args.logdir): os.mkdir(args.logdir)
563 |         
564 |         # Dump passed options to allow future prediction.
565 |         with open("{}/options.json".format(args.logdir), mode="w") as options_file:
566 |             json.dump(vars(args), options_file, sort_keys=True)
567 | 
568 |     # Postprocess args
569 |     args.epochs = [(int(epochs), float(lr)) for epochs, lr in (epochs_lr.split(":") for epochs_lr in args.epochs.split(","))]
570 |    
571 |     # Load the data
572 |     seq2seq = args.decoding == "seq2seq"
573 |     train = morpho_dataset.MorphoDataset(args.train_data, max_sentences=args.max_sentences, seq2seq=seq2seq, bert_embeddings_filename=args.bert_embeddings_train, flair_filename=args.flair_train, elmo_filename=args.elmo_train)
574 |     if args.dev_data:
575 |         dev = morpho_dataset.MorphoDataset(args.dev_data, train=train, shuffle_batches=False, seq2seq=seq2seq, bert_embeddings_filename=args.bert_embeddings_dev, flair_filename=args.flair_dev, elmo_filename=args.elmo_dev)
576 |     test = morpho_dataset.MorphoDataset(args.test_data, train=train, shuffle_batches=False, seq2seq=seq2seq, bert_embeddings_filename=args.bert_embeddings_test,flair_filename=args.flair_test, elmo_filename=args.elmo_test)
577 | 
578 |     # Load pretrained form embeddings
579 |     if args.form_wes_model:
580 |         args.form_wes_model = word2vec.load(args.form_wes_model)
581 |     if args.lemma_wes_model:
582 |         args.lemma_wes_model = word2vec.load(args.lemma_wes_model)
583 | 
584 |     # Load fasttext subwords embeddings
585 |     if args.fasttext_model:
586 |         args.fasttext_model = fasttext.load_model(args.fasttext_model)
587 | 
588 |     # Character-level embeddings
589 |     args.including_charseqs = (args.cle_dim > 0)
590 | 
591 |     # Construct the network
592 |     network = Network(threads=args.threads)
593 |     network.construct(args,
594 |                       num_forms=len(train.factors[train.FORMS].words),
595 |                       num_form_chars=len(train.factors[train.FORMS].alphabet),
596 |                       num_lemmas=len(train.factors[train.LEMMAS].words),
597 |                       num_lemma_chars=len(train.factors[train.LEMMAS].alphabet),
598 |                       num_pos=len(train.factors[train.POS].words),
599 |                       pretrained_form_we_dim=args.form_wes_model.vectors.shape[1] if args.form_wes_model else 0,
600 |                       pretrained_lemma_we_dim=args.lemma_wes_model.vectors.shape[1] if args.lemma_wes_model else 0,
601 |                       pretrained_fasttext_dim=args.fasttext_model.get_dimension() if args.fasttext_model else 0,
602 |                       num_tags=len(train.factors[train.TAGS].words),
603 |                       tag_bos=train.factors[train.TAGS].words_map["<bos>"],
604 |                       tag_eow=train.factors[train.TAGS].words_map["<eow>"],
605 |                       pretrained_bert_dim=train.bert_embeddings_dim(),
606 |                       pretrained_flair_dim=train.flair_embeddings_dim(),
607 |                       pretrained_elmo_dim=train.elmo_embeddings_dim(),
608 |                       predict_only=args.predict)
609 | 
610 |     if args.predict:
611 |         network.saver.restore(network.session, "{}/model".format(args.predict.rstrip("/")))
612 |         print("Predicting test data", file=sys.stderr)
613 |         network.predict("test", test, args, sys.stdout, evaluating=False)
614 |     else:
615 |         # Train
616 |         for epochs, learning_rate in args.epochs:
617 |             for epoch in range(epochs):
618 |                 network.train_epoch(train, learning_rate, args)
619 |                 dev_score = 0
620 |                 if args.dev_data:
621 |                     dev_score = network.evaluate("dev", dev, args)
622 |                     print("{}".format(dev_score))
623 |         # Save network
624 |         network.saver.save(network.session, "{}/model".format(args.logdir), write_meta_graph=False)
625 |         # Test
626 |         test_score = network.evaluate("test", test, args)
627 |         print("{}".format(test_score))
628 | 


--------------------------------------------------------------------------------
/test_run/run.sh:
--------------------------------------------------------------------------------
 1 | #!/bin/bash
 2 | #
 3 | # Copyright 2019 Institute of Formal and Applied Linguistics, Faculty of
 4 | # Mathematics and Physics, Charles University, Czech Republic.
 5 | #
 6 | # This Source Code Form is subject to the terms of the Mozilla Public
 7 | # License, v. 2.0. If a copy of the MPL was not distributed with this
 8 | # file, You can obtain one at http://mozilla.org/MPL/2.0/.
 9 | 
10 | # Tagger test run with minimal parameters.
11 | 
12 | set -e
13 | 
14 | cat test.conll | ../conll2eval_nested.py > test_gold_entities.txt
15 | 
16 | # Seq2seq
17 | (cd ../ && ./tagger.py --corpus=ACE2004 --train_data=test_run/train.conll --test_data=test_run/test.conll --decoding=seq2seq --epochs=50:1e-3,8:1e-4 --name=test_run)
18 | 
19 | # LSTM-CRF
20 | (cd ../ && ./tagger.py --corpus=ACE2004 --train_data=test_run/train.conll --test_data=test_run/test.conll --decoding=CRF --epochs=50:1e-3,8:1e-4 --name=test_run)
21 | 


--------------------------------------------------------------------------------
/test_run/test.conll:
--------------------------------------------------------------------------------
1 | The	the	DT	B-GPE
2 | Chinese	chinese	JJ	I-GPE|U-GPE
3 | government	government	NN	L-GPE
4 | 
5 | 


--------------------------------------------------------------------------------
/test_run/train.conll:
--------------------------------------------------------------------------------
 1 | The	the	DT	B-GPE
 2 | Chinese	chinese	JJ	I-GPE|U-GPE
 3 | government	government	NN	L-GPE
 4 | and	and	CC	O
 5 | the	the	DT	B-GPE
 6 | Australian	australian	JJ	I-GPE|U-GPE
 7 | government	government	NN	L-GPE
 8 | signed	sign	VBD	O
 9 | an	an	DT	O
10 | agreement	agreement	NN	O
11 | today	today	NN	O
12 | ,	,	,	O
13 | wherein	wherein	WRB	O
14 | the	the	DT	B-GPE
15 | Australian	australian	JJ	I-GPE|U-GPE
16 | party	party	NN	L-GPE
17 | would	would	MD	O
18 | provide	provide	VB	O
19 | China	China	NNP	U-GPE
20 | with	with	IN	O
21 | a	a	DT	O
22 | preferential	preferential	JJ	O
23 | financial	financial	JJ	O
24 | loan	loan	NN	O
25 | of	of	IN	O
26 | 150	150	CD	O
27 | million	million	CD	O
28 | Australian	australian	JJ	U-GPE
29 | dollars	dollar	NNS	O
30 | .	.	.	O
31 | 
32 | 


--------------------------------------------------------------------------------