├── LICENSE
├── README.md
├── data
    ├── bert_large_cased_vocab.txt
    └── labels.txt
├── preprocess
    └── generate_dataset.py
├── requirements.txt
├── run_ner.py
└── utils_ner.py


/LICENSE:
--------------------------------------------------------------------------------
 1 | MIT License
 2 | 
 3 | Copyright (c) 2019 Pierre-Yves Vandenbussche
 4 | 
 5 | Permission is hereby granted, free of charge, to any person obtaining a copy
 6 | of this software and associated documentation files (the "Software"), to deal
 7 | in the Software without restriction, including without limitation the rights
 8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
 9 | copies of the Software, and to permit persons to whom the Software is
10 | furnished to do so, subject to the following conditions:
11 | 
12 | The above copyright notice and this permission notice shall be included in all
13 | copies or substantial portions of the Software.
14 | 
15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
21 | SOFTWARE.
22 | 


--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
 1 | # transformers-ner
 2 | [![Documentation Status](https://img.shields.io/badge/Blog-link_to_the_post-brightgreen.svg)](http://pyvandenbussche.info/2019/named-entity-recognition-with-pytorch-transformers/)
 3 | 
 4 | Experiment on NER task using Huggingface state-of-the-art Natural Language Models
 5 | 
 6 | ## Installation
 7 | 
 8 | ### Prerequisites
 9 | 
10 | * Python ≥ 3.6
11 | 
12 | ### Provision a Virtual Environment
13 | 
14 | Create and activate a virtual environment (conda)
15 | 
16 | ```
17 | conda create --name py36_transformers-ner python=3.6
18 | source activate py36_transformers-ner
19 | ```
20 | 
21 | If `pip` is configured in your conda environment, 
22 | install dependencies from within the project root directory
23 | ```
24 | pip install -r requirements.txt
25 | ``` 
26 | 
27 | ### Data Pre-processing
28 | 
29 | #### Download the data
30 | The current `BC5CDR` dataset is available as IOB format. Small modifications should be applied 
31 | to the files so they can be processed by BERT NER (space separated elements, etc.). 
32 | We will first download the files and then transform them
33 | 
34 | Download the files at:
35 | ```bash
36 | mkdir data-input
37 | curl -o data-input/devel.tsv https://raw.githubusercontent.com/cambridgeltl/MTL-Bioinformatics-2016/master/data/BC5CDR-IOB/devel.tsv
38 | curl -o data-input/train.tsv https://raw.githubusercontent.com/cambridgeltl/MTL-Bioinformatics-2016/master/data/BC5CDR-IOB/train.tsv
39 | curl -o data-input/test.tsv https://raw.githubusercontent.com/cambridgeltl/MTL-Bioinformatics-2016/master/data/BC5CDR-IOB/test.tsv
40 | 
41 | ```
42 | 
43 | To transform the data in a BERT NER compatible format, execute the following command:
44 | ```bash
45 | python ./preprocess/generate_dataset.py --input_train_data data-input/train.tsv --input_dev_data data-input/devel.tsv --input_test_data data-input/test.tsv --output_dir data-input/
46 | ```
47 | 
48 | The script ouputs two files `train.txt` and `test.txt` that will be the input of the NER pipeline.
49 | 
50 | ### Download pre-trained model and run the NER task
51 | #### BERT
52 | Pre-trained models of BERT are automatically fetched by HuggingFace's transformers library.
53 | To execute the NER pipeline, run the following scripts:
54 | ```bash
55 | python ./run_ner.py --data_dir ./data --model_type bert --model_name_or_path bert-base-cased --output_dir ./output --labels ./data/labels.txt --do_train --do_predict --max_seq_length 256 --overwrite_output_dir --overwrite_cache
56 | ```
57 | The script will output the results and predictions in the output directory.
58 | 
59 | #### SciBERT
60 | Download and unzip the model, vocab and its config. Rename config file to config.json as expected from the script.
61 | ```bash
62 | curl -Ol https://s3-us-west-2.amazonaws.com/ai2-s2-research/scibert/pytorch_models/scibert_scivocab_cased.tar
63 | tar -xvf scibert_scivocab_cased.tar -C scibert_scivocab_cased
64 | cd scibert_scivocab_cased/
65 | tar -zxvf weights.tar.gz
66 | mv bert_config.json config.json
67 | rm weights.tar.gz
68 | ```
69 | To execute the NER pipeline, run the following scripts:
70 | ```bash
71 | python ./run_ner.py --data_dir ./data --model_type bert --model_name_or_path scibert_scivocab_cased --output_dir ./output --labels ./data/labels.txt --do_train --do_predict --max_seq_length 256 --overwrite_output_dir --overwrite_cache
72 | ```
73 | The script will output the results and predictions in the output directory.
74 | 
75 | #### SpanBERT
76 | Download and unzip the model, vocab and its config. Rename config file to config.json as expected from the script.
77 | Note that SpanBERT does not come with its own `vocab.txt` file. Instead it reuses the same as BERT-large-cased model
78 | ```bash
79 | curl -Ol https://dl.fbaipublicfiles.com/fairseq/models/spanbert_hf_base.tar.gz
80 | mkdir spanbert_hf_base
81 | tar -zxvf spanbert_hf_base.tar.gz -C spanbert_hf_base
82 | cd spanbert_hf_base
83 | curl -Ol https://raw.githubusercontent.com/pyvandenbussche/transformers-ner/master/data/bert_large_cased_vocab.txt
84 | mv bert_large_cased_vocab.txt vocab.txt
85 | ```
86 | To execute the NER pipeline, run the following scripts:
87 | ```bash
88 | python ./run_ner.py --data_dir ./data --model_type bert --model_name_or_path spanbert_hf_base --output_dir ./output --labels ./data/labels.txt --do_train --do_predict --max_seq_length 256 --overwrite_output_dir --overwrite_cache
89 | ```
90 | The script will output the results and predictions in the output directory.


--------------------------------------------------------------------------------
/data/labels.txt:
--------------------------------------------------------------------------------
1 | O
2 | B-Chemical
3 | I-Chemical
4 | B-Disease
5 | I-Disease


--------------------------------------------------------------------------------
/preprocess/generate_dataset.py:
--------------------------------------------------------------------------------
 1 | # coding=utf-8
 2 | """ Generate the dataset from stanford format """
 3 | 
 4 | import argparse
 5 | import logging
 6 | import os
 7 | 
 8 | logger = logging.getLogger(__name__)
 9 | 
10 | 
11 | def write_to_file(file_dir, filename, snippets):
12 |     if os.path.isfile(os.path.join(file_dir, filename)):
13 |         os.remove(os.path.join(file_dir, filename))
14 |     with open(os.path.join(file_dir, filename), "w", encoding='utf-8') as output_file:
15 |         for snippet in snippets:
16 |             output_file.write("\n".join(sentence for sentence in snippet))
17 |             output_file.write("\n\n")
18 | 
19 | def main():
20 |     parser = argparse.ArgumentParser()
21 | 
22 |     ## Required parameters
23 |     parser.add_argument("--input_train_data", default=None, type=str, required=True,
24 |                         help="The input data file path e.g. ../data/train.tsv")
25 |     parser.add_argument("--input_dev_data", default=None, type=str, required=True,
26 |                         help="The input data file path e.g. ../data/devel.tsv")
27 |     parser.add_argument("--input_test_data", default=None, type=str, required=True,
28 |                         help="The input data file path e.g. ../data/test.tsv")
29 |     parser.add_argument("--output_dir", default=None, type=str, required=True,
30 |                         help="The output directory e.g. ../data/")
31 |     parser.add_argument("--keep_only_tag", default=None, type=str,
32 |                         help="Keep only annotations with this tag label (e.g. 'indications' will keep tags "
33 |                              "['O', 'B-indications', 'I-indications'])")
34 | 
35 |     args = parser.parse_args()
36 | 
37 | 
38 |     if not os.path.isfile(args.input_train_data) or not os.path.isfile(args.input_dev_data):
39 |         raise ValueError("The input data file path ({} or {}) does not exist.".format(args.input_train_data, args.input_dev_data))
40 | 
41 |     nb_train_sent = 0
42 |     nb_test_sent = 0
43 |     fo = open(os.path.join(args.output_dir, "train.txt"), "w")
44 |     for tempfile in [args.input_train_data, args.input_dev_data]:
45 |         with open(tempfile, 'r') as fi:
46 |             for line in fi:
47 |                 # count the number of sentences
48 |                 if len(line.strip()) == 0:
49 |                     nb_train_sent+=1
50 |                 # change from tab separated to space separated as expected from BERT NER script
51 |                 line = line.replace("\t", " ")
52 |                 # filter out some tags in case we are performing singular tag NER
53 |                 if args.keep_only_tag is not None:
54 |                     splits = line.split()
55 |                     if len(splits) > 1:
56 |                         label = splits[-1]
57 |                         if len(label)>1 and label[2:] != args.keep_only_tag:
58 |                             line = "{} O\n".format(splits[:-1])
59 |                 fo.write(line)
60 |             # add new line at the end fo the file to break the sentence
61 |             fo.write("\n")
62 | 
63 |     fo = open(os.path.join(args.output_dir, "test.txt"), "w")
64 |     with open(args.input_test_data, 'r') as fi:
65 |         for line in fi:
66 |             # count the number of sentences
67 |             if len(line.strip()) == 0:
68 |                 nb_test_sent += 1
69 |             # change from tab separated to space separated as expected from BERT NER script
70 |             line = line.replace("\t", " ")
71 |             # filter out some tags in case we are performing singular tag NER
72 |             if args.keep_only_tag is not None:
73 |                 splits = line.split()
74 |                 if len(splits) > 1:
75 |                     label = splits[-1]
76 |                     if len(label) > 1 and label[2:] != args.keep_only_tag:
77 |                         line = "{} O\n".format(splits[:-1])
78 |             fo.write(line)
79 |         # add new line at the end fo the file to break the sentence
80 |         fo.write("\n")
81 |     print("Number of training sentences: {}".format(nb_train_sent))
82 |     print("Number of testing sentences: {}".format(nb_test_sent))
83 | 
84 | if __name__ == "__main__":
85 |     main()
86 | 


--------------------------------------------------------------------------------
/requirements.txt:
--------------------------------------------------------------------------------
 1 | boto3==1.9.243
 2 | numpy==1.15.4
 3 | pandas==0.24.1
 4 | protobuf==3.10.0
 5 | requests==2.21.0
 6 | regex==2019.8.19
 7 | sentencepiece==0.1.83
 8 | setuptools_scm==3.3.3
 9 | seqeval==0.0.12
10 | scikit-learn==0.21.3
11 | spacy==2.2.1
12 | tensorboardX==1.8
13 | torch==1.2.0
14 | torchvision==0.4.0
15 | tqdm==4.36.1
16 | transformers==2.1.0


--------------------------------------------------------------------------------
/run_ner.py:
--------------------------------------------------------------------------------
  1 | # coding=utf-8
  2 | """ Fine-tuning the library models for named entity recognition """
  3 | from __future__ import absolute_import, division, print_function
  4 | 
  5 | import argparse
  6 | import glob
  7 | import logging
  8 | import os
  9 | import random
 10 | 
 11 | import numpy as np
 12 | import torch
 13 | from seqeval.metrics import precision_score, recall_score, f1_score
 14 | from tensorboardX import SummaryWriter
 15 | from torch.nn import CrossEntropyLoss
 16 | from torch.utils.data import DataLoader, RandomSampler, SequentialSampler, TensorDataset
 17 | from torch.utils.data.distributed import DistributedSampler
 18 | from tqdm import tqdm, trange
 19 | from utils_ner import convert_examples_to_features, get_labels, read_examples_from_file
 20 | 
 21 | from transformers import AdamW, WarmupLinearSchedule
 22 | from transformers import WEIGHTS_NAME, BertConfig, BertForTokenClassification, BertTokenizer
 23 | 
 24 | logger = logging.getLogger(__name__)
 25 | 
 26 | ALL_MODELS = sum(
 27 |     (tuple(conf.pretrained_config_archive_map.keys()) for conf in (BertConfig, )),
 28 |     ())
 29 | 
 30 | MODEL_CLASSES = {
 31 |     "bert": (BertConfig, BertForTokenClassification, BertTokenizer),
 32 | }
 33 | 
 34 | 
 35 | def set_seed(args):
 36 |     random.seed(args.seed)
 37 |     np.random.seed(args.seed)
 38 |     torch.manual_seed(args.seed)
 39 |     if args.n_gpu > 0:
 40 |         torch.cuda.manual_seed_all(args.seed)
 41 | 
 42 | 
 43 | def train(args, train_dataset, model, tokenizer, labels, pad_token_label_id):
 44 |     """ Train the model """
 45 |     if args.local_rank in [-1, 0]:
 46 |         tb_writer = SummaryWriter()
 47 | 
 48 |     args.train_batch_size = args.per_gpu_train_batch_size * max(1, args.n_gpu)
 49 |     train_sampler = RandomSampler(train_dataset) if args.local_rank == -1 else DistributedSampler(train_dataset)
 50 |     train_dataloader = DataLoader(train_dataset, sampler=train_sampler, batch_size=args.train_batch_size)
 51 | 
 52 |     if args.max_steps > 0:
 53 |         t_total = args.max_steps
 54 |         args.num_train_epochs = args.max_steps // (len(train_dataloader) // args.gradient_accumulation_steps) + 1
 55 |     else:
 56 |         t_total = len(train_dataloader) // args.gradient_accumulation_steps * args.num_train_epochs
 57 | 
 58 |     # Prepare optimizer and schedule (linear warmup and decay)
 59 |     no_decay = ["bias", "LayerNorm.weight"]
 60 |     optimizer_grouped_parameters = [
 61 |         {"params": [p for n, p in model.named_parameters() if not any(nd in n for nd in no_decay)],
 62 |          "weight_decay": args.weight_decay},
 63 |         {"params": [p for n, p in model.named_parameters() if any(nd in n for nd in no_decay)], "weight_decay": 0.0}
 64 |     ]
 65 |     optimizer = AdamW(optimizer_grouped_parameters, lr=args.learning_rate, eps=args.adam_epsilon)
 66 |     scheduler = WarmupLinearSchedule(optimizer, warmup_steps=args.warmup_steps, t_total=t_total)
 67 |     if args.fp16:
 68 |         try:
 69 |             from apex import amp
 70 |         except ImportError:
 71 |             raise ImportError("Please install apex from https://www.github.com/nvidia/apex to use fp16 training.")
 72 |         model, optimizer = amp.initialize(model, optimizer, opt_level=args.fp16_opt_level)
 73 | 
 74 |     # multi-gpu training (should be after apex fp16 initialization)
 75 |     if args.n_gpu > 1:
 76 |         model = torch.nn.DataParallel(model)
 77 | 
 78 |     # Distributed training (should be after apex fp16 initialization)
 79 |     if args.local_rank != -1:
 80 |         model = torch.nn.parallel.DistributedDataParallel(model, device_ids=[args.local_rank],
 81 |                                                           output_device=args.local_rank,
 82 |                                                           find_unused_parameters=True)
 83 | 
 84 |     # Train!
 85 |     logger.info("***** Running training *****")
 86 |     logger.info("  Num examples = %d", len(train_dataset))
 87 |     logger.info("  Num Epochs = %d", args.num_train_epochs)
 88 |     logger.info("  Instantaneous batch size per GPU = %d", args.per_gpu_train_batch_size)
 89 |     logger.info("  Total train batch size (w. parallel, distributed & accumulation) = %d",
 90 |                 args.train_batch_size * args.gradient_accumulation_steps * (
 91 |                     torch.distributed.get_world_size() if args.local_rank != -1 else 1))
 92 |     logger.info("  Gradient Accumulation steps = %d", args.gradient_accumulation_steps)
 93 |     logger.info("  Total optimization steps = %d", t_total)
 94 | 
 95 |     global_step = 0
 96 |     tr_loss, logging_loss = 0.0, 0.0
 97 |     model.zero_grad()
 98 |     train_iterator = trange(int(args.num_train_epochs), desc="Epoch", disable=args.local_rank not in [-1, 0])
 99 |     set_seed(args)  # Added here for reproductibility (even between python 2 and 3)
100 |     for _ in train_iterator:
101 |         epoch_iterator = tqdm(train_dataloader, desc="Iteration", disable=args.local_rank not in [-1, 0])
102 |         for step, batch in enumerate(epoch_iterator):
103 |             model.train()
104 |             batch = tuple(t.to(args.device) for t in batch)
105 |             inputs = {"input_ids": batch[0],
106 |                       "attention_mask": batch[1],
107 |                       "token_type_ids": batch[2] if args.model_type in ["bert", "xlnet"] else None,
108 |                       # XLM and RoBERTa don"t use segment_ids
109 |                       "labels": batch[3]}
110 |             outputs = model(**inputs)
111 |             loss = outputs[0]  # model outputs are always tuple in pytorch-transformers (see doc)
112 | 
113 |             if args.n_gpu > 1:
114 |                 loss = loss.mean()  # mean() to average on multi-gpu parallel training
115 |             if args.gradient_accumulation_steps > 1:
116 |                 loss = loss / args.gradient_accumulation_steps
117 | 
118 |             if args.fp16:
119 |                 with amp.scale_loss(loss, optimizer) as scaled_loss:
120 |                     scaled_loss.backward()
121 |                 torch.nn.utils.clip_grad_norm_(amp.master_params(optimizer), args.max_grad_norm)
122 |             else:
123 |                 loss.backward()
124 |                 torch.nn.utils.clip_grad_norm_(model.parameters(), args.max_grad_norm)
125 | 
126 |             tr_loss += loss.item()
127 |             if (step + 1) % args.gradient_accumulation_steps == 0:
128 |                 scheduler.step()  # Update learning rate schedule
129 |                 optimizer.step()
130 |                 model.zero_grad()
131 |                 global_step += 1
132 | 
133 |                 if args.local_rank in [-1, 0] and args.logging_steps > 0 and global_step % args.logging_steps == 0:
134 |                     # Log metrics
135 |                     if args.local_rank == -1 and args.evaluate_during_training:  # Only evaluate when single GPU otherwise metrics may not average well
136 |                         results, _ = evaluate(args, model, tokenizer, labels, pad_token_label_id)
137 |                         for key, value in results.items():
138 |                             tb_writer.add_scalar("eval_{}".format(key), value, global_step)
139 |                     tb_writer.add_scalar("lr", scheduler.get_lr()[0], global_step)
140 |                     tb_writer.add_scalar("loss", (tr_loss - logging_loss) / args.logging_steps, global_step)
141 |                     logging_loss = tr_loss
142 | 
143 |                 if args.local_rank in [-1, 0] and args.save_steps > 0 and global_step % args.save_steps == 0:
144 |                     # Save model checkpoint
145 |                     output_dir = os.path.join(args.output_dir, "checkpoint-{}".format(global_step))
146 |                     if not os.path.exists(output_dir):
147 |                         os.makedirs(output_dir)
148 |                     model_to_save = model.module if hasattr(model, "module") else model  # Take care of distributed/parallel training
149 |                     model_to_save.save_pretrained(output_dir)
150 |                     torch.save(args, os.path.join(output_dir, "training_args.bin"))
151 |                     logger.info("Saving model checkpoint to %s", output_dir)
152 | 
153 |             if args.max_steps > 0 and global_step > args.max_steps:
154 |                 epoch_iterator.close()
155 |                 break
156 |         if args.max_steps > 0 and global_step > args.max_steps:
157 |             train_iterator.close()
158 |             break
159 | 
160 |     if args.local_rank in [-1, 0]:
161 |         tb_writer.close()
162 | 
163 |     return global_step, tr_loss / global_step
164 | 
165 | 
166 | # Use CoNLL evaluation method
167 | def evaluate(args, model, tokenizer, labels, pad_token_label_id, mode, prefix=""):
168 |     eval_dataset = load_and_cache_examples(args, tokenizer, labels, pad_token_label_id, mode=mode)
169 | 
170 |     args.eval_batch_size = args.per_gpu_eval_batch_size * max(1, args.n_gpu)
171 |     # Note that DistributedSampler samples randomly
172 |     eval_sampler = SequentialSampler(eval_dataset) if args.local_rank == -1 else DistributedSampler(eval_dataset)
173 |     eval_dataloader = DataLoader(eval_dataset, sampler=eval_sampler, batch_size=args.eval_batch_size)
174 | 
175 |     # Eval!
176 |     logger.info("***** Running evaluation %s *****", prefix)
177 |     logger.info("  Num examples = %d", len(eval_dataset))
178 |     logger.info("  Batch size = %d", args.eval_batch_size)
179 |     eval_loss = 0.0
180 |     nb_eval_steps = 0
181 |     preds = None
182 |     out_label_ids = None
183 |     model.eval()
184 |     for batch in tqdm(eval_dataloader, desc="Evaluating"):
185 |         batch = tuple(t.to(args.device) for t in batch)
186 | 
187 |         with torch.no_grad():
188 |             inputs = {"input_ids": batch[0],
189 |                       "attention_mask": batch[1],
190 |                       "token_type_ids": batch[2] if args.model_type in ["bert", "xlnet"] else None,
191 |                       # XLM and RoBERTa don"t use segment_ids
192 |                       "labels": batch[3]}
193 |             outputs = model(**inputs)
194 |             tmp_eval_loss, logits = outputs[:2]
195 | 
196 |             eval_loss += tmp_eval_loss.item()
197 |         nb_eval_steps += 1
198 |         if preds is None:
199 |             preds = logits.detach().cpu().numpy()
200 |             out_label_ids = inputs["labels"].detach().cpu().numpy()
201 |         else:
202 |             preds = np.append(preds, logits.detach().cpu().numpy(), axis=0)
203 |             out_label_ids = np.append(out_label_ids, inputs["labels"].detach().cpu().numpy(), axis=0)
204 | 
205 |     eval_loss = eval_loss / nb_eval_steps
206 |     preds = np.argmax(preds, axis=2)
207 | 
208 |     label_map = {i: label for i, label in enumerate(labels)}
209 | 
210 |     results = {}
211 | 
212 |     # get result per label
213 |     inv_label_map = {v: k for k, v in label_map.items()}
214 |     for lbl in set([lbl[2:].strip() for lbl in labels if len(lbl[2:].strip()) > 0]):
215 |         lbl_map = label_map.copy()
216 |         for k, v in inv_label_map.items():
217 |             if len(k[2:].strip()) > 0 and k[2:].strip() != lbl:
218 |                 lbl_map[v] = "O"
219 | 
220 |         out_label_list = [[] for _ in range(out_label_ids.shape[0])]
221 |         preds_list = [[] for _ in range(out_label_ids.shape[0])]
222 | 
223 |         for i in range(out_label_ids.shape[0]):
224 |             for j in range(out_label_ids.shape[1]):
225 |                 if out_label_ids[i, j] != pad_token_label_id:
226 |                     out_label_list[i].append(lbl_map[out_label_ids[i][j]])
227 |                     preds_list[i].append(lbl_map[preds[i][j]])
228 |         results["{}-precision".format(lbl)] = precision_score(out_label_list, preds_list)
229 |         results["{}-recall".format(lbl)] = recall_score(out_label_list, preds_list)
230 |         results["{}-f1".format(lbl)] = f1_score(out_label_list, preds_list)
231 | 
232 |     out_label_list = [[] for _ in range(out_label_ids.shape[0])]
233 |     preds_list = [[] for _ in range(out_label_ids.shape[0])]
234 | 
235 |     for i in range(out_label_ids.shape[0]):
236 |         for j in range(out_label_ids.shape[1]):
237 |             if out_label_ids[i, j] != pad_token_label_id:
238 |                 out_label_list[i].append(label_map[out_label_ids[i][j]])
239 |                 preds_list[i].append(label_map[preds[i][j]])
240 | 
241 |     results["loss"] = eval_loss
242 |     results["precision"] = precision_score(out_label_list, preds_list)
243 |     results["recall"] = recall_score(out_label_list, preds_list)
244 |     results["f1"] = f1_score(out_label_list, preds_list)
245 | 
246 | 
247 |     logger.info("***** Eval results %s *****", prefix)
248 |     for key in sorted(results.keys()):
249 |         logger.info("  %s = %s", key, str(results[key]))
250 | 
251 |     return results, preds_list
252 | 
253 | 
254 | def load_and_cache_examples(args, tokenizer, labels, pad_token_label_id, mode):
255 |     if args.local_rank not in [-1, 0] and not evaluate:
256 |         torch.distributed.barrier()  # Make sure only the first process in distributed training process the dataset, and the others will use the cache
257 | 
258 |     # Load data features from cache or dataset file
259 |     cached_features_file = os.path.join(args.data_dir, "cached_{}_{}_{}".format(mode,
260 |         list(filter(None, args.model_name_or_path.split("/"))).pop(),
261 |         str(args.max_seq_length)))
262 |     if os.path.exists(cached_features_file)  and not args.overwrite_cache:
263 |         logger.info("Loading features from cached file %s", cached_features_file)
264 |         features = torch.load(cached_features_file)
265 |     else:
266 |         logger.info("Creating features from dataset file at %s", args.data_dir)
267 |         examples = read_examples_from_file(args.data_dir, mode)
268 |         features = convert_examples_to_features(examples, labels, args.max_seq_length, tokenizer,
269 |                                                 cls_token_at_end=bool(args.model_type in ["xlnet"]),
270 |                                                 # xlnet has a cls token at the end
271 |                                                 cls_token=tokenizer.cls_token,
272 |                                                 cls_token_segment_id=2 if args.model_type in ["xlnet"] else 0,
273 |                                                 sep_token=tokenizer.sep_token,
274 |                                                 sep_token_extra=bool(args.model_type in ["roberta"]),
275 |                                                 # roberta uses an extra separator b/w pairs of sentences, cf. github.com/pytorch/fairseq/commit/1684e166e3da03f5b600dbb7855cb98ddfcd0805
276 |                                                 pad_on_left=bool(args.model_type in ["xlnet"]),
277 |                                                 # pad on the left for xlnet
278 |                                                 pad_token=tokenizer.convert_tokens_to_ids([tokenizer.pad_token])[0],
279 |                                                 pad_token_segment_id=4 if args.model_type in ["xlnet"] else 0,
280 |                                                 pad_token_label_id=pad_token_label_id
281 |                                                 )
282 |         if args.local_rank in [-1, 0]:
283 |             logger.info("Saving features into cached file %s", cached_features_file)
284 |             torch.save(features, cached_features_file)
285 | 
286 |     if args.local_rank == 0 and not evaluate:
287 |         torch.distributed.barrier()  # Make sure only the first process in distributed training process the dataset, and the others will use the cache
288 | 
289 |     # Convert to Tensors and build dataset
290 |     all_input_ids = torch.tensor([f.input_ids for f in features], dtype=torch.long)
291 |     all_input_mask = torch.tensor([f.input_mask for f in features], dtype=torch.long)
292 |     all_segment_ids = torch.tensor([f.segment_ids for f in features], dtype=torch.long)
293 |     all_label_ids = torch.tensor([f.label_ids for f in features], dtype=torch.long)
294 | 
295 |     dataset = TensorDataset(all_input_ids, all_input_mask, all_segment_ids, all_label_ids)
296 |     return dataset
297 | 
298 | 
299 | def main():
300 |     parser = argparse.ArgumentParser()
301 | 
302 |     ## Required parameters
303 |     parser.add_argument("--data_dir", default=None, type=str, required=True,
304 |                         help="The input data dir. Should contain the training files for the CoNLL-2003 NER task.")
305 |     parser.add_argument("--model_type", default=None, type=str, required=True,
306 |                         help="Model type selected in the list: " + ", ".join(MODEL_CLASSES.keys()))
307 |     parser.add_argument("--model_name_or_path", default=None, type=str, required=True,
308 |                         help="Path to pre-trained model or shortcut name selected in the list: " + ", ".join(ALL_MODELS))
309 |     parser.add_argument("--output_dir", default=None, type=str, required=True,
310 |                         help="The output directory where the model predictions and checkpoints will be written.")
311 | 
312 |     ## Other parameters
313 |     parser.add_argument("--labels", default="", type=str,
314 |                         help="Path to a file containing all labels. If not specified, CoNLL-2003 labels are used.")
315 |     parser.add_argument("--config_name", default="", type=str,
316 |                         help="Pretrained config name or path if not the same as model_name")
317 |     parser.add_argument("--tokenizer_name", default="", type=str,
318 |                         help="Pretrained tokenizer name or path if not the same as model_name")
319 |     parser.add_argument("--cache_dir", default="", type=str,
320 |                         help="Where do you want to store the pre-trained models downloaded from s3")
321 |     parser.add_argument("--max_seq_length", default=128, type=int,
322 |                         help="The maximum total input sequence length after tokenization. Sequences longer "
323 |                              "than this will be truncated, sequences shorter will be padded.")
324 |     parser.add_argument("--do_train", action="store_true",
325 |                         help="Whether to run training.")
326 |     parser.add_argument("--do_eval", action="store_true",
327 |                         help="Whether to run eval on the dev set.")
328 |     parser.add_argument("--do_predict", action="store_true",
329 |                         help="Whether to run predictions on the test set.")
330 |     parser.add_argument("--evaluate_during_training", action="store_true",
331 |                         help="Whether to run evaluation during training at each logging step.")
332 |     parser.add_argument("--do_lower_case", action="store_true",
333 |                         help="Set this flag if you are using an uncased model.")
334 | 
335 |     parser.add_argument("--per_gpu_train_batch_size", default=8, type=int,
336 |                         help="Batch size per GPU/CPU for training.")
337 |     parser.add_argument("--per_gpu_eval_batch_size", default=8, type=int,
338 |                         help="Batch size per GPU/CPU for evaluation.")
339 |     parser.add_argument("--gradient_accumulation_steps", type=int, default=1,
340 |                         help="Number of updates steps to accumulate before performing a backward/update pass.")
341 |     parser.add_argument("--learning_rate", default=5e-5, type=float,
342 |                         help="The initial learning rate for Adam.")
343 |     parser.add_argument("--weight_decay", default=0.0, type=float,
344 |                         help="Weight decay if we apply some.")
345 |     parser.add_argument("--adam_epsilon", default=1e-8, type=float,
346 |                         help="Epsilon for Adam optimizer.")
347 |     parser.add_argument("--max_grad_norm", default=1.0, type=float,
348 |                         help="Max gradient norm.")
349 |     parser.add_argument("--num_train_epochs", default=3.0, type=float,
350 |                         help="Total number of training epochs to perform.")
351 |     parser.add_argument("--max_steps", default=-1, type=int,
352 |                         help="If > 0: set total number of training steps to perform. Override num_train_epochs.")
353 |     parser.add_argument("--warmup_steps", default=0, type=int,
354 |                         help="Linear warmup over warmup_steps.")
355 | 
356 |     parser.add_argument("--logging_steps", type=int, default=50,
357 |                         help="Log every X updates steps.")
358 |     parser.add_argument("--save_steps", type=int, default=50,
359 |                         help="Save checkpoint every X updates steps.")
360 |     parser.add_argument("--eval_all_checkpoints", action="store_true",
361 |                         help="Evaluate all checkpoints starting with the same prefix as model_name ending and ending with step number")
362 |     parser.add_argument("--no_cuda", action="store_true",
363 |                         help="Avoid using CUDA when available")
364 |     parser.add_argument("--overwrite_output_dir", action="store_true",
365 |                         help="Overwrite the content of the output directory")
366 |     parser.add_argument("--overwrite_cache", action="store_true",
367 |                         help="Overwrite the cached training and evaluation sets")
368 |     parser.add_argument("--seed", type=int, default=42,
369 |                         help="random seed for initialization")
370 | 
371 |     parser.add_argument("--fp16", action="store_true",
372 |                         help="Whether to use 16-bit (mixed) precision (through NVIDIA apex) instead of 32-bit")
373 |     parser.add_argument("--fp16_opt_level", type=str, default="O1",
374 |                         help="For fp16: Apex AMP optimization level selected in ['O0', 'O1', 'O2', and 'O3']."
375 |                              "See details at https://nvidia.github.io/apex/amp.html")
376 |     parser.add_argument("--local_rank", type=int, default=-1,
377 |                         help="For distributed training: local_rank")
378 |     parser.add_argument("--server_ip", type=str, default="", help="For distant debugging.")
379 |     parser.add_argument("--server_port", type=str, default="", help="For distant debugging.")
380 |     args = parser.parse_args()
381 | 
382 |     if os.path.exists(args.output_dir) and os.listdir(
383 |             args.output_dir) and args.do_train and not args.overwrite_output_dir:
384 |         raise ValueError(
385 |             "Output directory ({}) already exists and is not empty. Use --overwrite_output_dir to overcome.".format(
386 |                 args.output_dir))
387 | 
388 |     # Setup distant debugging if needed
389 |     if args.server_ip and args.server_port:
390 |         # Distant debugging - see https://code.visualstudio.com/docs/python/debugging#_attach-to-a-local-script
391 |         import ptvsd
392 |         print("Waiting for debugger attach")
393 |         ptvsd.enable_attach(address=(args.server_ip, args.server_port), redirect_output=True)
394 |         ptvsd.wait_for_attach()
395 | 
396 |     # Setup CUDA, GPU & distributed training
397 |     if args.local_rank == -1 or args.no_cuda:
398 |         device = torch.device("cuda" if torch.cuda.is_available() and not args.no_cuda else "cpu")
399 |         args.n_gpu = torch.cuda.device_count()
400 |     else:  # Initializes the distributed backend which will take care of sychronizing nodes/GPUs
401 |         torch.cuda.set_device(args.local_rank)
402 |         device = torch.device("cuda", args.local_rank)
403 |         torch.distributed.init_process_group(backend="nccl")
404 |         args.n_gpu = 1
405 |     args.device = device
406 | 
407 |     # Setup logging
408 |     logging.basicConfig(format="%(asctime)s - %(levelname)s - %(name)s -   %(message)s",
409 |                         datefmt="%m/%d/%Y %H:%M:%S",
410 |                         level=logging.INFO if args.local_rank in [-1, 0] else logging.WARN)
411 |     logger.warning("Process rank: %s, device: %s, n_gpu: %s, distributed training: %s, 16-bits training: %s",
412 |                    args.local_rank, device, args.n_gpu, bool(args.local_rank != -1), args.fp16)
413 | 
414 |     # Set seed
415 |     set_seed(args)
416 | 
417 |     # Prepare NER task
418 |     labels = get_labels(args.labels)
419 |     num_labels = len(labels)
420 |     # Use cross entropy ignore index as padding label id so that only real label ids contribute to the loss later
421 |     pad_token_label_id = CrossEntropyLoss().ignore_index
422 | 
423 |     # Load pretrained model and tokenizer
424 |     if args.local_rank not in [-1, 0]:
425 |         torch.distributed.barrier()  # Make sure only the first process in distributed training will download model & vocab
426 | 
427 |     args.model_type = args.model_type.lower()
428 |     config_class, model_class, tokenizer_class = MODEL_CLASSES[args.model_type]
429 |     config = config_class.from_pretrained(args.config_name if args.config_name else args.model_name_or_path,
430 |                                           num_labels=num_labels)
431 |     tokenizer = tokenizer_class.from_pretrained(args.tokenizer_name if args.tokenizer_name else args.model_name_or_path,
432 |                                                 do_lower_case=args.do_lower_case)
433 |     model = model_class.from_pretrained(args.model_name_or_path, from_tf=bool(".ckpt" in args.model_name_or_path),
434 |                                         config=config)
435 | 
436 |     if args.local_rank == 0:
437 |         torch.distributed.barrier()  # Make sure only the first process in distributed training will download model & vocab
438 | 
439 |     model.to(args.device)
440 | 
441 |     logger.info("Training/evaluation parameters %s", args)
442 | 
443 |     # Training
444 |     if args.do_train:
445 |         train_dataset = load_and_cache_examples(args, tokenizer, labels, pad_token_label_id, mode="train")
446 |         global_step, tr_loss = train(args, train_dataset, model, tokenizer, labels, pad_token_label_id)
447 |         logger.info(" global_step = %s, average loss = %s", global_step, tr_loss)
448 | 
449 |     # Saving best-practices: if you use defaults names for the model, you can reload it using from_pretrained()
450 |     if args.do_train and (args.local_rank == -1 or torch.distributed.get_rank() == 0):
451 |         # Create output directory if needed
452 |         if not os.path.exists(args.output_dir) and args.local_rank in [-1, 0]:
453 |             os.makedirs(args.output_dir)
454 | 
455 |         logger.info("Saving model checkpoint to %s", args.output_dir)
456 |         # Save a trained model, configuration and tokenizer using `save_pretrained()`.
457 |         # They can then be reloaded using `from_pretrained()`
458 |         model_to_save = model.module if hasattr(model, "module") else model  # Take care of distributed/parallel training
459 |         model_to_save.save_pretrained(args.output_dir)
460 |         tokenizer.save_pretrained(args.output_dir)
461 | 
462 |         # Good practice: save your training arguments together with the trained model
463 |         torch.save(args, os.path.join(args.output_dir, "training_args.bin"))
464 | 
465 |     # Evaluation
466 |     results = {}
467 |     if args.do_eval and args.local_rank in [-1, 0]:
468 |         tokenizer = tokenizer_class.from_pretrained(args.output_dir, do_lower_case=args.do_lower_case)
469 |         checkpoints = [args.output_dir]
470 |         if args.eval_all_checkpoints:
471 |             checkpoints = list(os.path.dirname(c) for c in sorted(glob.glob(args.output_dir + "/**/" + WEIGHTS_NAME, recursive=True)))
472 |             logging.getLogger("pytorch_transformers.modeling_utils").setLevel(logging.WARN)  # Reduce logging
473 |         logger.info("Evaluate the following checkpoints: %s", checkpoints)
474 |         for checkpoint in checkpoints:
475 |             global_step = checkpoint.split("-")[-1] if len(checkpoints) > 1 else ""
476 |             model = model_class.from_pretrained(checkpoint)
477 |             model.to(args.device)
478 |             result, _ = evaluate(args, model, tokenizer, labels, pad_token_label_id, mode="dev", prefix=global_step)
479 |             if global_step:
480 |                 result = {"{}_{}".format(global_step, k): v for k, v in result.items()}
481 |             results.update(result)
482 |         output_eval_file = os.path.join(args.output_dir, "eval_results.txt")
483 |         with open(output_eval_file, "w") as writer:
484 |             for key in sorted(results.keys()):
485 |                 writer.write("{} = {}\n".format(key, str(results[key])))
486 | 
487 |     if args.do_predict and args.local_rank in [-1, 0]:
488 |         tokenizer = tokenizer_class.from_pretrained(args.output_dir, do_lower_case=args.do_lower_case)
489 |         model = model_class.from_pretrained(args.output_dir)
490 |         model.to(args.device)
491 |         result, predictions = evaluate(args, model, tokenizer, labels, pad_token_label_id, mode="test")
492 |         # Save results
493 |         output_test_results_file = os.path.join(args.output_dir, "test_results.txt")
494 |         with open(output_test_results_file, "w") as writer:
495 |             for key in sorted(result.keys()):
496 |                 writer.write("{} = {}\n".format(key, str(result[key])))
497 |         # Save predictions
498 |         output_test_predictions_file = os.path.join(args.output_dir, "test_predictions.txt")
499 |         with open(output_test_predictions_file, "w", encoding='utf-8') as writer:
500 |             with open(os.path.join(args.data_dir, "test.txt"), "r", encoding='utf-8') as f:
501 |                 example_id = 0
502 |                 for line in f:
503 | 
504 |                     if line.startswith("-DOCSTART-") or line == "" or line == "\n":
505 |                         writer.write(line)
506 |                         if not predictions[example_id]:
507 |                             example_id += 1
508 |                     elif predictions[example_id]:
509 |                         output_line = line.split()[0] + " " + predictions[example_id].pop(0) + "\n"
510 |                         writer.write(output_line)
511 |                     else:
512 |                         logger.warning("Maximum sequence length exceeded: No prediction for '%s'.", line.split()[0])
513 | 
514 |     return results
515 | 
516 | 
517 | if __name__ == "__main__":
518 |     main()
519 | 


--------------------------------------------------------------------------------
/utils_ner.py:
--------------------------------------------------------------------------------
  1 | # coding=utf-8
  2 | # Copyright 2018 The Google AI Language Team Authors and The HuggingFace Inc. team.
  3 | # Copyright (c) 2018, NVIDIA CORPORATION.  All rights reserved.
  4 | #
  5 | # Licensed under the Apache License, Version 2.0 (the "License");
  6 | # you may not use this file except in compliance with the License.
  7 | # You may obtain a copy of the License at
  8 | #
  9 | #     http://www.apache.org/licenses/LICENSE-2.0
 10 | #
 11 | # Unless required by applicable law or agreed to in writing, software
 12 | # distributed under the License is distributed on an "AS IS" BASIS,
 13 | # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 14 | # See the License for the specific language governing permissions and
 15 | # limitations under the License.
 16 | """ Named entity recognition fine-tuning: utilities to work with CoNLL-2003 task. """
 17 | 
 18 | from __future__ import absolute_import, division, print_function
 19 | 
 20 | import logging
 21 | import os
 22 | from io import open
 23 | 
 24 | logger = logging.getLogger(__name__)
 25 | 
 26 | 
 27 | class InputExample(object):
 28 |     """A single training/test example for token classification."""
 29 | 
 30 |     def __init__(self, guid, words, labels):
 31 |         """Constructs a InputExample.
 32 |         Args:
 33 |             guid: Unique id for the example.
 34 |             words: list. The words of the sequence.
 35 |             labels: (Optional) list. The labels for each word of the sequence. This should be
 36 |             specified for train and dev examples, but not for test examples.
 37 |         """
 38 |         self.guid = guid
 39 |         self.words = words
 40 |         self.labels = labels
 41 | 
 42 | 
 43 | class InputFeatures(object):
 44 |     """A single set of features of data."""
 45 | 
 46 |     def __init__(self, input_ids, input_mask, segment_ids, label_ids):
 47 |         self.input_ids = input_ids
 48 |         self.input_mask = input_mask
 49 |         self.segment_ids = segment_ids
 50 |         self.label_ids = label_ids
 51 | 
 52 | 
 53 | def read_examples_from_file(data_dir, mode):
 54 |     file_path = os.path.join(data_dir, "{}.txt".format(mode))
 55 |     guid_index = 1
 56 |     examples = []
 57 |     with open(file_path, encoding="utf-8") as f:
 58 |         words = []
 59 |         labels = []
 60 |         for line in f:
 61 |             if line.startswith("-DOCSTART-") or line == "" or line == "\n":
 62 |                 if words:
 63 |                     assert len(words) == len(labels)
 64 |                     examples.append(InputExample(guid="{}-{}".format(mode, guid_index),
 65 |                                                  words=words,
 66 |                                                  labels=labels))
 67 |                     guid_index += 1
 68 |                     words = []
 69 |                     labels = []
 70 |             else:
 71 |                 splits = line.split(" ")
 72 |                 words.append(splits[0])
 73 |                 if len(splits) > 1:
 74 |                     labels.append(splits[-1].replace("\n", ""))
 75 |                 else:
 76 |                     # Examples could have no label for mode = "test"
 77 |                     labels.append("O")
 78 |         if words:
 79 |             assert len(words) == len(labels)
 80 |             examples.append(InputExample(guid="%s-%d".format(mode, guid_index),
 81 |                                          words=words,
 82 |                                          labels=labels))
 83 |     return examples
 84 | 
 85 | 
 86 | def convert_examples_to_features(examples,
 87 |                                  label_list,
 88 |                                  max_seq_length,
 89 |                                  tokenizer,
 90 |                                  cls_token_at_end=False,
 91 |                                  cls_token="[CLS]",
 92 |                                  cls_token_segment_id=1,
 93 |                                  sep_token="[SEP]",
 94 |                                  sep_token_extra=False,
 95 |                                  pad_on_left=False,
 96 |                                  pad_token=0,
 97 |                                  pad_token_segment_id=0,
 98 |                                  pad_token_label_id=-1,
 99 |                                  sequence_a_segment_id=0,
100 |                                  mask_padding_with_zero=True):
101 |     """ Loads a data file into a list of `InputBatch`s
102 |         `cls_token_at_end` define the location of the CLS token:
103 |             - False (Default, BERT/XLM pattern): [CLS] + A + [SEP] + B + [SEP]
104 |             - True (XLNet/GPT pattern): A + [SEP] + B + [SEP] + [CLS]
105 |         `cls_token_segment_id` define the segment id associated to the CLS token (0 for BERT, 2 for XLNet)
106 |     """
107 | 
108 |     label_map = {label: i for i, label in enumerate(label_list)}
109 | 
110 |     features = []
111 |     for (ex_index, example) in enumerate(examples):
112 |         if ex_index % 10000 == 0:
113 |             logger.info("Writing example %d of %d", ex_index, len(examples))
114 | 
115 |         tokens = []
116 |         label_ids = []
117 |         for word, label in zip(example.words, example.labels):
118 |             word_tokens = tokenizer.tokenize(word)
119 |             tokens.extend(word_tokens)
120 |             # Use the real label id for the first token of the word, and padding ids for the remaining tokens
121 |             label_ids.extend([label_map[label]] + [pad_token_label_id] * (len(word_tokens) - 1))
122 | 
123 |         labels_len = len(label_ids)
124 |         tokens_len = len(tokens)
125 |         # Account for [CLS] and [SEP] with "- 2" and with "- 3" for RoBERTa.
126 |         special_tokens_count = 3 if sep_token_extra else 2
127 |         if len(tokens) > max_seq_length - special_tokens_count:
128 |             tokens = tokens[:(max_seq_length - special_tokens_count)]
129 |             label_ids = label_ids[:(max_seq_length - special_tokens_count)]
130 | 
131 |         # The convention in BERT is:
132 |         # (a) For sequence pairs:
133 |         #  tokens:   [CLS] is this jack ##son ##ville ? [SEP] no it is not . [SEP]
134 |         #  type_ids:   0   0  0    0    0     0       0   0   1  1  1  1   1   1
135 |         # (b) For single sequences:
136 |         #  tokens:   [CLS] the dog is hairy . [SEP]
137 |         #  type_ids:   0   0   0   0  0     0   0
138 |         #
139 |         # Where "type_ids" are used to indicate whether this is the first
140 |         # sequence or the second sequence. The embedding vectors for `type=0` and
141 |         # `type=1` were learned during pre-training and are added to the wordpiece
142 |         # embedding vector (and position vector). This is not *strictly* necessary
143 |         # since the [SEP] token unambiguously separates the sequences, but it makes
144 |         # it easier for the model to learn the concept of sequences.
145 |         #
146 |         # For classification tasks, the first vector (corresponding to [CLS]) is
147 |         # used as as the "sentence vector". Note that this only makes sense because
148 |         # the entire model is fine-tuned.
149 |         tokens += [sep_token]
150 |         label_ids += [pad_token_label_id]
151 |         if sep_token_extra:
152 |             # roberta uses an extra separator b/w pairs of sentences
153 |             tokens += [sep_token]
154 |             label_ids += [pad_token_label_id]
155 |         segment_ids = [sequence_a_segment_id] * len(tokens)
156 | 
157 |         if cls_token_at_end:
158 |             tokens += [cls_token]
159 |             label_ids += [pad_token_label_id]
160 |             segment_ids += [cls_token_segment_id]
161 |         else:
162 |             tokens = [cls_token] + tokens
163 |             label_ids = [pad_token_label_id] + label_ids
164 |             segment_ids = [cls_token_segment_id] + segment_ids
165 | 
166 |         input_ids = tokenizer.convert_tokens_to_ids(tokens)
167 |         input_len = len(input_ids)
168 | 
169 |         # The mask has 1 for real tokens and 0 for padding tokens. Only real
170 |         # tokens are attended to.
171 |         input_mask = [1 if mask_padding_with_zero else 0] * len(input_ids)
172 | 
173 |         # Zero-pad up to the sequence length.
174 |         padding_length = max_seq_length - len(input_ids)
175 |         if pad_on_left:
176 |             input_ids = ([pad_token] * padding_length) + input_ids
177 |             input_mask = ([0 if mask_padding_with_zero else 1] * padding_length) + input_mask
178 |             segment_ids = ([pad_token_segment_id] * padding_length) + segment_ids
179 |             label_ids = ([pad_token_label_id] * padding_length) + label_ids
180 |         else:
181 |             input_ids += ([pad_token] * padding_length)
182 |             input_mask += ([0 if mask_padding_with_zero else 1] * padding_length)
183 |             segment_ids += ([pad_token_segment_id] * padding_length)
184 |             label_ids += ([pad_token_label_id] * padding_length)
185 | 
186 |         if len(label_ids) != max_seq_length:
187 |             logger.info("*** Error ***")
188 |             logger.info("guid: %s", example.guid)
189 |             logger.info("tokens: %s", " ".join([str(x) for x in tokens]))
190 |             logger.info("input_ids: %s", " ".join([str(x) for x in input_ids]))
191 |             logger.info("input_mask: %s", " ".join([str(x) for x in input_mask]))
192 |             logger.info("segment_ids: %s", " ".join([str(x) for x in segment_ids]))
193 |             logger.info("label_ids: %s", " ".join([str(x) for x in label_ids]))
194 |             logger.info("labels_len: %s", str(labels_len))
195 |             logger.info("tokens_len: %s", str(tokens_len))
196 |             logger.info("input_len: %s", str(input_len))
197 | 
198 |         assert len(input_ids) == max_seq_length
199 |         assert len(input_mask) == max_seq_length
200 |         assert len(segment_ids) == max_seq_length
201 |         assert len(label_ids) == max_seq_length
202 | 
203 |         # if ex_index < 1:
204 |         #     logger.info("*** Example ***")
205 |         #     logger.info("guid: %s", example.guid)
206 |         #     logger.info("tokens: %s", " ".join([str(x) for x in tokens]))
207 |         #     logger.info("input_ids: %s", " ".join([str(x) for x in input_ids]))
208 |         #     logger.info("input_mask: %s", " ".join([str(x) for x in input_mask]))
209 |         #     logger.info("segment_ids: %s", " ".join([str(x) for x in segment_ids]))
210 |         #     logger.info("label_ids: %s", " ".join([str(x) for x in label_ids]))
211 |         #     logger.info("labels_len: %s", str(labels_len))
212 |         #     logger.info("tokens_len: %s", str(tokens_len))
213 |         #     logger.info("input_len: %s", str(input_len))
214 | 
215 |         features.append(
216 |                 InputFeatures(input_ids=input_ids,
217 |                               input_mask=input_mask,
218 |                               segment_ids=segment_ids,
219 |                               label_ids=label_ids))
220 |     return features
221 | 
222 | 
223 | # if no labels are given, it will use CONLL ones
224 | def get_labels(path):
225 |     if path:
226 |         with open(path, "r") as f:
227 |             labels = f.read().splitlines()
228 |         if "O" not in labels:
229 |             labels = ["O"] + labels
230 |         return labels
231 |     else:
232 |         return ["O", "B-MISC", "I-MISC",  "B-PER", "I-PER", "B-ORG", "I-ORG", "B-LOC", "I-LOC"]
233 | 


--------------------------------------------------------------------------------