├── .gitignore ├── Dockerfile ├── README.md ├── athnlp ├── __init__.py ├── create_squad_data.py ├── experiments │ ├── fever.json │ ├── nmt_multi30k.jsonnet │ └── qa_bert.jsonnet ├── models │ ├── __init__.py │ ├── fever_text_classification.py │ ├── nmt_seq2seq.py │ ├── qa_bert.py │ └── rnn_language_model.py ├── nlm.py ├── nmt.py ├── qa.py └── readers │ ├── __init__.py │ ├── bert_squad_reader.py │ ├── brown_pos_corpus.py │ ├── en-brown.map │ ├── fever_predictor.py │ ├── fever_reader.py │ ├── label_dictionary.py │ ├── lm_corpus.py │ ├── multi30k_reader.py │ ├── sequence.py │ ├── sequence_dictionary.py │ └── token_indexers │ ├── __init__.py │ └── bert_squad_indexer.py ├── data ├── fever │ ├── test.jsonl │ ├── train.jsonl │ └── validation.jsonl ├── lm │ ├── test.txt │ ├── train.txt │ └── valid.txt ├── multi30k │ ├── val.lc.norm.tok.head-250.en │ ├── val.lc.norm.tok.head-250.fr │ ├── val.lc.norm.tok.head-5.en │ ├── val.lc.norm.tok.head-5.en.jsonl │ ├── val.lc.norm.tok.head-5.fr │ ├── val.lc.norm.tok.head-750.en │ └── val.lc.norm.tok.head-750.fr ├── run_fever.png └── squad │ ├── dev-v2.0-small.json │ ├── test.json │ └── train.json ├── labs-exercises ├── AV_struct_perceptron.html ├── multiclass_perceptron.png ├── neural-encoding-fever.md ├── neural-language-model.md ├── neural-machine-translation.md ├── pos-tagging-perceptron.md ├── pos-tagging-structured-perceptron.md └── question-answering.md ├── requirements.txt ├── setup_dependencies.sh ├── setup_dependencies_Docker.sh └── slides ├── AthensNLP-MT-23Sept2019-ABisazza.pdf ├── Carreras_morning_2.pdf ├── DialogueSystem_VivianChen.pdf ├── MORNING_LECTURE_SLIDES_HERE ├── McDonald_classification.pdf ├── Riedel_Machine Reading Tutorial at AthensNLP Summer School.pdf └── athNLP-Lec3-BPlank.pdf /.gitignore: -------------------------------------------------------------------------------- 1 | *.swp 2 | *.swo 3 | .idea/ 4 | __pycache__/ 5 | .DS_Store 6 | 7 | models/* 8 | #external_resources/* 9 | -------------------------------------------------------------------------------- /Dockerfile: -------------------------------------------------------------------------------- 1 | FROM python:3.6.9 2 | MAINTAINER Andreas Vlachos 3 | 4 | RUN apt-get update -y 5 | RUN apt-get install -y git 6 | 7 | RUN git clone https://github.com/athnlp/athnlp-labs.git 8 | WORKDIR /athnlp-labs 9 | 10 | RUN sh setup_dependencies_Docker.sh 11 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # ΑθNLP 2019 2 | 3 | 4 | ## *Important*: This repository is now archived! 5 | 6 | Exercises for the lab sessions of ΑθNLP 2019. 7 | The labs will cover the following: 8 | 9 | 1. [Part-of-Speech Tagging with the Perceptron algorithm](labs-exercises/pos-tagging-perceptron.md) 10 | 2. [Part-of-Speech Tagging with the Structured Perceptron algorithm](labs-exercises/pos-tagging-structured-perceptron.md) 11 | 3. [Neural Encoding for Text Classification](labs-exercises/neural-encoding-fever.md) 12 | 4. [Neural Language Modeling](labs-exercises/neural-language-model.md) 13 | 5. [Neural Machine Translation](labs-exercises/neural-machine-translation.md) 14 | 6. [Question Answering](labs-exercises/question-answering.md) 15 | 16 | ## Setup 17 | 18 | You will need to have Python 3 installed on your machine; we recommend using [Anaconda](https://www.anaconda.com/), 19 | which is available for the most common OS distributions. 20 | 21 | For the first two labs we will be using vanilla Python (along with the standard scientific libarires, i.e., NumPy, SciPy, 22 | etc), while for the rest we will additionally be using [PyTorch](https://pytorch.org/) and 23 | [AllenNLP](https://allennlp.org/). 24 | 25 | Use the Anaconda command-line tools to create a new virtual environment with Python 3.6: 26 | ``` 27 | conda create --name athnlp python=3.6 28 | ``` 29 | After the installation is complete, you should have a new virtual environment called `athnlp` in your Anaconda installation 30 | that you can *activate* using the following command: `conda activate athnlp`. Remember to execute this command before 31 | running the scripts in this repository. 32 | 33 | Next, you should clone the repository to your computer: 34 | ``` 35 | git clone https://github.com/athnlp/athnlp-labs 36 | ``` 37 | 38 | Finally, you should install all required dependencies. 39 | We provide a script that will help you setup your environment. Run the command: `sh setup_dependencies.sh` and 40 | it will automatically install the project dependencies for you. The script will download several data dependencies that might 41 | require some time to be installed. 42 | 43 | 44 | **Note**: Installing AllenNLP on Mac OS can be tricky; check [here](https://stackoverflow.com/questions/52509602/cant-compile-c-program-on-a-mac-after-upgrade-to-mojave) 45 | for a possible solution. 46 | 47 | ## Docker 48 | 49 | If you prefer (or you are on Windows), you can install Docker and create a Docker image with the following commands: 50 | - build it by running `docker build -t athnlp - < Dockerfile` 51 | - get an interactive terminal on the image with `docker run -i -t athnlp bash` 52 | - run commands as you normally would (remember this is a very minimal linux installation) 53 | If you want to run the image with a new version of the code, add the option `--no-cache` to the build. 54 | You need to do the wget commands from setup_dependencies.sh on your own. Make sure you give Docker enough disk space and memory 55 | 56 | -------------------------------------------------------------------------------- /athnlp/__init__.py: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/athnlp/athnlp-labs/1c3f0a8595cd8e4a7b7d62fc4e608ade6ac092fd/athnlp/__init__.py -------------------------------------------------------------------------------- /athnlp/create_squad_data.py: -------------------------------------------------------------------------------- 1 | import copy 2 | import json 3 | import os 4 | from argparse import ArgumentParser 5 | from operator import itemgetter 6 | 7 | import numpy as np 8 | 9 | parser = ArgumentParser() 10 | 11 | parser.add_argument("-dataset_file", default="data/squad/dev-v2.0.json", 12 | help="Path to the Devset of SQuAD 2.0", type=str) 13 | parser.add_argument("-output_dir", default="data/squad", 14 | help="Output folder were the files train.json and test.json will be saved") 15 | parser.add_argument("--percentage_train", 16 | type=float, 17 | help="Percentage of questions associated to a given paragraph to retain for training", default=70.0) 18 | parser.add_argument("--wiki_title", help="Wikipedia page title used as reference to create the dataset", 19 | default="Normans") 20 | parser.add_argument("--remove_impossible", action='store_true') 21 | 22 | 23 | def create_dataset_splits(dataset, percentage_train, remove_impossible=True): 24 | train = {'version': dataset['version'], 'data': []} 25 | test = {'version': dataset['version'], 'data': []} 26 | 27 | for example_set in dataset["data"]: 28 | curr_train = {"title": example_set["title"], "paragraphs": []} 29 | curr_test = {"title": example_set["title"], "paragraphs": []} 30 | for paragraph in example_set["paragraphs"]: 31 | num_questions = len(paragraph["qas"]) 32 | 33 | question_ids = np.arange(num_questions) 34 | 35 | np.random.shuffle(question_ids) 36 | 37 | ref_index = int(percentage_train * num_questions) 38 | train_indexes = question_ids[:ref_index] 39 | test_indexes = question_ids[ref_index:] 40 | 41 | train_paragraph = copy.copy(paragraph) 42 | train_qas = itemgetter(*train_indexes)(paragraph["qas"]) 43 | if isinstance(train_qas, dict): 44 | train_qas = [train_qas] 45 | if remove_impossible: 46 | train_qas = [x for x in train_qas if not x['is_impossible']] 47 | train_paragraph["qas"] = train_qas 48 | test_paragraph = copy.copy(paragraph) 49 | test_qas = itemgetter(*test_indexes)(paragraph["qas"]) 50 | if isinstance(test_qas, dict): 51 | test_qas = [test_qas] 52 | if remove_impossible: 53 | test_qas = [x for x in test_qas if not x['is_impossible']] 54 | test_paragraph["qas"] = test_qas 55 | 56 | curr_train["paragraphs"].append(train_paragraph) 57 | curr_test["paragraphs"].append(test_paragraph) 58 | 59 | train["data"].append(curr_train) 60 | test["data"].append(curr_test) 61 | 62 | return train, test 63 | 64 | 65 | def main(args): 66 | with open(args.dataset_file) as in_file: 67 | dataset = json.load(in_file) 68 | 69 | # We extract only data associated to the Wikipedia page of the Normans 70 | filtered_dataset = {'version': dataset['version'], 'data': []} 71 | 72 | for example in dataset["data"]: 73 | if example["title"] == args.wiki_title: 74 | filtered_dataset["data"].append(example) 75 | 76 | total_num_paragraphs = 0 77 | total_num_questions = 0 78 | 79 | for example_set in filtered_dataset["data"]: 80 | total_num_paragraphs += len(example_set["paragraphs"]) 81 | 82 | for paragraph in example_set["paragraphs"]: 83 | total_num_questions += len(paragraph["qas"]) 84 | 85 | print("Wikipedia page title: {}".format(args.wiki_title)) 86 | print("Total number of paragraphs: {}".format(total_num_paragraphs)) 87 | print("Total number of questions: {}".format(total_num_questions)) 88 | 89 | train, test = create_dataset_splits(filtered_dataset, args.percentage_train / 100, args.remove_impossible) 90 | 91 | print("-- Saving training and test files to directory: {}".format(args.output_dir)) 92 | with open(os.path.join(args.output_dir, "train.json"), mode="w") as out_file: 93 | json.dump(train, out_file) 94 | 95 | with open(os.path.join(args.output_dir, "test.json"), mode="w") as out_file: 96 | json.dump(test, out_file) 97 | 98 | 99 | if __name__ == "__main__": 100 | args = parser.parse_args() 101 | 102 | main(args) 103 | -------------------------------------------------------------------------------- /athnlp/experiments/fever.json: -------------------------------------------------------------------------------- 1 | { 2 | "dataset_reader": { 3 | "type": "feverlite", 4 | "token_indexers": { 5 | "tokens": { 6 | "type": "single_id", 7 | "lowercase_tokens": true 8 | } 9 | }, 10 | "wiki_tokenizer": { 11 | "type":"word", 12 | "word_splitter": { 13 | "type": "just_spaces" 14 | } 15 | }, 16 | "claim_tokenizer": { 17 | "type":"word", 18 | "word_splitter": { 19 | "type": "simple" 20 | } 21 | } 22 | }, 23 | "train_data_path": "data/fever/train.jsonl", 24 | "validation_data_path": "data/fever/validation.jsonl", 25 | "model": { 26 | "type": "fever", 27 | "text_field_embedder": { 28 | "tokens": { 29 | "type": "embedding", 30 | "pretrained_file": "resources/glove.6B.50d.txt.gz", 31 | "embedding_dim": 50, 32 | "trainable": false 33 | } 34 | }, 35 | "final_feedforward": { 36 | "input_dim": 100, 37 | "num_layers": 3, 38 | "hidden_dims": [100, 100, 2], 39 | "activations": ["relu","relu","linear"], 40 | "dropout": 0.0 41 | }, 42 | "initializer": [ 43 | [".*linear_layers.*weight", {"type": "xavier_normal"}] 44 | ] 45 | }, 46 | "iterator": { 47 | "type": "bucket", 48 | "sorting_keys": [["claim", "num_tokens"], ["evidence", "num_tokens"]], 49 | "batch_size": 32, 50 | "instances_per_epoch": 16000 51 | }, 52 | "trainer": { 53 | "num_epochs": 20, 54 | "cuda_device": -1, 55 | "validation_metric": "+accuracy", 56 | "optimizer": { 57 | "type": "sgd", 58 | "lr": 0.01 59 | 60 | } 61 | } 62 | } 63 | -------------------------------------------------------------------------------- /athnlp/experiments/nmt_multi30k.jsonnet: -------------------------------------------------------------------------------- 1 | { 2 | "dataset_reader": { 3 | "type": "multi30k", 4 | "language_pairs": { 5 | "source": "en", 6 | "target": "fr" 7 | }, 8 | "source_token_indexers": { 9 | "source_tokens": { 10 | "type": "single_id", 11 | "namespace": "source_tokens" 12 | } 13 | }, 14 | "target_token_indexers": { 15 | "target_tokens": { 16 | "type": "single_id", 17 | "namespace": "target_tokens" 18 | } 19 | } 20 | }, 21 | "train_data_path": "data/multi30k/val.lc.norm.tok.head-750", 22 | "validation_data_path": "data/multi30k/val.lc.norm.tok.head-250", 23 | "model": { 24 | "type": "nmt_seq2seq", 25 | "source_embedder": { 26 | "token_embedders": { 27 | "source_tokens": { 28 | "type": "embedding", 29 | "embedding_dim": 50, 30 | "trainable": true, 31 | "vocab_namespace": "source_tokens" 32 | } 33 | } 34 | }, 35 | "target_namespace": "target_tokens", 36 | // "attention" : { 37 | // "type" : "dot_product" 38 | // }, 39 | "encoder": { 40 | "type": "lstm", 41 | "input_size": 50, 42 | "hidden_size": 200, 43 | "num_layers": 1, 44 | "dropout": 0.3, 45 | "bidirectional": true 46 | }, 47 | "decoder": { 48 | "type": "lstm", 49 | "input_size": 50, 50 | "hidden_size": 400 51 | }, 52 | "max_decoding_steps": 15, 53 | "beam_size": 1 54 | }, 55 | "iterator": { 56 | "type": "bucket", 57 | "sorting_keys": [ 58 | [ 59 | "source_tokens", 60 | "num_tokens" 61 | ] 62 | ], 63 | "batch_size": 1 64 | }, 65 | "trainer": { 66 | "optimizer": "adam", 67 | "num_epochs": 100, 68 | "patience": 10, 69 | "validation_metric": "-loss", 70 | "cuda_device": -1 71 | } 72 | } 73 | -------------------------------------------------------------------------------- /athnlp/experiments/qa_bert.jsonnet: -------------------------------------------------------------------------------- 1 | { 2 | "dataset_reader": { 3 | "lazy": false, 4 | "type": "bert_squad", 5 | "tokenizer": { 6 | "word_splitter": { 7 | "type": "bert-basic-wordpiece", 8 | "pretrained_model": "resources/bert-base-uncased/vocab.txt" 9 | } 10 | }, 11 | "token_indexers": { 12 | "bert": { 13 | "type": "bert-squad-indexer", 14 | "pretrained_model": "resources/bert-base-uncased/vocab.txt" 15 | } 16 | }, 17 | "version_2": true, 18 | "max_sequence_length": 384, 19 | "question_length_limit": 64, 20 | "doc_stride": 128 21 | }, 22 | "train_data_path": "data/squad/train.json", 23 | "validation_data_path": "data/squad/test.json", 24 | "model": { 25 | "type": "qa_bert", 26 | "bert_model": "resources/bert-base-uncased/", 27 | "dropout": 0.1 28 | }, 29 | "iterator": { 30 | "type": "bucket", 31 | "sorting_keys": [["tokens", "num_tokens"]], 32 | "batch_size": 2 33 | }, 34 | "trainer": { 35 | "optimizer": { 36 | "type": "adam", 37 | "lr": 0.0001 38 | }, 39 | "validation_metric": "+f1", 40 | "num_serialized_models_to_keep": 1, 41 | "num_epochs": 3, 42 | "grad_norm": 1.0, 43 | "patience": 5, 44 | "cuda_device": -1 45 | } 46 | } -------------------------------------------------------------------------------- /athnlp/models/__init__.py: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/athnlp/athnlp-labs/1c3f0a8595cd8e4a7b7d62fc4e608ade6ac092fd/athnlp/models/__init__.py -------------------------------------------------------------------------------- /athnlp/models/fever_text_classification.py: -------------------------------------------------------------------------------- 1 | from typing import Optional, Dict, List, Any 2 | 3 | import allennlp 4 | import torch 5 | from allennlp.nn.util import get_text_field_mask 6 | from torch import nn 7 | from torch.nn import functional as F 8 | from allennlp.data import Vocabulary 9 | from allennlp.models import Model 10 | from allennlp.modules import TextFieldEmbedder, FeedForward 11 | from allennlp.nn import InitializerApplicator, RegularizerApplicator 12 | from allennlp.training.metrics import CategoricalAccuracy 13 | 14 | 15 | @Model.register("fever") 16 | class FEVERTextClassificationModel(Model): 17 | 18 | def __init__(self, 19 | vocab: Vocabulary, 20 | text_field_embedder: TextFieldEmbedder, 21 | final_feedforward: FeedForward, 22 | initializer: InitializerApplicator = InitializerApplicator(), 23 | regularizer: Optional[RegularizerApplicator] = None, 24 | ) -> None: 25 | 26 | super().__init__(vocab,regularizer) 27 | 28 | # Model components 29 | self._embedder = text_field_embedder 30 | self._feed_forward = final_feedforward 31 | 32 | # For accuracy and loss for training/evaluation of model 33 | self._accuracy = CategoricalAccuracy() 34 | self._loss = nn.CrossEntropyLoss() 35 | 36 | # Initialize weights 37 | initializer(self) 38 | 39 | 40 | def forward(self, 41 | claim: Dict[str, torch.LongTensor], 42 | evidence: Dict[str, torch.LongTensor], 43 | label: torch.IntTensor = None, 44 | metadata: List[Dict[str, Any]] = None) -> Dict[str, torch.Tensor]: 45 | # pylint: disable=arguments-differ 46 | """ 47 | Parameters 48 | ---------- 49 | claim : Dict[str, torch.LongTensor] 50 | From a ``TextField`` 51 | The LongTensor Shape is typically ``(batch_size, sent_length)` 52 | evidence : Dict[str, torch.LongTensor] 53 | From a ``TextField`` 54 | The LongTensor Shape is typically ``(batch_size, sent_length)` 55 | label : torch.IntTensor, optional, (default = None) 56 | From a ``LabelField`` 57 | metadata : ``List[Dict[str, Any]]``, optional, (default = None) 58 | Metadata containing the original tokenization of the claim and 59 | evidence sentences with 'claim_tokens' and 'premise_tokens' keys respectively. 60 | Returns 61 | ------- 62 | An output dictionary consisting of: 63 | 64 | label_logits : torch.FloatTensor 65 | A tensor of shape ``(batch_size, num_labels)`` representing unnormalised log 66 | probabilities of the entailment label. 67 | label_probs : torch.FloatTensor 68 | A tensor of shape ``(batch_size, num_labels)`` representing probabilities of the 69 | entailment label. 70 | loss : torch.FloatTensor, optional 71 | A scalar loss to be optimised. 72 | """ 73 | 74 | 75 | # TODO - Delete this line when you start working on your solution 76 | raise NotImplementedError("Compute label logits (for supported and refuted) for the given Claim and Evidence input") 77 | 78 | # TODO - Uncomment the code below 79 | 80 | #label_logits = # TODO compute label logits for input 81 | #label_probs = F.softmax(label_logits, dim=-1) 82 | 83 | #output_dict = {"label_logits": label_logits, 84 | # "label_probs": label_probs} 85 | 86 | #if label is not None: 87 | # loss = self._loss(label_logits, label.long().view(-1)) 88 | # self._accuracy(label_logits, label) 89 | # output_dict["loss"] = loss 90 | 91 | #if metadata is not None: 92 | # output_dict["claim_tokens"] = [x["claim_tokens"] for x in metadata] 93 | # output_dict["evidence_tokens"] = [x["evidence_tokens"] for x in metadata] 94 | 95 | #return output_dict 96 | 97 | def get_metrics(self, reset: bool = False) -> Dict[str, float]: 98 | return { 99 | 'accuracy': self._accuracy.get_metric(reset), 100 | } 101 | -------------------------------------------------------------------------------- /athnlp/models/nmt_seq2seq.py: -------------------------------------------------------------------------------- 1 | from typing import Dict, List, Tuple 2 | 3 | import numpy 4 | from overrides import overrides 5 | import torch 6 | import torch.nn.functional as F 7 | from torch.nn.modules.linear import Linear 8 | from torch.nn.modules.rnn import LSTMCell 9 | from torch.nn.modules.rnn import GRUCell 10 | from allennlp.common.checks import ConfigurationError 11 | from allennlp.common.util import START_SYMBOL, END_SYMBOL 12 | from allennlp.data.vocabulary import Vocabulary 13 | from allennlp.modules import TextFieldEmbedder, Seq2SeqEncoder 14 | from allennlp.models.model import Model 15 | from allennlp.modules.token_embedders import Embedding 16 | from allennlp.nn import util 17 | from allennlp.nn.beam_search import BeamSearch 18 | from allennlp.training.metrics import BLEU 19 | from allennlp.nn.util import masked_softmax 20 | 21 | @Model.register("nmt_seq2seq") 22 | class NmtSeq2Seq(Model): 23 | """ 24 | This ``NmtSeq2Seq`` class is an adaptation from the SimpleSeq2Seq :class:`Model` from the AllenNLP toolkit, 25 | which takes a sequence, encodes it, and then uses the encoded representations to decode another sequence. 26 | We have removed some functionality . 27 | 28 | Parameters 29 | ---------- 30 | vocab : ``Vocabulary``, required 31 | Vocabulary containing source and target vocabularies. They may be under the same namespace 32 | (`tokens`) or the target tokens can have a different namespace, in which case it needs to 33 | be specified as `target_namespace`. 34 | source_embedder : ``TextFieldEmbedder``, required 35 | Embedder for source side sequences 36 | target_namespace : ``str``, 37 | If the target side vocabulary is different from the source side's, you need to specify the 38 | target's namespace here. If not, we'll assume it is "tokens", which is also the default 39 | choice for the source side, and this might cause them to share vocabularies. 40 | target_embedding_dim : ``int``, optional (default = source_embedding_dim) 41 | You can specify an embedding dimensionality for the target side. If not, we'll use the same 42 | value as the source embedder's. 43 | encoder : ``Seq2SeqEncoder``, required 44 | The encoder of the "encoder/decoder" model 45 | decoder : ``Dict``, required 46 | The parameters for the decoder RNN cell of the "encoder/decoder" model 47 | max_decoding_steps : ``int`` 48 | Maximum length of decoded sequences. 49 | You can specify an embedding dimensionality for the target side. If not, we'll use the same 50 | value as the source embedder's. 51 | attention : ``Dict``, optional (default = None) 52 | If you want to use attention to get a dynamic summary of the encoder outputs at each step 53 | of decoding, this is the Dict that holds parameters for the appropriate attention function to compute similarity 54 | between the decoder hidden state and encoder outputs. 55 | beam_size : ``int``, optional (default = None) 56 | Width of the beam for beam search. If not specified, greedy decoding is used. 57 | scheduled_sampling_ratio : ``float``, optional (default = 0.) 58 | At each timestep during training, we sample a random number between 0 and 1, and if it is 59 | not less than this value, we use the ground truth labels for the whole batch. Else, we use 60 | the predictions from the previous time step for the whole batch. If this value is 0.0 61 | (default), this corresponds to teacher forcing, and if it is 1.0, it corresponds to not 62 | using target side ground truth labels. See the following paper for more information: 63 | `Scheduled Sampling for Sequence Prediction with Recurrent Neural Networks. Bengio et al., 64 | 2015 `_. 65 | use_bleu : ``bool``, optional (default = True) 66 | If True, the BLEU metric will be calculated during validation. 67 | """ 68 | 69 | def __init__(self, 70 | vocab: Vocabulary, 71 | source_embedder: TextFieldEmbedder, 72 | target_namespace: str, 73 | encoder: Seq2SeqEncoder, 74 | decoder: Dict, 75 | max_decoding_steps: int, 76 | target_embedding_dim: int = None, 77 | attention: Dict = None, 78 | beam_size: int = None, 79 | scheduled_sampling_ratio: float = 0., 80 | use_bleu: bool = True, 81 | visualize_attention: bool = False) -> None: 82 | super(NmtSeq2Seq, self).__init__(vocab) 83 | 84 | self._scheduled_sampling_ratio = scheduled_sampling_ratio 85 | self._target_namespace = target_namespace 86 | # We need the start symbol to provide as the input at the first timestep of decoding, and 87 | # end symbol as a way to indicate the end of the decoded sequence. 88 | self._start_index = self.vocab.get_token_index(START_SYMBOL, self._target_namespace) 89 | self._end_index = self.vocab.get_token_index(END_SYMBOL, self._target_namespace) 90 | 91 | if use_bleu: 92 | pad_index = self.vocab.get_token_index(self.vocab._padding_token, self._target_namespace) # pylint: disable=protected-access 93 | self._bleu = BLEU(exclude_indices={pad_index, self._end_index, self._start_index}) 94 | else: 95 | self._bleu = None 96 | 97 | # At prediction time, we use a beam search to find the most likely sequence of target tokens. 98 | beam_size = beam_size or 1 99 | self._max_decoding_steps = max_decoding_steps 100 | self._beam_search = BeamSearch(self._end_index, max_steps=max_decoding_steps, beam_size=beam_size) 101 | 102 | # Dense embedding of source vocab tokens. 103 | self._source_embedder = source_embedder 104 | 105 | # Encodes the sequence of source embeddings into a sequence of hidden states. 106 | self._encoder = encoder 107 | 108 | num_classes = self.vocab.get_vocab_size(self._target_namespace) 109 | 110 | # Attention mechanism params applied to the encoder output for each step. 111 | self._attention = attention 112 | 113 | self._visualize_attention = visualize_attention 114 | 115 | # Dense embedding of vocab words in the target space. 116 | target_embedding_dim = target_embedding_dim or source_embedder.get_output_dim() 117 | self._target_embedder = Embedding(num_classes, target_embedding_dim) 118 | 119 | # Decoder output dim needs to be the same as the encoder output dim since we initialize the 120 | # hidden state of the decoder with the final hidden state of the encoder. 121 | self._encoder_output_dim = self._encoder.get_output_dim() 122 | # self._decoder_output_dim = self._encoder_output_dim 123 | 124 | self._decoder_input_dim = decoder["input_size"] 125 | # If using attention make sure the .jsonnet params reflect this architecture: 126 | # input_to_decoder_rnn = [prev_word + attended_context_vector] 127 | self._decoder_output_dim = decoder['hidden_size'] 128 | 129 | # We'll use an RNN cell as the recurrent cell that produces a hidden state 130 | # for the decoder at each time step. 131 | decoder_cell_type = decoder["type"] 132 | 133 | if decoder_cell_type == "gru": 134 | self._decoder_cell = GRUCell(self._decoder_input_dim, self._decoder_output_dim) 135 | elif decoder_cell_type == "lstm": 136 | self._decoder_cell = LSTMCell(self._decoder_input_dim, self._decoder_output_dim) 137 | else: 138 | raise ValueError("Dialogue encoder of type {} not supported yet!".format(decoder_cell_type)) 139 | 140 | # We project the hidden state from the decoder into the output vocabulary space 141 | # in order to get log probabilities of each target token, at each time step. 142 | self._output_projection_layer = Linear(self._decoder_output_dim, num_classes) 143 | 144 | def take_step(self, 145 | last_predictions: torch.Tensor, 146 | state: Dict[str, torch.Tensor]) -> Tuple[torch.Tensor, Dict[str, torch.Tensor]]: 147 | """ 148 | Take a decoding step. This is called by the beam search class. 149 | 150 | Parameters 151 | ---------- 152 | last_predictions : ``torch.Tensor`` 153 | A tensor of shape ``(group_size,)``, which gives the indices of the predictions 154 | during the last time step. 155 | state : ``Dict[str, torch.Tensor]`` 156 | A dictionary of tensors that contain the current state information 157 | needed to predict the next step, which includes the encoder outputs, 158 | the source mask, and the decoder hidden state and context. Each of these 159 | tensors has shape ``(group_size, *)``, where ``*`` can be any other number 160 | of dimensions. 161 | 162 | Returns 163 | ------- 164 | Tuple[torch.Tensor, Dict[str, torch.Tensor]] 165 | A tuple of ``(log_probabilities, updated_state)``, where ``log_probabilities`` 166 | is a tensor of shape ``(group_size, num_classes)`` containing the predicted 167 | log probability of each class for the next step, for each item in the group, 168 | while ``updated_state`` is a dictionary of tensors containing the encoder outputs, 169 | source mask, and updated decoder hidden state and context. 170 | 171 | Notes 172 | ----- 173 | We treat the inputs as a batch, even though ``group_size`` is not necessarily 174 | equal to ``batch_size``, since the group may contain multiple states 175 | for each source sentence in the batch. 176 | """ 177 | # shape: (group_size, num_classes) 178 | output_projections, state = self._prepare_output_projections(last_predictions, state) 179 | 180 | # shape: (group_size, num_classes) 181 | class_log_probabilities = F.log_softmax(output_projections, dim=-1) 182 | 183 | return class_log_probabilities, state 184 | 185 | @overrides 186 | def forward(self, # type: ignore 187 | source_tokens: Dict[str, torch.LongTensor], 188 | target_tokens: Dict[str, torch.LongTensor] = None) -> Dict[str, torch.Tensor]: 189 | # pylint: disable=arguments-differ 190 | """ 191 | Make foward pass with decoder logic for producing the entire target sequence. 192 | 193 | Parameters 194 | ---------- 195 | source_tokens : ``Dict[str, torch.LongTensor]`` 196 | The output of `TextField.as_array()` applied on the source `TextField`. This will be 197 | passed through a `TextFieldEmbedder` and then through an encoder. 198 | target_tokens : ``Dict[str, torch.LongTensor]``, optional (default = None) 199 | Output of `Textfield.as_array()` applied on target `TextField`. We assume that the 200 | target tokens are also represented as a `TextField`. 201 | 202 | Returns 203 | ------- 204 | Dict[str, torch.Tensor] 205 | """ 206 | state = self._encode(source_tokens) 207 | 208 | if target_tokens: 209 | state = self._init_decoder_state(state) 210 | # The `_forward_loop` decodes the input sequence and computes the loss during training 211 | # and validation. 212 | output_dict = self._forward_loop(state, target_tokens) 213 | else: 214 | output_dict = {} 215 | 216 | if not self.training: 217 | state = self._init_decoder_state(state) 218 | if self._visualize_attention: 219 | output_dict = self._forward_loop(state, target_tokens) 220 | else: 221 | predictions = self._forward_beam_search(state) 222 | output_dict.update(predictions) 223 | if target_tokens and self._bleu: 224 | # shape: (batch_size, beam_size, max_sequence_length) 225 | top_k_predictions = output_dict["predictions"] 226 | # shape: (batch_size, max_predicted_sequence_length) 227 | best_predictions = top_k_predictions[:, 0, :] 228 | self._bleu(best_predictions, target_tokens[self._target_namespace]) 229 | 230 | return output_dict 231 | 232 | @overrides 233 | def decode(self, output_dict: Dict[str, torch.Tensor]) -> Dict[str, torch.Tensor]: 234 | """ 235 | Finalize predictions. 236 | 237 | This method overrides ``Model.decode``, which gets called after ``Model.forward``, at test 238 | time, to finalize predictions. The logic for the decoder part of the encoder-decoder lives 239 | within the ``forward`` method. 240 | 241 | This method trims the output predictions to the first end symbol, replaces indices with 242 | corresponding tokens, and adds a field called ``predicted_tokens`` to the ``output_dict``. 243 | """ 244 | predicted_indices = output_dict["predictions"] 245 | if not isinstance(predicted_indices, numpy.ndarray): 246 | predicted_indices = predicted_indices.detach().cpu().numpy() 247 | all_predicted_tokens = [] 248 | for indices in predicted_indices: 249 | # Beam search gives us the top k results for each source sentence in the batch 250 | # but we just want the single best. 251 | if len(indices.shape) > 1: 252 | indices = indices[0] 253 | indices = list(indices) 254 | # Collect indices till the first end_symbol 255 | if self._end_index in indices: 256 | indices = indices[:indices.index(self._end_index)] 257 | predicted_tokens = [self.vocab.get_token_from_index(x, namespace=self._target_namespace) 258 | for x in indices] 259 | all_predicted_tokens.append(predicted_tokens) 260 | output_dict["predicted_tokens"] = all_predicted_tokens 261 | return output_dict 262 | 263 | def _encode(self, source_tokens: Dict[str, torch.Tensor]) -> Dict[str, torch.Tensor]: 264 | # shape: (batch_size, max_input_sequence_length, encoder_input_dim) 265 | embedded_input = self._source_embedder(source_tokens) 266 | # shape: (batch_size, max_input_sequence_length) 267 | source_mask = util.get_text_field_mask(source_tokens) 268 | # shape: (batch_size, max_input_sequence_length, encoder_output_dim) 269 | encoder_outputs = self._encoder(embedded_input, source_mask) 270 | return { 271 | "source_mask": source_mask, 272 | "encoder_outputs": encoder_outputs, 273 | } 274 | 275 | def _init_decoder_state(self, state: Dict[str, torch.Tensor]) -> Dict[str, torch.Tensor]: 276 | batch_size = state["source_mask"].size(0) 277 | # shape: (batch_size, encoder_output_dim) 278 | final_encoder_output = util.get_final_encoder_states( 279 | state["encoder_outputs"], 280 | state["source_mask"], 281 | self._encoder.is_bidirectional()) 282 | # Initialize the decoder hidden state with the final output of the encoder. 283 | # shape: (batch_size, decoder_output_dim) 284 | state["decoder_hidden"] = final_encoder_output 285 | # shape: (batch_size, decoder_output_dim) 286 | state["decoder_context"] = state["encoder_outputs"].new_zeros(batch_size, self._decoder_output_dim) 287 | return state 288 | 289 | def _forward_loop(self, 290 | state: Dict[str, torch.Tensor], 291 | target_tokens: Dict[str, torch.LongTensor] = None) -> Dict[str, torch.Tensor]: 292 | """ 293 | Make forward pass during training or do greedy search during prediction. 294 | 295 | Notes 296 | ----- 297 | We really only use the predictions from the method to test that beam search 298 | with a beam size of 1 gives the same results. 299 | """ 300 | # shape: (batch_size, max_input_sequence_length) 301 | source_mask = state["source_mask"] 302 | 303 | batch_size = source_mask.size()[0] 304 | 305 | if target_tokens: 306 | # shape: (batch_size, max_target_sequence_length) 307 | targets = target_tokens[self._target_namespace] 308 | 309 | _, target_sequence_length = targets.size() 310 | 311 | # The last input from the target is either padding or the end symbol. 312 | # Either way, we don't have to process it. 313 | num_decoding_steps = target_sequence_length - 1 314 | else: 315 | num_decoding_steps = self._max_decoding_steps 316 | 317 | # Initialize target predictions with the start index. 318 | # shape: (batch_size,) 319 | last_predictions = source_mask.new_full((batch_size,), fill_value=self._start_index) 320 | 321 | step_logits: List[torch.Tensor] = [] 322 | step_predictions: List[torch.Tensor] = [] 323 | 324 | for timestep in range(num_decoding_steps): 325 | if self.training and torch.rand(1).item() < self._scheduled_sampling_ratio: 326 | # Use gold tokens at test time and at a rate of 1 - _scheduled_sampling_ratio 327 | # during training. 328 | # shape: (batch_size,) 329 | input_choices = last_predictions 330 | elif not target_tokens: 331 | # shape: (batch_size,) 332 | input_choices = last_predictions 333 | else: 334 | # shape: (batch_size,) 335 | input_choices = targets[:, timestep] 336 | 337 | # shape: (batch_size, num_classes) 338 | output_projections, state = self._prepare_output_projections(input_choices, state) 339 | 340 | # list of tensors, shape: (batch_size, 1, num_classes) 341 | step_logits.append(output_projections.unsqueeze(1)) 342 | 343 | # shape: (batch_size, num_classes) 344 | class_probabilities = F.softmax(output_projections, dim=-1) 345 | 346 | # shape (predicted_classes): (batch_size,) 347 | _, predicted_classes = torch.max(class_probabilities, 1) 348 | 349 | # shape (predicted_classes): (batch_size,) 350 | last_predictions = predicted_classes 351 | 352 | step_predictions.append(last_predictions.unsqueeze(1)) 353 | 354 | # shape: (batch_size, num_decoding_steps) 355 | predictions = torch.cat(step_predictions, 1) 356 | 357 | output_dict = {"predictions": predictions} 358 | 359 | if target_tokens: 360 | # shape: (batch_size, num_decoding_steps, num_classes) 361 | logits = torch.cat(step_logits, 1) 362 | 363 | # Compute loss. 364 | target_mask = util.get_text_field_mask(target_tokens) 365 | loss = self._get_loss(logits, targets, target_mask) 366 | output_dict["loss"] = loss 367 | 368 | return output_dict 369 | 370 | def _forward_beam_search(self, state: Dict[str, torch.Tensor]) -> Dict[str, torch.Tensor]: 371 | """Make forward pass during prediction using a beam search.""" 372 | batch_size = state["source_mask"].size()[0] 373 | start_predictions = state["source_mask"].new_full((batch_size,), fill_value=self._start_index) 374 | 375 | # shape (all_top_k_predictions): (batch_size, beam_size, num_decoding_steps) 376 | # shape (log_probabilities): (batch_size, beam_size) 377 | all_top_k_predictions, log_probabilities = self._beam_search.search( 378 | start_predictions, state, self.take_step) 379 | 380 | output_dict = { 381 | "class_log_probabilities": log_probabilities, 382 | "predictions": all_top_k_predictions, 383 | } 384 | return output_dict 385 | 386 | def _prepare_output_projections(self, 387 | last_predictions: torch.Tensor, 388 | state: Dict[str, torch.Tensor]) -> Tuple[torch.Tensor, Dict[str, torch.Tensor]]: # pylint: disable=line-too-long 389 | """ 390 | Decode current state and last prediction to produce projections 391 | into the target space, which can then be used to get probabilities of 392 | each target token for the next step. 393 | 394 | Inputs are the same as for `take_step()`. 395 | """ 396 | # shape: (group_size, max_input_sequence_length, encoder_output_dim) 397 | encoder_outputs = state["encoder_outputs"] 398 | 399 | # shape: (group_size, max_input_sequence_length) 400 | source_mask = state["source_mask"] 401 | 402 | # shape: (group_size, decoder_output_dim) 403 | decoder_hidden = state["decoder_hidden"] 404 | 405 | # shape: (group_size, decoder_output_dim) 406 | decoder_context = state["decoder_context"] 407 | 408 | # shape: (group_size, target_embedding_dim) 409 | embedded_input = self._target_embedder(last_predictions) 410 | 411 | # TODO: Compute attention right about here... 412 | decoder_input = embedded_input 413 | 414 | # shape (decoder_hidden): (batch_size, decoder_output_dim) 415 | # shape (decoder_context): (batch_size, decoder_output_dim) 416 | decoder_hidden, decoder_context = self._decoder_cell( 417 | decoder_input, 418 | (decoder_hidden, decoder_context)) 419 | 420 | state["decoder_hidden"] = decoder_hidden 421 | state["decoder_context"] = decoder_context 422 | 423 | # shape: (group_size, num_classes) 424 | output_projections = self._output_projection_layer(decoder_hidden) 425 | 426 | return output_projections, state 427 | 428 | # TODO: Implement attention mechanisms here 429 | def _compute_attention(self, 430 | decoder_hidden_state: torch.LongTensor = None, 431 | encoder_outputs: torch.LongTensor = None, 432 | encoder_outputs_mask: torch.LongTensor = None) -> torch.Tensor: 433 | """Apply attention over encoder outputs and decoder state. 434 | Parameters 435 | ---------- 436 | decoder_hidden_state : ``torch.LongTensor`` 437 | A tensor of shape ``(batch_size, decoder_output_dim)``, which contains the current decoder hidden state to be used 438 | as the 'query' to the attention computation 439 | during the last time step. 440 | encoder_outputs : ``torch.LongTensor`` 441 | A tensor of shape ``(batch_size, max_input_sequence_length, encoder_output_dim)``, which contains all the 442 | encoder hidden states of the source tokens, i.e., the 'keys' to the attention computation 443 | encoder_mask : ``torch.LongTensor`` 444 | A tensor of shape (batch_size, max_input_sequence_length), which contains the mask of the encoded input. 445 | We want to avoid computing an attention score for positions of the source with zero-values (remember not all 446 | input sentences have the same length) 447 | 448 | Returns 449 | ------- 450 | torch.Tensor 451 | A tensor of shape (batch_size, encoder_output_dim) that contains the attended encoder outputs (aka context vector), 452 | i.e., we have ``applied`` the attention scores on the encoder hidden states. 453 | 454 | Notes 455 | ----- 456 | Don't forget to apply the final softmax over the **masked** encoder outputs! 457 | """ 458 | 459 | # Ensure mask is also a FloatTensor. Or else the multiplication within 460 | # attention will complain. 461 | # shape: (batch_size, max_input_sequence_length) 462 | encoder_outputs_mask = encoder_outputs_mask.float() 463 | 464 | # Main body of attention weights computation here 465 | 466 | return None 467 | 468 | @staticmethod 469 | def _get_loss(logits: torch.LongTensor, 470 | targets: torch.LongTensor, 471 | target_mask: torch.LongTensor) -> torch.Tensor: 472 | """ 473 | Compute loss. 474 | 475 | Takes logits (unnormalized outputs from the decoder) of size (batch_size, 476 | num_decoding_steps, num_classes), target indices of size (batch_size, num_decoding_steps+1) 477 | and corresponding masks of size (batch_size, num_decoding_steps+1) steps and computes cross 478 | entropy loss while taking the mask into account. 479 | 480 | The length of ``targets`` is expected to be greater than that of ``logits`` because the 481 | decoder does not need to compute the output corresponding to the last timestep of 482 | ``targets``. This method aligns the inputs appropriately to compute the loss. 483 | 484 | During training, we want the logit corresponding to timestep i to be similar to the target 485 | token from timestep i + 1. That is, the targets should be shifted by one timestep for 486 | appropriate comparison. Consider a single example where the target has 3 words, and 487 | padding is to 7 tokens. 488 | The complete sequence would correspond to w1 w2 w3

489 | and the mask would be 1 1 1 1 1 0 0 490 | and let the logits be l1 l2 l3 l4 l5 l6 491 | We actually need to compare: 492 | the sequence w1 w2 w3

493 | with masks 1 1 1 1 0 0 494 | against l1 l2 l3 l4 l5 l6 495 | (where the input was) w1 w2 w3

496 | """ 497 | # shape: (batch_size, num_decoding_steps) 498 | relevant_targets = targets[:, 1:].contiguous() 499 | 500 | # shape: (batch_size, num_decoding_steps) 501 | relevant_mask = target_mask[:, 1:].contiguous() 502 | 503 | return util.sequence_cross_entropy_with_logits(logits, relevant_targets, relevant_mask) 504 | 505 | @overrides 506 | def get_metrics(self, reset: bool = False) -> Dict[str, float]: 507 | all_metrics: Dict[str, float] = {} 508 | if self._bleu and not self.training: 509 | all_metrics.update(self._bleu.get_metric(reset=reset)) 510 | return all_metrics 511 | -------------------------------------------------------------------------------- /athnlp/models/qa_bert.py: -------------------------------------------------------------------------------- 1 | from typing import Dict, Optional 2 | 3 | import torch 4 | from allennlp.data.vocabulary import Vocabulary 5 | from allennlp.models.model import Model 6 | from allennlp.nn import RegularizerApplicator 7 | from allennlp.nn.initializers import InitializerApplicator 8 | from overrides import overrides 9 | from pytorch_transformers.modeling_bert import BertModel 10 | 11 | 12 | @Model.register("qa_bert") 13 | class BertQuestionAnswering(Model): 14 | """ 15 | A QA model for SQuAD based on the AllenNLP Model ``BertForClassification`` that runs pretrained BERT, 16 | takes the pooled output, adds a Linear layer on top, and predicts two numbers: start and end span. 17 | 18 | Note that this is a somewhat non-AllenNLP-ish model architecture, 19 | in that it essentially requires you to use the "bert-pretrained" 20 | token indexer, rather than configuring whatever indexing scheme you like. 21 | See `allennlp/tests/fixtures/bert/bert_for_classification.jsonnet` 22 | for an example of what your config might look like. 23 | Parameters 24 | ---------- 25 | vocab : ``Vocabulary`` 26 | bert_model : ``Union[str, BertModel]`` 27 | The BERT model to be wrapped. If a string is provided, we will call 28 | ``BertModel.from_pretrained(bert_model)`` and use the result. 29 | num_labels : ``int``, optional (default: None) 30 | How many output classes to predict. If not provided, we'll use the 31 | vocab_size for the ``label_namespace``. 32 | index : ``str``, optional (default: "bert") 33 | The index of the token indexer that generates the BERT indices. 34 | label_namespace : ``str``, optional (default : "labels") 35 | Used to determine the number of classes if ``num_labels`` is not supplied. 36 | trainable : ``bool``, optional (default : True) 37 | If True, the weights of the pretrained BERT model will be updated during training. 38 | Otherwise, they will be frozen and only the final linear layer will be trained. 39 | initializer : ``InitializerApplicator``, optional 40 | If provided, will be used to initialize the final linear layer *only*. 41 | regularizer : ``RegularizerApplicator``, optional (default=``None``) 42 | If provided, will be used to calculate the regularization penalty during training. 43 | """ 44 | 45 | def __init__(self, 46 | vocab: Vocabulary, 47 | bert_model: BertModel, 48 | dropout: float = 0.0, 49 | index: str = "bert", 50 | trainable: bool = True, 51 | initializer: InitializerApplicator = InitializerApplicator(), 52 | regularizer: Optional[RegularizerApplicator] = None, ) -> None: 53 | super().__init__(vocab, regularizer) 54 | 55 | self._index = index 56 | self.bert_model = PretrainedBertModel.load(bert_model) 57 | hidden_size = self.bert_model.config.hidden_size 58 | 59 | for param in self.bert_model.parameters(): 60 | param.requires_grad = trainable 61 | 62 | # 1. Instantiate any additional parts of your network 63 | 64 | # 2. DON'T FORGET TO INITIALIZE the additional parts of your network. 65 | 66 | # 3. Instantiate your metrics 67 | 68 | def forward(self, # type: ignore 69 | metadata: Dict, 70 | tokens: Dict[str, torch.LongTensor], 71 | span_start: torch.IntTensor = None, 72 | span_end: torch.IntTensor = None 73 | ) -> Dict[str, torch.Tensor]: 74 | # pylint: disable=arguments-differ 75 | """ 76 | Parameters 77 | ---------- 78 | tokens : Dict[str, torch.LongTensor] 79 | From a ``TextField`` (that has a bert-pretrained token indexer) 80 | span_start : torch.IntTensor, optional (default = None) 81 | A tensor of shape (batch_size, 1) which contains the start_position of the answer 82 | in the passage, or 0 if impossible. This is an `inclusive` token index. 83 | If this is given, we will compute a loss that gets included in the output dictionary. 84 | span_end : torch.IntTensor, optional (default = None) 85 | A tensor of shape (batch_size, 1) which contains the end_position of the answer 86 | in the passage, or 0 if impossible. This is an `inclusive` token index. 87 | If this is given, we will compute a loss that gets included in the output dictionary. 88 | Returns 89 | ------- 90 | An output dictionary consisting of: 91 | logits : torch.FloatTensor 92 | A tensor of shape ``(batch_size, num_labels)`` representing 93 | unnormalized log probabilities of the label. 94 | start_probs: torch.FloatTensor 95 | A tensor of shape ``(batch_size, num_labels)`` representing 96 | probabilities of the label. 97 | end_probs : torch.FloatTensor 98 | A tensor of shape ``(batch_size, num_labels)`` representing 99 | probabilities of the label. 100 | best_span: 101 | loss : torch.FloatTensor, optional 102 | A scalar loss to be optimised. 103 | """ 104 | input_ids = tokens[self._index] 105 | token_type_ids = tokens[f"{self._index}-type-ids"] 106 | input_mask = (input_ids != 0).long() 107 | 108 | # 1. Build model here 109 | 110 | # 2. Compute start_position and end_position and then get the best span 111 | # using allennlp.models.reading_comprehension.util.get_best_span() 112 | 113 | output_dict = {} 114 | 115 | # 4. Compute loss and accuracies. You should compute at least: 116 | # span_start accuracy, span_end accuracy and full span accuracy. 117 | 118 | # UNCOMMENT THIS LINE 119 | # output_dict["loss"] = 120 | 121 | # 5. Optionally you can compute the official squad metrics (exact match, f1). 122 | # Instantiate the metric object in __init__ using allennlp.training.metrics.SquadEmAndF1() 123 | # When you call it, you need to give it the word tokens of the span (implement and call decode() below) 124 | # and the gold tokens found in metadata[i]['answer_texts'] 125 | 126 | return output_dict 127 | 128 | @overrides 129 | def decode(self, output_dict: Dict[str, torch.Tensor]) -> Dict[str, torch.Tensor]: 130 | """ 131 | Does a simple argmax over the probabilities, converts index to string label, and 132 | add ``"label"`` key to the dictionary with the result. 133 | """ 134 | pass 135 | 136 | def get_metrics(self, reset: bool = False) -> Dict[str, float]: 137 | # UNCOMMENT if you want to report official SQuAD metrics 138 | # exact_match, f1_score = self._squad_metrics.get_metric(reset) 139 | 140 | metrics = { 141 | 'start_acc': self._span_start_accuracy.get_metric(reset), 142 | 'end_acc': self._span_end_accuracy.get_metric(reset), 143 | 'span_acc': self._span_accuracy.get_metric(reset), 144 | # 'em': exact_match, 145 | # 'f1': f1_score, 146 | } 147 | return metrics 148 | 149 | 150 | class PretrainedBertModel: 151 | """ 152 | In some instances you may want to load the same BERT model twice 153 | (e.g. to use as a token embedder and also as a pooling layer). 154 | This factory provides a cache so that you don't actually have to load the model twice. 155 | """ 156 | _cache: Dict[str, BertModel] = {} 157 | 158 | @classmethod 159 | def load(cls, model_name: str, cache_model: bool = True) -> BertModel: 160 | if model_name in cls._cache: 161 | return PretrainedBertModel._cache[model_name] 162 | 163 | model = BertModel.from_pretrained(model_name) 164 | if cache_model: 165 | cls._cache[model_name] = model 166 | 167 | return model 168 | -------------------------------------------------------------------------------- /athnlp/models/rnn_language_model.py: -------------------------------------------------------------------------------- 1 | import torch.nn as nn 2 | 3 | 4 | class RNNModel(nn.Module): 5 | """Container module with an encoder, a recurrent module, and a decoder.""" 6 | 7 | def __init__(self, rnn_type, ntoken, ninp, nhid, nlayers, dropout=0.5): 8 | """ 9 | Initialises the parameters of the RNN Language Model 10 | 11 | :param rnn_type: type of RNN cell 12 | :param ntoken: number of tokens in the vocabulary 13 | :param ninp: Dimensionality of the input vector 14 | :param nhid: Hidden size of the RNN cell 15 | :param nlayers: Number of layers of the RNN cell 16 | :param dropout: Dropout value applied to the RNN cell connections 17 | """ 18 | super(RNNModel, self).__init__() 19 | self.drop = nn.Dropout(dropout) 20 | self.encoder = nn.Embedding(ntoken, ninp) 21 | if rnn_type in ['LSTM', 'GRU']: 22 | self.rnn = getattr(nn, rnn_type)(ninp, nhid, nlayers, dropout=dropout) 23 | else: 24 | try: 25 | nonlinearity = {'RNN_TANH': 'tanh', 'RNN_RELU': 'relu'}[rnn_type] 26 | except KeyError: 27 | raise ValueError("""An invalid option for `--model` was supplied, 28 | options are ['LSTM', 'GRU', 'RNN_TANH' or 'RNN_RELU']""") 29 | self.rnn = nn.RNN(ninp, nhid, nlayers, nonlinearity=nonlinearity, dropout=dropout) 30 | self.decoder = nn.Linear(nhid, ntoken) 31 | 32 | self.init_weights() 33 | 34 | self.rnn_type = rnn_type 35 | self.nhid = nhid 36 | self.nlayers = nlayers 37 | 38 | def init_weights(self): 39 | """ 40 | Initialises the parameters of the RNN model. 41 | 42 | N.B. This is optional because you may want to use the default PyTorch weight initialisation 43 | """ 44 | initrange = 0.1 45 | self.encoder.weight.data.uniform_(-initrange, initrange) 46 | self.decoder.bias.data.zero_() 47 | self.decoder.weight.data.uniform_(-initrange, initrange) 48 | 49 | def forward(self, input, hidden): 50 | """ 51 | Forward pass of the RNN language model. Useful information about how to use 52 | an RNNCell can be found in the PyTorch documentation: 53 | https://pytorch.org/docs/stable/nn.html#torch.nn.LSTM 54 | https://pytorch.org/docs/stable/nn.html#torch.nn.GRU 55 | 56 | :param input: input features 57 | :param hidden: previous hidden state of the RNN language model 58 | :return: (output, updated_hidden_state) 59 | """ 60 | pass 61 | 62 | def init_hidden(self, bsz): 63 | """ 64 | Returns the initial hidden state of the RNN language model. It is a function that should be called before 65 | unrolling the RNN decoder. 66 | 67 | :param bsz: batch size 68 | :return: first hidden state of the RNN language model 69 | """ 70 | weight = next(self.parameters()) 71 | if self.rnn_type == 'LSTM': 72 | return (weight.new_zeros(self.nlayers, bsz, self.nhid), 73 | weight.new_zeros(self.nlayers, bsz, self.nhid)) 74 | else: 75 | return weight.new_zeros(self.nlayers, bsz, self.nhid) 76 | -------------------------------------------------------------------------------- /athnlp/nlm.py: -------------------------------------------------------------------------------- 1 | import argparse 2 | import math 3 | import time 4 | 5 | import torch 6 | 7 | from athnlp.readers.lm_corpus import Corpus 8 | 9 | parser = argparse.ArgumentParser(description='RNN/LSTM Language Model') 10 | parser.add_argument('--data', type=str, default='data/lm', 11 | help='location of the data corpus') 12 | parser.add_argument('--model_type', type=str, default='LSTM', 13 | help='type of recurrent net (RNN_TANH, RNN_RELU, LSTM, GRU)') 14 | parser.add_argument("--model_path", type=str, default='models/lm/default.pt', 15 | help='Path where to store the trained language model.') 16 | parser.add_argument('--emsize', type=int, default=50, 17 | help='size of word embeddings') 18 | parser.add_argument('--nhid', type=int, default=100, 19 | help='number of hidden units per layer') 20 | parser.add_argument('--nlayers', type=int, default=1, 21 | help='number of layers') 22 | parser.add_argument('--lr', type=float, default=20, 23 | help='initial learning rate') 24 | parser.add_argument('--clip', type=float, default=0.25, 25 | help='gradient clipping') 26 | parser.add_argument('--epochs', type=int, default=100, 27 | help='upper epoch limit') 28 | parser.add_argument('--batch_size', type=int, default=1, metavar='N', 29 | help='batch size') 30 | parser.add_argument('--bptt', type=int, default=10, 31 | help='sequence length') 32 | parser.add_argument('--dropout', type=float, default=0.2, 33 | help='dropout applied to layers (0 = no dropout)') 34 | parser.add_argument('--seed', type=int, default=1111, 35 | help='random seed') 36 | parser.add_argument('--cuda', action='store_true', 37 | help='use CUDA') 38 | parser.add_argument('--log-interval', type=int, default=200, metavar='N', 39 | help='report interval') 40 | parser.add_argument("--sentence_compl", action='store_true') 41 | 42 | 43 | # Starting from sequential data, batchify arranges the dataset into columns. 44 | # For instance, with the alphabet as the sequence and batch size 4, we'd get 45 | # ┌ a g m s ┐ 46 | # │ b h n t │ 47 | # │ c i o u │ 48 | # │ d j p v │ 49 | # │ e k q w │ 50 | # └ f l r x ┘. 51 | # These columns are treated as independent by the model, which means that the 52 | # dependence of e. g. 'g' on 'f' can not be learned, but allows more efficient 53 | # batch processing. 54 | def batchify(data, batch_size, device): 55 | # Work out how cleanly we can divide the dataset into bsz parts. 56 | num_batches = data.size(0) // batch_size 57 | # Trim off any extra elements that wouldn't cleanly fit (remainders). 58 | data = data.narrow(0, 0, num_batches * batch_size) 59 | # Evenly divide the data across the batch_size batches. 60 | data = data.view(batch_size, -1).t().contiguous() 61 | return data.to(device) 62 | 63 | 64 | ############################################################################### 65 | # Training code 66 | ############################################################################### 67 | 68 | def repackage_hidden(h): 69 | """Wraps hidden states in new Tensors, to detach them from their history.""" 70 | 71 | if isinstance(h, torch.Tensor): 72 | return h.detach() 73 | else: 74 | return tuple(repackage_hidden(v) for v in h) 75 | 76 | 77 | # get_batch subdivides the source data into chunks of length args.bptt. 78 | # If source is equal to the example output of the batchify function, with 79 | # a bptt-limit of 2, we'd get the following two Variables for i = 0: 80 | # ┌ a g m s ┐ ┌ b h n t ┐ 81 | # └ b h n t ┘ └ c i o u ┘ 82 | # Note that despite the name of the function, the subdivison of data is not 83 | # done along the batch dimension (i.e. dimension 1), since that was handled 84 | # by the batchify function. The chunks are along dimension 0, corresponding 85 | # to the seq_len dimension in the LSTM. 86 | 87 | def get_batch(source, i, bptt): 88 | seq_len = min(bptt, len(source) - 1 - i) 89 | data = source[i:i + seq_len] 90 | target = source[i + 1:i + 1 + seq_len].view(-1) 91 | return data, target 92 | 93 | 94 | def evaluate(model, criterion, eval_batch_size, corpus, data_source): 95 | """ 96 | Evaluates the performance of the model according to the specified criterion on the provided data source 97 | 98 | :param model: RNN language model 99 | :param criterion: criterion to be evaluated 100 | :param eval_batch_size: batch size (you can assume 1 for simplicity) 101 | :param corpus: instance of the reference corpus 102 | :param data_source: reference data for evaluation 103 | :return: the average score evaluated using the specified criterion 104 | """ 105 | # Turn on evaluation mode which disables dropout. 106 | model.eval() 107 | total_loss = 0. 108 | ntokens = len(corpus.dictionary) 109 | hidden = model.init_hidden(eval_batch_size) 110 | with torch.no_grad(): 111 | for i in range(0, data_source.size(0) - 1, args.bptt): 112 | data, targets = get_batch(data_source, i, args.bptt) 113 | output, hidden = model(data, hidden) 114 | hidden = repackage_hidden(hidden) 115 | output_flat = output.view(-1, ntokens) 116 | # We multiply by the number of examples in the batch in order 117 | # to get the total loss and not the average (which is what 118 | # by default PyTorch Cross-entropy loss is doing behind 119 | # the scenes for us) 120 | total_loss += len(data) * criterion(output_flat, targets).item() 121 | return total_loss / (len(data_source) - 1) 122 | 123 | 124 | def train(model, criterion, corpus, train_data, lr, bptt, epoch): 125 | """ 126 | Trains the specified language model by minimising the provided criterion using as the training data. It trains the 127 | model for a given number of epoch with a fixed learning rate. 128 | 129 | :param model: RNN language model 130 | :param criterion: LM loss function 131 | :param corpus: Reference corpus 132 | :param train_data: training data for the LM task 133 | :param lr: SGD learning rate 134 | :param bptt: Sequence length 135 | :param epoch: Number of training epochs 136 | :return: Average training loss 137 | """ 138 | # Turn on training mode which enables dropout. 139 | model.train() 140 | total_loss = 0. 141 | start_time = time.time() 142 | ntokens = len(corpus.dictionary) 143 | hidden = model.init_hidden(args.batch_size) 144 | for batch, i in enumerate(range(0, train_data.size(0) - 1, args.bptt)): 145 | data, targets = get_batch(train_data, i, bptt) 146 | # Starting each batch, we detach the hidden state from how it was previously produced. 147 | # If we didn't, the model would try backpropagating all the way to start of the dataset. 148 | model.zero_grad() 149 | hidden = repackage_hidden(hidden) 150 | # TODO: run model forward pass obtaining '(output, hidden)' 151 | output, hidden = None, None 152 | # TODO: compute loss using the defined criterion obtaining 'loss'. 153 | loss = None 154 | # TODO: compute backpropagation calling the backward pass 155 | # N.B.: Here you should also update your model's weights 156 | 157 | # TODO (optional): implement gradient clipping to prevent 158 | # the exploding gradient problem in RNNs / LSTMs 159 | # check the PyTorch function `clip_grad_norm` 160 | 161 | total_loss += loss.item() 162 | 163 | if batch % args.log_interval == 0 and batch > 0: 164 | cur_loss = total_loss / args.log_interval 165 | elapsed = time.time() - start_time 166 | print('| epoch {:3d} | {:5d}/{:5d} batches | lr {:02.2f} | ms/batch {:5.2f} | ' 167 | 'loss {:5.2f} | ppl {:8.2f}'.format( 168 | epoch, batch, len(train_data) // args.bptt, lr, 169 | elapsed * 1000 / args.log_interval, cur_loss, math.exp(cur_loss))) 170 | total_loss = 0 171 | start_time = time.time() 172 | 173 | 174 | def main(args): 175 | # Set the random seed manually for reproducibility. 176 | torch.manual_seed(args.seed) 177 | if torch.cuda.is_available(): 178 | if not args.cuda: 179 | print("WARNING: You have a CUDA device, so you should probably run with --cuda") 180 | 181 | device = torch.device("cuda" if args.cuda else "cpu") 182 | 183 | ############################################################################### 184 | # Load data 185 | ############################################################################### 186 | 187 | corpus = Corpus(args.data) 188 | 189 | # training mode selected 190 | # Trains the model and then runs the evaluation on the test set 191 | if not args.sentence_compl: 192 | eval_batch_size = 1 193 | ############################################################################### 194 | # Load your train, valid and test data 195 | ############################################################################### 196 | train_data = batchify(corpus.train, args.batch_size, device) 197 | val_data = batchify(corpus.valid, eval_batch_size, device) 198 | test_data = batchify(corpus.test, eval_batch_size, device) 199 | 200 | ############################################################################### 201 | # Build the model 202 | ############################################################################### 203 | # TODO: model definition and loss definition 204 | model = None 205 | criterion = None 206 | # In order to optimise your model weights you have two options: 207 | # 1. Write the SGD update rule that uses the computed gradients to update the model's weights 208 | # 2. Use a PyTorch optimiser that computes the update step (https://pytorch.org/docs/stable/optim.html) 209 | 210 | # Loop over epochs. 211 | lr = args.lr 212 | best_val_loss = None 213 | 214 | # At any point you can hit Ctrl + C to break out of training early. 215 | try: 216 | for epoch in range(1, args.epochs + 1): 217 | epoch_start_time = time.time() 218 | train(model, criterion, corpus, train_data, lr, args.bptt, epoch) 219 | val_loss = evaluate(model, criterion, eval_batch_size, corpus, val_data) 220 | print('-' * 89) 221 | print('| end of epoch {:3d} | time: {:5.2f}s | valid loss {:5.2f} | ' 222 | 'valid ppl {:8.2f}'.format(epoch, (time.time() - epoch_start_time), 223 | val_loss, math.exp(val_loss))) 224 | print('-' * 89) 225 | # Save the model if the validation loss is the best we've seen so far. 226 | if not best_val_loss or val_loss < best_val_loss: 227 | with open(args.model_path, 'wb') as f: 228 | torch.save(model, f) 229 | best_val_loss = val_loss 230 | 231 | # HINT: when the loss is not decreasing anymore on the validation set can you think to any method 232 | # to prevent the model from overfitting? 233 | except KeyboardInterrupt: 234 | print('-' * 89) 235 | print('Exiting from training early') 236 | 237 | # Load the best saved model. 238 | with open(args.model_path, 'rb') as f: 239 | model = torch.load(f) 240 | # after load the rnn params are not a continuous chunk of memory 241 | # this makes them a continuous chunk, and will speed up forward pass 242 | model.rnn.flatten_parameters() 243 | 244 | # Run on test data. 245 | test_loss = evaluate(model, criterion, eval_batch_size, corpus, test_data) 246 | print('=' * 89) 247 | print('| End of training | test loss {:5.2f} | test ppl {:8.2f}'.format( 248 | test_loss, math.exp(test_loss))) 249 | print('=' * 89) 250 | else: 251 | # we enabled the sentence completition mode 252 | 253 | # we first load the model 254 | # Load the best saved model. 255 | with open(args.model_path, 'rb') as f: 256 | model = torch.load(f) 257 | # after load the rnn params are not a continuous chunk of memory 258 | # this makes them a continuous chunk, and will speed up forward pass 259 | model.rnn.flatten_parameters() 260 | 261 | ############################################################################### 262 | # Use the pretrained LM at inference time 263 | ############################################################################### 264 | # TODO: Sentence completion solution 265 | 266 | 267 | if __name__ == "__main__": 268 | args = parser.parse_args() 269 | 270 | main(args) 271 | -------------------------------------------------------------------------------- /athnlp/nmt.py: -------------------------------------------------------------------------------- 1 | # pylint: disable=no-self-use,invalid-name 2 | from argparse import ArgumentParser 3 | import json 4 | import shutil 5 | import sys 6 | 7 | from allennlp.commands import main 8 | 9 | if __name__ == "__main__": 10 | argparse = ArgumentParser() 11 | argparse.add_argument('-c', "--config_file", type=str, default='athnlp/experiments/nmt_multi30k.jsonnet') 12 | argparse.add_argument('-m', "--model_path", default="/tmp/debugger_train") 13 | argparse.add_argument('-i', "--input_file", default="data/multi30k/val.lc.norm.tok.head-5.en.jsonl") 14 | argparse.add_argument("--predict", action='store_true') 15 | 16 | args = argparse.parse_args() 17 | config_file = args.config_file 18 | serialization_dir = args.model_path 19 | 20 | if args.predict: 21 | overrides = json.dumps({"model": {"visualize_attention": "false"}}) 22 | 23 | sys.argv = [ 24 | "allennlp", # command name, not used by main 25 | "predict", 26 | "--predictor", "seq2seq", 27 | "--include-package", "athnlp", 28 | "-o", overrides, 29 | serialization_dir, 30 | args.input_file, 31 | ] 32 | else: 33 | # Training will fail if the serialization directory already 34 | # has stuff in it. If you are running the same training loop 35 | # over and over again for debugging purposes, it will. 36 | # Hence we wipe it out in advance. 37 | # BE VERY CAREFUL NOT TO DO THIS FOR ACTUAL TRAINING! 38 | shutil.rmtree(serialization_dir, ignore_errors=True) 39 | 40 | # Use overrides to train on CPU. 41 | overrides = json.dumps({"trainer": {"cuda_device": -1}}) 42 | 43 | # Assemble the command into sys.argv 44 | sys.argv = [ 45 | "allennlp", # command name, not used by main 46 | "train", 47 | config_file, 48 | "-s", serialization_dir, 49 | "--include-package", "athnlp", 50 | "-o", overrides, 51 | ] 52 | 53 | main() 54 | -------------------------------------------------------------------------------- /athnlp/qa.py: -------------------------------------------------------------------------------- 1 | # pylint: disable=no-self-use,invalid-name 2 | from argparse import ArgumentParser 3 | import json 4 | import shutil 5 | import sys 6 | 7 | from allennlp.commands import main 8 | 9 | if __name__ == "__main__": 10 | argparse = ArgumentParser() 11 | argparse.add_argument('-c', "--config_file", type=str, default='athnlp/experiments/qa_bert.jsonnet') 12 | argparse.add_argument('-m', "--model_path", default="/tmp/debugger_train") 13 | argparse.add_argument('-i', "--input_file", default="data/squad/dev-v2.0-small.json") 14 | argparse.add_argument("--predict", action='store_true') 15 | 16 | args = argparse.parse_args() 17 | config_file = args.config_file 18 | serialization_dir = args.model_path 19 | 20 | if args.predict: 21 | overrides = json.dumps({"model": {"visualize_attention": "true"}}) 22 | 23 | sys.argv = [ 24 | "allennlp", # command name, not used by main 25 | "predict", 26 | "--predictor", "qa_bert", 27 | "--include-package", "athnlp", 28 | "-o", overrides, 29 | serialization_dir, 30 | args.input_file, 31 | ] 32 | else: 33 | # Training will fail if the serialization directory already 34 | # has stuff in it. If you are running the same training loop 35 | # over and over again for debugging purposes, it will. 36 | # Hence we wipe it out in advance. 37 | # BE VERY CAREFUL NOT TO DO THIS FOR ACTUAL TRAINING! 38 | shutil.rmtree(serialization_dir, ignore_errors=True) 39 | 40 | # Use overrides to train on CPU. 41 | overrides = json.dumps({"trainer": {"cuda_device": -1}}) 42 | 43 | # Assemble the command into sys.argv 44 | sys.argv = [ 45 | "allennlp", # command name, not used by main 46 | "train", 47 | config_file, 48 | "-s", serialization_dir, 49 | "--include-package", "athnlp", 50 | "-o", overrides, 51 | ] 52 | 53 | main() -------------------------------------------------------------------------------- /athnlp/readers/__init__.py: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/athnlp/athnlp-labs/1c3f0a8595cd8e4a7b7d62fc4e608ade6ac092fd/athnlp/readers/__init__.py -------------------------------------------------------------------------------- /athnlp/readers/bert_squad_reader.py: -------------------------------------------------------------------------------- 1 | import json 2 | import logging 3 | from collections import namedtuple 4 | from typing import Dict, List, Tuple 5 | from typing import Optional 6 | 7 | from allennlp.data.dataset_readers.dataset_reader import DatasetReader 8 | from allennlp.data.fields import Field, TextField, IndexField, MetadataField 9 | from allennlp.data.instance import Instance 10 | from allennlp.data.token_indexers import TokenIndexer 11 | from allennlp.data.tokenizers import Token 12 | from allennlp.data.tokenizers import WordTokenizer 13 | from allennlp.data.tokenizers.word_splitter import WordSplitter 14 | from overrides import overrides 15 | from pytorch_transformers.tokenization_bert import whitespace_tokenize, BertTokenizer 16 | 17 | logger = logging.getLogger(__name__) # pylint: disable=invalid-name 18 | cls_token='[CLS]' 19 | sep_token='[SEP]' 20 | sequence_a_segment_id=0 21 | sequence_b_segment_id=1 22 | cls_token_segment_id=0 23 | 24 | 25 | @DatasetReader.register("bert_squad") 26 | class BertSquadReader(DatasetReader): 27 | 28 | def __init__(self, 29 | max_sequence_length: int, 30 | doc_stride: int, 31 | question_length_limit: int, 32 | lazy: bool = False, 33 | version_2: bool = False, 34 | token_indexers: Dict[str, TokenIndexer] = None, 35 | tokenizer: WordTokenizer = None) -> None: 36 | super().__init__(lazy) 37 | self._token_indexers = token_indexers or {} 38 | self._tokenizer = tokenizer or WordTokenizer() 39 | self._version_2 = version_2 40 | self.max_sequence_length = max_sequence_length 41 | self.doc_stride= doc_stride 42 | self.question_length_limit = question_length_limit 43 | 44 | def _read(self, file_path: str): 45 | """Read a SQuAD json file into a list of SquadExample.""" 46 | with open(file_path, "r", encoding='utf-8') as reader: 47 | input_data = json.load(reader)["data"] 48 | 49 | def is_whitespace(c): 50 | if c == " " or c == "\t" or c == "\r" or c == "\n" or ord(c) == 0x202F: 51 | return True 52 | return False 53 | 54 | for entry in input_data: 55 | for paragraph in entry["paragraphs"]: 56 | paragraph_text = paragraph["context"] 57 | doc_tokens = [] 58 | char_to_word_offset = [] 59 | prev_is_whitespace = True 60 | for c in paragraph_text: 61 | if is_whitespace(c): 62 | prev_is_whitespace = True 63 | else: 64 | if prev_is_whitespace: 65 | doc_tokens.append(c) 66 | else: 67 | doc_tokens[-1] += c 68 | prev_is_whitespace = False 69 | char_to_word_offset.append(len(doc_tokens) - 1) 70 | 71 | for qa in paragraph["qas"]: 72 | qas_id = qa["id"] 73 | question_text = qa["question"] 74 | start_position = None 75 | end_position = None 76 | orig_answer_text = None 77 | is_impossible = False 78 | if self._version_2: 79 | is_impossible = qa["is_impossible"] 80 | # if (len(qa["answers"]) != 1) and (not is_impossible): 81 | # raise ValueError( 82 | # "For training, each question should have exactly 1 answer.") 83 | if not is_impossible: 84 | answer = qa["answers"][0] 85 | orig_answer_text = answer["text"] 86 | answer_offset = answer["answer_start"] 87 | answer_length = len(orig_answer_text) 88 | start_position = char_to_word_offset[answer_offset] 89 | end_position = char_to_word_offset[answer_offset + answer_length - 1] 90 | # Only add answers where the text can be exactly recovered from the 91 | # document. If this CAN'T happen it's likely due to weird Unicode 92 | # stuff so we will just skip the example. 93 | # 94 | # Note that this means for training mode, every example is NOT 95 | # guaranteed to be preserved. 96 | actual_text = " ".join(doc_tokens[start_position:(end_position + 1)]) 97 | cleaned_answer_text = " ".join( 98 | whitespace_tokenize(orig_answer_text)) 99 | if actual_text.find(cleaned_answer_text) == -1: 100 | logger.warning("Could not find answer: '%s' vs. '%s'", 101 | actual_text, cleaned_answer_text) 102 | continue 103 | else: 104 | start_position = -1 105 | end_position = -1 106 | orig_answer_text = "" 107 | 108 | query_tokens = self._tokenizer.tokenize(question_text) 109 | 110 | if self.question_length_limit is not None and len(query_tokens) > self.question_length_limit: 111 | query_tokens = query_tokens[0:self.question_length_limit] 112 | 113 | tok_to_orig_index = [] 114 | orig_to_tok_index = [] 115 | all_doc_tokens = [] 116 | for (i, token) in enumerate(doc_tokens): 117 | orig_to_tok_index.append(len(all_doc_tokens)) 118 | sub_tokens = self._tokenizer.tokenize(token) 119 | for sub_token in sub_tokens: 120 | tok_to_orig_index.append(i) 121 | all_doc_tokens.append(sub_token) 122 | 123 | tok_start_position = None 124 | tok_end_position = None 125 | if is_impossible: 126 | tok_start_position = -1 127 | tok_end_position = -1 128 | else: 129 | tok_start_position = orig_to_tok_index[start_position] 130 | if end_position < len(doc_tokens) - 1: 131 | tok_end_position = orig_to_tok_index[end_position + 1] - 1 132 | else: 133 | tok_end_position = len(all_doc_tokens) - 1 134 | (tok_start_position, tok_end_position) = _improve_answer_span( 135 | all_doc_tokens, tok_start_position, tok_end_position, self._tokenizer, 136 | orig_answer_text) 137 | 138 | # The -3 accounts for [CLS], [SEP] and [SEP] 139 | max_tokens_for_doc = self.max_sequence_length - len(query_tokens) - 3 140 | 141 | # We can have documents that are longer than the maximum sequence length. 142 | # To deal with this we do a sliding window approach, where we take chunks 143 | # of the up to our max length with a stride of `doc_stride`. 144 | _DocSpan = namedtuple( # pylint: disable=invalid-name 145 | "DocSpan", ["start", "length"]) 146 | doc_spans = [] 147 | start_offset = 0 148 | while start_offset < len(all_doc_tokens): 149 | length = len(all_doc_tokens) - start_offset 150 | if length > max_tokens_for_doc: 151 | length = max_tokens_for_doc 152 | doc_spans.append(_DocSpan(start=start_offset, length=length)) 153 | if start_offset + length == len(all_doc_tokens): 154 | break 155 | start_offset += min(length, self.doc_stride) 156 | 157 | for (doc_span_index, doc_span) in enumerate(doc_spans): 158 | tokens = [] 159 | token_to_orig_map = {} 160 | token_is_max_context = {} 161 | 162 | # p_mask: mask with 1 for token than cannot be in the answer (0 for token which can be in an answer) 163 | # Original TF implem also keep the classification token (set to 0) (not sure why...) 164 | # p_mask = [] 165 | 166 | # CLS token at the beginning 167 | # tokens.append(Token(cls_token)) 168 | # p_mask.append(0) 169 | cls_index = 0 170 | 171 | # Query 172 | for token in query_tokens: 173 | tokens.append(token) 174 | # p_mask.append(1) 175 | 176 | # SEP token 177 | tokens.append(Token(sep_token)) 178 | # p_mask.append(1) 179 | 180 | # Paragraph 181 | paragraph_start_id = len(tokens) 182 | for i in range(doc_span.length): 183 | split_token_index = doc_span.start + i 184 | token_to_orig_map[len(tokens)] = tok_to_orig_index[split_token_index] 185 | 186 | is_max_context = _check_is_max_context(doc_spans, doc_span_index, 187 | split_token_index) 188 | token_is_max_context[len(tokens)] = is_max_context 189 | tokens.append(all_doc_tokens[split_token_index]) 190 | # p_mask.append(0) 191 | paragraph_len = doc_span.length 192 | 193 | # SEP token 194 | tokens.append(Token(sep_token)) 195 | # p_mask.append(1) 196 | 197 | span_is_impossible = is_impossible 198 | start_position = None 199 | end_position = None 200 | if not span_is_impossible: 201 | # For training, if our document chunk does not contain an annotation 202 | # we throw it out, since there is nothing to predict. 203 | doc_start = doc_span.start 204 | doc_end = doc_span.start + doc_span.length - 1 205 | out_of_span = False 206 | if not (tok_start_position >= doc_start and 207 | tok_end_position <= doc_end): 208 | out_of_span = True 209 | if out_of_span: 210 | span_is_impossible = True 211 | else: 212 | # we offset by 2 to account for the [CLS] and [SEP] tokens (before/after the question) 213 | # at the beginning of the sequence. NOTE: we don't add the [CLS] token, as it will get 214 | # added later in the indexer process, therefore start_position will be off by 1. 215 | doc_offset = len(query_tokens) + 2 216 | start_position = tok_start_position - doc_start + doc_offset 217 | end_position = tok_end_position - doc_start + doc_offset 218 | 219 | if span_is_impossible: 220 | start_position = cls_index 221 | end_position = cls_index 222 | 223 | passage_offsets = [] 224 | token_idx = 0 225 | 226 | for token in doc_tokens: 227 | passage_offsets.append((token_idx, token_idx + len(token))) 228 | token_idx += len(token) 229 | 230 | instance = self.text_to_instance( 231 | qas_id=qas_id, 232 | question_text=question_text, 233 | passage_tokens=tokens[paragraph_start_id: paragraph_len], 234 | bert_tokens=tokens, 235 | orig_answer_text=orig_answer_text, 236 | start_position=start_position, 237 | end_position=end_position, 238 | answer_texts=[answer["text"] for answer in qa["answers"]], 239 | passage_offsets=passage_offsets, 240 | passage_text=paragraph["context"] 241 | ) 242 | 243 | yield instance 244 | 245 | @overrides 246 | def text_to_instance(self, # type: ignore 247 | qas_id: str, 248 | question_text: str, 249 | bert_tokens: List[Token], 250 | passage_tokens: List[Token], 251 | orig_answer_text: str, 252 | start_position: int, 253 | end_position: int, 254 | answer_texts: List[str], 255 | passage_offsets: List[Tuple[int, int]], 256 | passage_text: str) -> Optional[Instance]: 257 | fields: Dict[str, Field] = {} 258 | tokens_field = TextField(bert_tokens, self._token_indexers) 259 | fields['tokens'] = tokens_field 260 | 261 | fields['span_start'] = IndexField(start_position, tokens_field) 262 | fields['span_end'] = IndexField(end_position, tokens_field) 263 | metadata = { 264 | 'question_text': question_text, 265 | 'qas_id': qas_id, 266 | 'token_offsets': passage_offsets, 267 | 'original_passage': passage_text 268 | } 269 | 270 | if answer_texts: 271 | metadata['answer_texts'] = answer_texts 272 | 273 | fields['metadata'] = MetadataField(metadata) 274 | 275 | return Instance(fields) 276 | 277 | 278 | def _improve_answer_span(doc_tokens, input_start, input_end, tokenizer, 279 | orig_answer_text): 280 | """Returns tokenized answer spans that better match the annotated answer.""" 281 | 282 | # The SQuAD annotations are character based. We first project them to 283 | # whitespace-tokenized words. But then after WordPiece tokenization, we can 284 | # often find a "better match". For example: 285 | # 286 | # Question: What year was John Smith born? 287 | # Context: The leader was John Smith (1895-1943). 288 | # Answer: 1895 289 | # 290 | # The original whitespace-tokenized answer will be "(1895-1943).". However 291 | # after tokenization, our tokens will be "( 1895 - 1943 ) .". So we can match 292 | # the exact answer, 1895. 293 | # 294 | # However, this is not always possible. Consider the following: 295 | # 296 | # Question: What country is the top exporter of electornics? 297 | # Context: The Japanese electronics industry is the lagest in the world. 298 | # Answer: Japan 299 | # 300 | # In this case, the annotator chose "Japan" as a character sub-span of 301 | # the word "Japanese". Since our WordPiece tokenizer does not split 302 | # "Japanese", we just use "Japanese" as the annotation. This is fairly rare 303 | # in SQuAD, but does happen. 304 | tok_answer_text = " ".join(map(lambda x: x.text, tokenizer.tokenize(orig_answer_text))) 305 | 306 | for new_start in range(input_start, input_end + 1): 307 | for new_end in range(input_end, new_start - 1, -1): 308 | text_span = " ".join(map(lambda x: x.text, doc_tokens[new_start:(new_end + 1)])) 309 | if text_span == tok_answer_text: 310 | return (new_start, new_end) 311 | 312 | return (input_start, input_end) 313 | 314 | 315 | def _check_is_max_context(doc_spans, cur_span_index, position): 316 | """Check if this is the 'max context' doc span for the token.""" 317 | 318 | # Because of the sliding window approach taken to scoring documents, a single 319 | # token can appear in multiple documents. E.g. 320 | # Doc: the man went to the store and bought a gallon of milk 321 | # Span A: the man went to the 322 | # Span B: to the store and bought 323 | # Span C: and bought a gallon of 324 | # ... 325 | # 326 | # Now the word 'bought' will have two scores from spans B and C. We only 327 | # want to consider the score with "maximum context", which we define as 328 | # the *minimum* of its left and right context (the *sum* of left and 329 | # right context will always be the same, of course). 330 | # 331 | # In the example the maximum context for 'bought' would be span C since 332 | # it has 1 left context and 3 right context, while span B has 4 left context 333 | # and 0 right context. 334 | best_score = None 335 | best_span_index = None 336 | for (span_index, doc_span) in enumerate(doc_spans): 337 | end = doc_span.start + doc_span.length - 1 338 | if position < doc_span.start: 339 | continue 340 | if position > end: 341 | continue 342 | num_left_context = position - doc_span.start 343 | num_right_context = end - position 344 | score = min(num_left_context, num_right_context) + 0.01 * doc_span.length 345 | if best_score is None or score > best_score: 346 | best_score = score 347 | best_span_index = span_index 348 | 349 | return cur_span_index == best_span_index 350 | 351 | 352 | @WordSplitter.register("bert-basic-wordpiece") 353 | class BertBasicWordSplitter(WordSplitter): 354 | """ 355 | The ``BasicWordSplitter`` from the BERT implementation. 356 | This is used to split a sentence into words. 357 | Then the ``BertTokenIndexer`` converts each word into wordpieces. 358 | """ 359 | def __init__(self, 360 | pretrained_model: str, 361 | do_lower_case: bool = True, 362 | never_split: Optional[List[str]] = None) -> None: 363 | self.bert_tokenizer = BertTokenizer.from_pretrained(pretrained_model, do_lower_case=do_lower_case) 364 | 365 | @overrides 366 | def split_words(self, sentence: str) -> List[Token]: 367 | return [Token(text) for text in self.bert_tokenizer.tokenize(sentence)] 368 | 369 | 370 | -------------------------------------------------------------------------------- /athnlp/readers/brown_pos_corpus.py: -------------------------------------------------------------------------------- 1 | from nltk.corpus import brown 2 | from athnlp.readers.sequence_dictionary import SequenceDictionary 3 | from athnlp.readers.sequence import Sequence 4 | 5 | 6 | class BrownPosTag: 7 | 8 | def __init__( 9 | self, 10 | max_sent_len: int = 15, 11 | num_train_sents=10000, 12 | num_dev_sents=1000, 13 | num_test_sents=1000, 14 | mapping_file="athnlp/readers/en-brown.map"): 15 | 16 | self.train = [] 17 | self.dev = [] 18 | self.test = [] 19 | self.dictionary = SequenceDictionary() 20 | 21 | # Build mapping of postags 22 | self.mapping = {} 23 | if mapping_file is not None: 24 | for line in open(mapping_file): 25 | coarse, fine = line.strip().split("\t") 26 | self.mapping[coarse.lower()] = fine.lower() 27 | 28 | # Initialize noun to be tag zero so that it the default tag 29 | self.dictionary.y_dict.add("noun") 30 | 31 | # Preprocess dataset splits 32 | sents = brown.tagged_sents() 33 | last_id = 0 34 | self.train, last_id = self.preprocess_split(sents, last_id, num_train_sents, max_sent_len, "train_") 35 | self.dev, last_id = self.preprocess_split(sents, last_id, num_dev_sents, max_sent_len, prefix_id="dev_") 36 | self.test, _ = self.preprocess_split(sents, last_id, num_test_sents, max_sent_len, prefix_id="test_") 37 | 38 | def preprocess_split(self, input_dataset, last_id, num_sents, max_sent_len, prefix_id = ""): 39 | """"Add necessary pre-processing (e.g., convert to universal tagset) to sentences of the dataset.""" 40 | dataset = [] 41 | for sent in input_dataset[last_id:]: 42 | last_id += 1 43 | if type(sent) == tuple or len(sent) > max_sent_len or len(sent) <= 1: 44 | continue 45 | dataset.append(self.preprocess_sent(sent, prefix_id + str(len(dataset)))) 46 | if len(dataset) == num_sents: 47 | break 48 | 49 | return dataset, last_id 50 | 51 | def preprocess_sent(self, sent, sent_id): 52 | """Every word and tag of the sentence gets mapped to a unique id stored in a SequenceDictionary instance.""" 53 | ids_x = [] 54 | ids_y = [] 55 | for word, tag in sent: 56 | tag = tag.lower() 57 | if tag not in self.mapping: 58 | # Add unk tags to mapping dict 59 | self.mapping[tag] = "noun" 60 | universal_tag = self.mapping[tag] 61 | word_id = self.dictionary.x_dict.add(word) 62 | tag_id = self.dictionary.y_dict.add(universal_tag) 63 | ids_x.append(word_id) 64 | ids_y.append(tag_id) 65 | return Sequence(self.dictionary, ids_x, ids_y, sent_id) 66 | 67 | 68 | if __name__ == '__main__': 69 | corpus = BrownPosTag() 70 | print("vocabulary size: ", len(corpus.dictionary.x_dict)) 71 | print("train/dev/test set length: ", len(corpus.train), len(corpus.dev), len(corpus.test)) 72 | print("First train sentence: ", corpus.train[0]) 73 | print("First dev sentence: ", corpus.dev[0]) 74 | print("First test sentence: ", corpus.test[0]) 75 | -------------------------------------------------------------------------------- /athnlp/readers/en-brown.map: -------------------------------------------------------------------------------- 1 | ' . 2 | '' . 3 | ( . 4 | (-HL . 5 | ) . 6 | )-HL . 7 | * ADV 8 | *-HL ADV 9 | *-NC ADV 10 | *-TL ADV 11 | , . 12 | ,-HL . 13 | ,-NC . 14 | ,-TL . 15 | -- . 16 | ---HL . 17 | . . 18 | .-HL . 19 | .-NC . 20 | .-TL . 21 | : . 22 | :-HL . 23 | :-TL . 24 | ABL PRT 25 | ABN PRT 26 | ABN-HL PRT 27 | ABN-NC PRT 28 | ABN-TL PRT 29 | ABX DET 30 | AP ADJ 31 | AP$ PRT 32 | AP+AP-NC ADJ 33 | AP-HL ADJ 34 | AP-NC ADJ 35 | AP-TL ADJ 36 | AT DET 37 | AT-HL DET 38 | AT-NC DET 39 | AT-TL DET 40 | AT-TL-HL DET 41 | BE VERB 42 | BE-HL VERB 43 | BE-TL VERB 44 | BED VERB 45 | BED* VERB 46 | BED-NC VERB 47 | BEDZ VERB 48 | BEDZ* VERB 49 | BEDZ-HL VERB 50 | BEDZ-NC VERB 51 | BEG VERB 52 | BEM VERB 53 | BEM* VERB 54 | BEM-NC VERB 55 | BEN VERB 56 | BEN-TL VERB 57 | BER VERB 58 | BER* VERB 59 | BER*-NC VERB 60 | BER-HL VERB 61 | BER-NC VERB 62 | BER-TL VERB 63 | BEZ VERB 64 | BEZ* VERB 65 | BEZ-HL VERB 66 | BEZ-NC VERB 67 | BEZ-TL VERB 68 | CC CONJ 69 | CC-HL CONJ 70 | CC-NC CONJ 71 | CC-TL CONJ 72 | CC-TL-HL CONJ 73 | CD NUM 74 | CD$ NOUN 75 | CD-HL NUM 76 | CD-NC NUM 77 | CD-TL NUM 78 | CD-TL-HL NUM 79 | CS ADP 80 | CS-HL ADP 81 | CS-NC ADP 82 | CS-TL ADP 83 | DO VERB 84 | DO* VERB 85 | DO*-HL VERB 86 | DO+PPSS X 87 | DO-HL VERB 88 | DO-NC VERB 89 | DO-TL VERB 90 | DOD VERB 91 | DOD* VERB 92 | DOD*-TL VERB 93 | DOD-NC VERB 94 | DOZ VERB 95 | DOZ* VERB 96 | DOZ*-TL VERB 97 | DOZ-HL VERB 98 | DOZ-TL VERB 99 | DT DET 100 | DT$ DET 101 | DT+BEZ PRT 102 | DT+BEZ-NC PRT 103 | DT+MD PRT 104 | DT-HL DET 105 | DT-NC DET 106 | DT-TL DET 107 | DTI DET 108 | DTI-HL DET 109 | DTI-TL DET 110 | DTS DET 111 | DTS+BEZ PRT 112 | DTS-HL DET 113 | DTX DET 114 | EX PRT 115 | EX+BEZ PRT 116 | EX+HVD PRT 117 | EX+HVZ PRT 118 | EX+MD PRT 119 | EX-HL PRT 120 | EX-NC PRT 121 | FW-* X 122 | FW-*-TL X 123 | FW-AT X 124 | FW-AT+NN-TL X 125 | FW-AT+NP-TL X 126 | FW-AT-HL X 127 | FW-AT-TL X 128 | FW-BE X 129 | FW-BER X 130 | FW-BEZ X 131 | FW-CC X 132 | FW-CC-TL X 133 | FW-CD X 134 | FW-CD-TL X 135 | FW-CS X 136 | FW-DT X 137 | FW-DT+BEZ X 138 | FW-DTS X 139 | FW-HV X 140 | FW-IN X 141 | FW-IN+AT X 142 | FW-IN+AT-T X 143 | FW-IN+AT-TL X 144 | FW-IN+NN X 145 | FW-IN+NN-TL X 146 | FW-IN+NP-TL X 147 | FW-IN-TL X 148 | FW-JJ X 149 | FW-JJ-NC X 150 | FW-JJ-TL X 151 | FW-JJR X 152 | FW-JJT X 153 | FW-NN X 154 | FW-NN$ X 155 | FW-NN$-TL X 156 | FW-NN-NC X 157 | FW-NN-TL X 158 | FW-NN-TL-NC X 159 | FW-NNS X 160 | FW-NNS-NC X 161 | FW-NNS-TL X 162 | FW-NP X 163 | FW-NP-TL X 164 | FW-NPS X 165 | FW-NPS-TL X 166 | FW-NR X 167 | FW-NR-TL X 168 | FW-OD-NC X 169 | FW-OD-TL X 170 | FW-PN X 171 | FW-PP$ X 172 | FW-PP$-NC X 173 | FW-PP$-TL X 174 | FW-PPL X 175 | FW-PPL+VBZ X 176 | FW-PPO X 177 | FW-PPO+IN X 178 | FW-PPS X 179 | FW-PPSS X 180 | FW-PPSS+HV X 181 | FW-QL X 182 | FW-RB X 183 | FW-RB+CC X 184 | FW-RB-TL X 185 | FW-TO+VB X 186 | FW-UH X 187 | FW-UH-NC X 188 | FW-UH-TL X 189 | FW-VB X 190 | FW-VB-NC X 191 | FW-VB-TL X 192 | FW-VBD X 193 | FW-VBD-TL X 194 | FW-VBG X 195 | FW-VBG-TL X 196 | FW-VBN X 197 | FW-VBZ X 198 | FW-WDT X 199 | FW-WPO X 200 | FW-WPS X 201 | HV VERB 202 | HV* VERB 203 | HV+TO VERB 204 | HV-HL VERB 205 | HV-NC VERB 206 | HV-TL VERB 207 | HVD VERB 208 | HVD* VERB 209 | HVD-HL VERB 210 | HVG VERB 211 | HVG-HL VERB 212 | HVN VERB 213 | HVZ VERB 214 | HVZ* VERB 215 | HVZ-NC VERB 216 | HVZ-TL VERB 217 | IN ADP 218 | IN+IN ADP 219 | IN+PPO ADP 220 | IN-HL ADP 221 | IN-NC ADP 222 | IN-TL ADP 223 | IN-TL-HL ADP 224 | JJ ADJ 225 | JJ$-TL PRT 226 | JJ+JJ-NC ADJ 227 | JJ-HL ADJ 228 | JJ-NC ADJ 229 | JJ-TL ADJ 230 | JJ-TL-HL ADJ 231 | JJ-TL-NC ADJ 232 | JJR ADJ 233 | JJR+CS ADJ 234 | JJR-HL ADJ 235 | JJR-NC ADJ 236 | JJR-TL ADJ 237 | JJS ADJ 238 | JJS-HL ADJ 239 | JJS-TL ADJ 240 | JJT ADJ 241 | JJT-HL ADJ 242 | JJT-NC ADJ 243 | JJT-TL ADJ 244 | MD VERB 245 | MD* VERB 246 | MD*-HL VERB 247 | MD+HV VERB 248 | MD+PPSS VERB 249 | MD+TO VERB 250 | MD-HL VERB 251 | MD-NC VERB 252 | MD-TL VERB 253 | NIL X 254 | NN NOUN 255 | NN$ NOUN 256 | NN$-HL NOUN 257 | NN$-TL NOUN 258 | NN+BEZ PRT 259 | NN+BEZ-TL PRT 260 | NN+HVD-TL PRT 261 | NN+HVZ PRT 262 | NN+HVZ-TL PRT 263 | NN+IN NOUN 264 | NN+MD PRT 265 | NN+NN-NC NOUN 266 | NN-HL NOUN 267 | NN-NC NOUN 268 | NN-TL NOUN 269 | NN-TL-HL NOUN 270 | NN-TL-NC NOUN 271 | NNS NOUN 272 | NNS$ NOUN 273 | NNS$-HL NOUN 274 | NNS$-NC NOUN 275 | NNS$-TL NOUN 276 | NNS$-TL-HL NOUN 277 | NNS+MD PRT 278 | NNS-HL NOUN 279 | NNS-NC NOUN 280 | NNS-TL NOUN 281 | NNS-TL-HL NOUN 282 | NNS-TL-NC NOUN 283 | NP NOUN 284 | NP$ NOUN 285 | NP$-HL NOUN 286 | NP$-TL NOUN 287 | NP+BEZ PRT 288 | NP+BEZ-NC PRT 289 | NP+HVZ PRT 290 | NP+HVZ-NC PRT 291 | NP+MD PRT 292 | NP-HL NOUN 293 | NP-NC NOUN 294 | NP-TL NOUN 295 | NP-TL-HL NOUN 296 | NPS NOUN 297 | NPS$ NOUN 298 | NPS$-HL NOUN 299 | NPS$-TL NOUN 300 | NPS-HL NOUN 301 | NPS-NC NOUN 302 | NPS-TL NOUN 303 | NR NOUN 304 | NR$ NOUN 305 | NR$-TL NOUN 306 | NR+MD PRT 307 | NR-HL NOUN 308 | NR-NC NOUN 309 | NRS NOUN 310 | NRS-TL NOUN 311 | OD ADJ 312 | OD-HL ADJ 313 | OD-NC ADJ 314 | OD-TL ADJ 315 | PN NOUN 316 | PN$ NOUN 317 | PN+BEZ PRT 318 | PN+HVD PRT 319 | PN+HVZ PRT 320 | PN+MD PRT 321 | PN-HL NOUN 322 | PN-NC NOUN 323 | PN-TL NOUN 324 | PP$ DET 325 | PP$$ PRON 326 | PP$-HL DET 327 | PP$-NC DET 328 | PP$-TL DET 329 | PPL PRON 330 | PPL-HL PRON 331 | PPL-NC PRON 332 | PPL-TL PRON 333 | PPLS PRON 334 | PPO PRON 335 | PPO-HL PRON 336 | PPO-NC PRON 337 | PPO-TL PRON 338 | PPS PRON 339 | PPS+BEZ PRT 340 | PPS+BEZ-HL PRT 341 | PPS+BEZ-NC PRT 342 | PPS+HVD PRT 343 | PPS+HVZ PRT 344 | PPS+MD PRT 345 | PPS-HL PRON 346 | PPS-NC PRON 347 | PPS-TL PRON 348 | PPSS PRON 349 | PPSS+BEM PRT 350 | PPSS+BER PRT 351 | PPSS+BER-N PRT 352 | PPSS+BER-NC PRT 353 | PPSS+BER-TL PRT 354 | PPSS+BEZ PRT 355 | PPSS+BEZ* PRT 356 | PPSS+HV PRT 357 | PPSS+HV-TL PRT 358 | PPSS+HVD PRT 359 | PPSS+MD PRT 360 | PPSS+MD-NC PRT 361 | PPSS+VB PRT 362 | PPSS-HL PRON 363 | PPSS-NC PRON 364 | PPSS-TL PRON 365 | QL ADV 366 | QL-HL ADV 367 | QL-NC ADV 368 | QL-TL ADV 369 | QLP ADV 370 | RB ADV 371 | RB$ PRT 372 | RB+BEZ PRT 373 | RB+BEZ-HL PRT 374 | RB+BEZ-NC PRT 375 | RB+CS ADV 376 | RB-HL ADV 377 | RB-NC ADV 378 | RB-TL ADV 379 | RBR ADV 380 | RBR+CS ADV 381 | RBR-NC ADV 382 | RBT ADV 383 | RN ADV 384 | RP PRT 385 | RP+IN PRT 386 | RP-HL PRT 387 | RP-NC PRT 388 | RP-TL PRT 389 | TO PRT 390 | TO+VB PRT 391 | TO-HL PRT 392 | TO-NC PRT 393 | TO-TL PRT 394 | UH PRT 395 | UH-HL PRT 396 | UH-NC PRT 397 | UH-TL PRT 398 | VB VERB 399 | VB+AT VERB 400 | VB+IN VERB 401 | VB+JJ-NC VERB 402 | VB+PPO VERB 403 | VB+RP VERB 404 | VB+TO VERB 405 | VB+VB-NC VERB 406 | VB-HL VERB 407 | VB-NC VERB 408 | VB-TL VERB 409 | VBD VERB 410 | VBD-HL VERB 411 | VBD-NC VERB 412 | VBD-TL VERB 413 | VBG VERB 414 | VBG+TO VERB 415 | VBG-HL VERB 416 | VBG-NC VERB 417 | VBG-TL VERB 418 | VBN VERB 419 | VBN+TO VERB 420 | VBN-HL VERB 421 | VBN-NC VERB 422 | VBN-TL VERB 423 | VBN-TL-HL VERB 424 | VBN-TL-NC VERB 425 | VBZ VERB 426 | VBZ-HL VERB 427 | VBZ-NC VERB 428 | VBZ-TL VERB 429 | WDT DET 430 | WDT+BER PRT 431 | WDT+BER+PP X 432 | WDT+BEZ PRT 433 | WDT+BEZ-HL PRT 434 | WDT+BEZ-NC PRT 435 | WDT+BEZ-TL PRT 436 | WDT+DO+PPS X 437 | WDT+DOD PRT 438 | WDT+HVZ PRT 439 | WDT-HL DET 440 | WDT-NC DET 441 | WP$ DET 442 | WPO PRON 443 | WPO-NC PRON 444 | WPO-TL PRON 445 | WPS PRON 446 | WPS+BEZ PRT 447 | WPS+BEZ-NC PRT 448 | WPS+BEZ-TL PRT 449 | WPS+HVD PRT 450 | WPS+HVZ PRT 451 | WPS+MD PRT 452 | WPS-HL PRON 453 | WPS-NC PRON 454 | WPS-TL PRON 455 | WQL ADV 456 | WQL-TL ADV 457 | WRB ADV 458 | WRB+BER PRT 459 | WRB+BEZ PRT 460 | WRB+BEZ-TL PRT 461 | WRB+DO PRT 462 | WRB+DOD PRT 463 | WRB+DOD* PRT 464 | WRB+DOZ PRT 465 | WRB+IN PRT 466 | WRB+MD PRT 467 | WRB-HL ADV 468 | WRB-NC ADV 469 | WRB-TL ADV 470 | `` . 471 | -------------------------------------------------------------------------------- /athnlp/readers/fever_predictor.py: -------------------------------------------------------------------------------- 1 | from allennlp.common.util import JsonDict 2 | from allennlp.data import Instance 3 | from allennlp.predictors.predictor import Predictor 4 | from overrides import overrides 5 | 6 | 7 | @Predictor.register('fever') 8 | class FeverPredictor(Predictor): 9 | """ 10 | Predictor for sequence to sequence models that visualizes attention scores 11 | """ 12 | 13 | @overrides 14 | def _json_to_instance(self, json_dict: JsonDict) -> Instance: 15 | """ 16 | Expects JSON that looks like ``{"source": "..."}``. 17 | """ 18 | claim = json_dict["claim"] 19 | evidence = json_dict["evidence"] 20 | return self._dataset_reader.text_to_instance(claim, evidence) 21 | -------------------------------------------------------------------------------- /athnlp/readers/fever_reader.py: -------------------------------------------------------------------------------- 1 | import json 2 | import logging 3 | from typing import Iterable, Dict, List 4 | 5 | from allennlp.data import DatasetReader, Instance, Tokenizer, TokenIndexer 6 | from allennlp.data.fields import MetadataField, TextField, LabelField 7 | from allennlp.data.token_indexers import SingleIdTokenIndexer 8 | from allennlp.data.tokenizers import WordTokenizer 9 | 10 | 11 | logger = logging.getLogger(__name__) 12 | 13 | 14 | @DatasetReader.register("feverlite") 15 | class FEVERLiteDatasetReader(DatasetReader): 16 | def __init__(self, 17 | wiki_tokenizer: Tokenizer = None, 18 | claim_tokenizer: Tokenizer = None, 19 | token_indexers: Dict[str, TokenIndexer] = None) -> None: 20 | super().__init__() 21 | self._wiki_tokenizer = wiki_tokenizer or WordTokenizer() 22 | self._claim_tokenizer = claim_tokenizer or WordTokenizer() 23 | self._token_indexers = token_indexers or {'tokens': SingleIdTokenIndexer()} 24 | 25 | def _read(self, file_path: str) -> Iterable[Instance]: 26 | logger.info("Reading FEVER instances from {}".format(file_path)) 27 | with open(file_path,"r") as file: 28 | for line in file: 29 | json_line = json.loads(line) 30 | yield self.text_to_instance(**json_line) 31 | 32 | def text_to_instance(self, claim:str, evidence:List[str], label:str=None) -> Instance: 33 | # Evidence in the dataset is a list of sentences. We can concatenate these into just one long string 34 | # Extension Exercise: Can you make a new dataset reader and model that handles them individually? 35 | evidence = " ".join(set(evidence)) 36 | 37 | # Tokenize the claim and evidence 38 | claim_tokens = self._claim_tokenizer.tokenize(claim) 39 | evidence_tokens = self._wiki_tokenizer.tokenize(evidence) 40 | 41 | instance_meta = {"claim_tokens": claim_tokens, 42 | "evidence_tokens": evidence_tokens } 43 | 44 | instance_dict = {"claim": TextField(claim_tokens, self._token_indexers), 45 | "evidence": TextField(evidence_tokens, self._token_indexers), 46 | "metadata": MetadataField(instance_meta) 47 | } 48 | 49 | if label is not None: 50 | instance_dict["label"] = LabelField(label) 51 | 52 | return Instance(instance_dict) 53 | 54 | -------------------------------------------------------------------------------- /athnlp/readers/label_dictionary.py: -------------------------------------------------------------------------------- 1 | import sys 2 | import warnings 3 | 4 | 5 | class LabelDictionary(dict): 6 | """This class implements a dictionary of labels. Labels are mapped to 7 | integers, as it is more efficient to retrieve the label name from its 8 | integer representation, and vice-versa.""" 9 | 10 | def __init__(self, label_names=[]): 11 | self.names = [] 12 | for name in label_names: 13 | self.add(name) 14 | 15 | def add(self, name): 16 | if name in self: 17 | # warnings.warn('Ignoring duplicated label ' + name) 18 | label_id = self[name] 19 | else: 20 | label_id = len(self.names) 21 | self[name] = label_id 22 | self.names.append(name) 23 | return label_id 24 | 25 | def get_label_name(self, label_id): 26 | return self.names[label_id] 27 | 28 | def get_label_id(self, name): 29 | return self[name] 30 | -------------------------------------------------------------------------------- /athnlp/readers/lm_corpus.py: -------------------------------------------------------------------------------- 1 | import os 2 | from io import open 3 | 4 | import torch 5 | 6 | 7 | class Dictionary(object): 8 | def __init__(self): 9 | self.word2idx = {} 10 | self.idx2word = [] 11 | 12 | def add_word(self, word): 13 | if word not in self.word2idx: 14 | self.idx2word.append(word) 15 | self.word2idx[word] = len(self.idx2word) - 1 16 | return self.word2idx[word] 17 | 18 | def __len__(self): 19 | return len(self.idx2word) 20 | 21 | 22 | class Corpus(object): 23 | def __init__(self, path): 24 | self.dictionary = Dictionary() 25 | self.train = self.tokenize(os.path.join(path, 'train.txt')) 26 | self.valid = self.tokenize(os.path.join(path, 'valid.txt')) 27 | self.test = self.tokenize(os.path.join(path, 'test.txt')) 28 | 29 | def tokenize(self, path): 30 | """Tokenizes a text file.""" 31 | assert os.path.exists(path) 32 | # Add words to the dictionary 33 | with open(path, 'r', encoding="utf8") as f: 34 | for line in f: 35 | words = line.split() + [''] 36 | for word in words: 37 | self.dictionary.add_word(word.lower()) 38 | 39 | # Tokenize file content 40 | with open(path, 'r', encoding="utf8") as f: 41 | idss = [] 42 | for line in f: 43 | words = line.split() + [''] 44 | ids = [] 45 | for word in words: 46 | word = word.lower() 47 | ids.append(self.dictionary.word2idx[word]) 48 | idss.append(torch.tensor(ids).type(torch.int64)) 49 | ids = torch.cat(idss) 50 | 51 | return ids 52 | -------------------------------------------------------------------------------- /athnlp/readers/multi30k_reader.py: -------------------------------------------------------------------------------- 1 | from typing import Dict 2 | import logging 3 | 4 | from overrides import overrides 5 | 6 | from allennlp.common.checks import ConfigurationError 7 | from allennlp.common.file_utils import cached_path 8 | from allennlp.common.util import START_SYMBOL, END_SYMBOL 9 | from allennlp.data.dataset_readers.dataset_reader import DatasetReader 10 | from allennlp.data.fields import TextField 11 | from allennlp.data.instance import Instance 12 | from allennlp.data.tokenizers import Token, Tokenizer, WordTokenizer 13 | from allennlp.data.token_indexers import TokenIndexer, SingleIdTokenIndexer 14 | 15 | 16 | logger = logging.getLogger(__name__) # pylint: disable=invalid-name 17 | 18 | 19 | @DatasetReader.register("multi30k") 20 | class Multi30kReader(DatasetReader): 21 | 22 | def __init__(self, 23 | source_tokenizer: Tokenizer = None, 24 | target_tokenizer: Tokenizer = None, 25 | source_token_indexers: Dict[str, TokenIndexer] = {"tokens": SingleIdTokenIndexer()}, 26 | target_token_indexers: Dict[str, TokenIndexer] = None, 27 | source_add_start_token: bool = True, 28 | language_pairs: Dict = {"source": "fr", "target": "en"}, 29 | lazy: bool = False) -> None: 30 | super().__init__(lazy) 31 | self._source_tokenizer = source_tokenizer or WordTokenizer() 32 | self._target_tokenizer = target_tokenizer or self._source_tokenizer 33 | self._source_token_indexers = source_token_indexers 34 | self._target_token_indexers = target_token_indexers or self._source_token_indexers 35 | self._source_add_start_token = source_add_start_token 36 | self._language_pairs = language_pairs 37 | 38 | @overrides 39 | def _read(self, file_path): 40 | 41 | with open(cached_path(("%s.%s" % (file_path, self._language_pairs["source"]))), "r", encoding="utf8") as source_file, \ 42 | open(cached_path(("%s.%s" % (file_path, self._language_pairs["target"]))), "r", encoding="utf8") as target_file: 43 | logger.info("Reading instances from lines in source/target files at: %s", file_path) 44 | for source_sequence, target_sequence in zip(source_file, target_file): 45 | yield self.text_to_instance(source_sequence, target_sequence) 46 | 47 | @overrides 48 | def text_to_instance(self, source_string: str, target_string: str = None) -> Instance: # type: ignore 49 | # pylint: disable=arguments-differ 50 | tokenized_source = self._source_tokenizer.tokenize(source_string) 51 | if self._source_add_start_token: 52 | tokenized_source.insert(0, Token(START_SYMBOL)) 53 | tokenized_source.append(Token(END_SYMBOL)) 54 | source_field = TextField(tokenized_source, self._source_token_indexers) 55 | if target_string is not None: 56 | tokenized_target = self._target_tokenizer.tokenize(target_string) 57 | tokenized_target.insert(0, Token(START_SYMBOL)) 58 | tokenized_target.append(Token(END_SYMBOL)) 59 | target_field = TextField(tokenized_target, self._target_token_indexers) 60 | return Instance({"source_tokens": source_field, "target_tokens": target_field}) 61 | else: 62 | return Instance({'source_tokens': source_field}) -------------------------------------------------------------------------------- /athnlp/readers/sequence.py: -------------------------------------------------------------------------------- 1 | from athnlp.readers.sequence_dictionary import SequenceDictionary 2 | 3 | 4 | class Sequence(object): 5 | 6 | def __init__(self, dictionary: SequenceDictionary, x, y, nr): 7 | self.x = x 8 | self.y = y 9 | self.nr = nr 10 | self.dictionary = dictionary 11 | 12 | def size(self): 13 | """Returns the size of the sequence.""" 14 | return len(self.x) 15 | 16 | def __len__(self): 17 | return len(self.x) 18 | 19 | def copy_sequence(self): 20 | """Performs a deep copy of the sequence""" 21 | s = Sequence(self.dictionary, self.x[:], self.y[:], self.nr) 22 | return s 23 | 24 | def update_from_sequence(self, new_y): 25 | """Returns a new sequence equal to the previous but with y set to newy""" 26 | s = Sequence(self.dictionary, self.x, new_y, self.nr) 27 | return s 28 | 29 | def __str__(self): 30 | rep = "" 31 | for i, xi in enumerate(self.x): 32 | yi = self.y[i] 33 | rep += "%s/%s " % (self.dictionary.x_dict.get_label_name(xi), 34 | self.dictionary.y_dict.get_label_name(yi)) 35 | return rep 36 | 37 | def __repr__(self): 38 | rep = "" 39 | for i, xi in enumerate(self.x): 40 | yi = self.y[i] 41 | rep += "%s/%s " % (self.dictionary.x_dict.get_label_name(xi), 42 | self.dictionary.y_dict.get_label_name(yi)) 43 | return rep 44 | -------------------------------------------------------------------------------- /athnlp/readers/sequence_dictionary.py: -------------------------------------------------------------------------------- 1 | from athnlp.readers.label_dictionary import LabelDictionary 2 | 3 | 4 | class SequenceDictionary: 5 | 6 | def __init__(self): 7 | self.x_dict = LabelDictionary() 8 | self.y_dict = LabelDictionary() 9 | -------------------------------------------------------------------------------- /athnlp/readers/token_indexers/__init__.py: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/athnlp/athnlp-labs/1c3f0a8595cd8e4a7b7d62fc4e608ade6ac092fd/athnlp/readers/token_indexers/__init__.py -------------------------------------------------------------------------------- /athnlp/readers/token_indexers/bert_squad_indexer.py: -------------------------------------------------------------------------------- 1 | # pylint: disable=no-self-use 2 | from typing import List, Dict 3 | from overrides import overrides 4 | 5 | from allennlp.data.tokenizers.token import Token 6 | from allennlp.data.vocabulary import Vocabulary 7 | from allennlp.data.token_indexers.token_indexer import TokenIndexer 8 | from allennlp.data.token_indexers.wordpiece_indexer import PretrainedBertIndexer, WordpieceIndexer 9 | 10 | 11 | @TokenIndexer.register("bert-squad-indexer") 12 | class BertSquadIndexer(PretrainedBertIndexer): 13 | """ 14 | TokenIndexer closely based on AllenNLP's WordpieceIndexer; the only major difference is that we assume that 15 | basic and then wordpiece tokenization have already taken place when reading the SQuAD dataset 16 | (this follows the original methodology of hugginface). The reason we do that is so that start_position and 17 | end_position are correctly offset due to the extra wordpiece tokens introduced. 18 | NOTE: We are unnecesarily checking for len(tokens) > max_pieces, as we have already split the paragraphs 19 | when reading the dataset. The corresponding code below should never be triggered. 20 | """ 21 | def __init__(self, 22 | pretrained_model: str) -> None: 23 | super().__init__(pretrained_model) 24 | 25 | @overrides 26 | def tokens_to_indices(self, 27 | tokens: List[Token], 28 | vocabulary: Vocabulary, 29 | index_name: str) -> Dict[str, List[int]]: 30 | if not self._added_to_vocabulary: 31 | self._add_encoding_to_vocabulary(vocabulary) 32 | self._added_to_vocabulary = True 33 | 34 | # This lowercases tokens if necessary 35 | text = (token.text.lower() 36 | if self._do_lowercase and token.text not in self._never_lowercase 37 | else token.text 38 | for token in tokens) 39 | 40 | # Create nested sequence of wordpieces 41 | nested_wordpiece_tokens = _get_nested_wordpiece_tokens([token for token in text]) 42 | 43 | # Obtain a nested sequence of wordpieces, each represented by a list of wordpiece ids 44 | token_wordpiece_ids = [[self.vocab[wordpiece] for wordpiece in token] 45 | for token in nested_wordpiece_tokens] 46 | 47 | # Flattened list of wordpieces. In the end, the output of the model (e.g., BERT) should 48 | # have a sequence length equal to the length of this list. However, it will first be split into 49 | # chunks of length `self.max_pieces` so that they can be fit through the model. After packing 50 | # and passing through the model, it should be unpacked to represent the wordpieces in this list. 51 | flat_wordpiece_ids = [wordpiece for token in token_wordpiece_ids for wordpiece in token] 52 | 53 | # Similarly, we want to compute the token_type_ids from the flattened wordpiece ids before 54 | # we do the windowing; otherwise [SEP] tokens would get counted multiple times. 55 | flat_token_type_ids = _get_token_type_ids(flat_wordpiece_ids, self._separator_ids) 56 | 57 | # The code below will (possibly) pack the wordpiece sequence into multiple sub-sequences by using a sliding 58 | # window `window_length` that overlaps with previous windows according to the `stride`. Suppose we have 59 | # the following sentence: "I went to the store to buy some milk". Then a sliding window of length 4 and 60 | # stride of length 2 will split them up into: 61 | 62 | # "[I went to the] [to the store to] [store to buy some] [buy some milk [PAD]]". 63 | 64 | # This is to ensure that the model has context of as much of the sentence as possible to get accurate 65 | # embeddings. Finally, the sequences will be padded with any start/end piece ids, e.g., 66 | 67 | # "[CLS] I went to the [SEP] [CLS] to the store to [SEP] ...". 68 | 69 | # The embedder should then be able to split this token sequence by the window length, 70 | # pass them through the model, and recombine them. 71 | 72 | # Specify the stride to be half of `self.max_pieces`, minus any additional start/end wordpieces 73 | window_length = self.max_pieces - len(self._start_piece_ids) - len(self._end_piece_ids) 74 | stride = window_length // 2 75 | 76 | # offsets[i] will give us the index into wordpiece_ids 77 | # for the wordpiece "corresponding to" the i-th input token. 78 | offsets = [] 79 | 80 | # If we're using initial offsets, we want to start at offset = len(text_tokens) 81 | # so that the first offset is the index of the first wordpiece of tokens[0]. 82 | # Otherwise, we want to start at len(text_tokens) - 1, so that the "previous" 83 | # offset is the last wordpiece of "tokens[-1]". 84 | offset = len(self._start_piece_ids) if self.use_starting_offsets else len(self._start_piece_ids) - 1 85 | 86 | # Count amount of wordpieces accumulated 87 | pieces_accumulated = 0 88 | for token in token_wordpiece_ids: 89 | # Truncate the sequence if specified, which depends on where the offsets are 90 | next_offset = 1 if self.use_starting_offsets else 0 91 | if self._truncate_long_sequences and offset + len(token) - 1 >= window_length + next_offset: 92 | break 93 | 94 | # For initial offsets, the current value of ``offset`` is the start of 95 | # the current wordpiece, so add it to ``offsets`` and then increment it. 96 | if self.use_starting_offsets: 97 | offsets.append(offset) 98 | offset += len(token) 99 | # For final offsets, the current value of ``offset`` is the end of 100 | # the previous wordpiece, so increment it and then add it to ``offsets``. 101 | else: 102 | offset += len(token) 103 | offsets.append(offset) 104 | 105 | pieces_accumulated += len(token) 106 | 107 | if len(flat_wordpiece_ids) <= window_length: 108 | # If all the wordpieces fit, then we don't need to do anything special 109 | wordpiece_windows = [self._add_start_and_end(flat_wordpiece_ids)] 110 | token_type_ids = self._extend(flat_token_type_ids) 111 | elif self._truncate_long_sequences: 112 | self._warn_about_truncation(tokens) 113 | wordpiece_windows = [self._add_start_and_end(flat_wordpiece_ids[:pieces_accumulated])] 114 | token_type_ids = self._extend(flat_token_type_ids[:pieces_accumulated]) 115 | else: 116 | # Create a sliding window of wordpieces of length `max_pieces` that advances by `stride` steps and 117 | # add start/end wordpieces to each window 118 | # TODO: this currently does not respect word boundaries, so words may be cut in half between windows 119 | # However, this would increase complexity, as sequences would need to be padded/unpadded in the middle 120 | wordpiece_windows = [self._add_start_and_end(flat_wordpiece_ids[i:i + window_length]) 121 | for i in range(0, len(flat_wordpiece_ids), stride)] 122 | 123 | token_type_windows = [self._extend(flat_token_type_ids[i:i + window_length]) 124 | for i in range(0, len(flat_token_type_ids), stride)] 125 | 126 | # Check for overlap in the last window. Throw it away if it is redundant. 127 | last_window = wordpiece_windows[-1][1:] 128 | penultimate_window = wordpiece_windows[-2] 129 | if last_window == penultimate_window[-len(last_window):]: 130 | wordpiece_windows = wordpiece_windows[:-1] 131 | token_type_windows = token_type_windows[:-1] 132 | 133 | token_type_ids = [token_type for window in token_type_windows for token_type in window] 134 | 135 | # Flatten the wordpiece windows 136 | wordpiece_ids = [wordpiece for sequence in wordpiece_windows for wordpiece in sequence] 137 | 138 | 139 | # Our mask should correspond to the original tokens, 140 | # because calling util.get_text_field_mask on the 141 | # "wordpiece_id" tokens will produce the wrong shape. 142 | # However, because of the max_pieces constraint, we may 143 | # have truncated the wordpieces; accordingly, we want the mask 144 | # to correspond to the remaining tokens after truncation, which 145 | # is captured by the offsets. 146 | mask = [1 for _ in offsets] 147 | 148 | return {index_name: wordpiece_ids, 149 | f"{index_name}-offsets": offsets, 150 | f"{index_name}-type-ids": token_type_ids, 151 | "mask": mask} 152 | 153 | 154 | def _get_token_type_ids(wordpiece_ids: List[int], 155 | separator_ids: List[int]) -> List[int]: 156 | num_wordpieces = len(wordpiece_ids) 157 | token_type_ids: List[int] = [] 158 | type_id = 0 159 | cursor = 0 160 | while cursor < num_wordpieces: 161 | # check length 162 | if num_wordpieces - cursor < len(separator_ids): 163 | token_type_ids.extend(type_id 164 | for _ in range(num_wordpieces - cursor)) 165 | cursor += num_wordpieces - cursor 166 | # check content 167 | # when it is a separator 168 | elif all(wordpiece_ids[cursor + index] == separator_id 169 | for index, separator_id in enumerate(separator_ids)): 170 | token_type_ids.extend(type_id for _ in separator_ids) 171 | type_id += 1 172 | cursor += len(separator_ids) 173 | # when it is not 174 | else: 175 | cursor += 1 176 | token_type_ids.append(type_id) 177 | return token_type_ids 178 | 179 | 180 | def _get_nested_wordpiece_tokens(flat_wordpiece_tokens: List[str]): 181 | nested_worpiece_tokens = [] 182 | nested = [] 183 | for wordpiece in flat_wordpiece_tokens: 184 | if wordpiece.startswith("##"): 185 | nested.append(wordpiece) 186 | else: 187 | nested = [wordpiece] 188 | nested_worpiece_tokens.append(nested) 189 | return nested_worpiece_tokens 190 | -------------------------------------------------------------------------------- /data/lm/test.txt: -------------------------------------------------------------------------------- 1 | The thief stole . 2 | The thief stole the suitcase . 3 | The crook stole the suitcase . 4 | The cop took a bribe . 5 | The thief was arrested by the detective . 6 | -------------------------------------------------------------------------------- /data/lm/train.txt: -------------------------------------------------------------------------------- 1 | The thief stole . 2 | The thief stole the suitcase . 3 | The crook stole the suitcase . 4 | The cop took a bribe . 5 | The thief was arrested by the detective . 6 | -------------------------------------------------------------------------------- /data/lm/valid.txt: -------------------------------------------------------------------------------- 1 | The thief stole . 2 | The thief stole the suitcase . 3 | The crook stole the suitcase . 4 | The cop took a bribe . 5 | The thief was arrested by the detective . 6 | -------------------------------------------------------------------------------- /data/multi30k/val.lc.norm.tok.head-250.en: -------------------------------------------------------------------------------- 1 | a man in his living room ponders packing for a trip . 2 | an african-american woman sits at a brown table , wearing a purple dress , pink shoes , and black sunglasses . 3 | a man slouched in a chair on a city sidewalk girl watching . 4 | two men , one man selling fruit the other inspecting the fruit and conversing with the seller . 5 | a man and a woman hug on a street . 6 | many people in a stadium dressed in white are conversing with each other . 7 | many people have gathered to look at something that is not in the photo . 8 | a couple sits on a bench talking , while a woman walks a dog in the background . 9 | two women in spotted dresses walking down a sidewalk . 10 | a group of girls are playing in a water fountain in the sun . 11 | people landscaping and gardening the areas around the walkway . 12 | a balding man in red sunglasses wearing a green shirt standing in front of a building . 13 | a barefoot boy with a blue and white striped towel is standing on the beach . 14 | a couple and two girls are looking over a clear railing . 15 | people in a produce store picking produce to buy 16 | a tattoo artist applying tattoo ink to the skin . 17 | a woman standing next to two people is pointing to the sky . 18 | a couple walks down an isles at a store selling art and history books . 19 | a man with a name tag on is sitting in a chair . 20 | a man off in the distance by a buddhist temple . 21 | a man is balancing a metal ball on his arm . 22 | a negro male in a white t-shirt and a black hat sitting on the curb texting . 23 | a boy in a black shirt carries a blue bucket while walking with men dressed in white . 24 | a young dark-haired woman with red sun visor holding an open white umbrella amidst a crowd of people 25 | a man holding a small child who is wearing a backpack . 26 | a young man sets up pool balls , on a purple felt billiard . 27 | girls sitting with their hands on their laps 28 | a girl in a black tank with cargo shorts to what appears to be dancing with several people around . 29 | a boy and girl in black jumpsuits stand facing a girl in a pink jacket , with adults in the background . 30 | a man and woman pushing strollers are walking by some people who are selling items in tents . 31 | a woman in striped tights is being guided with strings . 32 | a construction worker in an orange vest lays down cobblestones . 33 | an asian woman sitting outside an outdoor market stall . 34 | a man lounges in his room without eating for days . 35 | a trailer drives down a red brick road . 36 | a young woman with brown hair and tank top is taking a picture with a camera . 37 | a person laying on a bench in front of a water feature . 38 | a smiling young man walking on next to the beach wearing a baseball cap , blue t-shirt and jeans . 39 | a group of people waving to a person on a balcony . 40 | a uniformed man in the army is training a german shepherd using an arm guard . 41 | a kid on skating ramp practicing cool moves . 42 | some men appear to be discussing something on a boat or ship . 43 | a man wearing an aviation cap and goggles sits in the road . 44 | people are crossing a tree lined street in front of a building . 45 | a man wearing a gray shirt resting his head on a table . 46 | a man in white shirt and dark shorts is working outside . 47 | a woman in black pants is looking at her cellphone . 48 | a young woman in a pink shirt attempting to rope a calf at the rodeo . 49 | a family is standing outdoors on a cloudy day . 50 | a very young child is sitting in the sink with paint on his body and face and is playing with the kitchen faucet . 51 | kids being spun around in a glass spinner . 52 | a man and a woman are holding up signs at a protest . 53 | a woman working on her deck on the weekend . 54 | an elderly man , wearing navy blue , is sitting on a bench along the street . 55 | a woman is taking a picture with her camera . 56 | an older male in blue jeans and brown coat is resting against an orange building . 57 | a man , hand on head , regards a bank of america advertisement . 58 | two women sitting on a bench at night in front of a store 59 | a man in a white shirt is sitting on a crate . 60 | an older man with tattoos and biker regalia lingers a moment on a city street . 61 | a man sits alone fishing along the shoreline . 62 | an older man is sitting outside on a bench in front a large banner that says , " memoria justicia sin olvido . " 63 | a man on a bike in a gray jacket carries foliage . 64 | a boy with glasses wearing a bright yellow shirt is standing in a parking lot . 65 | two men walking down a dirt path . 66 | a mother in a blue beret and blue shoes with her two sons . 67 | two dogs playing with a blue and green ball . 68 | a man wearing a bright , multi-color helmet is sitting on a motorcycle . 69 | a man crouches in front of a yellow wall . 70 | looks like a farmers market , a few tables with various items displayed . 71 | a blond boy in a blue shirt is sitting with a woman wearing glasses . 72 | a man in a sleek white shirt gazes into the woman 's eyes while holding on to the back of her black and pink dress . 73 | people play in a fountain at twilight . 74 | four boys posing while one boy sets his drink down . 75 | on a busy street a lady carries goods on her head . 76 | a street performer in an orange jumpsuit rides a tall unicycle as a crowd watches 77 | a person sitting on chair in front of a crowd . 78 | a male is waiting for the train to arrive at the platform . 79 | three white men in t-shirt jump into the air . 80 | a singer doing a stage dive into the crowd . 81 | a scuba diving class taking a picture during class time . 82 | a woman in a striped shirt folds her arms while standing in a grocery store . 83 | a smashed car with many firefighters cutting into the car . 84 | several people wait to checkout inside a store with a warehouse looking ceiling . 85 | 2 girls playing volleyball , one striking the ball . 86 | an older man is pouring something out of a bag into the water . 87 | six people are in a gymnasium working on repairing some bicycles . 88 | two young asian boys spar with each other . 89 | a woman prepares ingredients for a bowl of soup . 90 | two men wearing shorts are working on a blue bike . 91 | a man and woman taking a nap on a makeshift rip . 92 | there is a man wrapped in a blanket of some sort sliding down a hill that is covered in snow . 93 | a man and woman enjoying dinner at a party . 94 | a national guard soldier leading a group of other national guard soldiers singing the national anthem . 95 | two youths walk down an inclined street . 96 | a woman in a restaurant is drinking out of a coconut , using a straw . 97 | a woman wearing a blue uniform stands and looks down . 98 | one man in shorts is talking to another man in blue jeans in front of a sink . 99 | a man feeding a baby in a highchair . 100 | a young helmeted biker in blue takes to the air while going over small hills . 101 | a little baby in a pink hat lying naked and sleeping . 102 | two children on their stomachs lay on the ground under a pipe . 103 | two people are talking near a red phone booth while construction workers rest nearby . 104 | a native woman is working on a craft project . 105 | children chasing the ball in a soccer game . 106 | two children jumping on a screened in blue and black trampoline while outside surrounded by trees . 107 | two dogs run in a field looking at an unseen frisbee . 108 | a drummer and guitar player play a show in a dark area . 109 | a trio of people are hiking throughout a heavily snowed path . 110 | a group of people are near a small river in the middle of a city . 111 | a woman peeks into a telescope in the woods . 112 | a little blond girl with a polka dot shirt is giving a stuffed animal a " bath " in a sink . 113 | a shaggy young male with a nose ring brushes his teeth . 114 | a snow skier flies through the air , while other skiers going up the tow rope look on . 115 | three girls are horseback riding with the focus on the youngest girl . 116 | heavyset woman blowing her hair with a hair dryer smiling all happy 117 | a chef working in a kitchen using a knife . 118 | a man moving flowers while a woman makes a gesture at him . 119 | a woman singing into a microphone while a man plays drums in the background . 120 | the man wearing the cap is handing a freshly caught fish to the boy in the purple hat . 121 | a worker clinging to a tree . 122 | several people standing around a bowl , in which one man is manipulating a brown object . 123 | a rollerblading monk with some nice sunglasses prays before doing some sick tricks . 124 | a woman wearing sunglasses and a blue shirt , selling sea shells , looks at an older man wearing a black shirt and a cap . 125 | an old man sweeps the floor as a lady walks away from the camera . 126 | a man with dreadlocks is playing with the hair of a woman who is sitting on a chair on a cobblestone street . 127 | a man is working at a construction site . 128 | a man is in the middle of hitting a red , white and blue volleyball . 129 | a man stands in a mobile food stand , looking out the half-door . 130 | a baseball player with a red helmet and white pants is tagged by the catcher while running to home base . 131 | a man in an orange robe sweeping outside . 132 | man wearing an purple shirt working in a biology lab . 133 | a man is leading two small ponies on a walk at a park . 134 | 2 females , 1 from germany and 1 from china , compete in a wrestling match on a mat . 135 | a man and woman sleeping on a bench . 136 | a man and a woman are sitting on the floor in front of luggage . 137 | a rickshaw operator waiting for his next costumer . 138 | two girls are seated at a table and working on craft projects . 139 | a man runs through the snow with the aid of snowshoes . 140 | a dancer in a red suit is jumping in the air . 141 | a man is parked while inside of a sanitation truck . 142 | a bearded man in a heavy jacket sits in a corner with a paper cup . 143 | male on a skateboard , using an empty pool as a ramp on a very pretty day . 144 | a man with a large hat in the bushes . 145 | a closeup of a child 's face eating a blue , heart shaped lollipop . 146 | a man and woman are working on replacing a bike tire tube . 147 | a young boy shows his brown and green bead necklace . 148 | a man in a gray t-shirt works the bellows to start a fire on a brick oven inside a wooden shed . 149 | a woman standing in front of trees and smiling . 150 | someone in asian costume is sitting down and holding a sword . 151 | a young football player is setting up for a field goal . 152 | a guy in a bright green hoodie is crossing a crosswalk while looking at an accident between some cars and a bike . 153 | a young man gets ready to kick a soccer ball . 154 | a competitive runner taking her first sprint in a competition . 155 | two men , one in blue and one in red , compete in a boxing match . 156 | a group of friends lay sprawled out on the floor enjoying their time together . 157 | a standing man holds a microphone in front of a man holding a guitar . 158 | roller derby girl skating with others . 159 | a woman getting a bag of ice at a store . 160 | two motorcyclists racing neck and neck around a corner . 161 | one man standing alone on a sidewalk adjusting his hat . 162 | a man with a disability who doesn 't have legs is walking with another man who is entered into a marathon . 163 | multiple bodies collide in a soccer match . 164 | two people , one dressed as a nun and the other in a roger smith t-shirt , running in a foot race past onlookers in a wooded area . 165 | three men on horses during a race . 166 | a dark-skinned man in white shirts and a black sleeveless shirt flips his skateboard on a cement surface surrounded by tall buildings and palm trees . 167 | one man , wearing a hooded sweatshirt , sitting at a fountain watching the people of the city . 168 | swimmers stand on various levels of a large diving board complex in a room with figures from mythology painted on the wall . 169 | a young indian boy sitting down thinking about his future . 170 | a boy in a hoodie is throwing an object into a dirty swimming pool . 171 | colorful costumed men in a performance . 172 | two boxers are ready for their fight as the crowd watches with anticipation . 173 | children are playing a sport on a field . 174 | a female performer sings and plays the guitar in front of a microphone . 175 | boy 's are competing in martial arts . 176 | a group of black people performing in orange shirts in front of a fenced off park . 177 | schoolgirls in uniforms march in a parade while playing flute-like instruments . 178 | a man stuffs a fowl from ingredients in a blue bowl . 179 | the boy leaps of his bed with a karate kick . 180 | a baby in a bouncy seat and a standing boy surrounded by toys . 181 | these are people gathered around the table playing jenga . 182 | i see a bearded man and elderly lady sharing a bowl of food . 183 | a young man is skateboarding on a cement block wall . 184 | people boating on a lake with the sun through the clouds in the distance . 185 | two men wearing martial arts clothing are practicing martial arts . 186 | a boy band and no one even matches someone should have sent a memo . 187 | a group of go-cart riders are racing around a go-cart track . 188 | a person dressed in winter clothes poses with a snowman surrounded by snow covered landscape . 189 | a man is giving a presentation in front of a crowd . 190 | a middle-aged man is taping up the knee of a younger football player who is sitting on a trainers table . 191 | a tattooed man wearing overalls on a stage holding a microphone . 192 | a little girl plays with an miniature electric circuit consisting of three light bulbs and a battery . 193 | man wearing blue helmet merges into traffic on a bicycle . 194 | a bunch of young adults stare in concentration at their computer monitors as they competitively game . 195 | a young boy in green practices juggling in a parking lot . 196 | a little girl in a dotted dress looks back towards a woman in a black dress . 197 | a man adjusts the engine of a boat near the water . 198 | there is a tennis match being played at night in this stadium . 199 | a child snowboarder coming to a stop 200 | a soccer game is played as two men attempt to reach the ball before their respected opponent . 201 | young women and children in a village , with a single woman focused on the camera . 202 | toddler in a green shirt is brushing his teeth with a yellow toothbrush , while being supervised by mom . 203 | a man in a wheelchair and wearing a red jogging suit is carrying a torch . 204 | a son and his parents are taking a group picture in a church . 205 | three men competing in a hurdle race . 206 | two men are observing another as he puts the finishing touches on wet cement . 207 | a soldier is looking at binoculars into the mountainous landscape . 208 | a group of young boys race on a snowy day . 209 | a man jumps rope while a crowd of people watch him . 210 | a young boy and girl are laughing together as the girl holds up a hand sign . 211 | two men on opposing teams race toward a soccer ball . 212 | a biker wearing a yellow shirt pulls of an incredible trick in the air . 213 | two female kickboxers , one with a purple sports bra , battle it out in an arena . 214 | two men on fast motorcycle speeding around a corner on a racetrack . 215 | two men guard the man with the basketball during a game at dusk . 216 | a man wearing riding boots and a helmet is riding a white horse , and the horse is jumping a hurdle . 217 | a group of men in costume play music . 218 | man and women look through milk crates full of records or pictures on sale . 219 | a racing catamaran is lifted onto one hull in the water . 220 | a group of marines walking down the road with american flags and other military flags . 221 | a man holding a drinking glass at the camera . 222 | one man on stage , playing a guitar with lights in the background . 223 | four young kids playing with empty canisters . 224 | a group of workers are listening to instruction from a colleague . 225 | a shriner rides a large green tractor down the road during a parade . 226 | three dogs are playing in the water . 227 | a man plays a drum and a little boy hits his own , little drum . 228 | a group of girls playing a game on horseback . 229 | a man watching as a woman fires a gun with a smile at a firing range . 230 | the bike leader pedals for his life as competing countries gain his tail . 231 | a team of soccer players is huddled and having a serious discussion . 232 | a group of people in purple shirts and tan pants all walking in the same direction . 233 | a man in a bright shirt is playing trumpet . 234 | the guy with the jean shorts is at the skate park doing tricks on his bike . 235 | mother and daughter wearing alice in wonderland customs are posing for a picture . 236 | a little boy using a drill to make a hole in a piece of wood . 237 | two bicyclists are racing each other on a dirt track . 238 | a frisbee is being thrown to the girl while the other girl appears to be asking for it . 239 | an orthodontist working on a patient , while a man holds the light . 240 | two people eat hamburgers on lawn chairs while a third drinks a can of soda . 241 | two bicyclists ride down the street past people while talking . 242 | at a bowling alley , a man in a black shirt is holding a bowling ball and looking down the lane . 243 | two men in black clothes with blue and red bowties are performing in front of a crowd . 244 | two men , one black and one white , play their guitars and sing into microphones as they stand outdoors . 245 | a kickboxer lands a flying knee into the face of his opponent . 246 | three people are running a race around a red track . 247 | football players struggle to get plays through a tough line . 248 | several men are praying while standing at the end of a table of food . 249 | two indian children in formal costume happily performing a ritual dance . 250 | three children standing near each other and next to a tall blue wooden post . 251 | -------------------------------------------------------------------------------- /data/multi30k/val.lc.norm.tok.head-250.fr: -------------------------------------------------------------------------------- 1 | un homme dans son salon réfléchit aux affaires qu' il va emmener en voyage . 2 | une femme afro-américaine est assise à une table marron , vêtue d' une robe violette , de chaussures roses et de lunettes de soleil noires . 3 | un homme affalé dans une chaise sur un trottoir en ville , regardant une fille . 4 | deux hommes , l' un vendant des fruits et l' autre les inspectant et parlant avec le vendeur . 5 | un homme et une femme s' étreignent dans une rue . 6 | de nombreuses personnes vêtues de blanc dans un stade sont en train de discuter entre elles . 7 | beaucoup de personnes se sont rassemblées pour regarder quelque chose qui n' est pas sur la photo . 8 | un couple est assis sur un banc en train de parler , tandis qu' une femme promène un chien en arrière-plan . 9 | deux femmes en robes à pois marchant sur un trottoir . 10 | un groupe de filles jouent dans une fontaine au soleil . 11 | des paysagistes et des jardiniers travaillant aux alentours de l' allée . 12 | un homme dégarni avec des lunettes de soleil rouges et un t-shirt vert debout devant un bâtiment . 13 | un garçon pieds nus avec une serviette rayée bleue et blanche est debout sur la plage . 14 | un couple et deux filles regardent par-dessus une balustrade transparente . 15 | des gens dans un magasin choisissant des produits à acheter 16 | un tatoueur applique un tatouage à l' encre sur la peau . 17 | une femme debout à côté de deux personnes pointe le doigt vers le ciel . 18 | un couple marche dans un rayon d' un magasin vendant des livres d' art et d' histoire . 19 | un homme avec un badge est assis dans un fauteuil . 20 | un homme au loin près d' un temple bouddhiste . 21 | un homme tient une boule métallique en équilibre sur son bras . 22 | un homme noir en t-shirt blanc et casquette noire assis sur le trottoir , envoyant un texto . 23 | un garçon en chemise noire portant un seau bleu tandis qu' il marche avec des hommes vêtus de blanc . 24 | une jeune femme aux cheveux foncés avec une visière rouge , tenant un parapluie blanc ouvert au milieu d' une foule de personnes 25 | un homme tenant un petit enfant qui porte un sac à dos . 26 | un jeune homme installe des boules de billard , sur un tapis de billard violet . 27 | des filles assises avec les mains sur les genoux 28 | une fille en débardeur noir et short cargo semble être en train de danser avec plusieurs personnes autour . 29 | un garçon et une fille en survêtements noirs sont debout face à une fille en veste rose , avec des adultes en arrière-plan . 30 | un homme et une femme avec des poussettes marchent près de gens qui vendent des articles dans des tentes . 31 | une femme en collants rayés est dirigée par des ficelles . 32 | un ouvrier du bâtiment en gilet orange pose des pavés . 33 | une femme asiatique assise devant un étal de marché extérieur . 34 | un homme paresse dans sa chambre , sans avoir mangé depuis plusieurs jours . 35 | une remorque roule sur une route pavée rouge . 36 | une jeune femme avec des cheveux bruns et un débardeur prend une photo avec un appareil photo . 37 | une personne allongée sur un banc devant de l' eau . 38 | un jeune homme souriant , marchant au bord de la plage avec une casquette , un t-shirt bleu et un jean . 39 | un groupe de personnes saluant quelqu' un sur un balcon . 40 | un homme en uniforme militaire entraîne un berger allemand en utilisant un protège-bras . 41 | un enfant sur une rampe de skateboard , répétant des mouvements cools . 42 | des hommes semblent discuter de quelque chose sur un bateau ou un navire . 43 | un homme portant un bonnet et des lunettes d' aviateur est assis sur la route . 44 | des gens traversent une rue bordée d' arbres devant un bâtiment . 45 | un homme vêtu d' une chemise grise posant sa tête sur une table . 46 | un homme en t-shirt blanc et short noir travaille dehors . 47 | une femme en pantalon noir regarde son téléphone portable . 48 | une jeune femme en débardeur rose tentant d' attraper un veau au lasso lors d' un rodéo . 49 | une famille est debout dehors lors d' une journée nuageuse . 50 | un très jeune enfant est assis dans l' évier avec de la peinture sur son corps et son visage , et il joue avec le robinet de la cuisine . 51 | des enfants tournant dans un tourniquet en verre . 52 | un homme et une femme brandissent des pancartes lors d' une manifestation . 53 | une femme travaillant sur sa terrasse le week-end . 54 | un homme âgé , habillé en bleu marine , est assis sur un banc le long de la rue . 55 | une femme prend une photo avec son appareil . 56 | un homme âgé en jean et manteau marron se repose contre un bâtiment orange . 57 | un homme , une main sur la tête , regarde une publicité pour la bank of america . 58 | deux femmes assises sur un banc la nuit devant un magasin 59 | un homme en t-shirt blanc est assis sur une caisse . 60 | un vieil homme avec des tatouages et des insignes de motard s' attarde un moment dans une rue en ville . 61 | un homme est assis seul , pêchant le long du littoral . 62 | un vieil homme est assis dehors sur un banc devant une grande banderole où est écrit " memoria justicia sin olvido " 63 | un homme en veste grise sur un vélo transporte de la verdure . 64 | un garçon avec des lunettes vêtu d' un t-shirt jaune vif est debout sur un parking . 65 | deux hommes marchant sur un chemin en terre . 66 | une mère portant un béret et des chaussures bleus avec ses deux fils . 67 | deux chiens jouant avec un ballon bleu et vert . 68 | un homme portant un casque multicolore brillant est assis sur une moto . 69 | un homme est accroupi devant un mur jaune . 70 | on dirait un marché paysan , et quelques tables avec divers produits exposés . 71 | un garçon blond en t-shirt bleu est assis avec une femme portant des lunettes . 72 | un homme avec une chemise blanche élégante regarde fixement les yeux de la femme , tout en tenant le dos de sa robe rose et noire . 73 | des gens jouent dans une fontaine au crépuscule . 74 | quatre garçons posant tandis que l' un pose sa boisson . 75 | dans une rue très fréquentée , une femme porte des produits sur sa tête . 76 | un artiste de rue en combinaison orange est sur un grand monocycle tandis qu' une foule regarde 77 | une personne assise sur une chaise devant une foule . 78 | un homme attend que le train arrive sur le quai . 79 | trois hommes blancs en t-shirts sautent en l' air . 80 | un chanteur plongeant de la scène dans la foule . 81 | une classe de plongée prenant une photo pendant le cours . 82 | une femme en polo rayé croise les bras tout en étant debout dans un supermarché . 83 | une voiture écrasée avec de nombreux pompiers la découpant . 84 | plusieurs personnes attendent pour passer à la caisse dans un magasin avec un plafond ressemblant à celui d' un entrepôt . 85 | deux filles jouant au volleyball , l' une frappant le ballon . 86 | un homme âgé verse quelque chose se trouvant dans un sac dans l' eau . 87 | six personnes sont dans un gymnase , en train de réparer des vélos . 88 | deux jeunes garçons asiatiques se battent l' un contre l' autre . 89 | une femme prépare des ingrédients pour un bol de soupe . 90 | deux hommes en shorts travaillant sur un vélo bleu . 91 | un homme et une femme faisant une sieste sur un radeau de fortune . 92 | il y a un homme enveloppé dans une sorte de couverture , glissant sur une pente recouverte de neige . 93 | un homme et une femme appréciant un dîner lors d' une fête . 94 | un soldat de la garde nationale menant un groupe d' autres soldats de la garde nationale qui chantent l' hymne national . 95 | deux jeunes marchent dans une rue en pente . 96 | une femme dans un restaurant , est en train de boire une noix de coco , à l' aide d' une paille . 97 | une femme portant un uniforme bleu est debout et regarde vers le bas . 98 | un homme en short parle à un autre homme en jean devant un évier . 99 | un homme nourrissant un bébé dans une chaise haute . 100 | un jeune cycliste portant un casque habillé de bleu s' envole en l' air en passant sur de petites collines . 101 | un petit bébé avec un chapeau rose allongé nue en train de dormir . 102 | deux enfants sont allongés à plat ventre par terre sous un tuyau . 103 | deux personnes parlent près d' une cabine téléphonique rouge tandis que des ouvriers du bâtiment se reposent à proximité . 104 | une femme autochtone travaille sur un projet artisanal . 105 | des enfants courant après le ballon lors d' un match de football . 106 | deux enfants sautant sur un trampoline protégé bleu et noir , situé dehors et entouré d' arbres . 107 | deux chiens courent dans un champ , regardant un frisbee invisible . 108 | un batteur et un guitariste font un concert dans un endroit sombre . 109 | trois personnes font une randonnée sur un chemin très enneigé . 110 | un groupe de gens sont près d' une petite rivière au milieu d' une ville . 111 | une femme jette un coup d " œil dans un télescope dans les bois . 112 | une petite fille blonde avec une t-shirt à pois donne un " bain " à un animal en peluche dans un évier . 113 | un jeune homme hirsute avec un piercing au nez se brosse les dents . 114 | un skieur vole dans les airs , tandis que d' autres skieurs prenant le tire-fesse le regardent . 115 | trois filles font de l' équitation , avec la photo centrée sur la plus jeune . 116 | une femme corpulente soufflant dans ses cheveux avec un sèche-cheveux , souriante et très heureuse 117 | un cuisinier travaillant dans une cuisine avec un couteau . 118 | un homme déplaçant des fleurs tandis qu' une femme lui fait un geste . 119 | une femme chantant dans un micro tandis qu' un homme joue de la batterie en arrière-plan . 120 | l' homme avec la casquette donne un poisson fraîchement pêché au garçon en chapeau violet . 121 | un ouvrier se cramponnant à un arbre . 122 | plusieurs personnes debout autour d' un récipient , dans lequel un homme manipule un objet marron . 123 | un moine faisant du roller avec de belles lunettes de soleil prie avant de faire des figures insensées . 124 | une femme avec des lunettes de soleil et un t-shirt bleu , qui vend des coquillages , regarde un vieil homme portant un t-shirt noir et une casquette . 125 | un vieil homme balaie le sol tandis qu' une femme s' éloigne de l' objectif . 126 | un homme avec des dreadlocks joue avec les cheveux d' une femme qui est assise sur une chaise dans une rue pavée . 127 | un homme travaille sur un chantier . 128 | un homme est en train de frapper un ballon de volley rouge , blanc et bleu . 129 | un homme est debout dans un food truck , regardant par la demi-porte . 130 | un joueur de baseball avec un casque rouge et un pantalon blanc est touché par le receveur tandis qu' il court vers le marbre . 131 | un homme en toge orange balayant dehors . 132 | un homme vêtu d' un t-shirt violet travaillant dans un laboratoire de biologie . 133 | un homme promène deux petits poneys dans un parc . 134 | deux filles , une allemande et une chinoise , s' affrontent lors d' un combat de judo sur un tapis . 135 | un homme et une femme dormant sur un banc . 136 | un homme et une femme sont assis sur le sol devant des bagages . 137 | un conducteur de pousse-pousse attendant son prochain client . 138 | deux filles sont assises à une table et travaillent sur des projets artisanaux . 139 | un homme court dans la neige avec des raquettes . 140 | une danseuse en costume rouge saute en l' air . 141 | un homme est garé tout en étant à l' intérieur d' un camion d' assainissement . 142 | un homme barbu avec une veste épaisse est assis dans un coin avec un gobelet en carton . 143 | un homme en skateboard , utilisant une piscine vide comme rampe lors d' une très belle journée . 144 | un homme avec un grand chapeau dans les buissons . 145 | un gros plan du visage d' un enfant mangeant une sucette bleue en forme de cœur . 146 | un homme et une femme remplacent une chambre à air de vélo . 147 | un jeune garçon montre son collier de perles marron et vertes . 148 | un homme vêtu d' un t-shirt gris actionne le soufflet pour démarrer un feu sur un four en briques dans une cabane en bois . 149 | une femme debout devant des arbres et souriant . 150 | quelqu' un en costume asiatique est assis et tient une épée . 151 | un jeune footballeur américain se prépare pour un field goal . 152 | un mec dans un sweat à capuche vert vif traverse un passage piéton en regardant un accident entre une voiture et un vélo . 153 | un jeune homme s' apprête à tirer dans un ballon de foot . 154 | un coureur de compétition faisant son premier sprint dans une compétition . 155 | deux hommes , l' un en bleu et l' autre en rouge , combattent dans un match de boxe . 156 | un groupe d' amis gis sur le plancher de s' amusant ensemble . 157 | un homme debout tient un micro devant un homme tenant une guitare . 158 | une fille faisant du roller derby patine avec d' autres . 159 | une femme prenant un sac de glace dans un magasin . 160 | deux motards font la course au coude à coude dans un virage . 161 | un homme seul sur un trottoir réajustant son chapeau . 162 | un homme handicapé qui n' a pas de jambes marche avec un autre homme qui se lance dans un marathon . 163 | plusieurs mecs entrent en collision dans un match de football . 164 | deux personnes , l' une vêtue comme une religieuse et l' autre en t-shirt " roger smith " , engagées dans une course à pied , dépassant les spectateurs dans une zone boisée . 165 | trois jockeys pendant une course . 166 | un homme de couleur en chemise blanche et un t-shirt sans manches noir fait sauter son skateboard sur une surface cimentée entourée de hauts bâtiments et de palmiers . 167 | un homme , vêtu d' un sweat-shirt à capuche , assis à une fontaine à regarder les gens dans la ville . 168 | des nageurs sont à différents niveaux d' un grand complexe de plongées dans une pièce avec des représentations de la mythologie peintes sur le mur . 169 | un jeune garçon indien assis à réfléchir sur son avenir . 170 | un garçon en sweat à capuche est en train de jeter un objet dans une piscine sale . 171 | des hommes en costumes colorés pendant un spectacle . 172 | deux boxeurs sont prêts à combattre pendant que le public regarde avec impatience . 173 | les enfants jouent à un sport sur un terrain . 174 | une artiste chante et joue de la guitare devant un micro . 175 | des garçons concourent à un art martial . 176 | un groupe de personnes noires en chemises oranges en face d' un parc clôturé . 177 | des écolières en uniforme défilent dans un défilé en jouant d' un instrument ressemblant à une flute . 178 | un homme fourre une volaille avec des ingrédients qui sont dans un bol bleu . 179 | le garçon saute de son lit en faisant un coup de pied de karaté . 180 | un bébé dans un siège rebondissant et un garçon entourés de jouets . 181 | ce sont des gens réunis autour de la table jouant à jenga . 182 | je vois un homme barbu et une dame âgée partageant un bol de nourriture . 183 | un jeune homme fait du skateboard sur un mur de parpaings . 184 | des gens faisant du bateau sur un lac avec le soleil traversant les nuages au loin . 185 | deux hommes portant des kimonos s' exercent aux arts martiaux . 186 | a boys band et ils ne sont même pas assortis , quelqu' un aurait pu leur dire . 187 | un groupe de pilotes de kart font une course autour d' une piste de karting . 188 | une personne vêtue de vêtements d' hiver prend la pose avec un bonhomme de neige au milieu d' un paysage enneigé . 189 | un homme est en train de donner une présentation devant un public . 190 | un homme d' âge mûr est tapote le genou d' un jeune joueur de football qui est assis sur une table d' entraînement . 191 | un homme tatoué , vêtu d' une salopette , sur une scène tenant un micro . 192 | une petite fille joue avec un circuit électrique miniature , composé de trois ampoules électriques et d' une batterie 193 | un homme à vélo portant un casque bleu s' insère dans la circulation . 194 | une bande de jeunes adultes , en pleine concentration , regardent fixement leurs écrans d' ordinateur pendant une compétition de jeu . 195 | un jeune garçon en vert jongle dans un parking . 196 | une petite fille dans une robe à poids regarde en arrière vers une femme en robe noire . 197 | un homme règle le moteur d' un bateau près de l' eau . 198 | il y a un match de tennis en train d' être joué de nuit dans ce stade . 199 | un enfant faisant du snowboard en train de s' arrêter 200 | un match de football se joue et deux hommes tentent d' atteindre le ballon avant leur adversaire respectif . 201 | de jeunes femmes et des enfants dans un village , avec une femme au centre de l' objectif . 202 | un jeune enfant en maillot verte se brosse les dents avec une brosse à dents jaune , tout en étant supervisé par maman . 203 | un homme en fauteuil roulant et portant un costume de jogging rouge porte une torche olympique . 204 | un fils et ses parents prennent une photo de groupe dans une église . 205 | trois hommes participent à une course d' obstacles . 206 | deux hommes observent un autre alors qu' il met la touche finale au ciment humide . 207 | un soldat regarde dans les jumelles vers le paysage montagneux . 208 | un groupe de jeunes garçons courant pendant un jour enneigé . 209 | un homme saute à la corde tandis qu' une foule de personnes le regardent . 210 | un jeune garçon et une jeune fille rient ensemble tandis que la fille fait un signe de main . 211 | deux hommes de deux équipes adverses courent vers un ballon de football . 212 | un cycliste portant un maillot jaune réalise un tour incroyable dans l' air . 213 | deux kickboxers femelles , l' une avec un bustier pourpre , s' affrontent dans une arène . 214 | deux hommes sur des motos rapides accélèrent autour d' un coin sur un circuit . 215 | deux hommes défendent contre l' homme avec le basket-ball pendant un jeu au crépuscule . 216 | un homme portant des bottes d' équitation et un casque montant sur un cheval blanc qui saute un obstacle . 217 | un groupe d' hommes en costume jouent de la musique . 218 | l' homme et les femmes regardent à travers des caisses de lait pleines de disques ou d' images en vente . 219 | un catamaran de course est soulevé sur une seule coque dans l' eau . 220 | un groupe de marines marchant dans la route avec des drapeaux américains et d' autres drapeaux militaires . 221 | un homme tenant un verre d' alcool face au caméra . 222 | un homme sur scène , jouant de la guitare avec des lumières en arrière-plan . 223 | quatre jeunes enfants jouent avec des bidons vides . 224 | un groupe de travailleurs écoute les instructions d' un collègue . 225 | un mannequin monte un grand tracteur vert sur la route lors d' un défilé . 226 | trois chiens jouent dans l' eau . 227 | un homme joue du tambour et un petit garçon frappe sur son propre petit tambour . 228 | un groupe de filles jouant un jeu à cheval . 229 | un homme qui regarde pendant qu' une femme tire un fusil avec un sourire à un champ de tir . 230 | le leader de vélo pédales pour sa vie puisque les pays concurrents le rejoignent . 231 | une équipe de joueurs de football sont entassés et discutent sérieusement . 232 | un groupe de personnes en maillots violettes et pantalons de couleur marron marchant tous dans la même direction . 233 | un homme dans une chemise brillante joue de la trompette . 234 | le mec avec un short en jean est au planchodrome faisant des tours sur son vélo . 235 | mère et fille portant un costume d' alice au pays des merveilles posent pour une photo . 236 | un petit garçon utilisant une perceuse pour faire un trou dans un morceau de bois . 237 | deux cyclistes se battent sur une piste de terre . 238 | un frisbee est jeté vers la jeune fille tandis que l' autre fille semble le demander . 239 | un orthodontiste soigne un patient , tandis qu' un homme tient la lumière . 240 | deux personnes mangent des hamburgers sur des chaises de jardin tandis qu' un troisième boit une canette de soda . 241 | deux cyclistes dans la rue dépassent les gens tout en parlant . 242 | dans un bowling , un homme en maillot noire tient une boule de bowling et regarde en bas de la voie . 243 | deux hommes en tenue noire avec des nœuds papillons bleues et rouges se produisent devant une foule . 244 | deux hommes , un noir et un blanc , jouent avec leurs guitares et chantent dans les microphones pendant qu' ils se tiennent à l' extérieur . 245 | un kickboxer atterrit avec son genou battant dans le visage de son adversaire . 246 | trois personnes courent dans une course autour d' une piste rouge . 247 | les joueurs de football luttent pour obtenir des jeux à travers une ligne dure . 248 | plusieurs hommes prient en se tenant au bout d' une table de nourriture . 249 | deux enfants indiens en costume officiel accomplissant heureusement une danse rituelle . 250 | trois enfants debout les uns à côté des autres , à côté d' un grand poteau en bois bleu . 251 | -------------------------------------------------------------------------------- /data/multi30k/val.lc.norm.tok.head-5.en: -------------------------------------------------------------------------------- 1 | a group of men are loading cotton onto a truck 2 | a man sleeping in a green room on a couch . 3 | a boy wearing headphones sits on a woman 's shoulders . 4 | two men setting up a blue ice fishing hut on an iced over lake 5 | a balding man wearing a red life jacket is sitting in a small boat . 6 | -------------------------------------------------------------------------------- /data/multi30k/val.lc.norm.tok.head-5.en.jsonl: -------------------------------------------------------------------------------- 1 | {"source": "a man in his living room ponders packing for a trip .", "target": "un homme dans son salon réfléchit aux affaires qu' il va emmener en voyage ."} 2 | {"source": "an african-american woman sits at a brown table , wearing a purple dress , pink shoes , and black sunglasses .", "target": "une femme afro-américaine est assise à une table marron , vêtue d' une robe violette , de chaussures roses et de lunettes de soleil noires ."} 3 | {"source": "a man slouched in a chair on a city sidewalk girl watching .", "target": "un homme affalé dans une chaise sur un trottoir en ville , regardant une fille ."} 4 | {"source": "two men , one man selling fruit the other inspecting the fruit and conversing with the seller .", "target": "deux hommes , l' un vendant des fruits et l' autre les inspectant et parlant avec le vendeur ."} 5 | {"source": "a man and a woman hug on a street .", "target": "un homme et une femme s' étreignent dans une rue ."} 6 | -------------------------------------------------------------------------------- /data/multi30k/val.lc.norm.tok.head-5.fr: -------------------------------------------------------------------------------- 1 | un groupe d' hommes chargent du coton dans un camion 2 | un homme dormant dans une chambre verte sur un canapé . 3 | un garçon avec un casque est assis sur les épaules d' une femme . 4 | deux hommes installant une tente de pêche sur glace bleue sur un lac gelé 5 | un homme chauve vêtu d' un gilet de sauvetage rouge est assis dans un petit bateau . -------------------------------------------------------------------------------- /data/run_fever.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/athnlp/athnlp-labs/1c3f0a8595cd8e4a7b7d62fc4e608ade6ac092fd/data/run_fever.png -------------------------------------------------------------------------------- /data/squad/dev-v2.0-small.json: -------------------------------------------------------------------------------- 1 | { 2 | "version": "v2.0", 3 | "data": [{ 4 | "title": "Normans", 5 | "paragraphs": [{ 6 | "qas": [{ 7 | "question": "In what country is Normandy located?", 8 | "id": "56ddde6b9a695914005b9628", 9 | "answers": [{ 10 | "text": "France", 11 | "answer_start": 159 12 | }], 13 | "is_impossible": false 14 | }, { 15 | "question": "When were the Normans in Normandy?", 16 | "id": "56ddde6b9a695914005b9629", 17 | "answers": [{ 18 | "text": "10th and 11th centuries", 19 | "answer_start": 94 20 | }], 21 | "is_impossible": false 22 | }, { 23 | "question": "From which countries did the Norse originate?", 24 | "id": "56ddde6b9a695914005b962a", 25 | "answers": [{ 26 | "text": "Denmark, Iceland and Norway", 27 | "answer_start": 256 28 | }], 29 | "is_impossible": false 30 | }, { 31 | "plausible_answers": [{ 32 | "text": "Rollo", 33 | "answer_start": 308 34 | }], 35 | "question": "Who did King Charles III swear fealty to?", 36 | "id": "5ad39d53604f3c001a3fe8d3", 37 | "answers": [], 38 | "is_impossible": true 39 | }, { 40 | "plausible_answers": [{ 41 | "text": "10th century", 42 | "answer_start": 671 43 | }], 44 | "question": "When did the Frankish identity emerge?", 45 | "id": "5ad39d53604f3c001a3fe8d4", 46 | "answers": [], 47 | "is_impossible": true 48 | }], 49 | "context": "The Normans (Norman: Nourmands; French: Normands; Latin: Normanni) were the people who in the 10th and 11th centuries gave their name to Normandy, a region in France. They were descended from Norse (\"Norman\" comes from \"Norseman\") raiders and pirates from Denmark, Iceland and Norway who, under their leader Rollo, agreed to swear fealty to King Charles III of West Francia. Through generations of assimilation and mixing with the native Frankish and Roman-Gaulish populations, their descendants would gradually merge with the Carolingian-based cultures of West Francia. The distinct cultural and ethnic identity of the Normans emerged initially in the first half of the 10th century, and it continued to evolve over the succeeding centuries." 50 | }, { 51 | "qas": [{ 52 | "question": "Who was the duke in the battle of Hastings?", 53 | "id": "56dddf4066d3e219004dad5f", 54 | "answers": [{ 55 | "text": "William the Conqueror", 56 | "answer_start": 1022 57 | }], 58 | "is_impossible": false 59 | }, { 60 | "plausible_answers": [{ 61 | "text": "Antioch", 62 | "answer_start": 1295 63 | }], 64 | "question": "What principality did William the conquerer found?", 65 | "id": "5ad3a266604f3c001a3fea2b", 66 | "answers": [], 67 | "is_impossible": true 68 | }], 69 | "context": "The Norman dynasty had a major political, cultural and military impact on medieval Europe and even the Near East. The Normans were famed for their martial spirit and eventually for their Christian piety, becoming exponents of the Catholic orthodoxy into which they assimilated. They adopted the Gallo-Romance language of the Frankish land they settled, their dialect becoming known as Norman, Normaund or Norman French, an important literary language. The Duchy of Normandy, which they formed by treaty with the French crown, was a great fief of medieval France, and under Richard I of Normandy was forged into a cohesive and formidable principality in feudal tenure. The Normans are noted both for their culture, such as their unique Romanesque architecture and musical traditions, and for their significant military accomplishments and innovations. Norman adventurers founded the Kingdom of Sicily under Roger II after conquering southern Italy on the Saracens and Byzantines, and an expedition on behalf of their duke, William the Conqueror, led to the Norman conquest of England at the Battle of Hastings in 1066. Norman cultural and military influence spread from these new European centres to the Crusader states of the Near East, where their prince Bohemond I founded the Principality of Antioch in the Levant, to Scotland and Wales in Great Britain, to Ireland, and to the coasts of north Africa and the Canary Islands." 70 | }] 71 | }, { 72 | "title": "Computational_complexity_theory", 73 | "paragraphs": [{ 74 | "qas": [{ 75 | "question": "What branch of theoretical computer science deals with broadly classifying computational problems by difficulty and class of relationship?", 76 | "id": "56e16182e3433e1400422e28", 77 | "answers": [{ 78 | "text": "Computational complexity theory", 79 | "answer_start": 0 80 | }], 81 | "is_impossible": false 82 | }, { 83 | "plausible_answers": [{ 84 | "text": "algorithm", 85 | "answer_start": 472 86 | }], 87 | "question": "What is a manual application of mathematical steps?", 88 | "id": "5ad5316b5b96ef001a10ab76", 89 | "answers": [], 90 | "is_impossible": true 91 | }], 92 | "context": "Computational complexity theory is a branch of the theory of computation in theoretical computer science that focuses on classifying computational problems according to their inherent difficulty, and relating those classes to each other. A computational problem is understood to be a task that is in principle amenable to being solved by a computer, which is equivalent to stating that the problem may be solved by mechanical application of mathematical steps, such as an algorithm." 93 | }, { 94 | "qas": [{ 95 | "question": "What measure of a computational problem broadly defines the inherent difficulty of the solution?", 96 | "id": "56e16839cd28a01900c67887", 97 | "answers": [{ 98 | "text": "if its solution requires significant resources", 99 | "answer_start": 46 100 | }], 101 | "is_impossible": false 102 | }, { 103 | "question": "What method is used to intuitively assess or quantify the amount of resources required to solve a computational problem?", 104 | "id": "56e16839cd28a01900c67888", 105 | "answers": [{ 106 | "text": "mathematical models of computation", 107 | "answer_start": 176 108 | }], 109 | "is_impossible": false 110 | }, { 111 | "question": "What are two basic primary resources used to guage complexity?", 112 | "id": "56e16839cd28a01900c67889", 113 | "answers": [{ 114 | "text": "time and storage", 115 | "answer_start": 305 116 | }], 117 | "is_impossible": false 118 | }, { 119 | "plausible_answers": [{ 120 | "text": "the number of gates in a circuit", 121 | "answer_start": 436 122 | }], 123 | "question": "What unit is measured to determine circuit simplicity?", 124 | "id": "5ad532575b96ef001a10ab7f", 125 | "answers": [], 126 | "is_impossible": true 127 | }, { 128 | "plausible_answers": [{ 129 | "text": "the number of processors", 130 | "answer_start": 502 131 | }], 132 | "question": "What number is used in perpendicular computing?", 133 | "id": "5ad532575b96ef001a10ab80", 134 | "answers": [], 135 | "is_impossible": true 136 | }], 137 | "context": "A problem is regarded as inherently difficult if its solution requires significant resources, whatever the algorithm used. The theory formalizes this intuition, by introducing mathematical models of computation to study these problems and quantifying the amount of resources needed to solve them, such as time and storage. Other complexity measures are also used, such as the amount of communication (used in communication complexity), the number of gates in a circuit (used in circuit complexity) and the number of processors (used in parallel computing). One of the roles of computational complexity theory is to determine the practical limits on what computers can and cannot do." 138 | }] 139 | }] 140 | } -------------------------------------------------------------------------------- /labs-exercises/multiclass_perceptron.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/athnlp/athnlp-labs/1c3f0a8595cd8e4a7b7d62fc4e608ade6ac092fd/labs-exercises/multiclass_perceptron.png -------------------------------------------------------------------------------- /labs-exercises/neural-encoding-fever.md: -------------------------------------------------------------------------------- 1 | # Lab - Neural Encoding for Text Classification 2 | 3 | ## Introduction 4 | 5 | This lab will introduce continuous representations for NLP. 6 | We will work on the task of Natural Language Inference (also known as Textual Entailment) in the context of the Fact Extraction and Verification dataset introduced by [Thorne et al. (2018)](https://arxiv.org/abs/1803.05355). 7 | We will focus on the subtask of deciding whether a claim is supported or refuted given a set of evidence sentences. 8 | The dataset also contains claims for which no appropriate evidence was found in Wikipedia; we will ignore these in this lab. 9 | 10 | To simplify the task, we have prepared a _Lite_ version of the dataset that has the Wikipedia evidence bundled together with each dataset instance. The full task requires searching for this evidence. 11 | 12 | ## Requirements 13 | 14 | - Use a subset of the FEVER data (provided in `data/fever/`) to predict whether textual _claims_ are SUPPORTED or REFUTED from _evidence_ from Wikipedia. More details about how it was prepared can be found [here](https://github.com/j6mes/feverlite/releases). 15 | - Use the [AllenNLP](https://allennlp.org/) framework for implementation of your neural models. **(If you installed the required Python packages for the summer school, AllenNLP should be installed for you).** 16 | - It is highly recommended to use an IDE, such as PyCharm, for working in this project 17 | 18 | 19 | ## AllenNLP Primer 20 | There are four key parts you'll interact with when developing with the AllenNLP framework: 21 | 22 | * Dataset Reader 23 | * Model 24 | * Configuration File 25 | * Command Line Interface / Python Module 26 | 27 | ### Dataset Reader and Sample Data 28 | Each labeled dataset instance consists of a `claim` sentence accompanied by one or more `evidence` sentences. 29 | 30 | ``` 31 | { 32 | 'label': 'SUPPORTS', 33 | 'claim': 'Ryan Gosling has been to a country in Africa.', 34 | 'evidence': [ 35 | 'He is a supporter of PETA , Invisible Children and the Enough Project and has traveled to Chad , Uganda and eastern Congo to raise awareness about conflicts in the regions .', 36 | "Chad -LRB- -LSB- tʃæd -RSB- تشاد ; Tchad -LSB- tʃa -LRB- d -RRB- -RSB- -RRB- , officially the Republic of Chad -LRB- ; `` Republic of the Chad '' -RRB- , is a landlocked country in Central Africa ." 37 | ] 38 | } 39 | ``` 40 | 41 | We provide code to read through the dataset files in `athnlp/readers/fever_reader.py`. 42 | The dataset reader we provide does all the necessary preprocessing before we pass the data to the model. For example, in our implementation, we tokenize the sentences. 43 | 44 | This returns an `Instance` that consists of a `claim` and `evidence` for the model. 45 | Notice that the instance contains a `TextField` for the tokenized sentences and a `LabelField` for the label. 46 | The framework will construct a vocabulary using the words in the TextField for you. 47 | If you want to add hand-crafted features, this might be a good place to add them (you could add an array of features in an `ArrayField`). 48 | 49 | Also notice that above the `FEVERLiteDatasetReader` there is a decorator `@DatasetReader.register("feverlite")`. This will come in handy when using configuration files for our model as it associates the type `feverlite` with this class -- this will come in handy later! 50 | 51 | 52 | ### Model 53 | Just like we registered our dataset reader, we can also register a `Model`. In the file `athnlp/models/fever_text_classification.py`, we have built a skeleton model that you can adapt for the exercises. We have registered it with the name `fever` by using the decorator `@Model.register("fever")` above the class name. If you plan on adding more models, you should think of a unique name. 54 | 55 | The model has a `forward(...)` method: this is the main method for prediction just like we'd expect to find with other models written in `PyTorch`. 56 | Notice how in our model, the argument names match up with the values returned by the dataset reader: AllenNLP will match these up for you during training and model evaluation. 57 | While the variable names are the same, the data types are different. AllenNLP will convert a `TextField` into a LongTensor - each element in this tensor corresponds to the index of the token in the vocabulary. 58 | AllenNLP will automatically generate batches for you: this means that all variables here are batch-first tensors. 59 | 60 | The model returns quite a bit of information to the trainer that is calling it. 61 | It is quite common to see the following code in a lot of AllenNLP models. 62 | The loss is computed by the model (if a `label` is passed in) and this is what is used for error backpropagation. 63 | If we need to compute any metrics, such as accuracy or F1 score, this would be the place to do it. 64 | ``` 65 | label_probs = F.softmax(label_logits, dim=-1) 66 | output_dict = {"label_logits": label_logits, 67 | "label_probs": label_probs} 68 | 69 | if label is not None: 70 | loss = self._loss(label_logits, label.long().view(-1)) 71 | self._accuracy(label_logits, label) 72 | output_dict["loss"] = loss 73 | 74 | return output_dict 75 | ``` 76 | 77 | The meat of the model will perform a sequence of operations to the input data, returning the label logits and loss. 78 | It is possible to mix torch and AllenNLP operations 79 | 80 | Operations that might be helpful for the exercises are: 81 | 82 | * Embedding Lookup (TextFieldEmbedder) 83 | * Feed-forward Neural Networks (FeedForward) 84 | * Summing tensors (torch.sum()) 85 | * Concatenating tensors (torch.cat()) 86 | 87 | 88 | ### Configuration 89 | Parameters for the model are stored in a JSON file. For this example, you can adapt `athnlp/experiments/fever.json`. 90 | In this configuration file, there are separate configurations for the `datasetreader`, `model` and `trainer`. 91 | Notice how the `type` of the datasetreader and model match the values we specified earlier. 92 | 93 | The values in this configuration are passed to the constructor of our model and dataset reader and also match their parameters: 94 | 95 | ```json 96 | "model": { 97 | "type": "fever", 98 | "text_field_embedder": { 99 | ... 100 | }, 101 | "final_feedforward": { 102 | ... 103 | }, 104 | "initializer": [ 105 | ... 106 | ] 107 | } 108 | ``` 109 | And in the python code for the model, the `__init__` method of the model takes these parameters. Note that `vocab` is auto-filled by another part of AllenNLP. 110 | You can find examples of configs from real-world models on [GitHub](https://github.com/allenai/allennlp/tree/master/training_config). 111 | ```python 112 | @Model.register("fever") 113 | class FEVERTextClassificationModel(Model): 114 | def __init__(self, 115 | vocab: Vocabulary, 116 | text_field_embedder: TextFieldEmbedder, 117 | final_feedforward: FeedForward, 118 | initializer: InitializerApplicator = InitializerApplicator()): 119 | ``` 120 | 121 | ### Running the model 122 | AllenNLP will install itself as a bash script that you can call when you want to train/evaluate your model using the config specified in your json file. Using the `--include-package` option will load the custom models, dataset readers and other Python modules in that package. 123 | 124 | ```bash 125 | allennlp train --force --include-package athnlp --serialization-dir mymodel myconfig.json 126 | ``` 127 | 128 | `train` tells AllenNLP to train the model, there are other options for Fine Tuning, Evaluation, Predicting etc. 129 | `--serialization-dir` defines the location the model will be saved 130 | `--force` will overwite any existing model saved in the serialization-dir. You could use `--recover` if you wish to continue training a model from a checkpoint. 131 | `--include-package` will import the Python described package here. 132 | 133 | This is an alias that just runs Python with the following command: `python -m allennlp.run [args]`. 134 | If you are using an IDE, you can debug AllenNLP models by running the python module `allennlp.run`. **note: this is running a module. not a script - select the dropdown to select "Module name" NOT "Script path"** 135 | 136 | ![](/data/run_fever.png) 137 | 138 | If you are using `pdb`, you will have to write a simple 2-line wrapper script: 139 | ``` 140 | from allennlp.commands import main 141 | main(prog="allennlp”) 142 | ``` 143 | 144 | ### Debugging 145 | 146 | #### ConfigurationError 147 | 148 | If you encounter this error: 149 | ``` 150 | allennlp.common.checks.ConfigurationError: "feverlite not in acceptable choices for dataset_reader.type: ['ccgbank', 'conll2003', 'conll2000', 'ontonotes_ner', 'coref', 'winobias', 'event2mind', 'interleaving', 'language_modeling', 'multiprocess', 'ptb_trees', 'drop', 'squad', 'quac', 'triviaqa', 'qangaroo', 'srl', 'semantic_dependencies', 'seq2seq', 'sequence_tagging', 'snli', 'universal_dependencies', 'universal_dependencies_multilang', 'sst_tokens', 'quora_paraphrase', 'atis', 'nlvr', 'wikitables', 'template_text2sql', 'grammar_based_text2sql', 'quarel', 'simple_language_modeling', 'babi', 'copynet_seq2seq', 'text_classification_json']" 151 | ``` 152 | Check that `--include-package athnlp` is included in the arguments when calling AllenNLP 153 | 154 | 155 | #### ModuleNotFoundError 156 | 157 | If you encounter this error: 158 | ``` 159 | ModuleNotFoundError: No module named 'athnlp' 160 | ``` 161 | Check that the athnlp folder is in the `PYTHONPATH`. Is your current working directory the `athnlp-labs` folder? 162 | 163 | #### FileNotFoundError 164 | 165 | ``` 166 | FileNotFoundError: file resources/glove.6B.50d.txt.gz not found 167 | ``` 168 | 169 | Run setup_dependencies.sh to download the file (it will then call `wget https://allennlp.s3.amazonaws.com/datasets/glove/glove.6B.50d.txt.gz -P resources/;`) 170 | 171 | 172 | #### NotImplementedError 173 | This is where you should add your solution to the exercise! Go and edit `athnlp/models/fever_text_classification.py` and delete this line. 174 | ``` 175 | NotImplementedError: Compute label logits (for supported and refuted) for the given Claim and Evidence input 176 | ``` 177 | 178 | 179 | #### It is running all my scripts! 180 | If you have put your code in the `athnlp`, the `--include-package` function will try and find code to import. You should change the code from previous labs and wrap it in the following if statement `if __name__ == "__main__":` so that it only runs when it is your main python script. 181 | 182 | ### Self-Help 183 | There are a large number of (more complex) models already available for AllenNLP which could help inspire you when you write your model: Check out the [models package](https://github.com/allenai/allennlp/tree/master/allennlp/models) on GitHub for inspiration if you are stuck. 184 | 185 | If you are getting errors about size mismatch (`RuntimeError: size mismatch, m1: [32 x 100], m2: [256 x 100]`) check the dimensions of your MLP are compatible. This error is caused when PyTorch tries to multiply incompatible matrices. Check that the input dimension for the MLP in the config file is the same size as the input representation you generate. Use the debugger to inspect this or print the shape of your variables `print(my_variable.shape)`. 186 | 187 | 188 | ### Using GPU 189 | If your laptop has a CUDA-enabled GPU and you have the appropriate drivers installed, you can speed up training by setting `"cuda_device":0` in your configuration file. 190 | 191 | ## Exercises 192 | For the exercises, we have provided a dataset reader (`athnlp/readers/fever_reader.py`), configuration file (`athnlp/experiments/fever.json`), and sample model (`athnlp/models/fever_text_classification.py`). You can complete these exercises by completing the code in the sample model. 193 | 194 | ### 1. Average Word Embedding Model 195 | 1. Implement a model that 196 | - represents the claim and the evidence by averaging their word embeddings; 197 | - concatenates the two representations; 198 | - uses a multilayer perceptron to decide the label. 199 | 200 | 2. Experiment with the number and the size of hidden layers to find the best settings using the train/dev set and assess your accuracy on the test set. (note: this model may not get high accuracy) 201 | 202 | 3. Explore: How does fine-tuning the word embeddings affect performance? You can make the word embeddings layer trainable by changing the config file for the `text_field_embedder` in the `fever.json` config file. 203 | 204 | ### 2. Discrete Feature Baseline 205 | Start by make a new config file and a new model file based on `fever.json` and `fever_text_classification.py`. Don't forget 206 | 207 | 208 | 1. Compare against a discrete feature baseline. Instead of embedding claim and evidence we are making an n-hot Bag of Words vector. (Hint: edit the type of the `text_field_embedder` be `bag_of_word_counts` - you will have make changes to your model too!). 209 | 210 | 2. How does limiting the vocabulary size affect the model accuracy? (hint: adding this to the main section in the config file will limit the vocab size to 10000 tokens. `"vocabulary":{"max_vocab_size":10000}`) 211 | 212 | 213 | ### 3. Convolution 214 | Averaging word embeddings is an example of a CBOW model. An alternative way to combine the representations is to use CNNs (see slide 110/111 in Ryan McDonald's talk: [SLIDES](https://github.com/athnlp/athnlp-labs/blob/master/slides/McDonald_classification.pdf)). 215 | 216 | 1. Use a `CnnEncoder()` ([documentation](https://allenai.github.io/allennlp-docs/api/allennlp.modules.seq2vec_encoders.html#allennlp.modules.seq2vec_encoders.cnn_encoder.CnnEncoder)) to generate convoluted sentence representations. (debugging hint: this method expects the input to be padded. you may get errors if filter size is longer than the sentnece. you will need to set `"token_min_padding_length": 5` or higher in the `tokens` object in `token_indexers` for large filter sizes). Filter sizes of between 2-5 should be sufficient. More filters will cause training to be slower (perhaps just train for 1 or 2 epochs). 217 | 218 | ### 4. Hypothesis-Only NLI and Biases 219 | 1. Implement a _[hypothesis only](https://www.aclweb.org/anthology/S18-2023)_ version of the model that ignores the evidence and only uses the claim for predicting the label. What accuracy does this model get? Why do you think this? Think back to slide 7 on Ryan's talk. 220 | 221 | 2. Take a look at the training/dev data. Can you design claims that would "fool" your models? You can see this report ([Thorne and Vlachos, 2019](https://arxiv.org/abs/1903.05543)) for inspiration. 222 | What do you conclude about the ability of your model to understand language? 223 | -------------------------------------------------------------------------------- /labs-exercises/neural-language-model.md: -------------------------------------------------------------------------------- 1 | # Lab - Neural Language Modeling 2 | 3 | 4 | ## Introduction 5 | 6 | In this lab we will create a Language Model using Recurrent Neural Networks with PyTorch. 7 | 8 | ## Requirements 9 | We will train our model on the following toy dataset: 10 | 11 | ``` 12 | The thief stole . 13 | The thief stole the suitcase . 14 | The crook stole the suitcase . 15 | The cop took a bribe . 16 | The thief was arrested by the detective . 17 | ``` 18 | 19 | ## Exercises 20 | 21 | 22 | #### 1. Language Modeller 23 | 24 | Implement a LSTM-based RNN language model that takes each word of a sentence as input and 25 | predicts the next one (the original RNNLM demo paper can be found 26 | [here](http://www.fit.vutbr.cz/~imikolov/rnnlm/rnnlm-demo.pdf)). 27 | In particular, the input to the RNN is the previous word and the previous hidden state and the output is the next 28 | predicted word. 29 | 30 | **Note**: Consider each sentence as a separate example, where each sentence is represented as a list of tokens. 31 | 32 | Things to try out: 33 | - Run a sanity check: make sure your model can learn how to predict correctly your training data. After training your 34 | model, take the sentence 35 | ``` 36 | The thief stole the suitcase . 37 | ``` 38 | and check that for every word and context (i.e., last hidden state of the RNN) you get the right answer. Does it work? 39 | For example, given the context `` The`` the model should be predicting ``thief``. 40 | Why is this happening instead of predicting ``crook``? 41 | 42 | **Note**: You might need to play with the hyper-parameters, such as learning rate, epoch number etc. 43 | 44 | #### 2. Sentence Completion 45 | 46 | Given a sentence with a gap 47 | ``` 48 | The ______ was arrested by the detective . 49 | ``` 50 | implement a decoder that returns the most likely word to fill it in. 51 | In more detail, you can develop a k-best ranker that scores the top-k derivations that a) all start with the prefix 52 | ``The``, b) each contains the top-k candidate words from the vocabulary, and c) follow with the rest words of the given 53 | sentence. 54 | 55 | Things to try out: 56 | - Which is more likely to fill in the gap: ``cop`` or ``crook``? 57 | Get the model to predict this correctly by changing the hyper-parameters. 58 | - Ensure that the model is predicting correctly for the right reason, 59 | i.e., that the embeddings for ``thief`` and ``crook`` are closer to each other than the embeddings 60 | for ``thief`` and ``cop``. Why is that? 61 | 62 | **Hint**: Use cosine similarity to compute the distance of two embedding vectors. 63 | 64 | -------------------------------------------------------------------------------- /labs-exercises/neural-machine-translation.md: -------------------------------------------------------------------------------- 1 | # Lab - Neural Machine Translation 2 | 3 | 4 | ## Introduction 5 | 6 | In this lab we will familiarise ourselves with the popular sequence-to-sequence (seq2seq) architecture for Neural Machine 7 | Translation and will implement the attention mechanism. 8 | 9 | ## Requirements 10 | We will train our models on the Multi30k dataset (well just a small part of it, as we will be 11 | running things on your laptop; you are more than welcome to try out the full dataset on a GPU-enabled machine too!). 12 | 13 | 1. Clone the dataset from [here](https://github.com/multi30k/dataset). 14 | 2. Extract the first 1000 examples of the already tokenized version of the validation set: 15 | ``data/task1/tok/val.lc.norm.tok.*``. 16 | 3. Create a 75\%/25\% train/val split. 17 | 4. We will focus only on the ``en-fr`` pairs. 18 | 19 | ## Exercises 20 | 21 | We provide an implementation of a basic sequence-to-sequence (seq2seq) architecture with beam search 22 | adapted from the original AllenNLP toolkit that you will have to extend: ``athnlp/models/nmt_seq2seq.py``. 23 | There are placeholders in the code that are left empty for you to fill in. We are also giving you 24 | a dataset reader for Multi30k: ``athnlp/readers/multi30k_reader.py``. 25 | 26 | **Note**: We recommend that 27 | you train and predict with the built-in commands using ``allennlp train/predict``. If you 28 | need to debug your code you can programmatically execute the training process from: ``athnlp/nmt.py`` 29 | We will be reporting performance using [BLEU](https://www.aclweb.org/anthology/P02-1040). 30 | 31 | #### 1. Playing around 32 | 33 | Have a good look at the provided code and make sure you understand how it works. 34 | 35 | Things to try out: 36 | 37 | - Overfit a (very) small portion of the training set. What hyperparameters do you need to use? 38 | - Train a model on the bigger dataset for a few epochs and compute BLEU score for the baseline model. 39 | **Note**: You are most likely not going to get a state-of-the-art performance. Why? 40 | - Switch the RNN cell from an LSTM to a GRU. 41 | - Use pre-trained embeddings like [GloVe](https://nlp.stanford.edu/pubs/glove.pdf) vectors. Does it help? 42 | Is that always applicable in MT? 43 | - Consider switching the metric (currently it's the validation loss) for early stopping criterion. 44 | - Try using beam search instead of greedy decoding. Does it help? 45 | 46 | 47 | 48 | #### 2. Attention Mechanism 49 | 50 | Implement at least one attention mechanism ([dot product](https://arxiv.org/abs/1508.04025), 51 | [bilinear](https://arxiv.org/abs/1508.04025), [MLP](https://arxiv.org/abs/1409.0473)) 52 | in the methods ``_prepare_output_projections()`` and ``_compute_attention()``. 53 | 54 | **Important**: to keep things uniform assume that the attended encoder outputs 55 | (aka *context vector*) gets concatenated with the previous predicted word embedding *before* being 56 | fed as in input to the decoder RNN. 57 | 58 | Things to try out: 59 | 60 | - Convince yourself that attention helps boost the performance of your model by computing 61 | BLEU on the dev set (if not most probably you have a bug!) 62 | - Predict the output for some examples using the default ``se2seq`` predictor from AllenNLP. 63 | You can find a small set of examples here: ``data/multi30k/val.lc.norm.tok.head-5.fr.jsonl``. 64 | How does the output compare to without using attention? 65 | - Visualise the attention scores using e.g., ``matplotlib.heatmap``. We have created a custom predictor 66 | in ``athnlp/predictors/nmt_seq2seq.py`` for you that already prints out heatmaps; you will just need to extract the attention 67 | scores from your model in ``forward_loop()``. You can execute it via ``athnlp/nmt.py``. 68 | Then visualize attention scores between the input and predicted output for the examples 69 | found in ``data/multi30k/val.lc.norm.tok.head-5.fr.jsonl``. What do you observe? 70 | 71 | - (Bonus) Instead of concatenating the context vector with the previous predicted word embedding *before* being 72 | feeding as in input to the decoder RNN, try concatenating it to the output hidden state of the decoder 73 | (i.e., *after* the RNN). Does that effect any change? (**Note**: you might need to train on the original corpus). 74 | 75 | #### 3. (Bonus) Sampling during decoding 76 | Implement a sampling algorithm for your decoder. As an alternative heuristic to beam search during decoding, 77 | the idea is to *sample* from the vocobulary distribution instead of taking the ``arg_max`` at each time step. 78 | 79 | Things to try out: 80 | 81 | - How does sampling affect the performance (BLEU score) of your model? 82 | - Inspect the output of your model by drawing several samples for a few examples; what do you observe? 83 | 84 | -------------------------------------------------------------------------------- /labs-exercises/pos-tagging-perceptron.md: -------------------------------------------------------------------------------- 1 | # Lab - Part-of-Speech tagging with the Perceptron Algorithm 2 | 3 | 4 | ## Introduction 5 | 6 | In this lab we will create a Part-of-Speech (PoS) tagger for English using the Perceptron algorithm. 7 | In particular, we will train a PoS tagger on the [Brown corpus](http://clu.uni.no/icame/manuals/BROWN/INDEX.HTM) annotated with the [Universal Part-of-Speech Tagset](https://arxiv.org/abs/1104.2086). 8 | 9 | Although PoS-tagging is an inherently sequential problem, for the purposes of this lab we will keep things simple: our model will predict a PoS-tag for every word of a sentence *independently*. 10 | More concretely, given a sentence (sequence of words) as input, your model needs to predict a PoS-tag for each of them. 11 | 12 | Here is an example: 13 | 14 | | **PoS**: | DET | VERB | NOUN | VERB | ADV | ADJ | . | 15 | |:-------|:-----:|:-----------:|------|:------:|:--------------:|:-----------:|:---:| 16 | | **Words**: | The | scalloped | edge | is | particularly | appealing | . | 17 | 18 | 19 | ## Requirements 20 | You need to download the Brown corpus first through NLTK if you don't have it already. 21 | Just execute the following in a Python CLI: 22 | 23 | ```python 24 | import nltk 25 | nltk.download('brown') 26 | ``` 27 | 28 | ## Exercises 29 | 30 | 31 | #### 1. Perceptron Algorithm 32 | 33 | Implement the standard perceptron algorithm. Use the first 10000/1000/1000 sentences for training/dev/test. 34 | In order to speed up the process for you, we have implemented a simple dataset reader that automatically converts the Brown corpus using the Universal PoS Tagset: `athnlp/readers/brown_pos_corpus.py` (you may use your own implementation if you want; `athnlp/reader/en-brown.map` provides the mapping from Brown to Universal Tagset). 35 | 36 | **Important**: Recall that the perceptron has to predict multiple (PoS tags) instead of binary classes: 37 | ![Multiclass Perceptron](multiclass_perceptron.png) 38 | 39 | You should represent each example of the corpus (i.e., every word of each sentence) in a vector form. In order to keep things simple, let's assume a simple **bag-of-words** representation. 40 | In order to evaluate your model compute the **accuracy** (i.e., number-of-correctly-labelled-words / total-number-of-labelled-words) on the dev set. 41 | Here are a few things to try out: 42 | - Does it help if you **randomize** the order of the training instances? 43 | - Does it help if you perform **multiple passes** over the training set? What is a reasonable number? 44 | - Instead of using the last weight vector for computing the error, try taking the **average of all 45 | the weight vectors** calculated for each label. Does that help? 46 | 47 | #### 2. Feature Engineering 48 | 49 | - Implement different types beyond bag-of-words. *Hint*: One very common feature type is to 50 | introduce some local context for every word via **n-grams**, usually with n=2,3. Another is to 51 | look at the previous/next **word** (not **tag**; why?). A third option is to look at subword features, 52 | i.e. short character sequences such as suffixes. 53 | - (Bonus) What are the most **positively-weighted** features for each label? Give the 54 | top 10 for each class and comment on whether they make sense (if they 55 | don’t you might have a bug!). 56 | 57 | 58 | 59 | 60 | 61 | 62 | -------------------------------------------------------------------------------- /labs-exercises/pos-tagging-structured-perceptron.md: -------------------------------------------------------------------------------- 1 | # Lab - Part-of-Speech tagging with the Structured Perceptron Algorithm 2 | 3 | In this lab we will create a Part-of-Speech (PoS) tagger for English using the Structured Perceptron algorithm. 4 | In particular, we will train a PoS tagger on the [Brown corpus](http://clu.uni.no/icame/manuals/BROWN/INDEX.HTM) 5 | annotated with the [Universal Part-of-Speech Tagset](https://arxiv.org/abs/1104.2086). 6 | 7 | [Last time](pos-tagging-perceptron.md) we made each tagging decision independently. In this lab we will make 8 | each decision at the sequence level, i.e., by choosing the PoS tag for each word so that they collectively *maximize* 9 | the score of the sequence of labels for the whole sentence. Why does that matter? Let's have a look at the 10 | following example: 11 | 12 | | **PoS**: | | | | | | | 13 | |:-------|:-----:|:-----------:|------|:------:|:--------------:|:-----------:| 14 | | **Words**: | The | old | man | the | boat | . | 15 | 16 | If we predict with a trained model using the simple averaged perceptron implementation with unigram features from the 17 | [previous lab](pos-tagging-perceptron.md) we get the following predictions: 18 | 19 | 20 | | **PoS**: | DET | ***ADJ** | ***NOUN** | DET | NOUN | . | 21 | |:-------|:-----:|:-----------:|------|:------:|:--------------:|:-----------:| 22 | | **Words**: | The | old | man | the | boat | . | 23 | 24 | Why is this the case? **NOUN** is the highest scoring label for the word 'man'. Therefore, when making the prediction 25 | for this word *independently*, the model will make an error (still not convinced this is wrong? Read the sentence carefully!). 26 | (Why is *ADJ** also the wrong label for the word 'old'?) 27 | 28 | The idea of the structured perceptron is that it keeps track of several alternative hypotheses for sequences of labels 29 | (in this case PoS-tags): 30 | the one contained in the example above contains 'locally' high scoring labels (ADJ, NOUN), but has a much lower 'global' 31 | score compared to the (correct) sequence below: 32 | 33 | | **PoS**: | DET | NOUN | VERB | DET | NOUN | . | 34 | |:-------|:-----:|:-----------:|------|:------:|:--------------:|:-----------:| 35 | | **Words**: | The | old | man | the | boat | . | 36 | 37 | 38 | ## Requirements 39 | You need to download the Brown corpus first through NLTK if you don't have it already. 40 | Just execute the following in a Python CLI: 41 | 42 | ```python 43 | import nltk 44 | nltk.download('brown') 45 | ``` 46 | 47 | ## Exercises 48 | 49 | 50 | #### 1. Structured Perceptron Algorithm 51 | 52 | Implement the structured perceptron algorithm. Use the first 1000/100/100 sentences for training/dev/test with < 5 words. 53 | You can re-use the implemented simple dataset reader: `athnlp/reader/brown_pos_corpus.py`. 54 | 55 | The algorithm is starkingly similar to the original perceptron algorithm; the two major differences though are: 56 | 1. You need to find the optimal (a.k.a. *arg,max*) path through the input sequence; 57 | 2. You need to update the weights for each label where the optimal predicted path and ground truth don't agree. 58 | 59 | Things to try out: 60 | - (Sanity check) What is the accuracy score of the averaged pecreptron algorithm using unigrams for this dataset? 61 | - First implement *arg,max* using brute-force, i.e., explore all the possible labeled paths. 62 | - Brute-force is a really inefficient approach to finding the optimal path. Quite often applying a heuristic 63 | such as beam search (i.e., keeping the top-n scoring partial hypotheses and discarding the rest that don't exceed 64 | a predefined threshold, aka *fall out of the beam*) speeds up the process immensely, by usually sacrificing a bit in accuracy. 65 | Of course, this process introduces two extra hyper-parameters: ``beam size``, i.e., how many hypotheses to keep, and 66 | ``beam width``, i.e., what is the threshold below which we should be discarding hypotheses? 67 | 68 | You should evaluate your models by computing the **accuracy** (i.e., number-of-correctly-labelled-words / total-number-of-labelled-words) on the dev set. 69 | 70 | 71 | 72 | 73 | 74 | 75 | 76 | -------------------------------------------------------------------------------- /labs-exercises/question-answering.md: -------------------------------------------------------------------------------- 1 | # Lab - Question Answering 2 | 3 | ## Introduction 4 | 5 | In this lab we will build a Question Answering model for SQuAD based on BERT using AllenNLP. 6 | 7 | ## Requirements 8 | Make sure that the script `setup_dependencies.sh` installed every package and downloaded every data file. In particular, 9 | make sure that `pytorch-transformers` is installed and that you have in your `resources` folder the following files: 10 | 1. `bert-base-uncased/pytorch_model.bin` (pretrained BERT model) 11 | 2. `bert-base-uncased/config.json` (pretrained BERT model parameters) 12 | 2. `bert-base-uncased/vocab.txt` (vocabulary for the pretrained model) 13 | 14 | We will fine-tune the BERT model on the SQuAD 2.0 dataset. In this tutorial will use just a small part of it, as we will be 15 | running things on your laptop; you are more than welcome to try out the [full dataset](https://rajpurkar.github.io/SQuAD-explorer/) on a GPU-enabled machine too. 16 | 17 | The portion of the data that we will be using in this tutorial is available in the folder `data/squad/`. We will use the file 18 | `train.json` for model training and `test.json` for the evaluation phase. 19 | 20 | ## Exercises 21 | 22 | BERT is a large-scale language model trained using several masking-based loss functions. In this tutorial we will show you 23 | how to fine-tune a BERT model trained on millions of text documents to complete a question answering task like SQuAD. Particularly, 24 | we will rely on BERT's encoder to represent both the question and the reference paragraph. Imagine that you have the following 25 | example: 26 | 27 | - Paragraph: 28 | ``` 29 | The Normans (Norman: Nourmands; French: Normands; Latin: Normanni) were the people who in the 10th and 11th centuries gave their name to Normandy, a region in France. They were descended from Norse ("Norman" comes from "Norseman") raiders and pirates from Denmark, Iceland and Norway who, under their leader Rollo, agreed to swear fealty to King Charles III of West Francia. Through generations of assimilation and mixing with the native Frankish and Roman-Gaulish populations, their descendants would gradually merge with the Carolingian-based cultures of West Francia. The distinct cultural and ethnic identity of the Normans emerged initially in the first half of the 10th century, and it continued to evolve over the succeeding centuries. 30 | ``` 31 | - Question: In what country is Normandy located? 32 | 33 | We will encode the question and the paragraph following the BERT encoder scheme: 34 | 35 | ``` 36 | [CLS] in what country is normandy located ? [SEP] the norman ##s ( norman : no ##ur ##man ##ds ; french : norman ##ds ; latin : norman ##ni ) were the people who in the 10th and 11th centuries (...)' 37 | 38 | ``` 39 | 40 | More details about this fine-tuning procedure and the BERT encoding scheme can be found in the original paper from [Devlin et. al 2018](https://arxiv.org/pdf/1810.04805.pdf). 41 | We created an AllenNLP dataset reader that is able to process the SQuAD dataset examples and format them following 42 | BERT encoding scheme. 43 | 44 | **Note**: BERT by default performs word-piece tokenization. See how words like ``normans`` gets split to ``norman ##s``, 45 | and ``nourmands`` gets split to more than two wordpieces: ``no ##ur ##man ##ds``. 46 | The dataset reader provided takes care of that. 47 | 48 | #### 1. Span prediction for Question Answering 49 | 50 | In this exercise you will be creating a model that is able to predict the boundary of the answer span given 51 | an encoded representation generated by BERT. The task consists of _merely_ predicting two integer values: 52 | - `start_position`: start position of the answer in the reference document 53 | - `end_position`: end position of the answer in the reference document 54 | 55 | To complete the exercise we provide you with a basic template of the QA model in the file `athnlp/models/qa_bert.py`. Every method should be implemented to complete the exercise. The model definition contains 3 main methods: 56 | 57 | - `__init__`: The constructor of the main class that is used to initialise all the model parameters. You are supposed to initialise the 58 | layer used to predict the span in here. 59 | - `forward`: The forward pass of the model. We want to encode the input representation using BERT and then use a linear layer to predict 60 | the start and the end of the answer. 61 | - `decode`: Given the model predictions convert them to tokens for visualisation and evaluation purposes. 62 | - `get_metrics`: evaluates the metrics `start position accuracy`, `end position accuracy` and `span position accuracy`. 63 | 64 | In this tutorial, we will be using the BERT API provided py `pytorch-transformers`. Particularly, we are interested 65 | in using the class [BertModel](https://huggingface.co/pytorch-transformers/model_doc/bert.html#bertmodel). By using the configuration file 66 | that we created (see `resources/bert-base-uncased/config.json`) for details), the BERT model will generate the following outputs: 67 | 68 | - **last_hidden_state**: Sequence of hidden-states at the output of the last layer of the model. ``torch.FloatTensor`` of shape ``(batch_size, sequence_length, hidden_size)`` 69 | - **pooler_output**: Last layer hidden-state of the first token of the sequence (classification token) 70 | further processed by a Linear layer and a Tanh activation function. The Linear 71 | layer weights are trained from the next sentence prediction (classification) 72 | objective during Bert pretraining. ``torch.FloatTensor`` of shape ``(batch_size, hidden_size)`` 73 | - **attentions**: Attention weights after the attention softmax, used to compute the weighted average in the self-attention heads. 74 | List of ``torch.FloatTensor`` (one for each layer) of shape ``(batch_size, num_heads, sequence_length, sequence_length)`` 75 | 76 | Given the output of the BERT model, it is possible to use two different strategies to predict the answer span: 77 | 1. Learn a Linear layer that predicts the answer span given the average of the hidden states contained in `last_hidden_states`; 78 | 2. Learn a Linear layer that predicts the answer span given the `pooler_output`. 79 | 80 | You are free to experiment with both strategies but we recommend the first one. The reason behind our preference is that 81 | the BERT `pooler_output` is usually *not* a good summary of the semantic content of the input, because it is used 82 | during the original BERT training phase for a different task. 83 | 84 | For the sake of consistency we have already provided a specific signature for the forward that your model implementation 85 | should follow. We define details of required inputs and ouputs as follows: 86 | 87 | #### Parameters 88 | - tokens : Dict[str, torch.LongTensor] 89 | From a ``TextField`` (that has a bert-pretrained token indexer) 90 | - span_start : torch.IntTensor, optional (default = None) 91 | A tensor of shape (batch_size, 1) which contains the start_position of the answer 92 | in the passage, or 0 if impossible. This is an `inclusive` token index. 93 | If this is given, we will compute a loss that gets included in the output dictionary. 94 | - span_end : torch.IntTensor, optional (default = None) 95 | A tensor of shape (batch_size, 1) which contains the end_position of the answer 96 | in the passage, or 0 if impossible. This is an `inclusive` token index. 97 | If this is given, we will compute a loss that gets included in the output dictionary. 98 | 99 | #### Returns 100 | 101 | An output dictionary consisting of: 102 | - logits : torch.FloatTensor 103 | A tensor of shape ``(batch_size, num_tokens)`` representing 104 | unnormalized log probabilities of the label. 105 | - start_probs : torch.FloatTensor 106 | A tensor of shape ``(batch_size, num_tokens)`` representing 107 | probabilities of the label obtained applying softmax on the predicted logits. 108 | - end_probs : torch.FloatTensor 109 | A tensor of shape ``(batch_size, num_tokens)`` representing 110 | probabilities of the label obtained applying softmax on the predicted logits. 111 | - best_span: torch.LongTensor 112 | A tensor of shape ``(batch_size, 2)`` representing the predicted start and end position for the answer 113 | associated to a specific element of the batch. We suggest to use the function [get_best_span](https://allenai.github.io/allennlp-docs/api/allennlp.models.reading_comprehension.html?highlight=get_best_span#allennlp.models.reading_comprehension.util.get_best_span) already 114 | implemented in AllenNLP. 115 | - loss : torch.FloatTensor, optional 116 | Loss function to be optimised. This will be the average of the losses computed for the start position predictions 117 | and for the end position predictions. 118 | 119 | **Note**: We recommend that you train and predict with the built-in commands using ``allennlp train/predict``. If you 120 | need to debug your code you can programmatically execute the training process from: ``athnlp/qa.py``. 121 | We will be reporting performance using SQuAD official evaluation metrics (please see [Rajpurkar and Jia et. al 2018](http://arxiv.org/abs/1806.03822) for details). 122 | 123 | 124 | #### 2. Attention Mechanism 125 | 126 | BERT incorporates a stack of multi-head attention layers (12 layers) which are used to learn a contextualised representation 127 | of every token in the input utterance. In this second exercise we want to add an additional output to our model 128 | that represents the attention values for every layer of the BERT model. The default implementation of [BertModel](https://huggingface.co/pytorch-transformers/model_doc/bert.html#bertmodel) does not return the attention scores generated by BERT. In order to have access to the attention scores you need to add to the BERT configuration file (`resources/bert-base-uncased/config.json`) the following key-value pair: `"output_attentions":true`. 129 | 130 | In your AllenNLP model implementation you will add a new key to the output dictionary: 131 | 132 | - question_passage_attentions: list of ``torch.FloatTensor`` of shape ``(batch_size, num_heads, sequence_length, sequence_length)`` 133 | 134 | Use some of the test examples to visualise the model attentions. You might want to visualise the attention values 135 | for just the last layer of BERT, or you can create a grid containing all the attention scores for 136 | all BERT 12 layers. In order to visualise the attention values, you can reuse the code provided with the 137 | Neural Machine Translation predictor in Lab 5 and adapt it for BERT. 138 | -------------------------------------------------------------------------------- /requirements.txt: -------------------------------------------------------------------------------- 1 | nltk 2 | allennlp 3 | numpy 4 | ipykernel 5 | pytorch-transformers==1.1.0 6 | -------------------------------------------------------------------------------- /setup_dependencies.sh: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env bash 2 | 3 | conda activate athnlp; 4 | 5 | pip install -r requirements.txt; 6 | 7 | python -m nltk.downloader brown; 8 | 9 | mkdir resources; 10 | 11 | # We download in advance all the models/data that are required by AllenNLP and BERT 12 | wget -c https://allennlp.s3.amazonaws.com/datasets/glove/glove.6B.50d.txt.gz -P resources/; 13 | 14 | mkdir resources/bert-base-uncased; 15 | 16 | wget -c https://s3.amazonaws.com/models.huggingface.co/bert/bert-base-uncased-vocab.txt -O resources/bert-base-uncased/vocab.txt; 17 | 18 | wget -c https://s3.amazonaws.com/models.huggingface.co/bert/bert-base-uncased-pytorch_model.bin -O resources/bert-base-uncased/pytorch_model.bin; 19 | 20 | wget -c "https://s3.amazonaws.com/models.huggingface.co/bert/bert-base-uncased-config.json" -O resources/bert-base-uncased/config.json; 21 | -------------------------------------------------------------------------------- /setup_dependencies_Docker.sh: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env bash 2 | pip install -r requirements.txt; 3 | 4 | python -m nltk.downloader brown; 5 | 6 | mkdir resources; 7 | -------------------------------------------------------------------------------- /slides/AthensNLP-MT-23Sept2019-ABisazza.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/athnlp/athnlp-labs/1c3f0a8595cd8e4a7b7d62fc4e608ade6ac092fd/slides/AthensNLP-MT-23Sept2019-ABisazza.pdf -------------------------------------------------------------------------------- /slides/Carreras_morning_2.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/athnlp/athnlp-labs/1c3f0a8595cd8e4a7b7d62fc4e608ade6ac092fd/slides/Carreras_morning_2.pdf -------------------------------------------------------------------------------- /slides/DialogueSystem_VivianChen.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/athnlp/athnlp-labs/1c3f0a8595cd8e4a7b7d62fc4e608ade6ac092fd/slides/DialogueSystem_VivianChen.pdf -------------------------------------------------------------------------------- /slides/MORNING_LECTURE_SLIDES_HERE: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/athnlp/athnlp-labs/1c3f0a8595cd8e4a7b7d62fc4e608ade6ac092fd/slides/MORNING_LECTURE_SLIDES_HERE -------------------------------------------------------------------------------- /slides/McDonald_classification.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/athnlp/athnlp-labs/1c3f0a8595cd8e4a7b7d62fc4e608ade6ac092fd/slides/McDonald_classification.pdf -------------------------------------------------------------------------------- /slides/Riedel_Machine Reading Tutorial at AthensNLP Summer School.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/athnlp/athnlp-labs/1c3f0a8595cd8e4a7b7d62fc4e608ade6ac092fd/slides/Riedel_Machine Reading Tutorial at AthensNLP Summer School.pdf -------------------------------------------------------------------------------- /slides/athNLP-Lec3-BPlank.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/athnlp/athnlp-labs/1c3f0a8595cd8e4a7b7d62fc4e608ade6ac092fd/slides/athNLP-Lec3-BPlank.pdf --------------------------------------------------------------------------------