├── .gitignore
├── Dockerfile
├── README.md
├── athnlp
    ├── __init__.py
    ├── create_squad_data.py
    ├── experiments
    │   ├── fever.json
    │   ├── nmt_multi30k.jsonnet
    │   └── qa_bert.jsonnet
    ├── models
    │   ├── __init__.py
    │   ├── fever_text_classification.py
    │   ├── nmt_seq2seq.py
    │   ├── qa_bert.py
    │   └── rnn_language_model.py
    ├── nlm.py
    ├── nmt.py
    ├── qa.py
    └── readers
    │   ├── __init__.py
    │   ├── bert_squad_reader.py
    │   ├── brown_pos_corpus.py
    │   ├── en-brown.map
    │   ├── fever_predictor.py
    │   ├── fever_reader.py
    │   ├── label_dictionary.py
    │   ├── lm_corpus.py
    │   ├── multi30k_reader.py
    │   ├── sequence.py
    │   ├── sequence_dictionary.py
    │   └── token_indexers
    │       ├── __init__.py
    │       └── bert_squad_indexer.py
├── data
    ├── fever
    │   ├── test.jsonl
    │   ├── train.jsonl
    │   └── validation.jsonl
    ├── lm
    │   ├── test.txt
    │   ├── train.txt
    │   └── valid.txt
    ├── multi30k
    │   ├── val.lc.norm.tok.head-250.en
    │   ├── val.lc.norm.tok.head-250.fr
    │   ├── val.lc.norm.tok.head-5.en
    │   ├── val.lc.norm.tok.head-5.en.jsonl
    │   ├── val.lc.norm.tok.head-5.fr
    │   ├── val.lc.norm.tok.head-750.en
    │   └── val.lc.norm.tok.head-750.fr
    ├── run_fever.png
    └── squad
    │   ├── dev-v2.0-small.json
    │   ├── test.json
    │   └── train.json
├── labs-exercises
    ├── AV_struct_perceptron.html
    ├── multiclass_perceptron.png
    ├── neural-encoding-fever.md
    ├── neural-language-model.md
    ├── neural-machine-translation.md
    ├── pos-tagging-perceptron.md
    ├── pos-tagging-structured-perceptron.md
    └── question-answering.md
├── requirements.txt
├── setup_dependencies.sh
├── setup_dependencies_Docker.sh
└── slides
    ├── AthensNLP-MT-23Sept2019-ABisazza.pdf
    ├── Carreras_morning_2.pdf
    ├── DialogueSystem_VivianChen.pdf
    ├── MORNING_LECTURE_SLIDES_HERE
    ├── McDonald_classification.pdf
    ├── Riedel_Machine Reading Tutorial at AthensNLP Summer School.pdf
    └── athNLP-Lec3-BPlank.pdf


/.gitignore:
--------------------------------------------------------------------------------
1 | *.swp
2 | *.swo
3 | .idea/
4 | __pycache__/
5 | .DS_Store
6 | 
7 | models/*
8 | #external_resources/*
9 | 


--------------------------------------------------------------------------------
/Dockerfile:
--------------------------------------------------------------------------------
 1 | FROM python:3.6.9
 2 | MAINTAINER Andreas Vlachos <a.vlachos@sheffield.ac.uk>
 3 | 
 4 | RUN apt-get update -y
 5 | RUN apt-get install -y git
 6 | 
 7 | RUN git clone https://github.com/athnlp/athnlp-labs.git
 8 | WORKDIR /athnlp-labs
 9 | 
10 | RUN sh setup_dependencies_Docker.sh
11 | 


--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
 1 | # ΑθNLP 2019
 2 | 
 3 | 
 4 | ## *Important*: This repository is now archived!
 5 | 
 6 | Exercises for the lab sessions of ΑθNLP 2019. 
 7 | The labs will cover the following:
 8 | 
 9 | 1. [Part-of-Speech Tagging with the Perceptron algorithm](labs-exercises/pos-tagging-perceptron.md)
10 | 2. [Part-of-Speech Tagging with the Structured Perceptron algorithm](labs-exercises/pos-tagging-structured-perceptron.md)
11 | 3. [Neural Encoding for Text Classification](labs-exercises/neural-encoding-fever.md)
12 | 4. [Neural Language Modeling](labs-exercises/neural-language-model.md)
13 | 5. [Neural Machine Translation](labs-exercises/neural-machine-translation.md)
14 | 6. [Question Answering](labs-exercises/question-answering.md)
15 | 
16 | ## Setup
17 | 
18 | You will need to have Python 3 installed on your machine; we recommend using [Anaconda](https://www.anaconda.com/), 
19 | which is available for the most common OS distributions. 
20 | 
21 | For the first two labs we will be using vanilla Python (along with the standard scientific libarires, i.e., NumPy, SciPy, 
22 | etc), while for the rest we will additionally be using [PyTorch](https://pytorch.org/) and 
23 | [AllenNLP](https://allennlp.org/).
24 | 
25 | Use the Anaconda command-line tools to create a new virtual environment with Python 3.6:
26 | ```
27 |     conda create --name athnlp python=3.6
28 | ```
29 | After the installation is complete, you should have a new virtual environment called `athnlp` in your Anaconda installation 
30 | that you can *activate* using the following command: `conda activate athnlp`. Remember to execute this command before
31 | running the scripts in this repository.
32 | 
33 | Next, you should clone the repository to your computer:
34 | ```
35 |     git clone https://github.com/athnlp/athnlp-labs
36 | ```
37 | 
38 | Finally, you should install all required dependencies. 
39 | We provide a script that will help you setup your environment. Run the command: `sh setup_dependencies.sh` and 
40 | it will automatically install the project dependencies for you. The script will download several data dependencies that might
41 | require some time to be installed. 
42 | 
43 | 
44 | **Note**: Installing AllenNLP on Mac OS can be tricky; check [here](https://stackoverflow.com/questions/52509602/cant-compile-c-program-on-a-mac-after-upgrade-to-mojave)
45 | for a possible solution.
46 | 
47 | ## Docker
48 | 
49 | If you prefer (or you are on Windows), you can install Docker and create a Docker image with the following commands:
50 | - build it by running `docker build -t athnlp - < Dockerfile`
51 | - get an interactive terminal on the image with `docker run -i -t athnlp bash`
52 | - run commands as you normally would (remember this is a very minimal linux installation)
53 | If you want to run the image with a new version of the code, add the option `--no-cache` to the build.
54 | You need to do the wget commands from setup_dependencies.sh on your own. Make sure you give Docker enough disk space and memory
55 | 
56 | 


--------------------------------------------------------------------------------
/athnlp/__init__.py:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/athnlp/athnlp-labs/1c3f0a8595cd8e4a7b7d62fc4e608ade6ac092fd/athnlp/__init__.py


--------------------------------------------------------------------------------
/athnlp/create_squad_data.py:
--------------------------------------------------------------------------------
  1 | import copy
  2 | import json
  3 | import os
  4 | from argparse import ArgumentParser
  5 | from operator import itemgetter
  6 | 
  7 | import numpy as np
  8 | 
  9 | parser = ArgumentParser()
 10 | 
 11 | parser.add_argument("-dataset_file", default="data/squad/dev-v2.0.json",
 12 |                     help="Path to the Devset of SQuAD 2.0", type=str)
 13 | parser.add_argument("-output_dir", default="data/squad",
 14 |                     help="Output folder were the files train.json and test.json will be saved")
 15 | parser.add_argument("--percentage_train",
 16 |                     type=float,
 17 |                     help="Percentage of questions associated to a given paragraph to retain for training", default=70.0)
 18 | parser.add_argument("--wiki_title", help="Wikipedia page title used as reference to create the dataset",
 19 |                     default="Normans")
 20 | parser.add_argument("--remove_impossible", action='store_true')
 21 | 
 22 | 
 23 | def create_dataset_splits(dataset, percentage_train, remove_impossible=True):
 24 |     train = {'version': dataset['version'], 'data': []}
 25 |     test = {'version': dataset['version'], 'data': []}
 26 | 
 27 |     for example_set in dataset["data"]:
 28 |         curr_train = {"title": example_set["title"], "paragraphs": []}
 29 |         curr_test = {"title": example_set["title"], "paragraphs": []}
 30 |         for paragraph in example_set["paragraphs"]:
 31 |             num_questions = len(paragraph["qas"])
 32 | 
 33 |             question_ids = np.arange(num_questions)
 34 | 
 35 |             np.random.shuffle(question_ids)
 36 | 
 37 |             ref_index = int(percentage_train * num_questions)
 38 |             train_indexes = question_ids[:ref_index]
 39 |             test_indexes = question_ids[ref_index:]
 40 | 
 41 |             train_paragraph = copy.copy(paragraph)
 42 |             train_qas = itemgetter(*train_indexes)(paragraph["qas"])
 43 |             if isinstance(train_qas, dict):
 44 |                 train_qas = [train_qas]
 45 |             if remove_impossible:
 46 |                 train_qas = [x for x in train_qas if not x['is_impossible']]
 47 |             train_paragraph["qas"] = train_qas
 48 |             test_paragraph = copy.copy(paragraph)
 49 |             test_qas = itemgetter(*test_indexes)(paragraph["qas"])
 50 |             if isinstance(test_qas, dict):
 51 |                 test_qas = [test_qas]
 52 |             if remove_impossible:
 53 |                 test_qas = [x for x in test_qas if not x['is_impossible']]
 54 |             test_paragraph["qas"] = test_qas
 55 | 
 56 |             curr_train["paragraphs"].append(train_paragraph)
 57 |             curr_test["paragraphs"].append(test_paragraph)
 58 | 
 59 |         train["data"].append(curr_train)
 60 |         test["data"].append(curr_test)
 61 | 
 62 |     return train, test
 63 | 
 64 | 
 65 | def main(args):
 66 |     with open(args.dataset_file) as in_file:
 67 |         dataset = json.load(in_file)
 68 | 
 69 |     # We extract only data associated to the Wikipedia page of the Normans
 70 |     filtered_dataset = {'version': dataset['version'], 'data': []}
 71 | 
 72 |     for example in dataset["data"]:
 73 |         if example["title"] == args.wiki_title:
 74 |             filtered_dataset["data"].append(example)
 75 | 
 76 |     total_num_paragraphs = 0
 77 |     total_num_questions = 0
 78 | 
 79 |     for example_set in filtered_dataset["data"]:
 80 |         total_num_paragraphs += len(example_set["paragraphs"])
 81 | 
 82 |         for paragraph in example_set["paragraphs"]:
 83 |             total_num_questions += len(paragraph["qas"])
 84 | 
 85 |     print("Wikipedia page title: {}".format(args.wiki_title))
 86 |     print("Total number of paragraphs: {}".format(total_num_paragraphs))
 87 |     print("Total number of questions: {}".format(total_num_questions))
 88 | 
 89 |     train, test = create_dataset_splits(filtered_dataset, args.percentage_train / 100, args.remove_impossible)
 90 | 
 91 |     print("-- Saving training and test files to directory: {}".format(args.output_dir))
 92 |     with open(os.path.join(args.output_dir, "train.json"), mode="w") as out_file:
 93 |         json.dump(train, out_file)
 94 | 
 95 |     with open(os.path.join(args.output_dir, "test.json"), mode="w") as out_file:
 96 |         json.dump(test, out_file)
 97 | 
 98 | 
 99 | if __name__ == "__main__":
100 |     args = parser.parse_args()
101 | 
102 |     main(args)
103 | 


--------------------------------------------------------------------------------
/athnlp/experiments/fever.json:
--------------------------------------------------------------------------------
 1 | {
 2 |   "dataset_reader": {
 3 |     "type": "feverlite",
 4 |     "token_indexers": {
 5 |       "tokens": {
 6 |         "type": "single_id",
 7 |         "lowercase_tokens": true
 8 |       }
 9 |     },
10 |     "wiki_tokenizer": {
11 |       "type":"word",
12 |       "word_splitter": {
13 |         "type": "just_spaces"
14 |       }
15 |     },
16 |     "claim_tokenizer": {
17 |       "type":"word",
18 |       "word_splitter": {
19 |         "type": "simple"
20 |       }
21 |     }
22 |   },
23 |   "train_data_path": "data/fever/train.jsonl",
24 |   "validation_data_path": "data/fever/validation.jsonl",
25 |   "model": {
26 |     "type": "fever",
27 |     "text_field_embedder": {
28 |       "tokens": {
29 |         "type": "embedding",
30 |         "pretrained_file": "resources/glove.6B.50d.txt.gz",
31 |         "embedding_dim": 50,
32 |         "trainable": false
33 |       }
34 |     },
35 |     "final_feedforward": {
36 |       "input_dim": 100,
37 |       "num_layers": 3,
38 |       "hidden_dims": [100, 100, 2],
39 |       "activations": ["relu","relu","linear"],
40 |       "dropout": 0.0
41 |     },
42 |     "initializer": [
43 |       [".*linear_layers.*weight", {"type": "xavier_normal"}]
44 |      ]
45 |   },
46 |   "iterator": {
47 |     "type": "bucket",
48 |     "sorting_keys": [["claim", "num_tokens"], ["evidence", "num_tokens"]],
49 |     "batch_size": 32,
50 |     "instances_per_epoch": 16000
51 |   },
52 |   "trainer": {
53 |     "num_epochs": 20,
54 |     "cuda_device": -1,
55 |     "validation_metric": "+accuracy",
56 |     "optimizer": {
57 |       "type": "sgd",
58 |       "lr": 0.01
59 | 
60 |     }
61 |   }
62 | }
63 | 


--------------------------------------------------------------------------------
/athnlp/experiments/nmt_multi30k.jsonnet:
--------------------------------------------------------------------------------
 1 | {
 2 |   "dataset_reader": {
 3 |     "type": "multi30k",
 4 |     "language_pairs": {
 5 |       "source": "en",
 6 |       "target": "fr"
 7 |     },
 8 |     "source_token_indexers": {
 9 |       "source_tokens": {
10 |         "type": "single_id",
11 |         "namespace": "source_tokens"
12 |       }
13 |     },
14 |     "target_token_indexers": {
15 |       "target_tokens": {
16 |         "type": "single_id",
17 |         "namespace": "target_tokens"
18 |       }
19 |     }
20 |   },
21 |   "train_data_path": "data/multi30k/val.lc.norm.tok.head-750",
22 |   "validation_data_path": "data/multi30k/val.lc.norm.tok.head-250",
23 |   "model": {
24 |     "type": "nmt_seq2seq",
25 |     "source_embedder": {
26 |       "token_embedders": {
27 |         "source_tokens": {
28 |             "type": "embedding",
29 |             "embedding_dim": 50,
30 |             "trainable": true,
31 |             "vocab_namespace": "source_tokens"
32 |         }
33 |       }
34 |     },
35 |     "target_namespace": "target_tokens",
36 | //    "attention" : {
37 | //      "type" : "dot_product"
38 | //    },
39 |     "encoder": {
40 |       "type": "lstm",
41 |       "input_size": 50,
42 |       "hidden_size": 200,
43 |       "num_layers": 1,
44 |       "dropout": 0.3,
45 |       "bidirectional": true
46 |     },
47 |     "decoder": {
48 |       "type": "lstm",
49 |       "input_size": 50,
50 |       "hidden_size": 400
51 |     },
52 |     "max_decoding_steps": 15,
53 |     "beam_size": 1
54 |   },
55 |   "iterator": {
56 |     "type": "bucket",
57 |     "sorting_keys": [
58 |       [
59 |         "source_tokens",
60 |         "num_tokens"
61 |       ]
62 |     ],
63 |     "batch_size": 1
64 |   },
65 |   "trainer": {
66 |     "optimizer": "adam",
67 |     "num_epochs": 100,
68 |     "patience": 10,
69 |     "validation_metric": "-loss",
70 |     "cuda_device": -1
71 |   }
72 | }
73 | 


--------------------------------------------------------------------------------
/athnlp/experiments/qa_bert.jsonnet:
--------------------------------------------------------------------------------
 1 | {
 2 |     "dataset_reader": {
 3 |         "lazy": false,
 4 |         "type": "bert_squad",
 5 |         "tokenizer": {
 6 |             "word_splitter": {
 7 |                 "type": "bert-basic-wordpiece",
 8 |                 "pretrained_model": "resources/bert-base-uncased/vocab.txt"
 9 |             }
10 |         },
11 |         "token_indexers": {
12 |             "bert": {
13 |                 "type": "bert-squad-indexer",
14 |                 "pretrained_model": "resources/bert-base-uncased/vocab.txt"
15 |             }
16 |         },
17 |         "version_2": true,
18 |         "max_sequence_length": 384,
19 |         "question_length_limit": 64,
20 |         "doc_stride": 128
21 |     },
22 |     "train_data_path": "data/squad/train.json",
23 |     "validation_data_path": "data/squad/test.json",
24 |     "model": {
25 |         "type": "qa_bert",
26 |         "bert_model": "resources/bert-base-uncased/",
27 |         "dropout": 0.1
28 |     },
29 |     "iterator": {
30 |         "type": "bucket",
31 |         "sorting_keys": [["tokens", "num_tokens"]],
32 |         "batch_size": 2
33 |     },
34 |     "trainer": {
35 |         "optimizer": {
36 |             "type": "adam",
37 |             "lr": 0.0001
38 |         },
39 |         "validation_metric": "+f1",
40 |         "num_serialized_models_to_keep": 1,
41 |         "num_epochs": 3,
42 |         "grad_norm": 1.0,
43 |         "patience": 5,
44 |         "cuda_device": -1
45 |     }
46 | }


--------------------------------------------------------------------------------
/athnlp/models/__init__.py:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/athnlp/athnlp-labs/1c3f0a8595cd8e4a7b7d62fc4e608ade6ac092fd/athnlp/models/__init__.py


--------------------------------------------------------------------------------
/athnlp/models/fever_text_classification.py:
--------------------------------------------------------------------------------
  1 | from typing import Optional, Dict, List, Any
  2 | 
  3 | import allennlp
  4 | import torch
  5 | from allennlp.nn.util import get_text_field_mask
  6 | from torch import nn
  7 | from torch.nn import functional as F
  8 | from allennlp.data import Vocabulary
  9 | from allennlp.models import Model
 10 | from allennlp.modules import TextFieldEmbedder, FeedForward
 11 | from allennlp.nn import InitializerApplicator, RegularizerApplicator
 12 | from allennlp.training.metrics import CategoricalAccuracy
 13 | 
 14 | 
 15 | @Model.register("fever")
 16 | class FEVERTextClassificationModel(Model):
 17 | 
 18 |     def __init__(self,
 19 |                  vocab: Vocabulary,
 20 |                  text_field_embedder: TextFieldEmbedder,
 21 |                  final_feedforward: FeedForward,
 22 |                  initializer: InitializerApplicator = InitializerApplicator(),
 23 |                  regularizer: Optional[RegularizerApplicator] = None,
 24 |                  ) -> None:
 25 | 
 26 |         super().__init__(vocab,regularizer)
 27 | 
 28 |         # Model components
 29 |         self._embedder = text_field_embedder
 30 |         self._feed_forward = final_feedforward
 31 | 
 32 |         # For accuracy and loss for training/evaluation of model
 33 |         self._accuracy = CategoricalAccuracy()
 34 |         self._loss = nn.CrossEntropyLoss()
 35 | 
 36 |         # Initialize weights
 37 |         initializer(self)
 38 | 
 39 | 
 40 |     def forward(self,
 41 |                 claim: Dict[str, torch.LongTensor],
 42 |                 evidence: Dict[str, torch.LongTensor],
 43 |                 label: torch.IntTensor = None,
 44 |                 metadata: List[Dict[str, Any]] = None) -> Dict[str, torch.Tensor]:
 45 |         # pylint: disable=arguments-differ
 46 |         """
 47 |         Parameters
 48 |         ----------
 49 |         claim : Dict[str, torch.LongTensor]
 50 |             From a ``TextField``
 51 |             The LongTensor Shape is typically ``(batch_size, sent_length)`
 52 |         evidence : Dict[str, torch.LongTensor]
 53 |             From a ``TextField``
 54 |             The LongTensor Shape is typically ``(batch_size, sent_length)`
 55 |         label : torch.IntTensor, optional, (default = None)
 56 |             From a ``LabelField``
 57 |         metadata : ``List[Dict[str, Any]]``, optional, (default = None)
 58 |             Metadata containing the original tokenization of the claim and
 59 |             evidence sentences with 'claim_tokens' and 'premise_tokens' keys respectively.
 60 |         Returns
 61 |         -------
 62 |         An output dictionary consisting of:
 63 | 
 64 |         label_logits : torch.FloatTensor
 65 |             A tensor of shape ``(batch_size, num_labels)`` representing unnormalised log
 66 |             probabilities of the entailment label.
 67 |         label_probs : torch.FloatTensor
 68 |             A tensor of shape ``(batch_size, num_labels)`` representing probabilities of the
 69 |             entailment label.
 70 |         loss : torch.FloatTensor, optional
 71 |             A scalar loss to be optimised.
 72 |         """
 73 | 
 74 | 
 75 |         # TODO - Delete this line when you start working on your solution
 76 |         raise NotImplementedError("Compute label logits (for supported and refuted) for the given Claim and Evidence input")
 77 | 
 78 |         # TODO - Uncomment the code below
 79 | 
 80 |         #label_logits = # TODO compute label logits for input
 81 |         #label_probs = F.softmax(label_logits, dim=-1)
 82 | 
 83 |         #output_dict = {"label_logits": label_logits,
 84 |         #               "label_probs": label_probs}
 85 | 
 86 |         #if label is not None:
 87 |         #    loss = self._loss(label_logits, label.long().view(-1))
 88 |         #    self._accuracy(label_logits, label)
 89 |         #    output_dict["loss"] = loss
 90 | 
 91 |         #if metadata is not None:
 92 |         #    output_dict["claim_tokens"] = [x["claim_tokens"] for x in metadata]
 93 |         #    output_dict["evidence_tokens"] = [x["evidence_tokens"] for x in metadata]
 94 | 
 95 |         #return output_dict
 96 | 
 97 |     def get_metrics(self, reset: bool = False) -> Dict[str, float]:
 98 |         return {
 99 |                 'accuracy': self._accuracy.get_metric(reset),
100 |                 }
101 | 


--------------------------------------------------------------------------------
/athnlp/models/nmt_seq2seq.py:
--------------------------------------------------------------------------------
  1 | from typing import Dict, List, Tuple
  2 | 
  3 | import numpy
  4 | from overrides import overrides
  5 | import torch
  6 | import torch.nn.functional as F
  7 | from torch.nn.modules.linear import Linear
  8 | from torch.nn.modules.rnn import LSTMCell
  9 | from torch.nn.modules.rnn import GRUCell
 10 | from allennlp.common.checks import ConfigurationError
 11 | from allennlp.common.util import START_SYMBOL, END_SYMBOL
 12 | from allennlp.data.vocabulary import Vocabulary
 13 | from allennlp.modules import TextFieldEmbedder, Seq2SeqEncoder
 14 | from allennlp.models.model import Model
 15 | from allennlp.modules.token_embedders import Embedding
 16 | from allennlp.nn import util
 17 | from allennlp.nn.beam_search import BeamSearch
 18 | from allennlp.training.metrics import BLEU
 19 | from allennlp.nn.util import masked_softmax
 20 | 
 21 | @Model.register("nmt_seq2seq")
 22 | class NmtSeq2Seq(Model):
 23 |     """
 24 |     This ``NmtSeq2Seq`` class is an adaptation from the SimpleSeq2Seq :class:`Model` from the AllenNLP toolkit,
 25 |     which takes a sequence, encodes it, and then uses the encoded representations to decode another sequence.
 26 |     We have removed some functionality .
 27 | 
 28 |     Parameters
 29 |     ----------
 30 |     vocab : ``Vocabulary``, required
 31 |         Vocabulary containing source and target vocabularies. They may be under the same namespace
 32 |         (`tokens`) or the target tokens can have a different namespace, in which case it needs to
 33 |         be specified as `target_namespace`.
 34 |     source_embedder : ``TextFieldEmbedder``, required
 35 |         Embedder for source side sequences
 36 |     target_namespace : ``str``,
 37 |         If the target side vocabulary is different from the source side's, you need to specify the
 38 |         target's namespace here. If not, we'll assume it is "tokens", which is also the default
 39 |         choice for the source side, and this might cause them to share vocabularies.
 40 |     target_embedding_dim : ``int``, optional (default = source_embedding_dim)
 41 |         You can specify an embedding dimensionality for the target side. If not, we'll use the same
 42 |         value as the source embedder's.
 43 |     encoder : ``Seq2SeqEncoder``, required
 44 |         The encoder of the "encoder/decoder" model
 45 |     decoder : ``Dict``, required
 46 |         The parameters for the decoder RNN cell of the "encoder/decoder" model
 47 |     max_decoding_steps : ``int``
 48 |         Maximum length of decoded sequences.
 49 |         You can specify an embedding dimensionality for the target side. If not, we'll use the same
 50 |         value as the source embedder's.
 51 |     attention : ``Dict``, optional (default = None)
 52 |         If you want to use attention to get a dynamic summary of the encoder outputs at each step
 53 |         of decoding, this is the Dict that holds parameters for the appropriate attention function to compute similarity
 54 |         between the decoder hidden state and encoder outputs.
 55 |     beam_size : ``int``, optional (default = None)
 56 |         Width of the beam for beam search. If not specified, greedy decoding is used.
 57 |     scheduled_sampling_ratio : ``float``, optional (default = 0.)
 58 |         At each timestep during training, we sample a random number between 0 and 1, and if it is
 59 |         not less than this value, we use the ground truth labels for the whole batch. Else, we use
 60 |         the predictions from the previous time step for the whole batch. If this value is 0.0
 61 |         (default), this corresponds to teacher forcing, and if it is 1.0, it corresponds to not
 62 |         using target side ground truth labels.  See the following paper for more information:
 63 |         `Scheduled Sampling for Sequence Prediction with Recurrent Neural Networks. Bengio et al.,
 64 |         2015 <https://arxiv.org/abs/1506.03099>`_.
 65 |     use_bleu : ``bool``, optional (default = True)
 66 |         If True, the BLEU metric will be calculated during validation.
 67 |     """
 68 | 
 69 |     def __init__(self,
 70 |                  vocab: Vocabulary,
 71 |                  source_embedder: TextFieldEmbedder,
 72 |                  target_namespace: str,
 73 |                  encoder: Seq2SeqEncoder,
 74 |                  decoder: Dict,
 75 |                  max_decoding_steps: int,
 76 |                  target_embedding_dim: int = None,
 77 |                  attention: Dict = None,
 78 |                  beam_size: int = None,
 79 |                  scheduled_sampling_ratio: float = 0.,
 80 |                  use_bleu: bool = True,
 81 |                  visualize_attention: bool = False) -> None:
 82 |         super(NmtSeq2Seq, self).__init__(vocab)
 83 | 
 84 |         self._scheduled_sampling_ratio = scheduled_sampling_ratio
 85 |         self._target_namespace = target_namespace
 86 |         # We need the start symbol to provide as the input at the first timestep of decoding, and
 87 |         # end symbol as a way to indicate the end of the decoded sequence.
 88 |         self._start_index = self.vocab.get_token_index(START_SYMBOL, self._target_namespace)
 89 |         self._end_index = self.vocab.get_token_index(END_SYMBOL, self._target_namespace)
 90 | 
 91 |         if use_bleu:
 92 |             pad_index = self.vocab.get_token_index(self.vocab._padding_token, self._target_namespace)  # pylint: disable=protected-access
 93 |             self._bleu = BLEU(exclude_indices={pad_index, self._end_index, self._start_index})
 94 |         else:
 95 |             self._bleu = None
 96 | 
 97 |         # At prediction time, we use a beam search to find the most likely sequence of target tokens.
 98 |         beam_size = beam_size or 1
 99 |         self._max_decoding_steps = max_decoding_steps
100 |         self._beam_search = BeamSearch(self._end_index, max_steps=max_decoding_steps, beam_size=beam_size)
101 | 
102 |         # Dense embedding of source vocab tokens.
103 |         self._source_embedder = source_embedder
104 | 
105 |         # Encodes the sequence of source embeddings into a sequence of hidden states.
106 |         self._encoder = encoder
107 | 
108 |         num_classes = self.vocab.get_vocab_size(self._target_namespace)
109 | 
110 |         # Attention mechanism params applied to the encoder output for each step.
111 |         self._attention = attention
112 | 
113 |         self._visualize_attention = visualize_attention
114 | 
115 |         # Dense embedding of vocab words in the target space.
116 |         target_embedding_dim = target_embedding_dim or source_embedder.get_output_dim()
117 |         self._target_embedder = Embedding(num_classes, target_embedding_dim)
118 | 
119 |         # Decoder output dim needs to be the same as the encoder output dim since we initialize the
120 |         # hidden state of the decoder with the final hidden state of the encoder.
121 |         self._encoder_output_dim = self._encoder.get_output_dim()
122 |         # self._decoder_output_dim = self._encoder_output_dim
123 | 
124 |         self._decoder_input_dim = decoder["input_size"]
125 |         # If using attention make sure the .jsonnet params reflect this architecture:
126 |         # input_to_decoder_rnn = [prev_word + attended_context_vector]
127 |         self._decoder_output_dim = decoder['hidden_size']
128 | 
129 |         # We'll use an RNN cell as the recurrent cell that produces a hidden state
130 |         # for the decoder at each time step.
131 |         decoder_cell_type = decoder["type"]
132 | 
133 |         if decoder_cell_type == "gru":
134 |             self._decoder_cell = GRUCell(self._decoder_input_dim, self._decoder_output_dim)
135 |         elif decoder_cell_type == "lstm":
136 |             self._decoder_cell = LSTMCell(self._decoder_input_dim, self._decoder_output_dim)
137 |         else:
138 |             raise ValueError("Dialogue encoder of type {} not supported yet!".format(decoder_cell_type))
139 | 
140 |         # We project the hidden state from the decoder into the output vocabulary space
141 |         # in order to get log probabilities of each target token, at each time step.
142 |         self._output_projection_layer = Linear(self._decoder_output_dim, num_classes)
143 | 
144 |     def take_step(self,
145 |                   last_predictions: torch.Tensor,
146 |                   state: Dict[str, torch.Tensor]) -> Tuple[torch.Tensor, Dict[str, torch.Tensor]]:
147 |         """
148 |         Take a decoding step. This is called by the beam search class.
149 | 
150 |         Parameters
151 |         ----------
152 |         last_predictions : ``torch.Tensor``
153 |             A tensor of shape ``(group_size,)``, which gives the indices of the predictions
154 |             during the last time step.
155 |         state : ``Dict[str, torch.Tensor]``
156 |             A dictionary of tensors that contain the current state information
157 |             needed to predict the next step, which includes the encoder outputs,
158 |             the source mask, and the decoder hidden state and context. Each of these
159 |             tensors has shape ``(group_size, *)``, where ``*`` can be any other number
160 |             of dimensions.
161 | 
162 |         Returns
163 |         -------
164 |         Tuple[torch.Tensor, Dict[str, torch.Tensor]]
165 |             A tuple of ``(log_probabilities, updated_state)``, where ``log_probabilities``
166 |             is a tensor of shape ``(group_size, num_classes)`` containing the predicted
167 |             log probability of each class for the next step, for each item in the group,
168 |             while ``updated_state`` is a dictionary of tensors containing the encoder outputs,
169 |             source mask, and updated decoder hidden state and context.
170 | 
171 |         Notes
172 |         -----
173 |             We treat the inputs as a batch, even though ``group_size`` is not necessarily
174 |             equal to ``batch_size``, since the group may contain multiple states
175 |             for each source sentence in the batch.
176 |         """
177 |         # shape: (group_size, num_classes)
178 |         output_projections, state = self._prepare_output_projections(last_predictions, state)
179 | 
180 |         # shape: (group_size, num_classes)
181 |         class_log_probabilities = F.log_softmax(output_projections, dim=-1)
182 | 
183 |         return class_log_probabilities, state
184 | 
185 |     @overrides
186 |     def forward(self,  # type: ignore
187 |                 source_tokens: Dict[str, torch.LongTensor],
188 |                 target_tokens: Dict[str, torch.LongTensor] = None) -> Dict[str, torch.Tensor]:
189 |         # pylint: disable=arguments-differ
190 |         """
191 |         Make foward pass with decoder logic for producing the entire target sequence.
192 | 
193 |         Parameters
194 |         ----------
195 |         source_tokens : ``Dict[str, torch.LongTensor]``
196 |            The output of `TextField.as_array()` applied on the source `TextField`. This will be
197 |            passed through a `TextFieldEmbedder` and then through an encoder.
198 |         target_tokens : ``Dict[str, torch.LongTensor]``, optional (default = None)
199 |            Output of `Textfield.as_array()` applied on target `TextField`. We assume that the
200 |            target tokens are also represented as a `TextField`.
201 | 
202 |         Returns
203 |         -------
204 |         Dict[str, torch.Tensor]
205 |         """
206 |         state = self._encode(source_tokens)
207 | 
208 |         if target_tokens:
209 |             state = self._init_decoder_state(state)
210 |             # The `_forward_loop` decodes the input sequence and computes the loss during training
211 |             # and validation.
212 |             output_dict = self._forward_loop(state, target_tokens)
213 |         else:
214 |             output_dict = {}
215 | 
216 |         if not self.training:
217 |             state = self._init_decoder_state(state)
218 |             if self._visualize_attention:
219 |                 output_dict = self._forward_loop(state, target_tokens)
220 |             else:
221 |                 predictions = self._forward_beam_search(state)
222 |                 output_dict.update(predictions)
223 |             if target_tokens and self._bleu:
224 |                 # shape: (batch_size, beam_size, max_sequence_length)
225 |                 top_k_predictions = output_dict["predictions"]
226 |                 # shape: (batch_size, max_predicted_sequence_length)
227 |                 best_predictions = top_k_predictions[:, 0, :]
228 |                 self._bleu(best_predictions, target_tokens[self._target_namespace])
229 | 
230 |         return output_dict
231 | 
232 |     @overrides
233 |     def decode(self, output_dict: Dict[str, torch.Tensor]) -> Dict[str, torch.Tensor]:
234 |         """
235 |         Finalize predictions.
236 | 
237 |         This method overrides ``Model.decode``, which gets called after ``Model.forward``, at test
238 |         time, to finalize predictions. The logic for the decoder part of the encoder-decoder lives
239 |         within the ``forward`` method.
240 | 
241 |         This method trims the output predictions to the first end symbol, replaces indices with
242 |         corresponding tokens, and adds a field called ``predicted_tokens`` to the ``output_dict``.
243 |         """
244 |         predicted_indices = output_dict["predictions"]
245 |         if not isinstance(predicted_indices, numpy.ndarray):
246 |             predicted_indices = predicted_indices.detach().cpu().numpy()
247 |         all_predicted_tokens = []
248 |         for indices in predicted_indices:
249 |             # Beam search gives us the top k results for each source sentence in the batch
250 |             # but we just want the single best.
251 |             if len(indices.shape) > 1:
252 |                 indices = indices[0]
253 |             indices = list(indices)
254 |             # Collect indices till the first end_symbol
255 |             if self._end_index in indices:
256 |                 indices = indices[:indices.index(self._end_index)]
257 |             predicted_tokens = [self.vocab.get_token_from_index(x, namespace=self._target_namespace)
258 |                                 for x in indices]
259 |             all_predicted_tokens.append(predicted_tokens)
260 |         output_dict["predicted_tokens"] = all_predicted_tokens
261 |         return output_dict
262 | 
263 |     def _encode(self, source_tokens: Dict[str, torch.Tensor]) -> Dict[str, torch.Tensor]:
264 |         # shape: (batch_size, max_input_sequence_length, encoder_input_dim)
265 |         embedded_input = self._source_embedder(source_tokens)
266 |         # shape: (batch_size, max_input_sequence_length)
267 |         source_mask = util.get_text_field_mask(source_tokens)
268 |         # shape: (batch_size, max_input_sequence_length, encoder_output_dim)
269 |         encoder_outputs = self._encoder(embedded_input, source_mask)
270 |         return {
271 |                 "source_mask": source_mask,
272 |                 "encoder_outputs": encoder_outputs,
273 |         }
274 | 
275 |     def _init_decoder_state(self, state: Dict[str, torch.Tensor]) -> Dict[str, torch.Tensor]:
276 |         batch_size = state["source_mask"].size(0)
277 |         # shape: (batch_size, encoder_output_dim)
278 |         final_encoder_output = util.get_final_encoder_states(
279 |                 state["encoder_outputs"],
280 |                 state["source_mask"],
281 |                 self._encoder.is_bidirectional())
282 |         # Initialize the decoder hidden state with the final output of the encoder.
283 |         # shape: (batch_size, decoder_output_dim)
284 |         state["decoder_hidden"] = final_encoder_output
285 |         # shape: (batch_size, decoder_output_dim)
286 |         state["decoder_context"] = state["encoder_outputs"].new_zeros(batch_size, self._decoder_output_dim)
287 |         return state
288 | 
289 |     def _forward_loop(self,
290 |                       state: Dict[str, torch.Tensor],
291 |                       target_tokens: Dict[str, torch.LongTensor] = None) -> Dict[str, torch.Tensor]:
292 |         """
293 |         Make forward pass during training or do greedy search during prediction.
294 | 
295 |         Notes
296 |         -----
297 |         We really only use the predictions from the method to test that beam search
298 |         with a beam size of 1 gives the same results.
299 |         """
300 |         # shape: (batch_size, max_input_sequence_length)
301 |         source_mask = state["source_mask"]
302 | 
303 |         batch_size = source_mask.size()[0]
304 | 
305 |         if target_tokens:
306 |             # shape: (batch_size, max_target_sequence_length)
307 |             targets = target_tokens[self._target_namespace]
308 | 
309 |             _, target_sequence_length = targets.size()
310 | 
311 |             # The last input from the target is either padding or the end symbol.
312 |             # Either way, we don't have to process it.
313 |             num_decoding_steps = target_sequence_length - 1
314 |         else:
315 |             num_decoding_steps = self._max_decoding_steps
316 | 
317 |         # Initialize target predictions with the start index.
318 |         # shape: (batch_size,)
319 |         last_predictions = source_mask.new_full((batch_size,), fill_value=self._start_index)
320 | 
321 |         step_logits: List[torch.Tensor] = []
322 |         step_predictions: List[torch.Tensor] = []
323 | 
324 |         for timestep in range(num_decoding_steps):
325 |             if self.training and torch.rand(1).item() < self._scheduled_sampling_ratio:
326 |                 # Use gold tokens at test time and at a rate of 1 - _scheduled_sampling_ratio
327 |                 # during training.
328 |                 # shape: (batch_size,)
329 |                 input_choices = last_predictions
330 |             elif not target_tokens:
331 |                 # shape: (batch_size,)
332 |                 input_choices = last_predictions
333 |             else:
334 |                 # shape: (batch_size,)
335 |                 input_choices = targets[:, timestep]
336 | 
337 |             # shape: (batch_size, num_classes)
338 |             output_projections, state = self._prepare_output_projections(input_choices, state)
339 | 
340 |             # list of tensors, shape: (batch_size, 1, num_classes)
341 |             step_logits.append(output_projections.unsqueeze(1))
342 | 
343 |             # shape: (batch_size, num_classes)
344 |             class_probabilities = F.softmax(output_projections, dim=-1)
345 | 
346 |             # shape (predicted_classes): (batch_size,)
347 |             _, predicted_classes = torch.max(class_probabilities, 1)
348 | 
349 |             # shape (predicted_classes): (batch_size,)
350 |             last_predictions = predicted_classes
351 | 
352 |             step_predictions.append(last_predictions.unsqueeze(1))
353 | 
354 |         # shape: (batch_size, num_decoding_steps)
355 |         predictions = torch.cat(step_predictions, 1)
356 | 
357 |         output_dict = {"predictions": predictions}
358 | 
359 |         if target_tokens:
360 |             # shape: (batch_size, num_decoding_steps, num_classes)
361 |             logits = torch.cat(step_logits, 1)
362 | 
363 |             # Compute loss.
364 |             target_mask = util.get_text_field_mask(target_tokens)
365 |             loss = self._get_loss(logits, targets, target_mask)
366 |             output_dict["loss"] = loss
367 | 
368 |         return output_dict
369 | 
370 |     def _forward_beam_search(self, state: Dict[str, torch.Tensor]) -> Dict[str, torch.Tensor]:
371 |         """Make forward pass during prediction using a beam search."""
372 |         batch_size = state["source_mask"].size()[0]
373 |         start_predictions = state["source_mask"].new_full((batch_size,), fill_value=self._start_index)
374 | 
375 |         # shape (all_top_k_predictions): (batch_size, beam_size, num_decoding_steps)
376 |         # shape (log_probabilities): (batch_size, beam_size)
377 |         all_top_k_predictions, log_probabilities = self._beam_search.search(
378 |                 start_predictions, state, self.take_step)
379 | 
380 |         output_dict = {
381 |                 "class_log_probabilities": log_probabilities,
382 |                 "predictions": all_top_k_predictions,
383 |         }
384 |         return output_dict
385 | 
386 |     def _prepare_output_projections(self,
387 |                                     last_predictions: torch.Tensor,
388 |                                     state: Dict[str, torch.Tensor]) -> Tuple[torch.Tensor, Dict[str, torch.Tensor]]:  # pylint: disable=line-too-long
389 |         """
390 |         Decode current state and last prediction to produce projections
391 |         into the target space, which can then be used to get probabilities of
392 |         each target token for the next step.
393 | 
394 |         Inputs are the same as for `take_step()`.
395 |         """
396 |         # shape: (group_size, max_input_sequence_length, encoder_output_dim)
397 |         encoder_outputs = state["encoder_outputs"]
398 | 
399 |         # shape: (group_size, max_input_sequence_length)
400 |         source_mask = state["source_mask"]
401 | 
402 |         # shape: (group_size, decoder_output_dim)
403 |         decoder_hidden = state["decoder_hidden"]
404 | 
405 |         # shape: (group_size, decoder_output_dim)
406 |         decoder_context = state["decoder_context"]
407 | 
408 |         # shape: (group_size, target_embedding_dim)
409 |         embedded_input = self._target_embedder(last_predictions)
410 | 
411 |         # TODO: Compute attention right about here...
412 |         decoder_input = embedded_input
413 | 
414 |         # shape (decoder_hidden): (batch_size, decoder_output_dim)
415 |         # shape (decoder_context): (batch_size, decoder_output_dim)
416 |         decoder_hidden, decoder_context = self._decoder_cell(
417 |                 decoder_input,
418 |                 (decoder_hidden, decoder_context))
419 | 
420 |         state["decoder_hidden"] = decoder_hidden
421 |         state["decoder_context"] = decoder_context
422 | 
423 |         # shape: (group_size, num_classes)
424 |         output_projections = self._output_projection_layer(decoder_hidden)
425 | 
426 |         return output_projections, state
427 | 
428 |     # TODO: Implement attention mechanisms here
429 |     def _compute_attention(self,
430 |                            decoder_hidden_state: torch.LongTensor = None,
431 |                            encoder_outputs: torch.LongTensor = None,
432 |                            encoder_outputs_mask: torch.LongTensor = None) -> torch.Tensor:
433 |         """Apply attention over encoder outputs and decoder state.
434 |         Parameters
435 |         ----------
436 |         decoder_hidden_state : ``torch.LongTensor``
437 |             A tensor of shape ``(batch_size, decoder_output_dim)``, which contains the current decoder hidden state to be used
438 |             as the 'query' to the attention computation
439 |             during the last time step.
440 |         encoder_outputs : ``torch.LongTensor``
441 |             A tensor of shape ``(batch_size, max_input_sequence_length, encoder_output_dim)``, which contains all the
442 |             encoder hidden states of the source tokens, i.e., the 'keys' to the attention computation
443 |         encoder_mask : ``torch.LongTensor``
444 |             A tensor of shape (batch_size, max_input_sequence_length), which contains the mask of the encoded input.
445 |             We want to avoid computing an attention score for positions of the source with zero-values (remember not all
446 |             input sentences have the same length)
447 | 
448 |         Returns
449 |         -------
450 |         torch.Tensor
451 |             A tensor of shape (batch_size, encoder_output_dim) that contains the attended encoder outputs (aka context vector),
452 |             i.e., we have ``applied`` the attention scores on the encoder hidden states.
453 | 
454 |         Notes
455 |         -----
456 |             Don't forget to apply the final softmax over the **masked** encoder outputs!
457 |         """
458 | 
459 |         # Ensure mask is also a FloatTensor. Or else the multiplication within
460 |         # attention will complain.
461 |         # shape: (batch_size, max_input_sequence_length)
462 |         encoder_outputs_mask = encoder_outputs_mask.float()
463 | 
464 |         # Main body of attention weights computation here
465 | 
466 |         return None
467 | 
468 |     @staticmethod
469 |     def _get_loss(logits: torch.LongTensor,
470 |                   targets: torch.LongTensor,
471 |                   target_mask: torch.LongTensor) -> torch.Tensor:
472 |         """
473 |         Compute loss.
474 | 
475 |         Takes logits (unnormalized outputs from the decoder) of size (batch_size,
476 |         num_decoding_steps, num_classes), target indices of size (batch_size, num_decoding_steps+1)
477 |         and corresponding masks of size (batch_size, num_decoding_steps+1) steps and computes cross
478 |         entropy loss while taking the mask into account.
479 | 
480 |         The length of ``targets`` is expected to be greater than that of ``logits`` because the
481 |         decoder does not need to compute the output corresponding to the last timestep of
482 |         ``targets``. This method aligns the inputs appropriately to compute the loss.
483 | 
484 |         During training, we want the logit corresponding to timestep i to be similar to the target
485 |         token from timestep i + 1. That is, the targets should be shifted by one timestep for
486 |         appropriate comparison.  Consider a single example where the target has 3 words, and
487 |         padding is to 7 tokens.
488 |            The complete sequence would correspond to <S> w1  w2  w3  <E> <P> <P>
489 |            and the mask would be                     1   1   1   1   1   0   0
490 |            and let the logits be                     l1  l2  l3  l4  l5  l6
491 |         We actually need to compare:
492 |            the sequence           w1  w2  w3  <E> <P> <P>
493 |            with masks             1   1   1   1   0   0
494 |            against                l1  l2  l3  l4  l5  l6
495 |            (where the input was)  <S> w1  w2  w3  <E> <P>
496 |         """
497 |         # shape: (batch_size, num_decoding_steps)
498 |         relevant_targets = targets[:, 1:].contiguous()
499 | 
500 |         # shape: (batch_size, num_decoding_steps)
501 |         relevant_mask = target_mask[:, 1:].contiguous()
502 | 
503 |         return util.sequence_cross_entropy_with_logits(logits, relevant_targets, relevant_mask)
504 | 
505 |     @overrides
506 |     def get_metrics(self, reset: bool = False) -> Dict[str, float]:
507 |         all_metrics: Dict[str, float] = {}
508 |         if self._bleu and not self.training:
509 |             all_metrics.update(self._bleu.get_metric(reset=reset))
510 |         return all_metrics
511 | 


--------------------------------------------------------------------------------
/athnlp/models/qa_bert.py:
--------------------------------------------------------------------------------
  1 | from typing import Dict, Optional
  2 | 
  3 | import torch
  4 | from allennlp.data.vocabulary import Vocabulary
  5 | from allennlp.models.model import Model
  6 | from allennlp.nn import RegularizerApplicator
  7 | from allennlp.nn.initializers import InitializerApplicator
  8 | from overrides import overrides
  9 | from pytorch_transformers.modeling_bert import BertModel
 10 | 
 11 | 
 12 | @Model.register("qa_bert")
 13 | class BertQuestionAnswering(Model):
 14 |     """
 15 |     A QA model for SQuAD based on the AllenNLP Model ``BertForClassification`` that runs pretrained BERT,
 16 |     takes the pooled output, adds a Linear layer on top, and predicts two numbers: start and end span.
 17 | 
 18 |     Note that this is a somewhat non-AllenNLP-ish model architecture,
 19 |     in that it essentially requires you to use the "bert-pretrained"
 20 |     token indexer, rather than configuring whatever indexing scheme you like.
 21 |     See `allennlp/tests/fixtures/bert/bert_for_classification.jsonnet`
 22 |     for an example of what your config might look like.
 23 |     Parameters
 24 |     ----------
 25 |     vocab : ``Vocabulary``
 26 |     bert_model : ``Union[str, BertModel]``
 27 |         The BERT model to be wrapped. If a string is provided, we will call
 28 |         ``BertModel.from_pretrained(bert_model)`` and use the result.
 29 |     num_labels : ``int``, optional (default: None)
 30 |         How many output classes to predict. If not provided, we'll use the
 31 |         vocab_size for the ``label_namespace``.
 32 |     index : ``str``, optional (default: "bert")
 33 |         The index of the token indexer that generates the BERT indices.
 34 |     label_namespace : ``str``, optional (default : "labels")
 35 |         Used to determine the number of classes if ``num_labels`` is not supplied.
 36 |     trainable : ``bool``, optional (default : True)
 37 |         If True, the weights of the pretrained BERT model will be updated during training.
 38 |         Otherwise, they will be frozen and only the final linear layer will be trained.
 39 |     initializer : ``InitializerApplicator``, optional
 40 |         If provided, will be used to initialize the final linear layer *only*.
 41 |     regularizer : ``RegularizerApplicator``, optional (default=``None``)
 42 |         If provided, will be used to calculate the regularization penalty during training.
 43 |     """
 44 | 
 45 |     def __init__(self,
 46 |                  vocab: Vocabulary,
 47 |                  bert_model: BertModel,
 48 |                  dropout: float = 0.0,
 49 |                  index: str = "bert",
 50 |                  trainable: bool = True,
 51 |                  initializer: InitializerApplicator = InitializerApplicator(),
 52 |                  regularizer: Optional[RegularizerApplicator] = None, ) -> None:
 53 |         super().__init__(vocab, regularizer)
 54 | 
 55 |         self._index = index
 56 |         self.bert_model = PretrainedBertModel.load(bert_model)
 57 |         hidden_size = self.bert_model.config.hidden_size
 58 | 
 59 |         for param in self.bert_model.parameters():
 60 |             param.requires_grad = trainable
 61 | 
 62 |         # 1. Instantiate any additional parts of your network
 63 | 
 64 |         # 2. DON'T FORGET TO INITIALIZE the additional parts of your network.
 65 | 
 66 |         # 3. Instantiate your metrics
 67 | 
 68 |     def forward(self,  # type: ignore
 69 |                 metadata: Dict,
 70 |                 tokens: Dict[str, torch.LongTensor],
 71 |                 span_start: torch.IntTensor = None,
 72 |                 span_end: torch.IntTensor = None
 73 |                 ) -> Dict[str, torch.Tensor]:
 74 |         # pylint: disable=arguments-differ
 75 |         """
 76 |         Parameters
 77 |         ----------
 78 |         tokens : Dict[str, torch.LongTensor]
 79 |             From a ``TextField`` (that has a bert-pretrained token indexer)
 80 |         span_start : torch.IntTensor, optional (default = None)
 81 |             A tensor of shape (batch_size, 1) which contains the start_position of the answer
 82 |             in the passage, or 0 if impossible. This is an `inclusive` token index.
 83 |             If this is given, we will compute a loss that gets included in the output dictionary.
 84 |         span_end : torch.IntTensor, optional (default = None)
 85 |             A tensor of shape (batch_size, 1) which contains the end_position of the answer
 86 |             in the passage, or 0 if impossible. This is an `inclusive` token index.
 87 |             If this is given, we will compute a loss that gets included in the output dictionary.
 88 |         Returns
 89 |         -------
 90 |         An output dictionary consisting of:
 91 |         logits : torch.FloatTensor
 92 |             A tensor of shape ``(batch_size, num_labels)`` representing
 93 |             unnormalized log probabilities of the label.
 94 |         start_probs: torch.FloatTensor
 95 |             A tensor of shape ``(batch_size, num_labels)`` representing
 96 |             probabilities of the label.
 97 |         end_probs : torch.FloatTensor
 98 |             A tensor of shape ``(batch_size, num_labels)`` representing
 99 |             probabilities of the label.
100 |         best_span:
101 |         loss : torch.FloatTensor, optional
102 |             A scalar loss to be optimised.
103 |         """
104 |         input_ids = tokens[self._index]
105 |         token_type_ids = tokens[f"{self._index}-type-ids"]
106 |         input_mask = (input_ids != 0).long()
107 | 
108 |         # 1. Build model here
109 | 
110 |         # 2. Compute start_position and end_position and then get the best span
111 |         # using allennlp.models.reading_comprehension.util.get_best_span()
112 | 
113 |         output_dict = {}
114 | 
115 |         # 4. Compute loss and accuracies. You should compute at least:
116 |         # span_start accuracy, span_end accuracy and full span accuracy.
117 | 
118 |         # UNCOMMENT THIS LINE
119 |         # output_dict["loss"] =
120 | 
121 |         # 5. Optionally you can compute the official squad metrics (exact match, f1).
122 |         # Instantiate the metric object in __init__ using allennlp.training.metrics.SquadEmAndF1()
123 |         # When you call it, you need to give it the word tokens of the span (implement and call decode() below)
124 |         # and the gold tokens found in metadata[i]['answer_texts']
125 | 
126 |         return output_dict
127 | 
128 |     @overrides
129 |     def decode(self, output_dict: Dict[str, torch.Tensor]) -> Dict[str, torch.Tensor]:
130 |         """
131 |         Does a simple argmax over the probabilities, converts index to string label, and
132 |         add ``"label"`` key to the dictionary with the result.
133 |         """
134 |         pass
135 | 
136 |     def get_metrics(self, reset: bool = False) -> Dict[str, float]:
137 |         # UNCOMMENT if you want to report official SQuAD metrics
138 |         # exact_match, f1_score = self._squad_metrics.get_metric(reset)
139 | 
140 |         metrics = {
141 |             'start_acc': self._span_start_accuracy.get_metric(reset),
142 |             'end_acc': self._span_end_accuracy.get_metric(reset),
143 |             'span_acc': self._span_accuracy.get_metric(reset),
144 |             # 'em': exact_match,
145 |             # 'f1': f1_score,
146 |         }
147 |         return metrics
148 | 
149 | 
150 | class PretrainedBertModel:
151 |     """
152 |     In some instances you may want to load the same BERT model twice
153 |     (e.g. to use as a token embedder and also as a pooling layer).
154 |     This factory provides a cache so that you don't actually have to load the model twice.
155 |     """
156 |     _cache: Dict[str, BertModel] = {}
157 | 
158 |     @classmethod
159 |     def load(cls, model_name: str, cache_model: bool = True) -> BertModel:
160 |         if model_name in cls._cache:
161 |             return PretrainedBertModel._cache[model_name]
162 | 
163 |         model = BertModel.from_pretrained(model_name)
164 |         if cache_model:
165 |             cls._cache[model_name] = model
166 | 
167 |         return model
168 | 


--------------------------------------------------------------------------------
/athnlp/models/rnn_language_model.py:
--------------------------------------------------------------------------------
 1 | import torch.nn as nn
 2 | 
 3 | 
 4 | class RNNModel(nn.Module):
 5 |     """Container module with an encoder, a recurrent module, and a decoder."""
 6 | 
 7 |     def __init__(self, rnn_type, ntoken, ninp, nhid, nlayers, dropout=0.5):
 8 |         """
 9 |         Initialises the parameters of the RNN Language Model
10 | 
11 |         :param rnn_type: type of RNN cell
12 |         :param ntoken: number of tokens in the vocabulary
13 |         :param ninp: Dimensionality of the input vector
14 |         :param nhid: Hidden size of the RNN cell
15 |         :param nlayers: Number of layers of the RNN cell
16 |         :param dropout: Dropout value applied to the RNN cell connections
17 |         """
18 |         super(RNNModel, self).__init__()
19 |         self.drop = nn.Dropout(dropout)
20 |         self.encoder = nn.Embedding(ntoken, ninp)
21 |         if rnn_type in ['LSTM', 'GRU']:
22 |             self.rnn = getattr(nn, rnn_type)(ninp, nhid, nlayers, dropout=dropout)
23 |         else:
24 |             try:
25 |                 nonlinearity = {'RNN_TANH': 'tanh', 'RNN_RELU': 'relu'}[rnn_type]
26 |             except KeyError:
27 |                 raise ValueError("""An invalid option for `--model` was supplied,
28 |                                          options are ['LSTM', 'GRU', 'RNN_TANH' or 'RNN_RELU']""")
29 |             self.rnn = nn.RNN(ninp, nhid, nlayers, nonlinearity=nonlinearity, dropout=dropout)
30 |         self.decoder = nn.Linear(nhid, ntoken)
31 | 
32 |         self.init_weights()
33 | 
34 |         self.rnn_type = rnn_type
35 |         self.nhid = nhid
36 |         self.nlayers = nlayers
37 | 
38 |     def init_weights(self):
39 |         """
40 |         Initialises the parameters of the RNN model.
41 | 
42 |         N.B. This is optional because you may want to use the default PyTorch weight initialisation
43 |         """
44 |         initrange = 0.1
45 |         self.encoder.weight.data.uniform_(-initrange, initrange)
46 |         self.decoder.bias.data.zero_()
47 |         self.decoder.weight.data.uniform_(-initrange, initrange)
48 | 
49 |     def forward(self, input, hidden):
50 |         """
51 |         Forward pass of the RNN language model. Useful information about how to use
52 |         an RNNCell can be found in the PyTorch documentation:
53 |         https://pytorch.org/docs/stable/nn.html#torch.nn.LSTM
54 |         https://pytorch.org/docs/stable/nn.html#torch.nn.GRU
55 | 
56 |         :param input: input features
57 |         :param hidden: previous hidden state of the RNN language model
58 |         :return: (output, updated_hidden_state)
59 |         """
60 |         pass
61 | 
62 |     def init_hidden(self, bsz):
63 |         """
64 |         Returns the initial hidden state of the RNN language model. It is a function that should be called before
65 |         unrolling the RNN decoder.
66 | 
67 |         :param bsz: batch size
68 |         :return: first hidden state of the RNN language model
69 |         """
70 |         weight = next(self.parameters())
71 |         if self.rnn_type == 'LSTM':
72 |             return (weight.new_zeros(self.nlayers, bsz, self.nhid),
73 |                     weight.new_zeros(self.nlayers, bsz, self.nhid))
74 |         else:
75 |             return weight.new_zeros(self.nlayers, bsz, self.nhid)
76 | 


--------------------------------------------------------------------------------
/athnlp/nlm.py:
--------------------------------------------------------------------------------
  1 | import argparse
  2 | import math
  3 | import time
  4 | 
  5 | import torch
  6 | 
  7 | from athnlp.readers.lm_corpus import Corpus
  8 | 
  9 | parser = argparse.ArgumentParser(description='RNN/LSTM Language Model')
 10 | parser.add_argument('--data', type=str, default='data/lm',
 11 |                     help='location of the data corpus')
 12 | parser.add_argument('--model_type', type=str, default='LSTM',
 13 |                     help='type of recurrent net (RNN_TANH, RNN_RELU, LSTM, GRU)')
 14 | parser.add_argument("--model_path", type=str, default='models/lm/default.pt',
 15 |                     help='Path where to store the trained language model.')
 16 | parser.add_argument('--emsize', type=int, default=50,
 17 |                     help='size of word embeddings')
 18 | parser.add_argument('--nhid', type=int, default=100,
 19 |                     help='number of hidden units per layer')
 20 | parser.add_argument('--nlayers', type=int, default=1,
 21 |                     help='number of layers')
 22 | parser.add_argument('--lr', type=float, default=20,
 23 |                     help='initial learning rate')
 24 | parser.add_argument('--clip', type=float, default=0.25,
 25 |                     help='gradient clipping')
 26 | parser.add_argument('--epochs', type=int, default=100,
 27 |                     help='upper epoch limit')
 28 | parser.add_argument('--batch_size', type=int, default=1, metavar='N',
 29 |                     help='batch size')
 30 | parser.add_argument('--bptt', type=int, default=10,
 31 |                     help='sequence length')
 32 | parser.add_argument('--dropout', type=float, default=0.2,
 33 |                     help='dropout applied to layers (0 = no dropout)')
 34 | parser.add_argument('--seed', type=int, default=1111,
 35 |                     help='random seed')
 36 | parser.add_argument('--cuda', action='store_true',
 37 |                     help='use CUDA')
 38 | parser.add_argument('--log-interval', type=int, default=200, metavar='N',
 39 |                     help='report interval')
 40 | parser.add_argument("--sentence_compl", action='store_true')
 41 | 
 42 | 
 43 | # Starting from sequential data, batchify arranges the dataset into columns.
 44 | # For instance, with the alphabet as the sequence and batch size 4, we'd get
 45 | # ┌ a g m s ┐
 46 | # │ b h n t │
 47 | # │ c i o u │
 48 | # │ d j p v │
 49 | # │ e k q w │
 50 | # └ f l r x ┘.
 51 | # These columns are treated as independent by the model, which means that the
 52 | # dependence of e. g. 'g' on 'f' can not be learned, but allows more efficient
 53 | # batch processing.
 54 | def batchify(data, batch_size, device):
 55 |     # Work out how cleanly we can divide the dataset into bsz parts.
 56 |     num_batches = data.size(0) // batch_size
 57 |     # Trim off any extra elements that wouldn't cleanly fit (remainders).
 58 |     data = data.narrow(0, 0, num_batches * batch_size)
 59 |     # Evenly divide the data across the batch_size batches.
 60 |     data = data.view(batch_size, -1).t().contiguous()
 61 |     return data.to(device)
 62 | 
 63 | 
 64 | ###############################################################################
 65 | # Training code
 66 | ###############################################################################
 67 | 
 68 | def repackage_hidden(h):
 69 |     """Wraps hidden states in new Tensors, to detach them from their history."""
 70 | 
 71 |     if isinstance(h, torch.Tensor):
 72 |         return h.detach()
 73 |     else:
 74 |         return tuple(repackage_hidden(v) for v in h)
 75 | 
 76 | 
 77 | # get_batch subdivides the source data into chunks of length args.bptt.
 78 | # If source is equal to the example output of the batchify function, with
 79 | # a bptt-limit of 2, we'd get the following two Variables for i = 0:
 80 | # ┌ a g m s ┐ ┌ b h n t ┐
 81 | # └ b h n t ┘ └ c i o u ┘
 82 | # Note that despite the name of the function, the subdivison of data is not
 83 | # done along the batch dimension (i.e. dimension 1), since that was handled
 84 | # by the batchify function. The chunks are along dimension 0, corresponding
 85 | # to the seq_len dimension in the LSTM.
 86 | 
 87 | def get_batch(source, i, bptt):
 88 |     seq_len = min(bptt, len(source) - 1 - i)
 89 |     data = source[i:i + seq_len]
 90 |     target = source[i + 1:i + 1 + seq_len].view(-1)
 91 |     return data, target
 92 | 
 93 | 
 94 | def evaluate(model, criterion, eval_batch_size, corpus, data_source):
 95 |     """
 96 |     Evaluates the performance of the model according to the specified criterion  on the provided data source
 97 | 
 98 |     :param model: RNN language model
 99 |     :param criterion: criterion to be evaluated
100 |     :param eval_batch_size: batch size (you can assume 1 for simplicity)
101 |     :param corpus: instance of the reference corpus
102 |     :param data_source: reference data for evaluation
103 |     :return: the average score evaluated using the specified criterion
104 |     """
105 |     # Turn on evaluation mode which disables dropout.
106 |     model.eval()
107 |     total_loss = 0.
108 |     ntokens = len(corpus.dictionary)
109 |     hidden = model.init_hidden(eval_batch_size)
110 |     with torch.no_grad():
111 |         for i in range(0, data_source.size(0) - 1, args.bptt):
112 |             data, targets = get_batch(data_source, i, args.bptt)
113 |             output, hidden = model(data, hidden)
114 |             hidden = repackage_hidden(hidden)
115 |             output_flat = output.view(-1, ntokens)
116 |             # We multiply by the number of examples in the batch in order
117 |             # to get the total loss and not the average (which is what
118 |             # by default PyTorch Cross-entropy loss is doing behind
119 |             # the scenes for us)
120 |             total_loss += len(data) * criterion(output_flat, targets).item()
121 |     return total_loss / (len(data_source) - 1)
122 | 
123 | 
124 | def train(model, criterion, corpus, train_data, lr, bptt, epoch):
125 |     """
126 |     Trains the specified language model by minimising the provided criterion using as the training data. It trains the
127 |     model for a given number of epoch with a fixed learning rate.
128 | 
129 |     :param model: RNN language model
130 |     :param criterion: LM loss function
131 |     :param corpus: Reference corpus
132 |     :param train_data: training data for the LM task
133 |     :param lr: SGD learning rate
134 |     :param bptt: Sequence length
135 |     :param epoch: Number of training epochs
136 |     :return: Average training loss
137 |     """
138 |     # Turn on training mode which enables dropout.
139 |     model.train()
140 |     total_loss = 0.
141 |     start_time = time.time()
142 |     ntokens = len(corpus.dictionary)
143 |     hidden = model.init_hidden(args.batch_size)
144 |     for batch, i in enumerate(range(0, train_data.size(0) - 1, args.bptt)):
145 |         data, targets = get_batch(train_data, i, bptt)
146 |         # Starting each batch, we detach the hidden state from how it was previously produced.
147 |         # If we didn't, the model would try backpropagating all the way to start of the dataset.
148 |         model.zero_grad()
149 |         hidden = repackage_hidden(hidden)
150 |         # TODO: run model forward pass obtaining '(output, hidden)'
151 |         output, hidden = None, None
152 |         # TODO: compute loss using the defined criterion obtaining 'loss'.
153 |         loss = None
154 |         # TODO: compute backpropagation calling the backward pass
155 |         # N.B.: Here you should also update your model's weights
156 | 
157 |         # TODO (optional): implement gradient clipping to prevent
158 |         # the exploding gradient problem in RNNs / LSTMs
159 |         # check the PyTorch function `clip_grad_norm`
160 | 
161 |         total_loss += loss.item()
162 | 
163 |         if batch % args.log_interval == 0 and batch > 0:
164 |             cur_loss = total_loss / args.log_interval
165 |             elapsed = time.time() - start_time
166 |             print('| epoch {:3d} | {:5d}/{:5d} batches | lr {:02.2f} | ms/batch {:5.2f} | '
167 |                   'loss {:5.2f} | ppl {:8.2f}'.format(
168 |                 epoch, batch, len(train_data) // args.bptt, lr,
169 |                               elapsed * 1000 / args.log_interval, cur_loss, math.exp(cur_loss)))
170 |             total_loss = 0
171 |             start_time = time.time()
172 | 
173 | 
174 | def main(args):
175 |     # Set the random seed manually for reproducibility.
176 |     torch.manual_seed(args.seed)
177 |     if torch.cuda.is_available():
178 |         if not args.cuda:
179 |             print("WARNING: You have a CUDA device, so you should probably run with --cuda")
180 | 
181 |     device = torch.device("cuda" if args.cuda else "cpu")
182 | 
183 |     ###############################################################################
184 |     # Load data
185 |     ###############################################################################
186 | 
187 |     corpus = Corpus(args.data)
188 | 
189 |     # training mode selected
190 |     # Trains the model and then runs the evaluation on the test set
191 |     if not args.sentence_compl:
192 |         eval_batch_size = 1
193 |         ###############################################################################
194 |         # Load your train, valid and test data
195 |         ###############################################################################
196 |         train_data = batchify(corpus.train, args.batch_size, device)
197 |         val_data = batchify(corpus.valid, eval_batch_size, device)
198 |         test_data = batchify(corpus.test, eval_batch_size, device)
199 | 
200 |         ###############################################################################
201 |         # Build the model
202 |         ###############################################################################
203 |         # TODO: model definition and loss definition
204 |         model = None
205 |         criterion = None
206 |         # In order to optimise your model weights you have two options:
207 |         # 1. Write the SGD update rule that uses the computed gradients to update the model's weights
208 |         # 2. Use a PyTorch optimiser that computes the update step (https://pytorch.org/docs/stable/optim.html)
209 | 
210 |         # Loop over epochs.
211 |         lr = args.lr
212 |         best_val_loss = None
213 | 
214 |         # At any point you can hit Ctrl + C to break out of training early.
215 |         try:
216 |             for epoch in range(1, args.epochs + 1):
217 |                 epoch_start_time = time.time()
218 |                 train(model, criterion, corpus, train_data, lr, args.bptt, epoch)
219 |                 val_loss = evaluate(model, criterion, eval_batch_size, corpus, val_data)
220 |                 print('-' * 89)
221 |                 print('| end of epoch {:3d} | time: {:5.2f}s | valid loss {:5.2f} | '
222 |                       'valid ppl {:8.2f}'.format(epoch, (time.time() - epoch_start_time),
223 |                                                  val_loss, math.exp(val_loss)))
224 |                 print('-' * 89)
225 |                 # Save the model if the validation loss is the best we've seen so far.
226 |                 if not best_val_loss or val_loss < best_val_loss:
227 |                     with open(args.model_path, 'wb') as f:
228 |                         torch.save(model, f)
229 |                     best_val_loss = val_loss
230 | 
231 |                 # HINT: when the loss is not decreasing anymore on the validation set can you think to any method
232 |                 # to prevent the model from overfitting?
233 |         except KeyboardInterrupt:
234 |             print('-' * 89)
235 |             print('Exiting from training early')
236 | 
237 |         # Load the best saved model.
238 |         with open(args.model_path, 'rb') as f:
239 |             model = torch.load(f)
240 |             # after load the rnn params are not a continuous chunk of memory
241 |             # this makes them a continuous chunk, and will speed up forward pass
242 |             model.rnn.flatten_parameters()
243 | 
244 |         # Run on test data.
245 |         test_loss = evaluate(model, criterion, eval_batch_size, corpus, test_data)
246 |         print('=' * 89)
247 |         print('| End of training | test loss {:5.2f} | test ppl {:8.2f}'.format(
248 |             test_loss, math.exp(test_loss)))
249 |         print('=' * 89)
250 |     else:
251 |         # we enabled the sentence completition mode
252 | 
253 |         # we first load the model
254 |         # Load the best saved model.
255 |         with open(args.model_path, 'rb') as f:
256 |             model = torch.load(f)
257 |             # after load the rnn params are not a continuous chunk of memory
258 |             # this makes them a continuous chunk, and will speed up forward pass
259 |             model.rnn.flatten_parameters()
260 | 
261 |         ###############################################################################
262 |         # Use the pretrained LM at inference time
263 |         ###############################################################################
264 |         # TODO: Sentence completion solution
265 | 
266 | 
267 | if __name__ == "__main__":
268 |     args = parser.parse_args()
269 | 
270 |     main(args)
271 | 


--------------------------------------------------------------------------------
/athnlp/nmt.py:
--------------------------------------------------------------------------------
 1 | # pylint: disable=no-self-use,invalid-name
 2 | from argparse import ArgumentParser
 3 | import json
 4 | import shutil
 5 | import sys
 6 | 
 7 | from allennlp.commands import main
 8 | 
 9 | if __name__ == "__main__":
10 |     argparse = ArgumentParser()
11 |     argparse.add_argument('-c', "--config_file", type=str, default='athnlp/experiments/nmt_multi30k.jsonnet')
12 |     argparse.add_argument('-m', "--model_path", default="/tmp/debugger_train")
13 |     argparse.add_argument('-i', "--input_file", default="data/multi30k/val.lc.norm.tok.head-5.en.jsonl")
14 |     argparse.add_argument("--predict", action='store_true')
15 | 
16 |     args = argparse.parse_args()
17 |     config_file = args.config_file
18 |     serialization_dir = args.model_path
19 | 
20 |     if args.predict:
21 |         overrides = json.dumps({"model": {"visualize_attention": "false"}})
22 | 
23 |         sys.argv = [
24 |             "allennlp",  # command name, not used by main
25 |             "predict",
26 |             "--predictor", "seq2seq",
27 |             "--include-package", "athnlp",
28 |             "-o", overrides,
29 |             serialization_dir,
30 |             args.input_file,
31 |         ]
32 |     else:
33 |         # Training will fail if the serialization directory already
34 |         # has stuff in it. If you are running the same training loop
35 |         # over and over again for debugging purposes, it will.
36 |         # Hence we wipe it out in advance.
37 |         # BE VERY CAREFUL NOT TO DO THIS FOR ACTUAL TRAINING!
38 |         shutil.rmtree(serialization_dir, ignore_errors=True)
39 | 
40 |         # Use overrides to train on CPU.
41 |         overrides = json.dumps({"trainer": {"cuda_device": -1}})
42 | 
43 |         # Assemble the command into sys.argv
44 |         sys.argv = [
45 |             "allennlp",  # command name, not used by main
46 |             "train",
47 |             config_file,
48 |             "-s", serialization_dir,
49 |             "--include-package", "athnlp",
50 |             "-o", overrides,
51 |         ]
52 | 
53 |     main()
54 | 


--------------------------------------------------------------------------------
/athnlp/qa.py:
--------------------------------------------------------------------------------
 1 | # pylint: disable=no-self-use,invalid-name
 2 | from argparse import ArgumentParser
 3 | import json
 4 | import shutil
 5 | import sys
 6 | 
 7 | from allennlp.commands import main
 8 | 
 9 | if __name__ == "__main__":
10 |     argparse = ArgumentParser()
11 |     argparse.add_argument('-c', "--config_file", type=str, default='athnlp/experiments/qa_bert.jsonnet')
12 |     argparse.add_argument('-m', "--model_path", default="/tmp/debugger_train")
13 |     argparse.add_argument('-i', "--input_file", default="data/squad/dev-v2.0-small.json")
14 |     argparse.add_argument("--predict", action='store_true')
15 | 
16 |     args = argparse.parse_args()
17 |     config_file = args.config_file
18 |     serialization_dir = args.model_path
19 | 
20 |     if args.predict:
21 |         overrides = json.dumps({"model": {"visualize_attention": "true"}})
22 | 
23 |         sys.argv = [
24 |             "allennlp",  # command name, not used by main
25 |             "predict",
26 |             "--predictor", "qa_bert",
27 |             "--include-package", "athnlp",
28 |             "-o", overrides,
29 |             serialization_dir,
30 |             args.input_file,
31 |         ]
32 |     else:
33 |         # Training will fail if the serialization directory already
34 |         # has stuff in it. If you are running the same training loop
35 |         # over and over again for debugging purposes, it will.
36 |         # Hence we wipe it out in advance.
37 |         # BE VERY CAREFUL NOT TO DO THIS FOR ACTUAL TRAINING!
38 |         shutil.rmtree(serialization_dir, ignore_errors=True)
39 | 
40 |         # Use overrides to train on CPU.
41 |         overrides = json.dumps({"trainer": {"cuda_device": -1}})
42 | 
43 |         # Assemble the command into sys.argv
44 |         sys.argv = [
45 |             "allennlp",  # command name, not used by main
46 |             "train",
47 |             config_file,
48 |             "-s", serialization_dir,
49 |             "--include-package", "athnlp",
50 |             "-o", overrides,
51 |         ]
52 | 
53 |     main()


--------------------------------------------------------------------------------
/athnlp/readers/__init__.py:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/athnlp/athnlp-labs/1c3f0a8595cd8e4a7b7d62fc4e608ade6ac092fd/athnlp/readers/__init__.py


--------------------------------------------------------------------------------
/athnlp/readers/bert_squad_reader.py:
--------------------------------------------------------------------------------
  1 | import json
  2 | import logging
  3 | from collections import namedtuple
  4 | from typing import Dict, List, Tuple
  5 | from typing import Optional
  6 | 
  7 | from allennlp.data.dataset_readers.dataset_reader import DatasetReader
  8 | from allennlp.data.fields import Field, TextField, IndexField, MetadataField
  9 | from allennlp.data.instance import Instance
 10 | from allennlp.data.token_indexers import TokenIndexer
 11 | from allennlp.data.tokenizers import Token
 12 | from allennlp.data.tokenizers import WordTokenizer
 13 | from allennlp.data.tokenizers.word_splitter import WordSplitter
 14 | from overrides import overrides
 15 | from pytorch_transformers.tokenization_bert import whitespace_tokenize, BertTokenizer
 16 | 
 17 | logger = logging.getLogger(__name__)  # pylint: disable=invalid-name
 18 | cls_token='[CLS]'
 19 | sep_token='[SEP]'
 20 | sequence_a_segment_id=0
 21 | sequence_b_segment_id=1
 22 | cls_token_segment_id=0
 23 | 
 24 | 
 25 | @DatasetReader.register("bert_squad")
 26 | class BertSquadReader(DatasetReader):
 27 | 
 28 |     def __init__(self,
 29 |                  max_sequence_length: int,
 30 |                  doc_stride: int,
 31 |                  question_length_limit: int,
 32 |                  lazy: bool = False,
 33 |                  version_2: bool = False,
 34 |                  token_indexers: Dict[str, TokenIndexer] = None,
 35 |                  tokenizer: WordTokenizer = None) -> None:
 36 |         super().__init__(lazy)
 37 |         self._token_indexers = token_indexers or {}
 38 |         self._tokenizer = tokenizer or WordTokenizer()
 39 |         self._version_2 = version_2
 40 |         self.max_sequence_length = max_sequence_length
 41 |         self.doc_stride= doc_stride
 42 |         self.question_length_limit = question_length_limit
 43 | 
 44 |     def _read(self, file_path: str):
 45 |         """Read a SQuAD json file into a list of SquadExample."""
 46 |         with open(file_path, "r", encoding='utf-8') as reader:
 47 |             input_data = json.load(reader)["data"]
 48 | 
 49 |         def is_whitespace(c):
 50 |             if c == " " or c == "\t" or c == "\r" or c == "\n" or ord(c) == 0x202F:
 51 |                 return True
 52 |             return False
 53 | 
 54 |         for entry in input_data:
 55 |             for paragraph in entry["paragraphs"]:
 56 |                 paragraph_text = paragraph["context"]
 57 |                 doc_tokens = []
 58 |                 char_to_word_offset = []
 59 |                 prev_is_whitespace = True
 60 |                 for c in paragraph_text:
 61 |                     if is_whitespace(c):
 62 |                         prev_is_whitespace = True
 63 |                     else:
 64 |                         if prev_is_whitespace:
 65 |                             doc_tokens.append(c)
 66 |                         else:
 67 |                             doc_tokens[-1] += c
 68 |                         prev_is_whitespace = False
 69 |                     char_to_word_offset.append(len(doc_tokens) - 1)
 70 | 
 71 |                 for qa in paragraph["qas"]:
 72 |                     qas_id = qa["id"]
 73 |                     question_text = qa["question"]
 74 |                     start_position = None
 75 |                     end_position = None
 76 |                     orig_answer_text = None
 77 |                     is_impossible = False
 78 |                     if self._version_2:
 79 |                         is_impossible = qa["is_impossible"]
 80 |                     # if (len(qa["answers"]) != 1) and (not is_impossible):
 81 |                     #     raise ValueError(
 82 |                     #         "For training, each question should have exactly 1 answer.")
 83 |                     if not is_impossible:
 84 |                         answer = qa["answers"][0]
 85 |                         orig_answer_text = answer["text"]
 86 |                         answer_offset = answer["answer_start"]
 87 |                         answer_length = len(orig_answer_text)
 88 |                         start_position = char_to_word_offset[answer_offset]
 89 |                         end_position = char_to_word_offset[answer_offset + answer_length - 1]
 90 |                         # Only add answers where the text can be exactly recovered from the
 91 |                         # document. If this CAN'T happen it's likely due to weird Unicode
 92 |                         # stuff so we will just skip the example.
 93 |                         #
 94 |                         # Note that this means for training mode, every example is NOT
 95 |                         # guaranteed to be preserved.
 96 |                         actual_text = " ".join(doc_tokens[start_position:(end_position + 1)])
 97 |                         cleaned_answer_text = " ".join(
 98 |                             whitespace_tokenize(orig_answer_text))
 99 |                         if actual_text.find(cleaned_answer_text) == -1:
100 |                             logger.warning("Could not find answer: '%s' vs. '%s'",
101 |                                            actual_text, cleaned_answer_text)
102 |                             continue
103 |                     else:
104 |                         start_position = -1
105 |                         end_position = -1
106 |                         orig_answer_text = ""
107 | 
108 |                     query_tokens = self._tokenizer.tokenize(question_text)
109 | 
110 |                     if self.question_length_limit is not None and len(query_tokens) > self.question_length_limit:
111 |                         query_tokens = query_tokens[0:self.question_length_limit]
112 | 
113 |                     tok_to_orig_index = []
114 |                     orig_to_tok_index = []
115 |                     all_doc_tokens = []
116 |                     for (i, token) in enumerate(doc_tokens):
117 |                         orig_to_tok_index.append(len(all_doc_tokens))
118 |                         sub_tokens = self._tokenizer.tokenize(token)
119 |                         for sub_token in sub_tokens:
120 |                             tok_to_orig_index.append(i)
121 |                             all_doc_tokens.append(sub_token)
122 | 
123 |                     tok_start_position = None
124 |                     tok_end_position = None
125 |                     if is_impossible:
126 |                         tok_start_position = -1
127 |                         tok_end_position = -1
128 |                     else:
129 |                         tok_start_position = orig_to_tok_index[start_position]
130 |                         if end_position < len(doc_tokens) - 1:
131 |                             tok_end_position = orig_to_tok_index[end_position + 1] - 1
132 |                         else:
133 |                             tok_end_position = len(all_doc_tokens) - 1
134 |                         (tok_start_position, tok_end_position) = _improve_answer_span(
135 |                             all_doc_tokens, tok_start_position, tok_end_position, self._tokenizer,
136 |                             orig_answer_text)
137 | 
138 |                     # The -3 accounts for [CLS], [SEP] and [SEP]
139 |                     max_tokens_for_doc = self.max_sequence_length - len(query_tokens) - 3
140 | 
141 |                     # We can have documents that are longer than the maximum sequence length.
142 |                     # To deal with this we do a sliding window approach, where we take chunks
143 |                     # of the up to our max length with a stride of `doc_stride`.
144 |                     _DocSpan = namedtuple(  # pylint: disable=invalid-name
145 |                         "DocSpan", ["start", "length"])
146 |                     doc_spans = []
147 |                     start_offset = 0
148 |                     while start_offset < len(all_doc_tokens):
149 |                         length = len(all_doc_tokens) - start_offset
150 |                         if length > max_tokens_for_doc:
151 |                             length = max_tokens_for_doc
152 |                         doc_spans.append(_DocSpan(start=start_offset, length=length))
153 |                         if start_offset + length == len(all_doc_tokens):
154 |                             break
155 |                         start_offset += min(length, self.doc_stride)
156 | 
157 |                     for (doc_span_index, doc_span) in enumerate(doc_spans):
158 |                         tokens = []
159 |                         token_to_orig_map = {}
160 |                         token_is_max_context = {}
161 | 
162 |                         # p_mask: mask with 1 for token than cannot be in the answer (0 for token which can be in an answer)
163 |                         # Original TF implem also keep the classification token (set to 0) (not sure why...)
164 |                         # p_mask = []
165 | 
166 |                         # CLS token at the beginning
167 |                         # tokens.append(Token(cls_token))
168 |                         # p_mask.append(0)
169 |                         cls_index = 0
170 | 
171 |                         # Query
172 |                         for token in query_tokens:
173 |                             tokens.append(token)
174 |                             # p_mask.append(1)
175 | 
176 |                         # SEP token
177 |                         tokens.append(Token(sep_token))
178 |                         # p_mask.append(1)
179 | 
180 |                         # Paragraph
181 |                         paragraph_start_id = len(tokens)
182 |                         for i in range(doc_span.length):
183 |                             split_token_index = doc_span.start + i
184 |                             token_to_orig_map[len(tokens)] = tok_to_orig_index[split_token_index]
185 | 
186 |                             is_max_context = _check_is_max_context(doc_spans, doc_span_index,
187 |                                                                    split_token_index)
188 |                             token_is_max_context[len(tokens)] = is_max_context
189 |                             tokens.append(all_doc_tokens[split_token_index])
190 |                             # p_mask.append(0)
191 |                         paragraph_len = doc_span.length
192 | 
193 |                         # SEP token
194 |                         tokens.append(Token(sep_token))
195 |                         # p_mask.append(1)
196 | 
197 |                         span_is_impossible = is_impossible
198 |                         start_position = None
199 |                         end_position = None
200 |                         if not span_is_impossible:
201 |                             # For training, if our document chunk does not contain an annotation
202 |                             # we throw it out, since there is nothing to predict.
203 |                             doc_start = doc_span.start
204 |                             doc_end = doc_span.start + doc_span.length - 1
205 |                             out_of_span = False
206 |                             if not (tok_start_position >= doc_start and
207 |                                     tok_end_position <= doc_end):
208 |                                 out_of_span = True
209 |                             if out_of_span:
210 |                                 span_is_impossible = True
211 |                             else:
212 |                                 # we offset by 2 to account for the [CLS] and [SEP] tokens (before/after the question)
213 |                                 # at the beginning of the sequence. NOTE: we don't add the [CLS] token, as it will get
214 |                                 # added later in the indexer process, therefore start_position will be off by 1.
215 |                                 doc_offset = len(query_tokens) + 2
216 |                                 start_position = tok_start_position - doc_start + doc_offset
217 |                                 end_position = tok_end_position - doc_start + doc_offset
218 | 
219 |                         if span_is_impossible:
220 |                             start_position = cls_index
221 |                             end_position = cls_index
222 | 
223 |                         passage_offsets = []
224 |                         token_idx = 0
225 | 
226 |                         for token in doc_tokens:
227 |                             passage_offsets.append((token_idx, token_idx + len(token)))
228 |                             token_idx += len(token)
229 | 
230 |                         instance = self.text_to_instance(
231 |                             qas_id=qas_id,
232 |                             question_text=question_text,
233 |                             passage_tokens=tokens[paragraph_start_id: paragraph_len],
234 |                             bert_tokens=tokens,
235 |                             orig_answer_text=orig_answer_text,
236 |                             start_position=start_position,
237 |                             end_position=end_position,
238 |                             answer_texts=[answer["text"] for answer in qa["answers"]],
239 |                             passage_offsets=passage_offsets,
240 |                             passage_text=paragraph["context"]
241 |                         )
242 | 
243 |                         yield instance
244 | 
245 |     @overrides
246 |     def text_to_instance(self,  # type: ignore
247 |                          qas_id: str,
248 |                          question_text: str,
249 |                          bert_tokens: List[Token],
250 |                          passage_tokens: List[Token],
251 |                          orig_answer_text: str,
252 |                          start_position: int,
253 |                          end_position: int,
254 |                          answer_texts: List[str],
255 |                          passage_offsets: List[Tuple[int, int]],
256 |                          passage_text: str) -> Optional[Instance]:
257 |         fields: Dict[str, Field] = {}
258 |         tokens_field = TextField(bert_tokens, self._token_indexers)
259 |         fields['tokens'] = tokens_field
260 | 
261 |         fields['span_start'] = IndexField(start_position, tokens_field)
262 |         fields['span_end'] = IndexField(end_position, tokens_field)
263 |         metadata = {
264 |             'question_text': question_text,
265 |             'qas_id': qas_id,
266 |             'token_offsets': passage_offsets,
267 |             'original_passage': passage_text
268 |         }
269 | 
270 |         if answer_texts:
271 |             metadata['answer_texts'] = answer_texts
272 | 
273 |         fields['metadata'] = MetadataField(metadata)
274 | 
275 |         return Instance(fields)
276 | 
277 | 
278 | def _improve_answer_span(doc_tokens, input_start, input_end, tokenizer,
279 |                          orig_answer_text):
280 |     """Returns tokenized answer spans that better match the annotated answer."""
281 | 
282 |     # The SQuAD annotations are character based. We first project them to
283 |     # whitespace-tokenized words. But then after WordPiece tokenization, we can
284 |     # often find a "better match". For example:
285 |     #
286 |     #   Question: What year was John Smith born?
287 |     #   Context: The leader was John Smith (1895-1943).
288 |     #   Answer: 1895
289 |     #
290 |     # The original whitespace-tokenized answer will be "(1895-1943).". However
291 |     # after tokenization, our tokens will be "( 1895 - 1943 ) .". So we can match
292 |     # the exact answer, 1895.
293 |     #
294 |     # However, this is not always possible. Consider the following:
295 |     #
296 |     #   Question: What country is the top exporter of electornics?
297 |     #   Context: The Japanese electronics industry is the lagest in the world.
298 |     #   Answer: Japan
299 |     #
300 |     # In this case, the annotator chose "Japan" as a character sub-span of
301 |     # the word "Japanese". Since our WordPiece tokenizer does not split
302 |     # "Japanese", we just use "Japanese" as the annotation. This is fairly rare
303 |     # in SQuAD, but does happen.
304 |     tok_answer_text = " ".join(map(lambda x: x.text, tokenizer.tokenize(orig_answer_text)))
305 | 
306 |     for new_start in range(input_start, input_end + 1):
307 |         for new_end in range(input_end, new_start - 1, -1):
308 |             text_span = " ".join(map(lambda x: x.text, doc_tokens[new_start:(new_end + 1)]))
309 |             if text_span == tok_answer_text:
310 |                 return (new_start, new_end)
311 | 
312 |     return (input_start, input_end)
313 | 
314 | 
315 | def _check_is_max_context(doc_spans, cur_span_index, position):
316 |     """Check if this is the 'max context' doc span for the token."""
317 | 
318 |     # Because of the sliding window approach taken to scoring documents, a single
319 |     # token can appear in multiple documents. E.g.
320 |     #  Doc: the man went to the store and bought a gallon of milk
321 |     #  Span A: the man went to the
322 |     #  Span B: to the store and bought
323 |     #  Span C: and bought a gallon of
324 |     #  ...
325 |     #
326 |     # Now the word 'bought' will have two scores from spans B and C. We only
327 |     # want to consider the score with "maximum context", which we define as
328 |     # the *minimum* of its left and right context (the *sum* of left and
329 |     # right context will always be the same, of course).
330 |     #
331 |     # In the example the maximum context for 'bought' would be span C since
332 |     # it has 1 left context and 3 right context, while span B has 4 left context
333 |     # and 0 right context.
334 |     best_score = None
335 |     best_span_index = None
336 |     for (span_index, doc_span) in enumerate(doc_spans):
337 |         end = doc_span.start + doc_span.length - 1
338 |         if position < doc_span.start:
339 |             continue
340 |         if position > end:
341 |             continue
342 |         num_left_context = position - doc_span.start
343 |         num_right_context = end - position
344 |         score = min(num_left_context, num_right_context) + 0.01 * doc_span.length
345 |         if best_score is None or score > best_score:
346 |             best_score = score
347 |             best_span_index = span_index
348 | 
349 |     return cur_span_index == best_span_index
350 | 
351 | 
352 | @WordSplitter.register("bert-basic-wordpiece")
353 | class BertBasicWordSplitter(WordSplitter):
354 |     """
355 |     The ``BasicWordSplitter`` from the BERT implementation.
356 |     This is used to split a sentence into words.
357 |     Then the ``BertTokenIndexer`` converts each word into wordpieces.
358 |     """
359 |     def __init__(self,
360 |                  pretrained_model: str,
361 |                  do_lower_case: bool = True,
362 |                  never_split: Optional[List[str]] = None) -> None:
363 |             self.bert_tokenizer = BertTokenizer.from_pretrained(pretrained_model, do_lower_case=do_lower_case)
364 | 
365 |     @overrides
366 |     def split_words(self, sentence: str) -> List[Token]:
367 |         return [Token(text) for text in self.bert_tokenizer.tokenize(sentence)]
368 | 
369 | 
370 | 


--------------------------------------------------------------------------------
/athnlp/readers/brown_pos_corpus.py:
--------------------------------------------------------------------------------
 1 | from nltk.corpus import brown
 2 | from athnlp.readers.sequence_dictionary import SequenceDictionary
 3 | from athnlp.readers.sequence import Sequence
 4 | 
 5 | 
 6 | class BrownPosTag:
 7 | 
 8 |     def __init__(
 9 |             self,
10 |             max_sent_len: int = 15,
11 |             num_train_sents=10000,
12 |             num_dev_sents=1000,
13 |             num_test_sents=1000,
14 |             mapping_file="athnlp/readers/en-brown.map"):
15 | 
16 |         self.train = []
17 |         self.dev = []
18 |         self.test = []
19 |         self.dictionary = SequenceDictionary()
20 | 
21 |         # Build mapping of postags
22 |         self.mapping = {}
23 |         if mapping_file is not None:
24 |             for line in open(mapping_file):
25 |                 coarse, fine = line.strip().split("\t")
26 |                 self.mapping[coarse.lower()] = fine.lower()
27 | 
28 |         # Initialize noun to be tag zero so that it the default tag
29 |         self.dictionary.y_dict.add("noun")
30 | 
31 |         # Preprocess dataset splits
32 |         sents = brown.tagged_sents()
33 |         last_id = 0
34 |         self.train, last_id = self.preprocess_split(sents, last_id, num_train_sents, max_sent_len, "train_")
35 |         self.dev, last_id = self.preprocess_split(sents, last_id, num_dev_sents, max_sent_len, prefix_id="dev_")
36 |         self.test, _ = self.preprocess_split(sents, last_id, num_test_sents, max_sent_len, prefix_id="test_")
37 | 
38 |     def preprocess_split(self, input_dataset, last_id, num_sents, max_sent_len, prefix_id = ""):
39 |         """"Add necessary pre-processing (e.g., convert to universal tagset) to sentences of the dataset."""
40 |         dataset = []
41 |         for sent in input_dataset[last_id:]:
42 |             last_id += 1
43 |             if type(sent) == tuple or len(sent) > max_sent_len or len(sent) <= 1:
44 |                 continue
45 |             dataset.append(self.preprocess_sent(sent, prefix_id + str(len(dataset))))
46 |             if len(dataset) == num_sents:
47 |                 break
48 | 
49 |         return dataset, last_id
50 | 
51 |     def preprocess_sent(self, sent, sent_id):
52 |         """Every word and tag of the sentence gets mapped to a unique id stored in a SequenceDictionary instance."""
53 |         ids_x = []
54 |         ids_y = []
55 |         for word, tag in sent:
56 |             tag = tag.lower()
57 |             if tag not in self.mapping:
58 |                 # Add unk tags to mapping dict
59 |                 self.mapping[tag] = "noun"
60 |             universal_tag = self.mapping[tag]
61 |             word_id = self.dictionary.x_dict.add(word)
62 |             tag_id = self.dictionary.y_dict.add(universal_tag)
63 |             ids_x.append(word_id)
64 |             ids_y.append(tag_id)
65 |         return Sequence(self.dictionary, ids_x, ids_y, sent_id)
66 | 
67 | 
68 | if __name__ == '__main__':
69 |     corpus = BrownPosTag()
70 |     print("vocabulary size: ", len(corpus.dictionary.x_dict))
71 |     print("train/dev/test set length: ", len(corpus.train), len(corpus.dev), len(corpus.test))
72 |     print("First train sentence: ", corpus.train[0])
73 |     print("First dev sentence: ", corpus.dev[0])
74 |     print("First test sentence: ", corpus.test[0])
75 | 


--------------------------------------------------------------------------------
/athnlp/readers/en-brown.map:
--------------------------------------------------------------------------------
  1 | '	.
  2 | ''	.
  3 | (	.
  4 | (-HL	.
  5 | )	.
  6 | )-HL	.
  7 | *	ADV
  8 | *-HL	ADV
  9 | *-NC	ADV
 10 | *-TL	ADV
 11 | ,	.
 12 | ,-HL	.
 13 | ,-NC	.
 14 | ,-TL	.
 15 | --	.
 16 | ---HL	.
 17 | .	.
 18 | .-HL	.
 19 | .-NC	.
 20 | .-TL	.
 21 | :	.
 22 | :-HL	.
 23 | :-TL	.
 24 | ABL	PRT
 25 | ABN	PRT
 26 | ABN-HL	PRT
 27 | ABN-NC	PRT
 28 | ABN-TL	PRT
 29 | ABX	DET
 30 | AP	ADJ
 31 | AP$	PRT
 32 | AP+AP-NC	ADJ
 33 | AP-HL	ADJ
 34 | AP-NC	ADJ
 35 | AP-TL	ADJ
 36 | AT	DET
 37 | AT-HL	DET
 38 | AT-NC	DET
 39 | AT-TL	DET
 40 | AT-TL-HL	DET
 41 | BE	VERB
 42 | BE-HL	VERB
 43 | BE-TL	VERB
 44 | BED	VERB
 45 | BED*	VERB
 46 | BED-NC	VERB
 47 | BEDZ	VERB
 48 | BEDZ*	VERB
 49 | BEDZ-HL	VERB
 50 | BEDZ-NC	VERB
 51 | BEG	VERB
 52 | BEM	VERB
 53 | BEM*	VERB
 54 | BEM-NC	VERB
 55 | BEN	VERB
 56 | BEN-TL	VERB
 57 | BER	VERB
 58 | BER*	VERB
 59 | BER*-NC	VERB
 60 | BER-HL	VERB
 61 | BER-NC	VERB
 62 | BER-TL	VERB
 63 | BEZ	VERB
 64 | BEZ*	VERB
 65 | BEZ-HL	VERB
 66 | BEZ-NC	VERB
 67 | BEZ-TL	VERB
 68 | CC	CONJ
 69 | CC-HL	CONJ
 70 | CC-NC	CONJ
 71 | CC-TL	CONJ
 72 | CC-TL-HL	CONJ
 73 | CD	NUM
 74 | CD$	NOUN
 75 | CD-HL	NUM
 76 | CD-NC	NUM
 77 | CD-TL	NUM
 78 | CD-TL-HL	NUM
 79 | CS	ADP
 80 | CS-HL	ADP
 81 | CS-NC	ADP
 82 | CS-TL	ADP
 83 | DO	VERB
 84 | DO*	VERB
 85 | DO*-HL	VERB
 86 | DO+PPSS	X
 87 | DO-HL	VERB
 88 | DO-NC	VERB
 89 | DO-TL	VERB
 90 | DOD	VERB
 91 | DOD*	VERB
 92 | DOD*-TL	VERB
 93 | DOD-NC	VERB
 94 | DOZ	VERB
 95 | DOZ*	VERB
 96 | DOZ*-TL	VERB
 97 | DOZ-HL	VERB
 98 | DOZ-TL	VERB
 99 | DT	DET
100 | DT$	DET
101 | DT+BEZ	PRT
102 | DT+BEZ-NC	PRT
103 | DT+MD	PRT
104 | DT-HL	DET
105 | DT-NC	DET
106 | DT-TL	DET
107 | DTI	DET
108 | DTI-HL	DET
109 | DTI-TL	DET
110 | DTS	DET
111 | DTS+BEZ	PRT
112 | DTS-HL	DET
113 | DTX	DET
114 | EX	PRT
115 | EX+BEZ	PRT
116 | EX+HVD	PRT
117 | EX+HVZ	PRT
118 | EX+MD	PRT
119 | EX-HL	PRT
120 | EX-NC	PRT
121 | FW-*	X
122 | FW-*-TL	X
123 | FW-AT	X
124 | FW-AT+NN-TL	X
125 | FW-AT+NP-TL	X
126 | FW-AT-HL	X
127 | FW-AT-TL	X
128 | FW-BE	X
129 | FW-BER	X
130 | FW-BEZ	X
131 | FW-CC	X
132 | FW-CC-TL	X
133 | FW-CD	X
134 | FW-CD-TL	X
135 | FW-CS	X
136 | FW-DT	X
137 | FW-DT+BEZ	X
138 | FW-DTS	X
139 | FW-HV	X
140 | FW-IN	X
141 | FW-IN+AT	X
142 | FW-IN+AT-T	X
143 | FW-IN+AT-TL	X
144 | FW-IN+NN	X
145 | FW-IN+NN-TL	X
146 | FW-IN+NP-TL	X
147 | FW-IN-TL	X
148 | FW-JJ	X
149 | FW-JJ-NC	X
150 | FW-JJ-TL	X
151 | FW-JJR	X
152 | FW-JJT	X
153 | FW-NN	X
154 | FW-NN$	X
155 | FW-NN$-TL	X
156 | FW-NN-NC	X
157 | FW-NN-TL	X
158 | FW-NN-TL-NC	X
159 | FW-NNS	X
160 | FW-NNS-NC	X
161 | FW-NNS-TL	X
162 | FW-NP	X
163 | FW-NP-TL	X
164 | FW-NPS	X
165 | FW-NPS-TL	X
166 | FW-NR	X
167 | FW-NR-TL	X
168 | FW-OD-NC	X
169 | FW-OD-TL	X
170 | FW-PN	X
171 | FW-PP$	X
172 | FW-PP$-NC	X
173 | FW-PP$-TL	X
174 | FW-PPL	X
175 | FW-PPL+VBZ	X
176 | FW-PPO	X
177 | FW-PPO+IN	X
178 | FW-PPS	X
179 | FW-PPSS	X
180 | FW-PPSS+HV	X
181 | FW-QL	X
182 | FW-RB	X
183 | FW-RB+CC	X
184 | FW-RB-TL	X
185 | FW-TO+VB	X
186 | FW-UH	X
187 | FW-UH-NC	X
188 | FW-UH-TL	X
189 | FW-VB	X
190 | FW-VB-NC	X
191 | FW-VB-TL	X
192 | FW-VBD	X
193 | FW-VBD-TL	X
194 | FW-VBG	X
195 | FW-VBG-TL	X
196 | FW-VBN	X
197 | FW-VBZ	X
198 | FW-WDT	X
199 | FW-WPO	X
200 | FW-WPS	X
201 | HV	VERB
202 | HV*	VERB
203 | HV+TO	VERB
204 | HV-HL	VERB
205 | HV-NC	VERB
206 | HV-TL	VERB
207 | HVD	VERB
208 | HVD*	VERB
209 | HVD-HL	VERB
210 | HVG	VERB
211 | HVG-HL	VERB
212 | HVN	VERB
213 | HVZ	VERB
214 | HVZ*	VERB
215 | HVZ-NC	VERB
216 | HVZ-TL	VERB
217 | IN	ADP
218 | IN+IN	ADP
219 | IN+PPO	ADP
220 | IN-HL	ADP
221 | IN-NC	ADP
222 | IN-TL	ADP
223 | IN-TL-HL	ADP
224 | JJ	ADJ
225 | JJ$-TL	PRT
226 | JJ+JJ-NC	ADJ
227 | JJ-HL	ADJ
228 | JJ-NC	ADJ
229 | JJ-TL	ADJ
230 | JJ-TL-HL	ADJ
231 | JJ-TL-NC	ADJ
232 | JJR	ADJ
233 | JJR+CS	ADJ
234 | JJR-HL	ADJ
235 | JJR-NC	ADJ
236 | JJR-TL	ADJ
237 | JJS	ADJ
238 | JJS-HL	ADJ
239 | JJS-TL	ADJ
240 | JJT	ADJ
241 | JJT-HL	ADJ
242 | JJT-NC	ADJ
243 | JJT-TL	ADJ
244 | MD	VERB
245 | MD*	VERB
246 | MD*-HL	VERB
247 | MD+HV	VERB
248 | MD+PPSS	VERB
249 | MD+TO	VERB
250 | MD-HL	VERB
251 | MD-NC	VERB
252 | MD-TL	VERB
253 | NIL	X
254 | NN	NOUN
255 | NN$	NOUN
256 | NN$-HL	NOUN
257 | NN$-TL	NOUN
258 | NN+BEZ	PRT
259 | NN+BEZ-TL	PRT
260 | NN+HVD-TL	PRT
261 | NN+HVZ	PRT
262 | NN+HVZ-TL	PRT
263 | NN+IN	NOUN
264 | NN+MD	PRT
265 | NN+NN-NC	NOUN
266 | NN-HL	NOUN
267 | NN-NC	NOUN
268 | NN-TL	NOUN
269 | NN-TL-HL	NOUN
270 | NN-TL-NC	NOUN
271 | NNS	NOUN
272 | NNS$	NOUN
273 | NNS$-HL	NOUN
274 | NNS$-NC	NOUN
275 | NNS$-TL	NOUN
276 | NNS$-TL-HL	NOUN
277 | NNS+MD	PRT
278 | NNS-HL	NOUN
279 | NNS-NC	NOUN
280 | NNS-TL	NOUN
281 | NNS-TL-HL	NOUN
282 | NNS-TL-NC	NOUN
283 | NP	NOUN
284 | NP$	NOUN
285 | NP$-HL	NOUN
286 | NP$-TL	NOUN
287 | NP+BEZ	PRT
288 | NP+BEZ-NC	PRT
289 | NP+HVZ	PRT
290 | NP+HVZ-NC	PRT
291 | NP+MD	PRT
292 | NP-HL	NOUN
293 | NP-NC	NOUN
294 | NP-TL	NOUN
295 | NP-TL-HL	NOUN
296 | NPS	NOUN
297 | NPS$	NOUN
298 | NPS$-HL	NOUN
299 | NPS$-TL	NOUN
300 | NPS-HL	NOUN
301 | NPS-NC	NOUN
302 | NPS-TL	NOUN
303 | NR	NOUN
304 | NR$	NOUN
305 | NR$-TL	NOUN
306 | NR+MD	PRT
307 | NR-HL	NOUN
308 | NR-NC	NOUN
309 | NRS	NOUN
310 | NRS-TL	NOUN
311 | OD	ADJ
312 | OD-HL	ADJ
313 | OD-NC	ADJ
314 | OD-TL	ADJ
315 | PN	NOUN
316 | PN$	NOUN
317 | PN+BEZ	PRT
318 | PN+HVD	PRT
319 | PN+HVZ	PRT
320 | PN+MD	PRT
321 | PN-HL	NOUN
322 | PN-NC	NOUN
323 | PN-TL	NOUN
324 | PP$	DET
325 | PP$$	PRON
326 | PP$-HL	DET
327 | PP$-NC	DET
328 | PP$-TL	DET
329 | PPL	PRON
330 | PPL-HL	PRON
331 | PPL-NC	PRON
332 | PPL-TL	PRON
333 | PPLS	PRON
334 | PPO	PRON
335 | PPO-HL	PRON
336 | PPO-NC	PRON
337 | PPO-TL	PRON
338 | PPS	PRON
339 | PPS+BEZ	PRT
340 | PPS+BEZ-HL	PRT
341 | PPS+BEZ-NC	PRT
342 | PPS+HVD	PRT
343 | PPS+HVZ	PRT
344 | PPS+MD	PRT
345 | PPS-HL	PRON
346 | PPS-NC	PRON
347 | PPS-TL	PRON
348 | PPSS	PRON
349 | PPSS+BEM	PRT
350 | PPSS+BER	PRT
351 | PPSS+BER-N	PRT
352 | PPSS+BER-NC	PRT
353 | PPSS+BER-TL	PRT
354 | PPSS+BEZ	PRT
355 | PPSS+BEZ*	PRT
356 | PPSS+HV	PRT
357 | PPSS+HV-TL	PRT
358 | PPSS+HVD	PRT
359 | PPSS+MD	PRT
360 | PPSS+MD-NC	PRT
361 | PPSS+VB	PRT
362 | PPSS-HL	PRON
363 | PPSS-NC	PRON
364 | PPSS-TL	PRON
365 | QL	ADV
366 | QL-HL	ADV
367 | QL-NC	ADV
368 | QL-TL	ADV
369 | QLP	ADV
370 | RB	ADV
371 | RB$	PRT
372 | RB+BEZ	PRT
373 | RB+BEZ-HL	PRT
374 | RB+BEZ-NC	PRT
375 | RB+CS	ADV
376 | RB-HL	ADV
377 | RB-NC	ADV
378 | RB-TL	ADV
379 | RBR	ADV
380 | RBR+CS	ADV
381 | RBR-NC	ADV
382 | RBT	ADV
383 | RN	ADV
384 | RP	PRT
385 | RP+IN	PRT
386 | RP-HL	PRT
387 | RP-NC	PRT
388 | RP-TL	PRT
389 | TO	PRT
390 | TO+VB	PRT
391 | TO-HL	PRT
392 | TO-NC	PRT
393 | TO-TL	PRT
394 | UH	PRT
395 | UH-HL	PRT
396 | UH-NC	PRT
397 | UH-TL	PRT
398 | VB	VERB
399 | VB+AT	VERB
400 | VB+IN	VERB
401 | VB+JJ-NC	VERB
402 | VB+PPO	VERB
403 | VB+RP	VERB
404 | VB+TO	VERB
405 | VB+VB-NC	VERB
406 | VB-HL	VERB
407 | VB-NC	VERB
408 | VB-TL	VERB
409 | VBD	VERB
410 | VBD-HL	VERB
411 | VBD-NC	VERB
412 | VBD-TL	VERB
413 | VBG	VERB
414 | VBG+TO	VERB
415 | VBG-HL	VERB
416 | VBG-NC	VERB
417 | VBG-TL	VERB
418 | VBN	VERB
419 | VBN+TO	VERB
420 | VBN-HL	VERB
421 | VBN-NC	VERB
422 | VBN-TL	VERB
423 | VBN-TL-HL	VERB
424 | VBN-TL-NC	VERB
425 | VBZ	VERB
426 | VBZ-HL	VERB
427 | VBZ-NC	VERB
428 | VBZ-TL	VERB
429 | WDT	DET
430 | WDT+BER	PRT
431 | WDT+BER+PP	X
432 | WDT+BEZ	PRT
433 | WDT+BEZ-HL	PRT
434 | WDT+BEZ-NC	PRT
435 | WDT+BEZ-TL	PRT
436 | WDT+DO+PPS	X
437 | WDT+DOD	PRT
438 | WDT+HVZ	PRT
439 | WDT-HL	DET
440 | WDT-NC	DET
441 | WP$	DET
442 | WPO	PRON
443 | WPO-NC	PRON
444 | WPO-TL	PRON
445 | WPS	PRON
446 | WPS+BEZ	PRT
447 | WPS+BEZ-NC	PRT
448 | WPS+BEZ-TL	PRT
449 | WPS+HVD	PRT
450 | WPS+HVZ	PRT
451 | WPS+MD	PRT
452 | WPS-HL	PRON
453 | WPS-NC	PRON
454 | WPS-TL	PRON
455 | WQL	ADV
456 | WQL-TL	ADV
457 | WRB	ADV
458 | WRB+BER	PRT
459 | WRB+BEZ	PRT
460 | WRB+BEZ-TL	PRT
461 | WRB+DO	PRT
462 | WRB+DOD	PRT
463 | WRB+DOD*	PRT
464 | WRB+DOZ	PRT
465 | WRB+IN	PRT
466 | WRB+MD	PRT
467 | WRB-HL	ADV
468 | WRB-NC	ADV
469 | WRB-TL	ADV
470 | ``	.
471 | 


--------------------------------------------------------------------------------
/athnlp/readers/fever_predictor.py:
--------------------------------------------------------------------------------
 1 | from allennlp.common.util import JsonDict
 2 | from allennlp.data import Instance
 3 | from allennlp.predictors.predictor import Predictor
 4 | from overrides import overrides
 5 | 
 6 | 
 7 | @Predictor.register('fever')
 8 | class FeverPredictor(Predictor):
 9 |     """
10 |     Predictor for sequence to sequence models that visualizes attention scores
11 |     """
12 | 
13 |     @overrides
14 |     def _json_to_instance(self, json_dict: JsonDict) -> Instance:
15 |         """
16 |         Expects JSON that looks like ``{"source": "..."}``.
17 |         """
18 |         claim = json_dict["claim"]
19 |         evidence = json_dict["evidence"]
20 |         return self._dataset_reader.text_to_instance(claim, evidence)
21 | 


--------------------------------------------------------------------------------
/athnlp/readers/fever_reader.py:
--------------------------------------------------------------------------------
 1 | import json
 2 | import logging
 3 | from typing import Iterable, Dict, List
 4 | 
 5 | from allennlp.data import DatasetReader, Instance, Tokenizer, TokenIndexer
 6 | from allennlp.data.fields import MetadataField, TextField, LabelField
 7 | from allennlp.data.token_indexers import SingleIdTokenIndexer
 8 | from allennlp.data.tokenizers import WordTokenizer
 9 | 
10 | 
11 | logger = logging.getLogger(__name__)
12 | 
13 | 
14 | @DatasetReader.register("feverlite")
15 | class FEVERLiteDatasetReader(DatasetReader):
16 |     def __init__(self,
17 |                  wiki_tokenizer: Tokenizer = None,
18 |                  claim_tokenizer: Tokenizer = None,
19 |                  token_indexers: Dict[str, TokenIndexer] = None) -> None:
20 |         super().__init__()
21 |         self._wiki_tokenizer = wiki_tokenizer or WordTokenizer()
22 |         self._claim_tokenizer = claim_tokenizer or WordTokenizer()
23 |         self._token_indexers = token_indexers or {'tokens': SingleIdTokenIndexer()}
24 | 
25 |     def _read(self, file_path: str) -> Iterable[Instance]:
26 |         logger.info("Reading FEVER instances from {}".format(file_path))
27 |         with open(file_path,"r") as file:
28 |             for line in file:
29 |                 json_line = json.loads(line)
30 |                 yield self.text_to_instance(**json_line)
31 | 
32 |     def text_to_instance(self, claim:str, evidence:List[str], label:str=None) -> Instance:
33 |         # Evidence in the dataset is a list of sentences. We can concatenate these into just one long string
34 |         # Extension Exercise: Can you make a new dataset reader and model that handles them individually?
35 |         evidence = " ".join(set(evidence))
36 | 
37 |         # Tokenize the claim and evidence
38 |         claim_tokens = self._claim_tokenizer.tokenize(claim)
39 |         evidence_tokens = self._wiki_tokenizer.tokenize(evidence)
40 | 
41 |         instance_meta = {"claim_tokens": claim_tokens,
42 |                          "evidence_tokens": evidence_tokens }
43 | 
44 |         instance_dict = {"claim": TextField(claim_tokens, self._token_indexers),
45 |                          "evidence": TextField(evidence_tokens, self._token_indexers),
46 |                          "metadata": MetadataField(instance_meta)
47 |                          }
48 | 
49 |         if label is not None:
50 |             instance_dict["label"] = LabelField(label)
51 | 
52 |         return Instance(instance_dict)
53 | 
54 | 


--------------------------------------------------------------------------------
/athnlp/readers/label_dictionary.py:
--------------------------------------------------------------------------------
 1 | import sys
 2 | import warnings
 3 | 
 4 | 
 5 | class LabelDictionary(dict):
 6 |     """This class implements a dictionary of labels. Labels are mapped to
 7 |     integers, as it is more efficient to retrieve the label name from its
 8 |     integer representation, and vice-versa."""
 9 | 
10 |     def __init__(self, label_names=[]):
11 |         self.names = []
12 |         for name in label_names:
13 |             self.add(name)
14 | 
15 |     def add(self, name):
16 |         if name in self:
17 |             # warnings.warn('Ignoring duplicated label ' + name)
18 |             label_id = self[name]
19 |         else:
20 |             label_id = len(self.names)
21 |             self[name] = label_id
22 |             self.names.append(name)
23 |         return label_id
24 | 
25 |     def get_label_name(self, label_id):
26 |         return self.names[label_id]
27 | 
28 |     def get_label_id(self, name):
29 |         return self[name]
30 | 


--------------------------------------------------------------------------------
/athnlp/readers/lm_corpus.py:
--------------------------------------------------------------------------------
 1 | import os
 2 | from io import open
 3 | 
 4 | import torch
 5 | 
 6 | 
 7 | class Dictionary(object):
 8 |     def __init__(self):
 9 |         self.word2idx = {}
10 |         self.idx2word = []
11 | 
12 |     def add_word(self, word):
13 |         if word not in self.word2idx:
14 |             self.idx2word.append(word)
15 |             self.word2idx[word] = len(self.idx2word) - 1
16 |         return self.word2idx[word]
17 | 
18 |     def __len__(self):
19 |         return len(self.idx2word)
20 | 
21 | 
22 | class Corpus(object):
23 |     def __init__(self, path):
24 |         self.dictionary = Dictionary()
25 |         self.train = self.tokenize(os.path.join(path, 'train.txt'))
26 |         self.valid = self.tokenize(os.path.join(path, 'valid.txt'))
27 |         self.test = self.tokenize(os.path.join(path, 'test.txt'))
28 | 
29 |     def tokenize(self, path):
30 |         """Tokenizes a text file."""
31 |         assert os.path.exists(path)
32 |         # Add words to the dictionary
33 |         with open(path, 'r', encoding="utf8") as f:
34 |             for line in f:
35 |                 words = line.split() + ['<eos>']
36 |                 for word in words:
37 |                     self.dictionary.add_word(word.lower())
38 | 
39 |         # Tokenize file content
40 |         with open(path, 'r', encoding="utf8") as f:
41 |             idss = []
42 |             for line in f:
43 |                 words = line.split() + ['<eos>']
44 |                 ids = []
45 |                 for word in words:
46 |                     word = word.lower()
47 |                     ids.append(self.dictionary.word2idx[word])
48 |                 idss.append(torch.tensor(ids).type(torch.int64))
49 |             ids = torch.cat(idss)
50 | 
51 |         return ids
52 | 


--------------------------------------------------------------------------------
/athnlp/readers/multi30k_reader.py:
--------------------------------------------------------------------------------
 1 | from typing import Dict
 2 | import logging
 3 | 
 4 | from overrides import overrides
 5 | 
 6 | from allennlp.common.checks import ConfigurationError
 7 | from allennlp.common.file_utils import cached_path
 8 | from allennlp.common.util import START_SYMBOL, END_SYMBOL
 9 | from allennlp.data.dataset_readers.dataset_reader import DatasetReader
10 | from allennlp.data.fields import TextField
11 | from allennlp.data.instance import Instance
12 | from allennlp.data.tokenizers import Token, Tokenizer, WordTokenizer
13 | from allennlp.data.token_indexers import TokenIndexer, SingleIdTokenIndexer
14 | 
15 | 
16 | logger = logging.getLogger(__name__)  # pylint: disable=invalid-name
17 | 
18 | 
19 | @DatasetReader.register("multi30k")
20 | class Multi30kReader(DatasetReader):
21 | 
22 |     def __init__(self,
23 |                  source_tokenizer: Tokenizer = None,
24 |                  target_tokenizer: Tokenizer = None,
25 |                  source_token_indexers: Dict[str, TokenIndexer] = {"tokens": SingleIdTokenIndexer()},
26 |                  target_token_indexers: Dict[str, TokenIndexer] = None,
27 |                  source_add_start_token: bool = True,
28 |                  language_pairs: Dict = {"source": "fr", "target": "en"},
29 |                  lazy: bool = False) -> None:
30 |         super().__init__(lazy)
31 |         self._source_tokenizer = source_tokenizer or WordTokenizer()
32 |         self._target_tokenizer = target_tokenizer or self._source_tokenizer
33 |         self._source_token_indexers = source_token_indexers
34 |         self._target_token_indexers = target_token_indexers or self._source_token_indexers
35 |         self._source_add_start_token = source_add_start_token
36 |         self._language_pairs = language_pairs
37 | 
38 |     @overrides
39 |     def _read(self, file_path):
40 | 
41 |         with open(cached_path(("%s.%s" % (file_path, self._language_pairs["source"]))), "r", encoding="utf8") as source_file, \
42 |                 open(cached_path(("%s.%s" % (file_path, self._language_pairs["target"]))), "r", encoding="utf8") as target_file:
43 |             logger.info("Reading instances from lines in source/target files at: %s", file_path)
44 |             for source_sequence, target_sequence in zip(source_file, target_file):
45 |                 yield self.text_to_instance(source_sequence, target_sequence)
46 | 
47 |     @overrides
48 |     def text_to_instance(self, source_string: str, target_string: str = None) -> Instance:  # type: ignore
49 |         # pylint: disable=arguments-differ
50 |         tokenized_source = self._source_tokenizer.tokenize(source_string)
51 |         if self._source_add_start_token:
52 |             tokenized_source.insert(0, Token(START_SYMBOL))
53 |         tokenized_source.append(Token(END_SYMBOL))
54 |         source_field = TextField(tokenized_source, self._source_token_indexers)
55 |         if target_string is not None:
56 |             tokenized_target = self._target_tokenizer.tokenize(target_string)
57 |             tokenized_target.insert(0, Token(START_SYMBOL))
58 |             tokenized_target.append(Token(END_SYMBOL))
59 |             target_field = TextField(tokenized_target, self._target_token_indexers)
60 |             return Instance({"source_tokens": source_field, "target_tokens": target_field})
61 |         else:
62 |             return Instance({'source_tokens': source_field})


--------------------------------------------------------------------------------
/athnlp/readers/sequence.py:
--------------------------------------------------------------------------------
 1 | from athnlp.readers.sequence_dictionary import SequenceDictionary
 2 | 
 3 | 
 4 | class Sequence(object):
 5 | 
 6 |     def __init__(self, dictionary: SequenceDictionary, x, y, nr):
 7 |         self.x = x
 8 |         self.y = y
 9 |         self.nr = nr
10 |         self.dictionary = dictionary
11 | 
12 |     def size(self):
13 |         """Returns the size of the sequence."""
14 |         return len(self.x)
15 | 
16 |     def __len__(self):
17 |         return len(self.x)
18 | 
19 |     def copy_sequence(self):
20 |         """Performs a deep copy of the sequence"""
21 |         s = Sequence(self.dictionary, self.x[:], self.y[:], self.nr)
22 |         return s
23 | 
24 |     def update_from_sequence(self, new_y):
25 |         """Returns a new sequence equal to the previous but with y set to newy"""
26 |         s = Sequence(self.dictionary, self.x, new_y, self.nr)
27 |         return s
28 | 
29 |     def __str__(self):
30 |         rep = ""
31 |         for i, xi in enumerate(self.x):
32 |             yi = self.y[i]
33 |             rep += "%s/%s " % (self.dictionary.x_dict.get_label_name(xi),
34 |                                self.dictionary.y_dict.get_label_name(yi))
35 |         return rep
36 | 
37 |     def __repr__(self):
38 |         rep = ""
39 |         for i, xi in enumerate(self.x):
40 |             yi = self.y[i]
41 |             rep += "%s/%s " % (self.dictionary.x_dict.get_label_name(xi),
42 |                                self.dictionary.y_dict.get_label_name(yi))
43 |         return rep
44 | 


--------------------------------------------------------------------------------
/athnlp/readers/sequence_dictionary.py:
--------------------------------------------------------------------------------
1 | from athnlp.readers.label_dictionary import LabelDictionary
2 | 
3 | 
4 | class SequenceDictionary:
5 | 
6 |     def __init__(self):
7 |         self.x_dict = LabelDictionary()
8 |         self.y_dict = LabelDictionary()
9 | 


--------------------------------------------------------------------------------
/athnlp/readers/token_indexers/__init__.py:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/athnlp/athnlp-labs/1c3f0a8595cd8e4a7b7d62fc4e608ade6ac092fd/athnlp/readers/token_indexers/__init__.py


--------------------------------------------------------------------------------
/athnlp/readers/token_indexers/bert_squad_indexer.py:
--------------------------------------------------------------------------------
  1 | # pylint: disable=no-self-use
  2 | from typing import List, Dict
  3 | from overrides import overrides
  4 | 
  5 | from allennlp.data.tokenizers.token import Token
  6 | from allennlp.data.vocabulary import Vocabulary
  7 | from allennlp.data.token_indexers.token_indexer import TokenIndexer
  8 | from allennlp.data.token_indexers.wordpiece_indexer import PretrainedBertIndexer, WordpieceIndexer
  9 | 
 10 | 
 11 | @TokenIndexer.register("bert-squad-indexer")
 12 | class BertSquadIndexer(PretrainedBertIndexer):
 13 |     """
 14 |     TokenIndexer closely based on AllenNLP's WordpieceIndexer; the only major difference is that we assume that
 15 |     basic and then wordpiece tokenization have already taken place when reading the SQuAD dataset
 16 |     (this follows the original methodology of hugginface). The reason we do that is so that start_position and
 17 |     end_position are correctly offset due to the extra wordpiece tokens introduced.
 18 |     NOTE: We are unnecesarily checking for len(tokens) > max_pieces, as we have already split the paragraphs
 19 |     when reading the dataset. The corresponding code below should never be triggered.
 20 |     """
 21 |     def __init__(self,
 22 |                  pretrained_model: str) -> None:
 23 |         super().__init__(pretrained_model)
 24 | 
 25 |     @overrides
 26 |     def tokens_to_indices(self,
 27 |                           tokens: List[Token],
 28 |                           vocabulary: Vocabulary,
 29 |                           index_name: str) -> Dict[str, List[int]]:
 30 |         if not self._added_to_vocabulary:
 31 |             self._add_encoding_to_vocabulary(vocabulary)
 32 |             self._added_to_vocabulary = True
 33 | 
 34 |         # This lowercases tokens if necessary
 35 |         text = (token.text.lower()
 36 |                 if self._do_lowercase and token.text not in self._never_lowercase
 37 |                 else token.text
 38 |                 for token in tokens)
 39 | 
 40 |         # Create nested sequence of wordpieces
 41 |         nested_wordpiece_tokens = _get_nested_wordpiece_tokens([token for token in text])
 42 | 
 43 |         # Obtain a nested sequence of wordpieces, each represented by a list of wordpiece ids
 44 |         token_wordpiece_ids = [[self.vocab[wordpiece] for wordpiece in token]
 45 |                                for token in nested_wordpiece_tokens]
 46 | 
 47 |         # Flattened list of wordpieces. In the end, the output of the model (e.g., BERT) should
 48 |         # have a sequence length equal to the length of this list. However, it will first be split into
 49 |         # chunks of length `self.max_pieces` so that they can be fit through the model. After packing
 50 |         # and passing through the model, it should be unpacked to represent the wordpieces in this list.
 51 |         flat_wordpiece_ids = [wordpiece for token in token_wordpiece_ids for wordpiece in token]
 52 | 
 53 |         # Similarly, we want to compute the token_type_ids from the flattened wordpiece ids before
 54 |         # we do the windowing; otherwise [SEP] tokens would get counted multiple times.
 55 |         flat_token_type_ids = _get_token_type_ids(flat_wordpiece_ids, self._separator_ids)
 56 | 
 57 |         # The code below will (possibly) pack the wordpiece sequence into multiple sub-sequences by using a sliding
 58 |         # window `window_length` that overlaps with previous windows according to the `stride`. Suppose we have
 59 |         # the following sentence: "I went to the store to buy some milk". Then a sliding window of length 4 and
 60 |         # stride of length 2 will split them up into:
 61 | 
 62 |         # "[I went to the] [to the store to] [store to buy some] [buy some milk [PAD]]".
 63 | 
 64 |         # This is to ensure that the model has context of as much of the sentence as possible to get accurate
 65 |         # embeddings. Finally, the sequences will be padded with any start/end piece ids, e.g.,
 66 | 
 67 |         # "[CLS] I went to the [SEP] [CLS] to the store to [SEP] ...".
 68 | 
 69 |         # The embedder should then be able to split this token sequence by the window length,
 70 |         # pass them through the model, and recombine them.
 71 | 
 72 |         # Specify the stride to be half of `self.max_pieces`, minus any additional start/end wordpieces
 73 |         window_length = self.max_pieces - len(self._start_piece_ids) - len(self._end_piece_ids)
 74 |         stride = window_length // 2
 75 | 
 76 |         # offsets[i] will give us the index into wordpiece_ids
 77 |         # for the wordpiece "corresponding to" the i-th input token.
 78 |         offsets = []
 79 | 
 80 |         # If we're using initial offsets, we want to start at offset = len(text_tokens)
 81 |         # so that the first offset is the index of the first wordpiece of tokens[0].
 82 |         # Otherwise, we want to start at len(text_tokens) - 1, so that the "previous"
 83 |         # offset is the last wordpiece of "tokens[-1]".
 84 |         offset = len(self._start_piece_ids) if self.use_starting_offsets else len(self._start_piece_ids) - 1
 85 | 
 86 |         # Count amount of wordpieces accumulated
 87 |         pieces_accumulated = 0
 88 |         for token in token_wordpiece_ids:
 89 |             # Truncate the sequence if specified, which depends on where the offsets are
 90 |             next_offset = 1 if self.use_starting_offsets else 0
 91 |             if self._truncate_long_sequences and offset + len(token) - 1 >= window_length + next_offset:
 92 |                 break
 93 | 
 94 |             # For initial offsets, the current value of ``offset`` is the start of
 95 |             # the current wordpiece, so add it to ``offsets`` and then increment it.
 96 |             if self.use_starting_offsets:
 97 |                 offsets.append(offset)
 98 |                 offset += len(token)
 99 |             # For final offsets, the current value of ``offset`` is the end of
100 |             # the previous wordpiece, so increment it and then add it to ``offsets``.
101 |             else:
102 |                 offset += len(token)
103 |                 offsets.append(offset)
104 | 
105 |             pieces_accumulated += len(token)
106 | 
107 |         if len(flat_wordpiece_ids) <= window_length:
108 |             # If all the wordpieces fit, then we don't need to do anything special
109 |             wordpiece_windows = [self._add_start_and_end(flat_wordpiece_ids)]
110 |             token_type_ids = self._extend(flat_token_type_ids)
111 |         elif self._truncate_long_sequences:
112 |             self._warn_about_truncation(tokens)
113 |             wordpiece_windows = [self._add_start_and_end(flat_wordpiece_ids[:pieces_accumulated])]
114 |             token_type_ids = self._extend(flat_token_type_ids[:pieces_accumulated])
115 |         else:
116 |             # Create a sliding window of wordpieces of length `max_pieces` that advances by `stride` steps and
117 |             # add start/end wordpieces to each window
118 |             # TODO: this currently does not respect word boundaries, so words may be cut in half between windows
119 |             # However, this would increase complexity, as sequences would need to be padded/unpadded in the middle
120 |             wordpiece_windows = [self._add_start_and_end(flat_wordpiece_ids[i:i + window_length])
121 |                                  for i in range(0, len(flat_wordpiece_ids), stride)]
122 | 
123 |             token_type_windows = [self._extend(flat_token_type_ids[i:i + window_length])
124 |                                   for i in range(0, len(flat_token_type_ids), stride)]
125 | 
126 |             # Check for overlap in the last window. Throw it away if it is redundant.
127 |             last_window = wordpiece_windows[-1][1:]
128 |             penultimate_window = wordpiece_windows[-2]
129 |             if last_window == penultimate_window[-len(last_window):]:
130 |                 wordpiece_windows = wordpiece_windows[:-1]
131 |                 token_type_windows = token_type_windows[:-1]
132 | 
133 |             token_type_ids = [token_type for window in token_type_windows for token_type in window]
134 | 
135 |         # Flatten the wordpiece windows
136 |         wordpiece_ids = [wordpiece for sequence in wordpiece_windows for wordpiece in sequence]
137 | 
138 | 
139 |         # Our mask should correspond to the original tokens,
140 |         # because calling util.get_text_field_mask on the
141 |         # "wordpiece_id" tokens will produce the wrong shape.
142 |         # However, because of the max_pieces constraint, we may
143 |         # have truncated the wordpieces; accordingly, we want the mask
144 |         # to correspond to the remaining tokens after truncation, which
145 |         # is captured by the offsets.
146 |         mask = [1 for _ in offsets]
147 | 
148 |         return {index_name: wordpiece_ids,
149 |                 f"{index_name}-offsets": offsets,
150 |                 f"{index_name}-type-ids": token_type_ids,
151 |                 "mask": mask}
152 | 
153 | 
154 | def _get_token_type_ids(wordpiece_ids: List[int],
155 |                         separator_ids: List[int]) -> List[int]:
156 |     num_wordpieces = len(wordpiece_ids)
157 |     token_type_ids: List[int] = []
158 |     type_id = 0
159 |     cursor = 0
160 |     while cursor < num_wordpieces:
161 |         # check length
162 |         if num_wordpieces - cursor < len(separator_ids):
163 |             token_type_ids.extend(type_id
164 |                                   for _ in range(num_wordpieces - cursor))
165 |             cursor += num_wordpieces - cursor
166 |         # check content
167 |         # when it is a separator
168 |         elif all(wordpiece_ids[cursor + index] == separator_id
169 |                  for index, separator_id in enumerate(separator_ids)):
170 |             token_type_ids.extend(type_id for _ in separator_ids)
171 |             type_id += 1
172 |             cursor += len(separator_ids)
173 |         # when it is not
174 |         else:
175 |             cursor += 1
176 |             token_type_ids.append(type_id)
177 |     return token_type_ids
178 | 
179 | 
180 | def _get_nested_wordpiece_tokens(flat_wordpiece_tokens: List[str]):
181 |     nested_worpiece_tokens = []
182 |     nested = []
183 |     for wordpiece in flat_wordpiece_tokens:
184 |         if wordpiece.startswith("##"):
185 |             nested.append(wordpiece)
186 |         else:
187 |             nested = [wordpiece]
188 |             nested_worpiece_tokens.append(nested)
189 |     return nested_worpiece_tokens
190 | 


--------------------------------------------------------------------------------
/data/lm/test.txt:
--------------------------------------------------------------------------------
1 | The thief stole .
2 | The thief stole the suitcase .
3 | The crook stole the suitcase .
4 | The cop took a bribe .
5 | The thief was arrested by the detective .
6 | 


--------------------------------------------------------------------------------
/data/lm/train.txt:
--------------------------------------------------------------------------------
1 | The thief stole .
2 | The thief stole the suitcase .
3 | The crook stole the suitcase .
4 | The cop took a bribe .
5 | The thief was arrested by the detective .
6 | 


--------------------------------------------------------------------------------
/data/lm/valid.txt:
--------------------------------------------------------------------------------
1 | The thief stole .
2 | The thief stole the suitcase .
3 | The crook stole the suitcase .
4 | The cop took a bribe .
5 | The thief was arrested by the detective .
6 | 


--------------------------------------------------------------------------------
/data/multi30k/val.lc.norm.tok.head-250.en:
--------------------------------------------------------------------------------
  1 | a man in his living room ponders packing for a trip .
  2 | an african-american woman sits at a brown table , wearing a purple dress , pink shoes , and black sunglasses .
  3 | a man slouched in a chair on a city sidewalk girl watching .
  4 | two men , one man selling fruit the other inspecting the fruit and conversing with the seller .
  5 | a man and a woman hug on a street .
  6 | many people in a stadium dressed in white are conversing with each other .
  7 | many people have gathered to look at something that is not in the photo .
  8 | a couple sits on a bench talking , while a woman walks a dog in the background .
  9 | two women in spotted dresses walking down a sidewalk .
 10 | a group of girls are playing in a water fountain in the sun .
 11 | people landscaping and gardening the areas around the walkway .
 12 | a balding man in red sunglasses wearing a green shirt standing in front of a building .
 13 | a barefoot boy with a blue and white striped towel is standing on the beach .
 14 | a couple and two girls are looking over a clear railing .
 15 | people in a produce store picking produce to buy
 16 | a tattoo artist applying tattoo ink to the skin .
 17 | a woman standing next to two people is pointing to the sky .
 18 | a couple walks down an isles at a store selling art and history books .
 19 | a man with a name tag on is sitting in a chair .
 20 | a man off in the distance by a buddhist temple .
 21 | a man is balancing a metal ball on his arm .
 22 | a negro male in a white t-shirt and a black hat sitting on the curb texting .
 23 | a boy in a black shirt carries a blue bucket while walking with men dressed in white .
 24 | a young dark-haired woman with red sun visor holding an open white umbrella amidst a crowd of people
 25 | a man holding a small child who is wearing a backpack .
 26 | a young man sets up pool balls , on a purple felt billiard .
 27 | girls sitting with their hands on their laps
 28 | a girl in a black tank with cargo shorts to what appears to be dancing with several people around .
 29 | a boy and girl in black jumpsuits stand facing a girl in a pink jacket , with adults in the background .
 30 | a man and woman pushing strollers are walking by some people who are selling items in tents .
 31 | a woman in striped tights is being guided with strings .
 32 | a construction worker in an orange vest lays down cobblestones .
 33 | an asian woman sitting outside an outdoor market stall .
 34 | a man lounges in his room without eating for days .
 35 | a trailer drives down a red brick road .
 36 | a young woman with brown hair and tank top is taking a picture with a camera .
 37 | a person laying on a bench in front of a water feature .
 38 | a smiling young man walking on next to the beach wearing a baseball cap , blue t-shirt and jeans .
 39 | a group of people waving to a person on a balcony .
 40 | a uniformed man in the army is training a german shepherd using an arm guard .
 41 | a kid on skating ramp practicing cool moves .
 42 | some men appear to be discussing something on a boat or ship .
 43 | a man wearing an aviation cap and goggles sits in the road .
 44 | people are crossing a tree lined street in front of a building .
 45 | a man wearing a gray shirt resting his head on a table .
 46 | a man in white shirt and dark shorts is working outside .
 47 | a woman in black pants is looking at her cellphone .
 48 | a young woman in a pink shirt attempting to rope a calf at the rodeo .
 49 | a family is standing outdoors on a cloudy day .
 50 | a very young child is sitting in the sink with paint on his body and face and is playing with the kitchen faucet .
 51 | kids being spun around in a glass spinner .
 52 | a man and a woman are holding up signs at a protest .
 53 | a woman working on her deck on the weekend .
 54 | an elderly man , wearing navy blue , is sitting on a bench along the street .
 55 | a woman is taking a picture with her camera .
 56 | an older male in blue jeans and brown coat is resting against an orange building .
 57 | a man , hand on head , regards a bank of america advertisement .
 58 | two women sitting on a bench at night in front of a store
 59 | a man in a white shirt is sitting on a crate .
 60 | an older man with tattoos and biker regalia lingers a moment on a city street .
 61 | a man sits alone fishing along the shoreline .
 62 | an older man is sitting outside on a bench in front a large banner that says , &quot; memoria justicia sin olvido . &quot;
 63 | a man on a bike in a gray jacket carries foliage .
 64 | a boy with glasses wearing a bright yellow shirt is standing in a parking lot .
 65 | two men walking down a dirt path .
 66 | a mother in a blue beret and blue shoes with her two sons .
 67 | two dogs playing with a blue and green ball .
 68 | a man wearing a bright , multi-color helmet is sitting on a motorcycle .
 69 | a man crouches in front of a yellow wall .
 70 | looks like a farmers market , a few tables with various items displayed .
 71 | a blond boy in a blue shirt is sitting with a woman wearing glasses .
 72 | a man in a sleek white shirt gazes into the woman &apos;s eyes while holding on to the back of her black and pink dress .
 73 | people play in a fountain at twilight .
 74 | four boys posing while one boy sets his drink down .
 75 | on a busy street a lady carries goods on her head .
 76 | a street performer in an orange jumpsuit rides a tall unicycle as a crowd watches
 77 | a person sitting on chair in front of a crowd .
 78 | a male is waiting for the train to arrive at the platform .
 79 | three white men in t-shirt jump into the air .
 80 | a singer doing a stage dive into the crowd .
 81 | a scuba diving class taking a picture during class time .
 82 | a woman in a striped shirt folds her arms while standing in a grocery store .
 83 | a smashed car with many firefighters cutting into the car .
 84 | several people wait to checkout inside a store with a warehouse looking ceiling .
 85 | 2 girls playing volleyball , one striking the ball .
 86 | an older man is pouring something out of a bag into the water .
 87 | six people are in a gymnasium working on repairing some bicycles .
 88 | two young asian boys spar with each other .
 89 | a woman prepares ingredients for a bowl of soup .
 90 | two men wearing shorts are working on a blue bike .
 91 | a man and woman taking a nap on a makeshift rip .
 92 | there is a man wrapped in a blanket of some sort sliding down a hill that is covered in snow .
 93 | a man and woman enjoying dinner at a party .
 94 | a national guard soldier leading a group of other national guard soldiers singing the national anthem .
 95 | two youths walk down an inclined street .
 96 | a woman in a restaurant is drinking out of a coconut , using a straw .
 97 | a woman wearing a blue uniform stands and looks down .
 98 | one man in shorts is talking to another man in blue jeans in front of a sink .
 99 | a man feeding a baby in a highchair .
100 | a young helmeted biker in blue takes to the air while going over small hills .
101 | a little baby in a pink hat lying naked and sleeping .
102 | two children on their stomachs lay on the ground under a pipe .
103 | two people are talking near a red phone booth while construction workers rest nearby .
104 | a native woman is working on a craft project .
105 | children chasing the ball in a soccer game .
106 | two children jumping on a screened in blue and black trampoline while outside surrounded by trees .
107 | two dogs run in a field looking at an unseen frisbee .
108 | a drummer and guitar player play a show in a dark area .
109 | a trio of people are hiking throughout a heavily snowed path .
110 | a group of people are near a small river in the middle of a city .
111 | a woman peeks into a telescope in the woods .
112 | a little blond girl with a polka dot shirt is giving a stuffed animal a &quot; bath &quot; in a sink .
113 | a shaggy young male with a nose ring brushes his teeth .
114 | a snow skier flies through the air , while other skiers going up the tow rope look on .
115 | three girls are horseback riding with the focus on the youngest girl .
116 | heavyset woman blowing her hair with a hair dryer smiling all happy
117 | a chef working in a kitchen using a knife .
118 | a man moving flowers while a woman makes a gesture at him .
119 | a woman singing into a microphone while a man plays drums in the background .
120 | the man wearing the cap is handing a freshly caught fish to the boy in the purple hat .
121 | a worker clinging to a tree .
122 | several people standing around a bowl , in which one man is manipulating a brown object .
123 | a rollerblading monk with some nice sunglasses prays before doing some sick tricks .
124 | a woman wearing sunglasses and a blue shirt , selling sea shells , looks at an older man wearing a black shirt and a cap .
125 | an old man sweeps the floor as a lady walks away from the camera .
126 | a man with dreadlocks is playing with the hair of a woman who is sitting on a chair on a cobblestone street .
127 | a man is working at a construction site .
128 | a man is in the middle of hitting a red , white and blue volleyball .
129 | a man stands in a mobile food stand , looking out the half-door .
130 | a baseball player with a red helmet and white pants is tagged by the catcher while running to home base .
131 | a man in an orange robe sweeping outside .
132 | man wearing an purple shirt working in a biology lab .
133 | a man is leading two small ponies on a walk at a park .
134 | 2 females , 1 from germany and 1 from china , compete in a wrestling match on a mat .
135 | a man and woman sleeping on a bench .
136 | a man and a woman are sitting on the floor in front of luggage .
137 | a rickshaw operator waiting for his next costumer .
138 | two girls are seated at a table and working on craft projects .
139 | a man runs through the snow with the aid of snowshoes .
140 | a dancer in a red suit is jumping in the air .
141 | a man is parked while inside of a sanitation truck .
142 | a bearded man in a heavy jacket sits in a corner with a paper cup .
143 | male on a skateboard , using an empty pool as a ramp on a very pretty day .
144 | a man with a large hat in the bushes .
145 | a closeup of a child &apos;s face eating a blue , heart shaped lollipop .
146 | a man and woman are working on replacing a bike tire tube .
147 | a young boy shows his brown and green bead necklace .
148 | a man in a gray t-shirt works the bellows to start a fire on a brick oven inside a wooden shed .
149 | a woman standing in front of trees and smiling .
150 | someone in asian costume is sitting down and holding a sword .
151 | a young football player is setting up for a field goal .
152 | a guy in a bright green hoodie is crossing a crosswalk while looking at an accident between some cars and a bike .
153 | a young man gets ready to kick a soccer ball .
154 | a competitive runner taking her first sprint in a competition .
155 | two men , one in blue and one in red , compete in a boxing match .
156 | a group of friends lay sprawled out on the floor enjoying their time together .
157 | a standing man holds a microphone in front of a man holding a guitar .
158 | roller derby girl skating with others .
159 | a woman getting a bag of ice at a store .
160 | two motorcyclists racing neck and neck around a corner .
161 | one man standing alone on a sidewalk adjusting his hat .
162 | a man with a disability who doesn &apos;t have legs is walking with another man who is entered into a marathon .
163 | multiple bodies collide in a soccer match .
164 | two people , one dressed as a nun and the other in a roger smith t-shirt , running in a foot race past onlookers in a wooded area .
165 | three men on horses during a race .
166 | a dark-skinned man in white shirts and a black sleeveless shirt flips his skateboard on a cement surface surrounded by tall buildings and palm trees .
167 | one man , wearing a hooded sweatshirt , sitting at a fountain watching the people of the city .
168 | swimmers stand on various levels of a large diving board complex in a room with figures from mythology painted on the wall .
169 | a young indian boy sitting down thinking about his future .
170 | a boy in a hoodie is throwing an object into a dirty swimming pool .
171 | colorful costumed men in a performance .
172 | two boxers are ready for their fight as the crowd watches with anticipation .
173 | children are playing a sport on a field .
174 | a female performer sings and plays the guitar in front of a microphone .
175 | boy &apos;s are competing in martial arts .
176 | a group of black people performing in orange shirts in front of a fenced off park .
177 | schoolgirls in uniforms march in a parade while playing flute-like instruments .
178 | a man stuffs a fowl from ingredients in a blue bowl .
179 | the boy leaps of his bed with a karate kick .
180 | a baby in a bouncy seat and a standing boy surrounded by toys .
181 | these are people gathered around the table playing jenga .
182 | i see a bearded man and elderly lady sharing a bowl of food .
183 | a young man is skateboarding on a cement block wall .
184 | people boating on a lake with the sun through the clouds in the distance .
185 | two men wearing martial arts clothing are practicing martial arts .
186 | a boy band and no one even matches someone should have sent a memo .
187 | a group of go-cart riders are racing around a go-cart track .
188 | a person dressed in winter clothes poses with a snowman surrounded by snow covered landscape .
189 | a man is giving a presentation in front of a crowd .
190 | a middle-aged man is taping up the knee of a younger football player who is sitting on a trainers table .
191 | a tattooed man wearing overalls on a stage holding a microphone .
192 | a little girl plays with an miniature electric circuit consisting of three light bulbs and a battery .
193 | man wearing blue helmet merges into traffic on a bicycle .
194 | a bunch of young adults stare in concentration at their computer monitors as they competitively game .
195 | a young boy in green practices juggling in a parking lot .
196 | a little girl in a dotted dress looks back towards a woman in a black dress .
197 | a man adjusts the engine of a boat near the water .
198 | there is a tennis match being played at night in this stadium .
199 | a child snowboarder coming to a stop
200 | a soccer game is played as two men attempt to reach the ball before their respected opponent .
201 | young women and children in a village , with a single woman focused on the camera .
202 | toddler in a green shirt is brushing his teeth with a yellow toothbrush , while being supervised by mom .
203 | a man in a wheelchair and wearing a red jogging suit is carrying a torch .
204 | a son and his parents are taking a group picture in a church .
205 | three men competing in a hurdle race .
206 | two men are observing another as he puts the finishing touches on wet cement .
207 | a soldier is looking at binoculars into the mountainous landscape .
208 | a group of young boys race on a snowy day .
209 | a man jumps rope while a crowd of people watch him .
210 | a young boy and girl are laughing together as the girl holds up a hand sign .
211 | two men on opposing teams race toward a soccer ball .
212 | a biker wearing a yellow shirt pulls of an incredible trick in the air .
213 | two female kickboxers , one with a purple sports bra , battle it out in an arena .
214 | two men on fast motorcycle speeding around a corner on a racetrack .
215 | two men guard the man with the basketball during a game at dusk .
216 | a man wearing riding boots and a helmet is riding a white horse , and the horse is jumping a hurdle .
217 | a group of men in costume play music .
218 | man and women look through milk crates full of records or pictures on sale .
219 | a racing catamaran is lifted onto one hull in the water .
220 | a group of marines walking down the road with american flags and other military flags .
221 | a man holding a drinking glass at the camera .
222 | one man on stage , playing a guitar with lights in the background .
223 | four young kids playing with empty canisters .
224 | a group of workers are listening to instruction from a colleague .
225 | a shriner rides a large green tractor down the road during a parade .
226 | three dogs are playing in the water .
227 | a man plays a drum and a little boy hits his own , little drum .
228 | a group of girls playing a game on horseback .
229 | a man watching as a woman fires a gun with a smile at a firing range .
230 | the bike leader pedals for his life as competing countries gain his tail .
231 | a team of soccer players is huddled and having a serious discussion .
232 | a group of people in purple shirts and tan pants all walking in the same direction .
233 | a man in a bright shirt is playing trumpet .
234 | the guy with the jean shorts is at the skate park doing tricks on his bike .
235 | mother and daughter wearing alice in wonderland customs are posing for a picture .
236 | a little boy using a drill to make a hole in a piece of wood .
237 | two bicyclists are racing each other on a dirt track .
238 | a frisbee is being thrown to the girl while the other girl appears to be asking for it .
239 | an orthodontist working on a patient , while a man holds the light .
240 | two people eat hamburgers on lawn chairs while a third drinks a can of soda .
241 | two bicyclists ride down the street past people while talking .
242 | at a bowling alley , a man in a black shirt is holding a bowling ball and looking down the lane .
243 | two men in black clothes with blue and red bowties are performing in front of a crowd .
244 | two men , one black and one white , play their guitars and sing into microphones as they stand outdoors .
245 | a kickboxer lands a flying knee into the face of his opponent .
246 | three people are running a race around a red track .
247 | football players struggle to get plays through a tough line .
248 | several men are praying while standing at the end of a table of food .
249 | two indian children in formal costume happily performing a ritual dance .
250 | three children standing near each other and next to a tall blue wooden post .
251 | 


--------------------------------------------------------------------------------
/data/multi30k/val.lc.norm.tok.head-250.fr:
--------------------------------------------------------------------------------
  1 | un homme dans son salon réfléchit aux affaires qu&apos; il va emmener en voyage .
  2 | une femme afro-américaine est assise à une table marron , vêtue d&apos; une robe violette , de chaussures roses et de lunettes de soleil noires .
  3 | un homme affalé dans une chaise sur un trottoir en ville , regardant une fille .
  4 | deux hommes , l&apos; un vendant des fruits et l&apos; autre les inspectant et parlant avec le vendeur .
  5 | un homme et une femme s&apos; étreignent dans une rue .
  6 | de nombreuses personnes vêtues de blanc dans un stade sont en train de discuter entre elles .
  7 | beaucoup de personnes se sont rassemblées pour regarder quelque chose qui n&apos; est pas sur la photo .
  8 | un couple est assis sur un banc en train de parler , tandis qu&apos; une femme promène un chien en arrière-plan .
  9 | deux femmes en robes à pois marchant sur un trottoir .
 10 | un groupe de filles jouent dans une fontaine au soleil .
 11 | des paysagistes et des jardiniers travaillant aux alentours de l&apos; allée .
 12 | un homme dégarni avec des lunettes de soleil rouges et un t-shirt vert debout devant un bâtiment .
 13 | un garçon pieds nus avec une serviette rayée bleue et blanche est debout sur la plage .
 14 | un couple et deux filles regardent par-dessus une balustrade transparente .
 15 | des gens dans un magasin choisissant des produits à acheter
 16 | un tatoueur applique un tatouage à l&apos; encre sur la peau .
 17 | une femme debout à côté de deux personnes pointe le doigt vers le ciel .
 18 | un couple marche dans un rayon d&apos; un magasin vendant des livres d&apos; art et d&apos; histoire .
 19 | un homme avec un badge est assis dans un fauteuil .
 20 | un homme au loin près d&apos; un temple bouddhiste .
 21 | un homme tient une boule métallique en équilibre sur son bras .
 22 | un homme noir en t-shirt blanc et casquette noire assis sur le trottoir , envoyant un texto .
 23 | un garçon en chemise noire portant un seau bleu tandis qu&apos; il marche avec des hommes vêtus de blanc .
 24 | une jeune femme aux cheveux foncés avec une visière rouge , tenant un parapluie blanc ouvert au milieu d&apos; une foule de personnes
 25 | un homme tenant un petit enfant qui porte un sac à dos .
 26 | un jeune homme installe des boules de billard , sur un tapis de billard violet .
 27 | des filles assises avec les mains sur les genoux
 28 | une fille en débardeur noir et short cargo semble être en train de danser avec plusieurs personnes autour .
 29 | un garçon et une fille en survêtements noirs sont debout face à une fille en veste rose , avec des adultes en arrière-plan .
 30 | un homme et une femme avec des poussettes marchent près de gens qui vendent des articles dans des tentes .
 31 | une femme en collants rayés est dirigée par des ficelles .
 32 | un ouvrier du bâtiment en gilet orange pose des pavés .
 33 | une femme asiatique assise devant un étal de marché extérieur .
 34 | un homme paresse dans sa chambre , sans avoir mangé depuis plusieurs jours .
 35 | une remorque roule sur une route pavée rouge .
 36 | une jeune femme avec des cheveux bruns et un débardeur prend une photo avec un appareil photo .
 37 | une personne allongée sur un banc devant de l&apos; eau .
 38 | un jeune homme souriant , marchant au bord de la plage avec une casquette , un t-shirt bleu et un jean .
 39 | un groupe de personnes saluant quelqu&apos; un sur un balcon .
 40 | un homme en uniforme militaire entraîne un berger allemand en utilisant un protège-bras .
 41 | un enfant sur une rampe de skateboard , répétant des mouvements cools .
 42 | des hommes semblent discuter de quelque chose sur un bateau ou un navire .
 43 | un homme portant un bonnet et des lunettes d&apos; aviateur est assis sur la route .
 44 | des gens traversent une rue bordée d&apos; arbres devant un bâtiment .
 45 | un homme vêtu d&apos; une chemise grise posant sa tête sur une table .
 46 | un homme en t-shirt blanc et short noir travaille dehors .
 47 | une femme en pantalon noir regarde son téléphone portable .
 48 | une jeune femme en débardeur rose tentant d&apos; attraper un veau au lasso lors d&apos; un rodéo .
 49 | une famille est debout dehors lors d&apos; une journée nuageuse .
 50 | un très jeune enfant est assis dans l&apos; évier avec de la peinture sur son corps et son visage , et il joue avec le robinet de la cuisine .
 51 | des enfants tournant dans un tourniquet en verre .
 52 | un homme et une femme brandissent des pancartes lors d&apos; une manifestation .
 53 | une femme travaillant sur sa terrasse le week-end .
 54 | un homme âgé , habillé en bleu marine , est assis sur un banc le long de la rue .
 55 | une femme prend une photo avec son appareil .
 56 | un homme âgé en jean et manteau marron se repose contre un bâtiment orange .
 57 | un homme , une main sur la tête , regarde une publicité pour la bank of america .
 58 | deux femmes assises sur un banc la nuit devant un magasin
 59 | un homme en t-shirt blanc est assis sur une caisse .
 60 | un vieil homme avec des tatouages et des insignes de motard s&apos; attarde un moment dans une rue en ville .
 61 | un homme est assis seul , pêchant le long du littoral .
 62 | un vieil homme est assis dehors sur un banc devant une grande banderole où est écrit &quot; memoria justicia sin olvido &quot;
 63 | un homme en veste grise sur un vélo transporte de la verdure .
 64 | un garçon avec des lunettes vêtu d&apos; un t-shirt jaune vif est debout sur un parking .
 65 | deux hommes marchant sur un chemin en terre .
 66 | une mère portant un béret et des chaussures bleus avec ses deux fils .
 67 | deux chiens jouant avec un ballon bleu et vert .
 68 | un homme portant un casque multicolore brillant est assis sur une moto .
 69 | un homme est accroupi devant un mur jaune .
 70 | on dirait un marché paysan , et quelques tables avec divers produits exposés .
 71 | un garçon blond en t-shirt bleu est assis avec une femme portant des lunettes .
 72 | un homme avec une chemise blanche élégante regarde fixement les yeux de la femme , tout en tenant le dos de sa robe rose et noire .
 73 | des gens jouent dans une fontaine au crépuscule .
 74 | quatre garçons posant tandis que l&apos; un pose sa boisson .
 75 | dans une rue très fréquentée , une femme porte des produits sur sa tête .
 76 | un artiste de rue en combinaison orange est sur un grand monocycle tandis qu&apos; une foule regarde
 77 | une personne assise sur une chaise devant une foule .
 78 | un homme attend que le train arrive sur le quai .
 79 | trois hommes blancs en t-shirts sautent en l&apos; air .
 80 | un chanteur plongeant de la scène dans la foule .
 81 | une classe de plongée prenant une photo pendant le cours .
 82 | une femme en polo rayé croise les bras tout en étant debout dans un supermarché .
 83 | une voiture écrasée avec de nombreux pompiers la découpant .
 84 | plusieurs personnes attendent pour passer à la caisse dans un magasin avec un plafond ressemblant à celui d&apos; un entrepôt .
 85 | deux filles jouant au volleyball , l&apos; une frappant le ballon .
 86 | un homme âgé verse quelque chose se trouvant dans un sac dans l&apos; eau .
 87 | six personnes sont dans un gymnase , en train de réparer des vélos .
 88 | deux jeunes garçons asiatiques se battent l&apos; un contre l&apos; autre .
 89 | une femme prépare des ingrédients pour un bol de soupe .
 90 | deux hommes en shorts travaillant sur un vélo bleu .
 91 | un homme et une femme faisant une sieste sur un radeau de fortune .
 92 | il y a un homme enveloppé dans une sorte de couverture , glissant sur une pente recouverte de neige .
 93 | un homme et une femme appréciant un dîner lors d&apos; une fête .
 94 | un soldat de la garde nationale menant un groupe d&apos; autres soldats de la garde nationale qui chantent l&apos; hymne national .
 95 | deux jeunes marchent dans une rue en pente .
 96 | une femme dans un restaurant , est en train de boire une noix de coco , à l&apos; aide d&apos; une paille .
 97 | une femme portant un uniforme bleu est debout et regarde vers le bas .
 98 | un homme en short parle à un autre homme en jean devant un évier .
 99 | un homme nourrissant un bébé dans une chaise haute .
100 | un jeune cycliste portant un casque habillé de bleu s&apos; envole en l&apos; air en passant sur de petites collines .
101 | un petit bébé avec un chapeau rose allongé nue en train de dormir .
102 | deux enfants sont allongés à plat ventre par terre sous un tuyau .
103 | deux personnes parlent près d&apos; une cabine téléphonique rouge tandis que des ouvriers du bâtiment se reposent à proximité .
104 | une femme autochtone travaille sur un projet artisanal .
105 | des enfants courant après le ballon lors d&apos; un match de football .
106 | deux enfants sautant sur un trampoline protégé bleu et noir , situé dehors et entouré d&apos; arbres .
107 | deux chiens courent dans un champ , regardant un frisbee invisible .
108 | un batteur et un guitariste font un concert dans un endroit sombre .
109 | trois personnes font une randonnée sur un chemin très enneigé .
110 | un groupe de gens sont près d&apos; une petite rivière au milieu d&apos; une ville .
111 | une femme jette un coup d &quot; œil dans un télescope dans les bois .
112 | une petite fille blonde avec une t-shirt à pois donne un &quot; bain &quot; à un animal en peluche dans un évier .
113 | un jeune homme hirsute avec un piercing au nez se brosse les dents .
114 | un skieur vole dans les airs , tandis que d&apos; autres skieurs prenant le tire-fesse le regardent .
115 | trois filles font de l&apos; équitation , avec la photo centrée sur la plus jeune .
116 | une femme corpulente soufflant dans ses cheveux avec un sèche-cheveux , souriante et très heureuse
117 | un cuisinier travaillant dans une cuisine avec un couteau .
118 | un homme déplaçant des fleurs tandis qu&apos; une femme lui fait un geste .
119 | une femme chantant dans un micro tandis qu&apos; un homme joue de la batterie en arrière-plan .
120 | l&apos; homme avec la casquette donne un poisson fraîchement pêché au garçon en chapeau violet .
121 | un ouvrier se cramponnant à un arbre .
122 | plusieurs personnes debout autour d&apos; un récipient , dans lequel un homme manipule un objet marron .
123 | un moine faisant du roller avec de belles lunettes de soleil prie avant de faire des figures insensées .
124 | une femme avec des lunettes de soleil et un t-shirt bleu , qui vend des coquillages , regarde un vieil homme portant un t-shirt noir et une casquette .
125 | un vieil homme balaie le sol tandis qu&apos; une femme s&apos; éloigne de l&apos; objectif .
126 | un homme avec des dreadlocks joue avec les cheveux d&apos; une femme qui est assise sur une chaise dans une rue pavée .
127 | un homme travaille sur un chantier .
128 | un homme est en train de frapper un ballon de volley rouge , blanc et bleu .
129 | un homme est debout dans un food truck , regardant par la demi-porte .
130 | un joueur de baseball avec un casque rouge et un pantalon blanc est touché par le receveur tandis qu&apos; il court vers le marbre .
131 | un homme en toge orange balayant dehors .
132 | un homme vêtu d&apos; un t-shirt violet travaillant dans un laboratoire de biologie .
133 | un homme promène deux petits poneys dans un parc .
134 | deux filles , une allemande et une chinoise , s&apos; affrontent lors d&apos; un combat de judo sur un tapis .
135 | un homme et une femme dormant sur un banc .
136 | un homme et une femme sont assis sur le sol devant des bagages .
137 | un conducteur de pousse-pousse attendant son prochain client .
138 | deux filles sont assises à une table et travaillent sur des projets artisanaux .
139 | un homme court dans la neige avec des raquettes .
140 | une danseuse en costume rouge saute en l&apos; air .
141 | un homme est garé tout en étant à l&apos; intérieur d&apos; un camion d&apos; assainissement .
142 | un homme barbu avec une veste épaisse est assis dans un coin avec un gobelet en carton .
143 | un homme en skateboard , utilisant une piscine vide comme rampe lors d&apos; une très belle journée .
144 | un homme avec un grand chapeau dans les buissons .
145 | un gros plan du visage d&apos; un enfant mangeant une sucette bleue en forme de cœur .
146 | un homme et une femme remplacent une chambre à air de vélo .
147 | un jeune garçon montre son collier de perles marron et vertes .
148 | un homme vêtu d&apos; un t-shirt gris actionne le soufflet pour démarrer un feu sur un four en briques dans une cabane en bois .
149 | une femme debout devant des arbres et souriant .
150 | quelqu&apos; un en costume asiatique est assis et tient une épée .
151 | un jeune footballeur américain se prépare pour un field goal .
152 | un mec dans un sweat à capuche vert vif traverse un passage piéton en regardant un accident entre une voiture et un vélo .
153 | un jeune homme s&apos; apprête à tirer dans un ballon de foot .
154 | un coureur de compétition faisant son premier sprint dans une compétition .
155 | deux hommes , l&apos; un en bleu et l&apos; autre en rouge , combattent dans un match de boxe .
156 | un groupe d&apos; amis gis sur le plancher de s&apos; amusant ensemble .
157 | un homme debout tient un micro devant un homme tenant une guitare .
158 | une fille faisant du roller derby patine avec d&apos; autres .
159 | une femme prenant un sac de glace dans un magasin .
160 | deux motards font la course au coude à coude dans un virage .
161 | un homme seul sur un trottoir réajustant son chapeau .
162 | un homme handicapé qui n&apos; a pas de jambes marche avec un autre homme qui se lance dans un marathon .
163 | plusieurs mecs entrent en collision dans un match de football .
164 | deux personnes , l&apos; une vêtue comme une religieuse et l&apos; autre en t-shirt &quot; roger smith &quot; , engagées dans une course à pied , dépassant les spectateurs dans une zone boisée .
165 | trois jockeys pendant une course .
166 | un homme de couleur en chemise blanche et un t-shirt sans manches noir fait sauter son skateboard sur une surface cimentée entourée de hauts bâtiments et de palmiers .
167 | un homme , vêtu d&apos; un sweat-shirt à capuche , assis à une fontaine à regarder les gens dans la ville .
168 | des nageurs sont à différents niveaux d&apos; un grand complexe de plongées dans une pièce avec des représentations de la mythologie peintes sur le mur .
169 | un jeune garçon indien assis à réfléchir sur son avenir .
170 | un garçon en sweat à capuche est en train de jeter un objet dans une piscine sale .
171 | des hommes en costumes colorés pendant un spectacle .
172 | deux boxeurs sont prêts à combattre pendant que le public regarde avec impatience .
173 | les enfants jouent à un sport sur un terrain .
174 | une artiste chante et joue de la guitare devant un micro .
175 | des garçons concourent à un art martial .
176 | un groupe de personnes noires en chemises oranges en face d&apos; un parc clôturé .
177 | des écolières en uniforme défilent dans un défilé en jouant d&apos; un instrument ressemblant à une flute .
178 | un homme fourre une volaille avec des ingrédients qui sont dans un bol bleu .
179 | le garçon saute de son lit en faisant un coup de pied de karaté .
180 | un bébé dans un siège rebondissant et un garçon entourés de jouets .
181 | ce sont des gens réunis autour de la table jouant à jenga .
182 | je vois un homme barbu et une dame âgée partageant un bol de nourriture .
183 | un jeune homme fait du skateboard sur un mur de parpaings .
184 | des gens faisant du bateau sur un lac avec le soleil traversant les nuages au loin .
185 | deux hommes portant des kimonos s&apos; exercent aux arts martiaux .
186 | a boys band et ils ne sont même pas assortis , quelqu&apos; un aurait pu leur dire .
187 | un groupe de pilotes de kart font une course autour d&apos; une piste de karting .
188 | une personne vêtue de vêtements d&apos; hiver prend la pose avec un bonhomme de neige au milieu d&apos; un paysage enneigé .
189 | un homme est en train de donner une présentation devant un public .
190 | un homme d&apos; âge mûr est tapote le genou d&apos; un jeune joueur de football qui est assis sur une table d&apos; entraînement .
191 | un homme tatoué , vêtu d&apos; une salopette , sur une scène tenant un micro .
192 | une petite fille joue avec un circuit électrique miniature , composé de trois ampoules électriques et d&apos; une batterie
193 | un homme à vélo portant un casque bleu s&apos; insère dans la circulation .
194 | une bande de jeunes adultes , en pleine concentration , regardent fixement leurs écrans d&apos; ordinateur pendant une compétition de jeu .
195 | un jeune garçon en vert jongle dans un parking .
196 | une petite fille dans une robe à poids regarde en arrière vers une femme en robe noire .
197 | un homme règle le moteur d&apos; un bateau près de l&apos; eau .
198 | il y a un match de tennis en train d&apos; être joué de nuit dans ce stade .
199 | un enfant faisant du snowboard en train de s&apos; arrêter
200 | un match de football se joue et deux hommes tentent d&apos; atteindre le ballon avant leur adversaire respectif .
201 | de jeunes femmes et des enfants dans un village , avec une femme au centre de l&apos; objectif .
202 | un jeune enfant en maillot verte se brosse les dents avec une brosse à dents jaune , tout en étant supervisé par maman .
203 | un homme en fauteuil roulant et portant un costume de jogging rouge porte une torche olympique .
204 | un fils et ses parents prennent une photo de groupe dans une église .
205 | trois hommes participent à une course d&apos; obstacles .
206 | deux hommes observent un autre alors qu&apos; il met la touche finale au ciment humide .
207 | un soldat regarde dans les jumelles vers le paysage montagneux .
208 | un groupe de jeunes garçons courant pendant un jour enneigé .
209 | un homme saute à la corde tandis qu&apos; une foule de personnes le regardent .
210 | un jeune garçon et une jeune fille rient ensemble tandis que la fille fait un signe de main .
211 | deux hommes de deux équipes adverses courent vers un ballon de football .
212 | un cycliste portant un maillot jaune réalise un tour incroyable dans l&apos; air .
213 | deux kickboxers femelles , l&apos; une avec un bustier pourpre , s&apos; affrontent dans une arène .
214 | deux hommes sur des motos rapides accélèrent autour d&apos; un coin sur un circuit .
215 | deux hommes défendent contre l&apos; homme avec le basket-ball pendant un jeu au crépuscule .
216 | un homme portant des bottes d&apos; équitation et un casque montant sur un cheval blanc qui saute un obstacle .
217 | un groupe d&apos; hommes en costume jouent de la musique .
218 | l&apos; homme et les femmes regardent à travers des caisses de lait pleines de disques ou d&apos; images en vente .
219 | un catamaran de course est soulevé sur une seule coque dans l&apos; eau .
220 | un groupe de marines marchant dans la route avec des drapeaux américains et d&apos; autres drapeaux militaires .
221 | un homme tenant un verre d&apos; alcool face au caméra .
222 | un homme sur scène , jouant de la guitare avec des lumières en arrière-plan .
223 | quatre jeunes enfants jouent avec des bidons vides .
224 | un groupe de travailleurs écoute les instructions d&apos; un collègue .
225 | un mannequin monte un grand tracteur vert sur la route lors d&apos; un défilé .
226 | trois chiens jouent dans l&apos; eau .
227 | un homme joue du tambour et un petit garçon frappe sur son propre petit tambour .
228 | un groupe de filles jouant un jeu à cheval .
229 | un homme qui regarde pendant qu&apos; une femme tire un fusil avec un sourire à un champ de tir .
230 | le leader de vélo pédales pour sa vie puisque les pays concurrents le rejoignent .
231 | une équipe de joueurs de football sont entassés et discutent sérieusement .
232 | un groupe de personnes en maillots violettes et pantalons de couleur marron marchant tous dans la même direction .
233 | un homme dans une chemise brillante joue de la trompette .
234 | le mec avec un short en jean est au planchodrome faisant des tours sur son vélo .
235 | mère et fille portant un costume d&apos; alice au pays des merveilles posent pour une photo .
236 | un petit garçon utilisant une perceuse pour faire un trou dans un morceau de bois .
237 | deux cyclistes se battent sur une piste de terre .
238 | un frisbee est jeté vers la jeune fille tandis que l&apos; autre fille semble le demander .
239 | un orthodontiste soigne un patient , tandis qu&apos; un homme tient la lumière .
240 | deux personnes mangent des hamburgers sur des chaises de jardin tandis qu&apos; un troisième boit une canette de soda .
241 | deux cyclistes dans la rue dépassent les gens tout en parlant .
242 | dans un bowling , un homme en maillot noire tient une boule de bowling et regarde en bas de la voie .
243 | deux hommes en tenue noire avec des nœuds papillons bleues et rouges se produisent devant une foule .
244 | deux hommes , un noir et un blanc , jouent avec leurs guitares et chantent dans les microphones pendant qu&apos; ils se tiennent à l&apos; extérieur .
245 | un kickboxer atterrit avec son genou battant dans le visage de son adversaire .
246 | trois personnes courent dans une course autour d&apos; une piste rouge .
247 | les joueurs de football luttent pour obtenir des jeux à travers une ligne dure .
248 | plusieurs hommes prient en se tenant au bout d&apos; une table de nourriture .
249 | deux enfants indiens en costume officiel accomplissant heureusement une danse rituelle .
250 | trois enfants debout les uns à côté des autres , à côté d&apos; un grand poteau en bois bleu .
251 | 


--------------------------------------------------------------------------------
/data/multi30k/val.lc.norm.tok.head-5.en:
--------------------------------------------------------------------------------
1 | a group of men are loading cotton onto a truck
2 | a man sleeping in a green room on a couch .
3 | a boy wearing headphones sits on a woman &apos;s shoulders .
4 | two men setting up a blue ice fishing hut on an iced over lake
5 | a balding man wearing a red life jacket is sitting in a small boat .
6 | 


--------------------------------------------------------------------------------
/data/multi30k/val.lc.norm.tok.head-5.en.jsonl:
--------------------------------------------------------------------------------
1 | {"source": "a man in his living room ponders packing for a trip .", "target": "un homme dans son salon réfléchit aux affaires qu&apos; il va emmener en voyage ."}
2 | {"source": "an african-american woman sits at a brown table , wearing a purple dress , pink shoes , and black sunglasses .", "target": "une femme afro-américaine est assise à une table marron , vêtue d&apos; une robe violette , de chaussures roses et de lunettes de soleil noires ."}
3 | {"source": "a man slouched in a chair on a city sidewalk girl watching .", "target": "un homme affalé dans une chaise sur un trottoir en ville , regardant une fille ."}
4 | {"source": "two men , one man selling fruit the other inspecting the fruit and conversing with the seller .", "target": "deux hommes , l&apos; un vendant des fruits et l&apos; autre les inspectant et parlant avec le vendeur ."}
5 | {"source": "a man and a woman hug on a street .", "target": "un homme et une femme s&apos; étreignent dans une rue ."}
6 | 


--------------------------------------------------------------------------------
/data/multi30k/val.lc.norm.tok.head-5.fr:
--------------------------------------------------------------------------------
1 | un groupe d&apos; hommes chargent du coton dans un camion
2 | un homme dormant dans une chambre verte sur un canapé .
3 | un garçon avec un casque est assis sur les épaules d&apos; une femme .
4 | deux hommes installant une tente de pêche sur glace bleue sur un lac gelé
5 | un homme chauve vêtu d&apos; un gilet de sauvetage rouge est assis dans un petit bateau .


--------------------------------------------------------------------------------
/data/run_fever.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/athnlp/athnlp-labs/1c3f0a8595cd8e4a7b7d62fc4e608ade6ac092fd/data/run_fever.png


--------------------------------------------------------------------------------
/data/squad/dev-v2.0-small.json:
--------------------------------------------------------------------------------
  1 | {
  2 |     "version": "v2.0",
  3 |     "data": [{
  4 |         "title": "Normans",
  5 |         "paragraphs": [{
  6 |             "qas": [{
  7 |                 "question": "In what country is Normandy located?",
  8 |                 "id": "56ddde6b9a695914005b9628",
  9 |                 "answers": [{
 10 |                     "text": "France",
 11 |                     "answer_start": 159
 12 |                 }],
 13 |                 "is_impossible": false
 14 |             }, {
 15 |                 "question": "When were the Normans in Normandy?",
 16 |                 "id": "56ddde6b9a695914005b9629",
 17 |                 "answers": [{
 18 |                     "text": "10th and 11th centuries",
 19 |                     "answer_start": 94
 20 |                 }],
 21 |                 "is_impossible": false
 22 |             }, {
 23 |                 "question": "From which countries did the Norse originate?",
 24 |                 "id": "56ddde6b9a695914005b962a",
 25 |                 "answers": [{
 26 |                     "text": "Denmark, Iceland and Norway",
 27 |                     "answer_start": 256
 28 |                 }],
 29 |                 "is_impossible": false
 30 |             }, {
 31 |                 "plausible_answers": [{
 32 |                     "text": "Rollo",
 33 |                     "answer_start": 308
 34 |                 }],
 35 |                 "question": "Who did King Charles III swear fealty to?",
 36 |                 "id": "5ad39d53604f3c001a3fe8d3",
 37 |                 "answers": [],
 38 |                 "is_impossible": true
 39 |             }, {
 40 |                 "plausible_answers": [{
 41 |                     "text": "10th century",
 42 |                     "answer_start": 671
 43 |                 }],
 44 |                 "question": "When did the Frankish identity emerge?",
 45 |                 "id": "5ad39d53604f3c001a3fe8d4",
 46 |                 "answers": [],
 47 |                 "is_impossible": true
 48 |             }],
 49 |             "context": "The Normans (Norman: Nourmands; French: Normands; Latin: Normanni) were the people who in the 10th and 11th centuries gave their name to Normandy, a region in France. They were descended from Norse (\"Norman\" comes from \"Norseman\") raiders and pirates from Denmark, Iceland and Norway who, under their leader Rollo, agreed to swear fealty to King Charles III of West Francia. Through generations of assimilation and mixing with the native Frankish and Roman-Gaulish populations, their descendants would gradually merge with the Carolingian-based cultures of West Francia. The distinct cultural and ethnic identity of the Normans emerged initially in the first half of the 10th century, and it continued to evolve over the succeeding centuries."
 50 |         }, {
 51 |             "qas": [{
 52 |                 "question": "Who was the duke in the battle of Hastings?",
 53 |                 "id": "56dddf4066d3e219004dad5f",
 54 |                 "answers": [{
 55 |                     "text": "William the Conqueror",
 56 |                     "answer_start": 1022
 57 |                 }],
 58 |                 "is_impossible": false
 59 |             }, {
 60 |                 "plausible_answers": [{
 61 |                     "text": "Antioch",
 62 |                     "answer_start": 1295
 63 |                 }],
 64 |                 "question": "What principality did William the conquerer found?",
 65 |                 "id": "5ad3a266604f3c001a3fea2b",
 66 |                 "answers": [],
 67 |                 "is_impossible": true
 68 |             }],
 69 |             "context": "The Norman dynasty had a major political, cultural and military impact on medieval Europe and even the Near East. The Normans were famed for their martial spirit and eventually for their Christian piety, becoming exponents of the Catholic orthodoxy into which they assimilated. They adopted the Gallo-Romance language of the Frankish land they settled, their dialect becoming known as Norman, Normaund or Norman French, an important literary language. The Duchy of Normandy, which they formed by treaty with the French crown, was a great fief of medieval France, and under Richard I of Normandy was forged into a cohesive and formidable principality in feudal tenure. The Normans are noted both for their culture, such as their unique Romanesque architecture and musical traditions, and for their significant military accomplishments and innovations. Norman adventurers founded the Kingdom of Sicily under Roger II after conquering southern Italy on the Saracens and Byzantines, and an expedition on behalf of their duke, William the Conqueror, led to the Norman conquest of England at the Battle of Hastings in 1066. Norman cultural and military influence spread from these new European centres to the Crusader states of the Near East, where their prince Bohemond I founded the Principality of Antioch in the Levant, to Scotland and Wales in Great Britain, to Ireland, and to the coasts of north Africa and the Canary Islands."
 70 |         }]
 71 |     }, {
 72 |         "title": "Computational_complexity_theory",
 73 |         "paragraphs": [{
 74 |             "qas": [{
 75 |                 "question": "What branch of theoretical computer science deals with broadly classifying computational problems by difficulty and class of relationship?",
 76 |                 "id": "56e16182e3433e1400422e28",
 77 |                 "answers": [{
 78 |                     "text": "Computational complexity theory",
 79 |                     "answer_start": 0
 80 |                 }],
 81 |                 "is_impossible": false
 82 |             }, {
 83 |                 "plausible_answers": [{
 84 |                     "text": "algorithm",
 85 |                     "answer_start": 472
 86 |                 }],
 87 |                 "question": "What is a manual application of mathematical steps?",
 88 |                 "id": "5ad5316b5b96ef001a10ab76",
 89 |                 "answers": [],
 90 |                 "is_impossible": true
 91 |             }],
 92 |             "context": "Computational complexity theory is a branch of the theory of computation in theoretical computer science that focuses on classifying computational problems according to their inherent difficulty, and relating those classes to each other. A computational problem is understood to be a task that is in principle amenable to being solved by a computer, which is equivalent to stating that the problem may be solved by mechanical application of mathematical steps, such as an algorithm."
 93 |         }, {
 94 |             "qas": [{
 95 |                 "question": "What measure of a computational problem broadly defines the inherent difficulty of the solution?",
 96 |                 "id": "56e16839cd28a01900c67887",
 97 |                 "answers": [{
 98 |                     "text": "if its solution requires significant resources",
 99 |                     "answer_start": 46
100 |                 }],
101 |                 "is_impossible": false
102 |             }, {
103 |                 "question": "What method is used to intuitively assess or quantify the amount of resources required to solve a computational problem?",
104 |                 "id": "56e16839cd28a01900c67888",
105 |                 "answers": [{
106 |                     "text": "mathematical models of computation",
107 |                     "answer_start": 176
108 |                 }],
109 |                 "is_impossible": false
110 |             }, {
111 |                 "question": "What are two basic primary resources used to guage complexity?",
112 |                 "id": "56e16839cd28a01900c67889",
113 |                 "answers": [{
114 |                     "text": "time and storage",
115 |                     "answer_start": 305
116 |                 }],
117 |                 "is_impossible": false
118 |             }, {
119 |                 "plausible_answers": [{
120 |                     "text": "the number of gates in a circuit",
121 |                     "answer_start": 436
122 |                 }],
123 |                 "question": "What unit is measured to determine circuit simplicity?",
124 |                 "id": "5ad532575b96ef001a10ab7f",
125 |                 "answers": [],
126 |                 "is_impossible": true
127 |             }, {
128 |                 "plausible_answers": [{
129 |                     "text": "the number of processors",
130 |                     "answer_start": 502
131 |                 }],
132 |                 "question": "What number is used in perpendicular computing?",
133 |                 "id": "5ad532575b96ef001a10ab80",
134 |                 "answers": [],
135 |                 "is_impossible": true
136 |             }],
137 |             "context": "A problem is regarded as inherently difficult if its solution requires significant resources, whatever the algorithm used. The theory formalizes this intuition, by introducing mathematical models of computation to study these problems and quantifying the amount of resources needed to solve them, such as time and storage. Other complexity measures are also used, such as the amount of communication (used in communication complexity), the number of gates in a circuit (used in circuit complexity) and the number of processors (used in parallel computing). One of the roles of computational complexity theory is to determine the practical limits on what computers can and cannot do."
138 |         }]
139 |     }]
140 | }


--------------------------------------------------------------------------------
/labs-exercises/multiclass_perceptron.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/athnlp/athnlp-labs/1c3f0a8595cd8e4a7b7d62fc4e608ade6ac092fd/labs-exercises/multiclass_perceptron.png


--------------------------------------------------------------------------------
/labs-exercises/neural-encoding-fever.md:
--------------------------------------------------------------------------------
  1 | # Lab - Neural Encoding for Text Classification
  2 | 
  3 | ## Introduction 
  4 | 
  5 | This lab will introduce continuous representations for NLP. 
  6 | We will work on the task of Natural Language Inference (also known as Textual Entailment) in the context of the Fact Extraction and Verification dataset introduced by [Thorne et al. (2018)](https://arxiv.org/abs/1803.05355). 
  7 | We will focus on the subtask of deciding whether a claim is supported or refuted given a set of evidence sentences. 
  8 | The dataset also contains claims for which no appropriate evidence was found in Wikipedia; we will ignore these in this lab.
  9 | 
 10 | To simplify the task, we have prepared a _Lite_ version of the dataset that has the Wikipedia evidence bundled together with each dataset instance. The full task requires searching for this evidence.
 11 | 
 12 | ## Requirements
 13 | 
 14 | - Use a subset of the FEVER data (provided in `data/fever/`) to predict whether textual _claims_ are SUPPORTED or REFUTED from _evidence_ from Wikipedia. More details about how it was prepared can be found [here](https://github.com/j6mes/feverlite/releases).
 15 | - Use the [AllenNLP](https://allennlp.org/) framework for implementation of your neural models. **(If you installed the required Python packages for the summer school, AllenNLP should be installed for you).**
 16 | - It is highly recommended to use an IDE, such as PyCharm, for working in this project
 17 | 
 18 | 
 19 | ## AllenNLP Primer
 20 | There are four key parts you'll interact with when developing with the AllenNLP framework:
 21 | 
 22 | * Dataset Reader
 23 | * Model
 24 | * Configuration File
 25 | * Command Line Interface / Python Module
 26 | 
 27 | ### Dataset Reader and Sample Data
 28 | Each labeled dataset instance consists of a `claim` sentence accompanied by one or more `evidence` sentences. 
 29 | 
 30 | ```
 31 | {    
 32 |     'label': 'SUPPORTS',
 33 |     'claim': 'Ryan Gosling has been to a country in Africa.',
 34 |     'evidence': [
 35 |         'He is a supporter of PETA , Invisible Children and the Enough Project and has traveled to Chad , Uganda and eastern Congo to raise awareness about conflicts in the regions .', 
 36 |         "Chad -LRB- -LSB- tʃæd -RSB- تشاد ; Tchad -LSB- tʃa -LRB- d -RRB- -RSB- -RRB- , officially the Republic of Chad -LRB- ; `` Republic of the Chad '' -RRB- , is a landlocked country in Central Africa ."
 37 |     ]
 38 | }
 39 | ```
 40 | 
 41 | We provide code to read through the dataset files in `athnlp/readers/fever_reader.py`. 
 42 | The dataset reader we provide does all the necessary preprocessing before we pass the data to the model. For example, in our implementation, we tokenize the sentences.
 43 |  
 44 | This returns an `Instance` that consists of a `claim` and `evidence` for the model. 
 45 | Notice that the instance contains a `TextField` for the tokenized sentences and a `LabelField` for the label. 
 46 | The framework will construct a vocabulary using the words in the TextField for you. 
 47 | If you want to add hand-crafted features, this might be a good place to add them (you could add an array of features in an `ArrayField`).
 48 | 
 49 | Also notice that above the `FEVERLiteDatasetReader` there is a decorator `@DatasetReader.register("feverlite")`. This will come in handy when using configuration files for our model as it associates the type `feverlite` with this class -- this will come in handy later!
 50 | 
 51 | 
 52 | ### Model
 53 | Just like we registered our dataset reader, we can also register a `Model`. In the file `athnlp/models/fever_text_classification.py`, we have built a skeleton model that you can adapt for the exercises. We have registered it with the name `fever` by using the decorator `@Model.register("fever")` above the class name. If you plan on adding more models, you should think of a unique name.
 54 | 
 55 | The model has a `forward(...)` method: this is the main method for prediction just like we'd expect to find with other models written in `PyTorch`. 
 56 | Notice how in our model, the argument names match up with the values returned by the dataset reader: AllenNLP will match these up for you during training and model evaluation. 
 57 | While the variable names are the same, the data types are different. AllenNLP will convert a `TextField` into a LongTensor - each element in this tensor corresponds to the index of the token in the vocabulary.
 58 | AllenNLP will automatically generate batches for you: this means that all variables here are batch-first tensors.
 59 | 
 60 | The model returns quite a bit of information to the trainer that is calling it. 
 61 | It is quite common to see the following code in a lot of AllenNLP models.
 62 | The loss is computed by the model (if a `label` is passed in) and this is what is used for error backpropagation.
 63 | If we need to compute any metrics, such as accuracy or F1 score, this would be the place to do it.
 64 | ```
 65 |         label_probs = F.softmax(label_logits, dim=-1)
 66 |         output_dict = {"label_logits": label_logits,
 67 |                        "label_probs": label_probs}
 68 | 
 69 |         if label is not None:
 70 |             loss = self._loss(label_logits, label.long().view(-1))
 71 |             self._accuracy(label_logits, label)
 72 |             output_dict["loss"] = loss
 73 |             
 74 |         return output_dict
 75 | ```
 76 | 
 77 | The meat of the model will perform a sequence of operations to the input data, returning the label logits and loss.
 78 | It is possible to mix torch and AllenNLP operations
 79 | 
 80 | Operations that might be helpful for the exercises are:
 81 | 
 82 | * Embedding Lookup (TextFieldEmbedder)
 83 | * Feed-forward Neural Networks (FeedForward)
 84 | * Summing tensors (torch.sum())
 85 | * Concatenating tensors (torch.cat())
 86 | 
 87 |   
 88 | ### Configuration
 89 | Parameters for the model are stored in a JSON file. For this example, you can adapt `athnlp/experiments/fever.json`. 
 90 | In this configuration file, there are separate configurations for the `datasetreader`, `model` and `trainer`. 
 91 | Notice how the `type` of the datasetreader and model match the values we specified earlier. 
 92 | 
 93 | The values in this configuration are passed to the constructor of our model and dataset reader and also match their parameters:
 94 | 
 95 | ```json
 96 |  "model": {
 97 |     "type": "fever",
 98 |     "text_field_embedder": {
 99 |       ...
100 |     },
101 |     "final_feedforward": {
102 |       ...
103 |     },
104 |     "initializer": [
105 |       ...
106 |      ]
107 |   }
108 | ```
109 | And in the python code for the model, the `__init__` method of the model takes these parameters. Note that `vocab` is auto-filled by another part of AllenNLP.
110 | You can find examples of configs from real-world models on [GitHub](https://github.com/allenai/allennlp/tree/master/training_config).
111 | ```python
112 | @Model.register("fever")
113 | class FEVERTextClassificationModel(Model):
114 |     def __init__(self,
115 |                  vocab: Vocabulary,
116 |                  text_field_embedder: TextFieldEmbedder,
117 |                  final_feedforward: FeedForward,
118 |                  initializer: InitializerApplicator = InitializerApplicator()):
119 |  ```
120 |  
121 | ### Running the model
122 | AllenNLP will install itself as a bash script that you can call when you want to train/evaluate your model using the config specified in your json file. Using the `--include-package` option will load the custom models, dataset readers and other Python modules in that package.
123 | 
124 | ```bash
125 | allennlp train --force --include-package athnlp --serialization-dir mymodel myconfig.json
126 | ``` 
127 | 
128 | `train` tells AllenNLP to train the model, there are other options for Fine Tuning, Evaluation, Predicting etc.
129 | `--serialization-dir` defines the location the model will be saved
130 | `--force` will overwite any existing model saved in the serialization-dir. You could use `--recover` if you wish to continue training a model from a checkpoint.
131 | `--include-package` will import the Python described package here.
132 | 
133 | This is an alias that just runs Python with the following command: `python -m allennlp.run [args]`. 
134 | If you are using an IDE, you can debug AllenNLP models by running the python module `allennlp.run`. **note: this is running a module. not a script - select the dropdown to select "Module name" NOT "Script path"**
135 | 
136 | ![](/data/run_fever.png)
137 | 
138 | If you are using `pdb`, you will have to write a simple 2-line wrapper script: 
139 | ```
140 | from allennlp.commands import main 
141 | main(prog="allennlp”)
142 | ```
143 | 
144 | ### Debugging
145 | 
146 | #### ConfigurationError
147 | 
148 | If you encounter this error:
149 | ```
150 | allennlp.common.checks.ConfigurationError: "feverlite not in acceptable choices for dataset_reader.type: ['ccgbank', 'conll2003', 'conll2000', 'ontonotes_ner', 'coref', 'winobias', 'event2mind', 'interleaving', 'language_modeling', 'multiprocess', 'ptb_trees', 'drop', 'squad', 'quac', 'triviaqa', 'qangaroo', 'srl', 'semantic_dependencies', 'seq2seq', 'sequence_tagging', 'snli', 'universal_dependencies', 'universal_dependencies_multilang', 'sst_tokens', 'quora_paraphrase', 'atis', 'nlvr', 'wikitables', 'template_text2sql', 'grammar_based_text2sql', 'quarel', 'simple_language_modeling', 'babi', 'copynet_seq2seq', 'text_classification_json']"
151 | ```
152 | Check that `--include-package athnlp` is included in the arguments when calling AllenNLP
153 | 
154 | 
155 | #### ModuleNotFoundError
156 | 
157 | If you encounter this error: 
158 | ```
159 | ModuleNotFoundError: No module named 'athnlp'
160 | ```
161 | Check that the athnlp folder is in the `PYTHONPATH`. Is your current working directory the `athnlp-labs` folder?
162 | 
163 | #### FileNotFoundError
164 | 
165 | ```
166 | FileNotFoundError: file resources/glove.6B.50d.txt.gz not found
167 | ```
168 | 
169 | Run setup_dependencies.sh to download the file (it will then call `wget https://allennlp.s3.amazonaws.com/datasets/glove/glove.6B.50d.txt.gz -P resources/;`)
170 | 
171 | 
172 | #### NotImplementedError
173 | This is where you should add your solution to the exercise! Go and edit `athnlp/models/fever_text_classification.py` and delete this line.
174 | ```
175 | NotImplementedError: Compute label logits (for supported and refuted) for the given Claim and Evidence input
176 | ```
177 | 
178 | 
179 | #### It is running all my scripts!
180 | If you have put your code in the `athnlp`, the `--include-package` function will try and find code to import. You should change the code from previous labs and wrap it in the following if statement `if __name__ == "__main__":` so that it only runs when it is your main python script.  
181 | 
182 | ### Self-Help
183 | There are a large number of (more complex) models already available for AllenNLP which could help inspire you when you write your model: Check out the [models package](https://github.com/allenai/allennlp/tree/master/allennlp/models) on GitHub for inspiration if you are stuck. 
184 | 
185 | If you are getting errors about size mismatch (`RuntimeError: size mismatch, m1: [32 x 100], m2: [256 x 100]`) check the dimensions of your MLP are compatible. This error is caused when PyTorch tries to multiply incompatible matrices. Check that the input dimension for the MLP in the config file is the same size as the input representation you generate. Use the debugger to inspect this or print the shape of your variables `print(my_variable.shape)`. 
186 | 
187 | 
188 | ### Using GPU
189 | If your laptop has a CUDA-enabled GPU and you have the appropriate drivers installed, you can speed up training by setting `"cuda_device":0` in your configuration file. 
190 | 
191 | ## Exercises
192 | For the exercises, we have provided a dataset reader (`athnlp/readers/fever_reader.py`), configuration file (`athnlp/experiments/fever.json`), and sample model (`athnlp/models/fever_text_classification.py`). You can complete these exercises by completing the code in the sample model.
193 | 
194 | ### 1. Average Word Embedding Model
195 | 1. Implement a model that 
196 | 	- represents the claim and the evidence by averaging their word embeddings;
197 | 	- concatenates the two representations;
198 | 	- uses a multilayer perceptron to decide the label.
199 | 
200 | 2. Experiment with the number and the size of hidden layers to find the best settings using the train/dev set and assess your accuracy on the test set. (note: this model may not get high accuracy)
201 | 
202 | 3. Explore: How does fine-tuning the word embeddings affect performance? You can make the word embeddings layer trainable by changing the config file for the `text_field_embedder` in the `fever.json` config file. 
203 | 
204 | ### 2. Discrete Feature Baseline
205 | Start by make a new config file and a new model file based on `fever.json` and `fever_text_classification.py`. Don't forget  
206 | 
207 | 
208 | 1. Compare against a discrete feature baseline. Instead of embedding claim and evidence we are making an n-hot Bag of Words vector. (Hint: edit the type of the `text_field_embedder` be `bag_of_word_counts` - you will have make changes to your model too!).
209 | 
210 | 2. How does limiting the vocabulary size affect the model accuracy?  (hint: adding this to the main section in the config file will limit the vocab size to 10000 tokens. `"vocabulary":{"max_vocab_size":10000}`)
211 | 
212 | 
213 | ### 3. Convolution
214 | Averaging word embeddings is an example of a CBOW model. An alternative way to combine the representations is to use CNNs (see slide 110/111 in Ryan McDonald's talk: [SLIDES](https://github.com/athnlp/athnlp-labs/blob/master/slides/McDonald_classification.pdf)).
215 | 
216 | 1. Use a `CnnEncoder()` ([documentation](https://allenai.github.io/allennlp-docs/api/allennlp.modules.seq2vec_encoders.html#allennlp.modules.seq2vec_encoders.cnn_encoder.CnnEncoder)) to generate convoluted sentence representations. (debugging hint: this method expects the input to be padded. you may get errors if filter size is longer than the sentnece. you will need to set `"token_min_padding_length": 5` or higher in the `tokens` object in `token_indexers` for large filter sizes). Filter sizes of between 2-5 should be sufficient. More filters will cause training to be slower (perhaps just train for 1 or 2 epochs). 
217 | 
218 | ### 4. Hypothesis-Only NLI and Biases
219 | 1. Implement a _[hypothesis only](https://www.aclweb.org/anthology/S18-2023)_ version of the model that ignores the evidence and only uses the claim for predicting the label. What accuracy does this model get? Why do you think this? Think back to slide 7 on Ryan's talk. 
220 | 
221 | 2. Take a look at the training/dev data. Can you design claims that would "fool" your models? You can see this report ([Thorne and Vlachos, 2019](https://arxiv.org/abs/1903.05543)) for inspiration. 
222 | What do you conclude about the ability of your model to understand language?
223 | 


--------------------------------------------------------------------------------
/labs-exercises/neural-language-model.md:
--------------------------------------------------------------------------------
 1 | # Lab - Neural Language Modeling
 2 | 
 3 | 
 4 | ## Introduction
 5 | 
 6 | In this lab we will create a Language Model using Recurrent Neural Networks with PyTorch. 
 7 | 
 8 | ## Requirements
 9 | We will train our model on the following toy dataset:
10 | 
11 | ```
12 | <s> The thief stole . </s>
13 | <s> The thief stole the suitcase . </s>
14 | <s> The crook stole the suitcase . </s>
15 | <s> The cop took a bribe . <s>
16 | <s> The thief was arrested by the detective . </s>
17 | ```
18 | 
19 | ## Exercises
20 | 
21 | 
22 | #### 1. Language Modeller
23 | 
24 | Implement a LSTM-based RNN language model that takes each word of a sentence as input and
25 | predicts the next one (the original RNNLM demo paper can be found 
26 | [here](http://www.fit.vutbr.cz/~imikolov/rnnlm/rnnlm-demo.pdf)).   
27 | In particular, the input to the RNN is the previous word and the previous hidden state and the output is the next 
28 | predicted word. 
29 | 
30 | **Note**: Consider each sentence as a separate example, where each sentence is represented as a list of tokens.
31 | 
32 | Things to try out:
33 | - Run a sanity check: make sure your model can learn how to predict correctly your training data. After training your
34 | model, take the sentence
35 |     ```
36 |     <s> The thief stole the suitcase . </s>
37 |     ```
38 |     and check that for every word and context (i.e., last hidden state of the RNN) you get the right answer. Does it work?
39 |     For example, given the context ``<s> The`` the model should be predicting ``thief``. 
40 |     Why is this happening instead of predicting ``crook``? 
41 |     
42 |     **Note**: You might need to play with the hyper-parameters, such as learning rate, epoch number etc.
43 |      
44 | #### 2. Sentence Completion
45 | 
46 | Given a sentence with a gap
47 | ```
48 | <s> The ______ was arrested by the detective . </s>
49 | ```
50 | implement a decoder that returns the most likely word to fill it in. 
51 | In more detail, you can develop a k-best ranker that scores the top-k derivations that a) all start with the prefix 
52 | ``The``, b) each contains the top-k candidate words from the vocabulary, and c) follow with the rest words of the given
53 | sentence.
54 | 
55 | Things to try out:
56 | - Which is more likely to fill in the gap: ``cop`` or ``crook``? 
57 | Get the model to predict this correctly by changing the hyper-parameters. 
58 | - Ensure that the model is predicting correctly for the right reason, 
59 | i.e., that the embeddings for ``thief`` and ``crook`` are closer to each other than the embeddings 
60 | for ``thief`` and ``cop``. Why is that? 
61 | 
62 |     **Hint**: Use cosine similarity to compute the distance of two embedding vectors.
63 | 
64 | 


--------------------------------------------------------------------------------
/labs-exercises/neural-machine-translation.md:
--------------------------------------------------------------------------------
 1 | # Lab - Neural Machine Translation
 2 | 
 3 | 
 4 | ## Introduction
 5 | 
 6 | In this lab we will familiarise ourselves with the popular sequence-to-sequence (seq2seq) architecture for Neural Machine 
 7 | Translation and will implement the attention mechanism.  
 8 | 
 9 | ## Requirements
10 | We will train our models on the Multi30k dataset (well just a small part of it, as we will be 
11 | running things on your laptop; you are more than welcome to try out the full dataset on a GPU-enabled machine too!).
12 | 
13 | 1. Clone the dataset from [here](https://github.com/multi30k/dataset). 
14 | 2. Extract the first 1000 examples of the already tokenized version of the validation set: 
15 | ``data/task1/tok/val.lc.norm.tok.*``. 
16 | 3. Create a 75\%/25\% train/val split. 
17 | 4. We will focus only on the ``en-fr`` pairs. 
18 | 
19 | ## Exercises
20 | 
21 | We provide an implementation of a basic sequence-to-sequence (seq2seq) architecture with beam search 
22 | adapted from the original AllenNLP toolkit that you will have to extend: ``athnlp/models/nmt_seq2seq.py``. 
23 | There are placeholders in the code that are left empty for you to fill in. We are also giving you 
24 | a dataset reader for Multi30k: ``athnlp/readers/multi30k_reader.py``. 
25 | 
26 | **Note**: We recommend that
27 | you train and predict with the built-in commands using ``allennlp train/predict``. If you
28 | need to debug your code you can programmatically execute the training process from: ``athnlp/nmt.py``
29 | We will be reporting performance using [BLEU](https://www.aclweb.org/anthology/P02-1040).   
30 | 
31 | #### 1. Playing around
32 | 
33 | Have a good look at the provided code and make sure you understand how it works.
34 | 
35 | Things to try out:
36 | 
37 | - Overfit a (very) small portion of the training set. What hyperparameters do you need to use?
38 | - Train a model on the bigger dataset for a few epochs and compute BLEU score for the baseline model. 
39 | **Note**: You are most likely not going to get a state-of-the-art performance. Why?
40 | - Switch the RNN cell from an LSTM to a GRU.
41 | - Use pre-trained embeddings like [GloVe](https://nlp.stanford.edu/pubs/glove.pdf) vectors. Does it help? 
42 | Is that always applicable in MT?
43 | - Consider switching the metric (currently it's the validation loss) for early stopping criterion.
44 | - Try using beam search instead of greedy decoding. Does it help?
45 |  
46 |   
47 | 
48 | #### 2. Attention Mechanism
49 | 
50 | Implement at least one attention mechanism ([dot product](https://arxiv.org/abs/1508.04025), 
51 | [bilinear](https://arxiv.org/abs/1508.04025), [MLP](https://arxiv.org/abs/1409.0473)) 
52 | in the methods ``_prepare_output_projections()`` and ``_compute_attention()``. 
53 | 
54 | **Important**: to keep things uniform assume that the attended encoder outputs 
55 | (aka *context vector*) gets concatenated with the previous predicted word embedding *before* being 
56 | fed as in input to the decoder RNN. 
57 | 
58 | Things to try out:
59 | 
60 | - Convince yourself that attention helps boost the performance of your model by computing
61 | BLEU on the dev set (if not most probably you have a bug!)
62 | - Predict the output for some examples using the default ``se2seq`` predictor from AllenNLP. 
63 | You can find a small set of examples here: ``data/multi30k/val.lc.norm.tok.head-5.fr.jsonl``. 
64 | How does the output compare to without using attention?
65 | - Visualise the attention scores using e.g., ``matplotlib.heatmap``. We have created a custom predictor
66 | in ``athnlp/predictors/nmt_seq2seq.py`` for you that already prints out heatmaps; you will just need to extract the attention
67 | scores from your model in ``forward_loop()``. You can execute it via ``athnlp/nmt.py``. 
68 |  Then visualize attention scores between the input and predicted output for the examples 
69 |  found in ``data/multi30k/val.lc.norm.tok.head-5.fr.jsonl``. What do you observe?
70 |   
71 | - (Bonus) Instead of concatenating the context vector with the previous predicted word embedding *before* being 
72 | feeding as in input to the decoder RNN, try concatenating it to the output hidden state of the decoder 
73 | (i.e., *after* the RNN). Does that effect any change? (**Note**: you might need to train on the original corpus). 
74 |     
75 | #### 3. (Bonus) Sampling during decoding
76 | Implement a sampling algorithm for your decoder. As an alternative heuristic to beam search during decoding, 
77 | the idea is to *sample* from the vocobulary distribution instead of taking the ``arg_max`` at each time step.
78 | 
79 | Things to try out:
80 | 
81 | - How does sampling affect the performance (BLEU score) of your model?
82 | - Inspect the output of your model by drawing several samples for a few examples; what do you observe? 
83 |    
84 | 


--------------------------------------------------------------------------------
/labs-exercises/pos-tagging-perceptron.md:
--------------------------------------------------------------------------------
 1 | # Lab - Part-of-Speech tagging with the Perceptron Algorithm
 2 | 
 3 | 
 4 | ## Introduction
 5 | 
 6 | In this lab we will create a Part-of-Speech (PoS) tagger for English using the Perceptron algorithm. 
 7 | In particular, we will train a PoS tagger on the [Brown corpus](http://clu.uni.no/icame/manuals/BROWN/INDEX.HTM) annotated with the [Universal Part-of-Speech Tagset](https://arxiv.org/abs/1104.2086). 
 8 | 
 9 | Although PoS-tagging is an inherently sequential problem, for the purposes of this lab we will keep things simple: our model will predict a PoS-tag for every word of a sentence *independently*.
10 | More concretely, given a sentence (sequence of words) as input, your model needs to predict a PoS-tag for each of them.
11 | 
12 | Here is an example:
13 | 
14 | | **PoS**:   | DET | VERB      | NOUN | VERB |     ADV      | ADJ       | . |
15 | |:-------|:-----:|:-----------:|------|:------:|:--------------:|:-----------:|:---:|
16 | | **Words**: | The | scalloped | edge | is   | particularly | appealing | . |
17 |   
18 | 
19 | ## Requirements
20 | You need to download the Brown corpus first through NLTK if you don't have it already. 
21 | Just execute the following in a Python CLI:
22 | 
23 | ```python
24 | import nltk
25 | nltk.download('brown')
26 | ``` 
27 | 
28 | ## Exercises
29 | 
30 | 
31 | #### 1. Perceptron Algorithm
32 | 
33 | Implement the standard perceptron algorithm. Use the first 10000/1000/1000 sentences for training/dev/test.
34 | In order to speed up the process for you, we have implemented a simple dataset reader that automatically converts the Brown corpus using the Universal PoS Tagset: `athnlp/readers/brown_pos_corpus.py` (you may use your own implementation if you want; `athnlp/reader/en-brown.map` provides the mapping from Brown to Universal Tagset). 
35 | 
36 | **Important**: Recall that the perceptron has to predict multiple (PoS tags) instead of binary classes:
37 | ![Multiclass Perceptron](multiclass_perceptron.png)
38 | 
39 | You should represent each example of the corpus (i.e., every word of each sentence) in a vector form. In order to keep things simple, let's assume a simple **bag-of-words** representation.
40 | In order to evaluate your model compute the **accuracy** (i.e., number-of-correctly-labelled-words / total-number-of-labelled-words) on the dev set.
41 | Here are a few things to try out:
42 | - Does it help if you **randomize** the order of the training instances?
43 | - Does it help if you perform **multiple passes** over the training set? What is a reasonable number?
44 | - Instead of using the last weight vector for computing the error, try taking the **average of all
45 | the weight vectors** calculated for each label. Does that help?
46 | 
47 | #### 2. Feature Engineering
48 | 
49 | - Implement different types beyond bag-of-words. *Hint*: One very common feature type is to 
50 | introduce some local context for every word via **n-grams**, usually with n=2,3. Another is to
51 | look at the previous/next **word** (not **tag**; why?). A third option is to look at subword features,
52 | i.e. short character sequences such as suffixes.
53 | - (Bonus) What are the most **positively-weighted** features for each label? Give the
54 | top 10 for each class and comment on whether they make sense (if they
55 | don’t you might have a bug!).
56 | 
57 | 
58 | 
59 | 
60 | 
61 | 
62 | 


--------------------------------------------------------------------------------
/labs-exercises/pos-tagging-structured-perceptron.md:
--------------------------------------------------------------------------------
 1 | # Lab - Part-of-Speech tagging with the Structured Perceptron Algorithm
 2 | 
 3 | In this lab we will create a Part-of-Speech (PoS) tagger for English using the Structured Perceptron algorithm. 
 4 | In particular, we will train a PoS tagger on the [Brown corpus](http://clu.uni.no/icame/manuals/BROWN/INDEX.HTM) 
 5 | annotated with the [Universal Part-of-Speech Tagset](https://arxiv.org/abs/1104.2086). 
 6 | 
 7 | [Last time](pos-tagging-perceptron.md) we made each tagging decision independently. In this lab we will make
 8 | each decision at the sequence level, i.e., by choosing the PoS tag for each word so that they collectively *maximize* 
 9 | the score of the sequence of labels for the whole sentence. Why does that matter? Let's have a look  at the 
10 | following example:
11 | 
12 | | **PoS**:   |  | | | |     |        |
13 | |:-------|:-----:|:-----------:|------|:------:|:--------------:|:-----------:|
14 | | **Words**: | The | old | man | the | boat | . |
15 | 
16 | If we predict with a trained model using the simple averaged perceptron implementation with unigram features from the 
17 | [previous lab](pos-tagging-perceptron.md) we get the following predictions: 
18 | 
19 |   
20 | | **PoS**:   | DET | ***ADJ**      | ***NOUN** | DET |     NOUN      | .       |
21 | |:-------|:-----:|:-----------:|------|:------:|:--------------:|:-----------:|
22 | | **Words**: | The | old | man | the | boat | . |
23 | 
24 | Why is this the case? **NOUN** is the highest scoring label for the word 'man'. Therefore, when making the prediction
25 | for this word *independently*, the model will make an error (still not convinced this is wrong? Read the sentence carefully!).
26 | (Why is *ADJ** also the wrong label for the word 'old'?)
27 |  
28 | The idea of the structured perceptron is that it keeps track of several alternative hypotheses for sequences of labels 
29 | (in this case PoS-tags):
30 | the one contained in the example above contains 'locally' high scoring labels (ADJ, NOUN), but has a much lower 'global'
31 | score compared to the (correct) sequence below:  
32 | 
33 | | **PoS**:   | DET | NOUN      | VERB | DET |     NOUN      | .       |
34 | |:-------|:-----:|:-----------:|------|:------:|:--------------:|:-----------:|
35 | | **Words**: | The | old | man | the | boat | . |
36 | 
37 | 
38 | ## Requirements
39 | You need to download the Brown corpus first through NLTK if you don't have it already. 
40 | Just execute the following in a Python CLI:
41 | 
42 | ```python
43 | import nltk
44 | nltk.download('brown')
45 | ``` 
46 | 
47 | ## Exercises
48 | 
49 | 
50 | #### 1. Structured Perceptron Algorithm
51 | 
52 | Implement the structured perceptron algorithm. Use the first 1000/100/100 sentences for training/dev/test with < 5 words.
53 | You can re-use the implemented simple dataset reader: `athnlp/reader/brown_pos_corpus.py`. 
54 | 
55 | The algorithm is starkingly similar to the original perceptron algorithm; the two major differences though are:
56 | 1. You need to find the optimal (a.k.a. *arg,max*) path through the input sequence; 
57 | 2. You need to update the weights for each label where the optimal predicted path and ground truth don't agree. 
58 | 
59 | Things to try out:
60 | - (Sanity check) What is the accuracy score of the averaged pecreptron algorithm using unigrams for this dataset?
61 | - First implement *arg,max* using brute-force, i.e., explore all the possible labeled paths.
62 | - Brute-force is a really inefficient approach to finding the optimal path. Quite often applying a heuristic
63 | such as beam search (i.e., keeping the top-n scoring partial hypotheses and discarding the rest that don't exceed 
64 | a predefined threshold, aka *fall out of the beam*) speeds up the process immensely, by usually sacrificing a bit in accuracy.
65 | Of course, this process introduces two extra hyper-parameters: ``beam size``, i.e., how many hypotheses to keep, and
66 | ``beam width``, i.e., what is the threshold below which we should be discarding hypotheses? 
67 | 
68 | You should evaluate your models by computing the **accuracy** (i.e., number-of-correctly-labelled-words / total-number-of-labelled-words) on the dev set.
69 | 
70 | 
71 | 
72 | 
73 | 
74 | 
75 | 
76 | 


--------------------------------------------------------------------------------
/labs-exercises/question-answering.md:
--------------------------------------------------------------------------------
  1 | # Lab - Question Answering
  2 | 
  3 | ## Introduction
  4 | 
  5 | In this lab we will build a Question Answering model for SQuAD based on BERT using AllenNLP. 
  6 | 
  7 | ## Requirements
  8 | Make sure that the script `setup_dependencies.sh` installed every package and downloaded every data file. In particular,
  9 |  make sure that `pytorch-transformers` is installed and that you have in your `resources` folder the following files:
 10 |  1. `bert-base-uncased/pytorch_model.bin` (pretrained BERT model)
 11 |  2. `bert-base-uncased/config.json` (pretrained BERT model parameters)
 12 |  2. `bert-base-uncased/vocab.txt` (vocabulary for the pretrained model)
 13 | 
 14 | We will fine-tune the BERT model on the SQuAD 2.0 dataset. In this tutorial will use just a small part of it, as we will be 
 15 | running things on your laptop; you are more than welcome to try out the [full dataset](https://rajpurkar.github.io/SQuAD-explorer/) on a GPU-enabled machine too.
 16 | 
 17 | The portion of the data that we will be using in this tutorial is available in the folder `data/squad/`. We will use the file
 18 | `train.json` for model training and `test.json` for the evaluation phase.
 19 | 
 20 | ## Exercises
 21 | 
 22 | BERT is a large-scale language model trained using several masking-based loss functions. In this tutorial we will show you
 23 | how to fine-tune a BERT model trained on millions of text documents to complete a question answering task like SQuAD. Particularly,
 24 | we will rely on BERT's encoder to represent both the question and the reference paragraph. Imagine that you have the following 
 25 | example: 
 26 | 
 27 | - Paragraph: 
 28 | ```
 29 | The Normans (Norman: Nourmands; French: Normands; Latin: Normanni) were the people who in the 10th and 11th centuries gave their name to Normandy, a region in France. They were descended from Norse ("Norman" comes from "Norseman") raiders and pirates from Denmark, Iceland and Norway who, under their leader Rollo, agreed to swear fealty to King Charles III of West Francia. Through generations of assimilation and mixing with the native Frankish and Roman-Gaulish populations, their descendants would gradually merge with the Carolingian-based cultures of West Francia. The distinct cultural and ethnic identity of the Normans emerged initially in the first half of the 10th century, and it continued to evolve over the succeeding centuries.
 30 | ``` 
 31 | - Question: In what country is Normandy located?
 32 | 
 33 | We will encode the question and the paragraph following the BERT encoder scheme:
 34 | 
 35 | ```
 36 | [CLS] in what country is normandy located ? [SEP] the norman ##s ( norman : no ##ur ##man ##ds ; french : norman ##ds ; latin : norman ##ni ) were the people who in the 10th and 11th centuries (...)'
 37 | 
 38 | ```
 39 | 
 40 | More details about this fine-tuning procedure and the BERT encoding scheme can be found in the original paper from [Devlin et. al 2018](https://arxiv.org/pdf/1810.04805.pdf). 
 41 | We created an AllenNLP dataset reader that is able to process the SQuAD dataset examples and format them following
 42 | BERT encoding scheme. 
 43 | 
 44 | **Note**: BERT by default performs word-piece tokenization. See how words like ``normans`` gets split to ``norman ##s``, 
 45 | and ``nourmands`` gets split to more than two wordpieces: ``no ##ur ##man ##ds``. 
 46 | The dataset reader provided takes care of that.
 47 | 
 48 | #### 1. Span prediction for Question Answering
 49 | 
 50 | In this exercise you will be creating a model that is able to predict the boundary of the answer span given
 51 | an encoded representation generated by BERT. The task consists of _merely_ predicting two integer values:
 52 |  - `start_position`: start position of the answer in the reference document
 53 |  - `end_position`: end position of the answer in the reference document
 54 | 
 55 | To complete the exercise we provide you with a basic template of the QA model in the file `athnlp/models/qa_bert.py`. Every method should be implemented to complete the exercise. The model definition contains 3 main methods: 
 56 | 
 57 | - `__init__`: The constructor of the main class that is used to initialise all the model parameters. You are supposed to initialise the 
 58 | layer used to predict the span in here.
 59 | - `forward`: The forward pass of the model. We want to encode the input representation using BERT and then use a linear layer to predict 
 60 | the start and the end of the answer. 
 61 | - `decode`: Given the model predictions convert them to tokens for visualisation and evaluation purposes.
 62 | - `get_metrics`: evaluates the metrics `start position accuracy`, `end position accuracy` and `span position accuracy`.
 63 | 
 64 | In this tutorial, we will be using the BERT API provided py `pytorch-transformers`. Particularly, we are interested
 65 | in using the class [BertModel](https://huggingface.co/pytorch-transformers/model_doc/bert.html#bertmodel). By using the configuration file
 66 | that we created (see `resources/bert-base-uncased/config.json`) for details), the BERT model will generate the following outputs:
 67 | 
 68 | - **last_hidden_state**: Sequence of hidden-states at the output of the last layer of the model. ``torch.FloatTensor`` of shape ``(batch_size, sequence_length, hidden_size)``
 69 | - **pooler_output**: Last layer hidden-state of the first token of the sequence (classification token)
 70 |     further processed by a Linear layer and a Tanh activation function. The Linear
 71 |     layer weights are trained from the next sentence prediction (classification)
 72 |     objective during Bert pretraining. ``torch.FloatTensor`` of shape ``(batch_size, hidden_size)``
 73 | - **attentions**: Attention weights after the attention softmax, used to compute the weighted average in the self-attention heads.
 74 | List of ``torch.FloatTensor`` (one for each layer) of shape ``(batch_size, num_heads, sequence_length, sequence_length)``
 75 | 
 76 | Given the output of the BERT model, it is possible to use two different strategies to predict the answer span:
 77 | 1. Learn a Linear layer that predicts the answer span given the average of the hidden states contained in `last_hidden_states`;
 78 | 2. Learn a Linear layer that predicts the answer span given the `pooler_output`.
 79 | 
 80 | You are free to experiment with both strategies but we recommend the first one. The reason behind our preference is that 
 81 | the BERT `pooler_output` is usually *not* a good summary of the semantic content of the input, because it is used 
 82 | during the original BERT training phase for a different task.
 83 | 
 84 | For the sake of consistency we have already provided a specific signature for the forward that your model implementation 
 85 | should follow. We define details of required inputs and ouputs as follows:
 86 | 
 87 | #### Parameters
 88 | - tokens : Dict[str, torch.LongTensor]
 89 |     From a ``TextField`` (that has a bert-pretrained token indexer)
 90 | - span_start : torch.IntTensor, optional (default = None)
 91 |     A tensor of shape (batch_size, 1) which contains the start_position of the answer
 92 |     in the passage, or 0 if impossible. This is an `inclusive` token index.
 93 |     If this is given, we will compute a loss that gets included in the output dictionary.
 94 | - span_end : torch.IntTensor, optional (default = None)
 95 |     A tensor of shape (batch_size, 1) which contains the end_position of the answer
 96 |     in the passage, or 0 if impossible. This is an `inclusive` token index.
 97 |     If this is given, we will compute a loss that gets included in the output dictionary.
 98 | 
 99 | #### Returns
100 |   
101 | An output dictionary consisting of:
102 | - logits : torch.FloatTensor
103 |     A tensor of shape ``(batch_size, num_tokens)`` representing
104 |     unnormalized log probabilities of the label.
105 | - start_probs : torch.FloatTensor
106 |     A tensor of shape ``(batch_size, num_tokens)`` representing
107 |     probabilities of the label obtained applying softmax on the predicted logits.
108 | - end_probs : torch.FloatTensor
109 |     A tensor of shape ``(batch_size, num_tokens)`` representing
110 |     probabilities of the label obtained applying softmax on the predicted logits.
111 | - best_span: torch.LongTensor
112 |     A tensor of shape ``(batch_size, 2)`` representing the predicted start and end position for the answer
113 |     associated to a specific element of the batch. We suggest to use the function [get_best_span](https://allenai.github.io/allennlp-docs/api/allennlp.models.reading_comprehension.html?highlight=get_best_span#allennlp.models.reading_comprehension.util.get_best_span) already 
114 |     implemented in AllenNLP.
115 | - loss : torch.FloatTensor, optional
116 |     Loss function to be optimised. This will be the average of the losses computed for the start position predictions 
117 |     and for the end position predictions.
118 |     
119 | **Note**: We recommend that you train and predict with the built-in commands using ``allennlp train/predict``. If you
120 | need to debug your code you can programmatically execute the training process from: ``athnlp/qa.py``.
121 | We will be reporting performance using SQuAD official evaluation metrics (please see [Rajpurkar and Jia et. al 2018](http://arxiv.org/abs/1806.03822) for details).   
122 | 
123 | 
124 | #### 2. Attention Mechanism
125 | 
126 | BERT incorporates a stack of multi-head attention layers (12 layers) which are used to learn a contextualised representation
127 | of every token in the input utterance. In this second exercise we want to add an additional output to our model
128 | that represents the attention values for every layer of the BERT model. The default implementation of [BertModel](https://huggingface.co/pytorch-transformers/model_doc/bert.html#bertmodel) does not return the attention scores generated by BERT. In order to have access to the attention scores you need to add to the BERT configuration file (`resources/bert-base-uncased/config.json`) the following key-value pair: `"output_attentions":true`.
129 | 
130 | In your AllenNLP model implementation you will add a new key to the output dictionary: 
131 | 
132 | - question_passage_attentions: list of ``torch.FloatTensor`` of shape ``(batch_size, num_heads, sequence_length, sequence_length)``
133 | 
134 | Use some of the test examples to visualise the model attentions. You might want to visualise the attention values
135 | for just the last layer of BERT, or you can create a grid containing all the attention scores for
136 | all BERT 12 layers. In order to visualise the attention values, you can reuse the code provided with the 
137 | Neural Machine Translation predictor in Lab 5 and adapt it for BERT.
138 | 


--------------------------------------------------------------------------------
/requirements.txt:
--------------------------------------------------------------------------------
1 | nltk
2 | allennlp
3 | numpy
4 | ipykernel
5 | pytorch-transformers==1.1.0
6 | 


--------------------------------------------------------------------------------
/setup_dependencies.sh:
--------------------------------------------------------------------------------
 1 | #!/usr/bin/env bash
 2 | 
 3 | conda activate athnlp;
 4 | 
 5 | pip install -r requirements.txt;
 6 | 
 7 | python -m nltk.downloader brown;
 8 | 
 9 | mkdir resources;
10 | 
11 | # We download in advance all the models/data that are required by AllenNLP and BERT
12 | wget -c https://allennlp.s3.amazonaws.com/datasets/glove/glove.6B.50d.txt.gz -P resources/;
13 | 
14 | mkdir resources/bert-base-uncased;
15 | 
16 | wget -c https://s3.amazonaws.com/models.huggingface.co/bert/bert-base-uncased-vocab.txt -O resources/bert-base-uncased/vocab.txt;
17 | 
18 | wget -c https://s3.amazonaws.com/models.huggingface.co/bert/bert-base-uncased-pytorch_model.bin -O resources/bert-base-uncased/pytorch_model.bin;
19 | 
20 | wget -c "https://s3.amazonaws.com/models.huggingface.co/bert/bert-base-uncased-config.json" -O resources/bert-base-uncased/config.json;
21 | 


--------------------------------------------------------------------------------
/setup_dependencies_Docker.sh:
--------------------------------------------------------------------------------
1 | #!/usr/bin/env bash
2 | pip install -r requirements.txt;
3 | 
4 | python -m nltk.downloader brown;
5 | 
6 | mkdir resources;
7 | 


--------------------------------------------------------------------------------
/slides/AthensNLP-MT-23Sept2019-ABisazza.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/athnlp/athnlp-labs/1c3f0a8595cd8e4a7b7d62fc4e608ade6ac092fd/slides/AthensNLP-MT-23Sept2019-ABisazza.pdf


--------------------------------------------------------------------------------
/slides/Carreras_morning_2.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/athnlp/athnlp-labs/1c3f0a8595cd8e4a7b7d62fc4e608ade6ac092fd/slides/Carreras_morning_2.pdf


--------------------------------------------------------------------------------
/slides/DialogueSystem_VivianChen.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/athnlp/athnlp-labs/1c3f0a8595cd8e4a7b7d62fc4e608ade6ac092fd/slides/DialogueSystem_VivianChen.pdf


--------------------------------------------------------------------------------
/slides/MORNING_LECTURE_SLIDES_HERE:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/athnlp/athnlp-labs/1c3f0a8595cd8e4a7b7d62fc4e608ade6ac092fd/slides/MORNING_LECTURE_SLIDES_HERE


--------------------------------------------------------------------------------
/slides/McDonald_classification.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/athnlp/athnlp-labs/1c3f0a8595cd8e4a7b7d62fc4e608ade6ac092fd/slides/McDonald_classification.pdf


--------------------------------------------------------------------------------
/slides/Riedel_Machine Reading Tutorial at AthensNLP Summer School.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/athnlp/athnlp-labs/1c3f0a8595cd8e4a7b7d62fc4e608ade6ac092fd/slides/Riedel_Machine Reading Tutorial at AthensNLP Summer School.pdf


--------------------------------------------------------------------------------
/slides/athNLP-Lec3-BPlank.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/athnlp/athnlp-labs/1c3f0a8595cd8e4a7b7d62fc4e608ade6ac092fd/slides/athNLP-Lec3-BPlank.pdf


--------------------------------------------------------------------------------