├── .gitignore ├── 404.html ├── Gemfile ├── Gemfile.lock ├── README.md ├── _config.yml ├── _includes ├── default-tutorial.html ├── demo-url.html ├── footer.html ├── getstarted-url.html ├── header.html ├── highlight.html ├── logo-list.html ├── meta.html ├── more-tutorials.html ├── nav-main.html ├── scripts.html └── svg-sprite.html ├── _layouts └── tutorial.html ├── allennlp-social-card.png ├── assets ├── ai2-logo-full.png ├── ai2-logo-header.png ├── ai2-logo.svg ├── allennlp-logo-color.png ├── allennlp-logo-dark.png ├── allennlp-logo-dark.svg ├── allennlp-logo-white.png ├── allennlp-logo-white.svg ├── banner-blurred.jpg ├── banner.jpg ├── chevron.svg ├── interpret-photos │ ├── bert-combined.jpg │ ├── eric.jpg │ ├── hotflip.png │ ├── hotflip_sentiment.png │ ├── jens.jpg │ ├── junlin.png │ ├── matt.jpg │ ├── ner.pdf │ ├── ner.png │ ├── reduction.png │ ├── saliency.png │ ├── sameer.jpg │ ├── sanjay.png │ ├── software.pdf │ └── squad_screenshot.png ├── logo-list.png ├── logo_only_icon.png └── mocha-photos │ ├── anthony.jpg │ ├── gabriel.jpg │ ├── matt.jpg │ └── sameer.jpg ├── contrast-sets.html ├── css ├── _includes │ ├── _alert.scss │ ├── _banner.scss │ ├── _base.scss │ ├── _btn.scss │ ├── _callouts.scss │ ├── _col-layout.scss │ ├── _colors.scss │ ├── _flex.scss │ ├── _footer.scss │ ├── _functions.scss │ ├── _header.scss │ ├── _icons.scss │ ├── _logo-list.scss │ ├── _models-table.scss │ ├── _publications.scss │ ├── _tab.scss │ ├── _text.scss │ ├── _toolbar.scss │ ├── _tutorial.scss │ ├── _utils.scss │ ├── _vars.scss │ └── vendor │ │ ├── _pure-custom-min.scss │ │ └── _rouge--pygments--autumn.scss └── main.scss ├── drop.html ├── elmo.html ├── favicon.ico ├── iirc.html ├── index.html ├── interpret.html ├── js └── scripts.js ├── mocha.html ├── orb.html ├── papers └── AllenNLP_white_paper.pdf ├── publications.html ├── quoref.html ├── ropes.html ├── torque.html └── tutorials.html /.gitignore: -------------------------------------------------------------------------------- 1 | _site/ 2 | .sass-cache 3 | .idea/ 4 | -------------------------------------------------------------------------------- /404.html: -------------------------------------------------------------------------------- 1 | --- 2 | --- 3 | 4 | 5 |
6 | {% include meta.html %} 7 |Please check your path and try again.
21 |Please follow this link.
8 | 9 | 10 | -------------------------------------------------------------------------------- /_includes/demo-url.html: -------------------------------------------------------------------------------- 1 | href="http://demo.allennlp.org" target="_blank" 2 | -------------------------------------------------------------------------------- /_includes/footer.html: -------------------------------------------------------------------------------- 1 | 40 | -------------------------------------------------------------------------------- /_includes/getstarted-url.html: -------------------------------------------------------------------------------- 1 | href="https://guide.allennlp.org/" 2 | -------------------------------------------------------------------------------- /_includes/header.html: -------------------------------------------------------------------------------- 1 |{{ include.code }}
33 | {% endif %}
34 |
--------------------------------------------------------------------------------
/_includes/logo-list.html:
--------------------------------------------------------------------------------
1 | 24 | Standard test sets for supervised learning evaluate in-distribution 25 | generalization. Unfortunately, when a dataset has systematic gaps 26 | (e.g., annotation artifacts), these evaluations are misleading: a model 27 | can learn simple decision rules that perform well on the test set but 28 | do not capture a dataset's intended capabilities. We propose a new 29 | annotation paradigm for NLP that helps to close systematic gaps in the 30 | test data. In particular, after a dataset is constructed, we recommend 31 | that the dataset authors manually perturb the test instances in small 32 | but meaningful ways that (typically) change the gold label, creating 33 | contrast sets. Contrast sets provide a local view of a model's 34 | decision boundary, which can be used to more accurately evaluate a 35 | model's true linguistic capabilities. We demonstrate the efficacy of 36 | contrast sets by creating them for 10 diverse NLP datasets 37 | (e.g., DROP reading comprehension, UD parsing, IMDb sentiment 38 | analysis). Although our contrast sets are not explicitly adversarial, 39 | model performance is significantly lower on them than on the original 40 | test sets---up to 25% in some cases. We release our contrast sets as 41 | new evaluation benchmarks and encourage future dataset construction 42 | efforts to follow similar annotation processes. 43 |
44 | 45 |Here is the paper.
46 | 47 || Dataset | 51 |Contrast Sets | 52 |Type of NLP Task | 53 |
|---|---|---|
|
56 | BoolQ (Clark et al., 2019) 57 | |
58 | 59 | Data 60 | | 61 |Reading Comprehension | 62 |
|
65 | DROP (Dua et al., 2019) 66 | |
67 | 68 | Data 69 | | 70 |Reading Comprehension | 71 |
|
74 | MC-TACO (Zhou et al., 2019) 75 | |
76 | 77 | Data 78 | | 79 |Reading Comprehension | 80 |
|
83 | ROPES (Lin et al., 2019) 84 | |
85 | 86 | Data 87 | | 88 |Reading Comprehension | 89 |
|
92 | Quoref (Dasigi et al., 2019) 93 | |
94 | 95 | Data 96 | | 97 |Reading Comprehension | 98 |
|
101 | IMDb Sentiment Analysis (Maas et al., 2011) 102 | |
103 | 104 | Data 105 | | 106 |Classification | 107 |
|
110 | MATRES (Ning et al., 2018) 111 | |
112 | 113 | Data 114 | | 115 |Classification | 116 |
|
119 | NLVR2 (Suhr et al., 2019) 120 | |
121 | 122 | Data 123 | | 124 |Classification | 125 |
|
128 | PERSPECTRUM (Chen et al., 2019) 129 | |
130 | 131 | Data 132 | | 133 |Classification | 134 |
|
137 | UD English (Nivre et al., 2016) 138 | |
139 | 140 | Data 141 | | 142 |Parsing | 143 |
148 | @article{Gardner2020Evaluating,
149 | title={Evaluating NLP Models via Contrast Sets},
150 | author={Gardner, Matt and Artzi, Yoav and Basmova, Victoria and Berant, Jonathan and Bogin, Ben and Chen, Sihao
151 | and Dasigi, Pradeep and Dua, Dheeru and Elazar, Yanai and Gottumukkala, Ananth and Gupta, Nitish
152 | and Hajishirzi, Hanna and Ilharco, Gabriel and Khashabi, Daniel and Lin, Kevin and Liu, Jiangming
153 | and Liu, Nelson F. and Mulcaire, Phoebe and Ning, Qiang and Singh, Sameer and Smith, Noah A.
154 | and Subramanian, Sanjay and Tsarfaty, Reut and Wallace, Eric and Zhang, Ally and Zhou, Ben},
155 | journal={arXiv preprint},
156 | year={2020}
157 | }
158 |
159 | 24 | With system performance on existing reading comprehension benchmarks nearing or surpassing human performance, we need a new, hard 25 | dataset that improves systems' capabilities to actually read paragraphs of text. DROP is a crowdsourced, 26 | adversarially-created, 96k-question benchmark, in which a system must resolve references in a question, perhaps to multiple input 27 | positions, and perform discrete operations over them (such as addition, counting, or sorting). These operations require a much 28 | more comprehensive understanding of the content of paragraphs than what was necessary for prior datasets. 29 |
30 | 31 |32 | AllenNLP provides an easy way for you to get started with this dataset, with a dataset reader that can be used with any model you 33 | design, and a reference implementation of the NAQANet model that was introduced in the DROP paper. Find more details in the links 34 | below. 35 |
36 | 37 |Citation: 59 |
60 | @inproceedings{Dua2019DROP,
61 | author={Dheeru Dua and Yizhong Wang and Pradeep Dasigi and Gabriel Stanovsky and Sameer Singh and Matt Gardner},
62 | title={ {DROP}: A Reading Comprehension Benchmark Requiring Discrete Reasoning Over Paragraphs},
63 | booktitle={Proc. of NAACL},
64 | year={2019}
65 | }
66 |
67 |
68 | ELMo is a deep contextualized word representation that models 23 | both (1) complex characteristics of word use (e.g., syntax and semantics), and (2) how these uses vary 24 | across linguistic contexts (i.e., to model polysemy). 25 | These word vectors are learned functions of the internal states of a deep bidirectional language model (biLM), which is pre-trained on a large text corpus. 26 | They can be easily added to existing models and significantly improve the state of the art across a broad range of challenging NLP problems, including question answering, textual entailment and sentiment analysis. 27 |
ELMo representations are: 29 |
Adding ELMo to existing NLP systems significantly improves the state-of-the-art for every considered task. In most cases, they can be simply swapped for pre-trained GloVe or other word vectors. 37 |
| Task | Previous SOTA | Our baseline | ELMo + Baseline | Increase (Absolute/Relative) | |
| SQuAD | SAN | 84.4 | 81.1 | 85.8 | 4.7 / 24.9% |
| SNLI | Chen et al (2017) | 88.6 | 88.0 | 88.7 +/- 0.17 | 0.7 / 5.8% |
| SRL | He et al (2017) | 81.7 | 81.4 | 84.6 | 3.2 / 17.2% |
| Coref | Lee et al (2017) | 67.2 | 67.2 | 70.4 | 3.2 / 9.8% |
| NER | Peters et al (2017) | 91.93 +/- 0.19 | 90.15 | 92.22 +/- 0.10 | 2.06 / 21% |
| Sentiment (5-class) | McCann et al (2017) | 53.7 | 51.4 | 54.7 +/- 0.5 | 3.3 / 6.8% |
| Model | 51 |Link(Weights/Options File) | 52 |53 | | # Parameters (Millions) | 54 |LSTM Hidden Size/Output size | 55 |# Highway Layers> | 56 |SRL F1 | 57 |Constituency Parsing F1 | 58 |
| Small | 61 |weights | 62 |options | 63 |13.6 | 64 |1024/128 | 65 |1 | 66 |83.62 | 67 |93.12 | 68 |
| Medium | 71 |weights | 72 |options | 73 |28.0 | 74 |2048/256 | 75 |1 | 76 |84.04 | 77 |93.60 | 78 |
| Original | 81 |weights | 82 |options | 83 |93.6 | 84 |4096/512 | 85 |2 | 86 |84.63 | 87 |93.85 | 88 |
| Original (5.5B) | 91 |weights | 92 |options | 93 |93.6 | 94 |4096/512 | 95 |2 | 96 |84.93 | 97 |94.01 | 98 |
The baseline models described are from the original ELMo paper for SRL and from Extending a Parser to Distant Domains Using a Few Dozen Partially Annotated Examples (Joshi et al, 2018) for the Constituency Parser. We do not include GloVe vectors in these models to provide a direct comparison between ELMo representations - in some cases, this results in a small drop in performance (0.5 F1 for the Constituency Parser, > 0.1 for the SRL model).
101 |All models except for the 5.5B model were trained on the 1 Billion Word Benchmark, approximately 800M tokens of news crawl data from WMT 2011. 102 | The ELMo 5.5B model was trained on a dataset of 5.5B tokens consisting of Wikipedia (1.9B) and all of the monolingual news crawl data from WMT 2008-2012 (3.6B). 103 | In tasks where we have made a direct comparison, the 5.5B model has slightly higher performance then the original ELMo model, so we recommend it as a default model. 104 |
105 || Model | 110 |Link(Weights/Options File) | 111 |112 | | Contributor/Notes | 113 |
| Portuguese (Wikipedia corpus) | 116 |weights | 117 |options | 118 |Federal University of Goiás (UFG). Pedro Vitor Quinta de Castro, Anderson da Silva 119 | Soares, Nádia Félix Felipe da Silva, Rafael Teixeira Sousa, Ayrton Denner da Silva Amaral. 120 | Sponsered by Data-H, Aviso Urgente, and Americas Health Labs. 121 | | 122 |
| Portuguese (brWaC corpus) | 125 |weights | 126 |options | 127 ||
| Japanese | 130 |weights | 131 |options | 132 |ExaWizards Inc. Enkhbold Bataa, Joshua Wu. (paper) | 133 |
| German | 136 |code and weights | 137 |138 | | Philip May & T-Systems onsite | 139 |
| Basque | 142 |code and weights | 143 |144 | | Stefan Schweter | 145 |
| PubMed | 148 |weights | 149 |options | 150 |Matthew Peters 151 | | 152 |
| Transformer ELMo | 155 |model archive | 156 |157 | | Joel Grus and Brendan Roof 158 | | 159 |
There are reference implementations of the pre-trained bidirectional language model available in both PyTorch and TensorFlow. The PyTorch verison is fully integrated into AllenNLP. The TensorFlow version is also available in bilm-tf. 165 |
You can retrain ELMo models using the tensorflow code in bilm-tf. 167 |
See our paper Deep contextualized word representations for more information about the algorithm and a detailed analysis. 169 |
Citation: 170 |
171 | @inproceedings{Peters:2018,
172 | author={Peters, Matthew E. and Neumann, Mark and Iyyer, Mohit and Gardner, Matt and Clark, Christopher and Lee, Kenton and Zettlemoyer, Luke},
173 | title={Deep contextualized word representations},
174 | booktitle={Proc. of NAACL},
175 | year={2018}
176 | }
177 |
178 | 24 | Humans often have to read multiple documents to address their information needs. However, most existing reading comprehension tasks only focus on questions for which the contexts provide all the information required to answer them, thus not evaluating a system's performance at identifying a potential lack of sufficient information and locating the sources for that information. IIRC is a crowdsourced dataset consisting of information-seeking questions requiring models to identify and then retrieve necessary information that is missing from the original context. 25 |
26 | 27 |28 | Find more details in the links below. 29 |
30 | 31 |Citation: 41 |
42 | @inproceedings{Ferguson2020IIRC,
43 | author={James Ferguson and Matt Gardner and Hannaneh Hajishirzi and Tushar Khot and Pradeep Dasigi},
44 | title={ {IIRC}: A Dataset of Incomplete Information Reading Comprehension Questions},
45 | booktitle={Proc. of EMNLP},
46 | year={2020}
47 | }
48 |
49 |
50 | 46 | Despite constant advances and seemingly super-human performance on constrained domains, state-of-the-art models for NLP are imperfect. These imperfections, coupled with today's advances being driven by (seemingly black-box) neural models, leave researchers and practitioners scratching their heads asking, why did my model make this prediction? 47 |
48 | 49 |50 | We present AllenNLP Interpret, a toolkit built on top of AllenNLP for interactive model interpretations. The toolkit makes it easy to apply gradient-based saliency maps and adversarial attacks to new models, as well as develop new interpretation methods. AllenNLP interpret contains three components: a suite of interpretation techniques applicable to most models, APIs for developing new interpretation methods (e.g., APIs to obtain input gradients), and reusable front-end components for visualizing the interpretation results. 51 |
52 | 53 |This page presents links to:
54 | 55 |



Citation: 80 |
@inproceedings{Wallace2019AllenNLP,
81 | Author = {Eric Wallace and Jens Tuyls and Junlin Wang and Sanjay Subramanian
82 | and Matt Gardner and Sameer Singh},
83 | Booktitle = {Empirical Methods in Natural Language Processing},
84 | Year = {2019},
85 | Title = { {AllenNLP Interpret}: A Framework for Explaining Predictions of {NLP} Models}}
86 |
87 |
88 |
89 | 40 | Posing reading comprehension as a generation problem provides a great deal of flexibility, allowing for 41 | open-ended questions with few restrictions on possible answers. 42 | However, progress is impeded by existing generation metrics, which rely on token overlap and are agnostic to 43 | the nuances of reading comprehension. 44 | To address this, we introduce a benchmark for training and evaluating generative reading comprehension 45 | metrics: MOdeling Correctness with Human Annotations. 46 | MOCHA contains 40K human judgement scores on model outputs from 6 diverse question answering datasets and an 47 | additional set of minimal pairs for evaluation. 48 | Using MOCHA, we train an evaluation metric: LERC, a Learned Evaluation metric for Reading 49 | Comprehension, to mimic human judgement scores. 50 |
51 | 52 |53 | Find out more in the links below. 54 |
55 | 56 |87 | Citation: 88 |
89 | @inproceedings{Chen2020MOCHAAD,
90 | author={Anthony Chen and Gabriel Stanovsky and Sameer Singh and Matt Gardner},
91 | title={MOCHA: A Dataset for Training and Evaluating Generative Reading Comprehension Metrics},
92 | booktitle={EMNLP},
93 | year={2020}
94 | }
95 |
96 |
97 |
98 |
135 |
136 | 24 | A lot of diverse reading comprehension datasets have recently been introduced to study various phenomena in natural language, 25 | ranging from simple paraphrase matching and entity typing to entity tracking and understanding the implications of the context. 26 | Given the availability of many such datasets, comprehensive and reliable evaluation is tedious and time-consuming. 27 | ORB is an evaluation server that reports performance on diverse reading comprehension datasets, 28 | encouraging and facilitating testing a single model's capability in understanding a wide variety of reading phenomena. It also 29 | includes a suite of synthetic augmentations that test model's ability to generalize to out-of-distribution syntactic structures. 30 |
31 | 32 |33 | Find more details in the links below. 34 |
35 | 36 |This paper describes AllenNLP, a platform for research on deep learning methods in natural language understanding. AllenNLP is designed to support researchers who want to build novel language understanding models quickly and easily. It is built on top of PyTorch, allowing for dynamic computation graphs, and provides (1) a flexible data API that handles intelligent batching and padding, (2) highlevel abstractions for common operations...
36 |We introduce a new type of deep contextualized word representation that models both (1) complex characteristics of word use (e.g., syntax and semantics), and (2) how these uses vary across linguistic contexts (i.e., to model polysemy). Our word vectors are learned functions of the internal states of a deep bidirectional language model (biLM), which is pre-trained on a large text corpus. We show that these representations can be easily added to existing models and significantly improve the state of the art across six challenging NLP problems, including question answering, textual entailment and sentiment analysis. We also present an analysis showing that exposing the deep internals of the pre-trained network is crucial, allowing downstream models to mix different types of semi-supervision signals. 39 |
40 |24 | Machine comprehension of texts longer than a single sentence often requires coreference resolution. 25 | However, most current reading comprehension benchmarks do not contain complex coreferential phenomena and hence fail 26 | to evaluate the ability of models to resolve coreference. We present a new crowdsourced 27 | dataset containing more than 24K span-selection questions that require resolving coreference among 28 | entities in about 4.7K English paragraphs from Wikipedia. 29 |
30 | 31 |32 | AllenNLP provides an easy way for you to get started with this dataset, with a dataset reader that can be used with any model you 33 | design, and a reference implementation of the basline models used in the paper. Find more details in the links 34 | below. 35 |
36 | 37 |Citation: 46 |
47 | @inproceedings{Dasigi2019Quoref,
48 | author={Pradeep Dasigi and Nelson F. Liu and Ana Marasovi\'{c} and Noah A. Smith and Matt Gardner},
49 | title={Quoref: A Reading Comprehension Dataset with Questions Requiring Coreferential Reasoning},
50 | booktitle={Proc. of EMNLP-IJCNLP},
51 | year={2019}
52 | }
53 |
54 |
55 | 25 | ROPES is a reading comprehension dataset that tests a system's ability to apply knowledge from reading to novel situations. 26 | In this crowdsourced, 14k question-answering benchmark, a system is required to read a passage of expository text (eg. Wikipedia, textbooks), and use the causal relationships in the text to answer questions about a novel situation. 27 |
28 | 29 |Citation: 37 |
38 | @inproceedings{Lin2019ReasoningOP,
39 | author={Kevin Lin and Oyvind Tafjord and Peter Clark and Matt Gardner},
40 | title={Reasoning Over Paragraph Effects in Situations},
41 | booktitle={MRQA@EMNLP},
42 | year={2019}
43 | }
44 |
45 |
46 | 24 | A critical part of reading is being able to understand the temporal relationships between events described in a passage of text, even when those relationships are not explicitly stated. However, current machine reading comprehension benchmarks have practically no questions that test temporal phenomena, so systems trained on these benchmarks have no capacity to answer questions such as "what happened before/after [some event]?" We introduce TORQUE, a new English reading comprehension benchmark built on 3.2k news snippets with 21k human-generated questions querying temporal relationships. Results show that RoBERTa-large achieves an exact-match score of 51% on the test set of TORQUE, about 30% behind human performance. 25 |
26 | 27 |
32 | @inproceedings{NWHPGR20,
33 | author={Qiang Ning and Hao Wu and Rujun Han and Nanyun Peng and Matt Gardner and Dan Roth},
34 | title={ {TORQUE}: A Reading Comprehension Dataset of Temporal Ordering Questions},
35 | booktitle={Proc. of EMNLP},
36 | year={2020}
37 | }
38 |
39 |
40 |