├── CNAME ├── structured ├── requirements.txt ├── README.md └── export.py ├── _config.yml ├── .gitignore ├── Gemfile ├── img ├── edit_file.png └── propose_file_change.png ├── english ├── automatic_speech_recognition.md ├── multi-task_learning.md ├── stance_detection.md ├── semantic_role_labeling.md ├── coreference_resolution.md ├── domain_adaptation.md ├── ccg_supertagging.md ├── missing_elements.md ├── shallow_syntax.md ├── information_extraction.md ├── machine_translation.md ├── constituency_parsing.md ├── lexical_normalization.md ├── multimodal.md ├── natural_language_inference.md ├── part-of-speech_tagging.md ├── semantic_textual_similarity.md ├── common_sense.md ├── taxonomy_learning.md ├── text_classification.md ├── word_sense_disambiguation.md ├── grammatical_error_correction.md ├── relation_prediction.md ├── temporal_processing.md ├── named_entity_recognition.md ├── relationship_extraction.md ├── dependency_parsing.md ├── dialogue.md ├── entity_linking.md └── summarization.md ├── _includes ├── chart.html └── table.html ├── spanish └── entity_linking.md ├── chinese ├── chinese.md └── chinese_word_segmentation.md ├── jekyll_instructions.md ├── LICENSE ├── hindi └── hindi.md ├── README.md └── vietnamese └── vietnamese.md /CNAME: -------------------------------------------------------------------------------- 1 | nlpprogress.com -------------------------------------------------------------------------------- /structured/requirements.txt: -------------------------------------------------------------------------------- 1 | -------------------------------------------------------------------------------- /_config.yml: -------------------------------------------------------------------------------- 1 | theme: jekyll-theme-slate -------------------------------------------------------------------------------- /.gitignore: -------------------------------------------------------------------------------- 1 | _site/ 2 | Gemfile* 3 | venv 4 | .idea 5 | structured.json 6 | -------------------------------------------------------------------------------- /Gemfile: -------------------------------------------------------------------------------- 1 | source 'https://rubygems.org' 2 | gem 'github-pages', group: :jekyll_plugins -------------------------------------------------------------------------------- /img/edit_file.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/binhetech/NLP-progress/master/img/edit_file.png -------------------------------------------------------------------------------- /img/propose_file_change.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/binhetech/NLP-progress/master/img/propose_file_change.png -------------------------------------------------------------------------------- /english/automatic_speech_recognition.md: -------------------------------------------------------------------------------- 1 | # Automatic speech recognition (ASR) 2 | 3 | Automatic speech recognition is the task of automatically recognizing speech. You 4 | can find a repository tracking the state-of-the-art [here](https://github.com/syhw/wer_are_we). 5 | -------------------------------------------------------------------------------- /_includes/chart.html: -------------------------------------------------------------------------------- 1 | 20 | 21 |
22 | {% for result in include.results %} 23 | {% assign score = result[include.score] %} 24 |
25 |

{{ result.authors }} ({{ result.year }})

26 |

{{ score }}

27 |
28 | {% endfor %} 29 |
30 | -------------------------------------------------------------------------------- /spanish/entity_linking.md: -------------------------------------------------------------------------------- 1 | # Entity Linking 2 | 3 | See [here](../english/entity_linking.md) for more information about the task. 4 | 5 | ### Datasets 6 | 7 | #### AIDA CoNLL-YAGO Dataset 8 | 9 | ##### Disambiguation-Only Models 10 | 11 | | Model | Micro-Precision | Paper / Source | Code | 12 | | ------------- | :-----:| :----: | :----: | --- | 13 | | Sil et al. (2018) | 82.3 | [Neural Cross-Lingual Entity Linking](https://www.aaai.org/ocs/index.php/AAAI/AAAI18/paper/view/16501/16101) | | 14 | | Tsai & Roth (2016) | 80.9 | [Cross-lingual wikification using multilingual embeddings](http://cogcomp.org/papers/TsaiRo16b.pdf) | | 15 | 16 | [Go back to the README](../README.md) 17 | -------------------------------------------------------------------------------- /chinese/chinese.md: -------------------------------------------------------------------------------- 1 | # Chinese NLP tasks 2 | 3 | ## Entity linking 4 | 5 | See [here](../english/entity_linking.md) for more information about the task. 6 | 7 | ### Datasets 8 | 9 | #### AIDA CoNLL-YAGO Dataset 10 | 11 | ##### Disambiguation-Only Models 12 | 13 | | Model | Micro-Precision | Paper / Source | Code | 14 | | ------------- | :-----:| :----: | :----: | 15 | | Sil et al. (2018) | 84.4 | [Neural Cross-Lingual Entity Linking](https://www.aaai.org/ocs/index.php/AAAI/AAAI18/paper/view/16501/16101) | | 16 | | Tsai & Roth (2016) | 83.6 | [Cross-lingual wikification using multilingual embeddings](http://cogcomp.org/papers/TsaiRo16b.pdf) | | 17 | 18 | [Go back to the README](../README.md) 19 | -------------------------------------------------------------------------------- /structured/README.md: -------------------------------------------------------------------------------- 1 | # Exporting NLP-progress into a structure format 2 | 3 | Parse and export the unstructured information from Markdown into a structured JSON format. 4 | 5 | ## Installation 6 | 7 | Requires Python 3.6+. 8 | 9 | Create a virtualenv and install requirements (you can also use conda): 10 | 11 | ```shell 12 | virtualenv -p python3 venv 13 | source venv/bin/activate 14 | 15 | pip install -r requirements.txt 16 | ``` 17 | 18 | ## Running 19 | 20 | From the NLP-progress root directly (where the LICENCE file is), run: 21 | 22 | ```shell 23 | python structured/export.py 24 | ``` 25 | 26 | For example, to export all the data in the `english/` directory: 27 | 28 | ```shell 29 | python structured/export.py english 30 | ``` 31 | 32 | By default the output will be written into `structured.json`, but you can override this with the `--output` parameter. 33 | 34 | 35 | -------------------------------------------------------------------------------- /_includes/table.html: -------------------------------------------------------------------------------- 1 | {% assign scores = include.scores | split: "," %} 2 | 3 | 4 | 5 | 6 | 7 | {% for score in scores %} 8 | 9 | {% endfor %} 10 | 11 | 12 | 13 | 14 | 15 | {% for result in include.results %} 16 | 17 | 18 | {% for score in scores %} 19 | 20 | {% endfor %} 21 | 22 | 27 | 28 | {% endfor %} 29 | 30 |
Model{{ score }}Paper / SourceCode
{% if result.model %} {{ result.model }} by {% endif %} {{ result.authors }} ({{ result.year }}){{ result[score] }}{{ result.paper }} 23 | {% for el in result.code %} 24 | {{ el.name }} 25 | {% endfor %} 26 |
31 | -------------------------------------------------------------------------------- /english/multi-task_learning.md: -------------------------------------------------------------------------------- 1 | # Multi-task learning 2 | 3 | Multi-task learning aims to learn multiple different tasks simultaneously while maximizing 4 | performance on one or all of the tasks. 5 | 6 | ### DecaNLP 7 | 8 | The [Natural Language Decathlon](https://arxiv.org/abs/1806.08730) (decaNLP) is a benchmark for studying general NLP 9 | models that can perform a variety of complex, natural language tasks. 10 | It evaluates performance on ten disparate natural language tasks. 11 | 12 | Results can be seen on the [public leaderboard](https://decanlp.com/). 13 | 14 | ### GLUE 15 | 16 | The [General Language Understanding Evaluation benchmark](https://arxiv.org/abs/1804.07461) (GLUE) 17 | is a tool for evaluating and analyzing the performance of models across a diverse 18 | range of existing natural language understanding tasks. Models are evaluated based on their 19 | average accuracy across all tasks. 20 | 21 | The state-of-the-art results can be seen on the public [GLUE leaderboard](https://gluebenchmark.com/leaderboard). 22 | 23 | [Go back to the README](../README.md) 24 | -------------------------------------------------------------------------------- /jekyll_instructions.md: -------------------------------------------------------------------------------- 1 | # Instructions for building the site locally 2 | 3 | You can build the site locally using Jekyll by following the steps detailed 4 | [here](https://help.github.com/articles/setting-up-your-github-pages-site-locally-with-jekyll/#requirements): 5 | 6 | 1. Check whether you have Ruby 2.1.0 or higher installed with `ruby --version`, otherwise [install it](https://www.ruby-lang.org/en/downloads/). 7 | On OS X for instance, this can be done with `brew install ruby`. Make sure you also have `ruby-dev` and `zlib1g-dev` installed. 8 | 1. Install Bundler `gem install bundler`. If you run into issues with installing bundler on OS X, have a look 9 | [here](https://bundler.io/v1.16/guides/rubygems_tls_ssl_troubleshooting_guide.html) for troubleshooting tips. Also try refreshing 10 | the terminal. 11 | 1. Clone the repo locally: `git clone https://github.com/sebastianruder/NLP-progress` 12 | 1. Navigate to the repo with `cd NLP-progress` 13 | 1. Install Jekyll: `bundle install` 14 | 1. Run the Jekyll site locally: `bundle exec jekyll serve` 15 | 1. You can now preview the local Jekyll site in your browser at `http://localhost:4000`. 16 | -------------------------------------------------------------------------------- /LICENSE: -------------------------------------------------------------------------------- 1 | MIT License 2 | 3 | Copyright (c) 2018 Sebastian Ruder 4 | 5 | Permission is hereby granted, free of charge, to any person obtaining a copy 6 | of this software and associated documentation files (the "Software"), to deal 7 | in the Software without restriction, including without limitation the rights 8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell 9 | copies of the Software, and to permit persons to whom the Software is 10 | furnished to do so, subject to the following conditions: 11 | 12 | The above copyright notice and this permission notice shall be included in all 13 | copies or substantial portions of the Software. 14 | 15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR 16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, 17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE 18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER 19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, 20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE 21 | SOFTWARE. 22 | -------------------------------------------------------------------------------- /english/stance_detection.md: -------------------------------------------------------------------------------- 1 | # Stance detection 2 | 3 | Stance detection is the extraction of a subject's reaction to a claim made by a primary actor. It is a core part of a set of approaches to fake news assessment. 4 | 5 | Example: 6 | 7 | * Source: "Apples are the most delicious fruit in existence" 8 | * Reply: "Obviously not, because that is a reuben from Katz's" 9 | * Stance: deny 10 | 11 | ### RumourEval 12 | 13 | The [RumourEval 2017](http://www.aclweb.org/anthology/S/S17/S17-2006.pdf) dataset has been used for stance detection in English (subtask A). It features multiple stories and thousands of reply:response pairs, with train, test and evaluation splits each containing a distinct set of over-arching narratives. 14 | 15 | This dataset subsumes the large [PHEME collection of rumors and stance](http://journals.plos.org/plosone/article?id=10.1371/journal.pone.0150989), which includes German. 16 | 17 | | Model | Accuracy | Paper / Source | 18 | | ------------- | ----- | --- | 19 | | Kochkina et al. 2017 | 0.784 | [Turing at SemEval-2017 Task 8: Sequential Approach to Rumour Stance Classification with Branch-LSTM](http://www.aclweb.org/anthology/S/S17/S17-2083.pdf)| 20 | | Bahuleyan and Vechtomova 2017| 0.780 | [UWaterloo at SemEval-2017 Task 8: Detecting Stance towards Rumours with Topic Independent Features](http://www.aclweb.org/anthology/S/S17/S17-2080.pdf) | 21 | 22 | [Go back to the README](../README.md) 23 | -------------------------------------------------------------------------------- /english/semantic_role_labeling.md: -------------------------------------------------------------------------------- 1 | # Semantic role labeling 2 | 3 | Semantic role labeling aims to model the predicate-argument structure of a sentence 4 | and is often described as answering "Who did what to whom". BIO notation is typically 5 | used for semantic role labeling. 6 | 7 | Example: 8 | 9 | | Housing | starts | are | expected | to | quicken | a | bit | from | August’s | pace | 10 | | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | 11 | | B-ARG1 | I-ARG1 | O | O | O | V | B-ARG2 | I-ARG2 | B-ARG3 | I-ARG3 | I-ARG3 | 12 | 13 | ### OntoNotes 14 | 15 | Models are typically evaluated on the [OntoNotes benchmark](http://www.aclweb.org/anthology/W13-3516) based on F1. 16 | 17 | | Model | F1 | Paper / Source | 18 | | ------------- | :-----:| --- | 19 | | He et al., (2018) + ELMO | 85.5 | [Jointly Predicting Predicates and Arguments in Neural Semantic Role Labeling](http://aclweb.org/anthology/P18-2058) | 20 | | (He et al., 2017) + ELMo (Peters et al., 2018) | 84.6 | [Deep contextualized word representations](https://arxiv.org/abs/1802.05365) | 21 | | Tan et al. (2018) | 82.7 | [Deep Semantic Role Labeling with Self-Attention](https://arxiv.org/abs/1712.01586) | 22 | | He et al. (2018) | 82.1 | [Jointly Predicting Predicates and Arguments in Neural Semantic Role Labeling](http://aclweb.org/anthology/P18-2058) | 23 | | He et al. (2017) | 81.7 | [Deep Semantic Role Labeling: What Works and What’s Next](http://aclweb.org/anthology/P17-1044) | 24 | 25 | [Go back to the README](../README.md) 26 | -------------------------------------------------------------------------------- /hindi/hindi.md: -------------------------------------------------------------------------------- 1 | # Hindi 2 | 3 | ## Chunking 4 | 5 | | Model | Dev accuracy | Test F1 | Paper / Source | Code | 6 | | ------------- | :-----:| :-----:| --- | --- | 7 | | Dalal et al. (2006) | 87.40 | 82.40 | [Hindi Part-of-Speech Tagging and Chunking: A Maximum Entropy Approach](https://www.researchgate.net/publication/241211496_Hindi_Part-of-Speech_Tagging_and_Chunking_A_Maximum_Entropy_Approach) | | 8 | 9 | ## Part-of-speech tagging 10 | 11 | | Model | Dev accuracy | Test F1 | Paper / Source | Code | 12 | | ------------- | :-----:| :-----:| --- | --- | 13 | | Jha et al. (2018) | 99.30 | 99.06 | [Multi-Task Deep Morphological Analyzer: Context-Aware Joint Morphological Tagging and Lemma Prediction](https://arxiv.org/ftp/arxiv/papers/1811/1811.08619.pdf) | [mt-dma](https://github.com/Saurav0074/mt-dma) 14 | | Dalal et al. (2006) | 89.35 | 82.22 | [Hindi Part-of-Speech Tagging and Chunking: A Maximum Entropy Approach](https://www.researchgate.net/publication/241211496_Hindi_Part-of-Speech_Tagging_and_Chunking_A_Maximum_Entropy_Approach) | | 15 | 16 | ## Machine Translation 17 | 18 | The IIT Bombay English-Hindi Parallel Corpus used by Kunchukuttan et al. (2018) can be accessed [here](http://www.cfilt.iitb.ac.in/iitb_parallel/). 19 | 20 | | Model | BLEU | METEOR | Paper / Source | Code | 21 | | ------------- | :-----:| :-----:| --- | --- | 22 | | Kunchukuttan et al. (2018) | 12.23 (eng-hin) & 12.83 (hin-eng) | 0.308 | [The IIT Bombay English-Hindi Parallel Corpus](http://www.lrec-conf.org/proceedings/lrec2018/pdf/847.pdf) | | 23 | -------------------------------------------------------------------------------- /english/coreference_resolution.md: -------------------------------------------------------------------------------- 1 | # Coreference resolution 2 | 3 | Coreference resolution is the task of clustering mentions in text that refer to the same underlying real world entities. 4 | 5 | Example: 6 | 7 | ``` 8 | +-----------+ 9 | | | 10 | I voted for Obama because he was most aligned with my values", she said. 11 | | | | 12 | +-------------------------------------------------+------------+ 13 | ``` 14 | 15 | "I", "my", and "she" belong to the same cluster and "Obama" and "he" belong to the same cluster. 16 | 17 | ### CoNLL 2012 18 | 19 | Experiments are conducted on the data of the [CoNLL-2012 shared task](http://www.aclweb.org/anthology/W12-4501), which 20 | uses OntoNotes coreference annotations. Papers 21 | report the precision, recall, and F1 of the MUC, B3, and CEAFφ4 metrics using the official 22 | CoNLL-2012 evaluation scripts. The main evaluation metric is the average F1 of the three metrics. 23 | 24 | | Model | Avg F1 | Paper / Source | Code | 25 | | ------------- | :-----:| --- | --- | 26 | | (Lee et al., 2017)+ELMo (Peters et al., 2018)+coarse-to-fine & second-order inference (Lee et al., 2018) | 73.0 | [Higher-order Coreference Resolution with Coarse-to-fine Inference](http://aclweb.org/anthology/N18-2108) | [Official](https://github.com/kentonl/e2e-coref) | 27 | | (Lee et al., 2017)+ELMo (Peters et al., 2018) | 70.4 | [Deep contextualized word representatIions](https://arxiv.org/abs/1802.05365) | | 28 | | Lee et al. (2017) | 67.2 | [End-to-end Neural Coreference Resolution](https://arxiv.org/abs/1707.07045) | | 29 | 30 | [Go back to the README](../README.md) 31 | -------------------------------------------------------------------------------- /english/domain_adaptation.md: -------------------------------------------------------------------------------- 1 | # Domain adaptation 2 | 3 | ## Sentiment analysis 4 | 5 | ### Multi-Domain Sentiment Dataset 6 | 7 | The [Multi-Domain Sentiment Dataset](https://www.cs.jhu.edu/~mdredze/datasets/sentiment/) is a common 8 | evaluation dataset for domain adaptation for sentiment analysis. It contains product reviews from 9 | Amazon.com from different product categories, which are treated as distinct domains. 10 | Reviews contain star ratings (1 to 5 stars) that are generally converted into binary labels. Models are 11 | typically evaluated on a target domain that is different from the source domain they were trained on, while only 12 | having access to unlabeled examples of the target domain (unsupervised domain adaptation). The evaluation 13 | metric is accuracy and scores are averaged across each domain. 14 | 15 | | Model | DVD | Books | Electronics | Kitchen | Average | Paper / Source | 16 | | ------------- | :-----:| :-----:| :-----:| :-----:| :-----:| --- | 17 | | Multi-task tri-training (Ruder and Plank, 2018) | 78.14 | 74.86 | 81.45 | 82.14 | 79.15 | [Strong Baselines for Neural Semi-supervised Learning under Domain Shift](https://arxiv.org/abs/1804.09530) | 18 | | Asymmetric tri-training (Saito et al., 2017) | 76.17 | 72.97 | 80.47 | 83.97 | 78.39 | [Asymmetric Tri-training for Unsupervised Domain Adaptation](https://arxiv.org/abs/1702.08400) | 19 | | VFAE (Louizos et al., 2015) | 76.57 | 73.40 | 80.53 | 82.93 | 78.36 | [The Variational Fair Autoencoder](https://arxiv.org/abs/1511.00830) | 20 | | DANN (Ganin et al., 2016) | 75.40 | 71.43 | 77.67 | 80.53 | 76.26 | [Domain-Adversarial Training of Neural Networks](https://arxiv.org/abs/1505.07818) | 21 | 22 | [Go back to the README](../README.md) 23 | -------------------------------------------------------------------------------- /english/ccg_supertagging.md: -------------------------------------------------------------------------------- 1 | # CCG supertagging 2 | 3 | Combinatory Categorical Grammar (CCG; [Steedman, 2000](http://www.citeulike.org/group/14833/article/8971002)) is a 4 | highly lexicalized formalism. The standard parsing model of [Clark and Curran (2007)](https://www.mitpressjournals.org/doi/abs/10.1162/coli.2007.33.4.493) 5 | uses over 400 lexical categories (or _supertags_), compared to about 50 part-of-speech tags for typical parsers. 6 | 7 | Example: 8 | 9 | | Vinken | , | 61 | years | old | 10 | | --- | ---| --- | --- | --- | 11 | | N| , | N/N | N | (S[adj]\ NP)\ NP | 12 | 13 | ### CCGBank 14 | 15 | The CCGBank is a corpus of CCG derivations and dependency structures extracted from the Penn Treebank by 16 | [Hockenmaier and Steedman (2007)](http://www.aclweb.org/anthology/J07-3004). Sections 2-21 are used for training, 17 | section 00 for development, and section 23 as in-domain test set. 18 | Performance is only calculated on the 425 most frequent labels. Models are evaluated based on accuracy. 19 | 20 | | Model | Accuracy | Paper / Source | 21 | | ------------- | :-----:| --- | 22 | | Clark et al. (2018) | 96.1 | [Semi-Supervised Sequence Modeling with Cross-View Training](https://arxiv.org/abs/1809.08370) | 23 | | Lewis et al. (2016) | 94.7 | [LSTM CCG Parsing](https://aclweb.org/anthology/N/N16/N16-1026.pdf) | 24 | | Vaswani et al. (2016) | 94.24 | [Supertagging with LSTMs](https://aclweb.org/anthology/N/N16/N16-1027.pdf) | 25 | | Low supervision (Søgaard and Goldberg, 2016) | 93.26 | [Deep multi-task learning with low level tasks supervised at lower layers](http://anthology.aclweb.org/P16-2038) | 26 | | Xu et al. (2015) | 93.00 | [CCG Supertagging with a Recurrent Neural Network](http://www.aclweb.org/anthology/P15-2041) | 27 | 28 | [Go back to the README](../README.md) 29 | -------------------------------------------------------------------------------- /english/missing_elements.md: -------------------------------------------------------------------------------- 1 | # Missing Elements 2 | 3 | Missing elements are a collection of phenomenon that deals with things that are meant, but not explicitly mentioned in the text. 4 | There are different kinds of missing elements, which have different aspects and behaviour. 5 | For example, [Ellipsis](https://en.wikipedia.org/wiki/Ellipsis_(linguistics)), Fused-Head, Bridging Anaphora, etc. 6 | 7 | 8 | ### Numeric Fused-Head (NFH) 9 | FHs constructions are noun phrases (NPs) in which the head noun is missing and is said to be “fused” with its dependent modifier. 10 | This missing information is implicit and is important for sentence understanding. 11 | 12 | The Numeric [Fused-Head dataset](https://github.com/yanaiela/num_fh/tree/master/data/resolution/processed) 13 | consists of ~10K examples of crowd-sourced classified examples, labeled into 7 different categories, from two types. 14 | In the first type, *Reference*, the missing head is referenced explicitly somewhere else in the discourse, either in the 15 | same sentence or in surrounding sentences. 16 | In the second type, *Implicit*, the missing head does not appear in the text and needs to be inferred by the reader or 17 | hearer based on the context or world knowledge. This category was labeled into the 6 most common categories of the dataset. 18 | Models are evaluated based on accuracy. 19 | 20 | Annotated Examples: 21 | 22 | #### Reference 23 | 24 | | I | bought | 5 | apples | but | got | only | 4 | . | 25 | | --- | --- | --- | --- | --- | --- | --- | --- | --- | 26 | | | | | HEAD | | | | NFH-REFERENCE | | 27 | 28 | #### Implicit 29 | 30 | | Let | 's | meet | at | 5 | tomorrow | ? | 31 | | --- | --- | --- | --- | --- | --- | --- | 32 | | | | | | NFH-TIME | | | 33 | 34 | 35 | | Model | Accuracy | Paper / Source | Code | 36 | | ------------- | :-----:| --- | :-----: | 37 | | Bi-LSTM + Scoring (Elazar and Goldberg, 2019) | 60.8 | [Where’s My Head? Definition, Dataset and Models for Numeric Fused-Heads Identification and Resolution](https://arxiv.org/abs/1905.10886) | [Official](https://github.com/yanaiela/num_fh) | 38 | | Bi-LSTM + Elmo + Scoring (Elazar and Goldberg, 2019) | 74.0 | [Where’s My Head? Definition, Dataset and Models for Numeric Fused-Heads Identification and Resolution](https://arxiv.org/abs/1905.10886) | [Official](https://github.com/yanaiela/num_fh) | 39 | -------------------------------------------------------------------------------- /english/shallow_syntax.md: -------------------------------------------------------------------------------- 1 | # Shallow syntax 2 | 3 | Shallow syntactic tasks provide an analysis of a text on the level of the syntactic structure 4 | of the text. 5 | 6 | ## Chunking 7 | 8 | Chunking, also known as shallow parsing, identifies continuous spans of tokens that form syntactic units such as noun phrases or verb phrases. 9 | 10 | Example: 11 | 12 | | Vinken | , | 61 | years | old | 13 | | --- | ---| --- | --- | --- | 14 | | B-NLP| I-NP | I-NP | I-NP | I-NP | 15 | 16 | ### Penn Treebank 17 | 18 | The [Penn Treebank](https://catalog.ldc.upenn.edu/LDC99T42) is typically used for evaluating chunking. 19 | Sections 15-18 are used for training, section 19 for development, and and section 20 20 | for testing. Models are evaluated based on F1. 21 | 22 | | Model | F1 score | Paper / Source | 23 | | ------------- | :-----:| --- | 24 | | Flair embeddings (Akbik et al., 2018) | 96.72 | [Contextual String Embeddings for Sequence Labeling](http://aclweb.org/anthology/C18-1139) | 25 | | JMT (Hashimoto et al., 2017) | 95.77 | [A Joint Many-Task Model: Growing a Neural Network for Multiple NLP Tasks](https://www.aclweb.org/anthology/D17-1206) | 26 | | Low supervision (Søgaard and Goldberg, 2016) | 95.57 | [Deep multi-task learning with low level tasks supervised at lower layers](http://anthology.aclweb.org/P16-2038) | 27 | | Suzuki and Isozaki (2008) | 95.15 | [Semi-Supervised Sequential Labeling and Segmentation using Giga-word Scale Unlabeled Data](https://aclanthology.info/pdf/P/P08/P08-1076.pdf) | 28 | | NCRF++ (Yang and Zhang, 2018)| 95.06 | [NCRF++: An Open-source Neural Sequence Labeling Toolkit](http://www.aclweb.org/anthology/P18-4013) | [NCRF++](https://github.com/jiesutd/NCRFpp) | 29 | 30 | ## Resolving the Scope and focus of negation 31 | 32 | Scope of negation is the part of the meaning that is negated and focus the part of the scope that is most prominently negated (Huddleston and Pullum 2002). 33 | 34 | Example: 35 | 36 | `[John had] never [said %as much% before].` 37 | 38 | Scope is enclosed in square brackets and focus marked between % signs. 39 | 40 | The [CD-SCO (Conan Doyle Scope) dataset](https://www.clips.uantwerpen.be/sem2012-st-neg/data.html) is for scope detection. 41 | The [PB-FOC (PropBank Focus) dataset](https://www.clips.uantwerpen.be/sem2012-st-neg/data.html) is for focus detection. 42 | The public leaderboard is available on the [*SEM Shared Task 2012 website](https://www.clips.uantwerpen.be/sem2012-st-neg/results.html). 43 | 44 | [Go back to the README](../README.md) 45 | -------------------------------------------------------------------------------- /english/information_extraction.md: -------------------------------------------------------------------------------- 1 | # Information Extraction 2 | 3 | ## Open Knowledge Graph Canonicalization 4 | 5 | Open Information Extraction approaches leads to creation of large Knowledge bases (KB) from the web. The problem with such methods is that their entities and relations are not canonicalized, which leads to storage of redundant and ambiguous facts. For example, an Open KB storing *\* and *\* doesn't know that *Barack Obama* and *Obama* mean the same entity. Similarly, *took birth in* and *was born in* also refer to the same relation. Problem of Open KB canonicalization involves identifying groups of equivalent entities and relations in the KB. 6 | 7 | ### Datasets 8 | 9 | | Datasets | # Gold Entities | #NPs | #Relations | #Triples | 10 | | ---------------------------------------- | :-------------: | ----- | ---------- | -------- | 11 | | [Base](https://suchanek.name/work/publications/cikm2014.pdf) | 150 | 290 | 3K | 9K | 12 | | [Ambiguous](https://suchanek.name/work/publications/cikm2014.pdf) | 446 | 717 | 11K | 37K | 13 | | [ReVerb45K](https://github.com/malllabiisc/cesi) | 7.5K | 15.5K | 22K | 45K | 14 | 15 | ### Noun Phrase Canonicalization 16 | 17 | | **Model** | | Base Dataset | | | Ambiguous dataset | | | ReVerb45k | | **Paper**/Source | 18 | | :---------------------------- | :-----------: | :----------: | :----: | :-----------: | :---------------: | ------ | :-----------: | :--------: | :----: | ---------------------------------------- | 19 | | | **Precision** | **Recall** | **F1** | **Precision** | **Recall** | **F1** | **Precision** | **Recall** | **F1** | | 20 | | CESI (Vashishth et al., 2018) | 98.2 | 99.8 | 99.9 | 66.2 | 92.4 | 91.9 | 62.7 | 84.4 | 81.9 | [CESI: Canonicalizing Open Knowledge Bases using Embeddings and Side Information](https://github.com/malllabiisc/cesi) | 21 | | Galárraga et al., 2014 ( IDF) | 94.8 | 97.9 | 98.3 | 67.9 | 82.9 | 79.3 | 71.6 | 50.8 | 0.5 | [Canonicalizing Open Knowledge Bases](https://suchanek.name/work/publications/cikm2014.pdf) | 22 | 23 | [Go back to the README](../README.md) 24 | -------------------------------------------------------------------------------- /english/machine_translation.md: -------------------------------------------------------------------------------- 1 | # Machine translation 2 | 3 | Machine translation is the task of translating a sentence in a source language to a different target language. 4 | 5 | Results with a * indicate that the mean test score over the the best window based on average dev-set BLEU score over 6 | 21 consecutive evaluations is reported as in [Chen et al. (2018)](https://arxiv.org/abs/1804.09849). 7 | 8 | ### WMT 2014 EN-DE 9 | 10 | Models are evaluated on the English-German dataset of the Ninth Workshop on Statistical Machine Translation (WMT 2014) based 11 | on BLEU. 12 | 13 | | Model | BLEU | Paper / Source | 14 | | ------------- | :-----:| --- | 15 | | Transformer Big + BT (Edunov et al., 2018) | 35.0 | [Understanding Back-Translation at Scale](https://arxiv.org/pdf/1808.09381.pdf) | 16 | | DeepL | 33.3 | [DeepL Press release](https://www.deepl.com/press.html) | 17 | | DynamicConv (Wu et al., 2019)| 29.7 | [Pay Less Attention With Lightweight and Dynamic Convolutions](https://arxiv.org/abs/1901.10430) | 18 | | Transformer Big (Ott et al., 2018) | 29.3 | [Scaling Neural Machine Translation](https://arxiv.org/abs/1806.00187) | 19 | | RNMT+ (Chen et al., 2018) | 28.5* | [The Best of Both Worlds: Combining Recent Advances in Neural Machine Translation](https://arxiv.org/abs/1804.09849) | 20 | | Transformer Big (Vaswani et al., 2017) | 28.4 | [Attention Is All You Need](https://arxiv.org/abs/1706.03762) | 21 | | Transformer Base (Vaswani et al., 2017) | 27.3 | [Attention Is All You Need](https://arxiv.org/abs/1706.03762) | 22 | | MoE (Shazeer et al., 2017) | 26.03 | [Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer](https://arxiv.org/abs/1701.06538) | 23 | | ConvS2S (Gehring et al., 2017) | 25.16 | [Convolutional Sequence to Sequence Learning](https://arxiv.org/abs/1705.03122) | 24 | 25 | ### WMT 2014 EN-FR 26 | 27 | Similarly, models are evaluated on the English-French dataset of the Ninth Workshop on Statistical Machine Translation (WMT 2014) based 28 | on BLEU. 29 | 30 | | Model | BLEU | Paper / Source | 31 | | ------------- | :-----:| --- | 32 | | DeepL | 45.9 | [DeepL Press release](https://www.deepl.com/press.html) | 33 | | Transformer Big + BT (Edunov et al., 2018) | 45.6 | [Understanding Back-Translation at Scale](https://arxiv.org/pdf/1808.09381.pdf) | 34 | | DynamicConv (Wu et al., 2019)| 43.2 | [Pay Less Attention With Lightweight and Dynamic Convolutions](https://arxiv.org/abs/1901.10430) | 35 | | Transformer Big (Ott et al., 2018) | 43.2 | [Scaling Neural Machine Translation](https://arxiv.org/abs/1806.00187) | 36 | | RNMT+ (Chen et al., 2018) | 41.0* | [The Best of Both Worlds: Combining Recent Advances in Neural Machine Translation](https://arxiv.org/abs/1804.09849) | 37 | | Transformer Big (Vaswani et al., 2017) | 41.0 | [Attention Is All You Need](https://arxiv.org/abs/1706.03762) | 38 | | MoE (Shazeer et al., 2017) | 40.56 | [Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer](https://arxiv.org/abs/1701.06538) | 39 | | ConvS2S (Gehring et al., 2017) | 40.46 | [Convolutional Sequence to Sequence Learning](https://arxiv.org/abs/1705.03122) | 40 | | Transformer Base (Vaswani et al., 2017) | 38.1 | [Attention Is All You Need](https://arxiv.org/abs/1706.03762) | 41 | 42 | [Go back to the README](../README.md) 43 | -------------------------------------------------------------------------------- /english/constituency_parsing.md: -------------------------------------------------------------------------------- 1 | # Constituency parsing 2 | 3 | Constituency parsing aims to extract a constituency-based parse tree from a sentence that 4 | represents its syntactic structure according to a [phrase structure grammar](https://en.wikipedia.org/wiki/Phrase_structure_grammar). 5 | 6 | Example: 7 | 8 | Sentence (S) 9 | | 10 | +-------------+------------+ 11 | | | 12 | Noun (N) Verb Phrase (VP) 13 | | | 14 | John +-------+--------+ 15 | | | 16 | Verb (V) Noun (N) 17 | | | 18 | sees Bill 19 | 20 | [Recent approaches](https://papers.nips.cc/paper/5635-grammar-as-a-foreign-language.pdf) 21 | convert the parse tree into a sequence following a depth-first traversal in order to 22 | be able to apply sequence-to-sequence models to it. The linearized version of the 23 | above parse tree looks as follows: (S (N) (VP V N)). 24 | 25 | ### Penn Treebank 26 | 27 | The Wall Street Journal section of the [Penn Treebank](https://catalog.ldc.upenn.edu/LDC99T42) is used for 28 | evaluating constituency parsers. Section 22 is used for development and Section 23 is used for evaluation. 29 | Models are evaluated based on F1. Most of the below models incorporate external data or features. 30 | For a comparison of single models trained only on WSJ, refer to [Kitaev and Klein (2018)](https://arxiv.org/abs/1805.01052). 31 | 32 | | Model | F1 score | Paper / Source | 33 | | ------------- | :-----:| --- | 34 | | Self-attentive encoder + ELMo (Kitaev and Klein, 2018) | 95.13 | [Constituency Parsing with a Self-Attentive Encoder](https://arxiv.org/abs/1805.01052) | 35 | | Model combination (Fried et al., 2017) | 94.66 | [Improving Neural Parsing by Disentangling Model Combination and Reranking Effects](https://arxiv.org/abs/1707.03058) | 36 | | LSTM Encoder-Decoder + LSTM-LM (Takase et al., 2018) | 94.47 | [Direct Output Connection for a High-Rank Language Model](http://aclweb.org/anthology/D18-1489) 37 | | LSTM Encoder-Decoder + LSTM-LM (Suzuki et al., 2018) | 94.32 | [An Empirical Study of Building a Strong Baseline for Constituency Parsing](http://aclweb.org/anthology/P18-2097) 38 | | In-order (Liu and Zhang, 2017) | 94.2 | [In-Order Transition-based Constituent Parsing](http://aclweb.org/anthology/Q17-1029) | 39 | | Semi-supervised LSTM-LM (Choe and Charniak, 2016) | 93.8 | [Parsing as Language Modeling](http://www.aclweb.org/anthology/D16-1257) | 40 | | Stack-only RNNG (Kuncoro et al., 2017) | 93.6 | [What Do Recurrent Neural Network Grammars Learn About Syntax?](https://arxiv.org/abs/1611.05774) | 41 | | RNN Grammar (Dyer et al., 2016) | 93.3 | [Recurrent Neural Network Grammars](https://www.aclweb.org/anthology/N16-1024) | 42 | | Transformer (Vaswani et al., 2017) | 92.7 | [Attention Is All You Need](https://arxiv.org/abs/1706.03762) | 43 | | Combining Constituent Parsers (Fossum and Knight, 2009) | 92.4 | [Combining constituent parsers via parse selection or parse hybridization](https://dl.acm.org/citation.cfm?id=1620923) | 44 | | Semi-supervised LSTM (Vinyals et al., 2015) | 92.1 | [Grammar as a Foreign Language](https://papers.nips.cc/paper/5635-grammar-as-a-foreign-language.pdf) | 45 | | Self-trained parser (McClosky et al., 2006) | 92.1 | [Effective Self-Training for Parsing](https://pdfs.semanticscholar.org/6f0f/64f0dab74295e5eb139c160ed79ff262558a.pdf) | 46 | 47 | [Go back to the README](../README.md) 48 | -------------------------------------------------------------------------------- /english/lexical_normalization.md: -------------------------------------------------------------------------------- 1 | # Lexical Normalization 2 | 3 | Lexical normalization is the task of translating/transforming a non standard text to a standard register. 4 | 5 | Example: 6 | 7 | ``` 8 | new pix comming tomoroe 9 | new pictures coming tomorrow 10 | ``` 11 | 12 | Datasets usually consists of tweets, since these naturally contain a fair amount of 13 | these phenomena. 14 | 15 | For lexical normalization, only replacements on the word-level are annotated. 16 | Some corpora include annotation for 1-N and N-1 replacements. However, word 17 | insertion/deletion and reordering is not part of the task. 18 | 19 | ### LexNorm 20 | The [LexNorm](http://people.eng.unimelb.edu.au/tbaldwin/etc/lexnorm_v1.2.tgz) corpus was originally introduced by [Han and Baldwin (2011)](http://aclweb.org/anthology/P/P11/P11-1038.pdf). 21 | Several mistakes in annotation were resolved by [Yang and Eisenstein](http://www.aclweb.org/anthology/D13-1007); 22 | on this page, we only report results on the new dataset. For this dataset, the 2,577 23 | tweets from [Li and Liu(2014)](http://www.aclweb.org/anthology/P14-3012) is often 24 | used as training data, because of its similar annotation style. 25 | 26 | This dataset is commonly evaluated with accuracy on the non-standard words. This 27 | means that the system knows in advance which words are in need of normalization. 28 | 29 | | Model | Accuracy | Paper / Source | Code | 30 | | ------------- | :-----:| --- | --- | 31 | | MoNoise (van der Goot & van Noord, 2017) | 87.63 | [MoNoise: Modeling Noise Using a Modular Normalization System](http://www.let.rug.nl/rob/doc/clin27.paper.pdf) | [Official](https://bitbucket.org/robvanderg/monoise/) | 32 | | Joint POS + Norm in a Viterbi decoding (Li & Liu, 2015) | 87.58* | [Joint POS Tagging and Text Normalization for Informal Text](http://www.aaai.org/ocs/index.php/IJCAI/IJCAI15/paper/download/10839/10838) | | 33 | | Syllable based (Xu et al., 2015) | 86.08 | [Tweet Normalization with Syllables](http://www.aclweb.org/anthology/P15-1089) | | 34 | | unLOL (Yang & Eisenstein, 2013) | 82.06 | [A Log-Linear Model for Unsupervised Text Normalization](http://www.aclweb.org/anthology/D13-1007) | | 35 | 36 | \* used a slightly different version of the data 37 | 38 | #### LexNorm2015 39 | 40 | The 41 | [LexNorm2015](https://github.com/noisy-text/noisy-text.github.io/blob/master/2015/files/lexnorm2015.tgz) 42 | dataset was introduced for the shared task on lexical normalization, hosted at 43 | WNUT2015 ([Baldwin et al(2015)](http://aclweb.org/anthology/W15-4319)). In 44 | this dataset, 1-N and N-1 replacements are included in the annotation. The 45 | evaluation metrics used are precision, recall and F1 score. However, this is 46 | calculated a bit odd: 47 | 48 | Precision: out of all necessary replacements, how many correctly found 49 | 50 | Recall: out of all normalization by system, how many correct 51 | 52 | This means that if the system replaces a word which is in need of normalization, 53 | but chooses the wrong normalization, it is penalized twice. 54 | 55 | | Model | F1 | Precision | Recall | Paper / Source | Code | 56 | | ------------- | :-----:| :-----:| :-----:| --- | --- | 57 | | MoNoise (van der Goot & van Noord, 2017) | 86.39 | 93.53 | 80.26 | [MoNoise: Modeling Noise Using a Modular Normalization System](http://www.let.rug.nl/rob/doc/clin27.paper.pdf) | [Official](https://bitbucket.org/robvanderg/monoise/) | 58 | | Random Forest + novel similarity metric (Jin, 2017) | 84.21 | 90.61 | 78.65 | [NCSU-SAS-Ning: Candidate Generation and Feature Engineering for Supervised Lexical Normalization](http://www.aclweb.org/anthology/W15-4313) | | 59 | 60 | [Go back to the README](../README.md) 61 | -------------------------------------------------------------------------------- /english/multimodal.md: -------------------------------------------------------------------------------- 1 | # Multimodal 2 | 3 | ## Multimodal Emotion Recognition 4 | 5 | ### IEMOCAP 6 | 7 | The IEMOCAP ([Busso et al., 2008](https://link.springer.com/article/10.1007/s10579-008-9076-6)) contains the acts of 10 speakers in a two-way conversation segmented into utterances. The medium of the conversations in all the videos is English. The database contains the following categorical labels: anger, happiness, sadness, neutral, excitement, frustration, fear, surprise, and other. 8 | 9 | **Monologue:** 10 | 11 | | Model | Accuracy | Paper / Source | 12 | | ------------- | :-----:| --- | 13 | | CHFusion (Poria et al., 2017) | 76.5% | [Multimodal Sentiment Analysis using Hierarchical Fusion with Context Modeling](https://arxiv.org/pdf/1806.06228.pdf) | 14 | | bc-LSTM (Poria et al., 2017) | 74.10% | [Context-Dependent Sentiment Analysis in User-Generated Videos](http://sentic.net/context-dependent-sentiment-analysis-in-user-generated-videos.pdf) | 15 | 16 | **Conversational:** 17 | Conversational setting enables the models to capture emotions expressed by the speakers in a conversation. Inter speaker dependencies are considered in this setting. 18 | 19 | | Model | Weighted Accuracy (WAA) | Paper / Source | 20 | | ------------- | :-----:| --- | 21 | | CMN (Hazarika et al., 2018) | 77.62% | [Conversational Memory Network for Emotion Recognition in Dyadic Dialogue Videos](http://aclweb.org/anthology/N18-1193) | 22 | | Memn2n | 75.08 | [Conversational Memory Network for Emotion Recognition in Dyadic Dialogue Videos](http://aclweb.org/anthology/N18-1193)| 23 | 24 | ## Multimodal Metaphor Recognition 25 | 26 | [Mohammad et. al, 2016](http://www.aclweb.org/anthology/S16-2003) created a dataset of verb-noun pairs from WordNet that had multiple senses. They annoted these pairs for metaphoricity (metaphor or not a metaphor). Dataset is in English. 27 | 28 | | Model | F1 Score | Paper / Source | Code | 29 | | ------------------------------------------------------------ | :----------------------------------------------------------: | ------------------------------------------------------------ | ----------- | 30 | | 5-layer convolutional network (Krizhevsky et al., 2012), Word2Vec | 0.75 | [Shutova et. al, 2016](http://www.aclweb.org/anthology/N16-1020) | Unavailable | 31 | 32 | [Tsvetkov et. al, 2014](http://www.aclweb.org/anthology/P14-1024) created a dataset of adjective-noun pairs that they then annotated for metaphoricity. Dataset is in English. 33 | 34 | | Model | F1 Score | Paper / Source | Code | 35 | | ------------------------------------------------------------ | :----------------------------------------------------------: | ------------------------------------------------------------ | ----------- | 36 | | 5-layer convolutional network (Krizhevsky et al., 2012), Word2Vec | 0.79 | [Shutova et. al, 2016](http://www.aclweb.org/anthology/N16-1020) | Unavailable | 37 | 38 | ## Multimodal Sentiment Analysis 39 | 40 | ### MOSI 41 | The MOSI dataset ([Zadeh et al., 2016](https://arxiv.org/pdf/1606.06259.pdf)) is a dataset rich in sentimental expressions where 93 people review topics in English. The videos are segmented with each segments sentiment label scored between +3 (strong positive) to -3 (strong negative) by 5 annotators. 42 | 43 | | Model | Accuracy | Paper / Source | 44 | | ------------- | :-----:| --- | 45 | | bc-LSTM (Poria et al., 2017) | 80.3% | [Context-Dependent Sentiment Analysis in User-Generated Videos](http://sentic.net/context-dependent-sentiment-analysis-in-user-generated-videos.pdf) | 46 | | MARN (Zadeh et al., 2018) | 77.1% | [Multi-attention Recurrent Network for Human Communication Comprehension](https://arxiv.org/pdf/1802.00923.pdf) | 47 | 48 | [Go back to the README](../README.md) 49 | -------------------------------------------------------------------------------- /english/natural_language_inference.md: -------------------------------------------------------------------------------- 1 | # Natural language inference 2 | 3 | Natural language inference is the task of determining whether a "hypothesis" is 4 | true (entailment), false (contradiction), or undetermined (neutral) given a "premise". 5 | 6 | Example: 7 | 8 | | Premise | Label | Hypothesis | 9 | | --- | ---| --- | 10 | | A man inspects the uniform of a figure in some East Asian country. | contradiction | The man is sleeping. | 11 | | An older and younger man smiling. | neutral | Two men are smiling and laughing at the cats playing on the floor. | 12 | | A soccer game with multiple males playing. | entailment | Some men are playing a sport. | 13 | 14 | ### SNLI 15 | 16 | The [Stanford Natural Language Inference (SNLI) Corpus](https://arxiv.org/abs/1508.05326) 17 | contains around 550k hypothesis/premise pairs. Models are evaluated based on accuracy. 18 | 19 | State-of-the-art results can be seen on the [SNLI website](https://nlp.stanford.edu/projects/snli/). 20 | 21 | ### MultiNLI 22 | 23 | The [Multi-Genre Natural Language Inference (MultiNLI) corpus](https://arxiv.org/abs/1704.05426) 24 | contains around 433k hypothesis/premise pairs. It is similar to the SNLI corpus, but 25 | covers a range of genres of spoken and written text and supports cross-genre evaluation. The data 26 | can be downloaded from the [MultiNLI](https://www.nyu.edu/projects/bowman/multinli/) website. 27 | 28 | Public leaderboards for [in-genre (matched)](https://www.kaggle.com/c/multinli-matched-open-evaluation/leaderboard) 29 | and [cross-genre (mismatched)](https://www.kaggle.com/c/multinli-mismatched-open-evaluation/leaderboard) 30 | evaluation are available, but entries do not correspond to published models. 31 | 32 | | Model | Matched | Mismatched | Paper / Source | Code | 33 | | ------------- | :-----:| :-----:| --- | --- | 34 | | XLNet-Large (ensemble) (Yang et al., 2019) | 90.2 | 89.8 | [XLNet: Generalized Autoregressive Pretraining for Language Understanding](https://arxiv.org/pdf/1906.08237.pdf) | [Official](https://github.com/zihangdai/xlnet/) | 35 | | MT-DNN-ensemble (Liu et al., 2019) | 87.9 | 87.4 | [Improving Multi-Task Deep Neural Networks via Knowledge Distillation for Natural Language Understanding](https://arxiv.org/pdf/1904.09482.pdf) | [Official](https://github.com/namisan/mt-dnn/) | 36 | | Snorkel MeTaL(ensemble) (Ratner et al., 2018) | 87.6 | 87.2 | [Training Complex Models with Multi-Task Weak Supervision](https://arxiv.org/pdf/1810.02840.pdf) | [Official](https://github.com/HazyResearch/metal) | 37 | | Finetuned Transformer LM (Radford et al., 2018) | 82.1 | 81.4 | [Improving Language Understanding by Generative Pre-Training](https://s3-us-west-2.amazonaws.com/openai-assets/research-covers/language-unsupervised/language_understanding_paper.pdf) | | 38 | | Multi-task BiLSTM + Attn (Wang et al., 2018) | 72.2 | 72.1 | [GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding](https://arxiv.org/abs/1804.07461) | | 39 | | GenSen (Subramanian et al., 2018) | 71.4 | 71.3 | [Learning General Purpose Distributed Sentence Representations via Large Scale Multi-task Learning](https://arxiv.org/abs/1804.00079) | | 40 | 41 | ### SciTail 42 | 43 | The [SciTail](http://ai2-website.s3.amazonaws.com/publications/scitail-aaai-2018_cameraready.pdf) 44 | entailment dataset consists of 27k. In contrast to the SNLI and MultiNLI, it was not crowd-sourced 45 | but created from sentences that already exist "in the wild". Hypotheses were created from 46 | science questions and the corresponding answer candidates, while relevant web sentences from a large 47 | corpus were used as premises. Models are evaluated based on accuracy. 48 | 49 | | Model | Accuracy | Paper / Source | 50 | | ------------- | :-----:| --- | 51 | | Finetuned Transformer LM (Radford et al., 2018) | 88.3 | [Improving Language Understanding by Generative Pre-Training](https://s3-us-west-2.amazonaws.com/openai-assets/research-covers/language-unsupervised/language_understanding_paper.pdf) 52 | | Hierarchical BiLSTM Max Pooling (Talman et al., 2018) | 86.0 | [Natural Language Inference with Hierarchical BiLSTM Max Pooling](https://arxiv.org/abs/1808.08762) 53 | | CAFE (Tay et al., 2018) | 83.3 | [A Compare-Propagate Architecture with Alignment Factorization for Natural Language Inference](https://arxiv.org/abs/1801.00102) | 54 | 55 | [Go back to the README](../README.md) 56 | -------------------------------------------------------------------------------- /english/part-of-speech_tagging.md: -------------------------------------------------------------------------------- 1 | # Part-of-speech tagging 2 | 3 | Part-of-speech tagging (POS tagging) is the task of tagging a word in a text with its part of speech. 4 | A part of speech is a category of words with similar grammatical properties. Common English 5 | parts of speech are noun, verb, adjective, adverb, pronoun, preposition, conjunction, etc. 6 | 7 | Example: 8 | 9 | | Vinken | , | 61 | years | old | 10 | | --- | ---| --- | --- | --- | 11 | | NNP | , | CD | NNS | JJ | 12 | 13 | ### Penn Treebank 14 | 15 | A standard dataset for POS tagging is the Wall Street Journal (WSJ) portion of the Penn Treebank, containing 45 16 | different POS tags. Sections 0-18 are used for training, sections 19-21 for development, and sections 17 | 22-24 for testing. Models are evaluated based on accuracy. 18 | 19 | | Model | Accuracy | Paper / Source | Code | 20 | | ------------- | :-----:| --- | --- | 21 | | Meta BiLSTM (Bohnet et al., 2018) | 97.96 | [Morphosyntactic Tagging with a Meta-BiLSTM Model over Context Sensitive Token Encodings](https://arxiv.org/abs/1805.08237) | | 22 | | Flair embeddings (Akbik et al., 2018) | 97.85 | [Contextual String Embeddings for Sequence Labeling](http://aclweb.org/anthology/C18-1139) | [Flair framework](https://github.com/zalandoresearch/flair) | 23 | | Char Bi-LSTM (Ling et al., 2015) | 97.78 | [Finding Function in Form: Compositional Character Models for Open Vocabulary Word Representation](https://www.aclweb.org/anthology/D/D15/D15-1176.pdf) | | 24 | | Adversarial Bi-LSTM (Yasunaga et al., 2018) | 97.59 | [Robust Multilingual Part-of-Speech Tagging via Adversarial Training](https://arxiv.org/abs/1711.04903) | | 25 | | Yang et al. (2017) | 97.55 | [Transfer Learning for Sequence Tagging with Hierarchical Recurrent Networks](https://arxiv.org/abs/1703.06345) | | 26 | | Ma and Hovy (2016) | 97.55 | [End-to-end Sequence Labeling via Bi-directional LSTM-CNNs-CRF](https://arxiv.org/abs/1603.01354) | | 27 | | LM-LSTM-CRF (Liu et al., 2018)| 97.53 | [Empowering Character-aware Sequence Labeling with Task-Aware Neural Language Model](https://arxiv.org/pdf/1709.04109.pdf) | | 28 | | NCRF++ (Yang and Zhang, 2018)| 97.49 | [NCRF++: An Open-source Neural Sequence Labeling Toolkit](http://www.aclweb.org/anthology/P18-4013) | [NCRF++](https://github.com/jiesutd/NCRFpp) | 29 | | Feed Forward (Vaswani et a. 2016) | 97.4 | [Supertagging with LSTMs](https://aclweb.org/anthology/N/N16/N16-1027.pdf) | | 30 | | Bi-LSTM (Ling et al., 2017) | 97.36 | [Finding Function in Form: Compositional Character Models for Open Vocabulary Word Representation](https://www.aclweb.org/anthology/D/D15/D15-1176.pdf) | | 31 | | Bi-LSTM (Plank et al., 2016) | 97.22 | [Multilingual Part-of-Speech Tagging with Bidirectional Long Short-Term Memory Models and Auxiliary Loss](https://arxiv.org/abs/1604.05529) | | 32 | 33 | 34 | ### Social media 35 | 36 | The [Ritter (2011)](https://aclanthology.coli.uni-saarland.de/papers/D11-1141/d11-1141) dataset has become the benchmark for social media part-of-speech tagging. This is comprised of some 50K tokens of English social media sampled in late 2011, and is tagged using an extended version of the PTB tagset. 37 | 38 | | Model | Accuracy | Paper | 39 | | --- | --- | ---| 40 | | GATE | 88.69 | [Twitter Part-of-Speech Tagging for All: Overcoming Sparse and Noisy Data](https://aclanthology.coli.uni-saarland.de/papers/R13-1026/r13-1026) | 41 | | CMU | 90.0 ± 0.5 | [Improved Part-of-Speech Tagging for Online Conversational Text with Word Clusters](http://www.cs.cmu.edu/~ark/TweetNLP/owoputi+etal.naacl13.pdf) | 42 | 43 | 44 | ### UD 45 | 46 | [Universal Dependencies (UD)](http://universaldependencies.org/) is a framework for 47 | cross-linguistic grammatical annotation, which contains more than 100 treebanks in over 60 languages. 48 | Models are typically evaluated based on the average test accuracy across 21 high-resource languages (♦ evaluated on 17 languages). 49 | 50 | | Model | Avg accuracy | Paper / Source | 51 | | ------------- | :-----:| --- | 52 | | Multilingual BERT and BPEmb (Heinzerling and Strube, 2019) | 96.77 | [Sequence Tagging with Contextual and Non-Contextual Subword Representations: A Multilingual Evaluation](https://arxiv.org/abs/1906.01569) | 53 | | Adversarial Bi-LSTM (Yasunaga et al., 2018) | 96.65 | [Robust Multilingual Part-of-Speech Tagging via Adversarial Training](https://arxiv.org/abs/1711.04903) | 54 | | MultiBPEmb (Heinzerling and Strube, 2019) | 96.62 | [Sequence Tagging with Contextual and Non-Contextual Subword Representations: A Multilingual Evaluation](https://arxiv.org/abs/1906.01569) | 55 | | Bi-LSTM (Plank et al., 2016) | 96.40 | [Multilingual Part-of-Speech Tagging with Bidirectional Long Short-Term Memory Models and Auxiliary Loss](https://arxiv.org/abs/1604.05529) | 56 | | Joint Bi-LSTM (Nguyen et al., 2017)♦ | 95.55 | [A Novel Neural Network Model for Joint POS Tagging and Graph-based Dependency Parsing](https://arxiv.org/abs/1705.05952) | 57 | 58 | [Go back to the README](../README.md) 59 | -------------------------------------------------------------------------------- /english/semantic_textual_similarity.md: -------------------------------------------------------------------------------- 1 | # Semantic textual similarity 2 | 3 | Semantic textual similarity deals with determining how similar two pieces of texts are. 4 | This can take the form of assigning a score from 1 to 5. Related tasks are paraphrase or duplicate identification. 5 | 6 | ### SentEval 7 | 8 | [SentEval](https://arxiv.org/abs/1803.05449) is an evaluation toolkit for evaluating sentence 9 | representations. It includes 17 downstream tasks, including common semantic textual similarity 10 | tasks. The semantic textual similarity (STS) benchmark tasks from 2012-2016 (STS12, STS13, STS14, STS15, STS16, STS-B) measure the relatedness 11 | of two sentences based on the cosine similarity of the two representations. The evaluation criterion is Pearson correlation. 12 | 13 | The SICK relatedness (SICK-R) task trains a linear model to output a score from 1 to 5 indicating the relatedness of two sentences. For 14 | the same dataset (SICK-E) can be treated as a three-class classification problem using the entailment labels (classes are 'entailment', 'contradiction', and 'neutral'). 15 | The evaluation metric for SICK-R is Pearson correlation and classification accuracy for SICK-E. 16 | 17 | The Microsoft Research Paraphrase Corpus (MRPC) corpus is a paraphrase identification dataset, where systems 18 | aim to identify if two sentences are paraphrases of each other. The evaluation metric is classification accuracy and F1. 19 | 20 | The data can be downloaded from [here](https://github.com/facebookresearch/SentEval). 21 | 22 | | Model | MRPC | SICK-R | SICK-E | STS | Paper / Source | Code | 23 | | ------------- | :-----:| :-----:| :-----:| :-----:| --- | --- | 24 | | XLNet-Large (ensemble) (Yang et al., 2019) | 93.0/90.7 | - | - | 91.6/91.1* | [XLNet: Generalized Autoregressive Pretraining for Language Understanding](https://arxiv.org/pdf/1906.08237.pdf) | [Official](https://github.com/zihangdai/xlnet/) | 25 | | MT-DNN-ensemble (Liu et al., 2019) | 92.7/90.3 | - | - | 91.1/90.7* | [Improving Multi-Task Deep Neural Networks via Knowledge Distillation for Natural Language Understanding](https://arxiv.org/pdf/1904.09482.pdf) | [Official](https://github.com/namisan/mt-dnn/) | 26 | | Snorkel MeTaL(ensemble) (Ratner et al., 2018) | 91.5/88.5 | - | - | 90.1/89.7* | [Training Complex Models with Multi-Task Weak Supervision](https://arxiv.org/pdf/1810.02840.pdf) | [Official](https://github.com/HazyResearch/metal) | 27 | | GenSen (Subramanian et al., 2018) | 78.6/84.4 | 0.888 | 87.8 | 78.9/78.6 | [Learning General Purpose Distributed Sentence Representations via Large Scale Multi-task Learning](https://arxiv.org/abs/1804.00079) | [Official](https://github.com/Maluuba/gensen) | 28 | | InferSent (Conneau et al., 2017) | 76.2/83.1 | 0.884 | 86.3 | 75.8/75.5 | [Supervised Learning of Universal Sentence Representations from Natural Language Inference Data](https://arxiv.org/abs/1705.02364) | [Official](https://github.com/facebookresearch/InferSent) | 29 | | TF-KLD (Ji and Eisenstein, 2013) | 80.4/85.9 | - | - | - | [Discriminative Improvements to Distributional Sentence Similarity](http://www.aclweb.org/anthology/D/D13/D13-1090.pdf) | | 30 | 31 | \* only evaluated on STS-B 32 | 33 | ## Paraphrase identification 34 | 35 | ### Quora Question Pairs 36 | 37 | The [Quora Question Pairs dataset](https://data.quora.com/First-Quora-Dataset-Release-Question-Pairs) 38 | consists of over 400,000 pairs of questions on Quora. Systems must identify whether one question is a 39 | duplicate of the other. Models are evaluated based on accuracy. 40 | 41 | | Model | F1 | Accuracy | Paper / Source | Code | 42 | | ------------- | :-----: | :-----:| --- | --- | 43 | | XLNet-Large (ensemble) (Yang et al., 2019) | 74.2 | 90.3 | [XLNet: Generalized Autoregressive Pretraining for Language Understanding](https://arxiv.org/pdf/1906.08237.pdf) | [Official](https://github.com/zihangdai/xlnet/) | 44 | | MT-DNN-ensemble (Liu et al., 2019) | 73.7 | 89.9 | [Improving Multi-Task Deep Neural Networks via Knowledge Distillation for Natural Language Understanding](https://arxiv.org/pdf/1904.09482.pdf) | [Official](https://github.com/namisan/mt-dnn/) | 45 | | Snorkel MeTaL(ensemble) (Ratner et al., 2018) | 73.1 | 89.9 | [Training Complex Models with Multi-Task Weak Supervision](https://arxiv.org/pdf/1810.02840.pdf) | [Official](https://github.com/HazyResearch/metal) | 46 | | MwAN (Tan et al., 2018) | | 89.12 | [Multiway Attention Networks for Modeling Sentence Pairs](https://www.ijcai.org/proceedings/2018/0613.pdf) | | 47 | | DIIN (Gong et al., 2018) | | 89.06 | [Natural Language Inference Over Interaction Space](https://arxiv.org/pdf/1709.04348.pdf) | [Official](https://github.com/YichenGong/Densely-Interactive-Inference-Network) | 48 | | pt-DecAtt (Char) (Tomar et al., 2017) | | 88.40 | [Neural Paraphrase Identification of Questions with Noisy Pretraining](https://arxiv.org/abs/1704.04565) | | 49 | | BiMPM (Wang et al., 2017) | | 88.17 | [Bilateral Multi-Perspective Matching for Natural Language Sentences](https://arxiv.org/abs/1702.03814) | [Official](https://github.com/zhiguowang/BiMPM) | 50 | | GenSen (Subramanian et al., 2018) | | 87.01 | [Learning General Purpose Distributed Sentence Representations via Large Scale Multi-task Learning](https://arxiv.org/abs/1804.00079) | [Official](https://github.com/Maluuba/gensen) | 51 | 52 | [Go back to the README](../README.md) 53 | -------------------------------------------------------------------------------- /english/common_sense.md: -------------------------------------------------------------------------------- 1 | # Common sense 2 | 3 | Common sense reasoning tasks are intended to require the model to go beyond pattern 4 | recognition. Instead, the model should use "common sense" or world knowledge 5 | to make inferences. 6 | 7 | ### Event2Mind 8 | 9 | Event2Mind is a crowdsourced corpus of 25,000 event phrases covering a diverse range of everyday events and situations. 10 | Given an event described in a short free-form text, a model should reason about the likely intents and reactions of the 11 | event's participants. Models are evaluated based on average cross-entropy (lower is better). 12 | 13 | | Model | Dev | Test | Paper / Source | Code | 14 | | ------------- | :-----:| :-----:|--- | --- | 15 | | BiRNN 100d (Rashkin et al., 2018) | 4.25 | 4.22 | [Event2Mind: Commonsense Inference on Events, Intents, and Reactions](https://arxiv.org/abs/1805.06939) | | 16 | | ConvNet (Rashkin et al., 2018) | 4.44 | 4.40 | [Event2Mind: Commonsense Inference on Events, Intents, and Reactions](https://arxiv.org/abs/1805.06939) | | 17 | 18 | ### SWAG 19 | 20 | Situations with Adversarial Generations (SWAG) is a dataset consisting of 113k multiple 21 | choice questions about a rich spectrum of grounded situations. 22 | 23 | | Model | Dev | Test | Paper / Source | Code | 24 | | ------------- | :-----:| :-----:|--- | --- | 25 | | BERT Large (Devlin et al., 2018) | 86.6 | 86.3 | [BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding](https://arxiv.org/abs/1810.04805) | | 26 | | BERT Base (Devlin et al., 2018) | 81.6 | - | [BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding](https://arxiv.org/abs/1810.04805) | | 27 | | ESIM + ELMo (Zellers et al., 2018) | 59.1 | 59.2 | [SWAG: A Large-Scale Adversarial Dataset for Grounded Commonsense Inference](http://arxiv.org/abs/1808.05326) | | 28 | | ESIM + GloVe (Zellers et al., 2018) | 51.9 | 52.7 | [SWAG: A Large-Scale Adversarial Dataset for Grounded Commonsense Inference](http://arxiv.org/abs/1808.05326) | | 29 | 30 | ### Winograd Schema Challenge 31 | 32 | The [Winograd Schema Challenge](https://www.aaai.org/ocs/index.php/KR/KR12/paper/view/4492) 33 | is a dataset for common sense reasoning. It employs Winograd Schema questions that 34 | require the resolution of anaphora: the system must identify the antecedent of an ambiguous pronoun in a statement. Models 35 | are evaluated based on accuracy. 36 | 37 | Example: 38 | 39 | The trophy doesn’t fit in the suitcase because _it_ is too big. What is too big? 40 | Answer 0: the trophy. Answer 1: the suitcase 41 | 42 | | Model | Score | Paper / Source | Code | 43 | | ------------- | :-----:| --- | --- | 44 | | XLNet-Large (ensemble) (Yang et al., 2019) | 90.4 | [XLNet: Generalized Autoregressive Pretraining for Language Understanding](https://arxiv.org/pdf/1906.08237.pdf) | [Official](https://github.com/zihangdai/xlnet/) | 45 | | MT-DNN-ensemble (Liu et al., 2019) | 89.0 | [Improving Multi-Task Deep Neural Networks via Knowledge Distillation for Natural Language Understanding](https://arxiv.org/pdf/1904.09482.pdf) | [Official](https://github.com/namisan/mt-dnn/) | 46 | | Snorkel MeTaL(ensemble) (Ratner et al., 2018) | 65.1 | [Training Complex Models with Multi-Task Weak Supervision](https://arxiv.org/pdf/1810.02840.pdf) | [Official](https://github.com/HazyResearch/metal) | 47 | | Word-LM-partial (Trinh and Le, 2018) | 62.6 | [A Simple Method for Commonsense Reasoning](https://arxiv.org/abs/1806.02847) | | 48 | | Char-LM-partial (Trinh and Le, 2018) | 57.9 | [A Simple Method for Commonsense Reasoning](https://arxiv.org/abs/1806.02847) | | 49 | | USSM + Supervised DeepNet + KB (Liu et al., 2017) | 52.8 | [Combing Context and Commonsense Knowledge Through Neural Networks for Solving Winograd Schema Problems](https://aaai.org/ocs/index.php/SSS/SSS17/paper/view/15392) | | 50 | 51 | ### Visual Common Sense 52 | 53 | Visual Commonsense Reasoning (VCR) is a new task and large-scale dataset for cognition-level visual understanding. 54 | With one glance at an image, we can effortlessly imagine the world beyond the pixels (e.g. that [person1] ordered 55 | pancakes). While this task is easy for humans, it is tremendously difficult for today's vision systems, requiring 56 | higher-order cognition and commonsense reasoning about the world. We formalize this task as Visual Commonsense 57 | Reasoning. In addition to answering challenging visual questions expressed in natural language, a model must provide a 58 | rationale explaining why its answer is true. 59 | 60 | | Model | Q->A | QA->R | Q->AR | Paper / Source | Code | 61 | | ------ | :-------:| :-------: | :-------:| ------ | ------ | 62 | | Human Performance University of Washington (Zellers et al. '18) | 91.0 | 93.0 | 85.0 | [From Recognition to Cognition: Visual Commonsense Reasoning](https://arxiv.org/abs/1811.10830) | | 63 | | Recognition to Cognition Networks University of Washington | 65.1 | 67.3 | 44.0 | [From Recognition to Cognition: Visual Commonsense Reasoning](https://arxiv.org/abs/1811.10830) | https://github.com/rowanz/r2c | 64 | | BERT-Base Google AI Language (experiment by Rowan) | 53.9 | 64.5 | 35.0 | | https://github.com/google-research/bert | 65 | | MLB Seoul National University (experiment by Rowan) | 46.2 | 36.8 | 17.2 | | https://github.com/jnhwkim/MulLowBiVQA | 66 | | Random Performance | 25.0 | 25.0 | 6.2 | | | 67 | -------------------------------------------------------------------------------- /english/taxonomy_learning.md: -------------------------------------------------------------------------------- 1 | # Taxonomy Learning 2 | 3 | Taxonomy learning is the task of hierarchically classifying concepts in an automatic manner from text corpora. The process of building taxonomies is usually divided into two main steps: (1) extracting hypernyms for concepts, which may constitute a field of research in itself (see Hypernym Discovery below) and (2) refining the structure into a taxonomy. 4 | 5 | ## Hypernym Discovery 6 | 7 | Given a corpus and a target term (hyponym), the task of hypernym discovery consists of extracting a set of its most appropriate hypernyms from the corpus. For example, for the input word “dog”, some valid hypernyms would be “canine”, “mammal” or “animal”. 8 | 9 | ### SemEval 2018 10 | 11 | The SemEval-2018 hypernym discovery evaluation benchmark ([Camacho-Collados et al. 2018](http://aclweb.org/anthology/S18-1115)), which can be freely downloaded [here](https://competitions.codalab.org/competitions/17119), contains three domains (general, medical and music) and is also available in Italian and Spanish (not in this repository). For each domain a target corpus and vocabulary (i.e. hypernym search space) are provided. The dataset contains both concepts (e.g. dog) and entities (e.g. Manchester United) up to trigrams. The following table lists the number of hyponym-hypernym pairs for each dataset: 12 | 13 | | Partition | General | Medical | Music | 14 | | ------------- | :-----:|:-----:|:-----:| 15 | |Trial | 200 | 101 | 355 | 16 | |Training | 11779 | 3256 | 5455 | 17 | |Test | 7048 | 4116 | 5233 | 18 | 19 | The results for each model and dataset (general, medical and music) are presented below (MFH stands for “Most Frequent Hypernyms” and is used as a baseline). 20 | 21 | **General:** 22 | 23 | | Model | MAP | MRR | P@5 | Paper / Source | 24 | | ------------- | :-----:|:-----:|:-----:| --- | 25 | |CRIM (Bernier-Colborne and Barrière, 2018) | 19.78 | 36.10 | 19.03 | [A Hybrid Approach to Hypernym Discovery](http://aclweb.org/anthology/S18-1116) | 26 | |vTE (Espinosa-Anke et al., 2016) | 10.60 | 23.83 | 9.91 | [Supervised Distributional Hypernym Discovery via Domain Adaptation](https://aclweb.org/anthology/D16-1041) | 27 | |NLP_HZ (Qui et al., 2018) | 9.37 | 17.29 | 9.19 | [A Nearest Neighbor Approach](http://aclweb.org/anthology/S18-1148) | 28 | |300-sparsans (Berend et al., 2018) | 8.95 | 19.44 | 8.63 | [Hypernymy as interaction of sparse attributes ](http://aclweb.org/anthology/S18-1152) | 29 | |MFH | 8.77 | 21.39 | 7.81 | -- | 30 | |SJTU BCMI (Zhang et al., 2018) | 5.77 | 10.56 | 5.96 | [Neural Hypernym Discovery with Term Embeddings](http://aclweb.org/anthology/S18-1147) | 31 | |Apollo (Onofrei et al., 2018) | 2.68 | 6.01 | 2.69 | [Detecting Hypernymy Relations Using Syntactic Dependencies ](http://aclweb.org/anthology/S18-1146) | 32 | |balAPInc (Shwartz et al., 2017) | 1.36 | 3.18 | 1.30 | [Hypernyms under Siege: Linguistically-motivated Artillery for Hypernymy Detection](http://www.aclweb.org/anthology/E17-1007) | 33 | 34 | 35 | **Medical domain:** 36 | 37 | | Model | MAP | MRR | P@5 | Paper / Source | 38 | | ------------- | :-----:|:-----:|:-----:| --- | 39 | |CRIM (Bernier-Colborne and Barrière, 2018) | 34.05 | 54.64 | 36.77 | [A Hybrid Approach to Hypernym Discovery](http://aclweb.org/anthology/S18-1116) | 40 | |MFH | 28.93 | 35.80 | 34.20 | -- | 41 | |300-sparsans (Berend et al., 2018) | 20.75 | 40.60 | 21.43 | [Hypernymy as interaction of sparse attributes ](http://aclweb.org/anthology/S18-1152) | 42 | |vTE (Espinosa-Anke et al., 2016) | 18.84 | 41.07 | 20.71 | [Supervised Distributional Hypernym Discovery via Domain Adaptation](https://aclweb.org/anthology/D16-1041) | 43 | |EXPR (Issa Alaa Aldine et al., 2018) | 13.77 | 40.76 | 12.76 | [A Combined Approach for Hypernym Discovery](http://aclweb.org/anthology/S18-1150) | 44 | |SJTU BCMI (Zhang et al., 2018) | 11.69 | 25.95 | 11.69 | [Neural Hypernym Discovery with Term Embeddings](http://aclweb.org/anthology/S18-1147) | 45 | |ADAPT (Maldonado and Klubička, 2018) | 8.13 | 20.56 | 8.32 | [Skip-Gram Word Embeddings for Unsupervised Hypernym Discovery in Specialised Corpora ](http://aclweb.org/anthology/S18-1151) | 46 | |balAPInc (Shwartz et al., 2017) | 0.91 | 2.10 | 1.08 | [Hypernyms under Siege: Linguistically-motivated Artillery for Hypernymy Detection](http://www.aclweb.org/anthology/E17-1007) | 47 | 48 | 49 | **Music domain:** 50 | 51 | | Model | MAP | MRR | P@5 | Paper / Source | 52 | | ------------- | :-----:|:-----:|:-----:| --- | 53 | |CRIM (Bernier-Colborne and Barrière, 2018) | 40.97 | 60.93 | 41.31 | [A Hybrid Approach to Hypernym Discovery](http://aclweb.org/anthology/S18-1116) | 54 | |MFH | 33.32 | 51.48 | 35.76 | -- | 55 | |300-sparsans (Berend et al., 2018) | 29.54 | 46.43 | 28.86 | [Hypernymy as interaction of sparse attributes ](http://aclweb.org/anthology/S18-1152) | 56 | |vTE (Espinosa-Anke et al., 2016) | 12.99 | 39.36 | 12.41 | [Supervised Distributional Hypernym Discovery via Domain Adaptation](https://aclweb.org/anthology/D16-1041) | 57 | |SJTU BCMI (Zhang et al., 2018) | 4.71 | 9.15 | 4.91 | [Neural Hypernym Discovery with Term Embeddings](http://aclweb.org/anthology/S18-1147) | 58 | |ADAPT (Maldonado and Klubička, 2018) | 2.63 | 7.46 | 2.64 | [Skip-Gram Word Embeddings for Unsupervised Hypernym Discovery in Specialised Corpora ](http://aclweb.org/anthology/S18-1151) | 59 | |balAPInc (Shwartz et al., 2017) | 1.95 | 5.01 | 2.15 | [Hypernyms under Siege: Linguistically-motivated Artillery for Hypernymy Detection](http://www.aclweb.org/anthology/E17-1007) | 60 | -------------------------------------------------------------------------------- /english/text_classification.md: -------------------------------------------------------------------------------- 1 | # Text classification 2 | 3 | Text classification is the task of assigning a sentence or document an appropriate category. 4 | The categories depend on the chosen dataset and can range from topics. 5 | 6 | ### AG News 7 | 8 | The [AG News corpus](https://papers.nips.cc/paper/5782-character-level-convolutional-networks-for-text-classification.pdf) 9 | consists of news articles from the [AG's corpus of news articles on the web](http://www.di.unipi.it/~gulli/AG_corpus_of_news_articles.html) 10 | pertaining to the 4 largest classes. The dataset contains 30,000 training examples for each class 11 | 1,900 examples for each class for testing. Models are evaluated based on error rate (lower is better). 12 | 13 | | Model | Error | Paper / Source | Code | 14 | | ------------- | :-----:| --- | :-----: | 15 | | XLNet (Yang et al., 2019) | 4.49 | [XLNet: Generalized Autoregressive Pretraining for Language Understanding](https://arxiv.org/pdf/1906.08237.pdf) | [Official](https://github.com/zihangdai/xlnet/) | 16 | | ULMFiT (Howard and Ruder, 2018) | 5.01 | [Universal Language Model Fine-tuning for Text Classification](https://arxiv.org/abs/1801.06146) | [Official](http://nlp.fast.ai/ulmfit ) | 17 | | CNN (Johnson and Zhang, 2016) * | 6.57 | [Supervised and Semi-Supervised Text Categorization using LSTM for Region Embeddings](https://arxiv.org/abs/1602.02373) | [Official](https://github.com/riejohnson/ConText ) | 18 | | DPCNN (Johnson and Zhang, 2017) | 6.87 | [Deep Pyramid Convolutional Neural Networks for Text Categorization](http://aclweb.org/anthology/P17-1052) | [Official](https://github.com/riejohnson/ConText ) | 19 | | VDCN (Alexis et al., 2016) | 8.67 | [Very Deep Convolutional Networks for Text Classification](https://arxiv.org/abs/1606.01781) | [Non Official](https://github.com/ArdalanM/nlp-benchmarks/tree/master/src/vdcnn) | 20 | | Char-level CNN (Zhang et al., 2015) | 9.51 | [Character-level Convolutional Networks for Text Classification](https://papers.nips.cc/paper/5782-character-level-convolutional-networks-for-text-classification.pdf) | [Non Official](https://github.com/ArdalanM/nlp-benchmarks/tree/master/src/cnn) | 21 | 22 | \* Results reported in Johnson and Zhang, 2017 23 | 24 | ### DBpedia 25 | 26 | The [DBpedia ontology](https://papers.nips.cc/paper/5782-character-level-convolutional-networks-for-text-classification.pdf) 27 | dataset contains 560,000 training samples and 70,000 testing samples for each of 14 nonoverlapping classes from DBpedia. 28 | Models are evaluated based on error rate (lower is better). 29 | 30 | | Model | Error | Paper / Source | Code | 31 | | ------------- | :-----:| --- | :-----: | 32 | | XLNet (Yang et al., 2019) | 0.62 | [XLNet: Generalized Autoregressive Pretraining for Language Understanding](https://arxiv.org/pdf/1906.08237.pdf) | [Official](https://github.com/zihangdai/xlnet/) | 33 | | ULMFiT (Howard and Ruder, 2018) | 0.80 | [Universal Language Model Fine-tuning for Text Classification](https://arxiv.org/abs/1801.06146) | [Official](http://nlp.fast.ai/ulmfit ) | 34 | | CNN (Johnson and Zhang, 2016) | 0.84 | [Supervised and Semi-Supervised Text Categorization using LSTM for Region Embeddings](https://arxiv.org/abs/1602.02373) | [Official](https://github.com/riejohnson/ConText ) | 35 | | DPCNN (Johnson and Zhang, 2017) | 0.88 | [Deep Pyramid Convolutional Neural Networks for Text Categorization](http://aclweb.org/anthology/P17-1052) | [Official](https://github.com/riejohnson/ConText ) | 36 | | VDCN (Alexis et al., 2016) | 1.29 | [Very Deep Convolutional Networks for Text Classification](https://arxiv.org/abs/1606.01781) | [Non Official](https://github.com/ArdalanM/nlp-benchmarks/tree/master/src/vdcnn) | 37 | | Char-level CNN (Zhang et al., 2015) | 1.55 | [Character-level Convolutional Networks for Text Classification](https://papers.nips.cc/paper/5782-character-level-convolutional-networks-for-text-classification.pdf) | [Non Official](https://github.com/ArdalanM/nlp-benchmarks/tree/master/src/cnn) | 38 | 39 | ### TREC 40 | 41 | The [TREC dataset](http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.11.2766&rep=rep1&type=pdf) is dataset for 42 | question classification consisting of open-domain, fact-based questions divided into broad semantic categories. 43 | It has both a six-class (TREC-6) and a fifty-class (TREC-50) version. Both have 5,452 training examples and 500 test examples, 44 | but TREC-50 has finer-grained labels. Models are evaluated based on accuracy. 45 | 46 | TREC-6: 47 | 48 | | Model | Error | Paper / Source | Code | 49 | | ------------- | :-----:| --- | :-----: | 50 | | USE_T+CNN (Cer et al., 2018) | 1.93 | [Universal Sentence Encoder](https://arxiv.org/pdf/1803.11175.pdf) | [Official](https://tfhub.dev/google/universal-sentence-encoder/1) | 51 | | ULMFiT (Howard and Ruder, 2018) | 3.6 | [Universal Language Model Fine-tuning for Text Classification](https://arxiv.org/abs/1801.06146) | [Official](http://nlp.fast.ai/ulmfit ) | 52 | | LSTM-CNN (Zhou et al., 2016) | 3.9 | [Text Classification Improved by Integrating Bidirectional LSTM with Two-dimensional Max Pooling](http://www.aclweb.org/anthology/C16-1329) | 53 | | CNN+MCFA (Amplayo et al., 2018) | 4 | [Translations as Additional Contexts for Sentence Classification](https://arxiv.org/pdf/1806.05516.pdf) | 54 | | TBCNN (Mou et al., 2015) | 4 | [Discriminative Neural Sentence Modeling by Tree-Based Convolution](http://aclweb.org/anthology/D15-1279) | 55 | | CoVe (McCann et al., 2017) | 4.2 | [Learned in Translation: Contextualized Word Vectors](https://arxiv.org/abs/1708.00107) | 56 | 57 | TREC-50: 58 | 59 | | Model | Error | Paper / Source | Code | 60 | | ------------- | :-----:| --- | :-----: | 61 | | Rules (Madabushi and Lee, 2016) | 2.8 |[High Accuracy Rule-based Question Classification using Question Syntax and Semantics](http://www.aclweb.org/anthology/C16-1116)| | 62 | | SVM (Van-Tu and Anh-Cuong, 2016) | 8.4 | [Improving Question Classification by Feature Extraction and Selection](https://www.researchgate.net/publication/303553351_Improving_Question_Classification_by_Feature_Extraction_and_Selection) | | 63 | 64 | [Go back to the README](../README.md) 65 | -------------------------------------------------------------------------------- /english/word_sense_disambiguation.md: -------------------------------------------------------------------------------- 1 | # Word Sense Disambiguation 2 | 3 | The task of Word Sense Disambiguation (WSD) consists of associating words in context with their most suitable entry in a pre-defined sense inventory. The de-facto sense inventory for English in WSD is [WordNet](https://wordnet.princeton.edu). 4 | For example, given the word “mouse” and the following sentence: 5 | 6 | “A mouse consists of an object held in one's hand, with one or more buttons.” 7 | 8 | we would assign “mouse” with its electronic device sense ([the 4th sense in the WordNet sense inventory](http://wordnetweb.princeton.edu/perl/webwn?c=8&sub=Change&o2=&o0=1&o8=1&o1=1&o7=&o5=&o9=&o6=&o3=&o4=&i=-1&h=000000&s=mouse)). 9 | 10 | 11 | ### Fine-grained WSD: 12 | 13 | The [Evaluation framework](http://lcl.uniroma1.it/wsdeval/) of [Raganato et al. 2017](http://aclweb.org/anthology/E/E17/E17-1010.pdf) [1] includes two training sets (SemCor-Miller et al., 1993- and OMSTI-Taghipour and Ng, 2015-) and five test sets from the Senseval/SemEval series (Edmonds and Cotton, 2001; Snyder and Palmer, 2004; Pradhan et al., 2007; Navigli et al., 2013; Moro and Navigli, 2015), standardized to the same format and sense inventory (i.e. WordNet 3.0). 14 | 15 | Typically, there are two kinds of approach for WSD: supervised (which make use of sense-annotated training data) and knowledge-based (which make use of the properties of lexical resources). 16 | 17 | Supervised: The most widely used training corpus used is SemCor, with 226,036 sense annotations from 352 documents manually annotated. All supervised systems in the evaluation table are trained on SemCor. Some supervised methods, particularly neural architectures, usually employ the SemEval 2007 dataset as development set (marked by *). The most usual baseline is the Most Frequent Sense (MFS) heuristic, which selects for each target word the most frequent sense in the training data. 18 | 19 | Knowledge-based: Knowledge-based systems usually exploit WordNet or [BabelNet](https://babelnet.org/) as semantic network. The first sense given by the underlying sense inventory (i.e. WordNet 3.0) is included as a baseline. 20 | 21 | The main evaluation measure is F1-score. 22 | 23 | 24 | ### Supervised: 25 | 26 | | Model | Senseval 2 |Senseval 3 |SemEval 2007 |SemEval 2013 |SemEval 2015 | Paper / Source | 27 | | ------------- | :-----:|:-----:|:-----:|:-----:|:-----:| --- | 28 | |MFS baseline | 65.6 | 66.0 | 54.5 | 63.8 | 67.1 | [[1]](http://aclweb.org/anthology/E/E17/E17-1010.pdf) | 29 | |Bi-LSTMatt+LEX | 72.0 | 69.4 |63.7* | 66.4 | 72.4 | [[2]](http://aclweb.org/anthology/D17-1120) | 30 | |Bi-LSTMatt+LEX+POS | 72.0 | 69.1|64.8* | 66.9 | 71.5 | [[2]](http://aclweb.org/anthology/D17-1120) | 31 | |context2vec | 71.8 | 69.1 |61.3 | 65.6 | 71.9 | [[3]](http://www.aclweb.org/anthology/K16-1006) | 32 | |ELMo | 71.6 | 69.6 | 62.2 | 66.2 | 71.3 | [[4]](http://aclweb.org/anthology/N18-1202) | 33 | |GAS (Linear) | 72.0 | 70.0 | --* | 66.7 | 71.6 | [[5]](http://aclweb.org/anthology/P18-1230) | 34 | |GAS (Concatenation) | 72.1 | 70.2 | --* | 67 | 71.8 | [[5]](http://aclweb.org/anthology/P18-1230) | 35 | |GASext (Linear) | 72.4 | 70.1 | --* | 67.1 | 72.1 |[[5]](http://aclweb.org/anthology/P18-1230) | 36 | |GASext (Concatenation) | 72.2 | 70.5 | --* | 67.2 | 72.6 | [[5]](http://aclweb.org/anthology/P18-1230) | 37 | |supWSD | 71.3 | 68.8 | 60.2 | 65.8 | 70.0 | [[6]](https://aclanthology.info/pdf/P/P10/P10-4014.pdf) [[11]](http://aclweb.org/anthology/D17-2018) | 38 | |supWSDemb | 72.7 | 70.6 | 63.1 | 66.8 | 71.8 | [[7]](http://www.aclweb.org/anthology/P16-1085) [[11]](http://aclweb.org/anthology/D17-2018) | 39 | 40 | 41 | ### Knowledge-based: 42 | 43 | | Model | All | Senseval 2 |Senseval 3 |SemEval 2007 |SemEval 2013 |SemEval 2015 | Paper / Source | 44 | | ------------- | :-----: | :-----: | :-----: | :-----: | :-----: | :-----: | --- | 45 | |WN 1st sense baseline | 65.2 | 66.8 | 66.2 | 55.2 | 63.0 | 67.8 | [[1]](http://aclweb.org/anthology/E/E17/E17-1010.pdf) | 46 | |Babelfy | 65.5 | 67.0 | 63.5 | 51.6 | 66.4 | **70.3** | [[8]](http://aclweb.org/anthology/Q14-1019) | 47 | |UKBppr_w2w-nf | 57.5 | 64.2 | 54.8 | 40.0 | 64.5 | 64.5 | [[9]](https://www.mitpressjournals.org/doi/full/10.1162/COLI_a_00164) [[12]](http://aclweb.org/anthology/W18-2505) | 48 | |UKBppr_w2w | **67.3** | 68.8 | 66.1 | 53.0 | **68.8** | **70.3** | [[9]](https://www.mitpressjournals.org/doi/full/10.1162/COLI_a_00164) [[12]](http://aclweb.org/anthology/W18-2505) | 49 | |WSD-TM | 66.9 | **69.0** | **66.9** | **55.6** | 65.3 | 69.6 | [[10]](https://arxiv.org/pdf/1801.01900.pdf) | 50 | 51 | Note: 'All' is the concatenation of all datasets, as described in [10] and [12]. The scores of [6,7] and [9] are not taken from the original papers but from the results of the implementations of [11] and [12], respectively. 52 | 53 | [1] [Word Sense Disambiguation: A Unified Evaluation Framework and Empirical Comparison](http://aclweb.org/anthology/E/E17/E17-1010.pdf) 54 | 55 | [2] [Neural Sequence Learning Models for Word Sense Disambiguation](http://aclweb.org/anthology/D17-1120) 56 | 57 | [3] [context2vec: Learning generic context embedding with bidirectional lstm](http://www.aclweb.org/anthology/K16-1006) 58 | 59 | [4] [Deep contextualized word representations](http://aclweb.org/anthology/N18-1202) 60 | 61 | [5] [Incorporating Glosses into Neural Word Sense Disambiguation](http://aclweb.org/anthology/P18-1230) 62 | 63 | [6] [It makes sense: A wide-coverage word sense disambiguation system for free text](https://aclanthology.info/pdf/P/P10/P10-4014.pdf) 64 | 65 | [7] [Embeddings for Word Sense Disambiguation: An Evaluation Study](http://www.aclweb.org/anthology/P16-1085) 66 | 67 | [8] [Entity Linking meets Word Sense Disambiguation: A Unified Approach](http://aclweb.org/anthology/Q14-1019) 68 | 69 | [9] [Random walks for knowledge-based word sense disambiguation](https://www.mitpressjournals.org/doi/full/10.1162/COLI_a_00164) 70 | 71 | [10] [Knowledge-based Word Sense Disambiguation using Topic Models](https://arxiv.org/pdf/1801.01900.pdf) 72 | 73 | [11] [SupWSD: A Flexible Toolkit for Supervised Word Sense Disambiguation](http://aclweb.org/anthology/D17-2018) 74 | 75 | [12] [The risk of sub-optimal use of Open Source NLP Software: UKB is inadvertently state-of-the-art in knowledge-based WSD](http://aclweb.org/anthology/W18-2505) 76 | -------------------------------------------------------------------------------- /english/grammatical_error_correction.md: -------------------------------------------------------------------------------- 1 | # Grammatical Error Correction 2 | 3 | Grammatical Error Correction (GEC) is the task of correcting different kinds of errors in text such as spelling, punctuation, grammatical, and word choice errors. 4 | 5 | GEC is typically formulated as a sentence correction task. A GEC system takes a potentially erroneous sentence as input and is expected to transform it to its corrected version. See the example given below: 6 | 7 | | Input (Erroneous) | Output (Corrected) | 8 | | ------------------------- | ---------------------- | 9 | |She see Tom is catched by policeman in park at last night. | She saw Tom caught by a policeman in the park last night.| 10 | 11 | ### CoNLL-2014 Shared Task 12 | 13 | The [CoNLL-2014 shared task test set](https://www.comp.nus.edu.sg/~nlp/conll14st/conll14st-test-data.tar.gz) is the most widely used dataset to benchmark GEC systems. The test set contains 1,312 English sentences with error annotations by 2 expert annotators. Models are evaluated with MaxMatch scorer ([Dahlmeier and Ng, 2012](http://www.aclweb.org/anthology/N12-1067)) which computes a span-based Fβ-score (β set to 0.5 to weight precision twice as recall). 14 | 15 | The shared task setting restricts that systems use only publicly available datasets for training to ensure a fair comparison between systems. The highest published scores on the the CoNLL-2014 test set are given below. A distinction is made between papers that report results in the restricted CoNLL-2014 shared task setting of training using publicly-available training datasets only (_**Restricted**_) and those that made use of large, non-public datasets (_**Unrestricted**_). 16 | 17 | **Restricted**: 18 | 19 | | Model | F0.5 | Paper / Source | Code | 20 | | ------------- | :-----:| --- | :-----: | 21 | | Copy-Augmented Transformer + Pre-train (Zhao and Wang, NAACL 2019) | 61.15 | [Improving Grammatical Error Correction via Pre-Training a Copy-Augmented Architecture with Unlabeled Data](https://arxiv.org/pdf/1903.00138.pdf) | [Official](https://github.com/zhawe01/fairseq-gec) | 22 | | CNN Seq2Seq + Quality Estimation (Chollampatt and Ng, EMNLP 2018) | 56.52 | [Neural Quality Estimation of Grammatical Error Correction](http://aclweb.org/anthology/D18-1274) | [Official](https://github.com/nusnlp/neuqe/) | 23 | | SMT + BiGRU (Grundkiewicz and Junczys-Dowmunt, 2018) | 56.25 | [Near Human-Level Performance in Grammatical Error Correction with Hybrid Machine Translation](http://aclweb.org/anthology/N18-2046)| NA | 24 | | Transformer (Junczys-Dowmunt et al., 2018) | 55.8 | [Approaching Neural Grammatical Error Correction as a Low-Resource Machine Translation Task](http://aclweb.org/anthology/N18-1055)| [Official](https://github.com/grammatical/neural-naacl2018) | 25 | | CNN Seq2Seq (Chollampatt and Ng, 2018)| 54.79 | [A Multilayer Convolutional Encoder-Decoder Neural Network for Grammatical Error Correction](https://www.aaai.org/ocs/index.php/AAAI/AAAI18/paper/viewFile/17308/16137)| [Official](https://github.com/nusnlp/mlconvgec2018) | 26 | 27 | **Unrestricted**: 28 | 29 | | Model | F0.5 | Paper / Source | Code | 30 | | ------------- | :-----:| --- | :-----: | 31 | | CNN Seq2Seq + Fluency Boost (Ge et al., 2018) | 61.34 | [Reaching Human-level Performance in Automatic Grammatical Error Correction: An Empirical Study](https://arxiv.org/pdf/1807.01270.pdf)| NA | 32 | 33 | _**Restricted**_: uses only publicly available datasets. _**Unrestricted**_: uses non-public datasets. 34 | 35 | 36 | ### CoNLL-2014 10 Annotations 37 | 38 | [Bryant and Ng, 2015](http://aclweb.org/anthology/P15-1068) released 8 additional annotations (in addition to the two official annotations) for the CoNLL-2014 shared task test set ([link](http://www.comp.nus.edu.sg/~nlp/sw/10gec_annotations.zip)). 39 | 40 | **Restricted**: 41 | 42 | | Model | F0.5 | Paper / Source | Code | 43 | | ------------- | :-----:| --- | :-----: | 44 | | SMT + BiGRU (Grundkiewicz and Junczys-Dowmunt, 2018) | 72.04 | [Near Human-Level Performance in Grammatical Error Correction with Hybrid Machine Translation](http://aclweb.org/anthology/N18-2046)| NA | 45 | | CNN Seq2Seq (Chollampatt and Ng, 2018)| 70.14 (measured by Ge et al., 2018) | [ A Multilayer Convolutional Encoder-Decoder Neural Network for Grammatical Error Correction](https://www.aaai.org/ocs/index.php/AAAI/AAAI18/paper/viewFile/17308/16137)| [Official](https://github.com/nusnlp/mlconvgec2018) | 46 | 47 | **Unrestricted**: 48 | 49 | | Model | F0.5 | Paper / Source | Code | 50 | | ------------- | :-----:| --- | :-----: | 51 | | CNN Seq2Seq + Fluency Boost (Ge et al., 2018) | 76.88 | [Reaching Human-level Performance in Automatic Grammatical Error Correction: An Empirical Study](https://arxiv.org/pdf/1807.01270.pdf)| NA | 52 | 53 | _**Restricted**_: uses only publicly available datasets. _**Unrestricted**_: uses non-public datasets. 54 | 55 | 56 | ### JFLEG 57 | 58 | [JFLEG test set](https://github.com/keisks/jfleg) released by [Napoles et al., 2017](http://aclweb.org/anthology/E17-2037) consists of 747 English sentences with 4 references for each sentence. Models are evaluated with [GLEU](https://github.com/cnap/gec-ranking/) metric ([Napoles et al., 2016](https://arxiv.org/pdf/1605.02592.pdf)). 59 | 60 | 61 | **Restricted**: 62 | 63 | | Model | GLEU | Paper / Source | Code | 64 | | ------------- | :-----:| --- | :-----: | 65 | | SMT + BiGRU (Grundkiewicz and Junczys-Dowmunt, 2018) | 61.50 | [Near Human-Level Performance in Grammatical Error Correction with Hybrid Machine Translation](http://aclweb.org/anthology/N18-2046)| NA | 66 | | Transformer (Junczys-Dowmunt et al., 2018) | 59.9 | [Approaching Neural Grammatical Error Correction as a Low-Resource Machine Translation Task](http://aclweb.org/anthology/N18-1055)| NA | 67 | | CNN Seq2Seq (Chollampatt and Ng, 2018)| 57.47 | [ A Multilayer Convolutional Encoder-Decoder Neural Network for Grammatical Error Correction](https://www.aaai.org/ocs/index.php/AAAI/AAAI18/paper/viewFile/17308/16137)| [Official](https://github.com/nusnlp/mlconvgec2018) | 68 | 69 | 70 | **Unrestricted**: 71 | 72 | | Model | GLEU | Paper / Source | Code | 73 | | ------------- | :-----:| --- | :-----: | 74 | | CNN Seq2Seq + Fluency Boost and inference (Ge et al., 2018) | 62.42 | [Reaching Human-level Performance in Automatic Grammatical Error Correction: An Empirical Study](https://arxiv.org/pdf/1807.01270.pdf)| NA | 75 | 76 | _**Restricted**_: uses only publicly available datasets. _**Unrestricted**_: uses non-public datasets. 77 | 78 | 79 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # Tracking Progress in Natural Language Processing 2 | 3 | ## Table of contents 4 | 5 | ### English 6 | 7 | - [Automatic speech recognition](english/automatic_speech_recognition.md) 8 | - [CCG supertagging](english/ccg_supertagging.md) 9 | - [Common sense](english/common_sense.md) 10 | - [Constituency parsing](english/constituency_parsing.md) 11 | - [Coreference resolution](english/coreference_resolution.md) 12 | - [Dependency parsing](english/dependency_parsing.md) 13 | - [Dialogue](english/dialogue.md) 14 | - [Domain adaptation](english/domain_adaptation.md) 15 | - [Entity linking](english/entity_linking.md) 16 | - [Grammatical error correction](english/grammatical_error_correction.md) 17 | - [Information extraction](english/information_extraction.md) 18 | - [Language modeling](english/language_modeling.md) 19 | - [Lexical normalization](english/lexical_normalization.md) 20 | - [Machine translation](english/machine_translation.md) 21 | - [Missing elements](english/missing_elements.md) 22 | - [Multi-task learning](english/multi-task_learning.md) 23 | - [Multi-modal](english/multimodal.md) 24 | - [Named entity recognition](english/named_entity_recognition.md) 25 | - [Natural language inference](english/natural_language_inference.md) 26 | - [Part-of-speech tagging](english/part-of-speech_tagging.md) 27 | - [Question answering](english/question_answering.md) 28 | - [Relation prediction](english/relation_prediction.md) 29 | - [Relationship extraction](english/relationship_extraction.md) 30 | - [Semantic textual similarity](english/semantic_textual_similarity.md) 31 | - [Semantic parsing](english/semantic_parsing.md) 32 | - [Semantic role labeling](english/semantic_role_labeling.md) 33 | - [Sentiment analysis](english/sentiment_analysis.md) 34 | - [Shallow syntax](english/shallow_syntax.md) 35 | - [Simplification](english/simplification.md) 36 | - [Stance detection](english/stance_detection.md) 37 | - [Summarization](english/summarization.md) 38 | - [Taxonomy learning](english/taxonomy_learning.md) 39 | - [Temporal processing](english/temporal_processing.md) 40 | - [Text classification](english/text_classification.md) 41 | - [Word sense disambiguation](english/word_sense_disambiguation.md) 42 | 43 | ### Chinese 44 | 45 | - [Entity linking](chinese/chinese.md#entity-linking) 46 | - [Chinese word segmentation](chinese/chinese_word_segmentation.md) 47 | 48 | ### Hindi 49 | 50 | - [Chunking](hindi/hindi.md#chunking) 51 | - [Part-of-speech tagging](hindi/hindi.md#part-of-speech-tagging) 52 | - [Machine Translation](hindi/hindi.md#machine-translation) 53 | 54 | ### Vietnamese 55 | 56 | - [Dependency parsing](vietnamese/vietnamese.md#dependency-parsing) 57 | - [Machine translation](vietnamese/vietnamese.md#machine-translation) 58 | - [Named entity recognition](vietnamese/vietnamese.md#named-entity-recognition) 59 | - [Part-of-speech tagging](vietnamese/vietnamese.md#part-of-speech-tagging) 60 | - [Word segmentation](vietnamese/vietnamese.md#word-segmentation) 61 | 62 | This document aims to track the progress in Natural Language Processing (NLP) and give an overview 63 | of the state-of-the-art (SOTA) across the most common NLP tasks and their corresponding datasets. 64 | 65 | It aims to cover both traditional and core NLP tasks such as dependency parsing and part-of-speech tagging 66 | as well as more recent ones such as reading comprehension and natural language inference. The main objective 67 | is to provide the reader with a quick overview of benchmark datasets and the state-of-the-art for their 68 | task of interest, which serves as a stepping stone for further research. To this end, if there is a 69 | place where results for a task are already published and regularly maintained, such as a public leaderboard, 70 | the reader will be pointed there. 71 | 72 | If you want to find this document again in the future, just go to [`nlpprogress.com`](https://nlpprogress.com/) 73 | or [`nlpsota.com`](http://nlpsota.com/) in your browser. 74 | 75 | ### Contributing 76 | 77 | #### Guidelines 78 | 79 | **Results**   Results reported in published papers are preferred; an exception may be made for influential preprints. 80 | 81 | **Datasets**   Datasets should have been used for evaluation in at least one published paper besides 82 | the one that introduced the dataset. 83 | 84 | **Code**   We recommend to add a link to an implementation 85 | if available. You can add a `Code` column (see below) to the table if it does not exist. 86 | In the `Code` column, indicate an official implementation with [Official](http://link_to_implementation). 87 | If an unofficial implementation is available, use [Link](http://link_to_implementation) (see below). 88 | If no implementation is available, you can leave the cell empty. 89 | 90 | #### Adding a new result 91 | 92 | If you would like to add a new result, you can just click on the small edit button in the top-right 93 | corner of the file for the respective task (see below). 94 | 95 | ![Click on the edit button to add a file](img/edit_file.png) 96 | 97 | This allows you to edit the file in Markdown. Simply add a row to the corresponding table in the 98 | same format. Make sure that the table stays sorted (with the best result on top). 99 | After you've made your change, make sure that the table still looks ok by clicking on the 100 | "Preview changes" tab at the top of the page. If everything looks good, go to the bottom of the page, 101 | where you see the below form. 102 | 103 | ![Fill out the file change information](img/propose_file_change.png) 104 | 105 | Add a name for your proposed change, an optional description, indicate that you would like to 106 | "Create a new branch for this commit and start a pull request", and click on "Propose file change". 107 | 108 | #### Adding a new dataset or task 109 | 110 | For adding a new dataset or task, you can also follow the steps above. Alternatively, you can fork the repository. 111 | In both cases, follow the steps below: 112 | 113 | 1. If your task is completely new, create a new file and link to it in the table of contents above. 114 | 1. If not, add your task or dataset to the respective section of the corresponding file (in alphabetical order). 115 | 1. Briefly describe the dataset/task and include relevant references. 116 | 1. Describe the evaluation setting and evaluation metric. 117 | 1. Show how an annotated example of the dataset/task looks like. 118 | 1. Add a download link if available. 119 | 1. Copy the below table and fill in at least two results (including the state-of-the-art) 120 | for your dataset/task (change Score to the metric of your dataset). If your dataset/task 121 | has multiple metrics, add them to the right of `Score`. 122 | 1. Submit your change as a pull request. 123 | 124 | | Model | Score | Paper / Source | Code | 125 | | ------------- | :-----:| --- | --- | 126 | | | | | | 127 | 128 | 129 | ### Wish list 130 | 131 | These are tasks and datasets that are still missing: 132 | 133 | - Bilingual dictionary induction 134 | - Discourse parsing 135 | - Keyphrase extraction 136 | - Knowledge base population (KBP) 137 | - More dialogue tasks 138 | - Semi-supervised learning 139 | - Frame-semantic parsing (FrameNet full-sentence analysis) 140 | 141 | ### Exporting into a structured format 142 | 143 | You can extract all the data into a structured, machine-readable JSON format with parsed tasks, descriptions and SOTA tables. 144 | 145 | The instructions are in [structured/README.md](structured/README.md). 146 | 147 | ### Instructions for building the site locally 148 | 149 | Instructions for building the website locally using Jekyll can be found [here](jekyll_instructions.md). 150 | 151 | 152 | -------------------------------------------------------------------------------- /english/relation_prediction.md: -------------------------------------------------------------------------------- 1 | # Relation Prediction 2 | 3 | ## Task 4 | 5 | Relation Prediction is the task of recognizing a named relation between two named semantic entities. The common test setup is to hide one entity from the relation triplet, asking the system to recover it based on the other entity and the relation type. 6 | 7 | For example, given the triple \<*Roman Jakobson*, *born-in-city*, *?*\>, the system is required to replace the question mark with *Moscow*. 8 | 9 | Relation Prediction datasets are typically extracted from two types of resources: 10 | * *Knowledge Bases*: KBs such as [FreeBase](https://developers.google.com/freebase/) contain hundreds or thousands of relation types pertaining to world-knowledge obtained autmoatically or semi-automatically from various resources on millions of entities. These relations include *born-in*, *nationality*, *is-in* (for geographical entities), *part-of* (for organizations, among others), and more. 11 | * *Semantic Graphs*: SGs such as [WordNet](https://wordnet.princeton.edu/) are often manually-curated resources of semantic concepts, restricted to more "linguistic" relations compared to free real-world knowledge. The most common semantic relation is *hypernym*, also known as the *is-a* relation (example: \<*cat*, *hypernym*, *feline*\>). 12 | 13 | ## Evaluation 14 | 15 | Evaluation in Relation Prediction hinges on a list of ranked candidates given by the system to the test instance. The metrics below are derived from the location of correct candidate(s) in that list. 16 | 17 | A common action performed before evaluation on a given list is *filtering*, where the list is cleaned of entities whose corresponding triples exist in the knowledge graph. Unless specified otherwise, results here are from filtered lists. 18 | 19 | ### Metrics 20 | 21 | #### Mean Reciprocal Rank (MRR): 22 | 23 | The mean of all reciprocal ranks for the true candidates over the test set (1/rank). 24 | 25 | #### Hits at k (H@k): 26 | 27 | The rate of correct entities appearing in the top *k* entries for each instance list. This number may exceed 1.00 if the average *k*-truncated list contains more than one true entity. 28 | 29 | ### Datasets 30 | 31 | #### Freebase-15K-237 (FB15K-237) 32 | The FB15K dataset was introduced in [Bordes et al., 2013](http://papers.nips.cc/paper/5071-translating-embeddings-for-modeling-multi-relational-data.pdf). It is a subset of Freebase which contains about 14,951 entities with 1,345 different relations. This dataset was found to suffer from major test leakage through inverse relations and a large number of test triples can be obtained simply by inverting triples in the training set initially by [Toutanova et al.](http://aclweb.org/anthology/D15-1174). To create a dataset without this property, [Toutanova et al.](http://aclweb.org/anthology/D15-1174) introduced FB15k-237 – a subset of FB15k where inverse relations are removed. 33 | 34 | #### WordNet-18-RR (WN18RR) 35 | 36 | The WN18 dataset was also introduced in [Bordes et al., 2013](http://papers.nips.cc/paper/5071-translating-embeddings-for-modeling-multi-relational-data.pdf). It included the full 18 relations scraped from WordNet for roughly 41,000 synsets. Similar to FB15K, This dataset was found to suffer from test leakage by [Dettmers et al. (2018)](https://arxiv.org/abs/1707.01476) introduced the [WN18RR](https://github.com/villmow/datasets_knowledge_embedding). 37 | 38 | As a way to overcome this problem, [Dettmers et al. (2018)](https://arxiv.org/abs/1707.01476) introduced the [WN18RR](https://github.com/villmow/datasets_knowledge_embedding) dataset, derived from WN18, which features 11 relations only, no pair of which is reciprocal (but still include four internally-symmetric relations like *verb_group*, allowing the rule-based system to reach 35 on all three metrics). 39 | 40 | ### Experimental Results 41 | 42 | #### WN18RR 43 | 44 | The test set is composed of triplets, each used to create two test instances, one for each entity to be predicted. Since each instance is associated with a single true entity, the maximum value for all metrics is 1.00. 45 | 46 | | Model | H@10 | H@1 | MRR | Paper / Source | Code | 47 | | ------------- | :-----:| :-----:| :-----:| --- | --- | 48 | | Max-Margin Markov Graph Models (Pinter & Eisenstein, 2018) | 59.02 | 45.37 | 49.83 | [Predicting Semantic Relations using Global Graph Properties](https://arxiv.org/abs/1808.08644) | [Official](http://www.github.com/yuvalpinter/m3gm) | 49 | | TransE (reimplementation by Pinter & Eisenstein, 2018) | 55.55 | 42.26 | 46.59 | [Translating Embeddings for Modeling Multi-relational Data. ](http://papers.nips.cc/paper/5071-translating-embeddings-for-modeling-multi-relational-data.pdf) | [OpenKE](https://github.com/thunlp/OpenKE) | 50 | | ConvKB (Nguyen et al., 2018) | 52.50 | - | 24.80 | [A Novel Embedding Model for Knowledge Base Completion Based on Convolutional Neural Network](http://www.aclweb.org/anthology/N18-2053) | [Official](https://github.com/daiquocnguyen/ConvKB) | 51 | | ConvE (v6; Dettmers et al., 2018) | 52.00 | 40.00 | 43.00 | [Convolutional 2D Knowledge Graph Embeddings](https://arxiv.org/abs/1707.01476) | [Official](https://github.com/TimDettmers/ConvE) | 52 | | ComplEx (Trouillon et al., 2016) | 51.00 | 41.00 | 44.00 | [Complex Embeddings for Simple Link Prediction](http://www.jmlr.org/proceedings/papers/v48/trouillon16.pdf) | [Official](https://github.com/ttrouill/complex) | 53 | | DistMult (reimplementation by Dettmers et al., 2018) | 49.00 | 40.00 | 43.00 | [Embedding Entities and Relations for Learning and Inference in Knowledge Bases.](https://arxiv.org/pdf/1412.6575) | [Link](https://github.com/thunlp/OpenKE) | 54 | 55 | #### FB15K-237 56 | 57 | | Model | H@10 | H@1 | MRR | Paper / Source | Code | 58 | | ------------- | :-----:| :-----:| :-----:| --- | --- | 59 | | TransE (reimplementation by Han et al., 2018) | 47.09 | 19.87 | 29.04 | [Translating Embeddings for Modeling Multi-relational Data. ](http://papers.nips.cc/paper/5071-translating-embeddings-for-modeling-multi-relational-data.pdf) | [OpenKE](https://github.com/thunlp/OpenKE) | 60 | | TransH (reimplementation by Han et al., 2018) | 41.32 | 5.79 | 17.66 | [Knowledge Graph Embedding by Translating on Hyperplanes.](http://www.aaai.org/ocs/index.php/AAAI/AAAI14/paper/viewFile/8531/8546) | [OpenKE](https://github.com/thunlp/OpenKE) | 61 | | TransR (reimplementation by Han et al., 2018) | 40.67 | 16.35 | 24.44 | [ Learning Entity and Relation Embeddings for Knowledge Graph Completion.](http://www.aaai.org/ocs/index.php/AAAI/AAAI15/paper/download/9571/9523/) | [OpenKE](https://github.com/thunlp/OpenKE) | 62 | | TransD (reimplementation by Han et al., 2018) | 46.05 | 14.83 | 25.27 | [Knowledge Graph Embedding via Dynamic Mapping Matrix.](http://anthology.aclweb.org/P/P15/P15-1067.pdf) | [OpenKE](https://github.com/thunlp/OpenKE) | 63 | | ConvKB (Nguyen et al., 2018) | 51.70 | - | 39.60 | [A Novel Embedding Model for Knowledge Base Completion Based on Convolutional Neural Network](http://www.aclweb.org/anthology/N18-2053) | [Official](https://github.com/daiquocnguyen/ConvKB) | 64 | | ConvE (v6; Dettmers et al., 2018) | 50.10 | 23.70 | 32.50 | [Convolutional 2D Knowledge Graph Embeddings](https://arxiv.org/abs/1707.01476) | [Official](https://github.com/TimDettmers/ConvE) | 65 | | ComplEx (reimplementation by Dettmers et al., 2018) | 42.80 | 15.80 | 24.70 | [Complex Embeddings for Simple Link Prediction](http://www.jmlr.org/proceedings/papers/v48/trouillon16.pdf) | [Official](https://github.com/ttrouill/complex) | 66 | | DistMult (reimplementation by Dettmers et al., 2018) | 41.90 | 15.50 | 24.10 | [Embedding Entities and Relations for Learning and Inference in Knowledge Bases.](https://arxiv.org/pdf/1412.6575) | [Link](https://github.com/thunlp/OpenKE) | 67 | 68 | ## Resources 69 | [OpenKE](http://aclweb.org/anthology/D18-2024) is an open toolkit for relational learning which provides a standard training and testing framework. Currently, the implemented models in OpenKE include TransE, TransH, TransR, TransD, RESCAL, DistMult, ComplEx and HolE. 70 | 71 | [KRLPapers](https://github.com/thunlp/KRLPapers) is a must-read paper list for relational learning. 72 | 73 | [Back to README](../README.md) 74 | -------------------------------------------------------------------------------- /english/temporal_processing.md: -------------------------------------------------------------------------------- 1 | # Temporal Processing 2 | 3 | ## Document Dating (Time-stamping) 4 | 5 | Document Dating is the problem of automatically predicting the date of a document based on its content. Date of a document, also referred to as the Document Creation Time (DCT), is at the core of many important tasks, such as, information retrieval, temporal reasoning, text summarization, event detection, and analysis of historical text, among others. 6 | 7 | For example, in the following document, the correct creation year is 1999. This can be inferred by the presence of terms *1995* and *Four years after*. 8 | 9 | *Swiss adopted that form of taxation in 1995. The concession was approved by the govt last September. Four years after, the IOC….* 10 | 11 | ### Datasets 12 | 13 | | Datasets | # Docs | Start Year | End Year | 14 | | :--------------------------------------: | :----: | :--------: | :------: | 15 | | [APW](https://drive.google.com/file/d/1tll04ZBooB3Mohm6It-v8MBcjMCC3Y1w/view) | 675k | 1995 | 2010 | 16 | | [NYT](https://drive.google.com/file/d/1wqQRFeA1ESAOJqrwUNakfa77n_S9cmBi/view?usp=sharing) | 647k | 1987 | 1996 | 17 | 18 | ### Comparison on year level granularity: 19 | 20 | | | APW Dataset | NYT Dataset | Paper/Source | 21 | | -------------------------------------- | :---------: | :---------: | ---------------------------------------- | 22 | | NeuralDater (Vashishth et. al, 2018) | 64.1 | 58.9 | [Document Dating using Graph Convolution Networks](https://github.com/malllabiisc/NeuralDater) | 23 | | Chambers (2012) | 52.5 | 42.3 | [Labeling Documents with Timestamps: Learning from their Time Expressions](https://pdfs.semanticscholar.org/87af/a0cb4f829ce861da0c721ca666d48a62c404.pdf) | 24 | | BurstySimDater (Kotsakos et. al, 2014) | 45.9 | 38.5 | [A Burstiness-aware Approach for Document Dating](https://www.idi.ntnu.no/~noervaag/papers/SIGIR2014short.pdf) | 25 | 26 | 27 | ## Temporal Information Extraction 28 | 29 | Temporal information extraction is the identification of chunks/tokens corresponding to temporal intervals, and the extraction and determination of the temporal relations between those. The entities extracted may be temporal expressions (timexes), eventualities (events), or auxiliary signals that support the interpretation of an entity or relation. Relations may be temporal links (tlinks), describing the order of events and times, or subordinate links (slinks) describing modality and other subordinative activity, or aspectual links (alinks) around the various influences aspectuality has on event structure. 30 | 31 | The markup scheme used for temporal information extraction is well-described in the ISO-TimeML standard, and also on [www.timeml.org](http://www.timeml.org). 32 | 33 | ``` 34 | 35 | 36 | 37 | 38 | 39 | 40 | PRI20001020.2000.0127 41 | NEWS STORY 42 | 10/20/2000 20:02:07.85 43 | 44 | 45 | The Navy has changed its account of the attack on the USS Cole in Yemen. 46 | Officials now say the ship was hit nearly two hours after it had docked. 47 | Initially the Navy said the explosion occurred while several boats were helping 48 | the ship to tie up. The change raises new questions about how the attackers 49 | were able to get past the Navy security. 50 | 51 | 52 | 10/20/2000 20:02:28.05 53 | 54 | 55 | 56 | 57 | 58 | 59 | ``` 60 | 61 | To avoid leaking knowledge about temporal structure, train, dev and test splits must be made at document level for temporal information extraction. 62 | 63 | ### TimeBank 64 | 65 | TimeBank, based on the TIMEX3 standard embedded in ISO-TimeML, is a benchmark corpus containing 64K tokens of English newswire, and annotated for all asepcts of ISO-TimeML - including temporal expressions. TimeBank is freely distributed by the LDC: [TimeBank 1.2](https://catalog.ldc.upenn.edu/LDC2006T08) 66 | 67 | Evaluation is for both entity chunking and attribute annotation, as well as temporal relation accuracy, typically measured with F1 -- although this metric is not sensitive to inconsistencies or free wins from interval logic induction over the whole set. 68 | 69 | | Model | F1 score | Paper / Source | 70 | | ------------- | :-----:| --- | 71 | | Catena | 0.511 | [CATENA: CAusal and TEmporal relation extraction from NAtural language texts](http://www.aclweb.org/anthology/C16-1007) | 72 | | CAEVO | 0.507 | [Dense Event Ordering with a Multi-Pass Architecture](https://www.transacl.org/ojs/index.php/tacl/article/download/255/50) | 73 | 74 | ### TempEval-3 75 | 76 | The TempEval-3 corpus accompanied the shared [TempEval-3](http://www.aclweb.org/anthology/S13-2001) SemEval task in 2013. This uses a timelines-based metric to assess temporal relation structure. The corpus is fresh and somewhat more varied than TimeBank, though markedly smaller. [TempEval-3 data](https://www.cs.york.ac.uk/semeval-2013/task1/index.php%3Fid=data.html) 77 | 78 | | Model | Temporal awareness | Paper / Source | 79 | | ------------- | :-----:| --- | 80 | | Ning et al. | 67.2 | [A Structured Learning Approach to Temporal Relation Extraction](http://www.aclweb.org/anthology/D17-1108) | 81 | | ClearTK | 30.98 | [Cleartk-timeml: A minimalist approach to tempeval 2013](http://www.aclweb.org/anthology/S13-2002) | 82 | 83 | ## Timex normalisation 84 | 85 | Temporal expression normalisation is the grounding of a lexicalisation of a time to a calendar date or other formal temporal representation. 86 | 87 | Example: 88 | 10/18/2000 21:01:00.65 89 | Dozens of Palestinians were wounded in 90 | scattered clashes in the West Bank and Gaza Strip, Wednesday, 91 | despite the Sharm el-Sheikh truce accord. 92 | 93 | Chuck Rich reports on entertainment every Saturday 94 | 95 | ### TimeBank 96 | 97 | TimeBank, based on the TIMEX3 standard embedded in ISO-TimeML, is a benchmark corpus containing 64K tokens of English newswire, and annotated for all asepcts of ISO-TimeML - including temporal expressions. TimeBank is freely distributed by the LDC: [TimeBank 1.2](https://catalog.ldc.upenn.edu/LDC2006T08) 98 | 99 | | Model | F1 score | Paper / Source | 100 | | ------------- | :-----:| --- | 101 | | TIMEN | 0.89 | [TIMEN: An Open Temporal Expression Normalisation Resource](http://aclweb.org/anthology/L12-1015) | 102 | | HeidelTime | 0.876 | [A baseline temporal tagger for all languages](http://aclweb.org/anthology/D15-1063) | 103 | 104 | ### PNT 105 | 106 | The [Parsing Time Normalizations corpus](https://github.com/bethard/anafora-annotations/releases) in [SCATE](http://www.lrec-conf.org/proceedings/lrec2016/pdf/288_Paper.pdf) format allows the representation of a wider variety of time expressions than previous approaches. This corpus was release with [SemEval 2018 Task 6](http://aclweb.org/anthology/S18-1011). 107 | 108 | | Model | F1 score | Paper / Source | 109 | | ------------- | :-----:| --- | 110 | | Laparra et al. 2018 | 0.764 | [From Characters to Time Intervals: New Paradigms for Evaluation and Neural Parsing of Time Normalizations](http://aclweb.org/anthology/Q18-1025) | 111 | | HeidelTime | 0.74 | [A baseline temporal tagger for all languages](http://aclweb.org/anthology/D15-1063) | 112 | | Chrono | 0.70 | [Chrono at SemEval-2018 task 6: A system for normalizing temporal expressions](http://aclweb.org/anthology/S18-1012) | 113 | 114 | 115 | [Go back to the README](../README.md) 116 | -------------------------------------------------------------------------------- /chinese/chinese_word_segmentation.md: -------------------------------------------------------------------------------- 1 | # Chinese Word Segmentation 2 | 3 | ## Task 4 | Chinese word segmentation is the task of 5 | splitting Chinese text (a sequence of Chinese characters) 6 | into words. 7 | 8 | Example: 9 | ``` 10 | '上海浦东开发与建设同步' → ['上海', '浦东', '开发', ‘与', ’建设', '同步'] 11 | ``` 12 | 13 | ## Systems 14 | ♠ marks the system that uses character unigram as input. 15 | ♣ marks the system that uses character bigram as input. 16 | 17 | - Huang et al. (2019): BERT + model compression + multi-criterial learing ♠ 18 | - Yang et al. (2018): Lattice LSTM-CRF + BPE subword embeddings ♠♣ 19 | - Ma et al. (2018): BiLSTM-CRF + hyper-params search♠♣ 20 | - Yang et al. (2017): Transition-based + Beam-search + Rich pretrain♠♣ 21 | - Zhou et al. (2017): Greedy Search + word context♠ 22 | - Chen et al. (2017): BiLSTM-CRF + adv. loss♠♣ 23 | - Cai et al. (2017): Greedy Search+Span representation♠ 24 | - Kurita et al. (2017): Transition-based + Joint model♠ 25 | - Liu et al. (2016): neural semi-CRF♠ 26 | - Cai and Zhao (2016): Greedy Search♠ 27 | - Chen et al. (2015a): Gated Recursive NN♠♣ 28 | - Chen et al. (2015b): BiLSTM-CRF♠♣ 29 | 30 | ## Evaluation 31 | 32 | ### Metrics 33 | 34 | F1-score 35 | 36 | ### Dataset 37 | #### Chinese Treebank 6 38 | 39 | | Model | F1 | Paper / Source | Code | 40 | | ------------- | :-----: | --- | --- | 41 | | Huang et al. (2019) | 97.6 |[Toward Fast and Accurate Neural Chinese Word Segmentation with Multi-Criteria Learning](https://arxiv.org/pdf/1903.04190.pdf)|| 42 | | Ma et al. (2018) | 96.7 | [State-of-the-art Chinese Word Segmentation with Bi-LSTMs](https://aclweb.org/anthology/D18-1529)| | 43 | | Yang et al. (2018) | 96.3 | [Subword Encoding in Lattice LSTM for Chinese Word Segmentation](https://arxiv.org/pdf/1810.12594.pdf) | [Github](https://github.com/jiesutd/SubwordEncoding-CWS)| 44 | | Yang et al. (2017) | 96.2 | [Neural Word Segmentation with Rich Pretraining](http://aclweb.org/anthology/P17-1078) | [Github](https://github.com/jiesutd/RichWordSegmentor)| 45 | | Zhou et al. (2017) | 96.2 | [Word-Context Character Embeddings for Chinese Word Segmentation](https://www.aclweb.org/anthology/D17-1079)| | 46 | | Chen et al. (2017) | 96.2 | [Adversarial Multi-Criteria Learning for Chinese Word Segmentation](http://aclweb.org/anthology/P17-1110) | [Github](https://github.com/FudanNLP/adversarial-multi-criteria-learning-for-CWS) | 47 | | Liu et al. (2016) | 95.5 | [Exploring Segment Representations for Neural Segmentation Models](https://www.ijcai.org/Proceedings/16/Papers/409.pdf)| [Github](https://github.com/Oneplus/segrep-for-nn-semicrf) | 48 | | Chen et al. (2015b) | 96.0 | [Long Short-Term Memory Neural Networks for Chinese Word Segmentation](http://www.aclweb.org/anthology/D15-1141) | [Github](https://github.com/FudanNLP/CWS_LSTM) | 49 | 50 | #### Chinese Treebank 7 51 | 52 | | Model | F1 | Paper / Source | Code | 53 | | ------------- | :-----: | --- | --- | 54 | | Ma et al. (2018) | 96.6 | [State-of-the-art Chinese Word Segmentation with Bi-LSTMs](https://aclweb.org/anthology/D18-1529)| | 55 | | Kurita et al. (2017) | 96.2 | [Neural Joint Model for Transition-based Chinese Syntactic Analysis](http://www.aclweb.org/anthology/P17-1111) | | 56 | #### AS 57 | 58 | | Model | F1 | Paper / Source | Code | 59 | | ------------- | :-----: | --- | --- | 60 | | Huang et al. (2019) | 96.6 | [Toward Fast and Accurate Neural Chinese Word Segmentation with Multi-Criteria Learning](https://arxiv.org/pdf/1903.04190.pdf)| | 61 | | Ma et al. (2018) | 96.2 | [State-of-the-art Chinese Word Segmentation with Bi-LSTMs](https://aclweb.org/anthology/D18-1529)| | 62 | | Yang et al. (2017) | 95.7 | [Neural Word Segmentation with Rich Pretraining](http://aclweb.org/anthology/P17-1078) |[Github](https://github.com/jiesutd/RichWordSegmentor) | 63 | | Cai et al. (2017) | 95.3 | [Fast and Accurate Neural Word Segmentation for Chinese](http://aclweb.org/anthology/P17-2096) | [Github](https://github.com/jcyk/greedyCWS) | 64 | | Chen et al. (2017) | 94.8 | [Adversarial Multi-Criteria Learning for Chinese Word Segmentation](http://aclweb.org/anthology/P17-1110) | [Github](https://github.com/FudanNLP/adversarial-multi-criteria-learning-for-CWS) | 65 | 66 | #### CityU 67 | 68 | | Model | F1 | Paper / Source | Code | 69 | | ------------- | :-----: | --- | --- | 70 | | Huang et al. (2019) | 97.6 | [Toward Fast and Accurate Neural Chinese Word Segmentation with Multi-Criteria Learning](https://arxiv.org/pdf/1903.04190.pdf)| | 71 | | Ma et al. (2018) | 97.2 | [State-of-the-art Chinese Word Segmentation with Bi-LSTMs](https://aclweb.org/anthology/D18-1529)| | 72 | | Yang et al. (2017) | 96.9 | [Neural Word Segmentation with Rich Pretraining](http://aclweb.org/anthology/P17-1078) | [Github](https://github.com/jiesutd/RichWordSegmentor)| 73 | | Cai et al. (2017) | 95.6 | [Fast and Accurate Neural Word Segmentation for Chinese](http://aclweb.org/anthology/P17-2096) | [Github](https://github.com/jcyk/greedyCWS) | 74 | | Chen et al. (2017) | 95.6 | [Adversarial Multi-Criteria Learning for Chinese Word Segmentation](http://aclweb.org/anthology/P17-1110) | [Github](https://github.com/FudanNLP/adversarial-multi-criteria-learning-for-CWS) | 75 | 76 | #### PKU 77 | 78 | | Model | F1 | Paper / Source | Code | 79 | | ------------- | :-----: | --- | --- | 80 | | Huang et al. (2019) | 96.6 | [Toward Fast and Accurate Neural Chinese Word Segmentation with Multi-Criteria Learning](https://arxiv.org/pdf/1903.04190.pdf)| | 81 | | Yang et al. (2017) | 96.3 | [Neural Word Segmentation with Rich Pretraining](http://aclweb.org/anthology/P17-1078) | [Github](https://github.com/jiesutd/RichWordSegmentor)| 82 | | Ma et al. (2018) | 96.1 | [State-of-the-art Chinese Word Segmentation with Bi-LSTMs](https://aclweb.org/anthology/D18-1529)| | 83 | | Yang et al. (2018) | 95.9 | [Subword Encoding in Lattice LSTM for Chinese Word Segmentation](https://arxiv.org/pdf/1810.12594.pdf) | [Github](https://github.com/jiesutd/SubwordEncoding-CWS)| 84 | | Cai et al. (2017) | 95.8 | [Fast and Accurate Neural Word Segmentation for Chinese](http://aclweb.org/anthology/P17-2096) | [Github](https://github.com/jcyk/greedyCWS) | 85 | | Chen et al. (2017) | 94.3 | [Adversarial Multi-Criteria Learning for Chinese Word Segmentation](http://aclweb.org/anthology/P17-1110) | [Github](https://github.com/FudanNLP/adversarial-multi-criteria-learning-for-CWS) | 86 | | Liu et al. (2016) | 95.7 | [Exploring Segment Representations for Neural Segmentation Models](https://www.ijcai.org/Proceedings/16/Papers/409.pdf)| [Github](https://github.com/Oneplus/segrep-for-nn-semicrf) | 87 | | Cai and Zhao (2016) | 95.7 | [Neural Word Segmentation Learning for Chinese](http://www.aclweb.org/anthology/P16-1039) | [Github](https://github.com/jcyk/CWS) | 88 | 89 | #### MSR 90 | 91 | | Model | F1 | Paper / Source | Code | 92 | | ------------- | :-----: | --- | --- | 93 | | Ma et al. (2018) | 98.1 | [State-of-the-art Chinese Word Segmentation with Bi-LSTMs](https://aclweb.org/anthology/D18-1529)| | 94 | | Huang et al. (2019) | 97.9 | [Toward Fast and Accurate Neural Chinese Word Segmentation with Multi-Criteria Learning](https://arxiv.org/pdf/1903.04190.pdf)| | 95 | | Yang et al. (2018) | 97.8 | [Subword Encoding in Lattice LSTM for Chinese Word Segmentation](https://arxiv.org/pdf/1810.12594.pdf) | [Github](https://github.com/jiesutd/SubwordEncoding-CWS)| 96 | | Yang et al. (2017) | 97.5 | [Neural Word Segmentation with Rich Pretraining](http://aclweb.org/anthology/P17-1078) | [Github](https://github.com/jiesutd/RichWordSegmentor)| 97 | | Cai et al. (2017) | 97.1 | [Fast and Accurate Neural Word Segmentation for Chinese](http://aclweb.org/anthology/P17-2096) | [Github](https://github.com/jcyk/greedyCWS) | 98 | | Chen et al. (2017) | 96.0 | [Adversarial Multi-Criteria Learning for Chinese Word Segmentation](http://aclweb.org/anthology/P17-1110) | [Github](https://github.com/FudanNLP/adversarial-multi-criteria-learning-for-CWS) | 99 | | Liu et al. (2016) | 97.6 | [Exploring Segment Representations for Neural Segmentation Models](https://www.ijcai.org/Proceedings/16/Papers/409.pdf)| [Github](https://github.com/Oneplus/segrep-for-nn-semicrf) | 100 | | Cai and Zhao (2016) | 96.4 | [Neural Word Segmentation Learning for Chinese](http://www.aclweb.org/anthology/P16-1039) | [Github](https://github.com/jcyk/CWS) | 101 | 102 | [Go back to the README](../README.md) 103 | -------------------------------------------------------------------------------- /english/named_entity_recognition.md: -------------------------------------------------------------------------------- 1 | # Named entity recognition 2 | 3 | Named entity recognition (NER) is the task of tagging entities in text with their corresponding type. 4 | Approaches typically use BIO notation, which differentiates the beginning (B) and the inside (I) of entities. 5 | O is used for non-entity tokens. 6 | 7 | Example: 8 | 9 | | Mark | Watney | visited | Mars | 10 | | --- | ---| --- | --- | 11 | | B-PER | I-PER | O | B-LOC | 12 | 13 | ### CoNLL 2003 (English) 14 | 15 | The [CoNLL 2003 NER task](http://www.aclweb.org/anthology/W03-0419.pdf) consists of newswire text from the Reuters RCV1 16 | corpus tagged with four different entity types (PER, LOC, ORG, MISC). Models are evaluated based on span-based F1 on the test set. ♦ used both the train and development splits for training. 17 | 18 | | Model | F1 | Paper / Source | Code | 19 | | ------------- | :-----:| --- | --- | 20 | | CNN Large + fine-tune (Baevski et al., 2019) | 93.5 | [Cloze-driven Pretraining of Self-attention Networks](https://arxiv.org/pdf/1903.07785.pdf) | | 21 | | Flair embeddings (Akbik et al., 2018)♦ | 93.09 | [Contextual String Embeddings for Sequence Labeling](https://drive.google.com/file/d/17yVpFA7MmXaQFTe-HDpZuqw9fJlmzg56/view) | [Flair framework](https://github.com/zalandoresearch/flair) 22 | | BERT Large (Devlin et al., 2018) | 92.8 | [BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding](https://arxiv.org/abs/1810.04805) | | 23 | | CVT + Multi-Task (Clark et al., 2018) | 92.61 | [Semi-Supervised Sequence Modeling with Cross-View Training](https://arxiv.org/abs/1809.08370) | [Official](https://github.com/tensorflow/models/tree/master/research/cvt_text) | 24 | | BERT Base (Devlin et al., 2018) | 92.4 | [BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding](https://arxiv.org/abs/1810.04805) | | 25 | | BiLSTM-CRF+ELMo (Peters et al., 2018) | 92.22 | [Deep contextualized word representations](https://arxiv.org/abs/1802.05365) | [AllenNLP Project](https://allennlp.org/elmo) [AllenNLP GitHub](https://github.com/allenai/allennlp) | 26 | | Peters et al. (2017) ♦| 91.93 | [Semi-supervised sequence tagging with bidirectional language models](https://arxiv.org/abs/1705.00108) | | 27 | | CRF + AutoEncoder (Wu et al., 2018) | 91.87 | [Evaluating the Utility of Hand-crafted Features in Sequence Labelling](http://aclweb.org/anthology/D18-1310) | [Official](https://github.com/minghao-wu/CRF-AE) | 28 | | Bi-LSTM-CRF + Lexical Features (Ghaddar and Langlais 2018) | 91.73 | [Robust Lexical Features for Improved Neural Network Named-Entity Recognition](https://arxiv.org/pdf/1806.03489.pdf) | [Official](https://github.com/ghaddarAbs/NER-with-LS) | 29 | | Chiu and Nichols (2016) ♦| 91.62 | [Named entity recognition with bidirectional LSTM-CNNs](https://arxiv.org/abs/1511.08308) | | 30 | | HSCRF (Ye and Ling, 2018)| 91.38 | [Hybrid semi-Markov CRF for Neural Sequence Labeling](http://aclweb.org/anthology/P18-2038) | [HSCRF](https://github.com/ZhixiuYe/HSCRF-pytorch) | 31 | | IXA pipes (Agerri and Rigau 2016) | 91.36 | [Robust multilingual Named Entity Recognition with shallow semi-supervised features](https://doi.org/10.1016/j.artint.2016.05.003)| [Official](https://github.com/ixa-ehu/ixa-pipe-nerc)| 32 | | NCRF++ (Yang and Zhang, 2018)| 91.35 | [NCRF++: An Open-source Neural Sequence Labeling Toolkit](http://www.aclweb.org/anthology/P18-4013) | [NCRF++](https://github.com/jiesutd/NCRFpp) | 33 | | LM-LSTM-CRF (Liu et al., 2018)| 91.24 | [Empowering Character-aware Sequence Labeling with Task-Aware Neural Language Model](https://arxiv.org/pdf/1709.04109.pdf) | [LM-LSTM-CRF](https://github.com/LiyuanLucasLiu/LM-LSTM-CRF) | 34 | | Yang et al. (2017) ♦| 91.26 | [Transfer Learning for Sequence Tagging with Hierarchical Recurrent Networks](https://arxiv.org/abs/1703.06345) | | 35 | | Ma and Hovy (2016) | 91.21 | [End-to-end Sequence Labeling via Bi-directional LSTM-CNNs-CRF](https://arxiv.org/abs/1603.01354) | | 36 | | LSTM-CRF (Lample et al., 2016) | 90.94 | [Neural Architectures for Named Entity Recognition](https://arxiv.org/abs/1603.01360) | | 37 | 38 | ### Long-tail emerging entities 39 | 40 | The [WNUT 2017 Emerging Entities task](http://aclweb.org/anthology/W17-4418) operates over a wide range of English 41 | text and focuses on generalisation beyond memorisation in high-variance environments. Scores are given both over 42 | entity chunk instances, and unique entity surface forms, to normalise the biasing impact of entities that occur frequently. 43 | 44 | | Feature | Train | Dev | Test | 45 | | --- | --- | --- | --- | 46 | | Posts | 3,395 | 1,009 | 1,287 | 47 | | Tokens | 62,729 | 15,733 | 23,394 | 48 | | NE tokens | 3,160 | 1,250 | 1,589 | 49 | 50 | The data is annotated for six classes - person, location, group, creative work, product and corporation. 51 | 52 | Links: [WNUT 2017 Emerging Entity task page](https://noisy-text.github.io/2017/emerging-rare-entities.html) (including direct download links for data and scoring script) 53 | 54 | | Model | F1 | F1 (surface form) | Paper / Source | 55 | | --- | --- | --- | --- | 56 | | Flair embeddings (Akbik et al., 2018) | 49.59 | | [Pooled Contextualized Embeddings for Named Entity Recognition](http://alanakbik.github.io/papers/naacl2019_embeddings.pdf) / [Flair framework](https://github.com/zalandoresearch/flair) | 57 | | Aguilar et al. (2018) | 45.55 | | [Modeling Noisiness to Recognize Named Entities using Multitask Neural Networks on Social Media](http://aclweb.org/anthology/N18-1127.pdf) | 58 | | SpinningBytes | 40.78 | 39.33 | [Transfer Learning and Sentence Level Features for Named Entity Recognition on Tweets](http://aclweb.org/anthology/W17-4422.pdf) | 59 | 60 | ### Ontonotes v5 (English) 61 | 62 | The [Ontonotes corpus v5](https://catalog.ldc.upenn.edu/docs/LDC2013T19/OntoNotes-Release-5.0.pdf) is a richly annotated corpus with several layers of annotation, including named entities, coreference, part of speech, word sense, propositions, and syntactic parse trees. These annotations are over a large number of tokens, a broad cross-section of domains, and 3 languages (English, Arabic, and Chinese). The NER dataset (of interest here) includes 18 tags, consisting of 11 _types_ (PERSON, ORGANIZATION, etc) and 7 _values_ (DATE, PERCENT, etc), and contains 2 million tokens. The common datasplit used in NER is defined in [Pradhan et al 2013](https://www.semanticscholar.org/paper/Towards-Robust-Linguistic-Analysis-using-OntoNotes-Pradhan-Moschitti/a94e4fe6f475e047be5dcc9077f445e496240852) and can be found [here](http://cemantix.org/data/ontonotes.html). 63 | 64 | | Model | F1 | Paper / Source | Code | 65 | | ------------- | :-----:| --- | --- | 66 | | Flair embeddings (Akbik et al., 2018) | 89.71 | [Contextual String Embeddings for Sequence Labeling](http://aclweb.org/anthology/C18-1139) | [Official](https://github.com/zalandoresearch/flair) | 67 | | CVT + Multi-Task (Clark et al., 2018) | 88.81 | [Semi-Supervised Sequence Modeling with Cross-View Training](https://arxiv.org/abs/1809.08370) | [Official](https://github.com/tensorflow/models/tree/master/research/cvt_text) | 68 | | Bi-LSTM-CRF + Lexical Features (Ghaddar and Langlais 2018) | 87.95 | [Robust Lexical Features for Improved Neural Network Named-Entity Recognition](https://arxiv.org/pdf/1806.03489.pdf) | [Official](https://github.com/ghaddarAbs/NER-with-LS)| 69 | | BiLSTM-CRF (Strubell et al, 2017) | 86.99 | [Fast and Accurate Entity Recognition with Iterated Dilated Convolutions](https://arxiv.org/pdf/1702.02098.pdf) | [Official](https://github.com/iesl/dilated-cnn-ner) | 70 | | Iterated Dilated CNN (Strubell et al, 2017) | 86.84 | [Fast and Accurate Entity Recognition with Iterated Dilated Convolutions](https://arxiv.org/pdf/1702.02098.pdf) | [Official](https://github.com/iesl/dilated-cnn-ner) | 71 | | Chiu and Nichols (2016) | 86.28 | [Named entity recognition with bidirectional LSTM-CNNs](https://arxiv.org/abs/1511.08308) | | 72 | | Joint Model (Durrett and Klein 2014) | 84.04 | [A Joint Model for Entity Analysis: Coreference, Typing, and Linking](https://pdfs.semanticscholar.org/2eaf/f2205c56378e715d8d12c521d045c0756a76.pdf) | 73 | | Averaged Perceptron (Ratinov and Roth 2009) | 83.45 | [Design Challenges and Misconceptions in Named Entity Recognition](https://www.semanticscholar.org/paper/Design-Challenges-and-Misconceptions-in-Named-Ratinov-Roth/27496a2ee337db705e7c611dea1fd8e6f41437c2) (These scores reported in ([Durrett and Klein 2014](https://pdfs.semanticscholar.org/2eaf/f2205c56378e715d8d12c521d045c0756a76.pdf))) | [Official](https://github.com/CogComp/cogcomp-nlp/tree/master/ner) | 74 | 75 | 76 | 77 | [Go back to the README](../README.md) 78 | -------------------------------------------------------------------------------- /english/relationship_extraction.md: -------------------------------------------------------------------------------- 1 | # Relationship Extraction 2 | 3 | Relationship extraction is the task of extracting semantic relationships from a text. Extracted relationships usually 4 | occur between two or more entities of a certain type (e.g. Person, Organisation, Location) and fall into a number of 5 | semantic categories (e.g. married to, employed by, lives in). 6 | 7 | ### New York Times Corpus 8 | 9 | The standard corpus for distantly supervised relationship extraction is the New York Times (NYT) corpus, published in 10 | [Riedel et al, 2010](http://www.riedelcastro.org//publications/papers/riedel10modeling.pdf). 11 | 12 | This contains text from the [New York Times Annotated Corpus](https://catalog.ldc.upenn.edu/ldc2008t19) with named 13 | entities extracted from the text using the Stanford NER system and automatically linked to entities in the Freebase 14 | knowledge base. Pairs of named entities are labelled with relationship types by aligning them against facts in the 15 | Freebase knowledge base. (The process of using a separate database to provide label is known as 'distant supervision') 16 | 17 | Example: 18 | > **Elevation Partners**, the $1.9 billion private equity group that was founded by **Roger McNamee** 19 | 20 | `(founded_by, Elevation_Partners, Roger_McNamee)` 21 | 22 | Different papers have reported various metrics since the release of the dataset, making it difficult to compare systems 23 | directly. The main metrics used are either precision at N results or plots of the precision-recall. The range of recall 24 | has increased over the years as systems improve, with earlier systems having very low precision at 30% recall. 25 | 26 | 27 | | Model | P@10% | P@30% | Paper / Source | Code | 28 | | ----------------------------------- | ----- | ----- | --------------- | -------------- | 29 | | HRERE (Xu et al., 2019) | 84.9 | 72.8 | [Connecting Language and Knowledge with Heterogeneous Representations for Neural Relation Extraction](https://arxiv.org/abs/1903.10126) | [HRERE](https://github.com/billy-inn/HRERE) | 30 | | PCNN+noise_convert+cond_opt (Wu et al., 2019) | 81.7 | 61.8 | [Improving Distantly Supervised Relation Extraction with Neural Noise Converter and Conditional Optimal Selector](https://arxiv.org/pdf/1811.05616.pdf) | | 31 | | Intra- and Inter-Bag (Ye and Ling, 2019) | 78.9 | 62.4 | [Distant Supervision Relation Extraction with Intra-Bag and Inter-Bag Attentions](https://arxiv.org/pdf/1904.00143.pdf) | [Code](https://github.com/ZhixiuYe/Intra-Bag-and-Inter-Bag-Attentions) | 32 | | RESIDE (Vashishth et al., 2018) | 73.6 | 59.5 | [RESIDE: Improving Distantly-Supervised Neural Relation Extraction using Side Information](http://malllabiisc.github.io/publications/papers/reside_emnlp18.pdf) | [RESIDE](https://github.com/malllabiisc/RESIDE) | 33 | | PCNN+ATT (Lin et al., 2016) | 69.4 | 51.8 | [Neural Relation Extraction with Selective Attention over Instances](http://www.aclweb.org/anthology/P16-1200) | [OpenNRE](https://github.com/thunlp/OpenNRE/) | 34 | | MIML-RE (Surdeneau et al., 2012) | 60.7+ | - | [Multi-instance Multi-label Learning for Relation Extraction](http://www.aclweb.org/anthology/D12-1042) | [Mimlre](https://nlp.stanford.edu/software/mimlre.shtml) | 35 | | MultiR (Hoffman et al., 2011) | 60.9+ | - | [Knowledge-Based Weak Supervision for Information Extraction of Overlapping Relations](http://www.aclweb.org/anthology/P11-1055) | [MultiR](http://aiweb.cs.washington.edu/ai/raphaelh/mr/) | 36 | | (Mintz et al., 2009) | 39.9+ | - | [Distant supervision for relation extraction without labeled data](http://www.aclweb.org/anthology/P09-1113) | | 37 | 38 | 39 | (+) Obtained from results in the paper "Neural Relation Extraction with Selective Attention over Instances" 40 | 41 | ### SemEval-2010 Task 8 42 | 43 | [SemEval-2010](http://www.aclweb.org/anthology/S10-1006) introduced 'Task 8 - Multi-Way Classification of Semantic 44 | Relations Between Pairs of Nominals'. The task is, given a sentence and two tagged nominals, to predict the relation 45 | between those nominals *and* the direction of the relation. The dataset contains nine general semantic relations 46 | together with a tenth 'OTHER' relation. 47 | 48 | Example: 49 | > There were apples, **pears** and oranges in the **bowl**. 50 | 51 | `(content-container, pears, bowl)` 52 | 53 | The main evaluation metric used is macro-averaged F1, averaged across the nine proper relationships (i.e. excluding the 54 | OTHER relation), taking directionality of the relation into account. 55 | 56 | Several papers have used additional data (e.g. pre-trained word embeddings, WordNet) to improve performance. The figures 57 | reported here are the highest achieved by the model using any external resources. 58 | 59 | #### End-to-End Models 60 | 61 | | Model | F1 | Paper / Source | Code | 62 | | -------------------------------------- | ----- | --------------- | -------------- | 63 | | *BERT-based Models* | 64 | | R-BERT (Wu et al. 2019) | **89.25** | [Enriching Pre-trained Language Model with Entity Information for Relation Classification](https://arxiv.org/abs/1905.08284) | 65 | | *CNN-based Models* | 66 | | Multi-Attention CNN (Wang et al. 2016) | **88.0** | [Relation Classification via Multi-Level Attention CNNs](http://aclweb.org/anthology/P16-1123) | [lawlietAi's Reimplementation](https://github.com/lawlietAi/relation-classification-via-attention-model) | 67 | | Attention CNN (Huang and Y Shen, 2016) | 84.3
85.9[\*](#footnote) | [Attention-Based Convolutional Neural Network for Semantic Relation Extraction](http://www.aclweb.org/anthology/C16-1238) | 68 | | CR-CNN (dos Santos et al., 2015) | 84.1 | [Classifying Relations by Ranking with Convolutional Neural Network](https://www.aclweb.org/anthology/P15-1061) | [pratapbhanu's Reimplementation](https://github.com/pratapbhanu/CRCNN) | 69 | | CNN (Zeng et al., 2014) | 82.7 | [Relation Classification via Convolutional Deep Neural Network](http://www.aclweb.org/anthology/C14-1220) | [roomylee's Reimplementation](https://github.com/roomylee/cnn-relation-extraction) | 70 | | *RNN-based Models* | 71 | | Entity Attention Bi-LSTM (Lee et al., 2019) | **85.2** | [Semantic Relation Classification via Bidirectional LSTM Networks with Entity-aware Attention using Latent Entity Typing](https://arxiv.org/abs/1901.08163) | [Official](https://github.com/roomylee/entity-aware-relation-classification) | 72 | | Hierarchical Attention Bi-LSTM (Xiao and C Liu, 2016) | 84.3 | [Semantic Relation Classification via Hierarchical Recurrent Neural Network with Attention](http://www.aclweb.org/anthology/C16-1119) | 73 | | Attention Bi-LSTM (Zhou et al., 2016) | 84.0 | [Attention-Based Bidirectional Long Short-Term Memory Networks for Relation Classification](http://www.aclweb.org/anthology/P16-2034) | [SeoSangwoo's Reimplementation](https://github.com/SeoSangwoo/Attention-Based-BiLSTM-relation-extraction) | 74 | | Bi-LSTM (Zhang et al., 2015) | 82.7
84.3[\*](#footnote) | [Bidirectional long short-term memory networks for relation classification](http://www.aclweb.org/anthology/Y15-1009) | 75 | 76 | *: It uses external lexical resources, such as WordNet, part-of-speech tags, dependency tags, and named entity tags. 77 | 78 | 79 | #### Dependency Models 80 | 81 | | Model | F1 | Paper / Source | Code | 82 | | ----------------------------------- | ----- | --------------- | -------------- | 83 | | BRCNN (Cai et al., 2016) | **86.3** | [Bidirectional Recurrent Convolutional Neural Network for Relation Classification](http://www.aclweb.org/anthology/P16-1072) | 84 | | DRNNs (Xu et al., 2016) | 86.1 | [Improved Relation Classification by Deep Recurrent Neural Networks with Data Augmentation](https://arxiv.org/abs/1601.03651) | 85 | | depLCNN + NS (Xu et al., 2015a) | 85.6 | [Semantic Relation Classification via Convolutional Neural Networks with Simple Negative Sampling](https://www.aclweb.org/anthology/D/D15/D15-1062.pdf) | 86 | | SDP-LSTM (Xu et al., 2015b) | 83.7 | [Classifying Relations via Long Short Term Memory Networks along Shortest Dependency Path](https://arxiv.org/abs/1508.03720) | [Sshanu's Reimplementation](https://github.com/Sshanu/Relation-Classification) | 87 | | DepNN (Liu et al., 2015) | 83.6 | [A Dependency-Based Neural Network for Relation Classification](http://www.aclweb.org/anthology/P15-2047) | 88 | | FCN (Yu et al., 2014) | 83.0 | [Factor-based compositional embedding models](https://www.cs.cmu.edu/~mgormley/papers/yu+gormley+dredze.nipsw.2014.pdf) | 89 | | MVRNN (Socher et al., 2012) | 82.4 | [Semantic Compositionality through Recursive Matrix-Vector Spaces](http://aclweb.org/anthology/D12-1110) | [pratapbhanu's Reimplementation](https://github.com/pratapbhanu/MVRNN) | 90 | 91 | 92 | ### TACRED 93 | 94 | [TACRED](https://nlp.stanford.edu/projects/tacred/) is a large-scale relation extraction dataset with 106,264 examples built over newswire and web text from the [corpus](https://catalog.ldc.upenn.edu/LDC2018T03) used in the yearly [TAC Knowledge Base Population (TAC KBP) challenges](https://tac.nist.gov/2017/KBP/index.html). Examples in TACRED cover 41 relation types as used in the TAC KBP challenges (e.g., _per:schools_attended_ and _org:members_) or are labeled as _no_relation_ if no defined relation is held. These examples are created by combining available human annotations from the TAC KBP challenges and crowdsourcing. 95 | 96 | Example: 97 | > *Billy Mays*, the bearded, boisterious pitchman who, as the undisputed king of TV yell and sell, became an inlikely pop culture icon, died at his home in *Tampa*, Fla, on Sunday. 98 | 99 | `(per:city_of_death, Billy Mays, Tampa)` 100 | 101 | The main evaluation metric used is micro-averaged F1 over instances with proper relationships (i.e. excluding the 102 | _no_relation_ type). 103 | 104 | | Model | F1 | Paper / Source | Code | 105 | | -------------------------------------- | ----- | --------------- | -------------- | 106 | | C-GCN + PA-LSTM (Zhang et al. 2018) | **68.2** | [Graph Convolution over Pruned Dependency Trees Improves Relation Extraction](http://aclweb.org/anthology/D18-1244) | [Offical](https://github.com/qipeng/gcn-over-pruned-trees) | 107 | | PA-LSTM (Zhang et al, 2017) | 65.1 | [Position-aware Attention and Supervised Data Improve Slot Filling](http://aclweb.org/anthology/D17-1004) | [Official](https://github.com/yuhaozhang/tacred-relation) | 108 | 109 | 110 | 111 | # FewRel 112 | 113 | The Few-Shot Relation Classification Dataset (FewRel) is a different setting from the previous datasets. This dataset consists of 70K sentences expressing 100 relations annotated by crowdworkers on Wikipedia corpus. The few-shot learning task follows the N-way K-shot meta learning setting. It is both the largest supervised relation classification dataset as well as the largest few-shot learning dataset till now. 114 | 115 | The public leaderboard is available on the [FewRel website](http://www.zhuhao.me/fewrel/). 116 | 117 | [Go back to the README](../README.md) 118 | -------------------------------------------------------------------------------- /vietnamese/vietnamese.md: -------------------------------------------------------------------------------- 1 | # Vietnamese NLP tasks 2 | 3 | ## Dependency parsing 4 | 5 | * The last 1020 sentences of the [benchmark Vietnamese dependency treebank VnDT](http://vndp.sourceforge.net) are used for test, while the remaining 9k+ sentences are used for training & development. LAS and UAS scores are computed on all 6 | tokens (i.e. including punctuation). 7 | 8 | 9 | | | Model | LAS | UAS | Paper | Code | 10 | | ----- | ------------- | :-----:| --- | --- | --- | 11 | | **Predicted POS** | Biaffine (2017) | 73.53 | 80.84 | [Deep Biaffine Attention for Neural Dependency Parsing](https://arxiv.org/abs/1611.01734) | | 12 | | **Predicted POS** | jointWPD (2018) | 72.56 | 79.75 | [A neural joint model for Vietnamese word segmentation, POS tagging and dependency parsing](https://arxiv.org/abs/1812.11459) | | 13 | | **Predicted POS** | jPTDP-v2 (2018) | 71.72 | 79.26 | [An improved neural network model for joint POS tagging and dependency parsing](http://aclweb.org/anthology/K18-2008) | | 14 | | **Predicted POS** | VnCoreNLP (2018) | 70.23 | 76.93 | [VnCoreNLP: A Vietnamese Natural Language Processing Toolkit](http://aclweb.org/anthology/N18-5012) | [Official](https://github.com/vncorenlp/VnCoreNLP) | 15 | | Gold POS | VnCoreNLP (2018) |73.39 |79.02 | [VnCoreNLP: A Vietnamese Natural Language Processing Toolkit](http://aclweb.org/anthology/N18-5012) | [Official](https://github.com/vncorenlp/VnCoreNLP) | 16 | | Gold POS | BIST BiLSTM graph-based parser (2016) | 73.17|79.39 | [Simple and Accurate Dependency Parsing Using Bidirectional LSTM Feature Representations](https://aclweb.org/anthology/Q16-1023) | [Official](https://github.com/elikip/bist-parser/tree/master/bmstparser/src) | 17 | | Gold POS | BIST BiLSTM transition-based parser (2016) | 72.53| 79.33 | [Simple and Accurate Dependency Parsing Using Bidirectional LSTM Feature Representations](https://aclweb.org/anthology/Q16-1023) | [Official](https://github.com/elikip/bist-parser/tree/master/barchybrid/src) | 18 | | Gold POS | MSTparser (2006) | 70.29 | 76.47 | [Online large-margin training of dependency parsers](http://www.aclweb.org/anthology/P05-1012) | | 19 | | Gold POS | MaltParser (2007) | 69.10 | 74.91 | [MaltParser: A language-independent system for datadriven dependency parsing](https://stp.lingfil.uu.se/~nivre/docs/nle07.pdf) | | 20 | 21 | * Results for Biaffine and jPTDP-v2 are reported in "[A neural joint model for Vietnamese word segmentation, POS tagging and dependency parsing](https://arxiv.org/abs/1812.11459)." 22 | * Results for the BIST graph/transition-based parsers, MSTparser and MaltParser are reported in "[An empirical study for Vietnamese dependency parsing](http://www.aclweb.org/anthology/U16-1017)." 23 | 24 | ## Machine translation 25 | 26 | ### English-to-Vietnamese translation 27 | * Dataset is from [The IWSLT 2015 Evaluation Campaign](http://workshop2015.iwslt.org/downloads/proceeding.pdf), also be obtained from [https://github.com/tensorflow/nmt](https://github.com/tensorflow/nmt): `tst2012` is used for development while `tst2013` is used for test. Scores are computed for single models. 28 | 29 | | Model | BLEU | Paper | Code | 30 | | ------------- | :-----:| --- | --- | 31 | | CVT (2018) | 29.6 | [Semi-Supervised Sequence Modeling with Cross-View Training](https://arxiv.org/abs/1809.08370) | | 32 | | ELMo (2018) | 29.3 | [Deep contextualized word representations](http://aclweb.org/anthology/N18-1202)| | 33 | | Transformer (2017) | 28.9 | [Attention is all you need](http://papers.nips.cc/paper/7181-attention-is-all-you-need) | [Link](https://github.com/duyvuleo/Transformer-DyNet) | 34 | | Google (2017) | 26.1 | [Neural machine translation (seq2seq) tutorial](https://github.com/tensorflow/nmt) | [Official](https://github.com/tensorflow/nmt) | 35 | | Stanford (2015) |23.3 | [Stanford Neural Machine Translation Systems for Spoken Language Domains](https://nlp.stanford.edu/pubs/luong-manning-iwslt15.pdf) | | 36 | 37 | * The ELMo score is reported in [Semi-Supervised Sequence Modeling with Cross-View Training](https://arxiv.org/abs/1809.08370). The Transformer score is available at [https://github.com/duyvuleo/Transformer-DyNet](https://github.com/duyvuleo/Transformer-DyNet). 38 | 39 | ## Named entity recognition 40 | * 16,861 sentences for training and development from the VLSP 2016 NER shared task: 41 | * 14,861 sentences are used for training. 42 | * 2k sentences are used for development. 43 | * Test data: 2,831 test sentences from the VLSP 2016 NER shared task. 44 | * **NOTE** that in the VLSP 2016 NER data, each word representing a full personal name are separated into syllables that constitute the word. The VLSP 2016 NER data also consists of gold POS and chunking tags as [reconfirmed by VLSP 2016 organizers](https://drive.google.com/file/d/1XzrgPw13N4C_B6yrQy_7qIxl8Bqf7Uqi/view?usp=sharing). This scheme results in an unrealistic scenario for a pipeline evaluation: 45 | * The standard annotation for Vietnamese word segmentation and POS tagging forms each full name as a word token, thus all word segmenters have been trained to output a full name as a word and all POS taggers have been trained to assign a POS label to the entire full-name. 46 | * Gold POS and chunking tags are NOT available in a real-world application. 47 | * For a realistic scenario, contiguous syllables constituting a full name are merged to form a word. POS/chunking tags--if used--have to be automatically predicted! 48 | 49 | | Model | F1 | Paper | Code | Note | 50 | | ------------- | :-----:| --- | --- | --- | 51 | | VnCoreNLP (2018) [1] | 91.30 | [VnCoreNLP: A Vietnamese Natural Language Processing Toolkit](http://aclweb.org/anthology/N18-5012) | [Official](https://github.com/vncorenlp/VnCoreNLP) | Pre-trained embeddings learned from Vietnamese Wikipedia corpus | 52 | | BiLSTM-CRF + CNN-char (2016) [1] | 91.09 | [End-to-end Sequence Labeling via Bi-directional LSTM-CNNs-CRF](http://www.aclweb.org/anthology/P16-1101) | [Official](https://github.com/XuezheMax/LasagneNLP) / [Link](https://github.com/UKPLab/emnlp2017-bilstm-cnn-crf/) | Pre-trained embeddings learned from Vietnamese Wikipedia corpus | 53 | | VNER (2019) | 89.58 | [Attentive Neural Network for Named Entity Recognition in Vietnamese](https://arxiv.org/abs/1810.13097) | | 54 | | VnCoreNLP (2018) | 88.55 | [VnCoreNLP: A Vietnamese Natural Language Processing Toolkit](http://aclweb.org/anthology/N18-5012) | [Official](https://github.com/vncorenlp/VnCoreNLP) | Pre-trained embeddings learned from Baomoi corpus | 55 | | BiLSTM-CRF + CNN-char (2016) [2] | 88.28 | [End-to-end Sequence Labeling via Bi-directional LSTM-CNNs-CRF](http://www.aclweb.org/anthology/P16-1101) | [Official](https://github.com/XuezheMax/LasagneNLP) / [Link](https://github.com/UKPLab/emnlp2017-bilstm-cnn-crf/) | Pre-trained embeddings learned from Baomoi corpus | 56 | | BiLSTM-CRF + LSTM-char (2016) [2] | 87.71 | [Neural Architectures for Named Entity Recognition](http://www.aclweb.org/anthology/N16-1030) | [Link](https://github.com/UKPLab/emnlp2017-bilstm-cnn-crf/) | Pre-trained embeddings learned from Baomoi corpus | 57 | | BiLSTM-CRF (2015) [2] | 86.48 | [Bidirectional LSTM-CRF Models for Sequence Tagging](https://arxiv.org/abs/1508.01991) | [Link](https://github.com/UKPLab/emnlp2017-bilstm-cnn-crf/) | Pre-trained embeddings learned from Baomoi corpus | 58 | 59 | * [1] denotes that scores are reported in "[ETNLP: A Toolkit for Extraction, Evaluation and Visualization of Pre-trained Word Embeddings](https://arxiv.org/pdf/1903.04433v1.pdf)" 60 | * [2] denotes that BiLSTM-CRF-based scores are reported in "[VnCoreNLP: A Vietnamese Natural Language Processing Toolkit](http://aclweb.org/anthology/N18-5012)" 61 | 62 | ## Part-of-speech tagging 63 | 64 | * 27,870 sentences for training and development from the VLSP 2013 POS tagging shared task: 65 | * 27k sentences are used for training. 66 | * 870 sentences are used for development. 67 | * Test data: 2120 test sentences from the VLSP 2013 POS tagging shared task. 68 | 69 | | Model | Accuracy | Paper | Code | 70 | | ------------- | :-----:| --- | --- | 71 | | jointWPD (2018) | 95.93 | [A neural joint model for Vietnamese word segmentation, POS tagging and dependency parsing](https://arxiv.org/abs/1812.11459) | | 72 | | VnCoreNLP-VnMarMoT (2017) | 95.88 | [From Word Segmentation to POS Tagging for Vietnamese](http://aclweb.org/anthology/U17-1013) | [Official](https://github.com/datquocnguyen/vnmarmot) | 73 | | jPTDP-v2 (2018) | 95.61 | [An improved neural network model for joint POS tagging and dependency parsing](http://aclweb.org/anthology/K18-2008) | | 74 | | BiLSTM-CRF + CNN-char (2016) | 95.40 | [End-to-end Sequence Labeling via Bi-directional LSTM-CNNs-CRF](http://www.aclweb.org/anthology/P16-1101) | [Official](https://github.com/XuezheMax/LasagneNLP) / [Link](https://github.com/UKPLab/emnlp2017-bilstm-cnn-crf/) | 75 | | BiLSTM-CRF + LSTM-char (2016) | 95.31 | [Neural Architectures for Named Entity Recognition](http://www.aclweb.org/anthology/N16-1030) | [Link](https://github.com/UKPLab/emnlp2017-bilstm-cnn-crf/) | 76 | | BiLSTM-CRF (2015) | 95.06 | [Bidirectional LSTM-CRF Models for Sequence Tagging](https://arxiv.org/abs/1508.01991) | [Link](https://github.com/UKPLab/emnlp2017-bilstm-cnn-crf/) | 77 | | RDRPOSTagger (2014) | 95.11 | [RDRPOSTagger: A Ripple Down Rules-based Part-Of-Speech Tagger](http://www.aclweb.org/anthology/E14-2005) | [Official](https://github.com/datquocnguyen/rdrpostagger) | 78 | 79 | * Result for jPTDP-v2 is reported in "[A neural joint model for Vietnamese word segmentation, POS tagging and dependency parsing](https://arxiv.org/abs/1812.11459)." 80 | * Results for BiLSTM-CRF-based models and RDRPOSTagger are reported in "[From Word Segmentation to POS Tagging for Vietnamese](http://aclweb.org/anthology/U17-1013)." 81 | 82 | ## Word segmentation 83 | 84 | * Training & development data: 75k manually word-segmented training sentences from the [VLSP](http://vlsp.org.vn/) 2013 word segmentation shared task. 85 | * Test data: 2120 test sentences from the VLSP 2013 POS tagging shared task. 86 | 87 | | Model | F1 | Paper | Code | 88 | | ------------- | :-----:| --- | --- | 89 | | VnCoreNLP-RDRsegmenter (2018) | 97.90 | [A Fast and Accurate Vietnamese Word Segmenter](http://www.lrec-conf.org/proceedings/lrec2018/pdf/55.pdf) | [Official](https://github.com/datquocnguyen/RDRsegmenter) | 90 | | UETsegmenter (2016) | 97.87 | [A hybrid approach to Vietnamese word segmentation](http://doi.org/10.1109/RIVF.2016.7800279) | [Official](https://github.com/phongnt570/UETsegmenter) | 91 | | jointWPD (2018) | 97.78 | [A neural joint model for Vietnamese word segmentation, POS tagging and dependency parsing](https://arxiv.org/abs/1812.11459) | | 92 | | vnTokenizer (2008) | 97.33 | [A Hybrid Approach to Word Segmentation of Vietnamese Texts](https://link.springer.com/chapter/10.1007/978-3-540-88282-4_23) | | 93 | | JVnSegmenter (2006) | 97.06 | [Vietnamese Word Segmentation with CRFs and SVMs: An Investigation](http://www.aclweb.org/anthology/Y06-1028) | | 94 | | DongDu (2012) | 96.90 | [Ứng dụng phương pháp Pointwise vào bài toán tách từ cho tiếng Việt](https://tiengvietmenyeu.wordpress.com/2013/02/16/ung%C2%B7dung-phuong%C2%B7phap-pointwise-vao-bai%C2%B7toan-tach-tu-cho-tieng%C2%B7viet/) | | 95 | 96 | * Results for VnTokenizer, JVnSegmenter and DongDu are reported in "[A hybrid approach to Vietnamese word segmentation](http://doi.org/10.1109/RIVF.2016.7800279)." 97 | -------------------------------------------------------------------------------- /english/dependency_parsing.md: -------------------------------------------------------------------------------- 1 | # Dependency parsing 2 | 3 | Dependency parsing is the task of extracting a dependency parse of a sentence that represents its grammatical 4 | structure and defines the relationships between "head" words and words, which modify those heads. 5 | 6 | Example: 7 | 8 | ``` 9 | root 10 | | 11 | | +-------dobj---------+ 12 | | | | 13 | nsubj | | +------det-----+ | +-----nmod------+ 14 | +--+ | | | | | | | 15 | | | | | | +-nmod-+| | | +-case-+ | 16 | + | + | + + || + | + | | 17 | I prefer the morning flight through Denver 18 | ``` 19 | 20 | Relations among the words are illustrated above the sentence with directed, labeled 21 | arcs from heads to dependents (+ indicates the dependent). 22 | 23 | ### Penn Treebank 24 | 25 | Models are evaluated on the [Stanford Dependency](https://nlp.stanford.edu/software/dependencies_manual.pdf) 26 | conversion (**v3.3.0**) of the Penn Treebank with __predicted__ POS-tags. Punctuation symbols 27 | are excluded from the evaluation. Evaluation metrics are unlabeled attachment score (UAS) and labeled attachment score (LAS). UAS does not consider the semantic relation (e.g. Subj) used to label the attachment between the head and the child, while LAS requires a semantic correct label for each attachment.Here, we also mention the predicted POS tagging accuracy. 28 | 29 | | Model | POS | UAS | LAS | Paper / Source | Code | 30 | | ------------------------------------------------------------ | :---: | :---: | :---: | ------------------------------------------------------------ | ------------------------------------------------------------ | 31 | | CVT + Multi-Task (Clark et al., 2018) | 97.74 | 96.61 | 95.02 | [Semi-Supervised Sequence Modeling with Cross-View Training](https://arxiv.org/abs/1809.08370) | [Official](https://github.com/tensorflow/models/tree/master/research/cvt_text) | 32 | | Deep Biaffine (Dozat and Manning, 2017) | 97.3 | 95.74 | 94.08 | [Deep Biaffine Attention for Neural Dependency Parsing](https://arxiv.org/abs/1611.01734) | [Official](https://github.com/tdozat/Parser-v1) | 33 | | jPTDP (Nguyen and Verspoor, 2018) | 97.97 | 94.51 | 92.87 | [An improved neural network model for joint POS tagging and dependency parsing](https://arxiv.org/abs/1807.03955) | [Official](https://github.com/datquocnguyen/jPTDP) | 34 | | Andor et al. (2016) | 97.44 | 94.61 | 92.79 | [Globally Normalized Transition-Based Neural Networks](https://www.aclweb.org/anthology/P16-1231) | | 35 | | Distilled neural FOG (Kuncoro et al., 2016) | 97.3 | 94.26 | 92.06 | [Distilling an Ensemble of Greedy Dependency Parsers into One MST Parser](https://arxiv.org/abs/1609.07561) | | 36 | | Distilled transition-based parser (Liu et al., 2018) | 97.3 | 94.05 | 92.14 | [Distilling Knowledge for Search-based Structured Prediction](http://aclweb.org/anthology/P18-1129) | [Official](https://github.com/Oneplus/twpipe) | 37 | | Weiss et al. (2015) | 97.44 | 93.99 | 92.05 | [Structured Training for Neural Network Transition-Based Parsing](http://anthology.aclweb.org/P/P15/P15-1032.pdf) | | 38 | | BIST transition-based parser (Kiperwasser and Goldberg, 2016) | 97.3 | 93.9 | 91.9 | [Simple and Accurate Dependency Parsing Using Bidirectional LSTM Feature Representations](https://aclweb.org/anthology/Q16-1023) | [Official](https://github.com/elikip/bist-parser/tree/master/barchybrid/src) | 39 | | Arc-hybrid (Ballesteros et al., 2016) | 97.3 | 93.56 | 91.42 | [Training with Exploration Improves a Greedy Stack-LSTM Parser](https://arxiv.org/abs/1603.03793) | | 40 | | BIST graph-based parser (Kiperwasser and Goldberg, 2016) | 97.3 | 93.1 | 91.0 | [Simple and Accurate Dependency Parsing Using Bidirectional LSTM Feature Representations](https://aclweb.org/anthology/Q16-1023) | [Official](https://github.com/elikip/bist-parser/tree/master/bmstparser/src) | 41 | 42 | ### Universal Dependencies 43 | 44 | The focus of the task is learning syntactic dependency parsers that can work in a real-world setting, starting from raw text, and that can work over many typologically different languages, even low-resource languages for which there is little or no training data, by exploiting a common syntactic annotation standard. This task has been made possible by the Universal Dependencies initiative (UD, http://universaldependencies.org/), which has developed treebanks for 60+ languages with cross-linguistically consistent annotation and recoverability of the original raw texts. 45 | 46 | Participating systems will have to find labeled syntactic dependencies between words, i.e. a syntactic head for each word, and a label classifying the type of the dependency relation. In addition to syntactic dependencies, prediction of morphology and lemmatization will be evaluated. There will be multiple test sets in various languages but all data sets will adhere to the common annotation style of UD. Participants will be asked to parse raw text where no gold-standard pre-processing (tokenization, lemmas, morphology) is available. Data preprocessed by a baseline system (UDPipe, https://ufal.mff.cuni.cz/udpipe/) was provided so that the participants could focus on improving just one part of the processing pipeline. The organizers believed that this made the task reasonably accessible for everyone. 47 | 48 | | Model | LAS | MLAS | BLEX | Paper / Source | Code | 49 | | ------------------------------------------------------------ | :---: | :---: | :---: | ------------------------------------------------------------ | ------------------------------------------------------------ | 50 | | Stanford (Qi et al.) | 74.16 | 62.08 | 65.28 | [Universal Dependency Parsing from Scratch](https://arxiv.org/pdf/1901.10457.pdf) | [Official](https://github.com/stanfordnlp/stanfordnlp) | 51 | | UDPipe Future (Straka) | 73.11 | 61.25 | 64.49 | [UDPipe 2.0 Prototype at CoNLL 2018 UD Shared Task](https://www.aclweb.org/anthology/K18-2020) | [Official](https://github.com/CoNLL-UD-2018/UDPipe-Future) | 52 | | HIT-SCIR (Che et al.) | 75.84 | 59.78 | 65.33 | [Towards Better UD Parsing: Deep Contextualized Word Embeddings, Ensemble, and Treebank Concatenation](https://arxiv.org/abs/1807.03121) | | 53 | | TurkuNLP (Kanerva et al.) | 73.28 | 60.99 | 66.09 | [Turku Neural Parser Pipeline: An End-to-End System for the CoNLL 2018 Shared Task](https://universaldependencies.org/conll18/proceedings/pdf/K18-2013.pdf) | [Official](https://github.com/TurkuNLP/Turku-neural-parser-pipeline) | 54 | 55 | The following results are just for references: 56 | 57 | | Model | UAS | LAS | Note | Paper / Source | 58 | | ------------------------------------------------------------ | :---: | :---: | ------------------------------ | ------------------------------------------------------------ | 59 | | Stack-only RNNG (Kuncoro et al., 2017) | 95.8 | 94.6 | Constituent parser | [What Do Recurrent Neural Network Grammars Learn About Syntax?](https://arxiv.org/abs/1611.05774) | 60 | | Deep Biaffine (Dozat and Manning, 2017) | 95.75 | 94.22 | Stanford conversion **v3.5.0** | [Deep Biaffine Attention for Neural Dependency Parsing](https://arxiv.org/abs/1611.01734) | 61 | | Semi-supervised LSTM-LM (Choe and Charniak, 2016) (Constituent parser) | 95.9 | 94.1 | Constituent parser | [Parsing as Language Modeling](http://www.aclweb.org/anthology/D16-1257) | 62 | 63 | # Cross-lingual zero-shot dependency parsing 64 | 65 | Cross-lingual zero-shot parsing is the task of inferring the dependency parse of sentences from one language without any labeled training trees for that language. 66 | 67 | ## Universal Dependency Treebank 68 | 69 | Models are evaluated against the [Universal Dependency Treebank v2.0](https://github.com/ryanmcd/uni-dep-tb/). For each of the 6 target languages, models can use the trees of all other languages and English and are evaluated by the UAS and LAS on the target. The final score is the average score across the 6 target languages. The most common evaluation setup is to use 70 | gold POS-tags. 71 | 72 | | Model | UAS | LAS | Paper / Source | Code | 73 | | ------------- | :-----:| :-----:| --- | --- | 74 | | Cross-Lingual ELMo (Schuster et al., 2019) | 84.2 | 77.3 | [Cross-Lingual Alignment of Contextual Word Embeddings, with Applications to Zero-shot Dependency Parsing](https://arxiv.org/abs/1902.09492) | [Official](https://github.com/TalSchuster/CrossLingualELMo) 75 | | MALOPA (Ammar et al., 2016) | | 70.5 | [Many Languages, One Parser](https://www.transacl.org/ojs/index.php/tacl/article/view/892) | [Official](https://github.com/clab/language-universal-parser) 76 | | Guo et al. (2016) | 76.7 | 69.9 | [A representation learning framework for multi-source transfer parsing](https://dl.acm.org/citation.cfm?id=3016100.3016284) | 77 | 78 | # Unsupervised dependency parsing 79 | 80 | Unsupervised dependency parsing is the task of inferring the dependency parse of sentences without any labeled training data. 81 | 82 | ## Penn Treebank 83 | 84 | As with supervised parsing, models are evaluated against the Penn Treebank. The most common evaluation setup is to use 85 | gold POS-tags as input and to evaluate systems using the unlabeled attachment score (also called 'directed dependency 86 | accuracy'). 87 | 88 | | Model | UAS | Paper / Source | 89 | | ---------------------------------------------------- | :--: | ------------------------------------------------------------ | 90 | | Iterative reranking (Le & Zuidema, 2015) | 66.2 | [Unsupervised Dependency Parsing - Let’s Use Supervised Parsers](http://www.aclweb.org/anthology/N15-1067) | 91 | | Combined System (Spitkovsky et al., 2013) | 64.4 | [Breaking Out of Local Optima with Count Transforms and Model Recombination - A Study in Grammar Induction](http://www.aclweb.org/anthology/D13-1204) | 92 | | Tree Substitution Grammar DMV (Blunsom & Cohn, 2010) | 55.7 | [Unsupervised Induction of Tree Substitution Grammars for Dependency Parsing](http://www.aclweb.org/anthology/D10-1117) | 93 | | Shared Logistic Normal DMV (Cohen & Smith, 2009) | 41.4 | [Shared Logistic Normal Distributions for Soft Parameter Tying in Unsupervised Grammar Induction](http://www.aclweb.org/anthology/N09-1009) | 94 | | DMV (Klein & Manning, 2004) | 35.9 | [Corpus-Based Induction of Syntactic Structure - Models of Dependency and Constituency](http://www.aclweb.org/anthology/P04-1061) | 95 | 96 | [Go back to the README](../README.md) 97 | -------------------------------------------------------------------------------- /english/dialogue.md: -------------------------------------------------------------------------------- 1 | # Dialogue 2 | 3 | Dialogue is notoriously hard to evaluate. Past approaches have used human evaluation. 4 | 5 | ## Dialogue act classification 6 | 7 | Dialogue act classification is the task of classifying an utterance with respect to the fuction it serves in a dialogue, i.e. the act the speaker is performing. Dialogue acts are a type of speech acts (for Speech Act Theory, see [Austin (1975)](http://www.hup.harvard.edu/catalog.php?isbn=9780674411524) and [Searle (1969)](https://www.cambridge.org/core/books/speech-acts/D2D7B03E472C8A390ED60B86E08640E7)). 8 | 9 | ### Switchboard corpus 10 | The [Switchboard-1 corpus](https://catalog.ldc.upenn.edu/ldc97s62) is a telephone speech corpus, consisting of about 2,400 two-sided telephone conversation among 543 speakers with about 70 provided conversation topics. The dataset includes the audio files and the transcription files, as well as information about the speakers and the calls. 11 | 12 | The Switchboard Dialogue Act Corpus (SwDA) [[download](https://web.stanford.edu/~jurafsky/swb1_dialogact_annot.tar.gz)] extends the Switchboard-1 corpus with tags from the [SWBD-DAMSL tagset](https://web.stanford.edu/~jurafsky/ws97/manual.august1.html), which is an augmentation to the Discourse Annotation and Markup System of Labeling (DAMSL) tagset. The 220 tags were reduced to 42 tags by clustering in order to improve the language model on the Switchboard corpus. A subset of the Switchboard-1 corpus consisting of 1155 conversations was used. The resulting tags include dialogue acts like statement-non-opinion, acknowledge, statement-opinion, agree/accept, etc. 13 | Annotated example: 14 | *Speaker:* A, *Dialogue Act:* Yes-No-Question, *Utterance:* So do you go to college right now? 15 | 16 | | Model | Accuracy | Paper / Source | Code | 17 | | ------------- | :-----:| --- | --- | 18 | | CRF-ASN (Chen et al., 2018) | 81.3 | [Dialogue Act Recognition via CRF-Attentive Structured Network](https://arxiv.org/abs/1711.05568) | | 19 | | Bi-LSTM-CRF (Kumar et al., 2017) | 79.2 | [Dialogue Act Sequence Labeling using Hierarchical encoder with CRF](https://arxiv.org/abs/1709.04250) | [Link](https://github.com/YanWenqiang/HBLSTM-CRF) | 20 | | RNN with 3 utterances in context (Bothe et al., 2018) | 77.34 | [A Context-based Approach for Dialogue Act Recognition using Simple Recurrent Neural Networks](https://arxiv.org/abs/1805.06280) | | 21 | 22 | 23 | ### ICSI Meeting Recorder Dialog Act (MRDA) corpus 24 | The [MRDA corpus](http://www1.icsi.berkeley.edu/Speech/mr/) [[download](http://www.icsi.berkeley.edu/~ees/dadb/icsi_mrda+hs_corpus_050512.tar.gz)] consists of about 75 hours of speech from 75 naturally-occurring meetings among 53 speakers. The tagset used for labeling is a modified version of the SWBD-DAMSL tagset. It is annotated with three types of information: marking of the dialogue act segment boundaries, marking of the dialogue acts and marking of correspondences between dialogue acts. 25 | Annotated example: 26 | *Time:* 2804-2810, *Speaker:* c6, *Dialogue Act:* s^bd, *Transcript:* i mean these are just discriminative. 27 | Multiple dialogue acts are separated by "^". 28 | 29 | | Model | Accuracy | Paper / Source | Code | 30 | | ------------- | :-----:| --- | --- | 31 | | CRF-ASN (Chen et al., 2018) | 91.7 | [Dialogue Act Recognition via CRF-Attentive Structured Network](https://arxiv.org/abs/1711.05568) | | 32 | | Bi-LSTM-CRF (Kumar et al., 2017) | 90.9 | [Dialogue Act Sequence Labeling using Hierarchical encoder with CRF](https://arxiv.org/abs/1709.04250) | [Link](https://github.com/YanWenqiang/HBLSTM-CRF) | 33 | 34 | ## Dialogue state tracking 35 | 36 | Dialogue state tacking consists of determining at each turn of a dialogue the 37 | full representation of what the user wants at that point in the dialogue, 38 | which contains a goal constraint, a set of requested slots, and the user's dialogue act. 39 | 40 | ### Second dialogue state tracking challenge 41 | 42 | For goal-oriented dialogue, the dataset of the [second Dialogue Systems Technology Challenges](http://www.aclweb.org/anthology/W14-4337) 43 | (DSTC2) is a common evaluation dataset. The DSTC2 focuses on the restaurant search domain. Models are 44 | evaluated based on accuracy on both individual and joint slot tracking. 45 | 46 | | Model | Request | Area | Food | Price | Joint | Paper / Source | 47 | | ------------- | :-----: | :-----:| :-----:| :-----:| :-----:| --- | 48 | | Zhong et al. (2018) | 97.5 | - | - | - | 74.5| [Global-locally Self-attentive Dialogue State Tracker](https://arxiv.org/abs/1805.09655) | 49 | | Liu et al. (2018) | - | 90 | 84 | 92 | 72 | [Dialogue Learning with Human Teaching and Feedback in End-to-End Trainable Task-Oriented Dialogue Systems](https://arxiv.org/abs/1804.06512) | 50 | | Neural belief tracker (Mrkšić et al., 2017) | 96.5 | 90 | 84 | 94 | 73.4 | [Neural Belief Tracker: Data-Driven Dialogue State Tracking](https://arxiv.org/abs/1606.03777) | 51 | | RNN (Henderson et al., 2014) | 95.7 | 92 | 86 | 86 | 69 | [Robust dialog state tracking using delexicalised recurrent neural networks and unsupervised gate](http://svr-ftp.eng.cam.ac.uk/~sjy/papers/htyo14.pdf) | 52 | 53 | ### Wizard-of-Oz 54 | 55 | The [WoZ 2.0 dataset](https://arxiv.org/pdf/1606.03777.pdf) is a newer dialogue state tracking dataset whose evaluation is detached from the noisy output of speech recognition systems. Similar to DSTC2, it covers the restaurant search domain and has identical evaluation. 56 | 57 | 58 | | Model | Request | Joint | Paper / Source | 59 | | ------------- | :-----:| :-----:| --- | 60 | | Zhong et al. (2018) | 97.1 | 88.1 | [Global-locally Self-attentive Dialogue State Tracker](https://arxiv.org/abs/1805.09655) | 61 | | Neural belief tracker (Mrkšić et al., 2017) | 96.5 | 84.4 | [Neural Belief Tracker: Data-Driven Dialogue State Tracking](https://arxiv.org/abs/1606.03777) | 62 | | RNN (Henderson et al., 2014) | 87.1 | 70.8 | [Robust dialog state tracking using delexicalised recurrent neural networks and unsupervised gate](http://svr-ftp.eng.cam.ac.uk/~sjy/papers/htyo14.pdf) | 63 | 64 | 65 | ### MultiWOZ 66 | 67 | The [MultiWOZ dataset](https://arxiv.org/abs/1810.00278) is a fully-labeled collection of human-human written conversations spanning over multiple domains and topics. At a size of 10k dialogues, it is at least one order of magnitude larger than all previous annotated task-oriented corpora. The dialogue are set between a tourist and a clerk in the information. It spans over 7 domains. 68 | 69 | 70 | | Model | Joint | Slot | Paper / Source | 71 | | ------------- | :-----:| :-----:| --- | 72 | | Ramadan et al. (2018) | 15.57 | 89.53 | [Large-Scale Multi-Domain Belief Tracking with Knowledge Sharing](https://www.aclweb.org/anthology/P18-2069) | 73 | | Zhong et al. (2018) | 35.57 | 95.44 | [Global-locally Self-attentive Dialogue State Tracker](https://arxiv.org/abs/1805.09655) | 74 | | Nouri and Hosseini-Asl (2019) | 36.27 | 98.42 | [Toward Scalable Neural Dialogue State Tracking Model](https://arxiv.org/pdf/1812.00899.pdf) | 75 | | Wu et al. (2019) |48.62 | 96.92| [Transferable Multi-Domain State Generator for Task-OrientedDialogue System](https://arxiv.org/pdf/1905.08743.pdf) | 76 | 77 | ## Retrieval-based Chatbot 78 | The main task retrieval-based chatbot is response selection, that aims to find correct responses from a pre-defined index. 79 | ### Ubuntu Corpus 80 | The [Ubuntu Corpus](https://arxiv.org/pdf/1506.08909.pdf) contains almost 1 million multi-turn dialogues from the Ubuntu Chat Logs. The task of Ubuntu Corpus is to select the correct response from 10 candidates (others are negatively sampled) by considering previous conversation history. You can find more details at [here](https://github.com/ryan-lowe/Ubuntu-Dialogue-Generationv2). The Evaluation metric is recall at position K in N candidates (Recall_N@K). 81 | 82 | | Model | R_2@1 | R_10@1 | Paper / Source | 83 | | ------------- | :---------: | :---------:|---------------| 84 | | DAM (Zhou et al. 2018) | 93.8 | 76.7| [Multi-Turn Response Selection for Chatbots with Deep Attention Matching Network](http://www.aclweb.org/anthology/P18-1103) | 85 | | SMN (Wu et al. 2017) | 92.3 | 72.3| [Sequential Matching Network: A New Architecture for Multi-turn Response Selection in Retrieval-Based Chatbots](https://arxiv.org/pdf/1612.01627.pdf) | 86 | | Multi-View (Zhou et al. 2017) | 90.8 | 66.2 | [Multi-view Response Selection for Human-Computer Conversation](https://aclweb.org/anthology/D16-1036) | 87 | | Bi-LSTM (Kadlec et al. 2015) | 89.5 | 63.0 | [Improved Deep Learning Baselines for Ubuntu Corpus Dialogs](https://arxiv.org/pdf/1510.03753.pdf) | 88 | 89 | ### Reddit Corpus 90 | The [Reddit Corpus](https://arxiv.org/abs/1904.06472) contains 726 million multi-turn dialogues from the Reddit board. Reddit is an American social news aggregation website, where users can post links, and take partin discussions on these post. The task of Reddit Corpus is to select the correct response from 100 candidates (others are negatively sampled) by considering previous conversation history. Models are evaluated with the Recall 1 at 100 metric (the 1-of-100 ranking accuracy). You can find more details at [here](https://github.com/PolyAI-LDN/conversational-datasets). 91 | 92 | | Model | R_1@100 | Paper / Source | 93 | | ------------- | :---------:|---------------| 94 | | PolyAI Encoder (Henderson et al. 2019) | 61.3 | [A Repository of Conversational Dataset](https://arxiv.org/pdf/1904.06472.pdf) | 95 | | USE (Cer et al. 2018) | 47.7 | [Universal Sentence Encoder](https://arxiv.org/abs/1803.11175) | 96 | | BERT (Devlin et al. 2017) | 24.0 | [BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding](https://arxiv.org/abs/1810.04805) | 97 | | ELMO (Peters et al. 2018) | 19.3 | [Deep contextualized word representations](https://arxiv.org/abs/1802.05365) | 98 | 99 | ## Generative-based Chatbot 100 | The main task of generative-based chatbot is to generate consistent and engaging response given the context. 101 | ### Personalized Chit-chat 102 | 103 | The task of persinalized chit-chat dialogue generation is first proposed by [PersonaChat](https://arxiv.org/pdf/1801.07243.pdf). The motivation is to enhance the engagingness and consistency of chit-chat bots via endowing explicit personas to agents. Here the `persona` is defined as several profile natural language sentences like "I weight 300 pounds.". NIPS 2018 has hold a competition [The Conversational Intelligence Challenge 2 (ConvAI2)](http://convai.io/) based on the dataset. The Evaluation metric is F1, Hits@1 and ppl. F1 evaluates on the word-level, and Hits@1 represents the probability of the real next utterance ranking the highest according to the model, while ppl is perplexity for language modeling. The following results are reported on dev set (test set is still hidden), almost of them are borrowed from [ConvAI2 Leaderboard](https://github.com/DeepPavlov/convai/blob/master/leaderboards.md). 104 | 105 | | Model | F1 | Hits@1 | ppl | Paper / Source | Code | 106 | | ------------- | :---------: | :---------:| :--------: | ---------------| ------------- | 107 | | TransferTransfo (Thomas et al. 2019) | 19.09 | 82.1 | 17.51 | [TransferTransfo: A Transfer Learning Approach for Neural Network Based Conversational Agents](https://arxiv.org/pdf/1901.08149.pdf) | [Code](https://github.com/huggingface/transfer-learning-conv-ai) | 108 | | Lost In Conversation | 17.79 | - | 17.3 | [NIPS 2018 Workshop Presentation](http://convai.io/NeurIPSParticipantSlides.pptx) | [Code](https://github.com/atselousov/transformer_chatbot) | 109 | | Seq2Seq + Attention (Dzmitry et al. 2014) | 16.18 | 12.6 | 29.8 | [Neural Machine Translation by Jointly Learning to Align and Translate](https://arxiv.org/pdf/1409.0473.pdf) | [Code](https://github.com/facebookresearch/ParlAI/tree/master/projects/convai2/baselines/seq2seq) | 110 | | KV Profile Memory (Zhang et al. 2018) | 11.9 | 55.2 | - | [Personalizing Dialogue Agents: I have a dog, do you have pets too?](https://arxiv.org/pdf/1801.07243.pdf) | [Code](https://github.com/facebookresearch/ParlAI/tree/master/projects/convai2/baselines/kvmemnn) 111 | -------------------------------------------------------------------------------- /english/entity_linking.md: -------------------------------------------------------------------------------- 1 | # Entity Linking 2 | 3 | ## Task 4 | 5 | Entity Linking (EL) is the task of recognizing (cf. [Named Entity Recognition](named_entity_recognition.md)) and disambiguating (Named Entity Disambiguation) named entities to a knowledge base (e.g. Wikidata, DBpedia, or YAGO). It is sometimes also simply known as Named Entity Recognition and Disambiguation. 6 | 7 | EL can be split into two classes of approaches: 8 | * *End-to-End*: processing a piece of text to extract the entities (i.e. Named Entity Recognition) and then disambiguate these extracted entities to the correct entry in a given knowledge base (e.g. Wikidata, DBpedia, YAGO). 9 | * *Disambiguation-Only*: contrary to the first approach, this one directly takes gold standard named entities as input and only disambiguates them to the correct entry in a given knowledge base. 10 | 11 | Example: 12 | 13 | | Barack | Obama | was | born | in | Hawaï | 14 | | --- | ---| --- | --- | --- | --- | 15 | | https://en.wikipedia.org/wiki/Barack_Obama | https://en.wikipedia.org/wiki/Barack_Obama | O | O | O | https://en.wikipedia.org/wiki/Hawaii | 16 | 17 | More in details can be found in this [survey](http://dbgroup.cs.tsinghua.edu.cn/wangjy/papers/TKDE14-entitylinking.pdf). 18 | 19 | ## Current SOTA 20 | [Raiman][Raiman] is the current SOTA in Cross-lingual Entity Linking. They construct a type system, and use it to constrain the outputs of a neural network to respect the symbolic structure. They achieve this by reformulating the design problem into a mixed integer problem: create a type system and subsequently train a neural network with it. They propose a 2-step algorithm: 1) heuristic search or stochastic optimization over discrete variables that define a type system 21 | informed by an Oracle and a Learnability heuristic, 2) gradient descent to fit classifier parameters. They apply DeepType to the problem of Entity Linking on three standard datasets (i.e. WikiDisamb30, CoNLL (YAGO), TAC KBP 2010) and find that it outperforms all existing solutions by a wide margin, including approaches that rely on a human-designed type system or recent deep learning-based entity embeddings, while explicitly using symbolic information lets it integrate new entities without retraining. 22 | 23 | ## Evaluation 24 | 25 | ### Metrics 26 | 27 | #### Disambiguation-Only Approach 28 | 29 | * Micro-Precision: Fraction of correctly disambiguated named entities in the full corpus. 30 | * Macro-Precision: Fraction of correctly disambiguated named entities, averaged by document. 31 | 32 | #### End-to-End Approach 33 | 34 | * Gerbil Micro-F1 - strong matching: micro InKB F1 score for correctly linked and disambiguated mentions in the full corpus as computed using the Gerbil platform. InKB means only mentions with valid KB entities are used for evaluation. 35 | * Gerbil Macro-F1 - strong matching: macro InKB F1 score for correctly linked and disambiguated mentions in the full corpus as computed using the Gerbil platform. InKB means only mentions with valid KB entities are used for evaluation. 36 | 37 | ### Datasets 38 | 39 | #### AIDA CoNLL-YAGO Dataset 40 | 41 | The [AIDA CoNLL-YAGO][AIDACoNLLYAGO] Dataset by [[Hoffart]](http://www.aclweb.org/anthology/D11-1072) contains assignments of entities to the mentions of named entities annotated for the original [[CoNLL]](http://www.aclweb.org/anthology/W03-0419.pdf) 2003 NER task. The entities are identified by [YAGO2](http://yago-knowledge.org/) entity identifier, by [Wikipedia URL](https://en.wikipedia.org/), or by [Freebase mid](http://wiki.freebase.com/wiki/Machine_ID). 42 | 43 | ##### Disambiguation-Only Models 44 | 45 | | Paper / Source | Micro-Precision | Macro-Precision | Paper / Source | Code | 46 | | ------------- | :-----:| :----: | :----: | --- | 47 | | Raiman et al. (2018) | 94.88 | - | [DeepType: Multilingual Entity Linking by Neural Type System Evolution](https://arxiv.org/pdf/1802.01021.pdf) | [Official](https://github.com/openai/deeptype) | 48 | | Sil et al. (2018) | 94.0 | - | [Neural Cross-Lingual Entity Linking](https://www.aaai.org/ocs/index.php/AAAI/AAAI18/paper/view/16501/16101) | | 49 | | Radhakrishnan et al. (2018) | 93.0 | 93.7 | [ELDEN: Improved Entity Linking using Densified Knowledge Graphs](http://aclweb.org/anthology/N18-1167) | | 50 | | Le et al. (2018) | 93.07 | - | [Improving Entity Linking by Modeling Latent Relations between Mentions](http://aclweb.org/anthology/P18-1148) |[Official](https://github.com/lephong/mulrel-nel) 51 | | Ganea and Hofmann (2017) | 92.22 | - | [Deep Joint Entity Disambiguation with Local Neural Attention](https://www.aclweb.org/anthology/D17-1277) | [Link](https://github.com/dalab/deep-ed) | 52 | | Hoffart et al. (2011) | 82.29 | 82.02 | [Robust Disambiguation of Named Entities in Text](http://www.aclweb.org/anthology/D11-1072) | | 53 | 54 | ##### End-to-End Models 55 | 56 | | Paper / Source | Micro-F1-strong | Macro-F1-strong | Paper / Source | Code | 57 | | ------------- | :-----:| :----: | :----: | --- | 58 | | Kolitsas et al. (2018) | 86.6 | 89.4 | [End-to-End Neural Entity Linking](https://arxiv.org/pdf/1808.07699.pdf) | [Official](https://github.com/dalab/end2end_neural_el) | 59 | | Piccinno et al. (2014) | 69.32 | 72.8 | [From TagME to WAT: a new entity annotator](https://dl.acm.org/citation.cfm?id=2634350) | | 60 | | Hoffart et al. (2011) | 68.8 | 72.4 | [Robust Disambiguation of Named Entities in Text](http://www.aclweb.org/anthology/D11-1072) | | 61 | 62 | #### TAC KBP English Entity Linking Comprehensive and Evaluation Data 2010 63 | 64 | The Knowledge Base Population (KBP) Track at [TAC 2010](https://tac.nist.gov/2010) will explore extraction of information about entities with reference to an external knowledge source. Using basic schema for persons, organizations, and locations, nodes in an ontology must be created and populated using unstructured information found in text. A collection of [Wikipedia Infoboxes](http://en.wikipedia.org/wiki/Help:Infobox) will serve as a rudimentary initial knowledge representation. You can download the dataset from [LDC](https://www.ldc.upenn.edu/) or [here](https://github.com/ChrisLeeJ/TAC_KBP_English_EL_2010). 65 | 66 | ##### Disambiguation-Only Models 67 | 68 | | Paper / Source | Micro-Precision | Macro-Precision | Paper / Source | Code | 69 | | ------------- | :-----:| :----: | :----: | --- | 70 | | Raiman et al. (2018) | 90.85 | - | [DeepType: Multilingual Entity Linking by Neural Type System Evolution](https://arxiv.org/pdf/1802.01021.pdf) | [Official](https://github.com/openai/deeptype) | 71 | | Sil et al. (2018) | 87.4 | - | [Neural Cross-Lingual Entity Linking](https://www.aaai.org/ocs/index.php/AAAI/AAAI18/paper/view/16501/16101) | | 72 | | Yamada et al. (2016) | 85.2 | - | [Joint Learning of the Embedding of Words and Entities for Named Entity Disambiguation](https://arxiv.org/pdf/1601.01343.pdf) | | 73 | 74 | ### Platforms 75 | 76 | Evaluating Entity Linking systems in a manner that allows for direct comparison of performance can be difficult. The precise definition of a "correct" annotation can be somewhat subjective and it is easy to make mistakes. To provide a simple example, given the input surface form **"Tom Waits"**, an evaluation dataset might record the dbpedia resource `http://dbpedia.org/resource/Tom_Waits` as the correct referent. Yet an annotation system which returns a reference to `http://dbpedia.org/resource/PEHDTSCKJBMA` has technically provided an appropriate annotation as this resource is a redirect to `http://dbpedia.org/resource/Tom_Waits`. Alternatively if evaluating an End-to-End EL system, then accuracy with respect to word boundaries must be considered e.g. if a system only annotates **"Obama"** with the URI `http://dbpedia.org/resource/Barack_Obama` in the surface form **"Barack Obama"**, then is the system correct or incorrect in its annotation? 77 | 78 | Furthermore, the performance of an EL system can be strongly affected by the nature of the content on which the evaluation is performed e.g. news content versus Tweets. Hence comparing the relative performance of two EL systems which have been tested on two different corpora can be fallicious. Rather than allowing these little subjective points to creep into the evaluation of EL systems, it is better to make use of a standard evaluation platform where these assumptions are known and made explicit in the configuration of the experiment. 79 | 80 | [GERBIL][GERBIL], developed by [AKSW][AKSW] is an evaluation platform that is based on the [BAT framework][Cornolti]. It defines a number of standard experiments which may be run for any given EL service. These experiment types determine how strict the evaluation is with respect to measures such as word boundary alignment and also dictates how much responsibility is assigned to the EL service with respect to Entity Recognition, etc. GERBIL hosts 38 evaluation datasets obtained from a variety of different EL challenges. At present it also has hooks for 17 different EL services which may be included in an experiment. 81 | 82 | GERBIL may be used to test your own EL system either by downloading the source code and deploying GERBAL locally, or by making your service available on the web and giving GERBIL a link to your API endpoint. The only condition is that your API must accept input and respond with output in [NIF][NIF] format. It is also possible to upload your own evaluation dataset if you would like to test these services on your own content. Note the dataset must also be in NIF format. The [DBpedia Spotlight evaluation dataset][SpotlightEvaluation] is a good example of how to structure your content. 83 | 84 | GERBIL does have a number of shortcomings, the most notable of which are: 85 | 1. There is no way to view the annotations returned by each system you test. These are handled internally by GERBIL and then discarded. This can make it difficult to determine the source of error with an EL system. 86 | 2. There is no way to observe the candidate list considered for each surface form. This is, of course, a standard problem with any third party EL API, but if one is conducting a detailed investigation into the performance of an EL system, it is important to know if the source of error was the EL algorithm itself, or the candidate retrieval process which failed to identify the correct referent as a candidate. This was listed as an important consideration by [Hachey et al][Hachey]. 87 | 88 | Nevertheless, GERBIL is an excellent resource for standardising how EL systems are tested and compared. It is also a good starting point for anyone new to Entity Linking as it contains links to a wide variety of EL resources. For more information, see the research paper by [[Usbeck]](http://svn.aksw.org/papers/2015/WWW_GERBIL/public.pdf). 89 | 90 | ## References 91 | 92 | [Hoffart] Johannes Hoffart, Mohamed Amir Yosef, Ilaria Bordino, Hagen Fürstenau, Manfred Pinkal, Marc Spaniol, Bilyana Taneva, Stefan Thater, and Gerhard Weikum. Robust Disambiguation of Named Entities in Text. EMNLP 2011. http://www.aclweb.org/anthology/D11-1072 93 | 94 | [CoNLL] Erik F Tjong Kim Sang and Fien De Meulder. Introduction to the CoNLL-2003 Shared Task: Language-Independent Named Entity Recognition. CoNLL 2003. http://www.aclweb.org/anthology/W03-0419.pdf 95 | 96 | [Usbeck] Usbeck et al. GERBIL - General Entity Annotator Benchmarking Framework. WWW 2015. http://svn.aksw.org/papers/2015/WWW_GERBIL/public.pdf 97 | 98 | [Go back to the README](../README.md) 99 | 100 | [Sil]: https://www.aaai.org/ocs/index.php/AAAI/AAAI18/paper/view/16501/16101 "Neural Cross-Lingual Entity Linking" 101 | [Shen]: http://dbgroup.cs.tsinghua.edu.cn/wangjy/papers/TKDE14-entitylinking.pdf "Entity Linking with a Knowledge Base: Issues, Techniques, and Solutions" 102 | [AIDACoNLLYAGO]: https://www.mpi-inf.mpg.de/departments/databases-and-information-systems/research/yago-naga/aida/downloads/ "AIDA CoNLL-YAGO Dataset" 103 | [YAGO2]: http://yago-knowledge.org/ "YAGO2" 104 | [Wikipedia]: https://en.wikipedia.org/ "Wikipedia" 105 | [Freebase]: http://wiki.freebase.com/wiki/Machine_ID "Freebase" 106 | [Radhakrishnan]: http://aclweb.org/anthology/N18-1167 "ELDEN: Improved Entity Linking using Densified Knowledge Graphs" 107 | [Le]: https://arxiv.org/abs/1804.10637 108 | [NIF]: http://persistence.uni-leipzig.org/nlp2rdf/ "NLP Interchange Formt" 109 | [SpotlightEvaluation]: http://apps.yovisto.com/labs/ner-benchmarks/data/dbpedia-spotlight-nif.ttl "GERBIL DBpedia Spotlight Dataset" 110 | [Cornolti]: https://static.googleusercontent.com/media/research.google.com/en//pubs/archive/40749.pdf "A Framework for Benchmarking Entity-Annotation Systems" 111 | [GERBIL]: http://aksw.org/Projects/GERBIL.html "General Entity Annotator Benchmarking framework" 112 | [AKSW]: http://aksw.org/About.html "Agile Knowledge Engineering and Semantic Web" 113 | [Hachey]: http://benhachey.info/pubs/hachey-aij12-evaluating.pdf "Evaluating Entity Linking with Wikipedia" 114 | [Raiman]: https://arxiv.org/pdf/1802.01021.pdf "DeepType: Multilingual Entity Linking by Neural Type System Evolution" 115 | -------------------------------------------------------------------------------- /structured/export.py: -------------------------------------------------------------------------------- 1 | import argparse 2 | import os 3 | import pprint 4 | from typing import Dict, Tuple, List 5 | import re 6 | import sys 7 | import json 8 | 9 | 10 | def extract_dataset_desc_links(desc:List[str]) -> List: 11 | """ 12 | Extract all the links from the description of datasets 13 | 14 | :param desc: Lines of the description of the dataset 15 | :return: 16 | """ 17 | 18 | out = [] 19 | md = "".join(desc) 20 | 21 | md_links = re.findall("\\[.*\\]\\(.*\\)", md) 22 | 23 | for md_link in md_links: 24 | title, link = extract_title_and_link(md_link) 25 | out.append({ 26 | "title": title, 27 | "url": link, 28 | }) 29 | 30 | return out 31 | 32 | 33 | def sanitize_subdataset_name(name:str): 34 | """ 35 | Do some sanitization on automatically extracted subdataset name 36 | 37 | :param name: raw subdataset name line 38 | :return: 39 | """ 40 | 41 | name = name.replace("**", "") 42 | if name.endswith(":"): 43 | name = name[:-1] 44 | 45 | return name.strip() 46 | 47 | 48 | def extract_lines_before_tables(lines:List[str]): 49 | """ 50 | Extract the non-empty line before the table 51 | 52 | :param lines: a list of lines 53 | :return: 54 | """ 55 | 56 | out = [] 57 | 58 | before = None 59 | in_table = False 60 | for l in lines: 61 | if l.startswith("|") and not in_table: 62 | if before is not None: 63 | out.append(before) 64 | in_table = True 65 | elif in_table and not l.startswith("|"): 66 | in_table = False 67 | before = None 68 | if l.strip() != "": 69 | before = l.strip() 70 | elif l.strip() != "": 71 | before = l.strip() 72 | 73 | return out 74 | 75 | 76 | def handle_multiple_sota_table_exceptions(section:List[str], sota_tables:List[List[str]]): 77 | """ 78 | Manually handle the edge cases with dataset partitions 79 | 80 | These are not captured in a consistent format, so no unified approach is possible atm. 81 | 82 | :param section: The lines in this section 83 | :param sota_tables: The list of sota table lines 84 | :return: 85 | """ 86 | 87 | section_full = "".join(section) 88 | out = [] 89 | 90 | # Use the line before the table 91 | subdatasets = extract_lines_before_tables(section) 92 | subdatasets = [sanitize_subdataset_name(s) for s in subdatasets] 93 | 94 | # exceptions: 95 | if "hypernym discovery evaluation benchmark" in section_full: 96 | subdatasets = subdatasets[1:] 97 | 98 | if len(subdatasets) != len(sota_tables): 99 | print("ERROR parsing the subdataset SOTA tables", file=sys.stderr) 100 | print(sota_tables, file=sys.stderr) 101 | else: 102 | for i in range(len(subdatasets)): 103 | out.append({ 104 | "subdataset": subdatasets[i], 105 | "sota": extract_sota_table(sota_tables[i]) 106 | }) 107 | 108 | return out 109 | 110 | 111 | def extract_title_and_link(md_link:str) -> Tuple: 112 | """ 113 | Extract the anchor text and URL from a markdown link 114 | 115 | :param md_link: a string of ONLY the markdown link, e.g. "[google](http://google.com)" 116 | :return: e.g. the tuple (google, http://google.com) 117 | """ 118 | title = re.findall("^\\[(.*)\\]", md_link)[0].strip() 119 | link = re.findall("\\((.*)\\)$", md_link)[0].strip() 120 | 121 | return title, link 122 | 123 | 124 | def extract_model_name_and_author(md_name:str) -> Tuple: 125 | """ 126 | Extract the model name and author, if provided 127 | 128 | :param md_name: a string with the model name from the sota table 129 | :return: tuple (model_name, author_names) 130 | """ 131 | 132 | if ' (' in md_name and ')' in md_name: 133 | model_name = md_name.split(' (')[0] 134 | model_authors = md_name.split(' (')[1].split(')')[0] 135 | elif '(' in md_name and ')' in md_name: # only has author name 136 | model_name = None 137 | model_authors = md_name 138 | else: 139 | model_name = md_name 140 | model_authors = None 141 | 142 | return model_name, model_authors 143 | 144 | 145 | def extract_paper_title_and_link(paper_md:str) -> Tuple: 146 | """ 147 | Extract the title and link to the paper 148 | 149 | :param paper_md: markdown for the paper link 150 | :return: tuple (paper_title, paper_link) 151 | """ 152 | 153 | md_links = re.findall("\\[.*\\]\\(.*\\)", paper_md) 154 | 155 | if len(md_links) > 1: 156 | print("WARNING: Found multiple paper references: `%s`, using only the first..." % paper_md) 157 | if len(md_links) == 0: 158 | return None, None 159 | 160 | md_link = md_links[0] 161 | 162 | paper_title, paper_link = extract_title_and_link(md_link) 163 | return paper_title, paper_link 164 | 165 | 166 | def extract_code_links(code_md:str) -> List[Dict]: 167 | """ 168 | Extract the links to all code implementations 169 | 170 | :param code_md: 171 | :return: 172 | """ 173 | 174 | md_links = re.findall("\\[.*\\]\\(.*\\)", code_md) 175 | 176 | links = [] 177 | for md_link in md_links: 178 | t, l = extract_title_and_link(md_link) 179 | links.append({ 180 | "title": t, 181 | "url": l, 182 | }) 183 | 184 | return links 185 | 186 | 187 | def extract_sota_table(table_lines:List[str]) -> Dict: 188 | """ 189 | Parse a SOTA table out of lines in markdown 190 | 191 | :param table_lines: lines in the SOTA table 192 | :return: 193 | """ 194 | 195 | sota = {} 196 | 197 | header = table_lines[0] 198 | header_cols = [h.strip() for h in header.split("|") if h.strip()] 199 | cols_sanitized = [h.lower() for h in header_cols] 200 | cols_sanitized = [re.sub(" +", "", h).replace("**","") for h in cols_sanitized] 201 | 202 | # find the model name column (usually the first one) 203 | if "model" in cols_sanitized: 204 | model_inx = cols_sanitized.index("model") 205 | else: 206 | print("ERROR: Model name not found in this SOTA table, skipping...\n", file=sys.stderr) 207 | print("".join(table_lines), file=sys.stderr) 208 | return {} 209 | 210 | if "paper/source" in cols_sanitized: 211 | paper_inx = cols_sanitized.index("paper/source") 212 | elif "paper" in cols_sanitized: 213 | paper_inx = cols_sanitized.index("paper") 214 | else: 215 | print("ERROR: Paper reference not found in this SOTA table, skipping...\n", file=sys.stderr) 216 | print("".join(table_lines), file=sys.stderr) 217 | return {} 218 | 219 | if "code" in cols_sanitized: 220 | code_inx = cols_sanitized.index("code") 221 | else: 222 | code_inx = None 223 | 224 | metrics_inx = set(range(len(header_cols))) - set([model_inx, paper_inx, code_inx]) 225 | metrics_inx = sorted(list(metrics_inx)) 226 | 227 | metrics_names = [header_cols[i] for i in metrics_inx] 228 | 229 | sota["metrics"] = metrics_names 230 | sota["rows"] = [] 231 | 232 | min_cols = len(header_cols) 233 | 234 | # now parse the table rows 235 | rows = table_lines[2:] 236 | for row in rows: 237 | row_cols = [h.strip() for h in row.split("|")][1:] 238 | 239 | if len(row_cols) < min_cols: 240 | print("This row doesn't have enough columns, skipping: %s" % row, file=sys.stderr) 241 | continue 242 | 243 | # extract all the metrics 244 | metrics = {} 245 | for i in range(len(metrics_inx)): 246 | metrics[metrics_names[i]] = row_cols[metrics_inx[i]] 247 | 248 | # extract paper references 249 | paper_title, paper_link = extract_paper_title_and_link(row_cols[paper_inx]) 250 | 251 | # extract model_name and author 252 | model_name, model_author = extract_model_name_and_author(row_cols[model_inx]) 253 | 254 | sota_row = { 255 | "model_name": model_name, 256 | "metrics": metrics, 257 | } 258 | 259 | if paper_title is not None and paper_link is not None: 260 | sota_row["paper_title"] = paper_title 261 | sota_row["paper_url"] = paper_link 262 | 263 | # and code links if they exist 264 | if code_inx is not None: 265 | sota_row["code_links"] = extract_code_links(row_cols[code_inx]) 266 | 267 | sota["rows"].append(sota_row) 268 | 269 | return sota 270 | 271 | 272 | def get_line_no(sections:List[str], section_index:int, section_line=0) -> int: 273 | """ 274 | Get the line number for a section heading 275 | 276 | :param sections: A list of list of sections 277 | :param section_index: Index of the current section 278 | :param section_line: Index of the line within the section 279 | :return: 280 | """ 281 | if section_index == 0: 282 | return 1+section_line 283 | lens = [len(s) for s in sections[:section_index]] 284 | return sum(lens)+1+section_index 285 | 286 | 287 | def extract_dataset_desc_and_sota_table(md_lines:List[str]) -> Tuple: 288 | """ 289 | Extract the lines that are the description and lines that are the sota table(s) 290 | 291 | :param md_lines: a list of lines in this section 292 | :return: 293 | """ 294 | 295 | # Main assumption is that the Sota table will minimally have a "Model" column 296 | desc = [] 297 | tables = [] 298 | t = None 299 | in_table = False 300 | for l in md_lines: 301 | if l.startswith("|") and "model" in l.lower() and not in_table: 302 | t = [l] 303 | in_table = True 304 | elif in_table and l.startswith("|"): 305 | t.append(l) 306 | elif in_table and not l.startswith("|"): 307 | if t is not None: 308 | tables.append(t) 309 | t = None 310 | desc.append(l) 311 | in_table = False 312 | else: 313 | desc.append(l) 314 | 315 | if t is not None: 316 | tables.append(t) 317 | 318 | return desc, tables 319 | 320 | 321 | def parse_markdown_file(md_file:str) -> List: 322 | """ 323 | Parse a single markdown file 324 | 325 | :param md_file: path to the markdown file 326 | :return: 327 | """ 328 | 329 | with open(md_file, "r") as f: 330 | md_lines = f.readlines() 331 | 332 | # Assumptions: 333 | # 1) H1 are tasks 334 | # 2) Everything until the next heading is the task description 335 | # 3) H2 are subtasks, H3 are datasets, H4 are subdatasets 336 | 337 | # Algorithm: 338 | # 1) Split the document by headings 339 | 340 | sections = [] 341 | cur = [] 342 | for line in md_lines: 343 | if line.startswith("#"): 344 | if cur: 345 | sections.append(cur) 346 | cur = [line] 347 | else: 348 | cur = [line] 349 | else: 350 | cur.append(line) 351 | 352 | if cur: 353 | sections.append(cur) 354 | 355 | # 2) Parse each heading section one-by-one 356 | parsed_out = [] # whole parsed output 357 | t = {} # current task element being parsed 358 | st = None # current subtask being parsed 359 | ds = None # current dataset being parsed 360 | for section_index in range(len(sections)): 361 | section = sections[section_index] 362 | header = section[0] 363 | 364 | # Task definition 365 | if header.startswith("#") and not header.startswith("##"): 366 | if "task" in t: 367 | parsed_out.append(t) 368 | t = {} 369 | t["task"] = header[1:].strip() 370 | t["description"] = "".join(section[1:]).strip() 371 | 372 | # reset subtasks and datasets 373 | st = None 374 | ds = None 375 | 376 | ## Subtask definition 377 | if header.startswith("##") and not header.startswith("###"): 378 | if "task" not in t: 379 | print("ERROR: Unexpected subtask without a parent task at %s:#%d" % 380 | (md_file, get_line_no(sections, section_index)), file=sys.stderr) 381 | 382 | if "subtasks" not in t: 383 | t["subtasks"] = [] 384 | 385 | # new substask 386 | st = {} 387 | t["subtasks"].append(st) 388 | 389 | st["task"] = header[2:].strip() 390 | st["description"] = "".join(section[1:]).strip() 391 | st["source_link"] = { 392 | "title": "NLP-progress", 393 | "url": "https://github.com/sebastianruder/NLP-progress" 394 | } 395 | 396 | # reset the last dataset 397 | ds = None 398 | 399 | ### Dataset definition 400 | if header.startswith("###") and not header.startswith("####") and "Table of content" not in header: 401 | if "task" not in t: 402 | print("ERROR: Unexpected dataset without a parent task at %s:#%d" % 403 | (md_file, get_line_no(sections, section_index)), file=sys.stderr) 404 | 405 | if st is not None: 406 | # we are in a subtask, add everything here 407 | if "datasets" not in st: 408 | st["datasets"] = [] 409 | 410 | # new dataset and add 411 | ds = {} 412 | st["datasets"].append(ds) 413 | else: 414 | # we are in a task, add here 415 | if "datasets" not in t: 416 | t["datasets"] = [] 417 | 418 | ds = {} 419 | t["datasets"].append(ds) 420 | 421 | ds["dataset"] = header[3:].strip() 422 | # dataset description is everything that's not a table 423 | desc, tables = extract_dataset_desc_and_sota_table(section[1:]) 424 | ds["description"] = "".join(desc).strip() 425 | 426 | # see if there is an arxiv link in the first paragraph of the description 427 | dataset_links = extract_dataset_desc_links(desc) 428 | if dataset_links: 429 | ds["dataset_links"] = dataset_links 430 | 431 | if tables: 432 | if len(tables) > 1: 433 | ds["subdatasets"] = handle_multiple_sota_table_exceptions(section, tables) 434 | else: 435 | ds["sota"] = extract_sota_table(tables[0]) 436 | 437 | if t: 438 | t["source_link"] = { 439 | "title": "NLP-progress", 440 | "url": "https://github.com/sebastianruder/NLP-progress" 441 | } 442 | parsed_out.append(t) 443 | 444 | return parsed_out 445 | 446 | 447 | def parse_markdown_directory(path:str): 448 | """ 449 | Parse all markdown files in a directory 450 | 451 | :param path: Path to the directory 452 | :return: 453 | """ 454 | all_files = os.listdir(path) 455 | md_files = [f for f in all_files if f.endswith(".md")] 456 | 457 | out = [] 458 | for md_file in md_files: 459 | print("Processing `%s`..." % md_file) 460 | out.extend(parse_markdown_file(os.path.join(path, md_file))) 461 | 462 | return out 463 | 464 | 465 | if __name__ == "__main__": 466 | parser = argparse.ArgumentParser() 467 | parser.add_argument("paths", nargs="+", type=str, help="Files or directories to convert") 468 | parser.add_argument("--output", default="structured.json", type=str, help="Output JSON file name") 469 | 470 | args = parser.parse_args() 471 | 472 | out = [] 473 | for path in args.paths: 474 | if os.path.isdir(path): 475 | out.extend(parse_markdown_directory(path)) 476 | else: 477 | out.extend(parse_markdown_file(path)) 478 | 479 | with open(args.output, "w") as f: 480 | f.write(json.dumps(out, indent=2)) -------------------------------------------------------------------------------- /english/summarization.md: -------------------------------------------------------------------------------- 1 | # Summarization 2 | 3 | Summarization is the task of producing a shorter version of one or several documents that preserves most of the 4 | input's meaning. 5 | 6 | ### Warning: Evaluation Metrics 7 | 8 | For summarization, automatic metrics such as ROUGE and METEOR have serious limitations: 9 | 1. They only assess content selection and do not account for other quality aspects, such as fluency, grammaticality, coherence, etc. 10 | 2. To assess content selection, they rely mostly on lexical overlap, although an abstractive summary could express they same content as a reference without any lexical overlap. 11 | 3. Given the subjectiveness of summarization and the correspondingly low agreement between annotators, the metrics were designed to be used with multiple reference summaries per input. However, recent datasets such as CNN/DailyMail and Gigaword provide only a single reference. 12 | 13 | Therefore, tracking progress and claiming state-of-the-art based only on these metrics is questionable. Most papers carry out additional manual comparisons of alternative summaries. Unfortunately, such experiments are difficult to compare across papers. If you have an idea on how to do that, feel free to contribute. 14 | 15 | 16 | ### CNN / Daily Mail 17 | 18 | The [CNN / Daily Mail dataset](https://arxiv.org/abs/1506.03340) as processed by 19 | [Nallapati et al. (2016)](http://www.aclweb.org/anthology/K16-1028) has been used 20 | for evaluating summarization. The dataset contains online news articles (781 tokens 21 | on average) paired with multi-sentence summaries (3.75 sentences or 56 tokens on average). 22 | The processed version contains 287,226 training pairs, 13,368 validation pairs and 11,490 test pairs. 23 | Models are evaluated with full-length F1-scores of ROUGE-1, ROUGE-2, ROUGE-L, and METEOR (optional). 24 | 25 | #### Anonymized version 26 | 27 | The following models have been evaluated on the entitiy-anonymized version of the dataset introduced by [Nallapati et al. (2016)](http://www.aclweb.org/anthology/K16-1028). 28 | 29 | | Model | ROUGE-1 | ROUGE-2 | ROUGE-L | METEOR | Paper / Source | Code | 30 | | --------------- | :-----: | :-----: | :-----: | :----: | -------------- | ---- | 31 | | RNES w/o coherence (Wu and Hu, 2018) | 41.25 | 18.87 | 37.75 | - | [Learning to Extract Coherent Summary via Deep Reinforcement Learning](https://www.aaai.org/ocs/index.php/AAAI/AAAI18/paper/view/16838/16118) | | 32 | | SWAP-NET (Jadhav and Rajan, 2018) | 41.6 | 18.3 | 37.7 | - | [Extractive Summarization with SWAP-NET: Sentences and Words from Alternating Pointer Networks](http://aclweb.org/anthology/P18-1014) | | 33 | | HSASS (Al-Sabahi et al., 2018) | 42.3 | 17.8 | 37.6 | - | [A Hierarchical Structured Self-Attentive Model for Extractive Document Summarization (HSSAS)](https://ieeexplore.ieee.org/stamp/stamp.jsp?arnumber=8344797) | | 34 | | GAN (Liu et al., 2018) | 39.92 | 17.65 | 36.71 | - | [Generative Adversarial Network for Abstractive Text Summarization](https://aaai.org/ocs/index.php/AAAI/AAAI18/paper/view/16238/16492) | | 35 | | KIGN+Prediction-guide (Li et al., 2018) | 38.95| 17.12 | 35.68 | - | [Guiding Generation for Abstractive Text Summarization based on Key Information Guide Network](http://aclweb.org/anthology/N18-2009) | | 36 | | SummaRuNNer (Nallapati et al., 2017) | 39.6 | 16.2 | 35.3 | - | [SummaRuNNer: A Recurrent Neural Network based Sequence Model for Extractive Summarization of Documents](https://arxiv.org/abs/1611.04230) | | 37 | | rnn-ext + abs + RL + rerank (Chen and Bansal, 2018) | 39.66 | 15.85 | 37.34 | - | [Fast Abstractive Summarization with Reinforce-Selected Sentence Rewriting](http://aclweb.org/anthology/P18-1063) | [Official](https://github.com/ChenRocks/fast_abs_rl) | 38 | | ML+RL, with intra-attention (Paulus et al., 2018) | 39.87 | 15.82 | 36.90 | - | [A Deep Reinforced Model for Abstractive Summarization](https://openreview.net/pdf?id=HkAClQgA-) | | 39 | | Lead-3 baseline (Nallapati et al., 2017) | 39.2 | 15.7 | 35.5 | - | [SummaRuNNer: A Recurrent Neural Network based Sequence Model for Extractive Summarization of Documents](https://arxiv.org/abs/1611.04230) | | 40 | | ML+RL ROUGE+Novel, with LM (Kryscinski et al., 2018) | 40.02 | 15.53 | 37.44 | - | [Improving Abstraction in Text Summarization](http://aclweb.org/anthology/D18-1207) | | 41 | | (Tan et al., 2017) | 38.1 | 13.9 | 34.0 | - | [Abstractive Document Summarization with a Graph-Based Attentional Neural Model](http://aclweb.org/anthology/P17-1108) | | 42 | | words-lvt2k-temp-att (Nallapti et al., 2016) | 35.46 | 13.30 | 32.65 | - | [Abstractive Text Summarization using Sequence-to-sequence RNNs and Beyond](http://www.aclweb.org/anthology/K16-1028) | | 43 | 44 | #### Non-anonymized version 45 | 46 | The following models have been evaluated on the non-anonymized version of the dataset introduced by [See et al. (2017)](http://aclweb.org/anthology/P17-1099). 47 | 48 | | Model | ROUGE-1 | ROUGE-2 | ROUGE-L | METEOR | Paper / Source | Code | 49 | | --------------- | :-----: | :-----: | :-----: | :----: | -------------- | ---- | 50 | | DCA (Celikyilmaz et al., 2018) | 41.69 | 19.47 | 37.92 | - | [Deep Communicating Agents for Abstractive Summarization](http://aclweb.org/anthology/N18-1150) | | 51 | | NeuSUM (Zhou et al., 2018) | 41.59 | 19.01 | 37.98 | - | [Neural Document Summarization by Jointly Learning to Score and Select Sentences](http://aclweb.org/anthology/P18-1061) | [Official](https://github.com/magic282/NeuSum) | 52 | | Latent (Zhang et al., 2018) | 41.05 | 18.77 | 37.54 | - | [Neural Latent Extractive Document Summarization](http://aclweb.org/anthology/D18-1088) | | 53 | | rnn-ext + RL (Chen and Bansal, 2018) | 41.47 | 18.72 | 37.76 | 22.35 | [Fast Abstractive Summarization with Reinforce-Selected Sentence Rewriting](http://aclweb.org/anthology/P18-1061) | [Official](https://github.com/chenrocks/fast_abs_rl) | 54 | | BanditSum (Dong et al., 2018) | 41.5 | 18.7 | 37.6 | - | [BANDITSUM: Extractive Summarization as a Contextual Bandit](https://aclweb.org/anthology/D18-1409) | [Official](https://github.com/yuedongP/BanditSum)| 55 | | Bottom-Up Summarization (Gehrmann et al., 2018) | 41.22 | 18.68 | 38.34 | - | [Bottom-Up Abstractive Summarization](https://arxiv.org/abs/1808.10792) | [Official](https://github.com/sebastianGehrmann/bottom-up-summary) | 56 | | REFRESH (Narayan et al., 2018) | 40.0 | 18.2 | 36.6 | - | [Ranking Sentences for Extractive Summarization with Reinforcement Learning](http://aclweb.org/anthology/N18-1158) | [Official](https://github.com/EdinburghNLP/Refresh) | 57 | | (Li et al., 2018a) | 41.54 | 18.18 | 36.47 | - | [Improving Neural Abstractive Document Summarization with Explicit Information Selection Modeling](http://aclweb.org/anthology/D18-1205) | | 58 | | (Li et al., 2018b) | 40.30 | 18.02 | 37.36 | - | [Improving Neural Abstractive Document Summarization with Structural Regularization](http://aclweb.org/anthology/D18-1441) | | 59 | | ROUGESal+Ent RL (Pasunuru and Bansal, 2018) | 40.43 | 18.00 | 37.10 | 20.02 | [Multi-Reward Reinforced Summarization with Saliency and Entailment](http://aclweb.org/anthology/N18-2102) | | 60 | | end2end w/ inconsistency loss (Hsu et al., 2018) | 40.68 | 17.97 | 37.13 | - | [A Unified Model for Extractive and Abstractive Summarization using Inconsistency Loss](http://aclweb.org/anthology/P18-1013) | | 61 | | RL + pg + cbdec (Jiang and Bansal, 2018) | 40.66 | 17.87 | 37.06 | 20.51 | [Closed-Book Training to Improve Summarization Encoder Memory](http://aclweb.org/anthology/D18-1440) | | 62 | | rnn-ext + abs + RL + rerank (Chen and Bansal, 2018) | 40.88 | 17.80 | 38.54 | 20.38 | [Fast Abstractive Summarization with Reinforce-Selected Sentence Rewriting](http://aclweb.org/anthology/P18-1061) | [Official](https://github.com/chenrocks/fast_abs_rl) | 63 | | Lead-3 baseline (See et al., 2017) | 40.34 | 17.70 | 36.57 | 22.21 | [Get To The Point: Summarization with Pointer-Generator Networks](http://aclweb.org/anthology/P17-1099) | [Official](https://github.com/abisee/pointer-generator) | 64 | | Pointer + Coverage + EntailmentGen + QuestionGen (Guo et al., 2018) | 39.81 | 17.64 | 36.54 | 18.54 | [Soft Layer-Specific Multi-Task Summarization with Entailment and Question Generation](http://aclweb.org/anthology/P18-1064) | | 65 | | ML+RL ROUGE+Novel, with LM (Kryscinski et al., 2018) | 40.19 | 17.38 | 37.52 | - | [Improving Abstraction in Text Summarization](http://aclweb.org/anthology/D18-1207) | | 66 | | Pointer-generator + coverage (See et al., 2017) | 39.53 | 17.28 | 36.38 | 18.72 | [Get To The Point: Summarization with Pointer-Generator Networks](http://aclweb.org/anthology/P17-1099) | [Official](https://github.com/abisee/pointer-generator) | 67 | 68 | ### Gigaword 69 | 70 | The Gigaword summarization dataset has been first used by [Rush et al., 2015](https://www.aclweb.org/anthology/D/D15/D15-1044.pdf) and represents a sentence summarization / headline generation task with very short input documents (31.4 tokens) and summaries (8.3 tokens). It contains 3.8M training, 189k development and 1951 test instances. Models are evaluated with ROUGE-1, ROUGE-2 and ROUGE-L using full-length F1-scores. 71 | 72 | | Model | ROUGE-1 | ROUGE-2 | ROUGE-L | Paper / Source | Code | 73 | | --------------- | :-----: | :-----: | :-----: | -------------- | ---- | 74 | | Re^3 Sum (Cao et al., 2018) | 37.04 | 19.03 | 34.46 | [Retrieve, Rerank and Rewrite: Soft Template Based Neural Summarization](http://aclweb.org/anthology/P18-1015) | | 75 | | CGU (Lin et al., 2018) | 36.3 | 18.0 | 33.8 | [Global Encoding for Abstractive Summarization](http://aclweb.org/anthology/P18-2027) | [Official](https://www.github.com/lancopku/Global-Encoding) | 76 | | Pointer + Coverage + EntailmentGen + QuestionGen (Guo et al., 2018) | 35.98 | 17.76 | 33.63 | [Soft Layer-Specific Multi-Task Summarization with Entailment and Question Generation](http://aclweb.org/anthology/P18-1064) | | 77 | | Struct+2Way+Word (Song et al., 2018) | 35.47 | 17.66 | 33.52 | [Structure-Infused Copy Mechanisms for Abstractive Summarization](http://aclweb.org/anthology/C18-1146) | [Official](https://github.com/KaiQiangSong/struct_infused_summ)| 78 | | FTSum_g (Cao et al., 2018) | 37.27 | 17.65 | 34.24 | [Faithful to the Original: Fact Aware Neural Abstractive Summarization](https://arxiv.org/pdf/1711.04434.pdf) | | 79 | | DRGD (Li et al., 2017) | 36.27 | 17.57 | 33.62 | [Deep Recurrent Generative Decoder for Abstractive Text Summarization](http://aclweb.org/anthology/D17-1222) | | 80 | | SEASS (Zhou et al., 2017) | 36.15 | 17.54 | 33.63 | [Selective Encoding for Abstractive Sentence Summarization](http://aclweb.org/anthology/P17-1101) | [Official](https://github.com/magic282/SEASS) | 81 | | EndDec+WFE (Suzuki and Nagata, 2017) | 36.30 | 17.31 | 33.88 | [Cutting-off Redundant Repeating Generations for Neural Abstractive Summarization](http://aclweb.org/anthology/E17-2047) | | 82 | | Seq2seq + selective + MTL + ERAM (Li et al., 2018) | 35.33 | 17.27 | 33.19 | [Ensure the Correctness of the Summary: Incorporate Entailment Knowledge into Abstractive Sentence Summarization](http://aclweb.org/anthology/C18-1121) | | 83 | | Seq2seq + E2T_cnn (Amplayo et al., 2018) | 37.04 | 16.66 | 34.93 | [Entity Commonsense Representation for Neural Abstractive Summarization](http://aclweb.org/anthology/N18-1064) | | 84 | | RAS-Elman (Chopra et al., 2016) | 33.78 | 15.97 | 31.15 | [Abstractive Sentence Summarization with Attentive Recurrent Neural Networks](http://www.aclweb.org/anthology/N16-1012) | | 85 | | words-lvt5k-1sent (Nallapti et al., 2016) | 32.67 | 15.59 | 30.64 | [Abstractive Text Summarization using Sequence-to-sequence RNNs and Beyond](http://www.aclweb.org/anthology/K16-1028) | | 86 | | ABS+ (Rush et al., 2015) | 29.76 | 11.88 | 26.96 | [A Neural Attention Model for Sentence Summarization](https://www.aclweb.org/anthology/D/D15/D15-1044.pdf) * | | 87 | | ABS (Rush et al., 2015) | 29.55 | 11.32 | 26.42 | [A Neural Attention Model for Sentence Summarization](https://www.aclweb.org/anthology/D/D15/D15-1044.pdf) * | | 88 | 89 | (*) [Rush et al., 2015](https://www.aclweb.org/anthology/D/D15/D15-1044.pdf) report ROUGE recall, the table here contains ROUGE F1-scores for Rush's model reported by [Chopra et al., 2016](http://www.aclweb.org/anthology/N16-1012) 90 | 91 | ### DUC 2004 Task 1 92 | 93 | Similar to Gigaword, task 1 of [DUC 2004](https://duc.nist.gov/duc2004/) is a sentence summarization task. The dataset contains 500 documents with on average 35.6 tokens and summaries with 10.4 tokens. Due to its size, neural models are typically trained on other datasets and only tested on DUC 2004. Evaluation metrics are ROUGE-1, ROUGE-2 and ROUGE-L recall @ 75 bytes. 94 | 95 | | Model | ROUGE-1 | ROUGE-2 | ROUGE-L | Paper / Source | Code | 96 | | --------------- | :-----: | :-----: | :-----: | -------------- | ---- | 97 | | Transformer + LRPE + PE + Re-ranking (Takase and Okazaki, 2019) | 32.29 | 11.49 | 28.03 | [Positional Encoding to Control Output Sequence Length](https://arxiv.org/abs/1904.07418) | [Official](https://github.com/takase/control-length) | 98 | | DRGD (Li et al., 2017) | 31.79 | 10.75 | 27.48 | [Deep Recurrent Generative Decoder for Abstractive Text Summarization](http://aclweb.org/anthology/D17-1222) | | 99 | | EndDec+WFE (Suzuki and Nagata, 2017) | 32.28 | 10.54 | 27.8 | [Cutting-off Redundant Repeating Generations for Neural Abstractive Summarization](http://aclweb.org/anthology/E17-2047) | | 100 | | Seq2seq + selective + MTL + ERAM (Li et al., 2018) | 29.33 | 10.24 | 25.24 | [Ensure the Correctness of the Summary: Incorporate Entailment Knowledge into Abstractive Sentence Summarization](http://aclweb.org/anthology/C18-1121) | | 101 | | SEASS (Zhou et al., 2017) | 29.21 | 9.56 | 25.51 | [Selective Encoding for Abstractive Sentence Summarization](http://aclweb.org/anthology/P17-1101) | | 102 | | words-lvt5k-1sent (Nallapti et al., 2016) | 28.61 | 9.42 | 25.24 | [Abstractive Text Summarization using Sequence-to-sequence RNNs and Beyond](http://www.aclweb.org/anthology/K16-1028) | | 103 | | ABS+ (Rush et al., 2015) | 28.18 | 8.49 | 23.81 | [A Neural Attention Model for Sentence Summarization](https://www.aclweb.org/anthology/D/D15/D15-1044.pdf) | | 104 | | RAS-Elman (Chopra et al., 2016) | 28.97 | 8.26 | 24.06 | [Abstractive Sentence Summarization with Attentive Recurrent Neural Networks](http://www.aclweb.org/anthology/N16-1012) | | 105 | | ABS (Rush et al., 2015) | 26.55 | 7.06 | 22.05 | [A Neural Attention Model for Sentence Summarization](https://www.aclweb.org/anthology/D/D15/D15-1044.pdf) | | 106 | 107 | ## Webis-TLDR-17 Corpus 108 | 109 | This [dataset](https://zenodo.org/record/1168855) contains 3 Million pairs of content and self-written summaries mined from Reddit. It is one of the first large-scale summarization dataset from the social media domain. For more details, refer to [TL;DR: Mining Reddit to Learn Automatic Summarization](https://aclweb.org/anthology/W17-4508) 110 | 111 | ## Sentence Compression 112 | 113 | Sentence compression produces a shorter sentence by removing redundant information, 114 | preserving the grammatically and the important content of the original sentence. 115 | 116 | ### Google Dataset 117 | 118 | The [Google Dataset](https://github.com/google-research-datasets/sentence-compression) was built by Filippova et al., 2013([Overcoming the Lack of Parallel Data in Sentence Compression](https://www.aclweb.org/anthology/D/D13/D13-1155.pdf)). The first dataset released contained only 10,000 sentence-compression pairs, but last year was released an additional 200,000 pairs. 119 | 120 | Example of a sentence-compression pair: 121 | > Sentence: Floyd Mayweather is open to fighting Amir Khan in the future, despite snubbing the Bolton-born boxer in favour of a May bout with Argentine Marcos Maidana, according to promoters Golden Boy 122 | 123 | > Compression: Floyd Mayweather is open to fighting Amir Khan in the future. 124 | 125 | In short, this is a deletion-based task where the compression is a subsequence from the original sentence. From the 10,000 pairs of the eval portion([repository](https://github.com/google-research-datasets/sentence-compression/tree/master/data)) it is used the very first 1,000 sentence for automatic evaluation and the 200,000 pairs for training. 126 | 127 | Models are evaluated using the following metrics: 128 | * F1 - compute the recall and precision in terms of tokens kept in the golden and the generated compressions. 129 | * Compression rate (CR) - the length of the compression in characters divided over the sentence length. 130 | 131 | | Model | F1 | CR |Paper / Source | Code | 132 | | ------------- | :-----:| --- | --- | --- | 133 | | BiRNN + LM Evaluator (Zhao et al. 2018) | 0.851 | 0.39 | [A Language Model based Evaluator for Sentence Compression](https://aclweb.org/anthology/P18-2028) | https://github.com/code4conference/code4sc | 134 | | LSTM (Filippova et al., 2015) | 0.82 | 0.38 | [Sentence Compression by Deletion with LSTMs](https://research.google.com/pubs/archive/43852.pdf) | | 135 | | BiLSTM (Wang et al., 2017) | 0.8 | 0.43 | [Can Syntax Help? Improving an LSTM-based Sentence Compression Model for New Domains](http://www.aclweb.org/anthology/P17-1127) | | 136 | 137 | [Go back to the README](../README.md) 138 | --------------------------------------------------------------------------------