├── Hulth2003.zip ├── KDD.zip ├── Krapivin2009.zip ├── Marujo2012.zip ├── NLM500.zip ├── README.md ├── SemEval2010.zip └── WWW.zip /Hulth2003.zip: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/SDuari/Keyword-Extraction-Datasets/5fa62602ac1b9dd40eceaa1003822f363ca22dab/Hulth2003.zip -------------------------------------------------------------------------------- /KDD.zip: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/SDuari/Keyword-Extraction-Datasets/5fa62602ac1b9dd40eceaa1003822f363ca22dab/KDD.zip -------------------------------------------------------------------------------- /Krapivin2009.zip: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/SDuari/Keyword-Extraction-Datasets/5fa62602ac1b9dd40eceaa1003822f363ca22dab/Krapivin2009.zip -------------------------------------------------------------------------------- /Marujo2012.zip: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/SDuari/Keyword-Extraction-Datasets/5fa62602ac1b9dd40eceaa1003822f363ca22dab/Marujo2012.zip -------------------------------------------------------------------------------- /NLM500.zip: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/SDuari/Keyword-Extraction-Datasets/5fa62602ac1b9dd40eceaa1003822f363ca22dab/NLM500.zip -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # Keyword-Extraction-Datasets 2 | 3 | This repository contains seven annotated datasets for automatic keyword extraction task. Every dataset contains a document (.txt or .abstr) and its corresponding gold-standard keywords list (.key or .uncontr). These datasets were used for our study of supervised and unsupervised keyword extraction. Following are the links to our published works. 4 | 5 | 1. **sCAKE: Semantic Connectivity Aware Keyword Extraction** 6 | 7 | [![DOI:10.1016/j.ins.2018.10.034](https://zenodo.org/badge/DOI/10.1016/j.ins.2018.10.034.svg)](https://doi.org/10.1016/j.ins.2018.10.034) [![Generic badge](https://img.shields.io/badge/Full%20Article-ScienceDirect-orange.svg)](http://www.sciencedirect.com/science/article/pii/S0020025518308521) [![Generic badge](https://img.shields.io/badge/Preprint-arXiv-orange.svg)](https://arxiv.org/pdf/1811.10831.pdf) 8 | 9 | 2. **Complex Network based Supervised Keyword Extractor.** 10 | 11 | [![DOI:10.1016/j.eswa.2019.112876](https://zenodo.org/badge/DOI/10.1016/j.eswa.2019.112876.svg)](https://doi.org/10.1016/j.eswa.2019.112876) [![Generic badge](https://img.shields.io/badge/Full%20Article-ScienceDirect-orange.svg)](https://www.sciencedirect.com/science/article/pii/S095741741930586X) [![Generic badge](https://img.shields.io/badge/Preprint-arXiv-orange.svg)](https://arxiv.org/pdf/1909.12009.pdf) 12 | 13 | 14 | Following are the datasets and the original papers which proposed them. 15 | 16 | 1. **Hulth2003**: Contains abstracts from *Inspec* dataset. Originally downloaded from https://github.com/snkim/AutomaticKeyphraseExtraction. 17 | 2. **WWW** and **KDD**: CS abstracts from KDD and WWW conferences. We have only kept those documents that contain at least two sentences and atleast one gold-standard keyword. Originally downloaded from https://www.dropbox.com/s/3c57qar1b0xseob/kpshare.tgz?dl=0 (Link is not available now). Full dataset can be downloaded from https://github.com/LIAAD/KeywordExtractor-Datasets/tree/master/datasets. 18 | 3. **Marujo2012**: News articles. Originally downloaded from https://github.com/snkim/AutomaticKeyphraseExtraction. 19 | 4. **Krapivin2012**: ACM full papers. Originally downloaded from https://github.com/snkim/AutomaticKeyphraseExtraction. 20 | 5. **Semeval2010**: ACM full papers. Originally downloaded from https://github.com/snkim/AutomaticKeyphraseExtraction. 21 | 6. **NLM500**: PubMed documents. Originally downloaded from https://github.com/zelandiya/keyword-extraction-datasets. *Created for abstractive KE task.* 22 | 23 | ## Dataset details and collection statistics 24 | 25 | | Dataset | \|D\| | Lavg | Navg | Kavg | KPavg| Description | 26 | | :--- | :---: | :---: | :---: | :---: | :---: | :--- | 27 | | Hulth2003 | 1500 | 129 | 23 | 10 | 90.07 | Abstracts from *Inspec* dataset 28 | | WWW | 1248 | 174 | 9 | 5 | 64.97 | Abstracts from CS articles published in KDD conference 29 | | KDD | 704 | 204 | 8 | 4 | 68.12 | Abstracts from CS articles published in WWW conference 30 | | Marujo2012 | 450 | 427 | 69 | 48 | 99.31 | Online news articles 31 | | Krapivin2009 | 2304 | 7961 | 11 | 5 | 96.91 | Full scientific articles from ACM 32 | | SemEval2010 | 244 | 8085 | 34 | 16 | 95.89 | Full scientific articles from ACM, created for SemEval2010 Task 5 33 | | NLM500 | 500 | 4854 | 27 | 14 | 71.35 | Full papers from *PubMed* database 34 | 35 | \|D\|: Number of documents. 36 | Lavg: Average document length, in words. 37 | Navg: Average gold-standard keywords (unigrams) assigned per document. 38 | Kavg: Average gold-standard keyphrases (*n*-grams) assigned per document. 39 | KPavg: Average percentage of keyphrases present in the text 40 | 41 | ## Citations: 42 | Following are the citations for original papers. 43 | 44 | ### Hulth2003 45 | ```tex 46 | @inproceedings{hulth2003improved, 47 | title = "Improved Automatic Keyword Extraction given more Linguistic Knowledge", 48 | author = "Hulth, Anette", 49 | booktitle = "Proceedings of the 2003 Conference on EMNLP", 50 | pages = "216--223", 51 | year = "2003", 52 | organization = "ACL" 53 | } 54 | ``` 55 | 56 | ### Krapivin2009 57 | ```tex 58 | @article{krapivin2009large, 59 | title = "Large Dataset for Keyphrases Extraction", 60 | author = "Krapivin, Mikalai and Autaeu, Aliaksandr and Marchese, Maurizio", 61 | journal = "Technical Report DISI-09-055", 62 | year = "2009", 63 | publisher = "University of Trento" 64 | } 65 | ``` 66 | 67 | ### NLM500 68 | ```tex 69 | @inproceedings{aronson2000nlm, 70 | title = "The NLM Indexing Initiative", 71 | author = "Aronson and others", 72 | booktitle = "Proceedings of the AMIA Symposium", 73 | pages = "17", 74 | year = "2000", 75 | organization = "American Medical Informatics Association" 76 | } 77 | ``` 78 | 79 | ### SemEval2010 80 | ```tex 81 | @inproceedings{kim2010semeval, 82 | title = "Semeval-2010 Task 5: Automatic Keyphrase Extraction from Scientific Articles", 83 | author = "Kim, Su Nam and Medelyan, Olena and Kan, Min-Yen and Baldwin, Timothy", 84 | booktitle = "Proceedings of the 5th International Workshop on Semantic Evaluation", 85 | pages = "21--26", 86 | year = "2010", 87 | organization = "Association for Computational Linguistics" 88 | } 89 | ``` 90 | 91 | ### Marujo2012 92 | ```tex 93 | @inproceedings{marujo2012supervised, 94 | title = "Supervised Topical Key Phrase Extraction of News Stories using Crowdsourcing, Light Filtering and Co-reference Normalization", 95 | author = "Marujo, Lu{\'\i}s and Gershman, Anatole and Carbonell, Jaime and Frederking, Robert and Neto, Joa{\`I}ƒo P", 96 | booktitle = "Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC-2012)", 97 | year = "2012" 98 | } 99 | ``` 100 | 101 | ### WWW and KDD 102 | ```tex 103 | @inproceedings{gollapalli2014extracting, 104 | title = "Extracting keyphrases from research papers using citation networks", 105 | author = "Gollapalli, Sujatha Das and Caragea, Cornelia", 106 | booktitle = "Twenty-Eighth AAAI Conference on Artificial Intelligence", 107 | year = "2014" 108 | } 109 | ``` 110 | -------------------------------------------------------------------------------- /SemEval2010.zip: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/SDuari/Keyword-Extraction-Datasets/5fa62602ac1b9dd40eceaa1003822f363ca22dab/SemEval2010.zip -------------------------------------------------------------------------------- /WWW.zip: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/SDuari/Keyword-Extraction-Datasets/5fa62602ac1b9dd40eceaa1003822f363ca22dab/WWW.zip --------------------------------------------------------------------------------