├── Hulth2003.zip
├── KDD.zip
├── Krapivin2009.zip
├── Marujo2012.zip
├── NLM500.zip
├── README.md
├── SemEval2010.zip
└── WWW.zip


/Hulth2003.zip:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/SDuari/Keyword-Extraction-Datasets/5fa62602ac1b9dd40eceaa1003822f363ca22dab/Hulth2003.zip


--------------------------------------------------------------------------------
/KDD.zip:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/SDuari/Keyword-Extraction-Datasets/5fa62602ac1b9dd40eceaa1003822f363ca22dab/KDD.zip


--------------------------------------------------------------------------------
/Krapivin2009.zip:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/SDuari/Keyword-Extraction-Datasets/5fa62602ac1b9dd40eceaa1003822f363ca22dab/Krapivin2009.zip


--------------------------------------------------------------------------------
/Marujo2012.zip:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/SDuari/Keyword-Extraction-Datasets/5fa62602ac1b9dd40eceaa1003822f363ca22dab/Marujo2012.zip


--------------------------------------------------------------------------------
/NLM500.zip:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/SDuari/Keyword-Extraction-Datasets/5fa62602ac1b9dd40eceaa1003822f363ca22dab/NLM500.zip


--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
  1 | # Keyword-Extraction-Datasets
  2 | 
  3 | This repository contains seven annotated datasets for automatic keyword extraction task. Every dataset contains a document (.txt or .abstr) and its corresponding gold-standard keywords list (.key or .uncontr). These datasets were used for our study of supervised and unsupervised keyword extraction. Following are the links to our published works.
  4 | 
  5 | 1. **sCAKE: Semantic Connectivity Aware Keyword Extraction**
  6 | 
  7 | [![DOI:10.1016/j.ins.2018.10.034](https://zenodo.org/badge/DOI/10.1016/j.ins.2018.10.034.svg)](https://doi.org/10.1016/j.ins.2018.10.034) [![Generic badge](https://img.shields.io/badge/Full%20Article-ScienceDirect-orange.svg)](http://www.sciencedirect.com/science/article/pii/S0020025518308521) [![Generic badge](https://img.shields.io/badge/Preprint-arXiv-orange.svg)](https://arxiv.org/pdf/1811.10831.pdf)
  8 | 
  9 | 2. **Complex Network based Supervised Keyword Extractor.**
 10 | 
 11 | [![DOI:10.1016/j.eswa.2019.112876](https://zenodo.org/badge/DOI/10.1016/j.eswa.2019.112876.svg)](https://doi.org/10.1016/j.eswa.2019.112876) [![Generic badge](https://img.shields.io/badge/Full%20Article-ScienceDirect-orange.svg)](https://www.sciencedirect.com/science/article/pii/S095741741930586X) [![Generic badge](https://img.shields.io/badge/Preprint-arXiv-orange.svg)](https://arxiv.org/pdf/1909.12009.pdf)
 12 | 
 13 | 
 14 | Following are the datasets and the original papers which proposed them.
 15 | 
 16 | 1. **Hulth2003**: Contains abstracts from *Inspec* dataset. Originally downloaded from https://github.com/snkim/AutomaticKeyphraseExtraction.
 17 | 2. **WWW** and **KDD**: CS abstracts from KDD and WWW conferences. We have only kept those documents that contain at least two sentences and atleast one gold-standard keyword. Originally downloaded from https://www.dropbox.com/s/3c57qar1b0xseob/kpshare.tgz?dl=0 (Link is not available now). Full dataset can be downloaded from https://github.com/LIAAD/KeywordExtractor-Datasets/tree/master/datasets.
 18 | 3. **Marujo2012**: News articles. Originally downloaded from https://github.com/snkim/AutomaticKeyphraseExtraction.
 19 | 4. **Krapivin2012**: ACM full papers. Originally downloaded from https://github.com/snkim/AutomaticKeyphraseExtraction.
 20 | 5. **Semeval2010**: ACM full papers. Originally downloaded from https://github.com/snkim/AutomaticKeyphraseExtraction.
 21 | 6. **NLM500**: PubMed documents. Originally downloaded from https://github.com/zelandiya/keyword-extraction-datasets. *Created for abstractive KE task.*
 22 | 
 23 | ## Dataset details and collection statistics
 24 | 
 25 | | Dataset | \|D\| | L<sub>avg</sub> | N<sub>avg</sub> | K<sub>avg</sub> | KP<sub>avg</sub>| Description |
 26 | | :---         |     :---:      |     :---:      |     :---:      |     :---:      |     :---:      |          :--- |
 27 | | Hulth2003   | 1500 |  129   | 23 | 10 | 90.07 | Abstracts from *Inspec* dataset
 28 | | WWW     | 1248 |  174 | 9 | 5 | 64.97 | Abstracts from CS articles published in KDD conference
 29 | | KDD    | 704 | 204 | 8 | 4 | 68.12 | Abstracts from CS articles published in WWW conference
 30 | | Marujo2012     | 450 | 427 | 69 | 48 | 99.31 | Online news articles
 31 | | Krapivin2009     | 2304 | 7961 | 11 | 5 | 96.91 | Full scientific articles from ACM
 32 | | SemEval2010     | 244 |  8085 | 34 | 16 | 95.89 | Full scientific articles from ACM, created for SemEval2010 Task 5
 33 | | NLM500     | 500 |  4854  | 27 | 14 | 71.35 | Full papers from *PubMed* database
 34 | 
 35 | \|D\|: Number of documents.
 36 | L<sub>avg</sub>: Average document length, in words.
 37 | N<sub>avg</sub>: Average gold-standard keywords (unigrams) assigned per document.
 38 | K<sub>avg</sub>: Average gold-standard keyphrases (*n*-grams) assigned per document.
 39 | KP<sub>avg</sub>: Average percentage of keyphrases present in the text
 40 | 
 41 | ## Citations:
 42 | Following are the citations for original papers.
 43 | 
 44 | ### Hulth2003
 45 | ```tex
 46 | @inproceedings{hulth2003improved,
 47 | title = "Improved Automatic Keyword Extraction given more Linguistic Knowledge",
 48 | author = "Hulth, Anette",
 49 | booktitle = "Proceedings of the 2003 Conference on EMNLP",
 50 | pages = "216--223",
 51 | year = "2003",
 52 | organization = "ACL"
 53 | }
 54 | ```
 55 | 
 56 | ### Krapivin2009
 57 | ```tex 
 58 | @article{krapivin2009large,
 59 | title = "Large Dataset for Keyphrases Extraction",
 60 | author = "Krapivin, Mikalai and Autaeu, Aliaksandr and Marchese, Maurizio",
 61 | journal = "Technical Report DISI-09-055",
 62 | year = "2009",
 63 | publisher = "University of Trento"
 64 | }
 65 | ```
 66 | 
 67 | ### NLM500
 68 | ```tex 
 69 | @inproceedings{aronson2000nlm,
 70 | title = "The NLM Indexing Initiative",
 71 | author = "Aronson and others",
 72 | booktitle = "Proceedings of the AMIA Symposium",
 73 | pages = "17",
 74 | year = "2000",
 75 | organization = "American Medical Informatics Association"
 76 | }
 77 | ```
 78 | 
 79 | ### SemEval2010
 80 | ```tex
 81 | @inproceedings{kim2010semeval,
 82 | title = "Semeval-2010 Task 5: Automatic Keyphrase Extraction from Scientific Articles",
 83 | author = "Kim, Su Nam and Medelyan, Olena and Kan, Min-Yen and Baldwin, Timothy",
 84 | booktitle = "Proceedings of the 5th International Workshop on Semantic Evaluation",
 85 | pages = "21--26",
 86 | year = "2010",
 87 | organization = "Association for Computational Linguistics"
 88 | }
 89 | ```
 90 | 
 91 | ### Marujo2012
 92 | ```tex
 93 | @inproceedings{marujo2012supervised,
 94 | title = "Supervised Topical Key Phrase Extraction of News Stories using Crowdsourcing, Light Filtering and Co-reference Normalization",
 95 | author = "Marujo, Lu{\'\i}s and Gershman, Anatole and Carbonell, Jaime and Frederking, Robert and Neto, Joa{\`I}ƒo P",
 96 | booktitle = "Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC-2012)",
 97 | year = "2012"
 98 | }
 99 | ```
100 | 
101 | ### WWW and KDD
102 | ```tex
103 | @inproceedings{gollapalli2014extracting,
104 | title = "Extracting keyphrases from research papers using citation networks",
105 | author = "Gollapalli, Sujatha Das and Caragea, Cornelia",
106 | booktitle = "Twenty-Eighth AAAI Conference on Artificial Intelligence",
107 | year = "2014"
108 | }
109 | ```
110 | 


--------------------------------------------------------------------------------
/SemEval2010.zip:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/SDuari/Keyword-Extraction-Datasets/5fa62602ac1b9dd40eceaa1003822f363ca22dab/SemEval2010.zip


--------------------------------------------------------------------------------
/WWW.zip:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/SDuari/Keyword-Extraction-Datasets/5fa62602ac1b9dd40eceaa1003822f363ca22dab/WWW.zip


--------------------------------------------------------------------------------