├── Hulth2003.zip
├── KDD.zip
├── Krapivin2009.zip
├── Marujo2012.zip
├── NLM500.zip
├── README.md
├── SemEval2010.zip
└── WWW.zip
/Hulth2003.zip:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/SDuari/Keyword-Extraction-Datasets/5fa62602ac1b9dd40eceaa1003822f363ca22dab/Hulth2003.zip
--------------------------------------------------------------------------------
/KDD.zip:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/SDuari/Keyword-Extraction-Datasets/5fa62602ac1b9dd40eceaa1003822f363ca22dab/KDD.zip
--------------------------------------------------------------------------------
/Krapivin2009.zip:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/SDuari/Keyword-Extraction-Datasets/5fa62602ac1b9dd40eceaa1003822f363ca22dab/Krapivin2009.zip
--------------------------------------------------------------------------------
/Marujo2012.zip:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/SDuari/Keyword-Extraction-Datasets/5fa62602ac1b9dd40eceaa1003822f363ca22dab/Marujo2012.zip
--------------------------------------------------------------------------------
/NLM500.zip:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/SDuari/Keyword-Extraction-Datasets/5fa62602ac1b9dd40eceaa1003822f363ca22dab/NLM500.zip
--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
1 | # Keyword-Extraction-Datasets
2 |
3 | This repository contains seven annotated datasets for automatic keyword extraction task. Every dataset contains a document (.txt or .abstr) and its corresponding gold-standard keywords list (.key or .uncontr). These datasets were used for our study of supervised and unsupervised keyword extraction. Following are the links to our published works.
4 |
5 | 1. **sCAKE: Semantic Connectivity Aware Keyword Extraction**
6 |
7 | [](https://doi.org/10.1016/j.ins.2018.10.034) [](http://www.sciencedirect.com/science/article/pii/S0020025518308521) [](https://arxiv.org/pdf/1811.10831.pdf)
8 |
9 | 2. **Complex Network based Supervised Keyword Extractor.**
10 |
11 | [](https://doi.org/10.1016/j.eswa.2019.112876) [](https://www.sciencedirect.com/science/article/pii/S095741741930586X) [](https://arxiv.org/pdf/1909.12009.pdf)
12 |
13 |
14 | Following are the datasets and the original papers which proposed them.
15 |
16 | 1. **Hulth2003**: Contains abstracts from *Inspec* dataset. Originally downloaded from https://github.com/snkim/AutomaticKeyphraseExtraction.
17 | 2. **WWW** and **KDD**: CS abstracts from KDD and WWW conferences. We have only kept those documents that contain at least two sentences and atleast one gold-standard keyword. Originally downloaded from https://www.dropbox.com/s/3c57qar1b0xseob/kpshare.tgz?dl=0 (Link is not available now). Full dataset can be downloaded from https://github.com/LIAAD/KeywordExtractor-Datasets/tree/master/datasets.
18 | 3. **Marujo2012**: News articles. Originally downloaded from https://github.com/snkim/AutomaticKeyphraseExtraction.
19 | 4. **Krapivin2012**: ACM full papers. Originally downloaded from https://github.com/snkim/AutomaticKeyphraseExtraction.
20 | 5. **Semeval2010**: ACM full papers. Originally downloaded from https://github.com/snkim/AutomaticKeyphraseExtraction.
21 | 6. **NLM500**: PubMed documents. Originally downloaded from https://github.com/zelandiya/keyword-extraction-datasets. *Created for abstractive KE task.*
22 |
23 | ## Dataset details and collection statistics
24 |
25 | | Dataset | \|D\| | Lavg | Navg | Kavg | KPavg| Description |
26 | | :--- | :---: | :---: | :---: | :---: | :---: | :--- |
27 | | Hulth2003 | 1500 | 129 | 23 | 10 | 90.07 | Abstracts from *Inspec* dataset
28 | | WWW | 1248 | 174 | 9 | 5 | 64.97 | Abstracts from CS articles published in KDD conference
29 | | KDD | 704 | 204 | 8 | 4 | 68.12 | Abstracts from CS articles published in WWW conference
30 | | Marujo2012 | 450 | 427 | 69 | 48 | 99.31 | Online news articles
31 | | Krapivin2009 | 2304 | 7961 | 11 | 5 | 96.91 | Full scientific articles from ACM
32 | | SemEval2010 | 244 | 8085 | 34 | 16 | 95.89 | Full scientific articles from ACM, created for SemEval2010 Task 5
33 | | NLM500 | 500 | 4854 | 27 | 14 | 71.35 | Full papers from *PubMed* database
34 |
35 | \|D\|: Number of documents.
36 | Lavg: Average document length, in words.
37 | Navg: Average gold-standard keywords (unigrams) assigned per document.
38 | Kavg: Average gold-standard keyphrases (*n*-grams) assigned per document.
39 | KPavg: Average percentage of keyphrases present in the text
40 |
41 | ## Citations:
42 | Following are the citations for original papers.
43 |
44 | ### Hulth2003
45 | ```tex
46 | @inproceedings{hulth2003improved,
47 | title = "Improved Automatic Keyword Extraction given more Linguistic Knowledge",
48 | author = "Hulth, Anette",
49 | booktitle = "Proceedings of the 2003 Conference on EMNLP",
50 | pages = "216--223",
51 | year = "2003",
52 | organization = "ACL"
53 | }
54 | ```
55 |
56 | ### Krapivin2009
57 | ```tex
58 | @article{krapivin2009large,
59 | title = "Large Dataset for Keyphrases Extraction",
60 | author = "Krapivin, Mikalai and Autaeu, Aliaksandr and Marchese, Maurizio",
61 | journal = "Technical Report DISI-09-055",
62 | year = "2009",
63 | publisher = "University of Trento"
64 | }
65 | ```
66 |
67 | ### NLM500
68 | ```tex
69 | @inproceedings{aronson2000nlm,
70 | title = "The NLM Indexing Initiative",
71 | author = "Aronson and others",
72 | booktitle = "Proceedings of the AMIA Symposium",
73 | pages = "17",
74 | year = "2000",
75 | organization = "American Medical Informatics Association"
76 | }
77 | ```
78 |
79 | ### SemEval2010
80 | ```tex
81 | @inproceedings{kim2010semeval,
82 | title = "Semeval-2010 Task 5: Automatic Keyphrase Extraction from Scientific Articles",
83 | author = "Kim, Su Nam and Medelyan, Olena and Kan, Min-Yen and Baldwin, Timothy",
84 | booktitle = "Proceedings of the 5th International Workshop on Semantic Evaluation",
85 | pages = "21--26",
86 | year = "2010",
87 | organization = "Association for Computational Linguistics"
88 | }
89 | ```
90 |
91 | ### Marujo2012
92 | ```tex
93 | @inproceedings{marujo2012supervised,
94 | title = "Supervised Topical Key Phrase Extraction of News Stories using Crowdsourcing, Light Filtering and Co-reference Normalization",
95 | author = "Marujo, Lu{\'\i}s and Gershman, Anatole and Carbonell, Jaime and Frederking, Robert and Neto, Joa{\`I}ƒo P",
96 | booktitle = "Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC-2012)",
97 | year = "2012"
98 | }
99 | ```
100 |
101 | ### WWW and KDD
102 | ```tex
103 | @inproceedings{gollapalli2014extracting,
104 | title = "Extracting keyphrases from research papers using citation networks",
105 | author = "Gollapalli, Sujatha Das and Caragea, Cornelia",
106 | booktitle = "Twenty-Eighth AAAI Conference on Artificial Intelligence",
107 | year = "2014"
108 | }
109 | ```
110 |
--------------------------------------------------------------------------------
/SemEval2010.zip:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/SDuari/Keyword-Extraction-Datasets/5fa62602ac1b9dd40eceaa1003822f363ca22dab/SemEval2010.zip
--------------------------------------------------------------------------------
/WWW.zip:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/SDuari/Keyword-Extraction-Datasets/5fa62602ac1b9dd40eceaa1003822f363ca22dab/WWW.zip
--------------------------------------------------------------------------------