├── README.md └── datasets ├── 110-PT-BN-KP.zip ├── 500N-KPCrowd-v1.1.zip ├── Inspec.zip ├── Krapivin2009.zip ├── Nguyen2007.zip ├── PubMed.zip ├── Schutz2008.zip ├── SemEval2010.zip ├── SemEval2017.zip ├── WikiNews.zip ├── cacic.zip ├── citeulike180.zip ├── fao30.zip ├── fao780.zip ├── kdd.zip ├── pak2018.zip ├── theses100.zip ├── wicc.zip ├── wiki20.zip └── www.zip /README.md: -------------------------------------------------------------------------------- 1 | Datasets of Automatic Keyphrase Extraction 2 | ============================================ 3 | 4 | This repository contains 20 annotated datasets of Automatic Keyphrase Extraction made available by the research community. Following are the datasets and the original papers that proposed them. If you know more datasets, and want to contribute, please, notify me. You might also want to have a look at [Florian Boudin](https://github.com/boudinfl/ake-datasets) keyphrase extraction repository. 5 | 6 | 7 | | Dataset | Language | Type of Doc | Domain | #Docs | #Gold Keys (per doc) | #Tokens per doc | Absent GoldKey | 8 | | ------------------------------- | -------- | --------------- | ------------- | ----- | -------------------- | --------------- | -------------- | 9 | | [__110-PT-BN-KP__](#110) | PT | News | Misc. | 110 | 2610 (23.73) | 304.00 | 2.5% | 10 | | [__500N-KPCrowd-v1.1__](#500) | EN | News | Misc. | 500 | 24459 (48.92) | 408.33 | 13.5% | 11 | | [__Inspec__](#Inspec) | EN | Abstract | Comp. Science | 2000 | 29230 (14.62) | 128.20 | 37.7% | 12 | | [__Krapivin2009__](#Krapivin) | EN | Paper | Comp. Science | 2304 | 14599 (6.34) | 8040.74 | 15.3% | 13 | | [__Nguyen2007__](#Nguyen) | EN | Paper | Comp. Science | 209 | 2369 (11.33) | 5201.09 | 17.8% | 14 | | [__PubMed__](#PubMed) | EN | Paper | Comp. Science | 500 | 7620 (15.24) | 3992.78 | 60.2% | 15 | | [__Schutz2008__](#Schutz) | EN | Paper | Comp. Science | 1231 | 55013 (44.69) | 3901.31 | 13.6% | 16 | | [__SemEval2010__](#SemEval2010) | EN | Paper | Comp. Science | 243 | 4002 (16.47) | 8332.34 | 11.3% | 17 | | [__SemEval2017__](#SemEval2017) | EN | Paragraph | Misc. | 493 | 8969 (18.19) | 178.22 | 0.0% | 18 | | [__WikiNews__](#WikiNews) | FR | News | Misc. | 100 | 1177 (11.77) | 293.52 | 5.0% | 19 | | [__cacic__](#cacic) | ES | Paper | Comp. Science | 888 | 4282 (4.82) | 3985.84 | 2.2% | 20 | | [__citeulike180__](#citeulike) | EN | Paper | Misc. | 183 | 3370 (18.42) | 4796.08 | 32.2% | 21 | | [__fao30__](#fao30) | EN | Paper | Agriculture | 30 | 997 (33.23) | 4777.70 | 41.7% | 22 | | [__fao780__](#fao780) | EN | Paper | Agriculture | 779 | 6990 (8.97) | 4971.79 | 36.1% | 23 | | [__kdd__](#kdd) | EN | Paper | Comp. Science | 755 | 3831 (5.07) | 75.97 | 53.2% | 24 | | [__pak2018__](#pak) | PL | Abstract | Misc. | 50 | 232 (4.64) | 97.36 | 64.7% | 25 | | [__theses100__](#theses) | EN | Msc/Phd Thesis | Misc. | 100 | 767 (7.67) | 4728.86 | 47.6% | 26 | | [__wicc__](#wicc) | ES | Paper | Comp. Science | 1640 | 7498 (4.57) | 1955.56 | 2.7% | 27 | | [__wiki20__](#wiki20) | EN | Research Report | Comp. Science | 20 | 730 (36.50) | 6177.65 | 51.8% | 28 | | [__www__](#www) | EN | Paper | Comp. Science | 1330 | 7711 (5.80) | 84.08 | 55.0% | 29 | 30 | 31 | 32 |

33 | 34 | 35 | ### 110-PT-BN-KP 36 | 37 | **Dateset**: [110-PT-BN-KP](datasets/110-PT-BN-KP.zip) 38 | 39 | **Cite**: [Supervised Topical Key Phrase Extraction of News Stories using Crowdsourcing, Light Filtering and Co-reference Normalization](https://arxiv.org/abs/1306.4886) 40 | 41 | **Description**: The 110-PT-BN-KP is a TV Broadcast News (BN) dataset that contains 110 transcription text documents from 8 broadcast news programs from the European Portuguese ALERT BN database ranging from politics, sports, finance and other broadcast news. After the speech to text transcription, each news was manually reexamined to fix any segmentation error and the gold keywords were created asking one tagger to extract all keywords that summarize the document content. 42 | 43 | --- 44 | 45 | 46 | ### 500N-KPCrowd-v1.1 47 | 48 | **Dateset**: [500N-KPCrowd-v1.1](datasets/500N-KPCrowd-v1.1.zip) 49 | 50 | **Cite**: [Keyphrase cloud generation of broadcast news](https://arxiv.org/abs/1306.4606) 51 | 52 | **Description**: 500N-KPCrowd-v1.1 is a broadcast news transcription dataset. This dataset consists of 500 English broadcast news stories from 10 different categories (art and culture; business; crime; fashion; health; politics us; politics world; science; sports; technology) with 50 docs per category. The ground truth is built using Amazon’s Mechanical Turk service to recruit and manage taggers. Multiple annotators were required to look at the same news story and assign a set of keywords from the text itself. The final ground truth consists of keywords selected at least by 90% of the taggers. 53 | 54 | --- 55 | 56 | 57 | ### cacic 58 | 59 | **Dateset**: [cacic](datasets/cacic.zip) 60 | 61 | **Cite**: [Keyword Identification in Spanish Documents using Neural Networks](http://sedici.unlp.edu.ar/handle/10915/50087) 62 | 63 | **Description**: The cacic collection is a Spanish dataset formed by a set of scientific articles published between 2005 and 2013 and consist of 888 scientific papers published in the Argentine Congress of Computer Science [CACIC](http://redunci.info.unlp.edu.ar/cacic.html). 64 | 65 | --- 66 | 67 | 68 | ### citeulike180 69 | 70 | **Dateset**: [citeulike180](datasets/citeulike180.zip) 71 | 72 | **Cite**: [Human-competitive Tagging Using Automatic Keyphrase Extraction](https://dl.acm.org/citation.cfm?id=1699678) 73 | 74 | **Description**: The citeulike180 dataset is based on CiteULike.org platform which organizes academic citations. The CiteULike.org corpus is a full-text paper collection freely available that contains information about the users who tagged the documents. The dataset is based on a subset of CiteULike.org containing documents that have been indexed with at least three keywords on which at least two users have agreed. As well as filtering the document set, the dataset only considers annotators who have at least two additional co-annotators tagging the same common document. The result is a set of 180 documents indexed by 332 taggers, where most documents are related to the area of bioinformatics. 75 | 76 | --- 77 | 78 | 79 | ### fao30 80 | 81 | **Dateset**: [fao30](datasets/fao30.zip) 82 | 83 | **Cite**: [Domain‐independent automatic keyphrase indexing with small training sets](https://onlinelibrary.wiley.com/doi/abs/10.1002/asi.20790) 84 | 85 | **Description**: fao30 dataset is based on agricultural documents obtained from the two datasets based on Food and Agriculture Organization (FAO) of the United Nations, with 30 documents. It is full-text documents randomly selected from the FAO’s repository, where the keywords were manually assigned by six professional annotators at FAO. 86 | 87 | --- 88 | 89 | 90 | ### fao780 91 | 92 | **Dateset**: [fao780](datasets/fao780.zip) 93 | 94 | **Cite**: [Domain‐independent automatic keyphrase indexing with small training sets](https://onlinelibrary.wiley.com/doi/abs/10.1002/asi.20790) 95 | 96 | **Description**: fao780 dataset is based on agricultural documents obtained from the two datasets based on Food and Agriculture Organization (FAO) of the United Nations, with 780 documents. It is full-text documents randomly selected from the FAO’s repository, where the keywords were manually tagged by professional FAO staff with terms from the Agrovoc thesaurus. 97 | 98 | --- 99 | 100 | 101 | ### Inspec 102 | 103 | **Dateset**: [Inspec](datasets/Inspec.zip) 104 | 105 | **Cite**: [Improved automatic keyword extraction given more linguistic knowledge](https://dl.acm.org/citation.cfm?id=1119383) 106 | 107 | **Description**: Inspec consists of 2,000 abstracts of scientific journal papers from Computer Science collected between the years 1998 and 2002. Each document has two sets of keywords assigned: the controlled keywords, which are manually controlled assigned keywords that appear in the Inspec thesaurus but may not appear in the document, and the uncontrolled keywords which are freely assigned by the editors, i.e., are not restricted to the thesaurus or to the document. In our repository, we consider a union of both sets as the ground-truth. 108 | 109 | --- 110 | 111 | 112 | ### kdd 113 | 114 | **Dateset**: [kdd](datasets/kdd.zip) 115 | 116 | **Cite**: [Extracting Keyphrases from Research Papers using Citation Networks](https://www.aaai.org/ocs/index.php/AAAI/AAAI14/paper/viewFile/8662/8618) 117 | 118 | **Description**: The KDD collection is based on the abstracts of papers collected from the ACM Conference on Knowledge Discovery and Data Mining (KDD) published during the period 2004-2014, with 755 documents. The gold-keywords of these papers are the author-labeled terms. 119 | 120 | --- 121 | 122 | 123 | ### Krapivin2009 124 | 125 | **Dateset**: [Krapivin2009](datasets/Krapivin2009.zip) 126 | 127 | **Cite**: [Large dataset for keyphrases extraction](http://eprints.biblio.unitn.it/1671/) 128 | 129 | **Description**: The Krapivin2009 is the biggest dataset in terms of documents, with 2,304 full papers from the Computer Science domain, which were published by ACM in the period ranging from 2003 to 2005. The papers were downloaded from CiteSeerX Autonomous Digital Library and each one has its keywords assigned by the authors and verified by the reviewers. 130 | 131 | --- 132 | 133 | 142 | 143 | 144 | ### Nguyen2007 145 | 146 | **Dateset**: [Nguyen2007](datasets/Nguyen2007.zip) 147 | 148 | **Cite**: [Keyphrase Extraction in Scientific Publications](https://link.springer.com/chapter/10.1007%2F978-3-540-77094-7_41) 149 | 150 | **Description**: The Nguyen2007 is a dataset composed of 211 scientific conference papers. The gold keywords were manually assigned by volunteers’ students who were given three papers to read. The keywords assigned by the authors of the paper were hidden to avoid bias. 151 | 152 | --- 153 | 154 | 155 | ### pak2018 156 | 157 | **Dateset**: [pak2018](datasets/pak2018.zip) 158 | 159 | **Cite**: [YAKE! Keyword Extraction from Single Documents using Multiple Local Features](https://www.sciencedirect.com/science/article/pii/S0020025519308588?via%3Dihub) 160 | 161 | **Description**: pak2018 is a dataset in Polish formed by 50 abstracts of journals on technical topics collected from [Measurement Automation and Monitoring](http://pak.info.pl/) (in Polish “Pomiary, Automatyka, Kontrola”). The gold keywords are those author-assigned, resulting in 2-6 keywords per document. 162 | 163 | --- 164 | 165 | 166 | ### PubMed 167 | 168 | **Dateset**: [PubMed](datasets/PubMed.zip) 169 | 170 | **Cite**: [The NLM Indexing Initiative](https://pubmed.ncbi.nlm.nih.gov/11079836/) 171 | 172 | **Description**: PubMed dataset is based on full-text papers collected from PubMed Central, which comprises over 26 million citations for biomedical literature from MEDLINE, life science journals, and online books. It consists of 500 papers selected from the same source. PubMed uses the Medical Subject Headings [MeSH](https://www.ncbi.nlm.nih.gov/mesh), a controlled vocabulary thesaurus used for indexing articles for PubMed, as the gold keywords to the documents. 173 | 174 | --- 175 | 176 | 177 | ### Schutz2008 178 | 179 | **Dateset**: [Schutz2008](datasets/Schutz2008.zip) 180 | 181 | **Cite**: [Keyphrase Extraction from Single Documents in the Open Domain Exploiting Linguistic and Statistical Methods](http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.394.5372&rep=rep1&type=pdf) 182 | 183 | **Description**: Schutz2008 dataset is based on full-text papers collected from PubMed Central, which comprises over 26 million citations for biomedical literature from MEDLINE, life science journals, and online books. It consists of 1,231 papers selected from PubMed Central that the documents are distributed across 254 different journals, ranging from Abdominal Imaging to World 184 | Journal of Urology. These keywords assigned by the authors are hidden in the article and used as gold keywords. 185 | 186 | --- 187 | 188 | 189 | ### SemEval2010 190 | 191 | **Dateset**: [SemEval2010](datasets/SemEval2010.zip) 192 | 193 | **Cite**: [Semeval-2010 task 5: Automatic keyphrase extraction from scientific articles](https://dl.acm.org/citation.cfm?id=1859668) 194 | 195 | **Description**: SemEval2010 consists of 244 full scientific papers extracted from the ACM Digital Library (one of the most popular datasets which have been previously used for keyword extraction evaluation), each one ranging from 6 to 8 pages and belonging to four different computer science research areas (distributed systems; information search and retrieval; distributed artificial intelligence – multiagent systems; social and behavioral sciences – economics). Each paper has an author-assigned set of keywords (which are part of the original pdf file) and a set of keywords assigned by professional editors, both of which, may or may not appear explicitly in the text. 196 | 197 | --- 198 | 199 | 200 | ### SemEval2017 201 | 202 | **Dateset**: [SemEval2017](datasets/SemEval2017.zip) 203 | 204 | **Cite**: [Semeval 2017 task 10: Scienceie-extracting keyphrases and relations from scientific publications](https://arxiv.org/abs/1704.02853) 205 | 206 | **Description**: SemEval2017 consists of 500 paragraphs selected from 500 ScienceDirect journal articles, evenly distributed among the domains of Computer Science, Material Sciences and Physics. Each text has a number of keywords selected by one undergraduate student and an expert annotator. The expert's annotation is prioritized whenever there is disagreement between both annotators. The original purpose is extracting keywords and relations from scientific publications. 207 | 208 | --- 209 | 210 | 211 | ### theses100 212 | 213 | **Dateset**: [theses100](datasets/theses100.zip) 214 | 215 | **Cite**: [Originally downloaded from zelandiya github account](https://github.com/zelandiya/keyword-extraction-datasets/blob/ba4966ccceafb1c159cdc42f8e8dc630eff126d4/theses100.zip) 216 | 217 | **Description**: The theses100 dataset consists of 100 full master and Ph.D. theses from the University of Waikato, New Zeland. The domain of the theses made available is quite different ranging from chemistry, computer science, economics to psychology, philosophy, history, and others. 218 | 219 | --- 220 | 221 | 222 | ### wicc 223 | 224 | **Dateset**: [wicc](datasets/wicc.zip) 225 | 226 | **Cite**: [Keyword Identification in Spanish Documents using Neural Networks](http://sedici.unlp.edu.ar/handle/10915/50087) 227 | 228 | **Description**: The wicc dataset is composed of 1640 scientific articles published between 1999 and 2012 of the Workshop of Researchers in Computer Science [WICC](http://redunci.info.unlp.edu.ar/wicc.html). 229 | 230 | --- 231 | 232 | 233 | ### wiki20 234 | 235 | **Dateset**: [wiki20](datasets/wiki20.zip) 236 | 237 | **Cite**: [Topic indexing with Wikipedia](http://www.aaai.org/Papers/Workshops/2008/WS-08-15/WS08-15-004.pdf) 238 | 239 | **Description**: wiki20 consists of 20 English technical research reports covering different aspects of computer science. Fifteen teams, each consisting of two senior computer science undergraduates assigned keywords to each report using Wikipedia article titles as the candidate vocabulary. The teams were instructed to assign around 5 keywords to each document. Each team assigned 5.7 keywords on average. 240 | 241 | --- 242 | 243 | 244 | ### WikiNews 245 | 246 | **Dateset**: [WikiNews](datasets/WikiNews.zip) 247 | 248 | **Cite**: [TopicRank: Graph-Based Topic Ranking for Keyphrase Extraction](https://www.aclweb.org/anthology/I13-1062.pdf) 249 | 250 | **Description**: WikiNews is a French corpus created from the French version of WikiNews that contains 100 news articles published between May 2012 and December 2012 and manually annotated by at least three students. 251 | 252 | --- 253 | 254 | 255 | ### www 256 | 257 | **Dateset**: [www](datasets/www.zip) 258 | 259 | **Cite**: [Extracting Keyphrases from Research Papers using Citation Networks](https://www.aaai.org/ocs/index.php/AAAI/AAAI14/paper/viewFile/8662/8618) 260 | 261 | **Description**: the WWW collection is based on the abstracts of papers collected from the World Wide Web Conference (WWW) published during the period 2004-2014, with 1330 documents. The gold-keywords of these papers are the author-labeled terms. 262 | 263 | --- 264 | -------------------------------------------------------------------------------- /datasets/110-PT-BN-KP.zip: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/INESCTEC/KeywordExtractor-Datasets/8cd1e18ab750143e2075457ef5e0481754e3e966/datasets/110-PT-BN-KP.zip -------------------------------------------------------------------------------- /datasets/500N-KPCrowd-v1.1.zip: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/INESCTEC/KeywordExtractor-Datasets/8cd1e18ab750143e2075457ef5e0481754e3e966/datasets/500N-KPCrowd-v1.1.zip -------------------------------------------------------------------------------- /datasets/Inspec.zip: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/INESCTEC/KeywordExtractor-Datasets/8cd1e18ab750143e2075457ef5e0481754e3e966/datasets/Inspec.zip -------------------------------------------------------------------------------- /datasets/Krapivin2009.zip: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/INESCTEC/KeywordExtractor-Datasets/8cd1e18ab750143e2075457ef5e0481754e3e966/datasets/Krapivin2009.zip -------------------------------------------------------------------------------- /datasets/Nguyen2007.zip: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/INESCTEC/KeywordExtractor-Datasets/8cd1e18ab750143e2075457ef5e0481754e3e966/datasets/Nguyen2007.zip -------------------------------------------------------------------------------- /datasets/PubMed.zip: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/INESCTEC/KeywordExtractor-Datasets/8cd1e18ab750143e2075457ef5e0481754e3e966/datasets/PubMed.zip -------------------------------------------------------------------------------- /datasets/Schutz2008.zip: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/INESCTEC/KeywordExtractor-Datasets/8cd1e18ab750143e2075457ef5e0481754e3e966/datasets/Schutz2008.zip -------------------------------------------------------------------------------- /datasets/SemEval2010.zip: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/INESCTEC/KeywordExtractor-Datasets/8cd1e18ab750143e2075457ef5e0481754e3e966/datasets/SemEval2010.zip -------------------------------------------------------------------------------- /datasets/SemEval2017.zip: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/INESCTEC/KeywordExtractor-Datasets/8cd1e18ab750143e2075457ef5e0481754e3e966/datasets/SemEval2017.zip -------------------------------------------------------------------------------- /datasets/WikiNews.zip: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/INESCTEC/KeywordExtractor-Datasets/8cd1e18ab750143e2075457ef5e0481754e3e966/datasets/WikiNews.zip -------------------------------------------------------------------------------- /datasets/cacic.zip: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/INESCTEC/KeywordExtractor-Datasets/8cd1e18ab750143e2075457ef5e0481754e3e966/datasets/cacic.zip -------------------------------------------------------------------------------- /datasets/citeulike180.zip: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/INESCTEC/KeywordExtractor-Datasets/8cd1e18ab750143e2075457ef5e0481754e3e966/datasets/citeulike180.zip -------------------------------------------------------------------------------- /datasets/fao30.zip: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/INESCTEC/KeywordExtractor-Datasets/8cd1e18ab750143e2075457ef5e0481754e3e966/datasets/fao30.zip -------------------------------------------------------------------------------- /datasets/fao780.zip: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/INESCTEC/KeywordExtractor-Datasets/8cd1e18ab750143e2075457ef5e0481754e3e966/datasets/fao780.zip -------------------------------------------------------------------------------- /datasets/kdd.zip: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/INESCTEC/KeywordExtractor-Datasets/8cd1e18ab750143e2075457ef5e0481754e3e966/datasets/kdd.zip -------------------------------------------------------------------------------- /datasets/pak2018.zip: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/INESCTEC/KeywordExtractor-Datasets/8cd1e18ab750143e2075457ef5e0481754e3e966/datasets/pak2018.zip -------------------------------------------------------------------------------- /datasets/theses100.zip: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/INESCTEC/KeywordExtractor-Datasets/8cd1e18ab750143e2075457ef5e0481754e3e966/datasets/theses100.zip -------------------------------------------------------------------------------- /datasets/wicc.zip: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/INESCTEC/KeywordExtractor-Datasets/8cd1e18ab750143e2075457ef5e0481754e3e966/datasets/wicc.zip -------------------------------------------------------------------------------- /datasets/wiki20.zip: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/INESCTEC/KeywordExtractor-Datasets/8cd1e18ab750143e2075457ef5e0481754e3e966/datasets/wiki20.zip -------------------------------------------------------------------------------- /datasets/www.zip: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/INESCTEC/KeywordExtractor-Datasets/8cd1e18ab750143e2075457ef5e0481754e3e966/datasets/www.zip --------------------------------------------------------------------------------