├── README.md └── data ├── cyberleninka_0.jsonlines.zip ├── cyberleninka_1.jsonlines.zip ├── cyberleninka_2.jsonlines.zip ├── cyberleninka_3.jsonlines.zip ├── cyberleninka_4.jsonlines.zip ├── habrahabr_0.jsonlines.zip ├── habrahabr_1.jsonlines.zip ├── habrahabr_2.jsonlines.zip ├── habrahabr_3.jsonlines.zip ├── ng_0.jsonlines.zip ├── ng_1.jsonlines.zip ├── russia_today_0.jsonlines.zip ├── russia_today_1.jsonlines.zip ├── russia_today_2.jsonlines.zip ├── russia_today_3.jsonlines.zip ├── russia_today_4.jsonlines.zip ├── russia_today_5.jsonlines.zip ├── russia_today_6.jsonlines.zip └── russia_today_7.jsonlines.zip /README.md: -------------------------------------------------------------------------------- 1 | # Ru_kw_eval_datasets 2 | ### Datasets for evaluation of keyword extraction in Russian 3 | 4 | You can find all the datasets in /data directory. The datasets are stored in .jsonlines format (every line in a file is a json). The datasets are split into parts due to github file size limitations. 5 | 6 | Sources of data: 7 | 8 | * https://russian.rt.com/ 9 | 10 | * https://habr.com/ 11 | 12 | * http://www.ng.ru/ 13 | 14 | * https://cyberleninka.ru/ 15 | 16 | Every line in files represents one document. For the **RussiaToday, NG** and **Habrahabr** the json line has the following structure: 17 | ```python 18 | {'url':'https://url.here', content': 'Text of the document here', 'title': 'Title of the document here', 19 | 20 | 'summary': 'short summary of the document here', 'keywords': ['key', 'words', 'here']} 21 | ``` 22 | 23 | For **Cyberleninka** files the structure of the json is: 24 | ```python 25 | {'url':'https://url.here', 'content': 'Text of the document here', 'title': 'Title of the document here', 26 | 27 | 'abstract': 'abstract of the document here', 'keywords': ['key', 'words', 'here']} 28 | ``` 29 | 30 | Cyberleninka documents are pdfs converted to raw texts with pdf2text so there may be a bunch of mistakes and random linebreaks. Also note that the keywords were extracted from the documents **manually** (hell, that was boring!) after conversion and I could easily skipped something. Please inform me if you find undeleted keywords inside the content field. 31 | 32 | My e-mail: manefedov26@gmail.com 33 | 34 | 35 | ## Dataset Metadata 36 | The following table is necessary for this dataset to be indexed by search 37 | engines such as Google Dataset Search. 38 |
39 | 40 | 41 | 42 | 43 | 44 | 45 | 46 | 47 | 48 | 49 | 50 | 51 | 52 | 53 | 54 | 55 | 56 | 57 | 58 | 59 | 60 | 61 | 62 | 63 | 64 | 65 |
propertyvalue
nameDatasets for evaluation of keyword extraction in Russian
url
sameAshttps://github.com/mannefedov/ru_kw_eval_datasets
descriptionDatasets for evaluation of keyword extraction in Russian
author
66 |
67 | -------------------------------------------------------------------------------- /data/cyberleninka_0.jsonlines.zip: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/mannefedov/ru_kw_eval_datasets/f0492949945f669d8fc064d703eb6f02c2e9f553/data/cyberleninka_0.jsonlines.zip -------------------------------------------------------------------------------- /data/cyberleninka_1.jsonlines.zip: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/mannefedov/ru_kw_eval_datasets/f0492949945f669d8fc064d703eb6f02c2e9f553/data/cyberleninka_1.jsonlines.zip -------------------------------------------------------------------------------- /data/cyberleninka_2.jsonlines.zip: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/mannefedov/ru_kw_eval_datasets/f0492949945f669d8fc064d703eb6f02c2e9f553/data/cyberleninka_2.jsonlines.zip -------------------------------------------------------------------------------- /data/cyberleninka_3.jsonlines.zip: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/mannefedov/ru_kw_eval_datasets/f0492949945f669d8fc064d703eb6f02c2e9f553/data/cyberleninka_3.jsonlines.zip -------------------------------------------------------------------------------- /data/cyberleninka_4.jsonlines.zip: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/mannefedov/ru_kw_eval_datasets/f0492949945f669d8fc064d703eb6f02c2e9f553/data/cyberleninka_4.jsonlines.zip -------------------------------------------------------------------------------- /data/habrahabr_0.jsonlines.zip: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/mannefedov/ru_kw_eval_datasets/f0492949945f669d8fc064d703eb6f02c2e9f553/data/habrahabr_0.jsonlines.zip -------------------------------------------------------------------------------- /data/habrahabr_1.jsonlines.zip: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/mannefedov/ru_kw_eval_datasets/f0492949945f669d8fc064d703eb6f02c2e9f553/data/habrahabr_1.jsonlines.zip -------------------------------------------------------------------------------- /data/habrahabr_2.jsonlines.zip: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/mannefedov/ru_kw_eval_datasets/f0492949945f669d8fc064d703eb6f02c2e9f553/data/habrahabr_2.jsonlines.zip -------------------------------------------------------------------------------- /data/habrahabr_3.jsonlines.zip: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/mannefedov/ru_kw_eval_datasets/f0492949945f669d8fc064d703eb6f02c2e9f553/data/habrahabr_3.jsonlines.zip -------------------------------------------------------------------------------- /data/ng_0.jsonlines.zip: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/mannefedov/ru_kw_eval_datasets/f0492949945f669d8fc064d703eb6f02c2e9f553/data/ng_0.jsonlines.zip -------------------------------------------------------------------------------- /data/ng_1.jsonlines.zip: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/mannefedov/ru_kw_eval_datasets/f0492949945f669d8fc064d703eb6f02c2e9f553/data/ng_1.jsonlines.zip -------------------------------------------------------------------------------- /data/russia_today_0.jsonlines.zip: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/mannefedov/ru_kw_eval_datasets/f0492949945f669d8fc064d703eb6f02c2e9f553/data/russia_today_0.jsonlines.zip -------------------------------------------------------------------------------- /data/russia_today_1.jsonlines.zip: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/mannefedov/ru_kw_eval_datasets/f0492949945f669d8fc064d703eb6f02c2e9f553/data/russia_today_1.jsonlines.zip -------------------------------------------------------------------------------- /data/russia_today_2.jsonlines.zip: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/mannefedov/ru_kw_eval_datasets/f0492949945f669d8fc064d703eb6f02c2e9f553/data/russia_today_2.jsonlines.zip -------------------------------------------------------------------------------- /data/russia_today_3.jsonlines.zip: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/mannefedov/ru_kw_eval_datasets/f0492949945f669d8fc064d703eb6f02c2e9f553/data/russia_today_3.jsonlines.zip -------------------------------------------------------------------------------- /data/russia_today_4.jsonlines.zip: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/mannefedov/ru_kw_eval_datasets/f0492949945f669d8fc064d703eb6f02c2e9f553/data/russia_today_4.jsonlines.zip -------------------------------------------------------------------------------- /data/russia_today_5.jsonlines.zip: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/mannefedov/ru_kw_eval_datasets/f0492949945f669d8fc064d703eb6f02c2e9f553/data/russia_today_5.jsonlines.zip -------------------------------------------------------------------------------- /data/russia_today_6.jsonlines.zip: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/mannefedov/ru_kw_eval_datasets/f0492949945f669d8fc064d703eb6f02c2e9f553/data/russia_today_6.jsonlines.zip -------------------------------------------------------------------------------- /data/russia_today_7.jsonlines.zip: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/mannefedov/ru_kw_eval_datasets/f0492949945f669d8fc064d703eb6f02c2e9f553/data/russia_today_7.jsonlines.zip --------------------------------------------------------------------------------