├── README.md
└── data
├── cyberleninka_0.jsonlines.zip
├── cyberleninka_1.jsonlines.zip
├── cyberleninka_2.jsonlines.zip
├── cyberleninka_3.jsonlines.zip
├── cyberleninka_4.jsonlines.zip
├── habrahabr_0.jsonlines.zip
├── habrahabr_1.jsonlines.zip
├── habrahabr_2.jsonlines.zip
├── habrahabr_3.jsonlines.zip
├── ng_0.jsonlines.zip
├── ng_1.jsonlines.zip
├── russia_today_0.jsonlines.zip
├── russia_today_1.jsonlines.zip
├── russia_today_2.jsonlines.zip
├── russia_today_3.jsonlines.zip
├── russia_today_4.jsonlines.zip
├── russia_today_5.jsonlines.zip
├── russia_today_6.jsonlines.zip
└── russia_today_7.jsonlines.zip
/README.md:
--------------------------------------------------------------------------------
1 | # Ru_kw_eval_datasets
2 | ### Datasets for evaluation of keyword extraction in Russian
3 |
4 | You can find all the datasets in /data directory. The datasets are stored in .jsonlines format (every line in a file is a json). The datasets are split into parts due to github file size limitations.
5 |
6 | Sources of data:
7 |
8 | * https://russian.rt.com/
9 |
10 | * https://habr.com/
11 |
12 | * http://www.ng.ru/
13 |
14 | * https://cyberleninka.ru/
15 |
16 | Every line in files represents one document. For the **RussiaToday, NG** and **Habrahabr** the json line has the following structure:
17 | ```python
18 | {'url':'https://url.here', content': 'Text of the document here', 'title': 'Title of the document here',
19 |
20 | 'summary': 'short summary of the document here', 'keywords': ['key', 'words', 'here']}
21 | ```
22 |
23 | For **Cyberleninka** files the structure of the json is:
24 | ```python
25 | {'url':'https://url.here', 'content': 'Text of the document here', 'title': 'Title of the document here',
26 |
27 | 'abstract': 'abstract of the document here', 'keywords': ['key', 'words', 'here']}
28 | ```
29 |
30 | Cyberleninka documents are pdfs converted to raw texts with pdf2text so there may be a bunch of mistakes and random linebreaks. Also note that the keywords were extracted from the documents **manually** (hell, that was boring!) after conversion and I could easily skipped something. Please inform me if you find undeleted keywords inside the content field.
31 |
32 | My e-mail: manefedov26@gmail.com
33 |
34 |
35 | ## Dataset Metadata
36 | The following table is necessary for this dataset to be indexed by search
37 | engines such as Google Dataset Search.
38 |
39 |
40 |
41 | property |
42 | value |
43 |
44 |
45 | name |
46 | Datasets for evaluation of keyword extraction in Russian |
47 |
48 |
49 | url |
50 | https://github.com/mannefedov/ru_kw_eval_datasets |
51 |
52 |
53 | sameAs |
54 | https://github.com/mannefedov/ru_kw_eval_datasets |
55 |
56 |
57 | description |
58 | Datasets for evaluation of keyword extraction in Russian |
59 |
60 |
61 |
62 | author |
63 | Mikhail Nefedov |
64 |
65 |
66 |
67 |
--------------------------------------------------------------------------------
/data/cyberleninka_0.jsonlines.zip:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/mannefedov/ru_kw_eval_datasets/f0492949945f669d8fc064d703eb6f02c2e9f553/data/cyberleninka_0.jsonlines.zip
--------------------------------------------------------------------------------
/data/cyberleninka_1.jsonlines.zip:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/mannefedov/ru_kw_eval_datasets/f0492949945f669d8fc064d703eb6f02c2e9f553/data/cyberleninka_1.jsonlines.zip
--------------------------------------------------------------------------------
/data/cyberleninka_2.jsonlines.zip:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/mannefedov/ru_kw_eval_datasets/f0492949945f669d8fc064d703eb6f02c2e9f553/data/cyberleninka_2.jsonlines.zip
--------------------------------------------------------------------------------
/data/cyberleninka_3.jsonlines.zip:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/mannefedov/ru_kw_eval_datasets/f0492949945f669d8fc064d703eb6f02c2e9f553/data/cyberleninka_3.jsonlines.zip
--------------------------------------------------------------------------------
/data/cyberleninka_4.jsonlines.zip:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/mannefedov/ru_kw_eval_datasets/f0492949945f669d8fc064d703eb6f02c2e9f553/data/cyberleninka_4.jsonlines.zip
--------------------------------------------------------------------------------
/data/habrahabr_0.jsonlines.zip:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/mannefedov/ru_kw_eval_datasets/f0492949945f669d8fc064d703eb6f02c2e9f553/data/habrahabr_0.jsonlines.zip
--------------------------------------------------------------------------------
/data/habrahabr_1.jsonlines.zip:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/mannefedov/ru_kw_eval_datasets/f0492949945f669d8fc064d703eb6f02c2e9f553/data/habrahabr_1.jsonlines.zip
--------------------------------------------------------------------------------
/data/habrahabr_2.jsonlines.zip:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/mannefedov/ru_kw_eval_datasets/f0492949945f669d8fc064d703eb6f02c2e9f553/data/habrahabr_2.jsonlines.zip
--------------------------------------------------------------------------------
/data/habrahabr_3.jsonlines.zip:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/mannefedov/ru_kw_eval_datasets/f0492949945f669d8fc064d703eb6f02c2e9f553/data/habrahabr_3.jsonlines.zip
--------------------------------------------------------------------------------
/data/ng_0.jsonlines.zip:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/mannefedov/ru_kw_eval_datasets/f0492949945f669d8fc064d703eb6f02c2e9f553/data/ng_0.jsonlines.zip
--------------------------------------------------------------------------------
/data/ng_1.jsonlines.zip:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/mannefedov/ru_kw_eval_datasets/f0492949945f669d8fc064d703eb6f02c2e9f553/data/ng_1.jsonlines.zip
--------------------------------------------------------------------------------
/data/russia_today_0.jsonlines.zip:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/mannefedov/ru_kw_eval_datasets/f0492949945f669d8fc064d703eb6f02c2e9f553/data/russia_today_0.jsonlines.zip
--------------------------------------------------------------------------------
/data/russia_today_1.jsonlines.zip:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/mannefedov/ru_kw_eval_datasets/f0492949945f669d8fc064d703eb6f02c2e9f553/data/russia_today_1.jsonlines.zip
--------------------------------------------------------------------------------
/data/russia_today_2.jsonlines.zip:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/mannefedov/ru_kw_eval_datasets/f0492949945f669d8fc064d703eb6f02c2e9f553/data/russia_today_2.jsonlines.zip
--------------------------------------------------------------------------------
/data/russia_today_3.jsonlines.zip:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/mannefedov/ru_kw_eval_datasets/f0492949945f669d8fc064d703eb6f02c2e9f553/data/russia_today_3.jsonlines.zip
--------------------------------------------------------------------------------
/data/russia_today_4.jsonlines.zip:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/mannefedov/ru_kw_eval_datasets/f0492949945f669d8fc064d703eb6f02c2e9f553/data/russia_today_4.jsonlines.zip
--------------------------------------------------------------------------------
/data/russia_today_5.jsonlines.zip:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/mannefedov/ru_kw_eval_datasets/f0492949945f669d8fc064d703eb6f02c2e9f553/data/russia_today_5.jsonlines.zip
--------------------------------------------------------------------------------
/data/russia_today_6.jsonlines.zip:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/mannefedov/ru_kw_eval_datasets/f0492949945f669d8fc064d703eb6f02c2e9f553/data/russia_today_6.jsonlines.zip
--------------------------------------------------------------------------------
/data/russia_today_7.jsonlines.zip:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/mannefedov/ru_kw_eval_datasets/f0492949945f669d8fc064d703eb6f02c2e9f553/data/russia_today_7.jsonlines.zip
--------------------------------------------------------------------------------