├── README.md
├── binary-classification
├── Blue Birds
│ └── README.md
├── Crowdsourced Amazon Sentiment
│ └── README.md
├── Crowdsourced loneliness-slr
│ └── README.md
├── HITspam-UsingCrowdflower
│ └── README.md
├── HITspam-UsingMTurk
│ └── README.md
├── Recognizing Textual Entailment
│ └── README.md
├── Sentiment popularity - AMT
│ └── README.md
├── Temporal Ordering
│ └── README.md
├── Text Highlighting
│ └── README.md
├── Toloka Aggregation Relevance 2
│ └── README.md
└── readme.md
├── download_datasets.py
├── multi-class-classification
├── 2010 Crowdsourced Web Relevance Judgments
│ └── README.md
├── AdultContent2
│ └── README.md
├── AdultContent3
│ └── README.md
├── Emotion
│ └── README.md
├── Toloka Aggregation Relevance 5
│ └── readme.md
├── Weather Sentiment - AMT
│ └── README.md
├── Word Pair Similarity
│ └── README.md
└── readme.md
└── transform_datasets.py
/README.md:
--------------------------------------------------------------------------------
1 | # CrowdData
2 |
3 | CrowdData is an open repository that aggregates the crowdsourced datasets that have individual crowd votes. We aim at providing the available datasets with a standard format (explained in `Download` section below) so that they can be directly used in experiments, without any work-load in preprocessing. Datasets included in this repo serve for classification tasks (mainly text classification, except Emotion Dataset). CrowData can benefit researchers investigating hybrid usage of machine and human-in-the-loop in classification tasks (the repo includes 5 datasets having the actual content of the tasks), human in classification and ranking tasks, truth discovery based on crowdsourced data, estimation of the crowd bias, and active learning. **If you use any of the datasets in this repository, please make sure that you've read and followed the usage consent we explain at the bottom of this page.**
4 |
5 | ## Datasets
6 |
7 | We categorized the datasets in two folders: `binary-classification` and `multi-class-classification`. Within each folder, each dataset is kept in a separate folder having a link to the original source. Table below shows an overview of the datasets. The columns of the table are as follows:
8 |
9 | * `Dataset`: Name of the dataset including a link to the original source.
10 | * `Description`: A brief description of the dataset.
11 | * `Number of tasks`: The number of tasks asked to the crowd.
12 | * `Number of workers`: Number of crowd workers completing the tasks.
13 | * `Number of total votes`: Number of votes collected for all tasks.
14 | * `Ground Truth`: Are the ground truths of corresponding tasks available in the dataset? Yes? No? Partially available?
15 | * `Task Type`: Type of the task asked to the crowd. It can be either binary or multi-class question. If it is a multi-class question, we specify whether it is categorical (how many categories?), interval (range?), or ordinal (how many classes?).
16 | * `Task Content`: Content of the task asked to the crowd (text, image, etc.), and does the content available in the dataset? (Available? Unavailable? Partially available?)
17 | * `I don't know option`: Do the crowd workers have an "I don't know" option while completing the tasks?
18 | * `Time spent on the task`: Does the dataset includes any information about the time spent on the tasks?
19 |
20 |
21 | | Dataset | Description | Number of tasks | Number of workers | Number of total votes | Ground Truth | Task Type | Task Content | I don't know option | Time spent on the task |
22 | |---|---|---|---|---|---|---|---|---|---|
23 | | [Blue Birds](https://github.com/welinder/cubam/tree/public/demo/bluebirds) | The task is to identify whether the image contains a blue bird or not. The dataset contains both the individual votes and the ground truths. | 108 | 39 | 4212 | Yes | binary | image, unavailable | No | No |
24 | | [Crowdsourced Amazon Sentiment](https://github.com/Evgeneus/screening-classification-datasets/tree/master/crowdsourced-amazon-sentiment-dataset/data) | The task is to make sentiment analysis on Amazon product reviews. There are two predicates: "is_book", "is_negative". | 1011 | 284 | 7803 | Yes | binary | text, available | No | Unavailable |
25 | | [Crowdsourced loneliness-slr](https://github.com/Evgeneus/crowd-machine-collaboration-for-item-screening/tree/master/data/amt_real_data) | Each paper is assessed by three questions: (i) Does it related to the use of technology? (ii) Does it related to older adults, and (iii) Does it related to the intervention? | 319 | 34 | 797 | Yes | binary | text, unavailable | Yes | Unavailable |
26 | | [HITspam-UsingCrowdflower](https://github.com/ipeirotis/Get-Another-Label/tree/master/data/HITspam-UsingCrowdflower) | The dataset contains individual worker judgments and the related ground truths about whether a HIT (from Crowdflower data) should be considered as a "spam" task. | 5380 | 153 | 42762 | Partially | binary | text, unavailable | No | Unavailable |
27 | | [HITspam-UsingMTurk](https://github.com/ipeirotis/Get-Another-Label/tree/master/data/HITspam-UsingMTurk) | The dataset contains individual worker judgments and the related ground truths about whether a HIT (from MTurk data) should be considered as a "spam" task. | 5840 | 135 | 28354 | Partially | binary | text, unavailable | No | Unavailable |
28 | | [Recognizing Textual Entailment](https://sites.google.com/site/nlpannotations/) | Recognizing Textual Entailment dataset contains the individual worker judgments and the related ground truths about identifying whether a given Hypothesis sentence is implied by the information in the given text. | 800 | 164 | 8000 | Yes | binary | text, available | No | Unavailable |
29 | | [Sentiment popularity - AMT](https://eprints.soton.ac.uk/376544/) | This dataset contains positive or negative judgments of workers for 500 sentences extracted from movie reviews, with gold labels assigned by the website. | 500 | 143 | 10000 | Yes | binary | text, unavailable | No | Yes |
30 | | [Temporal Ordering](https://sites.google.com/site/nlpannotations/) | Temporal Ordering dataset contains the individual worker votes and the corresponding ground truths for the task of identifying whether one event happens before another event in a given context. | 462 | 76 | 4620 | Yes | binary | text, partially available | No | Unavailable |
31 | | [Text Highlighting](https://figshare.com/articles/Crowdsourced_dataset_to_study_the_generation_and_impact_of_text_highlighting_in_classification_tasks/9917162) | This dataset contains two kinds of tasks: (i) classification tasks with highlighting support, and (ii) highlighting tasks, where the workers highlight evidence. | 685 | 1851 | 27711 | Yes | binary | text, available | Maybe option | Available |
32 | | [Toloka Aggregation Relevance 2](https://research.yandex.com/datasets/toloka) | This dataset contains approximately 0.5 million anonymized individual votes that collected in the "Relevance 2 Gradations" project in 2016. | 99319 | 7139 | 475536 | Partially | binary | text, unavailable | No | Unavailable |
33 | | [2010 Crowdsourced Web Relevance Judgments Data](https://www.ischool.utexas.edu/~ml/data/trec-rf10-crowd.tgz) | The dataset contains the judgments about the relevance of English Web pages from the ClueWeb09 collection (http://lemurproject.org/clueweb09/). The judgments are based on 3 scales: highly relevant, relevant, and non-relevant. A fourth judgment option indicated a broken link which could not be judged. | 20232 | 766 | 98453 | Yes | multi, 3 classes | text, unavailable | No | Unavailable |
34 | | [AdultContent2](https://github.com/ipeirotis/Get-Another-Label/tree/master/data/AdultContent2) | This dataset contains approximately 100K individual worker judgments and the related ground truths for classification of websites into 5 categories. | 11040 | 269 | 92721 | Partially | multi, 5 categories | text, unavailable | No | Unavailable |
35 | | [AdultContent3](https://github.com/ipeirotis/Get-Another-Label/tree/master/data/AdultContent3-HCOMP2010) | This dataset contains approximately 50K individual worker judgments and the related ground truths for classification of websites into 4 categories. | 500 | 100 | 50000 | No | multi, 4 categories | text, unavailable | No | Unavailable |
36 | | [Emotion](https://sites.google.com/site/nlpannotations/) | This dataset contains individual worker votes that rate the emotion of a given text, based on the followings: anger, disgust, fear, joy, sadness, surprise, valence. Furthermore, each rating contains a value from -100 to 100 for each emotion about the text. | 700 | 10 | 7000 | Yes | multi, interval (-100,100) | text, available | No | Unavailable |
37 | | [Toloka Aggregation Relevance 5](https://research.yandex.com/datasets/toloka) | This dataset contains the judgments on the relevance of a document for a query on a 5-graded scale. | 363814 | 1274 | 1091918 | Partially | multi, 5 classes | text, unavailable | No | Unavailable |
38 | | [Weather Sentiment - AMT](https://eprints.soton.ac.uk/376543/) | This dataset contains the sentiment judgments of 300 tweets. The classification task is based on the following categories: negative (0), neutral (1), positive (2), tweet not related to weather (3) and can't tell (4). | 300 | 110 | 6000 | Yes | multi, 5 classes | text, unavailable | Yes | Yes |
39 | | [Word Pair Similarity](https://sites.google.com/site/nlpannotations/) | This dataset contains the individual worker votes that assign a numerical similarity score between 0 and 10 to a given text. | 30 | 10 | 300 | Yes | multi, interval (0,10) | text, unavailable | No | Unavailable |
40 |
41 | ## Download
42 |
43 | We provide two python scripts that will help you to download all the datasets, and then transform them to a standard format. In order to do that, you should first run the `download_datasets.py`, and then `transform_datasets.py`. The required python version is 3.7, and the following modules should be installed on your system: `os, pandas, wget, zipfile, tarfile, re, platform, and shutil`.
44 |
45 | Running the two scripts in given order will create one csv file within each dataset folder. These csv files will be in a standard format that includes the following columns, respectively:
46 |
47 | * `workerID`: ID of the crowd worker.
48 | * `taskID`: ID of the task answered by the corresponding worker.
49 | * `response`: Response of the corresponding worker on the task identified by `taskID`.
50 | * `goldLabel`: Gold label of the corresponding task (if available).
51 | * `taskContent`: Content of the task answered by the worker (if available).
52 |
53 | Only `Sentiment popularity - AMT` and `Weather Sentiment - AMT` datasets will have an additional column:
54 |
55 | * `timeSpent`: How much time the corresponding worker spent on this task?
56 |
57 | **P.S.** If the original dataset includes multi-predicates for a task, then we create one csv file for each predicate in the transformed version of the dataset.
58 |
59 | (**You should not modify any of the directory names and/or dataset files you downloaded from this repo to obtain the resulting csv files accurately**)
60 |
61 | ## Usage consent
62 |
63 | By using this tool you agree to acknowledge the original datasets and to check their terms and conditions. Some data providers may require authentication, filling forms, etc. We include a link to the original source both in the table above and in the individual repository folders for usefulness.
64 |
--------------------------------------------------------------------------------
/binary-classification/Blue Birds/README.md:
--------------------------------------------------------------------------------
1 | # Blue Birds Dataset
2 |
3 | Link to the original source: https://github.com/welinder/cubam/tree/public/demo/bluebirds
4 |
5 | The task is to identify whether the image contains a blue bird or not. The dataset contains both the individual votes and the ground truths.
6 |
7 | **Cite this work as:**
8 |
9 | ```
10 |
11 | @inproceedings{WelinderEtal10b,
12 | Author = {Peter Welinder and Steve Branson and Serge Belongie and Pietro Perona},
13 | Booktitle = {NIPS},
14 | Title = {{The Multidimensional Wisdom of Crowds}},
15 | Year = {2010}}
16 | ```
--------------------------------------------------------------------------------
/binary-classification/Crowdsourced Amazon Sentiment/README.md:
--------------------------------------------------------------------------------
1 | # Crowdsourced Amazon Sentiment Dataset
2 |
3 | Link to the original source: https://github.com/Evgeneus/screening-classification-datasets/tree/master/crowdsourced-amazon-sentiment-dataset/data
4 |
5 | The task is to make sentiment analysis on Amazon product reviews. There are two predicates: "is_book", "is_negative".
6 |
7 | **Cite this work as:**
8 |
9 | ```
10 | @article{Krivosheev:2018:CCM:3290265.3274366,
11 | author = {Krivosheev, Evgeny and Casati, Fabio and Baez, Marcos and Benatallah, Boualem},
12 | title = {Combining Crowd and Machines for Multi-predicate Item Screening},
13 | journal = {Proc. ACM Hum.-Comput. Interact.},
14 | issue_date = {November 2018},
15 | volume = {2},
16 | number = {CSCW},
17 | month = nov,
18 | year = {2018},
19 | issn = {2573-0142},
20 | pages = {97:1--97:18},
21 | articleno = {97},
22 | numpages = {18},
23 | url = {http://doi.acm.org/10.1145/3274366},
24 | doi = {10.1145/3274366},
25 | acmid = {3274366},
26 | publisher = {ACM},
27 | address = {New York, NY, USA},
28 | keywords = {classification, crowd-machine system, crowdsourcing},
29 | }
30 | ```
--------------------------------------------------------------------------------
/binary-classification/Crowdsourced loneliness-slr/README.md:
--------------------------------------------------------------------------------
1 | # Crowdsourced loneliness-slr Dataset
2 |
3 | Link to the original source: https://github.com/Evgeneus/crowd-machine-collaboration-for-item-screening/tree/master/data/amt_real_data
4 |
5 | Each paper is assessed by three questions: (i) Does it related to the use of technology? (ii) Does it related to older adults, and (iii) Does it related to the intervention?
6 |
7 | **Cite this work as:**
8 |
9 | ```
10 | @article{Krivosheev:2018:CCM:3290265.3274366,
11 | author = {Krivosheev, Evgeny and Casati, Fabio and Baez, Marcos and Benatallah, Boualem},
12 | title = {Combining Crowd and Machines for Multi-predicate Item Screening},
13 | journal = {Proc. ACM Hum.-Comput. Interact.},
14 | issue_date = {November 2018},
15 | volume = {2},
16 | number = {CSCW},
17 | month = nov,
18 | year = {2018},
19 | issn = {2573-0142},
20 | pages = {97:1--97:18},
21 | articleno = {97},
22 | numpages = {18},
23 | url = {http://doi.acm.org/10.1145/3274366},
24 | doi = {10.1145/3274366},
25 | acmid = {3274366},
26 | publisher = {ACM},
27 | address = {New York, NY, USA},
28 | keywords = {classification, crowd-machine system, crowdsourcing},
29 | }
30 | ```
--------------------------------------------------------------------------------
/binary-classification/HITspam-UsingCrowdflower/README.md:
--------------------------------------------------------------------------------
1 | # HITspam-UsingCrowdflower
2 |
3 | Link to the original source: https://github.com/ipeirotis/Get-Another-Label/tree/master/data/HITspam-UsingCrowdflower
4 |
5 | The dataset contains individual worker judgments and the related ground truths about whether an AMT HIT (from Crowdflower data) should be considered as a "spam" task.
6 |
7 | **Cite this work as:**
8 |
9 | ```
10 | @inproceedings{Sheng:2008:GLI:1401890.1401965,
11 | author = {Sheng, Victor S. and Provost, Foster and Ipeirotis, Panagiotis G.},
12 | title = {Get Another Label? Improving Data Quality and Data Mining Using Multiple, Noisy Labelers},
13 | booktitle = {Proceedings of the 14th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining},
14 | series = {KDD '08},
15 | year = {2008},
16 | isbn = {978-1-60558-193-4},
17 | location = {Las Vegas, Nevada, USA},
18 | pages = {614--622},
19 | numpages = {9},
20 | url = {http://doi.acm.org/10.1145/1401890.1401965},
21 | doi = {10.1145/1401890.1401965},
22 | acmid = {1401965},
23 | publisher = {ACM},
24 | address = {New York, NY, USA},
25 | keywords = {data preprocessing, data selection},
26 | }
27 | ```
--------------------------------------------------------------------------------
/binary-classification/HITspam-UsingMTurk/README.md:
--------------------------------------------------------------------------------
1 | # HITspam-UsingMTurk
2 |
3 | Link to the original source: https://github.com/ipeirotis/Get-Another-Label/tree/master/data/HITspam-UsingMTurk
4 |
5 | The dataset contains individual worker judgments and the related ground truths about whether an AMT HIT (from MTurk data) should be considered as a "spam" task.
6 |
7 | **Cite this work as:**
8 |
9 | ```
10 | @inproceedings{Sheng:2008:GLI:1401890.1401965,
11 | author = {Sheng, Victor S. and Provost, Foster and Ipeirotis, Panagiotis G.},
12 | title = {Get Another Label? Improving Data Quality and Data Mining Using Multiple, Noisy Labelers},
13 | booktitle = {Proceedings of the 14th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining},
14 | series = {KDD '08},
15 | year = {2008},
16 | isbn = {978-1-60558-193-4},
17 | location = {Las Vegas, Nevada, USA},
18 | pages = {614--622},
19 | numpages = {9},
20 | url = {http://doi.acm.org/10.1145/1401890.1401965},
21 | doi = {10.1145/1401890.1401965},
22 | acmid = {1401965},
23 | publisher = {ACM},
24 | address = {New York, NY, USA},
25 | keywords = {data preprocessing, data selection},
26 | }
27 | ```
--------------------------------------------------------------------------------
/binary-classification/Recognizing Textual Entailment/README.md:
--------------------------------------------------------------------------------
1 | # Recognizing Textual Entailment Dataset
2 |
3 | Link to the original source: https://sites.google.com/site/nlpannotations/
4 |
5 | This dataset contains the individual worker judgments and the related ground truths about identifying whether a given Hypothesis sentence is implied by the information in the given text.
6 |
7 | **Cite this work as:**
8 |
9 | ```
10 | @inproceedings{Snow:2008:CFG:1613715.1613751,
11 | author = {Snow, Rion and O'Connor, Brendan and Jurafsky, Daniel and Ng, Andrew Y.},
12 | title = {Cheap and Fast---but is It Good?: Evaluating Non-expert Annotations for Natural Language Tasks},
13 | booktitle = {Proceedings of the Conference on Empirical Methods in Natural Language Processing},
14 | series = {EMNLP '08},
15 | year = {2008},
16 | location = {Honolulu, Hawaii},
17 | pages = {254--263},
18 | numpages = {10},
19 | url = {http://dl.acm.org/citation.cfm?id=1613715.1613751},
20 | acmid = {1613751},
21 | publisher = {Association for Computational Linguistics},
22 | address = {Stroudsburg, PA, USA},
23 | }
24 | ```
--------------------------------------------------------------------------------
/binary-classification/Sentiment popularity - AMT/README.md:
--------------------------------------------------------------------------------
1 | # Sentiment popularity - AMT Dataset
2 |
3 | Link to the original source: https://eprints.soton.ac.uk/376544/
4 |
5 | This dataset contains positive or negative judgments of workers for 500 sentences extracted from movie reviews, with gold labels assigned by the website. Totally, 10000 sentiment judgments, that are collected from AMT platform, are included in the dataset.
6 |
7 | **Cite this work as:**
8 |
9 | ```
10 | @inproceedings{soton376365,
11 | booktitle = {International Joint Conference on Artificial Intelligence (IJCAI-15) (31/07/15)},
12 | month = {July},
13 | title = {Bayesian modelling of community-based multidimensional trust in participatory sensing under data sparsity},
14 | author = {Matteo Venanzi and W.T.L. Teacy and Alex Rogers and Nicholas R. Jennings},
15 | year = {2015},
16 | pages = {717--724},
17 | url = {https://eprints.soton.ac.uk/376365/},
18 | abstract = {We propose a new Bayesian model for reliable aggregation of crowdsourced estimates of real-valued quantities in participatory sensing applications. Existing approaches focus on probabilistic modelling of user?s reliability as the key to accurate aggregation. However, these are either limited to estimating discrete quantities, or require a significant number of reports from each user to accurately model their reliability. To mitigate these issues, we adopt a community-based approach, which reduces the data required to reliably aggregate real-valued estimates, by leveraging correlations between the re- porting behaviour of users belonging to different communities. As a result, our method is up to 16.6\% more accurate than existing state-of-the-art methods and is up to 49\% more effective under data sparsity when used to estimate Wi-Fi hotspot locations in a real-world crowdsourcing application.}
19 | }
20 | ```
--------------------------------------------------------------------------------
/binary-classification/Temporal Ordering/README.md:
--------------------------------------------------------------------------------
1 | # Temporal Ordering Dataset
2 |
3 | Link to the original source: https://sites.google.com/site/nlpannotations/
4 |
5 | This dataset contains the individual worker votes and the corresponding ground truths for the task of identifying whether one event happens before another event in a given context.
6 |
7 | **Cite this work as:**
8 | ```
9 | @inproceedings{Snow:2008:CFG:1613715.1613751,
10 | author = {Snow, Rion and O'Connor, Brendan and Jurafsky, Daniel and Ng, Andrew Y.},
11 | title = {Cheap and Fast---but is It Good?: Evaluating Non-expert Annotations for Natural Language Tasks},
12 | booktitle = {Proceedings of the Conference on Empirical Methods in Natural Language Processing},
13 | series = {EMNLP '08},
14 | year = {2008},
15 | location = {Honolulu, Hawaii},
16 | pages = {254--263},
17 | numpages = {10},
18 | url = {http://dl.acm.org/citation.cfm?id=1613715.1613751},
19 | acmid = {1613751},
20 | publisher = {Association for Computational Linguistics},
21 | address = {Stroudsburg, PA, USA},
22 | }
23 | ```
--------------------------------------------------------------------------------
/binary-classification/Text Highlighting/README.md:
--------------------------------------------------------------------------------
1 | # Text highlighting
2 |
3 | Link to the original source: https://bmcresnotes.biomedcentral.com/articles/10.1186/s13104-019-4858-z
4 |
5 | In a nutshell, the dataset contains two kinds of tasks:
6 |
7 | - classification tasks with highlighting support.
8 | - highlighting tasks, where the workers highlight evidence.
9 |
10 |
11 |
12 | **Cite this work as:**
13 |
14 | ```
15 | @Article{Ramirez2019,
16 | author={Ram{\'i}rez, Jorge
17 | and Baez, Marcos
18 | and Casati, Fabio
19 | and Benatallah, Boualem},
20 | title={Crowdsourced dataset to study the generation and impact of text highlighting in classification tasks},
21 | journal={BMC Research Notes},
22 | year={2019},
23 | volume={12},
24 | number={1},
25 | pages={820},
26 | issn={1756-0500},
27 | doi={10.1186/s13104-019-4858-z},
28 | url={https://doi.org/10.1186/s13104-019-4858-z}
29 | }
30 | ```
31 |
32 |
33 |
34 |
35 |
--------------------------------------------------------------------------------
/binary-classification/Toloka Aggregation Relevance 2/README.md:
--------------------------------------------------------------------------------
1 | # Toloka Aggregation Relevance 2
2 |
3 | Link to the original source: https://research.yandex.com/datasets/toloka
4 |
5 | This dataset contains approximately 0.5 million anonymized individual votes that collected in the "Relevance 2 Gradations" project in 2016. From the source given above, `crowd_labels.tsv` and `golden_labels.tsv` files should be copied into the `data-raw` folder.
6 |
7 |
--------------------------------------------------------------------------------
/binary-classification/readme.md:
--------------------------------------------------------------------------------
1 | # binary-classification
2 |
3 | The following datasets are included in this folder:
4 |
5 | - `Blue Birds`
6 | - `Crowdsourced Amazon Sentiment`
7 | - `Crowdsourced loneliness-slr`
8 | - `HITspam-UsingCrowdflower`
9 | - `HITspam-UsingMTurk`
10 | - `Recognizing Textual Entailment`
11 | - `Sentiment popularity - AMT`
12 | - `Temporal Ordering`
13 | - `Text Highlighting`
14 | - `Toloka Aggregation Relevance 2`
15 |
--------------------------------------------------------------------------------
/download_datasets.py:
--------------------------------------------------------------------------------
1 | '''
2 | By using this tool you agree to acknowledge the original datasets and to check their terms and conditions.
3 | Some data providers may require authentication, filling forms, etc.
4 | We include a link to the original source of each dataset in our repository, please cite the appropriate sources in your work.
5 | '''
6 |
7 | import os
8 | import wget
9 | import zipfile
10 | import tarfile
11 | import re
12 | import platform
13 | from shutil import copyfile, rmtree
14 |
15 | def download(folderName, urlDict, key):
16 | print("downloading ", key)
17 | directoryPath = os.path.join(folderName, 'data-raw')
18 | try:
19 | os.mkdir(directoryPath)
20 | except OSError:
21 | print("Creation of the directory %s failed" % directoryPath)
22 | url = urlDict.get(key)
23 | if url is None:
24 | url = urlDict.get('NLP Annotations')
25 | for file in url:
26 | wget.download(file, directoryPath)
27 |
28 | def download_folders(folder, urlDict):
29 | for folderName, subfolders, filenames in os.walk(folder):
30 | for subfolder in subfolders:
31 | if subfolder == 'binary-classification' or subfolder == 'multi-class-classification':
32 | for currentFolder, currentSubfolders, currentFiles in os.walk(os.path.join(folderName, subfolder)):
33 | for dataset in currentSubfolders:
34 | download(os.path.join(currentFolder, dataset), urlDict, dataset)
35 | break
36 | break
37 |
38 | def extract_nested_archives(archivedFile, toFolder):
39 | if archivedFile.endswith('.zip'):
40 | with zipfile.ZipFile(archivedFile, 'r') as zfile:
41 | zfile.extractall(path=toFolder)
42 | elif archivedFile.endswith('.tgz'):
43 | tar = tarfile.open(archivedFile, "r:gz")
44 | tar.extractall(path=toFolder)
45 | tar.close()
46 | os.remove(archivedFile)
47 |
48 | for root, dirs, files in os.walk(toFolder):
49 | if '__MACOSX' not in root:
50 | for filename in files:
51 | if re.search(r'\.zip$', filename) or re.search(r'\.tgz$', filename):
52 | fileSpec = os.path.join(root, filename)
53 | extract_nested_archives(fileSpec, root)
54 |
55 | def recursive_walk(folder, delimeter, requiredFilesList, dest):
56 | for folderName, subfolders, filenames in os.walk(folder):
57 | for file in filenames:
58 | if file in requiredFilesList:
59 | copyfile(os.path.join(folderName, file), os.path.join(dest, file))
60 | if subfolders:
61 | for subfolder in subfolders:
62 | recursive_walk(subfolder, delimeter, requiredFilesList, dest)
63 |
64 | def delete_unnecessary_files(path):
65 | for folderName, subfolders, filenames in os.walk(path):
66 | for subfolder in subfolders:
67 | rmtree(os.path.join(path, subfolder))
68 |
69 | if __name__ == "__main__":
70 |
71 | # get the current path
72 | path = '.'
73 |
74 | # define the delimeter based on the operating system
75 | delimeter = '/'
76 | if platform.system() == 'Windows':
77 | delimeter = '\\'
78 |
79 | # define the links to download datasets
80 | urlDict = {
81 | 'Blue Birds': ['https://raw.githubusercontent.com/welinder/cubam/public/demo/bluebirds/gt.yaml', 'https://raw.githubusercontent.com/welinder/cubam/public/demo/bluebirds/labels.yaml'],
82 | 'HITspam-UsingCrowdflower': ['https://raw.githubusercontent.com/ipeirotis/Get-Another-Label/master/data/HITspam-UsingCrowdflower/gold.txt', 'https://raw.githubusercontent.com/ipeirotis/Get-Another-Label/master/data/HITspam-UsingCrowdflower/labels.txt'],
83 | 'HITspam-UsingMTurk': ['https://raw.githubusercontent.com/ipeirotis/Get-Another-Label/master/data/HITspam-UsingMTurk/gold.txt', 'https://raw.githubusercontent.com/ipeirotis/Get-Another-Label/master/data/HITspam-UsingMTurk/labels.txt'],
84 | 'NLP Annotations': ['https://sites.google.com/site/nlpannotations/snow2008_mturk_data_with_orig_files_assembled_201904.zip'],
85 | 'Sentiment popularity - AMT': ['https://eprints.soton.ac.uk/376544/1/SP_amt.csv'],
86 | '2010 Crowdsourced Web Relevance Judgments': ['https://www.ischool.utexas.edu/~ml/data/trec-rf10-crowd.tgz'],
87 | 'AdultContent2': ['https://raw.githubusercontent.com/ipeirotis/Get-Another-Label/master/data/AdultContent2/gold.txt', 'https://raw.githubusercontent.com/ipeirotis/Get-Another-Label/master/data/AdultContent2/labels.txt'],
88 | 'AdultContent3': ['https://raw.githubusercontent.com/ipeirotis/Get-Another-Label/master/data/AdultContent3-HCOMP2010/labels.txt'],
89 | 'Weather Sentiment - AMT': ['https://eprints.soton.ac.uk/376543/1/WeatherSentiment_amt.csv'],
90 | 'Toloka Aggregation Relevance 2': ['https://tlk.s3.yandex.net/dataset/TlkAgg2.zip'],
91 | 'Toloka Aggregation Relevance 5': ['https://tlk.s3.yandex.net/dataset/TlkAgg5.zip'],
92 | 'Text Highlighting': ['https://ndownloader.figshare.com/articles/9917162/versions/2'],
93 | 'Crowdsourced Amazon Sentiment': ['https://raw.githubusercontent.com/Evgeneus/screening-classification-datasets/master/crowdsourced-amazon-sentiment-dataset/data/1k_amazon_reviews_crowdsourced.csv'],
94 | 'Crowdsourced loneliness-slr': ['https://raw.githubusercontent.com/Evgeneus/crowd-machine-collaboration-for-item-screening/master/data/amt_real_data/crowd-data.csv']
95 | }
96 |
97 | # download the datasets
98 | download_folders(path, urlDict)
99 |
100 | # define the directories which stores the downloaded dataset as an archive
101 | pathDict = {
102 | 'Recognizing Textual Entailment' : ['.' + delimeter + 'binary-classification' + delimeter + 'Recognizing Textual Entailment' + delimeter + 'data-raw', 'snow2008_mturk_data_with_orig_files_assembled_201904.zip', ['rte.standardized.tsv','rte1.tsv']],
103 | 'Temporal Ordering' : ['.' + delimeter + 'binary-classification' + delimeter + 'Temporal Ordering' + delimeter + 'data-raw', 'snow2008_mturk_data_with_orig_files_assembled_201904.zip', ['all.tsv', 'temp.standardized.tsv']],
104 | '2010 Crowdsourced Web Relevance Judgments' : ['.' + delimeter + 'multi-class-classification' + delimeter + '2010 Crowdsourced Web Relevance Judgments' + delimeter + 'data-raw', 'trec-rf10-crowd.tgz', ['trec-rf10-data.txt']],
105 | 'Emotion' : ['.' + delimeter + 'multi-class-classification' + delimeter + 'Emotion' + delimeter + 'data-raw', 'snow2008_mturk_data_with_orig_files_assembled_201904.zip', ['affect.tsv', 'anger.standardized.tsv', 'disgust.standardized.tsv', 'fear.standardized.tsv', 'joy.standardized.tsv', 'sadness.standardized.tsv', 'surprise.standardized.tsv', 'valence.standardized.tsv']],
106 | 'Toloka Aggregation Relevance 2' : ['.' + delimeter + 'binary-classification' + delimeter + 'Toloka Aggregation Relevance 2' + delimeter + 'data-raw', 'TlkAgg2.zip', ['crowd_labels.tsv', 'golden_labels.tsv']],
107 | 'Toloka Aggregation Relevance 5' : ['.' + delimeter + 'multi-class-classification' + delimeter + 'Toloka Aggregation Relevance 5' + delimeter + 'data-raw', 'TlkAgg5.zip', ['crowd_labels.tsv', 'golden_labels.tsv']],
108 | 'Word Pair Similarity' : ['.' + delimeter + 'multi-class-classification' + delimeter + 'Word Pair Similarity' + delimeter + 'data-raw', 'snow2008_mturk_data_with_orig_files_assembled_201904.zip', ['wordsim.standardized.tsv']],
109 | 'Text Highlighting' : ['.' + delimeter + 'binary-classification' + delimeter + 'Text Highlighting' + delimeter + 'data-raw', '9917162.zip', ['crowdsourced_highlights.csv', 'classification_tech-ML-highlights.csv', 'classification_tech-crowd-highlights.csv', 'classification_tech-6x6-crowd-highlights.csv', 'classification_tech-3x12-crowd-highlights.csv', 'classification_oa-ML-highlights.csv', 'classification_oa-crowd-highlights.csv', 'classification_amazon-ML-highlights.csv', 'classification_amazon-crowd-highlights.csv']]
110 | }
111 |
112 | # extract the required files from the archived datasets
113 | for key, value in pathDict.items():
114 | extract_nested_archives(value[0] + delimeter + value[1], value[0])
115 | if key != 'Text Highlighting':
116 | recursive_walk(value[0], delimeter, value[2], value[0])
117 | delete_unnecessary_files(value[0])
118 | else:
119 | for folderName, subfolders, filenames in os.walk(value[0]):
120 | for file in filenames:
121 | if file not in value[2]:
122 | os.remove(os.path.join(os.path.join(os.getcwd(), value[0]), file))
--------------------------------------------------------------------------------
/multi-class-classification/2010 Crowdsourced Web Relevance Judgments/README.md:
--------------------------------------------------------------------------------
1 | # 2010 Crowdsourced Web Relevance Judgments
2 |
3 | Link to the original source: https://www.ischool.utexas.edu/~ml/data/trec-rf10-crowd.tgz
4 |
5 | The dataset contains the judgments of 766 anonymized AMT workers about the relevance of English Web pages from the ClueWeb09 collection (http://lemurproject.org/clueweb09/) for English search queries drawn from the TREC 2009 Million Query track (http://ir.cis.udel.edu/million). The judgments are based on 3 scales: highly relevant, relevant, and non-relevant. A fourth judgment option indicated a broken link which could not be judged. 98,453 judgments produced by the workers, 3277 of which have the ground truths provided by NIST. From the source given above, `trec-rf10-data.txt` file should be copied into the `data-raw` folder.
6 |
7 | **Cite this work as:**
8 | ```
9 | @inproceedings{Buckley10-notebook,
10 | author={Chris Buckley and Matthew Lease and Mark D. Smucker},
11 | title={{Overview of the TREC 2010 Relevance Feedback Track (Notebook)}},
12 | booktitle={{The Nineteenth Text Retrieval Conference (TREC) Notebook}},
13 | institute = {{National Institute of Standards and Technology (NIST)}},
14 | year={2010}
15 | }
16 | ```
17 |
18 |
19 |
20 |
21 |
--------------------------------------------------------------------------------
/multi-class-classification/AdultContent2/README.md:
--------------------------------------------------------------------------------
1 | # AdultContent2 Dataset
2 |
3 | Link to the original source: https://github.com/ipeirotis/Get-Another-Label/tree/master/data/AdultContent2
4 |
5 | This dataset contains approximately 100K individual worker judgments and the related ground truths for classification of websites into 5 categories.
6 |
7 | **Cite this work as:**
8 | ```
9 | @inproceedings{Sheng:2008:GLI:1401890.1401965,
10 | author = {Sheng, Victor S. and Provost, Foster and Ipeirotis, Panagiotis G.},
11 | title = {Get Another Label? Improving Data Quality and Data Mining Using Multiple, Noisy Labelers},
12 | booktitle = {Proceedings of the 14th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining},
13 | series = {KDD '08},
14 | year = {2008},
15 | isbn = {978-1-60558-193-4},
16 | location = {Las Vegas, Nevada, USA},
17 | pages = {614--622},
18 | numpages = {9},
19 | url = {http://doi.acm.org/10.1145/1401890.1401965},
20 | doi = {10.1145/1401890.1401965},
21 | acmid = {1401965},
22 | publisher = {ACM},
23 | address = {New York, NY, USA},
24 | keywords = {data preprocessing, data selection},
25 | }
26 | ```
--------------------------------------------------------------------------------
/multi-class-classification/AdultContent3/README.md:
--------------------------------------------------------------------------------
1 | # AdultContent3
2 |
3 | Link to the original source: https://github.com/ipeirotis/Get-Another-Label/tree/master/data/AdultContent3-HCOMP2010
4 |
5 | This dataset contains approximately 50K individual worker judgments and the related ground truths for classification of websites into 4 categories.
6 |
7 | **Cite this work as:**
8 | ```
9 | @inproceedings{Sheng:2008:GLI:1401890.1401965,
10 | author = {Sheng, Victor S. and Provost, Foster and Ipeirotis, Panagiotis G.},
11 | title = {Get Another Label? Improving Data Quality and Data Mining Using Multiple, Noisy Labelers},
12 | booktitle = {Proceedings of the 14th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining},
13 | series = {KDD '08},
14 | year = {2008},
15 | isbn = {978-1-60558-193-4},
16 | location = {Las Vegas, Nevada, USA},
17 | pages = {614--622},
18 | numpages = {9},
19 | url = {http://doi.acm.org/10.1145/1401890.1401965},
20 | doi = {10.1145/1401890.1401965},
21 | acmid = {1401965},
22 | publisher = {ACM},
23 | address = {New York, NY, USA},
24 | keywords = {data preprocessing, data selection},
25 | }
26 | ```
--------------------------------------------------------------------------------
/multi-class-classification/Emotion/README.md:
--------------------------------------------------------------------------------
1 | # Emotion Dataset
2 |
3 | Link to the original source: https://sites.google.com/site/nlpannotations/
4 |
5 | This dataset contains individual worker votes that rate the emotion of a given text, based on the followings: anger, disgust, fear, joy, sadness, surprise, valence. Furthermore, each rating contains a value from -100 to 100 for each emotion about the text.
6 |
7 | **Cite this work as:**
8 | ```
9 | @inproceedings{Snow:2008:CFG:1613715.1613751,
10 | author = {Snow, Rion and O'Connor, Brendan and Jurafsky, Daniel and Ng, Andrew Y.},
11 | title = {Cheap and Fast---but is It Good?: Evaluating Non-expert Annotations for Natural Language Tasks},
12 | booktitle = {Proceedings of the Conference on Empirical Methods in Natural Language Processing},
13 | series = {EMNLP '08},
14 | year = {2008},
15 | location = {Honolulu, Hawaii},
16 | pages = {254--263},
17 | numpages = {10},
18 | url = {http://dl.acm.org/citation.cfm?id=1613715.1613751},
19 | acmid = {1613751},
20 | publisher = {Association for Computational Linguistics},
21 | address = {Stroudsburg, PA, USA},
22 | }
23 | ```
--------------------------------------------------------------------------------
/multi-class-classification/Toloka Aggregation Relevance 5/readme.md:
--------------------------------------------------------------------------------
1 | # Toloka Aggregation Relevance 5
2 |
3 | Link to the original source: https://research.yandex.com/datasets/toloka
4 |
5 | This dataset contains the judgments on the relevance of a document for a query on a 5-graded scale. From the source given above,
6 | `crowd_labels.tsv` and `golden_labels.tsv` files should be copied into the `data-raw` folder.
--------------------------------------------------------------------------------
/multi-class-classification/Weather Sentiment - AMT/README.md:
--------------------------------------------------------------------------------
1 | # Weather Sentiment - AMT Dataset
2 |
3 | Link to the original source: https://eprints.soton.ac.uk/376543/
4 |
5 | This dataset contains the ground truths and individual judgments of 110 workers for the sentiment of 300 tweets. The classification task is performed in the following categories: negative (0), neutral (1), positive (2), tweet not related to weather (3) and can't tell (4). From the source given above, `WeatherSentiment_amt.csv` file should be copied into the `data-raw` folder.
6 |
7 | **Cite this work as:**
8 | ```
9 | @inproceedings{soton376365,
10 | booktitle = {International Joint Conference on Artificial Intelligence (IJCAI-15) (31/07/15)},
11 | month = {July},
12 | title = {Bayesian modelling of community-based multidimensional trust in participatory sensing under data sparsity},
13 | author = {Matteo Venanzi and W.T.L. Teacy and Alex Rogers and Nicholas R. Jennings},
14 | year = {2015},
15 | pages = {717--724},
16 | url = {https://eprints.soton.ac.uk/376365/},
17 | abstract = {We propose a new Bayesian model for reliable aggregation of crowdsourced estimates of real-valued quantities in participatory sensing applications. Existing approaches focus on probabilistic modelling of user?s reliability as the key to accurate aggregation. However, these are either limited to estimating discrete quantities, or require a significant number of reports from each user to accurately model their reliability. To mitigate these issues, we adopt a community-based approach, which reduces the data required to reliably aggregate real-valued estimates, by leveraging correlations between the re- porting behaviour of users belonging to different communities. As a result, our method is up to 16.6\% more accurate than existing state-of-the-art methods and is up to 49\% more effective under data sparsity when used to estimate Wi-Fi hotspot locations in a real-world crowdsourcing application.}
18 | }
19 | ```
20 |
--------------------------------------------------------------------------------
/multi-class-classification/Word Pair Similarity/README.md:
--------------------------------------------------------------------------------
1 | # Word Pair Similarity
2 |
3 | Link to the original source: https://sites.google.com/site/nlpannotations/
4 |
5 | This dataset contains the individual worker votes that assign a numerical similarity score between 0 and 10 to a given text.
6 |
7 | **Cite this work as:**
8 | ```
9 | @inproceedings{Snow:2008:CFG:1613715.1613751,
10 | author = {Snow, Rion and O'Connor, Brendan and Jurafsky, Daniel and Ng, Andrew Y.},
11 | title = {Cheap and Fast---but is It Good?: Evaluating Non-expert Annotations for Natural Language Tasks},
12 | booktitle = {Proceedings of the Conference on Empirical Methods in Natural Language Processing},
13 | series = {EMNLP '08},
14 | year = {2008},
15 | location = {Honolulu, Hawaii},
16 | pages = {254--263},
17 | numpages = {10},
18 | url = {http://dl.acm.org/citation.cfm?id=1613715.1613751},
19 | acmid = {1613751},
20 | publisher = {Association for Computational Linguistics},
21 | address = {Stroudsburg, PA, USA},
22 | }
23 | ```
--------------------------------------------------------------------------------
/multi-class-classification/readme.md:
--------------------------------------------------------------------------------
1 | # multi-label-classification
2 |
3 | The following datasets are included in this folder:
4 |
5 | - `2010 Crowdsourced Web Relevance Judgments Data`
6 | - `AdultContent2`
7 | - `AdultContent3`
8 | - `Emotion`
9 | - `Toloka Aggregation Relevance 5`
10 | - `Weather Sentiment - AMT`
11 | - `Word pair similarity`
12 |
--------------------------------------------------------------------------------
/transform_datasets.py:
--------------------------------------------------------------------------------
1 | '''
2 | By using this tool you agree to acknowledge the original datasets and to check their terms and conditions.
3 | Some data providers may require authentication, filling forms, etc.
4 | We include a link to the original source of each dataset in our repository, please cite the appropriate sources in your work.
5 | '''
6 |
7 | import os
8 | import pandas as pd
9 | import re
10 | import platform
11 | import csv
12 | from itertools import islice
13 |
14 | def recursive_walk(folder, delimeter):
15 | for folderName, subfolders, filenames in os.walk(folder):
16 | dest = os.path.join(os.getcwd(), folderName.split('data-raw')[0]) + 'transformed_dataset.csv'
17 | if folderName == 'binary-classification' + delimeter + 'Blue Birds' + delimeter + 'data-raw':
18 | processBlueBirds(filenames, folderName)
19 | if folderName == 'binary-classification' + delimeter + 'Crowdsourced Amazon Sentiment' + delimeter + 'data-raw':
20 | processCrowdsourcedAmazonSentimentDataset(filenames, folderName)
21 | if folderName == 'binary-classification' + delimeter + 'Crowdsourced loneliness-slr' + delimeter + 'data-raw':
22 | processCrowdsourcedLonelinessDataset(filenames, folderName)
23 | elif folderName == 'binary-classification' + delimeter + 'HITspam-UsingCrowdflower' + delimeter + 'data-raw':
24 | processGoldAndLabelFiles(filenames, folderName, dest, None)
25 | elif folderName == 'binary-classification' + delimeter + 'HITspam-UsingMTurk' + delimeter + 'data-raw':
26 | processGoldAndLabelFiles(filenames, folderName, dest, None)
27 | elif folderName == 'binary-classification' + delimeter + 'Recognizing Textual Entailment' + delimeter + 'data-raw':
28 | processWithSeperateText(filenames, folderName, 'rte.standardized.tsv', 'rte1.tsv', dest, True)
29 | elif folderName == 'binary-classification' + delimeter + 'Sentiment popularity - AMT' + delimeter + 'data-raw':
30 | processSentiment(filenames, folderName, dest)
31 | elif folderName == 'binary-classification' + delimeter + 'Temporal Ordering' + delimeter + 'data-raw':
32 | processWithSeperateText(filenames, folderName, 'temp.standardized.tsv', 'all.tsv', dest, True)
33 | elif folderName == 'binary-classification' + delimeter + 'Text Highlighting' + delimeter + 'data-raw':
34 | processTextHighlightingDataset(filenames, folderName)
35 | elif folderName == 'multi-class-classification' + delimeter + '2010 Crowdsourced Web Relevance Judgments' + delimeter + 'data-raw':
36 | processTopicDocument(filenames, folderName, dest)
37 | elif folderName == 'multi-class-classification' + delimeter + 'AdultContent2' + delimeter + 'data-raw':
38 | processGoldAndLabelFiles(filenames, folderName, dest, None)
39 | elif folderName == 'multi-class-classification' + delimeter + 'AdultContent3' + delimeter + 'data-raw':
40 | processGoldAndLabelFiles(filenames, folderName, dest, None)
41 | elif folderName == 'multi-class-classification' + delimeter + 'Weather Sentiment - AMT' + delimeter + 'data-raw':
42 | processSentiment(filenames, folderName, dest)
43 | elif folderName == 'multi-class-classification' + delimeter + 'Emotion' + delimeter + 'data-raw':
44 | processEmotionDataset(filenames, folderName)
45 | elif folderName == 'binary-classification' + delimeter + 'Toloka Aggregation Relevance 2' + delimeter + 'data-raw':
46 | processGoldAndLabelFiles(filenames, folderName, dest, 'Toloka')
47 | elif folderName == 'multi-class-classification' + delimeter + 'Toloka Aggregation Relevance 5' + delimeter + 'data-raw':
48 | processGoldAndLabelFiles(filenames, folderName, dest, 'Toloka')
49 | elif folderName == 'multi-class-classification' + delimeter + 'Word Pair Similarity' + delimeter + 'data-raw':
50 | processWithSeperateText(filenames, folderName, 'wordsim.standardized.tsv', None, dest, False)
51 | else:
52 | for subfolder in subfolders:
53 | recursive_walk(subfolder, delimeter)
54 |
55 | def processBlueBirds(filenames, folderName):
56 | gt_dict = {}
57 | columns = ['workerID', 'taskID', 'response', 'goldLabel', 'taskContent']
58 | df = pd.DataFrame([], columns=columns)
59 | for file in sorted(filenames):
60 | if file == 'gt.yaml':
61 | for title, block in blueBirdBlocks(os.path.join(folderName, file)):
62 | block = block.replace('{', '')
63 | block = block.replace('}', '')
64 | block = block.replace('\n', '')
65 | block = block.replace(' ', '')
66 | gt_dict = dict(item.split(":") for item in block.split(","))
67 | if file == 'labels.yaml':
68 | for title, block in blueBirdBlocks(os.path.join(folderName, file)):
69 | block = block.replace('\n', '')
70 | splitted_block = block.split(": {")
71 | splitted_block[1] = splitted_block[1].replace('}', '')
72 | new_dict = dict(item.split(": ") for item in splitted_block[1].split(","))
73 | for key in new_dict.keys():
74 | row = [splitted_block[0], key.strip(), new_dict.get(key), gt_dict.get(key.strip()), None]
75 | dfRow = pd.DataFrame([row], columns=columns)
76 | df = df.append(dfRow)
77 |
78 | df.to_csv(os.path.join(os.getcwd(), folderName.split('data-raw')[0]) + 'transformed_dataset.csv', index=None, header=True)
79 |
80 |
81 | def blueBirdBlocks(filename):
82 | title, block = '', None
83 | with open(filename) as fp:
84 | for line in fp:
85 | if '{' in line:
86 | block = line
87 | elif block is not None:
88 | block += line
89 | else:
90 | title = line
91 | if '}' in line:
92 | yield title, block
93 | title, block = '', None
94 |
95 | def processGoldAndLabelFiles(filenames, folderName, dest, dataset):
96 | if dataset == 'Toloka':
97 | filenames = sorted(filenames, reverse=True)
98 | else:
99 | filenames = sorted(filenames)
100 | gt_dict = {}
101 | columns = ['workerID', 'taskID', 'response', 'goldLabel', 'taskContent']
102 | df = pd.DataFrame([], columns=columns)
103 | for file in filenames:
104 | if file == 'gold.txt' or file == 'golden_labels.tsv':
105 | with open(os.path.join(folderName, file)) as fp:
106 | rows = ( line.split('\t') for line in fp )
107 | gt_dict = { row[0]:row[1] for row in rows }
108 | if file == 'labels.txt' or file == 'crowd_labels.tsv':
109 | with open(os.path.join(folderName, file)) as fp:
110 | for line in fp:
111 | label_list = re.split(r'\t+', line)
112 | goldLabel = None
113 | if gt_dict.get(label_list[1]) != None:
114 | goldLabel = gt_dict.get(label_list[1]).strip()
115 | row = [label_list[0], label_list[1], label_list[2].strip(), goldLabel, None]
116 | dfRow = pd.DataFrame([row], columns=columns)
117 | df = df.append(dfRow)
118 | df.to_csv(dest, index=None, header=True)
119 |
120 | def processWithSeperateText(filenames, folderName, file1, file2, dest, isTextExist):
121 | if file2 == 'rte1.tsv':
122 | filenames = sorted(filenames)
123 | else:
124 | filenames = sorted(filenames, reverse=True)
125 | columns = ['workerID', 'taskID', 'response', 'goldLabel', 'taskContent']
126 | df = pd.DataFrame([], columns=columns)
127 | for file in filenames:
128 | with open(os.path.join(folderName, file)) as fp:
129 | next(fp)
130 | for line in fp:
131 | label_list = re.split(r'\t+', line)
132 | if file == file1:
133 | row = [label_list[1], label_list[2], label_list[3], label_list[4].strip(), None]
134 | dfRow = pd.DataFrame([row], columns=columns)
135 | df = df.append(dfRow)
136 | if file == file2 and isTextExist:
137 | if file2 == 'rte1.tsv':
138 | df.loc[df['taskID'] == label_list[0], ['taskContent']] = label_list[3]
139 | else:
140 | df.loc[df['taskID'] == label_list[0], ['taskContent']] = label_list[4]
141 | df.to_csv(dest, index=None, header=True)
142 |
143 | def processWithSeperateText(filenames, folderName, file1, file2, dest, isTextExist):
144 | if file2 == 'rte1.tsv':
145 | filenames = sorted(filenames)
146 | else:
147 | filenames = sorted(filenames, reverse=True)
148 | columns = ['workerID', 'taskID', 'response', 'goldLabel', 'taskContent']
149 | df = pd.DataFrame([], columns=columns)
150 | for file in filenames:
151 | with open(os.path.join(folderName, file)) as fp:
152 | next(fp)
153 | for line in fp:
154 | label_list = re.split(r'\t+', line)
155 | if file == file1:
156 | row = [label_list[1], label_list[2], label_list[3], label_list[4].strip(), None]
157 | dfRow = pd.DataFrame([row], columns=columns)
158 | df = df.append(dfRow)
159 | if file == file2 and isTextExist:
160 | if file2 == 'rte1.tsv':
161 | df.loc[df['taskID'] == label_list[0], ['taskContent']] = label_list[3]
162 | else:
163 | df.loc[df['taskID'] == label_list[0], ['taskContent']] = label_list[4]
164 | df.to_csv(dest, index=None, header=True)
165 |
166 | def processSentiment(filenames, folderName, dest):
167 | columns = ['workerID', 'taskID', 'response', 'goldLabel', 'taskContent', 'timeSpent']
168 | df = pd.DataFrame([], columns=columns)
169 | for file in sorted(filenames):
170 | with open(os.path.join(folderName, file)) as fp:
171 | for line in fp:
172 | label_list = re.split(r',', line)
173 | row = [label_list[0], label_list[1], label_list[2], label_list[3], None, label_list[4]]
174 | dfRow = pd.DataFrame([row], columns=columns)
175 | df = df.append(dfRow)
176 | df.to_csv(dest, index=None, header=True)
177 |
178 | def processTopicDocument(filenames, folderName, dest):
179 | columns = ['workerID', 'taskID', 'response', 'goldLabel', 'taskContent']
180 | df = pd.DataFrame([], columns=columns)
181 | for file in sorted(filenames):
182 | if file == 'trec-rf10-data.txt':
183 | with open(os.path.join(folderName, file)) as fp:
184 | next(fp)
185 | for line in fp:
186 | label_list = re.split(r'\t+', line)
187 | task = [label_list[0], label_list[2]]
188 | row = [label_list[1], '_'.join(task), label_list[4].strip(), label_list[3], None]
189 | dfRow = pd.DataFrame([row], columns=columns)
190 | df = df.append(dfRow)
191 | df.to_csv(dest, index=None, header=True)
192 |
193 | def processEmotionDataset(filenames, folderName):
194 | text_dict = {}
195 | columns = ['workerID', 'taskID', 'response', 'goldLabel', 'taskContent']
196 | df = pd.DataFrame([], columns=columns)
197 | for file in sorted(filenames):
198 | if file == "affect.tsv":
199 | with open(os.path.join(folderName, file)) as fp:
200 | rows = ( line.split('\t') for line in fp )
201 | text_dict = { row[0]:row[1] for row in rows }
202 | else:
203 | dest = os.path.join(os.getcwd(), folderName.split('data-raw')[0]) + 'transformed_dataset_' + file.split('.')[0] + '.csv'
204 | with open(os.path.join(folderName, file)) as fp:
205 | next(fp)
206 | for line in fp:
207 | label_list = re.split(r'\t+', line)
208 | row = [label_list[1], label_list[2], label_list[3], label_list[4].strip(), text_dict.get(label_list[2])]
209 | dfRow = pd.DataFrame([row], columns=columns)
210 | df = df.append(dfRow)
211 | df.to_csv(dest, index=None, header=True)
212 |
213 | def processTextHighlightingDataset(filenames, folderName):
214 | columns = ['workerID', 'taskID', 'response', 'goldLabel', 'taskContent']
215 | df = pd.DataFrame([], columns=columns)
216 | for file in sorted(filenames):
217 | minus = 0
218 | if file == 'crowdsourced_highlights.csv':
219 | minus = 1
220 | with open(os.path.join(folderName, file), encoding="utf8") as csv_file:
221 | dest = os.path.join(os.getcwd(), folderName.split('data-raw')[0]) + 'transformed_dataset_' + file.split('.')[0] + '.csv'
222 | next(csv_file)
223 | csv_reader = csv.reader(csv_file, delimiter=',')
224 | for line in csv_reader:
225 | if line[11 - minus] == 'True':
226 | row = [line[12 - minus], line[0], line[15 - minus], line[15 - minus], line[2]]
227 | else:
228 | row = [line[12 - minus], line[0], line[15 - minus], None, line[2]]
229 | dfRow = pd.DataFrame([row], columns=columns)
230 | df = df.append(dfRow)
231 | df.to_csv(dest, index=None, header=True)
232 |
233 |
234 | def processCrowdsourcedAmazonSentimentDataset(filenames, folderName):
235 | columns = ['workerID', 'taskID', 'response', 'goldLabel', 'taskContent']
236 | for file in sorted(filenames):
237 | for i in range(2):
238 | df = pd.DataFrame([], columns=columns)
239 | with open(os.path.join(folderName, file), encoding="utf8") as csv_file:
240 | filePostfix = 'is_book'
241 | responseIndex = 14
242 | goldenIndex = 23
243 | if (i == 1):
244 | filePostfix = 'is_negative'
245 | responseIndex = 15
246 | goldenIndex = 25
247 | dest = os.path.join(os.getcwd(),
248 | folderName.split('data-raw')[0]) + 'transformed_dataset_' + filePostfix + '.csv'
249 | next(csv_file)
250 | csv_reader = csv.reader(csv_file, delimiter=',')
251 | for line in csv_reader:
252 | row = [line[9], line[0], line[responseIndex], line[goldenIndex], line[27]]
253 | dfRow = pd.DataFrame([row], columns=columns)
254 | df = df.append(dfRow)
255 | df.to_csv(dest, index=None, header=True)
256 |
257 |
258 | def processCrowdsourcedLonelinessDataset(filenames, folderName):
259 | gt_dict = {}
260 | for file in sorted(filenames):
261 | with open(os.path.join(folderName, file), encoding="utf8") as csv_file:
262 | next(csv_file)
263 | csv_reader = csv.reader(csv_file, delimiter=',')
264 | gt_dict = {row[0] + 'intervention': row[14] for row in islice(csv_reader, 500)}
265 |
266 | columns = ['workerID', 'taskID', 'response', 'goldLabel', 'taskContent']
267 | for file in sorted(filenames):
268 | for i in range(0, 9, 4):
269 | df = pd.DataFrame([], columns=columns)
270 | with open(os.path.join(folderName, file), encoding="utf8") as csv_file:
271 | filePostfix = 'intervention'
272 | goldenIndex = 14
273 | if (i == 4):
274 | filePostfix = 'use_of_tech'
275 | goldenIndex = 15
276 | elif (i == 8):
277 | filePostfix = 'older_adult'
278 | goldenIndex = 16
279 | dest = os.path.join(os.getcwd(),
280 | folderName.split('data-raw')[0]) + 'transformed_dataset_' + filePostfix + '.csv'
281 | next(csv_file)
282 | csv_reader = csv.reader(csv_file, delimiter=',')
283 | for line in csv_reader:
284 | row = []
285 | if (i == 0):
286 | row = [line[i + 1], line[i], line[i + 2], gt_dict.get(line[i] + 'intervention'), None]
287 | else:
288 | row = [line[i + 1], line[i], line[i + 2], line[goldenIndex], None]
289 | dfRow = pd.DataFrame([row], columns=columns)
290 | df = df.append(dfRow)
291 | df.to_csv(dest, index=None, header=True)
292 |
293 |
294 | if __name__ == '__main__':
295 | # get the current path
296 | path = '.'
297 |
298 | # define the delimeter based on the operating system
299 | delimeter = '/'
300 | if platform.system() == 'Windows':
301 | delimeter = '\\'
302 |
303 | # walk through the directories and transform all the datasets
304 | recursive_walk(path, delimeter)
305 |
306 |
307 |
308 |
--------------------------------------------------------------------------------