├── .gitmodules ├── README.md └── datasets └── README.md /.gitmodules: -------------------------------------------------------------------------------- 1 | [submodule "models"] 2 | path = models 3 | url = git@github.com:bit-ml/Dupin.git 4 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # VeriDark Authorship Benchmark 2 | 3 | The VeriDark benchmark contains several large-scale authorship verification and identification datasets, which should facilitate research into authorship analysis generally and enable building tools for the cybersecurity domain in particular. 4 | 5 | The datasets are detailed [here](https://github.com/bit-ml/VeriDark/tree/master/datasets), while the BERT-based baseline code can be found [here](https://github.com/bit-ml/Dupin/tree/434eee096324f82adf6739d4db8e147428693b92). -------------------------------------------------------------------------------- /datasets/README.md: -------------------------------------------------------------------------------- 1 | # VeriDark Authorship Analysis Datasets 2 | 3 | ### Requesting the datasets 4 | Due to ethical concerns regarding the potential misuse of our benchmark, we require additional information for granting permission to use our datasets, which you can submit in the Zenodo request access forms (see Zenodo links below). We request the following information: 5 | 6 | 1. The name of the person requesting access, together with their affiliations, job title and an e-mail address. If the person holds an institutional e-mail address, we strongly recommend using it instead of a personal e-mail address. 7 | 8 | 2. The intended usage for the dataset. 9 | 10 | 3. An acknowledgement that the dataset will be strictly used in an ethical manner. Non-ethical uses of the dataset include, but are not limited to: 11 | * using the datasets for the task of Language Modeling or similar generative algorithms. 12 | * building algorithms that could aid criminals to evade law enforcement organizations. 13 | * building algorithms that have the aim of unmasking undercover law enforcement agents. 14 | * building algorithms that could interfere with the activity of law enforcement agencies. 15 | * building algorithms that could lead to violating any article of the United Nations Universal Declaration of Human Rights. 16 | * building algorithms with the purpose of exposing the identity of reporters, individuals in the political realms, leakers, whistleblowers, dissidents, or other persons who are seeking to express an opinion about what they perceive is a particular injustice in the world, without regard to what that injustice may be. 17 | * building algorithms that can help entities discriminate, or exacerbate bias against other persons on the basis of race, color, religion, gender, gender expression, age, national origin, familiar status, ancestry, culture, disability, political views, sexual orientation, marital status, military status, social status, or who have other protected characteristics. 18 | 19 | We strongly encourage the inclusion of an ethical statement and discussion in any work based on this dataset. 20 | We do not encourage the distribution of the dataset in its current form to any other parties without our consent. 21 | 22 | DISCLAIMER: Any personal information provided when requesting access to the dataset will be used just for deciding whether access to the dataset should be granted or not. We will not disclose your personal data. 23 | 24 | 25 | | dataset | train | val | test | task | Link | 26 | |---------|-------|-----|------|---------|--------| 27 | |[MiniDarkReddit](https://drive.google.com/file/d/1ok_CY59RhD0GgJqF1OOZMN592Zp9fgOY/view?usp=sharing) | 204 | 412 | 412 | Authorship Verification | [Google Drive link](https://drive.google.com/file/d/1ok_CY59RhD0GgJqF1OOZMN592Zp9fgOY/view?usp=sharing) | 28 | |DarkReddit+ | 6817 | 2275 | 2276 | Authorship Identification | [Zenodo link](https://zenodo.org/record/6998363) | 29 | |DarkReddit+ | 106252 | 6124 | 6633 | Authorship Verification | [Zenodo link](https://zenodo.org/record/6998375) | 30 | |SilkRoad1 | 614656 | 34300 | 32255 | Authorship Verification| [Zenodo link](https://zenodo.org/record/6998371) | 31 | |Agora | 4195381 | 216570 | 219171 | Authorship Verification | [Zenodo link](https://zenodo.org/record/7018853) | 32 | 33 | ## DarkReddit dataset for Authorship Verification (AV) 34 | This dataset was created by crawling comments from the `/r/darknet` subreddit. The dataset is small and was introduces in this [paper](https://arxiv.org/abs/2112.05125) to assess how well does training on the PAN authorship verification datasets transfer to the smaller dataset. 35 | 36 | ## DarkReddit+ dataset for Authorship Verification (AV) 37 | This dataset contains same author (SA) and different author (DA) pairs of comments 38 | from the defunct `/r/darknetmarkets` subreddit. Specifically, the comments were retrieved from 39 | the large Reddit comment [dataset](https://www.reddit.com/r/datasets/comments/3bxlg7/i_have_every_publicly_available_reddit_comment/). We removed comments with less than 200 characters. We split the authors into 3 disjoint sets of train authors, validation and test authors, making the authorship verification task an open setup. The SA and DA classes are balanced in all the splits. 40 | 41 | ## DarkReddit+ dataset for Authorship Identification (AI) 42 | This dataset is taken from the same Reddit comment dataset above. Specifically, we retrieved users who wrote in the `/r/darknetmarkets` subreddit as well as in other subreddits (`clearReddit`). We then removed comments with less than 200 characters. After that, we removed users who had less than 5 comments in either `/r/darknetmarkets` or in `clearReddit`. Then we selected the top 10 most active users (most comments) in `/r/darknetmarkets`. The size of the total dataset is ~10k samples (train+validation+test splits). The task is a 10-way classification task in which, given a user comment, the model predicts the correct user. 43 | 44 | 45 | ## Agora dataset for Authorship Verification (AV) 46 | The Agora dataset was collected from the [Agora](https://archive.org/download/dnmarchives/agora-forums.tar.xz) marketplace forum data, which was obtained from the [Darknet market archives](https://www.gwern.net/DNM-archives). 47 | 48 | 49 | ## SilkRoad1 dataset for Authorship Verification (AV) 50 | The SilkRoad1 dataset was collected from the [SilkRoad1](https://archive.org/download/dnmarchives/silkroad1-forums.tar.xz) marketplace forum data, which was obtained from the [Darknet market archives](https://www.gwern.net/DNM-archives). 51 | --------------------------------------------------------------------------------