├── .github └── ISSUE_TEMPLATE │ └── dataset-request.md ├── README.md └── doc └── sources └── TEMPLATE.md /.github/ISSUE_TEMPLATE/dataset-request.md: -------------------------------------------------------------------------------- 1 | --- 2 | name: Dataset request 3 | about: Suggest a source of data to include 4 | title: "[Data Source] Add ..." 5 | labels: enhancement 6 | assignees: '' 7 | 8 | --- 9 | 10 | **Name of source?** 11 | Simply the name of the data source, e.g. Wikipedia, Syosetu ni Narou, Twitter, userABC on Reddit, etc. 12 | 13 | **URL of source?** 14 | Where the data can be found; multiple links are acceptable if necessary. 15 | 16 | **Source language?** 17 | Language the data is in, e.g. English, Japanese, Spanish, etc. If it's code, you should specify the programming language (C, Java, etc.). If it's a known mixture of languages, list all of them and how they're separated, if possible. If it's an unknown mix, just put "mixed." 18 | 19 | **Data format?** 20 | The format the data is already in, e.g. Wikitext, HTML, markdown, PDF. 21 | 22 | **Should the data be converted to another format?** 23 | Whether it's appropriate to convert the data to a plain text or markdown. Usually "yes," but if you're submitting a dataset of code, it would be inappropriate to strip HTML tags. 24 | 25 | **Notes for scraping?** 26 | Any notes or useful information for the scraping process, if applicable, e.g. "there's an API that returns a list of pages at /api/v1/list" or "the site heavily rate limits, so you may need to use multiple IPs." If the data is available as a convenient archive or download already, you'd mention that here. 27 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # BigKnow2022 2 | 3 | BigKnow2022 is an upcoming up-to-date dataset intended for training language 4 | models. Its contents are primarily derived from various websites. 5 | 6 | **We are currently seeking contributors to assist in dataset collection and 7 | filtering.** 8 | 9 | ## Rationale 10 | 11 | Existing large-scale datasets such as the Pile or vanilla Common Crawl all 12 | have notable downsides: 13 | 14 | * Insufficient attention given to non-English or non-European languages. 15 | In particular, CJK languages tend to be neglected the most. 16 | 17 | * Information that is now two or more years out of date. 18 | 19 | ## Sources 20 | 21 | We are currently planning to include the following corpora: 22 | 23 | * English 24 | - [x] Wikia/Fandom 25 | - [ ] AO3 26 | - [ ] Wikipedia (March 2023 dump) 27 | * Japanese 28 | - [x] Pixiv Encyclopedia (dic.pixiv.net) 29 | - [x] Wikipedia (March 2023 dump) 30 | - [ ] NicoNico Encyclopedia (dic.nicovideo.jp) 31 | - [ ] Syosetu ni Narou (syosetu.com) 32 | - [ ] FC2 (Japanese blog hosting) 33 | - [ ] atwiki/@WIKI (Japanese wiki hosting) 34 | * Korean 35 | - [x] Namu News 36 | - [ ] Namu Wiki 37 | - [x] Wikipedia (March 2023 dump) 38 | - [ ] KAIST? (http://semanticweb.kaist.ac.kr/home/index.php/KAIST_Corpus) 39 | * German 40 | - [ ] Transcripts of the Federal German Bundestag 41 | - [ ] Court decisions and other legal documents 42 | - [ ] Wikipedia? 43 | * Mixed 44 | - [ ] Wikis from Wikiapiary 45 | - [ ] Wikitionary? 46 | 47 | *This list is subject to later expansion.* 48 | -------------------------------------------------------------------------------- /doc/sources/TEMPLATE.md: -------------------------------------------------------------------------------- 1 | *Remove this line: template for information on a data source.* 2 | 3 | # [Source Name] 4 | 5 | * URL: [url] 6 | * Language: [language or "mixed"] 7 | * Status: [waiting|downloading|processing|ready] 8 | 9 | [Brief description] 10 | 11 | ## Inclusion Rationale 12 | 13 | [Why should it be added?] 14 | 15 | ## Format 16 | 17 | [Information on source format and possible scraping methodology] 18 | --------------------------------------------------------------------------------