├── .github
    └── ISSUE_TEMPLATE
    │   └── dataset-request.md
├── README.md
└── doc
    └── sources
        └── TEMPLATE.md


/.github/ISSUE_TEMPLATE/dataset-request.md:
--------------------------------------------------------------------------------
 1 | ---
 2 | name: Dataset request
 3 | about: Suggest a source of data to include
 4 | title: "[Data Source] Add ..."
 5 | labels: enhancement
 6 | assignees: ''
 7 | 
 8 | ---
 9 | 
10 | **Name of source?**
11 | Simply the name of the data source, e.g. Wikipedia, Syosetu ni Narou, Twitter, userABC on Reddit, etc.
12 | 
13 | **URL of source?**
14 | Where the data can be found; multiple links are acceptable if necessary.
15 | 
16 | **Source language?**
17 | Language the data is in, e.g. English, Japanese, Spanish, etc. If it's code, you should specify the programming language (C, Java, etc.). If it's a known mixture of languages, list all of them and how they're separated, if possible. If it's an unknown mix, just put "mixed."
18 | 
19 | **Data format?**
20 | The format the data is already in, e.g. Wikitext, HTML, markdown, PDF.
21 | 
22 | **Should the data be converted to another format?**
23 | Whether it's appropriate to convert the data to a plain text or markdown. Usually "yes," but if you're submitting a dataset of code, it would be inappropriate to strip HTML tags.
24 | 
25 | **Notes for scraping?**
26 | Any notes or useful information for the scraping process, if applicable, e.g. "there's an API that returns a list of pages at /api/v1/list" or "the site heavily rate limits, so you may need to use multiple IPs." If the data is available as a convenient archive or download already, you'd mention that here.
27 | 


--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
 1 | # BigKnow2022
 2 | 
 3 | BigKnow2022 is an upcoming up-to-date dataset intended for training language
 4 | models. Its contents are primarily derived from various websites.
 5 | 
 6 | **We are currently seeking contributors to assist in dataset collection and
 7 | filtering.**
 8 | 
 9 | ## Rationale
10 | 
11 | Existing large-scale datasets such as the Pile or vanilla Common Crawl all
12 | have notable downsides:
13 | 
14 | * Insufficient attention given to non-English or non-European languages.
15 |   In particular, CJK languages tend to be neglected the most.
16 | 
17 | * Information that is now two or more years out of date.
18 | 
19 | ## Sources
20 | 
21 | We are currently planning to include the following corpora:
22 | 
23 | * English
24 |   - [x] Wikia/Fandom
25 |   - [ ] AO3
26 |   - [ ] Wikipedia (March 2023 dump)
27 | * Japanese
28 |   - [x] Pixiv Encyclopedia (dic.pixiv.net)
29 |   - [x] Wikipedia (March 2023 dump)
30 |   - [ ] NicoNico Encyclopedia (dic.nicovideo.jp)
31 |   - [ ] Syosetu ni Narou (syosetu.com)
32 |   - [ ] FC2 (Japanese blog hosting)
33 |   - [ ] atwiki/@WIKI (Japanese wiki hosting)
34 | * Korean
35 |   - [x] Namu News
36 |   - [ ] Namu Wiki
37 |   - [x] Wikipedia (March 2023 dump)
38 |   - [ ] KAIST? (http://semanticweb.kaist.ac.kr/home/index.php/KAIST_Corpus)
39 | * German
40 |   - [ ] Transcripts of the Federal German Bundestag
41 |   - [ ] Court decisions and other legal documents
42 |   - [ ] Wikipedia?
43 | * Mixed
44 |   - [ ] Wikis from Wikiapiary
45 |   - [ ] Wikitionary?
46 | 
47 | *This list is subject to later expansion.*
48 | 


--------------------------------------------------------------------------------
/doc/sources/TEMPLATE.md:
--------------------------------------------------------------------------------
 1 | *Remove this line: template for information on a data source.*
 2 | 
 3 | # [Source Name]
 4 | 
 5 | * URL: [url]
 6 | * Language: [language or "mixed"]
 7 | * Status: [waiting|downloading|processing|ready]
 8 | 
 9 | [Brief description]
10 | 
11 | ## Inclusion Rationale
12 | 
13 | [Why should it be added?]
14 | 
15 | ## Format
16 | 
17 | [Information on source format and possible scraping methodology]
18 | 


--------------------------------------------------------------------------------