└── readme.md


/readme.md:
--------------------------------------------------------------------------------
  1 | 
  2 | Books & Documents:
  3 | 
  4 | 
  5 | https://huggingface.co/datasets/the_pile_books3
  6 | 
  7 | Description:
  8 | This dataset is Shawn Presser's work and is part of EleutherAi/The Pile dataset.
  9 | 
 10 | This dataset contains all of bibliotik in plain .txt form, aka 197,000 books processed in exactly the same way as did for bookcorpusopen (a.k.a. books1). 
 11 | 
 12 | On s3: Not yet.
 13 | 
 14 | Converted to training format: not yet
 15 | 
 16 | 
 17 | 
 18 | https://the-eye.eu/libraries.html
 19 | 
 20 | Description:
 21 | libgen & zlib
 22 | 
 23 | On s3: yes
 24 | Converted to training format: not yet
 25 | 
 26 | 
 27 | https://archive.org/details/fanfictiondotnet_repack
 28 | 
 29 | https://archive.org/details/Fanfictiondotnet1011dump
 30 | 
 31 | fanfiction.net ID 11M+ should get scraped
 32 | 
 33 | Description:
 34 | dump of fanfiction.net
 35 | Many short stories, books, ... 
 36 | 
 37 | On s3: Yes
 38 | 
 39 | Converted to training format: not yet
 40 | 
 41 | 
 42 | 
 43 | 
 44 | https://the-eye.eu/public/Random/torrents/archiveorg_DjVuTXT_Part1.torrent
 45 | 
 46 | Description:
 47 | 16 M ebooks from IA
 48 | 
 49 | On s3: Not Yet
 50 | 
 51 | Converted to training format: not yet
 52 | 
 53 | 
 54 | 
 55 | 
 56 | https://the-eye.eu/public/Books/
 57 | Description:
 58 | 5+M ebooks from different domains
 59 | 
 60 | On s3: Not Yet
 61 | 
 62 | Converted to training format: not yet
 63 | 
 64 | 
 65 | all ebook torrents from piratebay: https://pirate-bays.net/search?q=ebooks
 66 | Description:
 67 | many differentr ebook torrents
 68 | 
 69 | On s3: Not Yet
 70 | 
 71 | Converted to training format: not yet
 72 | 
 73 | 
 74 | 
 75 | https://the-eye.eu/public/Site-Dumps/campdivision.com/camp/Text%20Files/Miscellaneous%20Texts/
 76 | 
 77 | https://the-eye.eu/public/Site-Dumps/campdivision.com/camp/Text%20Files/PDF/
 78 | 
 79 | 
 80 | Description:
 81 | many TV captions / subtitles - need to be checked
 82 | 
 83 | On s3: Yes.
 84 | 
 85 | Converted to training format: not yet
 86 | 
 87 | 
 88 | 
 89 | https://huggingface.co/datasets/bookcorpusopen
 90 | 
 91 | https://huggingface.co/datasets/demelin/moral_stories
 92 | 
 93 | Subs:
 94 | https://the-eye.eu/public/Random/archive.org_dumps/archive.org_tvarchive_CaptionProject_December1st2022.tar.zst
 95 | 
 96 | Description:
 97 | many TV captions / subtitles - need to be checked
 98 | 
 99 | On s3: Yes.
100 | 
101 | Converted to training format: not yet
102 | 
103 | 
104 | 
105 | Largescale Webtext:
106 | 
107 | https://huggingface.co/datasets/oscar
108 | 
109 | https://huggingface.co/datasets/mc4
110 | 
111 | https://huggingface.co/datasets/the_pile
112 | 
113 | https://huggingface.co/datasets/spanish_billion_words
114 | 
115 | https://huggingface.co/datasets/arabic_billion_words
116 | 
117 | https://huggingface.co/datasets/olm/wikipedia
118 | 
119 | https://huggingface.co/datasets/cc100
120 | 
121 | 
122 | https://files.pushshift.io/reddit/comments/ 
123 | https://arxiv.org/abs/2001.08435
124 | 
125 | Description:
126 | Reddit comments dumps
127 | 
128 | On s3: Not yet
129 | 
130 | Converted to training format: not yet
131 | 
132 | 
133 | 
134 | https://the-eye.eu/public/social/twitter/
135 | 
136 | 
137 | 
138 | Code:
139 | https://huggingface.co/datasets/bigcode/the-stack-dedup
140 | 
141 | https://huggingface.co/datasets/code_search_net
142 | 
143 | https://huggingface.co/datasets/codeparrot/github-code
144 | 
145 | Law:
146 | https://openreview.net/forum?id=3HCT3xfNm9r
147 | https://huggingface.co/datasets/pile-of-law/pile-of-law
148 | 
149 | 
150 | 
151 | Scientific papers:
152 | 
153 | Translation:
154 | https://huggingface.co/datasets/opus100
155 | 
156 | 


--------------------------------------------------------------------------------