└── readme.md /readme.md: -------------------------------------------------------------------------------- 1 | 2 | Books & Documents: 3 | 4 | 5 | https://huggingface.co/datasets/the_pile_books3 6 | 7 | Description: 8 | This dataset is Shawn Presser's work and is part of EleutherAi/The Pile dataset. 9 | 10 | This dataset contains all of bibliotik in plain .txt form, aka 197,000 books processed in exactly the same way as did for bookcorpusopen (a.k.a. books1). 11 | 12 | On s3: Not yet. 13 | 14 | Converted to training format: not yet 15 | 16 | 17 | 18 | https://the-eye.eu/libraries.html 19 | 20 | Description: 21 | libgen & zlib 22 | 23 | On s3: yes 24 | Converted to training format: not yet 25 | 26 | 27 | https://archive.org/details/fanfictiondotnet_repack 28 | 29 | https://archive.org/details/Fanfictiondotnet1011dump 30 | 31 | fanfiction.net ID 11M+ should get scraped 32 | 33 | Description: 34 | dump of fanfiction.net 35 | Many short stories, books, ... 36 | 37 | On s3: Yes 38 | 39 | Converted to training format: not yet 40 | 41 | 42 | 43 | 44 | https://the-eye.eu/public/Random/torrents/archiveorg_DjVuTXT_Part1.torrent 45 | 46 | Description: 47 | 16 M ebooks from IA 48 | 49 | On s3: Not Yet 50 | 51 | Converted to training format: not yet 52 | 53 | 54 | 55 | 56 | https://the-eye.eu/public/Books/ 57 | Description: 58 | 5+M ebooks from different domains 59 | 60 | On s3: Not Yet 61 | 62 | Converted to training format: not yet 63 | 64 | 65 | all ebook torrents from piratebay: https://pirate-bays.net/search?q=ebooks 66 | Description: 67 | many differentr ebook torrents 68 | 69 | On s3: Not Yet 70 | 71 | Converted to training format: not yet 72 | 73 | 74 | 75 | https://the-eye.eu/public/Site-Dumps/campdivision.com/camp/Text%20Files/Miscellaneous%20Texts/ 76 | 77 | https://the-eye.eu/public/Site-Dumps/campdivision.com/camp/Text%20Files/PDF/ 78 | 79 | 80 | Description: 81 | many TV captions / subtitles - need to be checked 82 | 83 | On s3: Yes. 84 | 85 | Converted to training format: not yet 86 | 87 | 88 | 89 | https://huggingface.co/datasets/bookcorpusopen 90 | 91 | https://huggingface.co/datasets/demelin/moral_stories 92 | 93 | Subs: 94 | https://the-eye.eu/public/Random/archive.org_dumps/archive.org_tvarchive_CaptionProject_December1st2022.tar.zst 95 | 96 | Description: 97 | many TV captions / subtitles - need to be checked 98 | 99 | On s3: Yes. 100 | 101 | Converted to training format: not yet 102 | 103 | 104 | 105 | Largescale Webtext: 106 | 107 | https://huggingface.co/datasets/oscar 108 | 109 | https://huggingface.co/datasets/mc4 110 | 111 | https://huggingface.co/datasets/the_pile 112 | 113 | https://huggingface.co/datasets/spanish_billion_words 114 | 115 | https://huggingface.co/datasets/arabic_billion_words 116 | 117 | https://huggingface.co/datasets/olm/wikipedia 118 | 119 | https://huggingface.co/datasets/cc100 120 | 121 | 122 | https://files.pushshift.io/reddit/comments/ 123 | https://arxiv.org/abs/2001.08435 124 | 125 | Description: 126 | Reddit comments dumps 127 | 128 | On s3: Not yet 129 | 130 | Converted to training format: not yet 131 | 132 | 133 | 134 | https://the-eye.eu/public/social/twitter/ 135 | 136 | 137 | 138 | Code: 139 | https://huggingface.co/datasets/bigcode/the-stack-dedup 140 | 141 | https://huggingface.co/datasets/code_search_net 142 | 143 | https://huggingface.co/datasets/codeparrot/github-code 144 | 145 | Law: 146 | https://openreview.net/forum?id=3HCT3xfNm9r 147 | https://huggingface.co/datasets/pile-of-law/pile-of-law 148 | 149 | 150 | 151 | Scientific papers: 152 | 153 | Translation: 154 | https://huggingface.co/datasets/opus100 155 | 156 | --------------------------------------------------------------------------------