├── .gitmodules ├── LICENSE ├── README.md ├── analysis_scripts ├── README.md ├── duplicates.py ├── term_counts.py ├── timestamp_dist.py └── url_dist.py ├── pipeline_scripts ├── common_crawl │ ├── README.md │ ├── apply_bigscience_filters.py │ ├── combine_last_modified_with_text_dataset.py │ ├── deduplicate.py │ ├── download_common_crawl.py │ ├── download_pipeline_processing_models.sh │ ├── experimental │ │ ├── add_perplexity.py │ │ ├── filter_for_only_updated_websites.py │ │ └── kenlm │ │ │ ├── LICENSE │ │ │ ├── README.md │ │ │ ├── model.py │ │ │ └── wikipedia │ │ │ ├── en.arpa.bin │ │ │ ├── en.sp.model │ │ │ └── en.sp.vocab │ ├── get_last_modified_dataset_from_wat_downloads.py │ ├── get_text_dataset_from_wet_downloads.py │ └── remove_wikipedia_urls.py └── wikipedia │ └── README.md └── requirements.txt /.gitmodules: -------------------------------------------------------------------------------- 1 | [submodule "data-preparation"] 2 | path = pipeline_scripts/common_crawl/data-preparation 3 | url = https://github.com/bigscience-workshop/data-preparation.git 4 | [submodule "deduplicate-text-datasets"] 5 | path = pipeline_scripts/common_crawl/deduplicate-text-datasets 6 | url = https://github.com/TristanThrush/deduplicate-text-datasets.git 7 | -------------------------------------------------------------------------------- /LICENSE: -------------------------------------------------------------------------------- 1 | Copyright 2022- The Hugging Face team. All rights reserved. 2 | 3 | Apache License 4 | Version 2.0, January 2004 5 | http://www.apache.org/licenses/ 6 | 7 | TERMS AND CONDITIONS FOR USE, REPRODUCTION, AND DISTRIBUTION 8 | 9 | 1. Definitions. 10 | 11 | "License" shall mean the terms and conditions for use, reproduction, 12 | and distribution as defined by Sections 1 through 9 of this document. 13 | 14 | "Licensor" shall mean the copyright owner or entity authorized by 15 | the copyright owner that is granting the License. 16 | 17 | "Legal Entity" shall mean the union of the acting entity and all 18 | other entities that control, are controlled by, or are under common 19 | control with that entity. For the purposes of this definition, 20 | "control" means (i) the power, direct or indirect, to cause the 21 | direction or management of such entity, whether by contract or 22 | otherwise, or (ii) ownership of fifty percent (50%) or more of the 23 | outstanding shares, or (iii) beneficial ownership of such entity. 24 | 25 | "You" (or "Your") shall mean an individual or Legal Entity 26 | exercising permissions granted by this License. 27 | 28 | "Source" form shall mean the preferred form for making modifications, 29 | including but not limited to software source code, documentation 30 | source, and configuration files. 31 | 32 | "Object" form shall mean any form resulting from mechanical 33 | transformation or translation of a Source form, including but 34 | not limited to compiled object code, generated documentation, 35 | and conversions to other media types. 36 | 37 | "Work" shall mean the work of authorship, whether in Source or 38 | Object form, made available under the License, as indicated by a 39 | copyright notice that is included in or attached to the work 40 | (an example is provided in the Appendix below). 41 | 42 | "Derivative Works" shall mean any work, whether in Source or Object 43 | form, that is based on (or derived from) the Work and for which the 44 | editorial revisions, annotations, elaborations, or other modifications 45 | represent, as a whole, an original work of authorship. For the purposes 46 | of this License, Derivative Works shall not include works that remain 47 | separable from, or merely link (or bind by name) to the interfaces of, 48 | the Work and Derivative Works thereof. 49 | 50 | "Contribution" shall mean any work of authorship, including 51 | the original version of the Work and any modifications or additions 52 | to that Work or Derivative Works thereof, that is intentionally 53 | submitted to Licensor for inclusion in the Work by the copyright owner 54 | or by an individual or Legal Entity authorized to submit on behalf of 55 | the copyright owner. For the purposes of this definition, "submitted" 56 | means any form of electronic, verbal, or written communication sent 57 | to the Licensor or its representatives, including but not limited to 58 | communication on electronic mailing lists, source code control systems, 59 | and issue tracking systems that are managed by, or on behalf of, the 60 | Licensor for the purpose of discussing and improving the Work, but 61 | excluding communication that is conspicuously marked or otherwise 62 | designated in writing by the copyright owner as "Not a Contribution." 63 | 64 | "Contributor" shall mean Licensor and any individual or Legal Entity 65 | on behalf of whom a Contribution has been received by Licensor and 66 | subsequently incorporated within the Work. 67 | 68 | 2. Grant of Copyright License. Subject to the terms and conditions of 69 | this License, each Contributor hereby grants to You a perpetual, 70 | worldwide, non-exclusive, no-charge, royalty-free, irrevocable 71 | copyright license to reproduce, prepare Derivative Works of, 72 | publicly display, publicly perform, sublicense, and distribute the 73 | Work and such Derivative Works in Source or Object form. 74 | 75 | 3. Grant of Patent License. Subject to the terms and conditions of 76 | this License, each Contributor hereby grants to You a perpetual, 77 | worldwide, non-exclusive, no-charge, royalty-free, irrevocable 78 | (except as stated in this section) patent license to make, have made, 79 | use, offer to sell, sell, import, and otherwise transfer the Work, 80 | where such license applies only to those patent claims licensable 81 | by such Contributor that are necessarily infringed by their 82 | Contribution(s) alone or by combination of their Contribution(s) 83 | with the Work to which such Contribution(s) was submitted. If You 84 | institute patent litigation against any entity (including a 85 | cross-claim or counterclaim in a lawsuit) alleging that the Work 86 | or a Contribution incorporated within the Work constitutes direct 87 | or contributory patent infringement, then any patent licenses 88 | granted to You under this License for that Work shall terminate 89 | as of the date such litigation is filed. 90 | 91 | 4. Redistribution. You may reproduce and distribute copies of the 92 | Work or Derivative Works thereof in any medium, with or without 93 | modifications, and in Source or Object form, provided that You 94 | meet the following conditions: 95 | 96 | (a) You must give any other recipients of the Work or 97 | Derivative Works a copy of this License; and 98 | 99 | (b) You must cause any modified files to carry prominent notices 100 | stating that You changed the files; and 101 | 102 | (c) You must retain, in the Source form of any Derivative Works 103 | that You distribute, all copyright, patent, trademark, and 104 | attribution notices from the Source form of the Work, 105 | excluding those notices that do not pertain to any part of 106 | the Derivative Works; and 107 | 108 | (d) If the Work includes a "NOTICE" text file as part of its 109 | distribution, then any Derivative Works that You distribute must 110 | include a readable copy of the attribution notices contained 111 | within such NOTICE file, excluding those notices that do not 112 | pertain to any part of the Derivative Works, in at least one 113 | of the following places: within a NOTICE text file distributed 114 | as part of the Derivative Works; within the Source form or 115 | documentation, if provided along with the Derivative Works; or, 116 | within a display generated by the Derivative Works, if and 117 | wherever such third-party notices normally appear. The contents 118 | of the NOTICE file are for informational purposes only and 119 | do not modify the License. You may add Your own attribution 120 | notices within Derivative Works that You distribute, alongside 121 | or as an addendum to the NOTICE text from the Work, provided 122 | that such additional attribution notices cannot be construed 123 | as modifying the License. 124 | 125 | You may add Your own copyright statement to Your modifications and 126 | may provide additional or different license terms and conditions 127 | for use, reproduction, or distribution of Your modifications, or 128 | for any such Derivative Works as a whole, provided Your use, 129 | reproduction, and distribution of the Work otherwise complies with 130 | the conditions stated in this License. 131 | 132 | 5. Submission of Contributions. Unless You explicitly state otherwise, 133 | any Contribution intentionally submitted for inclusion in the Work 134 | by You to the Licensor shall be under the terms and conditions of 135 | this License, without any additional terms or conditions. 136 | Notwithstanding the above, nothing herein shall supersede or modify 137 | the terms of any separate license agreement you may have executed 138 | with Licensor regarding such Contributions. 139 | 140 | 6. Trademarks. This License does not grant permission to use the trade 141 | names, trademarks, service marks, or product names of the Licensor, 142 | except as required for reasonable and customary use in describing the 143 | origin of the Work and reproducing the content of the NOTICE file. 144 | 145 | 7. Disclaimer of Warranty. Unless required by applicable law or 146 | agreed to in writing, Licensor provides the Work (and each 147 | Contributor provides its Contributions) on an "AS IS" BASIS, 148 | WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or 149 | implied, including, without limitation, any warranties or conditions 150 | of TITLE, NON-INFRINGEMENT, MERCHANTABILITY, or FITNESS FOR A 151 | PARTICULAR PURPOSE. You are solely responsible for determining the 152 | appropriateness of using or redistributing the Work and assume any 153 | risks associated with Your exercise of permissions under this License. 154 | 155 | 8. Limitation of Liability. In no event and under no legal theory, 156 | whether in tort (including negligence), contract, or otherwise, 157 | unless required by applicable law (such as deliberate and grossly 158 | negligent acts) or agreed to in writing, shall any Contributor be 159 | liable to You for damages, including any direct, indirect, special, 160 | incidental, or consequential damages of any character arising as a 161 | result of this License or out of the use or inability to use the 162 | Work (including but not limited to damages for loss of goodwill, 163 | work stoppage, computer failure or malfunction, or any and all 164 | other commercial damages or losses), even if such Contributor 165 | has been advised of the possibility of such damages. 166 | 167 | 9. Accepting Warranty or Additional Liability. While redistributing 168 | the Work or Derivative Works thereof, You may choose to offer, 169 | and charge a fee for, acceptance of support, warranty, indemnity, 170 | or other liability obligations and/or rights consistent with this 171 | License. However, in accepting such obligations, You may act only 172 | on Your own behalf and on Your sole responsibility, not on behalf 173 | of any other Contributor, and only if You agree to indemnify, 174 | defend, and hold each Contributor harmless for any liability 175 | incurred by, or claims asserted against, such Contributor by reason 176 | of your accepting any such warranty or additional liability. 177 | 178 | END OF TERMS AND CONDITIONS 179 | 180 | APPENDIX: How to apply the Apache License to your work. 181 | 182 | To apply the Apache License to your work, attach the following 183 | boilerplate notice, with the fields enclosed by brackets "[]" 184 | replaced with your own identifying information. (Don't include 185 | the brackets!) The text should be enclosed in the appropriate 186 | comment syntax for the file format. We also recommend that a 187 | file or class name and description of purpose be included on the 188 | same "printed page" as the copyright notice for easier 189 | identification within third-party archives. 190 | 191 | Copyright [yyyy] [name of copyright owner] 192 | 193 | Licensed under the Apache License, Version 2.0 (the "License"); 194 | you may not use this file except in compliance with the License. 195 | You may obtain a copy of the License at 196 | 197 | http://www.apache.org/licenses/LICENSE-2.0 198 | 199 | Unless required by applicable law or agreed to in writing, software 200 | distributed under the License is distributed on an "AS IS" BASIS, 201 | WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. 202 | See the License for the specific language governing permissions and 203 | limitations under the License. 204 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # Online Language Modelling Dataset Pipeline 2 | 3 | This repo enables you to pull a large and up-to-date text corpus from the web. It uses state-of-the-art processing methods to produce a clean text dataset that you can immediately use to pretrain a large language model, like BERT, GPT, or BLOOM. The main use-case for this repo is the Online Language Modelling Project, where we want to keep a language model up-to-date by pretraining it on the latest Common Crawl and Wikipedia dumps every month or so. You can see the models for the OLM project here: https://huggingface.co/olm. They actually get better performance than their original static counterparts. 4 | 5 | Specifically, this repo has modular Python commands that enable you to: 6 | * Specify Common Crawl web snapshots, or just Wikipedia snapshots. Then pull the data. 7 | * Filter the data for a particular language, like English or French. 8 | * Run the OSCAR filters used by BigScience for the BLOOM language model. These filters ensure some level of text quality and reduce pornographic content. 9 | * Deduplicate the data. 10 | 11 | This code is also fairly parallelized, although it can certianly be improved further. It can process over a terabyte from Common Crawl in a day or two, and all of English Wikipedia in less than an hour if you have: 12 | * A machine with a lot of CPUs and memory. 13 | * A fast internet connection. 14 | 15 | ## Setup 16 | 1. If you want to use this repo to generate a decent amount of data, get a machine with lots of CPUs and memory. We use an `n2d-standard-224` running `Ubuntu 20.04 LTS` on GCP. Add Terabytes of disk space too. You may need an even larger machine if you want to process close to 100% of a Common Crawl snapshot or several snapshots, particularly due to how much memory the deduplication process uses. Alternatively, you can specify in the deduplication arguments that you want to deduplicate the dataset in chunks so your memory doesn't explode. 17 | 2. Clone with submodules: `git clone --recursive git@github.com:huggingface/olm-datasets.git` 18 | 3. Install cargo (rust package manager) with `curl https://sh.rustup.rs -sSf | sh`. Then install Ungoliant with `cargo install ungoliant@1.2.3`. You may need to install gcc and cmake first. 19 | 4. Set up a Python 3.9 environment, and run `pip install -r requirements.txt` 20 | 5. Run `huggingface-cli login`. This cli should have been installed from `requirements.txt`. To login, you need to paste a token from your account at [https://huggingface.co](https://huggingface.co). This step is necessary for the pipeline to push the generated datasets to your Hugging Face account. 21 | 22 | ## Getting a clean and up-to-date Common Crawl corpus 23 | 24 | Follow the instructions at [pipeline_scripts/common_crawl](pipeline_scripts/common_crawl). 25 | 26 | Here is the output dataset to expect from a 20% random segment sample of the August 2022 Common Crawl Snapshot: [https://huggingface.co/datasets/Tristan/olm-CC-MAIN-2022-33-sampling-ratio-0.20](https://huggingface.co/datasets/Tristan/olm-CC-MAIN-2022-33-sampling-ratio-0.20) 27 | 28 | ## Getting a clean and up-to-date Wikipedia corpus 29 | 30 | Follow the instructions at [pipeline_scripts/wikipedia](pipeline_scripts/wikipedia). 31 | 32 | Here is the output dataset to expect from a September 2022 snapshot of Wikipedia: [https://huggingface.co/datasets/Tristan/olm-wikipedia-20220920](https://huggingface.co/datasets/Tristan/olm-wikipedia-20220920) 33 | 34 | ## Analyzing the corpora 35 | 36 | Follow the instructions at [analysis_scripts](analysis_scripts). 37 | 38 | Here is a tweet thread which utilizes these scripts: [https://twitter.com/TristanThrush/status/1582356055794733057](https://twitter.com/TristanThrush/status/1582356055794733057) 39 | 40 | Here is another tweet thread that dives a little deeper: 41 | [https://twitter.com/TristanThrush/status/1588156731909029889](https://twitter.com/TristanThrush/status/1588156731909029889) 42 | 43 | And here is a colab where you can quickly run some of the analysis yourself! [https://colab.research.google.com/drive/18Wv7ghW2rRjEe3oWDqh2iz9qqO8O6XcX?usp=sharing](https://colab.research.google.com/drive/18Wv7ghW2rRjEe3oWDqh2iz9qqO8O6XcX?usp=sharing) 44 | 45 | ## Citation 46 | 47 | ``` 48 | @misc{thrush2022pipeline, 49 | title={Online Language Modelling Data Pipeline}, 50 | author={Tristan Thrush and Helen Ngo and Nathan Lambert and Douwe Kiela}, 51 | year={2022}, 52 | howpublished={\url{https://github.com/huggingface/olm-datasets}} 53 | } 54 | ``` 55 | -------------------------------------------------------------------------------- /analysis_scripts/README.md: -------------------------------------------------------------------------------- 1 | # OLM Analysis 2 | 3 | ## To analyze for term counts accross various datasets 4 | 5 | This command reports the count of terms associated with events that happened over summer 2022, accross chronologically ordered summer 2022 OLM datasets. We would expect the counts to go up over the summer: 6 | 7 | ``` 8 | python term_counts.py --input_dataset_names Tristan/olm-CC-MAIN-2022-21-sampling-ratio-0.14775510204 Tristan/olm-CC-MAIN-2022-27-sampling-ratio-0.16142697881 Tristan/olm-CC-MAIN-2022-33-sampling-ratio-0.20 --input_dataset_pretty_names "May" "June/July" "August" --terms "gentleminion" "monkeypox outbreak" "inflation reduction act of 2022" "quiet quitting" "jonas vingegaard" --plot_title="Count of Terms in 2022 Summer CC OLM Datasets" --analysis_column=text --split=train --num_proc=224 --output_filename=summer_2022_term_counts.png --load_from_hub_instead_of_disk --ylabel Count 9 | ``` 10 | 11 | Here is the resulting figure: 12 | 13 | ![summer_2022_term_counts](https://user-images.githubusercontent.com/20826878/200715141-6ce73388-7d6a-4d05-bbf4-88e1f2a3c62c.png) 14 | 15 | This command reports the count of words with the highest usage increase between the start of summer 2022 and the fall of 2022, out of all of the frequent (> mean + std) words in the dataset with only alphabetic characters, lowercased, and split by spaces: 16 | 17 | ``` 18 | python term_counts.py --input_dataset_names Tristan/olm-CC-MAIN-2022-21-sampling-ratio-0.14775510204 Tristan/olm-CC-MAIN-2022-27-sampling-ratio-0.16142697881 Tristan/olm-CC-MAIN-2022-33-sampling-ratio-0.20 Tristan/olm-CC-MAIN-2022-40-sampling-ratio-0.15894621295 --input_dataset_pretty_names "May" "Jun/Jul" "Aug" "Sep/Oct" --num_terms_to_find 5 --plot_title="Top 5 Words with Highest Usage Increase" --analysis_column=text --split=train --num_proc=224 --output_filename=top_5_term_counts_heatmap.png --load_from_hub_instead_of_disk --ylabel "Word" --as_heatmap --heatmap_bar_label "Percent Increase" --xlabel "Internet Snapshot" --normalize_axis=1 --cache_dir=term_counts_cache_top_5 --percent_increase --annotation "To avoid spurious results from words with small counts, we only considered frequent words. A word is considered frequent if the count is greater than a standard deviation above the mean count. Snapshot datasets are from the OLM project: https://github.com/huggingface/olm-datasets." 19 | ``` 20 | 21 | Here is the resulting figure: 22 | 23 | ![top_5_term_counts_heatmap](https://user-images.githubusercontent.com/20826878/200715219-ce3b6fa4-e9f6-4dac-b594-caa052e759a0.png) 24 | 25 | This command reports the count of date mentions in the text between summer 2022 and fall 2022: 26 | 27 | ``` 28 | python term_counts.py --input_dataset_names Tristan/olm-CC-MAIN-2022-21-sampling-ratio-0.14775510204 Tristan/olm-CC-MAIN-2022-27-sampling-ratio-0.16142697881 Tristan/olm-CC-MAIN-2022-33-sampling-ratio-0.20 Tristan/olm-CC-MAIN-2022-40-sampling-ratio-0.15894621295 --input_dataset_pretty_names "May" "Jun/Jul" "Aug" "Sep/Oct" --terms 2022/05 2022/06 2022/07 2022/08 2022/09 --plot_title="Relative Freq of Dates in Webpage Text" --analysis_column=text --split=train --num_proc=224 --output_filename=date_term_counts_heatmap_text.png --load_from_hub_instead_of_disk --as_heatmap --ylabel "Date (YYYY/MM)" --term_pretty_names May Jun Jul Aug Sep --cache_dir term_counts_cache_date_text --xlabel "Internet Snapshot" --annotation "Snapshot datasets are from the OLM project: https://github.com/huggingface/olm-datasets." --normalize 29 | ``` 30 | 31 | Here is the resulting figure: 32 | 33 | ![date_term_counts_heatmap_text](https://user-images.githubusercontent.com/20826878/200715272-e5dab35b-211c-4344-b685-881e0ce46bb0.png) 34 | 35 | This command reports the count of date mentions in the URLs between summer 2022 and fall 2022: 36 | 37 | ``` 38 | python term_counts.py --input_dataset_names Tristan/olm-CC-MAIN-2022-21-sampling-ratio-0.14775510204 Tristan/olm-CC-MAIN-2022-27-sampling-ratio-0.16142697881 Tristan/olm-CC-MAIN-2022-33-sampling-ratio-0.20 Tristan/olm-CC-MAIN-2022-40-sampling-ratio-0.15894621295 --input_dataset_pretty_names "May" "Jun/Jul" "Aug" "Sep/Oct" --terms 2022/05 2022/06 2022/07 2022/08 2022/09 --plot_title="Relative Freq of Dates in Webpage URLs" --analysis_column=url --split=train --num_proc=224 --output_filename=date_term_counts_heatmap_url.png --load_from_hub_instead_of_disk --as_heatmap --ylabel "Date (YYYY/MM)" --term_pretty_names May Jun Jul Aug Sep --cache_dir term_counts_cache_date_urls --xlabel "Internet Snapshot" --annotation "Snapshot datasets are from the OLM project: https://github.com/huggingface/olm-datasets." --normalize 39 | ``` 40 | 41 | Here is the resulting figure: 42 | 43 | ![date_term_counts_heatmap_url](https://user-images.githubusercontent.com/20826878/200715307-b3110b88-191b-419f-91ff-1e45ecfc6361.png) 44 | 45 | ## To analyze the timestamp distribution accross and within various datasets 46 | 47 | This command reports the last-modified timestamp distribution for the summer 2022 through fall 2022 OLM CC datasets: 48 | 49 | ``` 50 | python timestamp_dist.py --input_dataset_names Tristan/olm-CC-MAIN-2022-40-sampling-ratio-0.15894621295 Tristan/olm-CC-MAIN-2022-33-sampling-ratio-0.20 Tristan/olm-CC-MAIN-2022-27-sampling-ratio-0.16142697881 Tristan/olm-CC-MAIN-2022-21-sampling-ratio-0.14775510204 --input_dataset_pretty_names Sep/Oct Aug Jun/Jul May --timestamp_column last_modified_timestamp --plot_title "Last-Modified Timestamp Distributions from Webpages" --num_proc=224 --output_filename last_modified_dist.png --load_from_hub_instead_of_disk --cache_dir timestamp_dist_cache_last_modified --split=train 51 | ``` 52 | 53 | Here is the resulting figure: 54 | 55 | ![last_modified_dist](https://user-images.githubusercontent.com/20826878/200715332-203f5950-6d4d-4e3a-bfaa-ebbcf7603242.png) 56 | 57 | This command reports the crawl timestamp distribution for the summer 2022 through fall 2022 OLM CC datasets: 58 | 59 | ``` 60 | python timestamp_dist.py --input_dataset_names Tristan/olm-CC-MAIN-2022-40-sampling-ratio-0.15894621295 Tristan/olm-CC-MAIN-2022-33-sampling-ratio-0.20 Tristan/olm-CC-MAIN-2022-27-sampling-ratio-0.16142697881 Tristan/olm-CC-MAIN-2022-21-sampling-ratio-0.14775510204 --input_dataset_pretty_names Sep/Oct Aug Jun/Jul May --timestamp_column crawl_timestamp --plot_title "Crawl Timestamp Distributions from Webpages" --num_proc=224 --output_filename crawl_dist.png --load_from_hub_instead_of_disk --cache_dir timestamp_dist_cache_crawl --split=train 61 | ``` 62 | 63 | Here is the resulting figure: 64 | 65 | ![crawl_dist](https://user-images.githubusercontent.com/20826878/200715349-562af902-8863-428a-8417-0975738164bf.png) 66 | 67 | ## To analyze the URL domain distribution accross and within various datasets 68 | 69 | This command reports the domain distribution within the May 2022 OLM CC dataset: 70 | 71 | ``` 72 | python url_dist.py --input_dataset_names Tristan/olm-CC-MAIN-2022-21-sampling-ratio-0.14775510204 --input_dataset_pretty_names May --url_column url --hist_plot_title "URL Domain Distribution for May Internet Snapshot" --corr_plot_title "URL Domain Distribution Corr for May Internet Snapshot" --num_proc=224 --output_corr_filename url_corr_may.png --output_hist_filename url_hist_may.png --load_from_hub_instead_of_disk --cache_dir url_dist_cache_may --no_hist_legend --annotation "Only the top 25 domains are shown." 73 | ``` 74 | 75 | Here is the resulting figure: 76 | 77 | ![url_hist_may](https://user-images.githubusercontent.com/20826878/200715359-7c7bc37a-5749-454a-9e38-77b1116de7f0.png) 78 | 79 | This command reports the domain correlations between the summer 2022 through fall 2022 OLM CC datasets: 80 | 81 | ``` 82 | python url_dist.py --input_dataset_names Tristan/olm-CC-MAIN-2022-21-sampling-ratio-0.14775510204 Tristan/olm-CC-MAIN-2022-27-sampling-ratio-0.16142697881 Tristan/olm-CC-MAIN-2022-33-sampling-ratio-0.20 Tristan/olm-CC-MAIN-2022-40-sampling-ratio-0.15894621295 --input_dataset_pretty_names May Jun/Jul Aug Sep/Oct --url_column url --hist_plot_title "URL Domain Distribution for Internet Snapshots" --corr_plot_title "URL Domain Distribution Corr for Internet Snapshots" --num_proc=224 --output_corr_filename url_corr.png --output_hist_filename url_hist.png --load_from_hub_instead_of_disk --cache_dir url_dist_cache_all 83 | ``` 84 | 85 | Here is the resulting figure: 86 | 87 | ![url_corr](https://user-images.githubusercontent.com/20826878/200715384-d4793781-9775-4884-bffe-698b16677284.png) 88 | 89 | Does sampling about 15-20% of a Common Crawl Snapshot do anything surprising? How much correlation is there between the resulting OLM dataset from a Common Crawl sample from a random seed versus another random seed? This command reports the domain correlation between two Sep/Oct datasets where the only difference is the sampled segments based on different random seeds: 90 | 91 | ``` 92 | python url_dist.py --input_dataset_names Tristan/olm-CC-MAIN-2022-40-sampling-ratio-0.15894621295 Tristan/olm-CC-MAIN-2022-40-sampling-ratio-0.15894621295-seed-69 --input_dataset_pretty_names "Sep/Oct Seed 1" "Sep/Oct Seed 2" --url_column url --hist_plot_title "URL Domain Distribution for Sep/Oct Snapshots" --corr_plot_title "URL Domain Distribution Corr for Sep/Oct Snapshots" --num_proc=224 --output_corr_filename url_corr_sep_oct_different_seeds.png --output_hist_filename url_hist_sep_oct_different_seeds.png --load_from_hub_instead_of_disk --cache_dir url_dist_cache_all --annotation="This plot shows two different OLM datasets. They were created with the same code from a 16% random sample of Sep/Oct Common Crawl WET files, but with different random seeds for the sampling." 93 | ``` 94 | 95 | Here is the resulting figure: 96 | 97 | ![url_corr_sep_oct_different_seeds](https://user-images.githubusercontent.com/20826878/200715404-5ccb3a1e-9e82-41be-82db-9e54e73785fe.png) 98 | 99 | ## To analyze for duplicates accross various datasets 100 | 101 | This command reports the ratio of shared URLs between the August and June/July Common Crawl OLM Datasets: 102 | 103 | ``` 104 | python duplicates.py --input_dataset_names Tristan/olm-CC-MAIN-2022-33-sampling-ratio-0.20 Tristan/olm-CC-MAIN-2022-27-sampling-ratio-0.16142697881 --analysis_column=url --split=train --num_proc=224 --plot_title="URLs in the June/July plus the August CC OLM Datasets" --output_filename=duplicate_urls_aug_jun_jul.png --load_from_hub_instead_of_disk 105 | ``` 106 | 107 | Here is the resulting figure: 108 | 109 | ![duplicate_urls_aug_jun_jul](https://user-images.githubusercontent.com/20826878/200715427-79d0120b-fa48-4fdf-8410-8943a1325780.png) 110 | 111 | This command reports the ratio of exact text duplicated between the August and June/July Common Crawl OLM Datasets: 112 | 113 | ``` 114 | python duplicates.py --input_dataset_names Tristan/olm-CC-MAIN-2022-33-sampling-ratio-0.20 Tristan/olm-CC-MAIN-2022-27-sampling-ratio-0.16142697881 --analysis_column=text --split=train --num_proc=224 --plot_title="Text in the June/July plus the August CC OLM Datasets" --output_filename=duplicate_text_aug_jun_jul.png --load_from_hub_instead_of_disk 115 | ``` 116 | 117 | Here is the resulting figure: 118 | 119 | ![duplicate_text_aug_jun_jul](https://user-images.githubusercontent.com/20826878/200715436-4893263b-1fe9-4941-ae43-edd4732652c4.png) 120 | 121 | What about the duplicated URLs between two differently seeded OLM datasets from the same month? 122 | 123 | ``` 124 | python duplicates.py --input_dataset_names Tristan/olm-CC-MAIN-2022-40-sampling-ratio-0.15894621295 Tristan/olm-CC-MAIN-2022-40-sampling-ratio-0.15894621295-seed-69 --analysis_column=url --split=train --num_proc=224 --plot_title="URLs in two Differently Seeded Sep/Oct CC OLM Datasets" --output_filename=duplicate_urls_sep_oct_different_seeds.png --load_from_hub_instead_of_disk 125 | ``` 126 | 127 | Here is the resulting figure: 128 | 129 | ![duplicate_urls_sep_oct_different_seeds](https://user-images.githubusercontent.com/20826878/200715575-fae99dcb-cef5-411e-a786-a6e20e53a003.png) 130 | 131 | And what about the text? 132 | 133 | ``` 134 | python duplicates.py --input_dataset_names Tristan/olm-CC-MAIN-2022-40-sampling-ratio-0.15894621295 Tristan/olm-CC-MAIN-2022-40-sampling-ratio-0.15894621295-seed-69 --analysis_column=text --split=train --num_proc=224 --plot_title="Text in two Differently Seeded Sep/Oct CC OLM Datasets" --output_filename=duplicate_text_sep_oct_different_seeds.png --load_from_hub_instead_of_disk 135 | ``` 136 | 137 | ![duplicate_text_sep_oct_different_seeds](https://user-images.githubusercontent.com/20826878/200715583-1aa76245-14c5-4afe-88c5-539c8665d4d7.png) 138 | 139 | ## Documentation 140 | 141 | ``` 142 | python term_counts.py --help 143 | ``` 144 | 145 | ``` 146 | python url_dist.py --help 147 | ``` 148 | 149 | ``` 150 | python timestamp_dist.py --help 151 | ``` 152 | 153 | ``` 154 | python duplicates.py --help 155 | ``` 156 | -------------------------------------------------------------------------------- /analysis_scripts/duplicates.py: -------------------------------------------------------------------------------- 1 | from datasets import load_dataset, load_from_disk, concatenate_datasets 2 | import argparse 3 | import matplotlib.pyplot as plt 4 | 5 | parser = argparse.ArgumentParser(description="This script takes a list of datasets, concatenates them, and saves a pie chart for duplicate versus unique items in the specified column.") 6 | parser.add_argument("--input_dataset_names", nargs="+", required=True) 7 | parser.add_argument("--analysis_column", required=True) 8 | parser.add_argument("--plot_title", required=True) 9 | parser.add_argument("--split", default=None, help="The dataset split to use. Some datasets don't have splits so this argument is optional.") 10 | parser.add_argument("--num_proc", type=int, required=True) 11 | parser.add_argument("--duplicate_label", default="Duplicate") 12 | parser.add_argument("--unique_label", default="Unique") 13 | parser.add_argument("--output_filename", required=True) 14 | parser.add_argument("--load_from_hub_instead_of_disk", action="store_true", help="Whether to load the input dataset from the Hugging Face hub instead of the disk (default is the disk).") 15 | args = parser.parse_args() 16 | 17 | datasets = [] 18 | for input_dataset_name in args.input_dataset_names: 19 | if args.load_from_hub_instead_of_disk: 20 | if args.split is None: 21 | ds = load_dataset(input_dataset_name) 22 | else: 23 | ds = load_dataset(input_dataset_name, split=args.split) 24 | else: 25 | if args.split is None: 26 | ds = load_from_disk(input_dataset_name) 27 | else: 28 | ds = load_from_disk(input_dataset_name)[args.split] 29 | 30 | datasets.append(ds) 31 | 32 | ds = concatenate_datasets(datasets) 33 | 34 | ds = ds.sort(args.analysis_column) 35 | 36 | max_index = len(ds) - 1 37 | def same_adjacent_entry(entry, index): 38 | if index == max_index: 39 | return ds[index - 1][args.analysis_column] == entry 40 | elif index == 0: 41 | return ds[index + 1][args.analysis_column] == entry 42 | return ds[index - 1][args.analysis_column] == entry or ds[index + 1][args.analysis_column] == entry 43 | 44 | num_examples = len(ds) 45 | ds = ds.filter(lambda example, index: same_adjacent_entry(example[args.analysis_column], index), num_proc=args.num_proc, with_indices=True) 46 | num_examples_only_duplicate_entries = len(ds) 47 | 48 | 49 | labels = [args.duplicate_label, args.unique_label] 50 | sizes = [num_examples_only_duplicate_entries, num_examples - num_examples_only_duplicate_entries] 51 | plt.pie(sizes, labels=labels, autopct='%1.1f%%') 52 | 53 | plt.title(args.plot_title, fontweight="bold") 54 | plt.rcParams["font.family"] = "Times New Roman" 55 | 56 | plt.savefig(args.output_filename, dpi=300) 57 | -------------------------------------------------------------------------------- /analysis_scripts/term_counts.py: -------------------------------------------------------------------------------- 1 | from datasets import load_dataset, load_from_disk 2 | import argparse 3 | import numpy as np 4 | import matplotlib.pyplot as plt 5 | from collections import Counter 6 | from multiprocessing import Manager 7 | from tqdm import tqdm 8 | from os import path, mkdir 9 | import pickle 10 | import statistics 11 | 12 | parser = argparse.ArgumentParser(description="This script takes in an ordered list of datasets and counts terms in each of them, in the specified column. It then plots a graph or a heatmap for how the count changed accross datasets. ") 13 | parser.add_argument("--input_dataset_names", nargs="+", required=True) 14 | parser.add_argument("--input_dataset_pretty_names", nargs="+", required=True, help="The names of the datasets that you want to appear in the saved graph.") 15 | parser.add_argument("--terms", nargs="+", default=None, help="The terms that you want to count. If left as None, then you must specify --num_terms_to_find, and then the script will return the top --num_terms_to_find with the greatest percent change from the first dataset to the last dataset, out of the terms which have count > the mean count plus the standard deviation (so we don't get spurious results from low-count words).") 16 | parser.add_argument("--term_pretty_names", nargs="+", default=None) 17 | parser.add_argument("--analysis_column", required=True) 18 | parser.add_argument("--plot_title", required=True) 19 | parser.add_argument("--split", default=None, help="The dataset split to use. Some datasets don't have splits so this argument is optional.") 20 | parser.add_argument("--num_proc", type=int, required=True) 21 | parser.add_argument("--output_filename", required=True) 22 | parser.add_argument("--as_heatmap", action="store_true") 23 | parser.add_argument("--samples", default=None, type=int) 24 | parser.add_argument("--num_terms_to_find", default=None, type=int) 25 | parser.add_argument("--normalize", action="store_true") 26 | parser.add_argument("--ylabel", required=True) 27 | parser.add_argument("--cache_dir", default="term_count_cache") 28 | parser.add_argument("--load_from_cache_dir", action="store_true") 29 | parser.add_argument("--heatmap_bar_label", default="") 30 | parser.add_argument("--annotation", default=None) 31 | parser.add_argument("--xlabel", default="Dataset") 32 | parser.add_argument("--normalize_axis", default=0, type=int) 33 | parser.add_argument("--percent_increase", action="store_true") 34 | parser.add_argument("--bottom", default=0.25, type=float) 35 | parser.add_argument("--load_from_hub_instead_of_disk", action="store_true", help="Whether to load the input dataset from the Hugging Face hub instead of the disk (default is the disk).") 36 | args = parser.parse_args() 37 | 38 | datasets = [] 39 | term_y_coords = None 40 | count_dicts = [] 41 | 42 | if args.load_from_cache_dir: 43 | if args.terms is None: 44 | count_dicts = pickle.load(open(path.join(args.cache_dir, "count_dicts.pkl"), "rb")) 45 | else: 46 | term_y_coords = pickle.load(open(path.join(args.cache_dir, "term_y_coords.pkl"), "rb")) 47 | cached_args = pickle.load(open(path.join(args.cache_dir, "args.pkl"), "rb")) 48 | if args != cached_args: 49 | print("Warning: argument mismatch between cached args and current args") 50 | print("Cached args: ", cached_args) 51 | print("Current args: ", args) 52 | 53 | if term_y_coords is None: 54 | for input_dataset_name in args.input_dataset_names: 55 | if args.load_from_hub_instead_of_disk: 56 | if args.split is None: 57 | ds = load_dataset(input_dataset_name) 58 | else: 59 | ds = load_dataset(input_dataset_name, split=args.split) 60 | else: 61 | if args.split is None: 62 | ds = load_from_disk(input_dataset_name) 63 | else: 64 | ds = load_from_disk(input_dataset_name)[args.split] 65 | 66 | if args.samples is not None: 67 | ds = ds.shuffle(seed=42) 68 | ds = ds.select(range(args.samples)) 69 | 70 | datasets.append(ds) 71 | 72 | if args.terms is None and not args.load_from_cache_dir: 73 | with Manager() as manager: 74 | shared_list = manager.list() 75 | def build_count_dict(examples): 76 | counts = None 77 | for text in examples[args.analysis_column]: 78 | if counts is None: 79 | counts = Counter(filter(lambda obj: obj.isalpha(), text.lower().split(" "))) 80 | else: 81 | counts += Counter(filter(lambda obj: obj.isalpha(), text.lower().split(" "))) 82 | shared_list.append(counts) 83 | 84 | ds.map(build_count_dict, num_proc=args.num_proc, batched=True, batch_size=len(ds) // args.num_proc, remove_columns=ds.column_names) 85 | 86 | count_dict = shared_list[0] 87 | for counts in tqdm(shared_list[1:]): 88 | count_dict += counts 89 | 90 | count_dicts.append(count_dict) 91 | 92 | if args.terms is None: 93 | if not path.exists(args.cache_dir): 94 | mkdir(args.cache_dir) 95 | pickle.dump(args, open(path.join(args.cache_dir, "args.pkl"), "wb")) 96 | pickle.dump(count_dicts, open(path.join(args.cache_dir, "count_dicts.pkl"), "wb")) 97 | 98 | if args.terms is None: 99 | 100 | intersection_count_set = set(count_dicts[0].keys()) 101 | for count_dict in count_dicts[1:]: 102 | intersection_count_set = intersection_count_set.intersection(set(count_dict.keys())) 103 | 104 | words_with_occurence_changes = [] 105 | counts = [] 106 | for word in intersection_count_set: 107 | count_sum = 0 108 | for count_dict in count_dicts: 109 | count_sum += count_dict[word] 110 | counts.append(count_sum) 111 | mean_count = statistics.mean(counts) 112 | std = statistics.stdev(counts) 113 | for word in intersection_count_set: 114 | count_sum = 0 115 | for count_dict in count_dicts: 116 | count_sum += count_dict[word] 117 | if count_sum > mean_count + std: 118 | change = count_dicts[-1][word]/count_dicts[0][word] 119 | words_with_occurence_changes.append((word, change)) 120 | 121 | words_with_occurence_changes.sort(key=lambda word_and_change: word_and_change[1], reverse=True) 122 | terms = [word_and_change[0] for word_and_change in words_with_occurence_changes[:args.num_terms_to_find]] 123 | 124 | else: 125 | terms = args.terms 126 | 127 | 128 | if term_y_coords is None: 129 | 130 | term_y_coords = {term: [] for term in terms} 131 | 132 | for ds in datasets: 133 | def term_counts(text): 134 | return {term + "_count": text.lower().count(term.lower()) for term in terms} 135 | 136 | ds = ds.map(lambda example: term_counts(example[args.analysis_column]), num_proc=args.num_proc) 137 | 138 | for term in terms: 139 | term_y_coords[term].append(sum(ds[term + "_count"])) 140 | 141 | if not path.exists(args.cache_dir): 142 | mkdir(args.cache_dir) 143 | pickle.dump(args, open(path.join(args.cache_dir, "args.pkl"), "wb")) 144 | pickle.dump(term_y_coords, open(path.join(args.cache_dir, "term_y_coords.pkl"), "wb")) 145 | 146 | plt.xticks(range(len(args.input_dataset_pretty_names)), args.input_dataset_pretty_names) 147 | 148 | if args.as_heatmap: 149 | matrix = [] 150 | for term in terms: 151 | matrix.append(term_y_coords[term]) 152 | matrix = np.array(matrix) 153 | if args.percent_increase: 154 | matrix = matrix.transpose() 155 | matrix = (matrix - matrix[0])/matrix[0] 156 | matrix = matrix.transpose() * 100 157 | 158 | if args.normalize: 159 | column_sums = matrix.sum(axis=args.normalize_axis) 160 | if args.normalize_axis == 0: 161 | normalized_matrix = matrix / column_sums 162 | if args.normalize_axis == 1: 163 | normalized_matrix = matrix.transpose() / column_sums 164 | normalized_matrix = normalized_matrix.transpose() 165 | plt.imshow(np.flipud(normalized_matrix), plt.cm.Blues) 166 | else: 167 | plt.imshow(np.flipud(matrix), plt.cm.Blues) 168 | plt.yticks(range(len(terms)), reversed(terms if args.term_pretty_names is None else args.term_pretty_names)) 169 | cbar = plt.colorbar() 170 | cbar.ax.set_ylabel(args.heatmap_bar_label, rotation=-90, va="bottom") 171 | plt.ylabel(args.ylabel, style='italic', fontweight="bold") 172 | else: 173 | for term in terms: 174 | if args.normalize: 175 | term_y_coords[term] = np.array(term_y_coords[term])/sum(term_y_coords[term]) 176 | plt.plot(term_y_coords[term], label=term, marker=".") 177 | plt.grid(linestyle=":") 178 | plt.legend(loc="upper left") 179 | plt.ylabel(args.ylabel, style='italic', fontweight="bold") 180 | 181 | if args.annotation is not None: 182 | plt.figtext(0.6, 0.01, args.annotation, wrap=True, horizontalalignment='center', fontsize=8) 183 | plt.subplots_adjust(bottom=args.bottom) 184 | plt.xlabel(args.xlabel, style='italic', fontweight="bold") 185 | plt.title(args.plot_title, fontweight="bold") 186 | plt.rcParams["font.family"] = "Times New Roman" 187 | plt.savefig(args.output_filename, dpi=300) 188 | -------------------------------------------------------------------------------- /analysis_scripts/timestamp_dist.py: -------------------------------------------------------------------------------- 1 | from datasets import load_dataset, load_from_disk 2 | import argparse 3 | import numpy as np 4 | import matplotlib.pyplot as plt 5 | import seaborn as sns 6 | import pandas as pd 7 | from datetime import datetime 8 | from os import path, mkdir 9 | import pickle 10 | 11 | parser = argparse.ArgumentParser(description="This script takes in an ordered list of datasets. It is assumed that each dataset has a timestamp column. The script plots the timestamp distribution histogram for each dataset.") 12 | parser.add_argument("--input_dataset_names", nargs="+", required=True) 13 | parser.add_argument("--input_dataset_pretty_names", nargs="+", required=True, help="The names of the datasets that you want to appear in the saved graph.") 14 | parser.add_argument("--timestamp_column", required=True) 15 | parser.add_argument("--plot_title", required=True) 16 | parser.add_argument("--split", default=None, help="The dataset split to use. Some datasets don't have splits so this argument is optional.") 17 | parser.add_argument("--num_proc", type=int, required=True) 18 | parser.add_argument("--output_filename", required=True) 19 | parser.add_argument("--samples", default=None, type=int) 20 | parser.add_argument("--bins", default=100, help="The number of histogram bins to plot") 21 | parser.add_argument("--cache_dir", default="timestamp_dist_cache") 22 | parser.add_argument("--load_from_cache_dir", action="store_true") 23 | parser.add_argument("--annotation", default=None) 24 | parser.add_argument("--legend_title", default="Internet Snapshot") 25 | parser.add_argument("--load_from_hub_instead_of_disk", action="store_true", help="Whether to load the input dataset from the Hugging Face hub instead of the disk (default is the disk).") 26 | args = parser.parse_args() 27 | 28 | if args.load_from_cache_dir: 29 | data_array = np.load(open(path.join(args.cache_dir, "data_array.npy"), "rb")) 30 | cached_args = pickle.load(open(path.join(args.cache_dir, "args.pkl"), "rb")) 31 | if args != cached_args: 32 | print("Warning: argument mismatch between cached args and current args") 33 | print("Cached args: ", cached_args) 34 | print("Current args: ", args) 35 | else: 36 | 37 | # Remove timestamp outliers more than 10 median deviations away from the median. 38 | # This is important if the timestamp is the Last-Modified timestamp, which can sometimes be wrong 39 | # because websites can report whatever they want. We don't want one website that says it was created 40 | # a billion years ago to seriously affect the distribution. 41 | def reject_outliers(data, m = 10.): 42 | d = np.abs(data - np.median(data)) 43 | mdev = np.median(d) 44 | s = d/mdev if mdev else 0. 45 | return data[s= args.hist_bins: 110 | break 111 | dataframe_dict["samples"] += [datum[name] for name in args.input_dataset_pretty_names] 112 | dataframe_dict["dataset"] += args.input_dataset_pretty_names 113 | dataframe_dict["domain"] += [datum["domain_name"]]*len(args.input_dataset_pretty_names) 114 | index += 1 115 | 116 | df = pd.DataFrame(dataframe_dict) 117 | color_palette = sns.color_palette("pastel") 118 | colors = color_palette[:len(args.input_dataset_names)] 119 | plot = sns.barplot(data=df, palette=colors, hue="dataset", y="domain", x="samples") 120 | if args.annotation is not None: 121 | plot.figure.text(0.4, 0.01, args.annotation, wrap=True, horizontalalignment="center", fontsize=8) 122 | plot.figure.subplots_adjust(bottom=0.15) 123 | plot.legend().set_title("") 124 | if args.no_hist_legend: 125 | plot.legend().remove() 126 | for item in plot.get_yticklabels(): 127 | item.set_fontsize(args.hist_bin_fontsize) 128 | plot.set_title(args.hist_plot_title, fontweight="bold") 129 | plot.set_xlabel("Count", style="italic", fontweight="bold") 130 | plot.set_ylabel("Domain", style="italic", fontweight="bold") 131 | plot.figure.savefig(args.output_hist_filename, dpi=300, bbox_inches="tight") 132 | -------------------------------------------------------------------------------- /pipeline_scripts/common_crawl/README.md: -------------------------------------------------------------------------------- 1 | ![olm_cc_pipeline](https://user-images.githubusercontent.com/20826878/199851707-64a7a026-c413-4d78-8b04-a825e07534b3.jpeg) 2 | 3 | # Quick start 4 | 5 | This section provides all the commands that you need to generate a deduplicated and filtered dataset from Common Crawl, ready for pretraining! 6 | 7 | ## One time only 8 | 9 | `bash download_pipeline_processing_models.sh` 10 | 11 | ## Every time 12 | 13 | Use the following commands to get a dataset. They should take only a few min if you have lots of CPUs. Adjust `--num_proc` to be equal to however many CPUs that you have. 14 | 15 | ``` 16 | python download_common_crawl.py --snapshots CC-MAIN-2022-33 --segment_sampling_ratios 0.0001 --seed=42 --download_dir=common_crawl_wet_downloads --num_proc=224 17 | python get_text_dataset_from_wet_downloads.py --download_dir=common_crawl_wet_downloads --output_dataset_name=cc_raw --num_proc=224 18 | python remove_wikipedia_urls.py --input_dataset_name=cc_raw --output_dataset_name=cc_no_wikipedia --url_column=url --split=en --num_proc=224 19 | python apply_bigscience_filters.py --input_dataset_name=cc_no_wikipedia --output_dataset_name=cc_filtered --lang_id=en --text_column=text --num_proc=224 20 | ulimit -Sn 1000000 && python deduplicate.py --input_dataset_name=cc_filtered --output_dataset_name=cc_olm --text_column=text --remove_whole_example --num_proc=224 21 | 22 | # Optionally, get the last-modified headers from the websites and add them to the dataset. --segment_sampling_ratios and --seed must be the same as above for this to work. 23 | python download_common_crawl.py --snapshots CC-MAIN-2022-33 --segment_sampling_ratios 0.0001 --seed=42 --download_dir=common_crawl_wat_downloads --paths_type=wat --num_proc=224 24 | python get_last_modified_dataset_from_wat_downloads.py --download_dir=common_crawl_wat_downloads --output_dataset_name=cc_raw_last_modified --num_proc=224 25 | python combine_last_modified_with_text_dataset.py --text_dataset_name=cc_olm --last_modified_dataset_name=cc_raw_last_modified --output_dataset_name=cc_olm_with_last_modified --url_column=url --crawl_timestamp_column=crawl_timestamp --last_modified_timestamp_column=last_modified_timestamp --num_proc=224 26 | 27 | ``` 28 | 29 | You can then upload the final dataset to the Hugging Face Hub from a Python terminal like this: 30 | 31 | ``` 32 | from datasets import load_from_disk 33 | 34 | ds = load_from_disk("cc_olm") # Or cc_olm_with_last_modified if you did the optional step above. 35 | 36 | ds = ds.shuffle() # Optionally, shuffle the dataset so you can get an idea of what a random sample of the dataset looks like in the Hugging Face Hub dataset preview. 37 | 38 | ds.push_to_hub("cc_olm") # Or cc_olm_with_last_modified if you did the optional step above. 39 | ``` 40 | 41 | 42 | # Important notes 43 | 44 | ## Finding the latest Common Crawl snapshots 45 | 46 | They are displayed here: [https://commoncrawl.org/the-data/get-started/](https://commoncrawl.org/the-data/get-started/). Just enter the names of the snapshots you want as arguments to the `download_common_crawl.py` script. 47 | 48 | ## Intermediate dataset checkpoints 49 | 50 | Each of the python scripts from the quick start commands saves a Hugging Face dataset to the disk. The dataset is then read by the next python command. These intermediate datasets are not deleted by default, so you can observe what each step of the pipeline does. This also means that you should have a large disk. We use a 15 terabyte disk for the Online Language Modelling Project. 51 | 52 | ## How to specify the size of the dataset 53 | 54 | Increase `--segment_sampling_ratios` to get a larger dataset (it goes up to `1`). In the above quick start code, `0.0001` means that it only uses a sample of `0.01%` of the data from a Common Crawl snapshot. To generate a dataset for the Online Language Modelling Project, we are currently pulling about 1.45 terabytes from each Common Crawl snapshot, which is about 350 gigabytes after going through the BigScience filters and finally 30 gigabytes after going through the deduplication code. For the August 2022 snapshot, 1.45 terabytes is about 20% (i.e. `--segment_sampling_ratios 0.20`). Crawl sizes very though. For May 2022, 1.45 terabytes is about 14%. 55 | 56 | If you want to train a larger model than us, then specify a higher value for `--segment_sampling_ratios`, or even use multiple Common Crawl snapshots like this: 57 | 58 | ``` 59 | python download_common_crawl.py --snapshots CC-MAIN-2022-27 CC-MAIN-2022-33 --segment_sampling_ratios 0.5 1 --download_dir=common_crawl_wet_downloads --num_proc=224 60 | ``` 61 | 62 | Keep in mind that, with more data, the deduplication script will need more RAM. Read on for limitations of the deduplication script. 63 | 64 | ## Limitations of the deduplication code 65 | 66 | There are tons of duplicates in Common Crawl data, which means that the deduplication script will need 100's of gigabytes of RAM if you want to generate a 30 gigabyte dataset like us :(. If you want to get around this, there is also the option in the deduplication script for you to chunk the dataset and deduplicate each chunk individually. The main problem is this issue in the Google deduplication code: [https://github.com/google-research/deduplicate-text-datasets/issues/18](https://github.com/google-research/deduplicate-text-datasets/issues/18). 67 | 68 | 69 | # More documentation 70 | 71 | Run any of the python commands with the `--help` flag. For example, `python download_common_crawl.py --help`. 72 | -------------------------------------------------------------------------------- /pipeline_scripts/common_crawl/apply_bigscience_filters.py: -------------------------------------------------------------------------------- 1 | from datasets import load_dataset, load_from_disk 2 | import argparse 3 | from subprocess import run 4 | from os import path, mkdir 5 | from shutil import rmtree 6 | import sys 7 | import uuid 8 | 9 | sys.path.append("data-preparation/preprocessing/training/01b_oscar_cleaning_and_filtering") 10 | from filtering import DatasetFiltering 11 | 12 | parser = argparse.ArgumentParser(description="Applies the BigScience BLOOM filters which were used on OSCAR. They are designed to improve text quality and remove pornographic content.") 13 | parser.add_argument("--input_dataset_name", help="The name of the input dataset.", required=True) 14 | parser.add_argument("--output_dataset_name", help="The name of the output dataset.", required=True) 15 | parser.add_argument("--lang_id", help="The language id of your dataset. This is necessary because the BigScience filters use a list of language-specific pornographic words, and also language-specific hyperparameters for text quality improvement.", required=True) 16 | parser.add_argument("--split", default=None, help="The split of the dataset to apply the filters to. Not all datasets have splits, so this is not a required argument.") 17 | parser.add_argument("--text_column", help="The name of the dataset column that contains the text.", required=True) 18 | parser.add_argument("--num_proc", type=int, help="The number of processes to use.", required=True) 19 | parser.add_argument("--push_to_hub", action="store_true", help="Whether to push the output dataset to the Hugging Face Hub after saving it to the disk.") 20 | parser.add_argument("--tmp_dir", default=".tmp_apply_bigscience_filters", help="Directory to store temporary files. It will be deleted afterwards. Defaults to .tmp_apply_bigscience_filters.") 21 | parser.add_argument("--load_from_hub_instead_of_disk", action="store_true", help="Whether to pull the input dataset by name from the Hugging Face Hub. If this argument is not used, it is assumed that there is a dataset saved to the disk with the input dataset name.") 22 | args = parser.parse_args() 23 | 24 | if args.load_from_hub_instead_of_disk: 25 | if args.split is None: 26 | ds = load_dataset(args.input_dataset_name) 27 | else: 28 | ds = load_dataset(args.input_dataset_name, split=args.split) 29 | else: 30 | if args.split is None: 31 | ds = load_from_disk(args.input_dataset_name) 32 | else: 33 | ds = load_from_disk(args.input_dataset_name)[args.split] 34 | 35 | # We have to do this if the text column is not named "text" in the dataset, 36 | # because DatasetFiltering assumes that the name is "text". 37 | temp_column_name = None 38 | if args.text_column != "text": 39 | if "text" in ds.colum_names: 40 | temp_column_name = str(uuid.uuid4()) 41 | ds = ds.rename_column("text", temp_column_name) 42 | ds = ds.rename_column(args.text_column, "text") 43 | 44 | if path.exists(args.tmp_dir): 45 | run(f"rm -r {args.tmp_dir}", shell=True) 46 | 47 | mkdir(args.tmp_dir) 48 | tmp_dataset_name = path.join(args.tmp_dir, "intermediate_bigscience_filtered_dataset") 49 | 50 | dataset_filtering = DatasetFiltering( 51 | dataset=ds, 52 | lang_dataset_id=args.lang_id, 53 | path_fasttext_model="sp_kenlm_ft_models/lid.176.bin", 54 | path_sentencepiece_model=f"sp_kenlm_ft_models/{args.lang_id}.sp.model", 55 | path_kenlm_model=f"sp_kenlm_ft_models/{args.lang_id}.arpa.bin", 56 | num_proc=args.num_proc, 57 | path_dir_save_dataset=tmp_dataset_name, 58 | ) 59 | 60 | dataset_filtering.modifying_documents() 61 | dataset_filtering.filtering() 62 | dataset_filtering.save_dataset() 63 | 64 | ds = load_from_disk(path.join(tmp_dataset_name, args.lang_id)) 65 | 66 | # We have to do this if the text column is not named "text" in the dataset, 67 | # because DatasetFiltering assumes that the name is "text". 68 | if args.text_column != "text": 69 | ds = ds.rename_column("text", args.text_column) 70 | if temp_column_name is not None: 71 | ds = ds.rename_column(temp_column_name, "text") 72 | 73 | ds.save_to_disk(args.output_dataset_name) 74 | rmtree(args.tmp_dir) 75 | 76 | if args.push_to_hub: 77 | ds.push_to_hub(args.output_dataset_name) 78 | -------------------------------------------------------------------------------- /pipeline_scripts/common_crawl/combine_last_modified_with_text_dataset.py: -------------------------------------------------------------------------------- 1 | from datasets import load_dataset, load_from_disk 2 | import argparse 3 | from multiprocessing import Manager 4 | from tqdm import tqdm 5 | import uuid 6 | 7 | parser = argparse.ArgumentParser(description="This script takes in a text dataset with crawl timestamps and urls, and then a last-modified dataset with crawl timestamps and urls. It uses the shared urls and crawl timestamps to add last-modified timestamps to the text dataset.") 8 | parser.add_argument("--text_dataset_name", required=True) 9 | parser.add_argument("--last_modified_dataset_name", required=True) 10 | parser.add_argument("--output_dataset_name", required=True) 11 | parser.add_argument("--text_dataset_split", default=None) 12 | parser.add_argument("--last_modified_dataset_split", default=None) 13 | parser.add_argument("--last_modified_timestamp_column", required=True) 14 | parser.add_argument("--crawl_timestamp_column", required=True) 15 | parser.add_argument("--url_column", required=True) 16 | parser.add_argument("--num_proc", type=int, required=True) 17 | parser.add_argument("--load_text_dataset_from_hub_instead_of_disk", action="store_true", help="Whether to load the text dataset from the Hugging Face hub instead of the disk (default is the disk).") 18 | parser.add_argument("--load_last_modified_dataset_from_hub_instead_of_disk", action="store_true", help="Whether to load the last modified dataset from the Hugging Face hub instead of the disk (default is the disk).") 19 | parser.add_argument("--push_to_hub", action="store_true") 20 | args = parser.parse_args() 21 | 22 | if args.load_text_dataset_from_hub_instead_of_disk: 23 | if args.text_dataset_split is None: 24 | text_ds = load_dataset(args.text_dataset_name) 25 | else: 26 | text_ds = load_dataset(args.text_dataset_name, split=args.text_dataset_split) 27 | else: 28 | if args.text_dataset_split is None: 29 | text_ds = load_from_disk(args.text_dataset_name) 30 | else: 31 | text_ds = load_from_disk(args.text_dataset_name)[args.text_dataset_split] 32 | 33 | if args.load_last_modified_dataset_from_hub_instead_of_disk: 34 | if args.last_modified_dataset_split is None: 35 | last_modified_ds = load_dataset(args.last_modified_dataset_name) 36 | else: 37 | last_modified_ds = load_dataset(args.last_modified_dataset_name, split=args.last_modified_dataset_split) 38 | else: 39 | if args.last_modified_dataset_split is None: 40 | last_modified_ds = load_from_disk(args.last_modified_dataset_name) 41 | else: 42 | last_modified_ds = load_from_disk(args.last_modified_dataset_name)[args.last_modified_dataset_split] 43 | 44 | 45 | with Manager() as manager: 46 | shared_list = manager.list() 47 | def build_last_modified_dict(examples): 48 | last_modified_dict = {} 49 | for url, crawl_timestamp, last_modified_tag_timestamp in zip(examples[args.url_column], examples[args.crawl_timestamp_column], examples[args.last_modified_timestamp_column]): 50 | last_modified_dict[(url, crawl_timestamp)] = last_modified_tag_timestamp 51 | shared_list.append(last_modified_dict) 52 | 53 | last_modified_ds.map(build_last_modified_dict, num_proc=args.num_proc, batched=True, batch_size=len(last_modified_ds) // args.num_proc) 54 | 55 | aggregate_last_modified_dict = {} 56 | for last_modified_dict in tqdm(shared_list): 57 | aggregate_last_modified_dict |= last_modified_dict 58 | 59 | # Set the new fingerprint manually so the map function doesn't take forever hashing the huge aggregate_last_modified_dict. 60 | text_ds = text_ds.map(lambda example: {args.last_modified_timestamp_column: aggregate_last_modified_dict.get((example[args.url_column], example[args.crawl_timestamp_column]), None)}, new_fingerprint=str(uuid.uuid4())) 61 | 62 | text_ds.save_to_disk(args.output_dataset_name) 63 | 64 | if args.push_to_hub: 65 | text_ds.push_to_hub(args.output_dataset_name) 66 | -------------------------------------------------------------------------------- /pipeline_scripts/common_crawl/deduplicate.py: -------------------------------------------------------------------------------- 1 | from datasets import load_dataset, load_from_disk, concatenate_datasets 2 | from text_dedup.exact_dedup import GoogleSuffixArrayDeduplicator 3 | from shutil import rmtree 4 | from os import path 5 | import argparse 6 | import hashlib 7 | import uuid 8 | 9 | parser = argparse.ArgumentParser(description="Applies varying levels of exact deduplication or exact suffix array deduplication to a Hugging Face dataset.") 10 | parser.add_argument("--input_dataset_name", help="Name of the input dataset.", required=True) 11 | parser.add_argument("--output_dataset_name", help="Name of the output dataset.", required=True) 12 | parser.add_argument("--text_column", help="Name of the dataset's text column.", required=True) 13 | parser.add_argument("--split", default=None, help="The split of the dataset to apply deduplication on. Not all datasets have splits, so this argument is optional.") 14 | parser.add_argument("--num_proc", type=int, help="The minimum number of processes to use.", required=True) 15 | parser.add_argument("--push_to_hub", action="store_true", help="Whether to push the output dataset to the Hugging Face Hub after saving it to the disk.") 16 | parser.add_argument("--remove_whole_example", action="store_true", help= "If an example in our courpus has a byte string of 100 or longer which is duplicated elsewhere in the corpus, then this option will result in the removal of the whole example. If this option is not specified, then only the substring is removed, not the whole example. In the paper for this deduplication method, they only remove the byte string, not the whole example. Removing the whole example will vastly shrink the size of the dataset, but it will ensure no gaps in text continuity.") 17 | parser.add_argument("--only_exact_duplicates", action="store_true", help="Use this option if you want to forget about the suffix array stuff and just get rid of examples that exactly match other examples in the dataset.") 18 | parser.add_argument("--chunks", type=int, default=1, help="Deduplication can be really memory-intensive. This option allows you to split the dataset up in to n chunks, and perform deduplication independently on each of the chunks. Then the resulting deduplicated datasets are concatenated together at the end.") 19 | parser.add_argument("--load_from_hub_instead_of_disk", action="store_true", help="Whether to load the input dataset from the Hugging Face Hub. If this argument is not used, then it is assumed that the input dataset is stored locally on the disk.") 20 | args = parser.parse_args() 21 | 22 | if args.load_from_hub_instead_of_disk: 23 | if args.split is None: 24 | ds = load_dataset(args.input_dataset_name) 25 | else: 26 | ds = load_dataset(args.input_dataset_name, split=args.split) 27 | else: 28 | if args.split is None: 29 | ds = load_from_disk(args.input_dataset_name) 30 | else: 31 | ds = load_from_disk(args.input_dataset_name)[args.split] 32 | 33 | deduplicated_ds_shard_list = [] 34 | for ds_shard_index in range(args.chunks): 35 | ds_shard = ds.shard(num_shards=args.chunks, index=ds_shard_index) 36 | 37 | if args.remove_whole_example: 38 | def check_for_ending_example_in_cluster(example, index, column, last_index): 39 | if index == last_index: 40 | return True 41 | return ds_shard[index+1][column] != example[column] 42 | 43 | # Sort the dataset so examples with the same first 100 bytes of text are grouped together. 44 | print("Sorting by first 100 bytes of text") 45 | temp_column_name = str(uuid.uuid4()) 46 | ds_shard = ds_shard.map(lambda example: {temp_column_name: example[args.text_column].encode("u8")[:100]}, num_proc=args.num_proc) 47 | ds_shard = ds_shard.sort(temp_column_name) 48 | 49 | # Filter away examples if their first 100 bytes of text exactly matches another example's first 100 bytes of text. 50 | # This gets rid of a subset of the examples that the next step (suffix array deduplication) gets rid of, so we technically 51 | # don't need to do it. But it speeds up the next step quite a bit to do this first. 52 | last_index = len(ds_shard) - 1 53 | len_before = len(ds_shard) 54 | ds_shard = ds_shard.filter(lambda example, index: check_for_ending_example_in_cluster(example, index, temp_column_name, last_index), num_proc=args.num_proc, with_indices=True) 55 | ds_shard = ds_shard.remove_columns(temp_column_name) 56 | print(f"Got rid of all examples sharing first 100 bytes of text, as a speedup step. Removed {len_before - len(ds_shard)} from {len_before} examples.") 57 | 58 | # Do the same thing with the ending 100 bytes of text. 59 | print("Sorting by last 100 bytes of text") 60 | temp_column_name = str(uuid.uuid4()) 61 | ds_shard = ds_shard.map(lambda example: {temp_column_name: example[args.text_column].encode("u8")[-100:]}, num_proc=args.num_proc) 62 | ds_shard = ds_shard.sort(temp_column_name) 63 | 64 | last_index = len(ds_shard) - 1 65 | len_before = len(ds_shard) 66 | ds_shard = ds_shard.filter(lambda example, index: check_for_ending_example_in_cluster(example, index, temp_column_name, last_index), num_proc=args.num_proc, with_indices=True) 67 | ds_shard = ds_shard.remove_columns(temp_column_name) 68 | print(f"Got rid of all examples sharing last 100 bytes of text, as a speedup step. Removed {len_before - len(ds_shard)} from {len_before} examples.") 69 | 70 | else: 71 | print("Getting rid of exact duplicates") 72 | def check_for_ending_example_in_cluster(example, index, column, last_index): 73 | if index == last_index: 74 | return True 75 | return ds_shard[index+1][column] != example[column] 76 | 77 | temp_column_name = str(uuid.uuid4()) 78 | ds_shard = ds_shard.map(lambda example: {temp_column_name: hashlib.md5(example[args.text_column].encode()).hexdigest()}, num_proc=args.num_proc) 79 | ds_shard = ds_shard.sort(temp_column_name) 80 | 81 | last_index = len(ds_shard) - 1 82 | ds_shard = ds_shard.filter(lambda example, index: check_for_ending_example_in_cluster(example, index, temp_column_name, last_index), num_proc=args.num_proc, with_indices=True) 83 | ds_shard = ds_shard.remove_columns(temp_column_name) 84 | print("Got rid of exact duplicates") 85 | 86 | if path.exists(".cache"): 87 | rmtree(".cache") 88 | 89 | if not args.only_exact_duplicates: 90 | # Now, do Suffix Array Substring Exact Deduplication. 91 | 92 | deduplicator = GoogleSuffixArrayDeduplicator(k=100) 93 | 94 | # We need to create this iterator over the dataset text column 95 | # to ensure that not all of the text entries are loaded into memory at once. 96 | class DatasetColumnIterator(): 97 | def __init__(self, dataset, column): 98 | self.iterable_dataset = dataset.__iter__() 99 | self.column = column 100 | 101 | def __iter__(self): 102 | return self 103 | 104 | def __next__(self): 105 | return self.iterable_dataset.__next__()[self.column] 106 | 107 | slices = deduplicator.fit_predict(DatasetColumnIterator(ds_shard, args.text_column)) 108 | if args.remove_whole_example: 109 | ds_shard = ds_shard.filter(lambda example, index: slices[index] == [], num_proc=args.num_proc, with_indices=True) 110 | else: 111 | def remove_slice_list(string, slice_list): 112 | for s in slice_list: 113 | string = string.replace(string[s], "") 114 | return string 115 | # It's important to give this map function a uuid as its fingerprint. If we let it compute the fingerprint as a hash of the whole slice_list, then it will take too long. 116 | ds_shard = ds_shard.map(lambda example, index: {args.text_column: remove_slice_list(example[args.text_column], slices[index])}, num_proc=args.num_proc, with_indices=True, new_fingerprint=str(uuid.uuid4())) 117 | ds_shard = ds_shard.filter(lambda example: example[args.text_column] != "", num_proc=args.num_proc) 118 | 119 | if path.exists(".cache"): 120 | rmtree(".cache") 121 | 122 | deduplicated_ds_shard_list.append(ds_shard) 123 | 124 | ds = concatenate_datasets(deduplicated_ds_shard_list) 125 | 126 | ds.save_to_disk(args.output_dataset_name) 127 | 128 | if args.push_to_hub: 129 | ds.push_to_hub(args.output_dataset_name) 130 | -------------------------------------------------------------------------------- /pipeline_scripts/common_crawl/download_common_crawl.py: -------------------------------------------------------------------------------- 1 | from os import mkdir, path 2 | from subprocess import run 3 | import argparse 4 | import random 5 | 6 | parser = argparse.ArgumentParser(description="Downloads raw Common Crawl WET files, or WAT files if you specify --paths_type=wat.") 7 | parser.add_argument("--snapshots", nargs='+', help="The Common Crawl snapshots to download files from, such as CC-MAIN-2022-33 or CC-MAIN-2022-27. Several can be specified.", required=True) 8 | parser.add_argument("--download_dir", help="The name of the directory to create and download WET files to.", required=True) 9 | parser.add_argument("--segment_sampling_ratios", type=float, nargs="+", help="The ratios of each Common Crawl snapshot to use. The higher the ratio, the larger the generated dataset (but also the longer the time that the OLM pipeline runs). You should specify one for each snapshot. For example, if you specify '--snapshots CC-MAIN-2022-33 CC-MAIN-2022-27', then --segment_sampling_ratios could be '0.15 0.11'. This means that 15 percent of the segments from CC-MAIN-2022-33 will uniformly randomly sampled and used, and 11 percent of the segments from CC-MAIN-2022-27 will be uniformly randomly sampled and used.", required=True) 10 | parser.add_argument("--tmp_dir", default=".tmp_download_common_crawl", help="The directory where temporary files are stored. They are deleted when this script completes. Default is .tmp_download_common_crawl.") 11 | parser.add_argument("--num_proc", type=int, help="The number of processes to use.", required=True) 12 | parser.add_argument("--seed", type=int, default=42) 13 | parser.add_argument("--paths_type", default="wet") 14 | args = parser.parse_args() 15 | 16 | random.seed(args.seed) 17 | 18 | if path.exists(args.download_dir): 19 | run(f"rm -r {args.download_dir}", shell=True) 20 | 21 | if path.exists(args.tmp_dir): 22 | run(f"rm -r {args.tmp_dir}", shell=True) 23 | 24 | run(f"mkdir {args.download_dir} {args.tmp_dir}", shell=True) 25 | for index in range(len(args.snapshots)): 26 | # Download the data for a certian common crawl snapshot 27 | tmp_download_dir_name = f"{args.tmp_dir}/ungoliant_downloads-{args.snapshots[index]}" 28 | run(f"mkdir {tmp_download_dir_name}", shell=True) 29 | run(f"wget https://data.commoncrawl.org/crawl-data/{args.snapshots[index]}/{args.paths_type}.paths.gz", shell=True) 30 | run(f"gzip -d {args.paths_type}.paths.gz", shell=True) 31 | paths_name = f"{args.paths_type}-{args.snapshots[index]}.paths" 32 | run(f"mv {args.paths_type}.paths {paths_name}", shell=True) 33 | segments = open(paths_name, "r").readlines() 34 | kept_segments = [] 35 | for segment in segments: 36 | if random.random() <= args.segment_sampling_ratios[index]: 37 | kept_segments.append(segment) 38 | open(paths_name, "w").writelines(kept_segments) 39 | run(f"ungoliant download -t={args.num_proc} {paths_name} {tmp_download_dir_name}", shell=True) 40 | run(f"rm {paths_name}", shell=True) 41 | 42 | # Now, add 0's to the filename for every downloaded file. We want the number of 0's to be different than those from another common crawl snapshot 43 | # because we want every file to have a unique name accross multiple snapshot downloads. 44 | if index > 0: 45 | run(f"cd {tmp_download_dir_name} && for f in * ; do mv \"$f\" {'0'*index}\"$f\" ; done", shell=True) 46 | 47 | # Now we can move the downloaded files into the main download dir which has the downloads from the rest of this for loop. 48 | run(f"mv {tmp_download_dir_name}/* {args.download_dir}/", shell=True) 49 | run(f"rm -r {tmp_download_dir_name}", shell=True) 50 | 51 | run(f"rm -r {args.tmp_dir}", shell=True) 52 | run("rm -r errors.txt", shell=True) 53 | 54 | -------------------------------------------------------------------------------- /pipeline_scripts/common_crawl/download_pipeline_processing_models.sh: -------------------------------------------------------------------------------- 1 | # exit when any command fails 2 | set -e 3 | 4 | python data-preparation/preprocessing/training/01b_oscar_cleaning_and_filtering/download_sentencepiece_kenlm_models.py --output_dir_path=sp_kenlm_ft_models 5 | wget https://dl.fbaipublicfiles.com/fasttext/supervised-models/lid.176.bin -P sp_kenlm_ft_models/ 6 | -------------------------------------------------------------------------------- /pipeline_scripts/common_crawl/experimental/add_perplexity.py: -------------------------------------------------------------------------------- 1 | from datasets import load_dataset, load_from_disk 2 | import argparse 3 | import sys 4 | sys.path.append("kenlm") 5 | from model import KenlmModel 6 | 7 | parser = argparse.ArgumentParser(description="This script simply uses a kenlm trained on English Wikipedia to compute the perplexity of each text example in the dataset. It then sorts the dataset by perplexity so that the user can then select the range of perplexities that they want their data to be in.") 8 | parser.add_argument("--input_dataset_name", help="The name of the input dataset.", required=True) 9 | parser.add_argument("--output_dataset_name", help="The name of the output dataset.", required=True) 10 | parser.add_argument("--split", default=None, help="The split of the dataset to apply the filters to. Not all datasets have splits, so this is not a required argument.") 11 | parser.add_argument("--text_column", help="The name of the dataset column that contains the text.", required=True) 12 | parser.add_argument("--num_proc", type=int, help="The number of processes to use.", required=True) 13 | parser.add_argument("--push_to_hub", action="store_true", help="Whether to push the output dataset to the Hugging Face Hub after saving it to the disk.") 14 | parser.add_argument("--load_from_hub_instead_of_disk", action="store_true", help="Whether to pull the input dataset by name from the Hugging Face Hub. If this argument is not used, it is assumed that there is a dataset saved to the disk with the input dataset name.") 15 | args = parser.parse_args() 16 | 17 | if args.load_from_hub_instead_of_disk: 18 | if args.split is None: 19 | ds = load_dataset(args.input_dataset_name) 20 | else: 21 | ds = load_dataset(args.input_dataset_name, split=args.split) 22 | else: 23 | if args.split is None: 24 | ds = load_from_disk(args.input_dataset_name) 25 | else: 26 | ds = load_from_disk(args.input_dataset_name)[args.split] 27 | 28 | 29 | model = KenlmModel.from_pretrained("kenlm/wikipedia", "en") 30 | ds = ds.map(lambda example: {"kenlm_ppl": model.get_perplexity(example[args.text_column])}, num_proc=args.num_proc) 31 | ds = ds.sort("kenlm_ppl") 32 | ds.save_to_disk(args.output_dataset_name) 33 | 34 | if args.push_to_hub: 35 | ds.push_to_hub(args.output_dataset_name) 36 | -------------------------------------------------------------------------------- /pipeline_scripts/common_crawl/experimental/filter_for_only_updated_websites.py: -------------------------------------------------------------------------------- 1 | from datasets import load_dataset, load_from_disk 2 | import argparse 3 | 4 | parser = argparse.ArgumentParser(description="Experimental script to check and filter for a diff between examples with the same URL. It drastically reduces the size of the dataset in many cases, but it helps ensure that the text is up to date. The script only keeps an example if 1) the example shares a URL with other examples 2) the example is the most recent example with that URL 3) there was a diff between the example and an earlier example with the same URL.") 5 | parser.add_argument("--input_dataset_name", required=True) 6 | parser.add_argument("--output_dataset_name", required=True) 7 | parser.add_argument("--text_column", required=True) 8 | parser.add_argument("--timestamp_column", required=True) 9 | parser.add_argument("--split", default=None, help="The split of the datset to apply this filter to. Not all datsets have splits, so this argument is optional.") 10 | parser.add_argument("--url_column", required=True) 11 | parser.add_argument("--num_proc", type=int, required=True) 12 | parser.add_argument("--push_to_hub", action="store_true", help="Whether to push the output dataset to the Hugging Face Hub after saving it to the disk.") 13 | parser.add_argument("--load_from_hub_instead_of_disk", action="store_true", help="Whether to load the input datset from the Hugging Face Hub. If this argument is not used, it is assumed that the input dataset is stored on the disk.") 14 | args = parser.parse_args() 15 | 16 | if args.load_from_hub_instead_of_disk: 17 | if args.split is None: 18 | ds = load_dataset(args.input_dataset_name) 19 | else: 20 | ds = load_dataset(args.input_dataset_name, split=args.split) 21 | else: 22 | if args.split is None: 23 | ds = load_from_disk(args.input_dataset_name) 24 | else: 25 | ds = load_from_disk(args.input_dataset_name)[args.split] 26 | 27 | # Group so examples with the same URL are next to each other in the dataset. 28 | ds = ds.sort(args.url_column) 29 | 30 | # Throw away examples with URLs occuring only once in the dataset. 31 | last_index = len(ds) - 1 32 | def check_for_adjacent_duplicate_url(example, index): 33 | if index == last_index: 34 | return ds[index-1][args.url_column] == example[args.url_column] 35 | if index == 0: 36 | return ds[index+1][args.url_column] == example[args.url_column] 37 | return ds[index-1][args.url_column] == example[args.url_column] or ds[index+1][args.url_column] == example[args.url_column] 38 | 39 | ds = ds.filter(lambda example, index: check_for_adjacent_duplicate_url(example, index), num_proc=args.num_proc, with_indices=True) 40 | 41 | # Sort the dataset so that examples with the same URL are still grouped together, but also arrange by timestamp from oldest to newest. 42 | ds = ds.sort(args.timestamp_column) 43 | ds = ds.sort(args.url_column, kind="stable") 44 | 45 | # Keep only the pair of examples from each URL group with the oldest and newest timestamp. 46 | last_index = len(ds) - 1 47 | def check_for_ending_or_beginning_example_in_url_cluster(example, index): 48 | if index in (last_index, 0): 49 | return True 50 | return ds[index-1][args.url_column] != example[args.url_column] or ds[index+1][args.url_column] != example[args.url_column] 51 | 52 | ds = ds.filter(lambda example, index: check_for_ending_or_beginning_example_in_url_cluster(example, index), num_proc=args.num_proc, with_indices=True) 53 | 54 | # For each example pair, check to see if the text was modified between the old time and the new time. 55 | # If it was modified, keep the latest example and throw the old example out. We have evidence that this new example is up-to-date :D 56 | # If it wasn't modified, throw both examples out. We have no evidence that this new example is up-to-date :( 57 | last_index = len(ds) - 1 58 | def check_for_updated_example_in_url_pair(example, index): 59 | if index == 0 or ds[index-1][args.url_column] != example[args.url_column]: 60 | return False 61 | if ds[index-1][args.text_column] != example[args.text_column]: 62 | return True 63 | return False 64 | 65 | ds = ds.filter(lambda example, index: check_for_updated_example_in_url_pair(example, index), num_proc=args.num_proc, with_indices=True) 66 | 67 | ds.save_to_disk(args.output_dataset_name) 68 | 69 | if args.push_to_hub: 70 | ds.push_to_hub(args.output_dataset_name) 71 | -------------------------------------------------------------------------------- /pipeline_scripts/common_crawl/experimental/kenlm/LICENSE: -------------------------------------------------------------------------------- 1 | 2 | Apache License 3 | Version 2.0, January 2004 4 | http://www.apache.org/licenses/ 5 | 6 | TERMS AND CONDITIONS FOR USE, REPRODUCTION, AND DISTRIBUTION 7 | 8 | 1. Definitions. 9 | 10 | "License" shall mean the terms and conditions for use, reproduction, 11 | and distribution as defined by Sections 1 through 9 of this document. 12 | 13 | "Licensor" shall mean the copyright owner or entity authorized by 14 | the copyright owner that is granting the License. 15 | 16 | "Legal Entity" shall mean the union of the acting entity and all 17 | other entities that control, are controlled by, or are under common 18 | control with that entity. For the purposes of this definition, 19 | "control" means (i) the power, direct or indirect, to cause the 20 | direction or management of such entity, whether by contract or 21 | otherwise, or (ii) ownership of fifty percent (50%) or more of the 22 | outstanding shares, or (iii) beneficial ownership of such entity. 23 | 24 | "You" (or "Your") shall mean an individual or Legal Entity 25 | exercising permissions granted by this License. 26 | 27 | "Source" form shall mean the preferred form for making modifications, 28 | including but not limited to software source code, documentation 29 | source, and configuration files. 30 | 31 | "Object" form shall mean any form resulting from mechanical 32 | transformation or translation of a Source form, including but 33 | not limited to compiled object code, generated documentation, 34 | and conversions to other media types. 35 | 36 | "Work" shall mean the work of authorship, whether in Source or 37 | Object form, made available under the License, as indicated by a 38 | copyright notice that is included in or attached to the work 39 | (an example is provided in the Appendix below). 40 | 41 | "Derivative Works" shall mean any work, whether in Source or Object 42 | form, that is based on (or derived from) the Work and for which the 43 | editorial revisions, annotations, elaborations, or other modifications 44 | represent, as a whole, an original work of authorship. For the purposes 45 | of this License, Derivative Works shall not include works that remain 46 | separable from, or merely link (or bind by name) to the interfaces of, 47 | the Work and Derivative Works thereof. 48 | 49 | "Contribution" shall mean any work of authorship, including 50 | the original version of the Work and any modifications or additions 51 | to that Work or Derivative Works thereof, that is intentionally 52 | submitted to Licensor for inclusion in the Work by the copyright owner 53 | or by an individual or Legal Entity authorized to submit on behalf of 54 | the copyright owner. For the purposes of this definition, "submitted" 55 | means any form of electronic, verbal, or written communication sent 56 | to the Licensor or its representatives, including but not limited to 57 | communication on electronic mailing lists, source code control systems, 58 | and issue tracking systems that are managed by, or on behalf of, the 59 | Licensor for the purpose of discussing and improving the Work, but 60 | excluding communication that is conspicuously marked or otherwise 61 | designated in writing by the copyright owner as "Not a Contribution." 62 | 63 | "Contributor" shall mean Licensor and any individual or Legal Entity 64 | on behalf of whom a Contribution has been received by Licensor and 65 | subsequently incorporated within the Work. 66 | 67 | 2. Grant of Copyright License. Subject to the terms and conditions of 68 | this License, each Contributor hereby grants to You a perpetual, 69 | worldwide, non-exclusive, no-charge, royalty-free, irrevocable 70 | copyright license to reproduce, prepare Derivative Works of, 71 | publicly display, publicly perform, sublicense, and distribute the 72 | Work and such Derivative Works in Source or Object form. 73 | 74 | 3. Grant of Patent License. Subject to the terms and conditions of 75 | this License, each Contributor hereby grants to You a perpetual, 76 | worldwide, non-exclusive, no-charge, royalty-free, irrevocable 77 | (except as stated in this section) patent license to make, have made, 78 | use, offer to sell, sell, import, and otherwise transfer the Work, 79 | where such license applies only to those patent claims licensable 80 | by such Contributor that are necessarily infringed by their 81 | Contribution(s) alone or by combination of their Contribution(s) 82 | with the Work to which such Contribution(s) was submitted. If You 83 | institute patent litigation against any entity (including a 84 | cross-claim or counterclaim in a lawsuit) alleging that the Work 85 | or a Contribution incorporated within the Work constitutes direct 86 | or contributory patent infringement, then any patent licenses 87 | granted to You under this License for that Work shall terminate 88 | as of the date such litigation is filed. 89 | 90 | 4. Redistribution. You may reproduce and distribute copies of the 91 | Work or Derivative Works thereof in any medium, with or without 92 | modifications, and in Source or Object form, provided that You 93 | meet the following conditions: 94 | 95 | (a) You must give any other recipients of the Work or 96 | Derivative Works a copy of this License; and 97 | 98 | (b) You must cause any modified files to carry prominent notices 99 | stating that You changed the files; and 100 | 101 | (c) You must retain, in the Source form of any Derivative Works 102 | that You distribute, all copyright, patent, trademark, and 103 | attribution notices from the Source form of the Work, 104 | excluding those notices that do not pertain to any part of 105 | the Derivative Works; and 106 | 107 | (d) If the Work includes a "NOTICE" text file as part of its 108 | distribution, then any Derivative Works that You distribute must 109 | include a readable copy of the attribution notices contained 110 | within such NOTICE file, excluding those notices that do not 111 | pertain to any part of the Derivative Works, in at least one 112 | of the following places: within a NOTICE text file distributed 113 | as part of the Derivative Works; within the Source form or 114 | documentation, if provided along with the Derivative Works; or, 115 | within a display generated by the Derivative Works, if and 116 | wherever such third-party notices normally appear. The contents 117 | of the NOTICE file are for informational purposes only and 118 | do not modify the License. You may add Your own attribution 119 | notices within Derivative Works that You distribute, alongside 120 | or as an addendum to the NOTICE text from the Work, provided 121 | that such additional attribution notices cannot be construed 122 | as modifying the License. 123 | 124 | You may add Your own copyright statement to Your modifications and 125 | may provide additional or different license terms and conditions 126 | for use, reproduction, or distribution of Your modifications, or 127 | for any such Derivative Works as a whole, provided Your use, 128 | reproduction, and distribution of the Work otherwise complies with 129 | the conditions stated in this License. 130 | 131 | 5. Submission of Contributions. Unless You explicitly state otherwise, 132 | any Contribution intentionally submitted for inclusion in the Work 133 | by You to the Licensor shall be under the terms and conditions of 134 | this License, without any additional terms or conditions. 135 | Notwithstanding the above, nothing herein shall supersede or modify 136 | the terms of any separate license agreement you may have executed 137 | with Licensor regarding such Contributions. 138 | 139 | 6. Trademarks. This License does not grant permission to use the trade 140 | names, trademarks, service marks, or product names of the Licensor, 141 | except as required for reasonable and customary use in describing the 142 | origin of the Work and reproducing the content of the NOTICE file. 143 | 144 | 7. Disclaimer of Warranty. Unless required by applicable law or 145 | agreed to in writing, Licensor provides the Work (and each 146 | Contributor provides its Contributions) on an "AS IS" BASIS, 147 | WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or 148 | implied, including, without limitation, any warranties or conditions 149 | of TITLE, NON-INFRINGEMENT, MERCHANTABILITY, or FITNESS FOR A 150 | PARTICULAR PURPOSE. You are solely responsible for determining the 151 | appropriateness of using or redistributing the Work and assume any 152 | risks associated with Your exercise of permissions under this License. 153 | 154 | 8. Limitation of Liability. In no event and under no legal theory, 155 | whether in tort (including negligence), contract, or otherwise, 156 | unless required by applicable law (such as deliberate and grossly 157 | negligent acts) or agreed to in writing, shall any Contributor be 158 | liable to You for damages, including any direct, indirect, special, 159 | incidental, or consequential damages of any character arising as a 160 | result of this License or out of the use or inability to use the 161 | Work (including but not limited to damages for loss of goodwill, 162 | work stoppage, computer failure or malfunction, or any and all 163 | other commercial damages or losses), even if such Contributor 164 | has been advised of the possibility of such damages. 165 | 166 | 9. Accepting Warranty or Additional Liability. While redistributing 167 | the Work or Derivative Works thereof, You may choose to offer, 168 | and charge a fee for, acceptance of support, warranty, indemnity, 169 | or other liability obligations and/or rights consistent with this 170 | License. However, in accepting such obligations, You may act only 171 | on Your own behalf and on Your sole responsibility, not on behalf 172 | of any other Contributor, and only if You agree to indemnify, 173 | defend, and hold each Contributor harmless for any liability 174 | incurred by, or claims asserted against, such Contributor by reason 175 | of your accepting any such warranty or additional liability. 176 | 177 | END OF TERMS AND CONDITIONS 178 | 179 | APPENDIX: How to apply the Apache License to your work. 180 | 181 | To apply the Apache License to your work, attach the following 182 | boilerplate notice, with the fields enclosed by brackets "[]" 183 | replaced with your own identifying information. (Don't include 184 | the brackets!) The text should be enclosed in the appropriate 185 | comment syntax for the file format. We also recommend that a 186 | file or class name and description of purpose be included on the 187 | same "printed page" as the copyright notice for easier 188 | identification within third-party archives. 189 | 190 | Copyright 2021-2022 Eduardo González Ponferrada 191 | 192 | Licensed under the Apache License, Version 2.0 (the "License"); 193 | you may not use this file except in compliance with the License. 194 | You may obtain a copy of the License at 195 | 196 | http://www.apache.org/licenses/LICENSE-2.0 197 | 198 | Unless required by applicable law or agreed to in writing, software 199 | distributed under the License is distributed on an "AS IS" BASIS, 200 | WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. 201 | See the License for the specific language governing permissions and 202 | limitations under the License. -------------------------------------------------------------------------------- /pipeline_scripts/common_crawl/experimental/kenlm/README.md: -------------------------------------------------------------------------------- 1 | --- 2 | language: 3 | - es 4 | - af 5 | - ar 6 | - arz 7 | - as 8 | - bn 9 | - fr 10 | - sw 11 | - eu 12 | - ca 13 | - zh 14 | - en 15 | - hi 16 | - ur 17 | - id 18 | - pt 19 | - vi 20 | - gu 21 | - kn 22 | - ml 23 | - mr 24 | - ta 25 | - te 26 | - yo 27 | tags: 28 | - kenlm 29 | - perplexity 30 | - n-gram 31 | - kneser-ney 32 | - bigscience 33 | license: "mit" 34 | datasets: 35 | - wikipedia 36 | - oscar 37 | --- 38 | 39 | Taken from the amazing repo here: [https://huggingface.co/edugp/kenlm](https://huggingface.co/edugp/kenlm) 40 | 41 | # KenLM models 42 | This repo contains several KenLM models trained on different tokenized datasets and languages. 43 | KenLM models are probabilistic n-gram languge models that models. One use case of these models consist on fast perplexity estimation for [filtering or sampling large datasets](https://huggingface.co/bertin-project/bertin-roberta-base-spanish). For example, one could use a KenLM model trained on French Wikipedia to run inference on a large dataset and filter out samples that are very unlike to appear on Wikipedia (high perplexity), or very simple non-informative sentences that could appear repeatedly (low perplexity). 44 | 45 | At the root of this repo you will find different directories named after the dataset models were trained on (e.g. `wikipedia`, `oscar`). Within each directory, you will find several models trained on different language subsets of the dataset (e.g. `en (English)`, `es (Spanish)`, `fr (French)`). For each language you will find three different files 46 | * `{language}.arpa.bin`: The trained KenLM model binary 47 | * `{language}.sp.model`: The trained SentencePiece model used for tokenization 48 | * `{language}.sp.vocab`: The vocabulary file for the SentencePiece model 49 | 50 | The models have been trained using some of the preprocessing steps from [cc_net](https://github.com/facebookresearch/cc_net), in particular replacing numbers with zeros and normalizing punctuation. So, it is important to keep the default values for the parameters: `lower_case`, `remove_accents`, `normalize_numbers` and `punctuation` when using the pre-trained models in order to replicate the same pre-processing steps at inference time. 51 | 52 | # Dependencies 53 | * KenLM: `pip install https://github.com/kpu/kenlm/archive/master.zip` 54 | * SentencePiece: `pip install sentencepiece` 55 | 56 | # Example: 57 | ``` 58 | from model import KenlmModel 59 | 60 | 61 | # Load model trained on English wikipedia 62 | model = KenlmModel.from_pretrained("wikipedia", "en") 63 | 64 | # Get perplexity 65 | model.get_perplexity("I am very perplexed") 66 | # 341.3 (low perplexity, since sentence style is formal and with no grammar mistakes) 67 | 68 | model.get_perplexity("im hella trippin") 69 | # 46793.5 (high perplexity, since the sentence is colloquial and contains grammar mistakes) 70 | ``` 71 | In the example above we see that, since Wikipedia is a collection of encyclopedic articles, a KenLM model trained on it will naturally give lower perplexity scores to sentences with formal language and no grammar mistakes than colloquial sentences with grammar mistakes. 72 | -------------------------------------------------------------------------------- /pipeline_scripts/common_crawl/experimental/kenlm/model.py: -------------------------------------------------------------------------------- 1 | import os 2 | import re 3 | import unicodedata 4 | from typing import Dict 5 | 6 | import kenlm 7 | import sentencepiece 8 | from huggingface_hub import cached_download, hf_hub_url 9 | 10 | 11 | class SentencePiece: 12 | def __init__( 13 | self, 14 | model: str, 15 | ): 16 | super().__init__() 17 | self.sp = sentencepiece.SentencePieceProcessor() 18 | self.sp.load(str(model)) 19 | 20 | def do(self, text: dict) -> dict: 21 | tokenized = self.sp.encode_as_pieces(text) 22 | return " ".join(tokenized) 23 | 24 | 25 | class KenlmModel: 26 | digit_re: re.Pattern = re.compile(r"\d") 27 | unicode_punct: Dict[str, str] = { 28 | ",": ",", 29 | "。": ".", 30 | "、": ",", 31 | "„": '"', 32 | "”": '"', 33 | "“": '"', 34 | "«": '"', 35 | "»": '"', 36 | "1": '"', 37 | "」": '"', 38 | "「": '"', 39 | "《": '"', 40 | "》": '"', 41 | "´": "'", 42 | "∶": ":", 43 | ":": ":", 44 | "?": "?", 45 | "!": "!", 46 | "(": "(", 47 | ")": ")", 48 | ";": ";", 49 | "–": "-", 50 | "—": " - ", 51 | ".": ". ", 52 | "~": "~", 53 | "’": "'", 54 | "…": "...", 55 | "━": "-", 56 | "〈": "<", 57 | "〉": ">", 58 | "【": "[", 59 | "】": "]", 60 | "%": "%", 61 | "►": "-", 62 | } 63 | unicode_punct_re = re.compile(f"[{''.join(unicode_punct.keys())}]") 64 | non_printing_chars_re = re.compile( 65 | f"[{''.join(map(chr, list(range(0,32)) + list(range(127,160))))}]" 66 | ) 67 | kenlm_model_dir = None 68 | sentence_piece_model_dir = None 69 | 70 | def __init__( 71 | self, 72 | model_dataset: str, 73 | language: str, 74 | lower_case: bool = False, 75 | remove_accents: bool = False, 76 | normalize_numbers: bool = True, 77 | punctuation: int = 1, 78 | ): 79 | self.model = kenlm.Model(os.path.join(model_dataset, f"{language}.arpa.bin")) 80 | self.tokenizer = SentencePiece(os.path.join(model_dataset, f"{language}.sp.model")) 81 | self.accent = remove_accents 82 | self.case = lower_case 83 | self.numbers = normalize_numbers 84 | self.punct = punctuation 85 | 86 | @classmethod 87 | def from_pretrained( 88 | cls, 89 | model_dataset: str, 90 | language: str, 91 | ): 92 | return cls( 93 | model_dataset, 94 | language, 95 | False, 96 | False, 97 | True, 98 | 1, 99 | ) 100 | 101 | def pp(self, log_score, length): 102 | return 10.0 ** (-log_score / length) 103 | 104 | def get_perplexity(self, doc: str, normalize_cc_net: bool = True): 105 | if normalize_cc_net: 106 | doc = self.normalize( 107 | doc, 108 | accent=self.accent, 109 | case=self.case, 110 | numbers=self.numbers, 111 | punct=self.punct, 112 | ) 113 | # Tokenize (after normalizing): See https://github.com/facebookresearch/cc_net/blob/bda555bd1cf1ee2e0b925363e62a61cd46c8b60d/cc_net/mine.py#L352 for full pipeline 114 | doc = self.tokenizer.do(doc) 115 | doc_log_score, doc_length = 0, 0 116 | for line in doc.split("\n"): 117 | log_score = self.model.score(line) 118 | length = len(line.split()) + 1 119 | doc_log_score += log_score 120 | doc_length += length 121 | return round(self.pp(doc_log_score, doc_length), 1) 122 | 123 | def normalize( 124 | self, 125 | line: str, 126 | accent: bool = True, 127 | case: bool = True, 128 | numbers: bool = True, 129 | punct: int = 1, 130 | ) -> str: 131 | line = line.strip() 132 | if not line: 133 | return line 134 | if case: 135 | line = line.lower() 136 | if accent: 137 | line = self.strip_accents(line) 138 | if numbers: 139 | line = self.digit_re.sub("0", line) 140 | if punct == 1: 141 | line = self.replace_unicode_punct(line) 142 | elif punct == 2: 143 | line = self.remove_unicode_punct(line) 144 | line = self.remove_non_printing_char(line) 145 | return line 146 | 147 | def strip_accents(self, line: str) -> str: 148 | """Strips accents from a piece of text.""" 149 | nfd = unicodedata.normalize("NFD", line) 150 | output = [c for c in nfd if unicodedata.category(c) != "Mn"] 151 | if len(output) == line: 152 | return line 153 | return "".join(output) 154 | 155 | def replace_unicode_punct(self, text: str) -> str: 156 | return "".join(self.unicode_punct.get(c, c) for c in text) 157 | 158 | def remove_unicode_punct(self, text: str) -> str: 159 | """More aggressive version of replace_unicode_punct but also faster.""" 160 | return self.unicode_punct_re.sub("", text) 161 | 162 | def remove_non_printing_char(self, text: str) -> str: 163 | return self.non_printing_chars_re.sub("", text) 164 | -------------------------------------------------------------------------------- /pipeline_scripts/common_crawl/experimental/kenlm/wikipedia/en.arpa.bin: -------------------------------------------------------------------------------- 1 | version https://git-lfs.github.com/spec/v1 2 | oid sha256:04923fccbb4e63005c40f01d66112659416de01accd80d16e366a592289ee07a 3 | size 4444690658 4 | -------------------------------------------------------------------------------- /pipeline_scripts/common_crawl/experimental/kenlm/wikipedia/en.sp.model: -------------------------------------------------------------------------------- 1 | version https://git-lfs.github.com/spec/v1 2 | oid sha256:cf8147a573770b4e6c0d4df1dcb75453baa88190706dab406be7711b84f059de 3 | size 931348 4 | -------------------------------------------------------------------------------- /pipeline_scripts/common_crawl/experimental/kenlm/wikipedia/en.sp.vocab: -------------------------------------------------------------------------------- 1 | version https://git-lfs.github.com/spec/v1 2 | oid sha256:a9c3c51a7736d736cc620cbe9a4c9430533469e57a54bc29546067a252f7d872 3 | size 729017 4 | -------------------------------------------------------------------------------- /pipeline_scripts/common_crawl/get_last_modified_dataset_from_wat_downloads.py: -------------------------------------------------------------------------------- 1 | from datasets import load_dataset 2 | from tqdm import tqdm 3 | import pandas as pd 4 | import subprocess 5 | from multiprocessing import Process 6 | from os import walk, mkdir, path 7 | from shutil import rmtree 8 | import dateutil 9 | import dateparser 10 | import argparse 11 | import ujson 12 | 13 | parser = argparse.ArgumentParser(description="Turns WAT downloads from download_common_crawl.py into a Hugging Face dataset with Last-Modified timestamps, URLs, and crawl timestamps.") 14 | parser.add_argument("--download_dir", help="The directory of the downloaded WAT files.", required=True) 15 | parser.add_argument("--output_dataset_name", help="The name of the Hugging Face dataset which will be saved upon completion of this program.", required=True) 16 | parser.add_argument("--num_proc", type=int, help="The number of processes to use.", required=True) 17 | parser.add_argument("--tmp_dir", default=".tmp_get_last_modified_dataset_from_wat_downloads") 18 | parser.add_argument("--push_to_hub", action="store_true", help="Whether to push the Hugging Face dataset to the Hugging Face Hub after saving a copy to the disk.") 19 | args = parser.parse_args() 20 | 21 | if path.exists(args.tmp_dir): 22 | rmtree(args.tmp_dir) 23 | 24 | mkdir(args.tmp_dir) 25 | 26 | filenames = next(walk(args.download_dir), (None, None, []))[2] 27 | 28 | def split_a_into_n_parts(a, n): 29 | k, m = divmod(len(a), n) 30 | return [a[i*k+min(i, m):(i+1)*k+min(i+1, m)] for i in range(n)] 31 | 32 | filename_per_proc = [names for names in split_a_into_n_parts(filenames, args.num_proc) if len(names) != 0] 33 | 34 | processes = [] 35 | for filenames in filename_per_proc: 36 | def get_dataset(filenames): 37 | for filename in tqdm(filenames): 38 | dataset_dict = {"last_modified_timestamp": [], "url": [], "crawl_timestamp": []} 39 | file_path = path.join(args.download_dir, filename) 40 | if filename.endswith(".gz"): 41 | subprocess.run(f"gzip -d {file_path}", shell=True) 42 | filename = filename[:-3] 43 | file_path = path.join(args.download_dir, filename) 44 | for line in open(file_path).readlines(): 45 | if line.startswith("{"): 46 | parsed_line = ujson.loads(line) 47 | last_modified = parsed_line.get("Envelope", {}).get("Payload-Metadata", {}).get("HTTP-Response-Metadata", {}).get("Headers", {}).get("Last-Modified", None) 48 | url = parsed_line.get("Envelope", {}).get("WARC-Header-Metadata", {}).get("WARC-Target-URI", None) 49 | date = parsed_line.get("Envelope", {}).get("WARC-Header-Metadata", {}).get("WARC-Date", None) 50 | if None not in (last_modified, url, date): 51 | try: 52 | last_modified_timestamp = dateutil.parser.parse(last_modified).timestamp() 53 | except Exception: 54 | try: 55 | last_modified_timestamp = dateparser.parse(last_modified).timestamp() 56 | except Exception: 57 | last_modified_timestamp = None 58 | if last_modified_timestamp is not None: 59 | crawl_timestamp = dateutil.parser.parse(date).timestamp() 60 | dataset_dict["last_modified_timestamp"].append(last_modified_timestamp) 61 | dataset_dict["url"].append(url) 62 | dataset_dict["crawl_timestamp"].append(crawl_timestamp) 63 | # Zip the download file again to save space. 64 | subprocess.run(f"gzip {file_path}", shell=True) 65 | pd.DataFrame(dataset_dict).to_parquet(path.join(args.tmp_dir, filename + ".filtered.parquet")) 66 | p = Process(target=get_dataset, args=(filenames,)) 67 | p.start() 68 | processes.append(p) 69 | 70 | for p in processes: 71 | p.join() 72 | 73 | ds = load_dataset("parquet", data_files=path.join(args.tmp_dir, "*.parquet")) 74 | ds.save_to_disk(args.output_dataset_name) 75 | 76 | rmtree(args.tmp_dir) 77 | 78 | if args.push_to_hub: 79 | ds.push_to_hub(args.output_dataset_name) 80 | 81 | -------------------------------------------------------------------------------- /pipeline_scripts/common_crawl/get_text_dataset_from_wet_downloads.py: -------------------------------------------------------------------------------- 1 | from datasets import load_dataset 2 | from tqdm import tqdm 3 | import pandas as pd 4 | import subprocess 5 | from multiprocessing import Process 6 | from os import walk, mkdir, path 7 | from shutil import move, rmtree 8 | import dateutil 9 | import argparse 10 | 11 | parser = argparse.ArgumentParser(description="Turns downloads from download_common_crawl.py into a Hugging Face dataset, split by language (language is identified using a FastText model). The dataset has a timestamp column for the time it was crawled, along with a url column and, of course, a text column.") 12 | parser.add_argument("--download_dir", help="The directory of the downloaded WET files.", required=True) 13 | parser.add_argument("--output_dataset_name", help="The name of the Hugging Face dataset which will be saved upon completion of this program.", required=True) 14 | parser.add_argument("--num_proc", type=int, help="The number of processes to use, at a minimum.", required=True) 15 | parser.add_argument("--tmp_dir", default=".tmp_get_dataset_from_downloads", help="The directory to store temporary files. The directory will be deleted upon completion of this script. Defaults to .tmp_get_datasets_from_downloads.") 16 | parser.add_argument("--push_to_hub", action="store_true", help="Whether to push the Hugging Face dataset to the Hugging Face Hub after saving a copy to the disk.") 17 | args = parser.parse_args() 18 | 19 | if path.exists(args.tmp_dir): 20 | rmtree(args.tmp_dir) 21 | 22 | mkdir(args.tmp_dir) 23 | 24 | tmp_download_dir = path.join(args.tmp_dir, "downloads") 25 | 26 | move(args.download_dir, tmp_download_dir) 27 | 28 | filenames = next(walk(tmp_download_dir), (None, None, []))[2] 29 | 30 | def split_a_into_n_parts(a, n): 31 | k, m = divmod(len(a), n) 32 | return [a[i*k+min(i, m):(i+1)*k+min(i+1, m)] for i in range(n)] 33 | 34 | ungoliant_pipeline_output_dirs = [] 35 | filename_per_directory = [names for names in split_a_into_n_parts(filenames, args.num_proc) if len(names) != 0] 36 | num_files_awaiting_processing = 0 37 | dirs_awaiting_processing = [] 38 | def do_parallel_pipeline_processing(dirs_awaiting_processing): 39 | processes = [] 40 | for obj in dirs_awaiting_processing: 41 | p = subprocess.Popen(f"ungoliant pipeline --lid-path=sp_kenlm_ft_models/lid.176.bin {obj['download_chunk_dir']} {obj['pipeline_output_dir']}", shell=True) 42 | processes.append(p) 43 | for p in processes: 44 | p.wait() 45 | 46 | # This loop runs the ungoliant pipeline num_proc number of times to generate num_proc number of output files. 47 | # the ungoliant pipeline is already parallelized, so we don't do this so that the ungoliant pipeline will run faster. 48 | # Instead, we do this so that we will have num_proc number of output files so we can load them in parallel into a 49 | # pandas dataframes, which will eventually be turned into Hugging Face dataset. 50 | ungoliant_pipeline_results = path.join(args.tmp_dir, "ungoliant_pipeline_results") 51 | mkdir(ungoliant_pipeline_results) 52 | for i in range(len(filename_per_directory)): 53 | download_chunk_dir = path.join(tmp_download_dir, "chunk_" + str(i)) 54 | mkdir(download_chunk_dir) 55 | for filename in filename_per_directory[i]: 56 | num_files_awaiting_processing += 1 57 | move(path.join(tmp_download_dir, filename), path.join(download_chunk_dir, filename)) 58 | pipeline_output_dir = path.join(ungoliant_pipeline_results, "chunk_" + str(i)) 59 | mkdir(pipeline_output_dir) 60 | ungoliant_pipeline_output_dirs.append(pipeline_output_dir) 61 | dirs_awaiting_processing.append({"pipeline_output_dir": pipeline_output_dir, "download_chunk_dir": download_chunk_dir}) 62 | if num_files_awaiting_processing >= args.num_proc: 63 | do_parallel_pipeline_processing(dirs_awaiting_processing) 64 | num_files_awaiting_processing = 0 65 | dirs_awaiting_processing = [] 66 | 67 | do_parallel_pipeline_processing(dirs_awaiting_processing) 68 | 69 | # For some reason, datasets errors out if we try to load directly from the jsonl, so we need to do this first. 70 | processes = [] 71 | for ungoliant_pipeline_output_dir in ungoliant_pipeline_output_dirs: 72 | language_filenames = [name for name in next(walk(ungoliant_pipeline_output_dir), (None, None, []))[2] if name.endswith("_meta.jsonl")] 73 | language_ids = [language_filename.split("_")[0] for language_filename in language_filenames] 74 | def convert_to_parquet_and_reformat(ungoliant_pipeline_output_dir): 75 | for language_filename in language_filenames: 76 | language_id = language_filename.split("_")[0] 77 | i = 0 78 | print("Chunking the ungoliant json into several parquet files and reformatting before loading into huggingface dataset.") 79 | parquet_file_dir = path.join(ungoliant_pipeline_output_dir, language_id + "_parquet") 80 | mkdir(parquet_file_dir) 81 | for chunk in tqdm(pd.read_json(path.join(ungoliant_pipeline_output_dir, language_id + "_meta.jsonl"), lines=True, chunksize=10000)): 82 | parquet_file_path = path.join(parquet_file_dir, str(i) + ".parquet") 83 | chunk["url"] = chunk.apply(lambda row: row["warc_headers"]["warc-target-uri"], axis=1) 84 | chunk["crawl_timestamp"] = chunk.apply(lambda row: dateutil.parser.parse(row["warc_headers"]["warc-date"]).timestamp(), axis=1) 85 | chunk.drop(columns=["warc_headers", "metadata"], inplace=True) 86 | chunk.rename(columns={"content": "text"}, inplace=True) 87 | chunk.to_parquet(parquet_file_path) 88 | i += 1 89 | p = Process(target=convert_to_parquet_and_reformat, args=(ungoliant_pipeline_output_dir,)) 90 | p.start() 91 | processes.append(p) 92 | 93 | for p in processes: 94 | p.join() 95 | 96 | data_files = {language_id: [path.join(ungoliant_pipeline_output_dir, language_id + "_parquet", "*.parquet") for ungoliant_pipeline_output_dir in ungoliant_pipeline_output_dirs] for language_id in language_ids} 97 | ds = load_dataset("parquet", data_files=data_files) 98 | ds.save_to_disk(args.output_dataset_name) 99 | rmtree(args.tmp_dir) 100 | 101 | if args.push_to_hub: 102 | ds.push_to_hub(args.output_dataset_name) 103 | -------------------------------------------------------------------------------- /pipeline_scripts/common_crawl/remove_wikipedia_urls.py: -------------------------------------------------------------------------------- 1 | from datasets import load_dataset, load_from_disk 2 | import argparse 3 | 4 | parser = argparse.ArgumentParser(description="Removes all examples from a Hugging Face dataset if they have a Wikipedia URL. This script is intened to be used if you eventually want to merge the dataset with a Wikipedia snapshot. In that case, examples from Wikipedia in this dataset are redundant.") 5 | parser.add_argument("--input_dataset_name", help="Input dataset name.", required=True) 6 | parser.add_argument("--output_dataset_name", help="Output dataset name.", required=True) 7 | parser.add_argument("--url_column", help="Name of the URL column of the dataset.", required=True) 8 | parser.add_argument("--split", default=None, help="The split of the dataset to use. Some datasets don't have splits, so it is optional.") 9 | parser.add_argument("--num_proc", type=int, help="The number of processes to use.") 10 | parser.add_argument("--push_to_hub", action="store_true", help="Whether to push the output dataset to the Hugging Face hub after saving to the disk.") 11 | parser.add_argument("--load_from_hub_instead_of_disk", action="store_true", help="Whether to load the input dataset by name from the Hugging Face hub. If this argument isn't specified then the input dataset will be loaded from a directory of the same name on the disk.") 12 | args = parser.parse_args() 13 | 14 | if args.load_from_hub_instead_of_disk: 15 | if args.split is None: 16 | ds = load_dataset(args.input_dataset_name) 17 | else: 18 | ds = load_dataset(args.input_dataset_name, split=args.split) 19 | else: 20 | if args.split is None: 21 | ds = load_from_disk(args.input_dataset_name) 22 | else: 23 | ds = load_from_disk(args.input_dataset_name)[args.split] 24 | 25 | ds = ds.filter(lambda example: not example[args.url_column].startswith("https://en.wikipedia.org/wiki/"), num_proc=args.num_proc) 26 | 27 | ds.save_to_disk(args.output_dataset_name) 28 | 29 | if args.push_to_hub: 30 | ds.push_to_hub(args.output_dataset_name) 31 | -------------------------------------------------------------------------------- /pipeline_scripts/wikipedia/README.md: -------------------------------------------------------------------------------- 1 | Per the repository [here](https://huggingface.co/datasets/olm/wikipedia), just run this Python code. It uses all CPUs available and should take less than an hour if you have a lot of CPUs (on the order of 100). 2 | 3 | ``` 4 | from datasets import load_dataset 5 | 6 | ds = load_dataset("olm/wikipedia", language="en", date="20220920") 7 | 8 | ds.save_to_disk("wikipedia_en_20220920") 9 | ds.push_to_hub("wikipedia_en_20220920") 10 | ```` 11 | 12 | The code pulls the Wikipedia snapshot for the given date and language and does all the processing required to turn it into a clean pretraining dataset. You can get the dates for the latest wikipedia snapshots here: [https://dumps.wikimedia.org/enwiki/](https://dumps.wikimedia.org/enwiki/). 13 | -------------------------------------------------------------------------------- /requirements.txt: -------------------------------------------------------------------------------- 1 | datasets==2.6.1 2 | emoji==1.7.0 3 | fasttext==0.9.2 4 | sentencepiece==0.1.97 5 | pypi-kenlm==0.1.20220713 6 | text-dedup==0.2.1 7 | argparse==1.4.0 8 | dateparser==1.1.1 9 | mwparserfromhell==0.6.4 10 | matplotlib==3.6.2 11 | multiprocess==0.70.13 12 | --------------------------------------------------------------------------------