├── .gitmodules
├── LICENSE
├── README.md
├── analysis_scripts
    ├── README.md
    ├── duplicates.py
    ├── term_counts.py
    ├── timestamp_dist.py
    └── url_dist.py
├── pipeline_scripts
    ├── common_crawl
    │   ├── README.md
    │   ├── apply_bigscience_filters.py
    │   ├── combine_last_modified_with_text_dataset.py
    │   ├── deduplicate.py
    │   ├── download_common_crawl.py
    │   ├── download_pipeline_processing_models.sh
    │   ├── experimental
    │   │   ├── add_perplexity.py
    │   │   ├── filter_for_only_updated_websites.py
    │   │   └── kenlm
    │   │   │   ├── LICENSE
    │   │   │   ├── README.md
    │   │   │   ├── model.py
    │   │   │   └── wikipedia
    │   │   │       ├── en.arpa.bin
    │   │   │       ├── en.sp.model
    │   │   │       └── en.sp.vocab
    │   ├── get_last_modified_dataset_from_wat_downloads.py
    │   ├── get_text_dataset_from_wet_downloads.py
    │   └── remove_wikipedia_urls.py
    └── wikipedia
    │   └── README.md
└── requirements.txt


/.gitmodules:
--------------------------------------------------------------------------------
1 | [submodule "data-preparation"]
2 | 	path = pipeline_scripts/common_crawl/data-preparation
3 | 	url = https://github.com/bigscience-workshop/data-preparation.git
4 | [submodule "deduplicate-text-datasets"]
5 | 	path = pipeline_scripts/common_crawl/deduplicate-text-datasets
6 | 	url = https://github.com/TristanThrush/deduplicate-text-datasets.git
7 | 


--------------------------------------------------------------------------------
/LICENSE:
--------------------------------------------------------------------------------
  1 | Copyright 2022- The Hugging Face team. All rights reserved.
  2 | 
  3 |                                 Apache License
  4 |                            Version 2.0, January 2004
  5 |                         http://www.apache.org/licenses/
  6 | 
  7 |    TERMS AND CONDITIONS FOR USE, REPRODUCTION, AND DISTRIBUTION
  8 | 
  9 |    1. Definitions.
 10 | 
 11 |       "License" shall mean the terms and conditions for use, reproduction,
 12 |       and distribution as defined by Sections 1 through 9 of this document.
 13 | 
 14 |       "Licensor" shall mean the copyright owner or entity authorized by
 15 |       the copyright owner that is granting the License.
 16 | 
 17 |       "Legal Entity" shall mean the union of the acting entity and all
 18 |       other entities that control, are controlled by, or are under common
 19 |       control with that entity. For the purposes of this definition,
 20 |       "control" means (i) the power, direct or indirect, to cause the
 21 |       direction or management of such entity, whether by contract or
 22 |       otherwise, or (ii) ownership of fifty percent (50%) or more of the
 23 |       outstanding shares, or (iii) beneficial ownership of such entity.
 24 | 
 25 |       "You" (or "Your") shall mean an individual or Legal Entity
 26 |       exercising permissions granted by this License.
 27 | 
 28 |       "Source" form shall mean the preferred form for making modifications,
 29 |       including but not limited to software source code, documentation
 30 |       source, and configuration files.
 31 | 
 32 |       "Object" form shall mean any form resulting from mechanical
 33 |       transformation or translation of a Source form, including but
 34 |       not limited to compiled object code, generated documentation,
 35 |       and conversions to other media types.
 36 | 
 37 |       "Work" shall mean the work of authorship, whether in Source or
 38 |       Object form, made available under the License, as indicated by a
 39 |       copyright notice that is included in or attached to the work
 40 |       (an example is provided in the Appendix below).
 41 | 
 42 |       "Derivative Works" shall mean any work, whether in Source or Object
 43 |       form, that is based on (or derived from) the Work and for which the
 44 |       editorial revisions, annotations, elaborations, or other modifications
 45 |       represent, as a whole, an original work of authorship. For the purposes
 46 |       of this License, Derivative Works shall not include works that remain
 47 |       separable from, or merely link (or bind by name) to the interfaces of,
 48 |       the Work and Derivative Works thereof.
 49 | 
 50 |       "Contribution" shall mean any work of authorship, including
 51 |       the original version of the Work and any modifications or additions
 52 |       to that Work or Derivative Works thereof, that is intentionally
 53 |       submitted to Licensor for inclusion in the Work by the copyright owner
 54 |       or by an individual or Legal Entity authorized to submit on behalf of
 55 |       the copyright owner. For the purposes of this definition, "submitted"
 56 |       means any form of electronic, verbal, or written communication sent
 57 |       to the Licensor or its representatives, including but not limited to
 58 |       communication on electronic mailing lists, source code control systems,
 59 |       and issue tracking systems that are managed by, or on behalf of, the
 60 |       Licensor for the purpose of discussing and improving the Work, but
 61 |       excluding communication that is conspicuously marked or otherwise
 62 |       designated in writing by the copyright owner as "Not a Contribution."
 63 | 
 64 |       "Contributor" shall mean Licensor and any individual or Legal Entity
 65 |       on behalf of whom a Contribution has been received by Licensor and
 66 |       subsequently incorporated within the Work.
 67 | 
 68 |    2. Grant of Copyright License. Subject to the terms and conditions of
 69 |       this License, each Contributor hereby grants to You a perpetual,
 70 |       worldwide, non-exclusive, no-charge, royalty-free, irrevocable
 71 |       copyright license to reproduce, prepare Derivative Works of,
 72 |       publicly display, publicly perform, sublicense, and distribute the
 73 |       Work and such Derivative Works in Source or Object form.
 74 | 
 75 |    3. Grant of Patent License. Subject to the terms and conditions of
 76 |       this License, each Contributor hereby grants to You a perpetual,
 77 |       worldwide, non-exclusive, no-charge, royalty-free, irrevocable
 78 |       (except as stated in this section) patent license to make, have made,
 79 |       use, offer to sell, sell, import, and otherwise transfer the Work,
 80 |       where such license applies only to those patent claims licensable
 81 |       by such Contributor that are necessarily infringed by their
 82 |       Contribution(s) alone or by combination of their Contribution(s)
 83 |       with the Work to which such Contribution(s) was submitted. If You
 84 |       institute patent litigation against any entity (including a
 85 |       cross-claim or counterclaim in a lawsuit) alleging that the Work
 86 |       or a Contribution incorporated within the Work constitutes direct
 87 |       or contributory patent infringement, then any patent licenses
 88 |       granted to You under this License for that Work shall terminate
 89 |       as of the date such litigation is filed.
 90 | 
 91 |    4. Redistribution. You may reproduce and distribute copies of the
 92 |       Work or Derivative Works thereof in any medium, with or without
 93 |       modifications, and in Source or Object form, provided that You
 94 |       meet the following conditions:
 95 | 
 96 |       (a) You must give any other recipients of the Work or
 97 |           Derivative Works a copy of this License; and
 98 | 
 99 |       (b) You must cause any modified files to carry prominent notices
100 |           stating that You changed the files; and
101 | 
102 |       (c) You must retain, in the Source form of any Derivative Works
103 |           that You distribute, all copyright, patent, trademark, and
104 |           attribution notices from the Source form of the Work,
105 |           excluding those notices that do not pertain to any part of
106 |           the Derivative Works; and
107 | 
108 |       (d) If the Work includes a "NOTICE" text file as part of its
109 |           distribution, then any Derivative Works that You distribute must
110 |           include a readable copy of the attribution notices contained
111 |           within such NOTICE file, excluding those notices that do not
112 |           pertain to any part of the Derivative Works, in at least one
113 |           of the following places: within a NOTICE text file distributed
114 |           as part of the Derivative Works; within the Source form or
115 |           documentation, if provided along with the Derivative Works; or,
116 |           within a display generated by the Derivative Works, if and
117 |           wherever such third-party notices normally appear. The contents
118 |           of the NOTICE file are for informational purposes only and
119 |           do not modify the License. You may add Your own attribution
120 |           notices within Derivative Works that You distribute, alongside
121 |           or as an addendum to the NOTICE text from the Work, provided
122 |           that such additional attribution notices cannot be construed
123 |           as modifying the License.
124 | 
125 |       You may add Your own copyright statement to Your modifications and
126 |       may provide additional or different license terms and conditions
127 |       for use, reproduction, or distribution of Your modifications, or
128 |       for any such Derivative Works as a whole, provided Your use,
129 |       reproduction, and distribution of the Work otherwise complies with
130 |       the conditions stated in this License.
131 | 
132 |    5. Submission of Contributions. Unless You explicitly state otherwise,
133 |       any Contribution intentionally submitted for inclusion in the Work
134 |       by You to the Licensor shall be under the terms and conditions of
135 |       this License, without any additional terms or conditions.
136 |       Notwithstanding the above, nothing herein shall supersede or modify
137 |       the terms of any separate license agreement you may have executed
138 |       with Licensor regarding such Contributions.
139 | 
140 |    6. Trademarks. This License does not grant permission to use the trade
141 |       names, trademarks, service marks, or product names of the Licensor,
142 |       except as required for reasonable and customary use in describing the
143 |       origin of the Work and reproducing the content of the NOTICE file.
144 | 
145 |    7. Disclaimer of Warranty. Unless required by applicable law or
146 |       agreed to in writing, Licensor provides the Work (and each
147 |       Contributor provides its Contributions) on an "AS IS" BASIS,
148 |       WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or
149 |       implied, including, without limitation, any warranties or conditions
150 |       of TITLE, NON-INFRINGEMENT, MERCHANTABILITY, or FITNESS FOR A
151 |       PARTICULAR PURPOSE. You are solely responsible for determining the
152 |       appropriateness of using or redistributing the Work and assume any
153 |       risks associated with Your exercise of permissions under this License.
154 | 
155 |    8. Limitation of Liability. In no event and under no legal theory,
156 |       whether in tort (including negligence), contract, or otherwise,
157 |       unless required by applicable law (such as deliberate and grossly
158 |       negligent acts) or agreed to in writing, shall any Contributor be
159 |       liable to You for damages, including any direct, indirect, special,
160 |       incidental, or consequential damages of any character arising as a
161 |       result of this License or out of the use or inability to use the
162 |       Work (including but not limited to damages for loss of goodwill,
163 |       work stoppage, computer failure or malfunction, or any and all
164 |       other commercial damages or losses), even if such Contributor
165 |       has been advised of the possibility of such damages.
166 | 
167 |    9. Accepting Warranty or Additional Liability. While redistributing
168 |       the Work or Derivative Works thereof, You may choose to offer,
169 |       and charge a fee for, acceptance of support, warranty, indemnity,
170 |       or other liability obligations and/or rights consistent with this
171 |       License. However, in accepting such obligations, You may act only
172 |       on Your own behalf and on Your sole responsibility, not on behalf
173 |       of any other Contributor, and only if You agree to indemnify,
174 |       defend, and hold each Contributor harmless for any liability
175 |       incurred by, or claims asserted against, such Contributor by reason
176 |       of your accepting any such warranty or additional liability.
177 | 
178 |    END OF TERMS AND CONDITIONS
179 | 
180 |    APPENDIX: How to apply the Apache License to your work.
181 | 
182 |       To apply the Apache License to your work, attach the following
183 |       boilerplate notice, with the fields enclosed by brackets "[]"
184 |       replaced with your own identifying information. (Don't include
185 |       the brackets!)  The text should be enclosed in the appropriate
186 |       comment syntax for the file format. We also recommend that a
187 |       file or class name and description of purpose be included on the
188 |       same "printed page" as the copyright notice for easier
189 |       identification within third-party archives.
190 | 
191 |    Copyright [yyyy] [name of copyright owner]
192 | 
193 |    Licensed under the Apache License, Version 2.0 (the "License");
194 |    you may not use this file except in compliance with the License.
195 |    You may obtain a copy of the License at
196 | 
197 |        http://www.apache.org/licenses/LICENSE-2.0
198 | 
199 |    Unless required by applicable law or agreed to in writing, software
200 |    distributed under the License is distributed on an "AS IS" BASIS,
201 |    WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
202 |    See the License for the specific language governing permissions and
203 |    limitations under the License.
204 | 


--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
 1 | # Online Language Modelling Dataset Pipeline
 2 | 
 3 | This repo enables you to pull a large and up-to-date text corpus from the web. It uses state-of-the-art processing methods to produce a clean text dataset that you can immediately use to pretrain a large language model, like BERT, GPT, or BLOOM. The main use-case for this repo is the Online Language Modelling Project, where we want to keep a language model up-to-date by pretraining it on the latest Common Crawl and Wikipedia dumps every month or so. You can see the models for the OLM project here: https://huggingface.co/olm. They actually get better performance than their original static counterparts.
 4 | 
 5 | Specifically, this repo has modular Python commands that enable you to:
 6 | * Specify Common Crawl web snapshots, or just Wikipedia snapshots. Then pull the data.
 7 | * Filter the data for a particular language, like English or French.
 8 | * Run the OSCAR filters used by BigScience for the BLOOM language model. These filters ensure some level of text quality and reduce pornographic content.
 9 | * Deduplicate the data.
10 | 
11 | This code is also fairly parallelized, although it can certianly be improved further. It can process over a terabyte from Common Crawl in a day or two, and all of English Wikipedia in less than an hour if you have:
12 | * A machine with a lot of CPUs and memory.
13 | * A fast internet connection.
14 | 
15 | ## Setup
16 | 1. If you want to use this repo to generate a decent amount of data, get a machine with lots of CPUs and memory. We use an `n2d-standard-224` running `Ubuntu 20.04 LTS` on GCP. Add Terabytes of disk space too. You may need an even larger machine if you want to process close to 100% of a Common Crawl snapshot or several snapshots, particularly due to how much memory the deduplication process uses. Alternatively, you can specify in the deduplication arguments that you want to deduplicate the dataset in chunks so your memory doesn't explode.
17 | 2. Clone with submodules: `git clone --recursive git@github.com:huggingface/olm-datasets.git`
18 | 3. Install cargo (rust package manager) with `curl https://sh.rustup.rs -sSf | sh`. Then install Ungoliant with `cargo install ungoliant@1.2.3`. You may need to install gcc and cmake first.
19 | 4. Set up a Python 3.9 environment, and run `pip install -r requirements.txt`
20 | 5. Run `huggingface-cli login`. This cli should have been installed from `requirements.txt`. To login, you need to paste a token from your account at [https://huggingface.co](https://huggingface.co). This step is necessary for the pipeline to push the generated datasets to your Hugging Face account.
21 | 
22 | ## Getting a clean and up-to-date Common Crawl corpus
23 | 
24 | Follow the instructions at [pipeline_scripts/common_crawl](pipeline_scripts/common_crawl).
25 | 
26 | Here is the output dataset to expect from a 20% random segment sample of the August 2022 Common Crawl Snapshot: [https://huggingface.co/datasets/Tristan/olm-CC-MAIN-2022-33-sampling-ratio-0.20](https://huggingface.co/datasets/Tristan/olm-CC-MAIN-2022-33-sampling-ratio-0.20)
27 | 
28 | ## Getting a clean and up-to-date Wikipedia corpus
29 | 
30 | Follow the instructions at [pipeline_scripts/wikipedia](pipeline_scripts/wikipedia).
31 | 
32 | Here is the output dataset to expect from a September 2022 snapshot of Wikipedia: [https://huggingface.co/datasets/Tristan/olm-wikipedia-20220920](https://huggingface.co/datasets/Tristan/olm-wikipedia-20220920)
33 | 
34 | ## Analyzing the corpora
35 | 
36 | Follow the instructions at [analysis_scripts](analysis_scripts).
37 | 
38 | Here is a tweet thread which utilizes these scripts: [https://twitter.com/TristanThrush/status/1582356055794733057](https://twitter.com/TristanThrush/status/1582356055794733057)
39 | 
40 | Here is another tweet thread that dives a little deeper:
41 | [https://twitter.com/TristanThrush/status/1588156731909029889](https://twitter.com/TristanThrush/status/1588156731909029889)
42 | 
43 | And here is a colab where you can quickly run some of the analysis yourself! [https://colab.research.google.com/drive/18Wv7ghW2rRjEe3oWDqh2iz9qqO8O6XcX?usp=sharing](https://colab.research.google.com/drive/18Wv7ghW2rRjEe3oWDqh2iz9qqO8O6XcX?usp=sharing)
44 | 
45 | ## Citation
46 | 
47 | ```
48 | @misc{thrush2022pipeline,
49 |     title={Online Language Modelling Data Pipeline},
50 |     author={Tristan Thrush and Helen Ngo and Nathan Lambert and Douwe Kiela},
51 |     year={2022},
52 |     howpublished={\url{https://github.com/huggingface/olm-datasets}}
53 | }
54 | ```
55 | 


--------------------------------------------------------------------------------
/analysis_scripts/README.md:
--------------------------------------------------------------------------------
  1 | # OLM Analysis
  2 | 
  3 | ## To analyze for term counts accross various datasets
  4 | 
  5 | This command reports the count of terms associated with events that happened over summer 2022, accross chronologically ordered summer 2022 OLM datasets. We would expect the counts to go up over the summer:
  6 | 
  7 | ```
  8 | python term_counts.py --input_dataset_names Tristan/olm-CC-MAIN-2022-21-sampling-ratio-0.14775510204 Tristan/olm-CC-MAIN-2022-27-sampling-ratio-0.16142697881 Tristan/olm-CC-MAIN-2022-33-sampling-ratio-0.20 --input_dataset_pretty_names "May" "June/July" "August" --terms "gentleminion" "monkeypox outbreak" "inflation reduction act of 2022" "quiet quitting" "jonas vingegaard" --plot_title="Count of Terms in 2022 Summer CC OLM Datasets" --analysis_column=text --split=train --num_proc=224 --output_filename=summer_2022_term_counts.png --load_from_hub_instead_of_disk --ylabel Count
  9 | ```
 10 | 
 11 | Here is the resulting figure:
 12 | 
 13 | ![summer_2022_term_counts](https://user-images.githubusercontent.com/20826878/200715141-6ce73388-7d6a-4d05-bbf4-88e1f2a3c62c.png)
 14 | 
 15 | This command reports the count of words with the highest usage increase between the start of summer 2022 and the fall of 2022, out of all of the frequent (> mean + std) words in the dataset with only alphabetic characters, lowercased, and split by spaces:
 16 | 
 17 | ```
 18 | python term_counts.py --input_dataset_names Tristan/olm-CC-MAIN-2022-21-sampling-ratio-0.14775510204 Tristan/olm-CC-MAIN-2022-27-sampling-ratio-0.16142697881 Tristan/olm-CC-MAIN-2022-33-sampling-ratio-0.20 Tristan/olm-CC-MAIN-2022-40-sampling-ratio-0.15894621295 --input_dataset_pretty_names "May" "Jun/Jul" "Aug" "Sep/Oct" --num_terms_to_find 5 --plot_title="Top 5 Words with Highest Usage Increase" --analysis_column=text --split=train --num_proc=224 --output_filename=top_5_term_counts_heatmap.png --load_from_hub_instead_of_disk --ylabel "Word" --as_heatmap --heatmap_bar_label "Percent Increase" --xlabel "Internet Snapshot" --normalize_axis=1 --cache_dir=term_counts_cache_top_5 --percent_increase --annotation "To avoid spurious results from words with small counts, we only considered frequent words. A word is considered frequent if the count is greater than a standard deviation above the mean count. Snapshot datasets are from the OLM project: https://github.com/huggingface/olm-datasets."
 19 | ```
 20 | 
 21 | Here is the resulting figure:
 22 | 
 23 | ![top_5_term_counts_heatmap](https://user-images.githubusercontent.com/20826878/200715219-ce3b6fa4-e9f6-4dac-b594-caa052e759a0.png)
 24 | 
 25 | This command reports the count of date mentions in the text between summer 2022 and fall 2022:
 26 | 
 27 | ```
 28 | python term_counts.py --input_dataset_names Tristan/olm-CC-MAIN-2022-21-sampling-ratio-0.14775510204 Tristan/olm-CC-MAIN-2022-27-sampling-ratio-0.16142697881 Tristan/olm-CC-MAIN-2022-33-sampling-ratio-0.20 Tristan/olm-CC-MAIN-2022-40-sampling-ratio-0.15894621295 --input_dataset_pretty_names "May" "Jun/Jul" "Aug" "Sep/Oct" --terms 2022/05 2022/06 2022/07 2022/08 2022/09 --plot_title="Relative Freq of Dates in Webpage Text" --analysis_column=text --split=train --num_proc=224 --output_filename=date_term_counts_heatmap_text.png --load_from_hub_instead_of_disk --as_heatmap --ylabel "Date (YYYY/MM)" --term_pretty_names May Jun Jul Aug Sep --cache_dir term_counts_cache_date_text --xlabel "Internet Snapshot" --annotation "Snapshot datasets are from the OLM project: https://github.com/huggingface/olm-datasets." --normalize
 29 | ```
 30 | 
 31 | Here is the resulting figure:
 32 | 
 33 | ![date_term_counts_heatmap_text](https://user-images.githubusercontent.com/20826878/200715272-e5dab35b-211c-4344-b685-881e0ce46bb0.png)
 34 | 
 35 | This command reports the count of date mentions in the URLs between summer 2022 and fall 2022:
 36 | 
 37 | ```
 38 | python term_counts.py --input_dataset_names Tristan/olm-CC-MAIN-2022-21-sampling-ratio-0.14775510204 Tristan/olm-CC-MAIN-2022-27-sampling-ratio-0.16142697881 Tristan/olm-CC-MAIN-2022-33-sampling-ratio-0.20 Tristan/olm-CC-MAIN-2022-40-sampling-ratio-0.15894621295 --input_dataset_pretty_names "May" "Jun/Jul" "Aug" "Sep/Oct" --terms 2022/05 2022/06 2022/07 2022/08 2022/09 --plot_title="Relative Freq of Dates in Webpage URLs" --analysis_column=url --split=train --num_proc=224 --output_filename=date_term_counts_heatmap_url.png --load_from_hub_instead_of_disk --as_heatmap --ylabel "Date (YYYY/MM)" --term_pretty_names May Jun Jul Aug Sep --cache_dir term_counts_cache_date_urls --xlabel "Internet Snapshot" --annotation "Snapshot datasets are from the OLM project: https://github.com/huggingface/olm-datasets." --normalize
 39 | ```
 40 | 
 41 | Here is the resulting figure:
 42 | 
 43 | ![date_term_counts_heatmap_url](https://user-images.githubusercontent.com/20826878/200715307-b3110b88-191b-419f-91ff-1e45ecfc6361.png)
 44 | 
 45 | ## To analyze the timestamp distribution accross and within various datasets
 46 | 
 47 | This command reports the last-modified timestamp distribution for the summer 2022 through fall 2022 OLM CC datasets:
 48 | 
 49 | ```
 50 | python timestamp_dist.py --input_dataset_names Tristan/olm-CC-MAIN-2022-40-sampling-ratio-0.15894621295 Tristan/olm-CC-MAIN-2022-33-sampling-ratio-0.20 Tristan/olm-CC-MAIN-2022-27-sampling-ratio-0.16142697881 Tristan/olm-CC-MAIN-2022-21-sampling-ratio-0.14775510204 --input_dataset_pretty_names Sep/Oct Aug Jun/Jul May --timestamp_column last_modified_timestamp --plot_title "Last-Modified Timestamp Distributions from Webpages" --num_proc=224 --output_filename last_modified_dist.png --load_from_hub_instead_of_disk --cache_dir timestamp_dist_cache_last_modified --split=train
 51 | ```
 52 | 
 53 | Here is the resulting figure:
 54 | 
 55 | ![last_modified_dist](https://user-images.githubusercontent.com/20826878/200715332-203f5950-6d4d-4e3a-bfaa-ebbcf7603242.png)
 56 | 
 57 | This command reports the crawl timestamp distribution for the summer 2022 through fall 2022 OLM CC datasets:
 58 | 
 59 | ```
 60 | python timestamp_dist.py --input_dataset_names Tristan/olm-CC-MAIN-2022-40-sampling-ratio-0.15894621295 Tristan/olm-CC-MAIN-2022-33-sampling-ratio-0.20 Tristan/olm-CC-MAIN-2022-27-sampling-ratio-0.16142697881 Tristan/olm-CC-MAIN-2022-21-sampling-ratio-0.14775510204 --input_dataset_pretty_names Sep/Oct Aug Jun/Jul May --timestamp_column crawl_timestamp --plot_title "Crawl Timestamp Distributions from Webpages" --num_proc=224 --output_filename crawl_dist.png --load_from_hub_instead_of_disk --cache_dir timestamp_dist_cache_crawl --split=train
 61 | ```
 62 | 
 63 | Here is the resulting figure:
 64 | 
 65 | ![crawl_dist](https://user-images.githubusercontent.com/20826878/200715349-562af902-8863-428a-8417-0975738164bf.png)
 66 | 
 67 | ## To analyze the URL domain distribution accross and within various datasets
 68 | 
 69 | This command reports the domain distribution within the May 2022 OLM CC dataset:
 70 | 
 71 | ```
 72 | python url_dist.py --input_dataset_names Tristan/olm-CC-MAIN-2022-21-sampling-ratio-0.14775510204 --input_dataset_pretty_names May --url_column url --hist_plot_title "URL Domain Distribution for May Internet Snapshot" --corr_plot_title "URL Domain Distribution Corr for May Internet Snapshot" --num_proc=224 --output_corr_filename url_corr_may.png --output_hist_filename url_hist_may.png --load_from_hub_instead_of_disk --cache_dir url_dist_cache_may --no_hist_legend --annotation "Only the top 25 domains are shown."
 73 | ```
 74 | 
 75 | Here is the resulting figure:
 76 | 
 77 | ![url_hist_may](https://user-images.githubusercontent.com/20826878/200715359-7c7bc37a-5749-454a-9e38-77b1116de7f0.png)
 78 | 
 79 | This command reports the domain correlations between the summer 2022 through fall 2022 OLM CC datasets:
 80 | 
 81 | ```
 82 | python url_dist.py --input_dataset_names Tristan/olm-CC-MAIN-2022-21-sampling-ratio-0.14775510204 Tristan/olm-CC-MAIN-2022-27-sampling-ratio-0.16142697881 Tristan/olm-CC-MAIN-2022-33-sampling-ratio-0.20 Tristan/olm-CC-MAIN-2022-40-sampling-ratio-0.15894621295 --input_dataset_pretty_names May Jun/Jul Aug Sep/Oct --url_column url --hist_plot_title "URL Domain Distribution for Internet Snapshots" --corr_plot_title "URL Domain Distribution Corr for Internet Snapshots" --num_proc=224 --output_corr_filename url_corr.png --output_hist_filename url_hist.png --load_from_hub_instead_of_disk --cache_dir url_dist_cache_all
 83 | ```
 84 | 
 85 | Here is the resulting figure:
 86 | 
 87 | ![url_corr](https://user-images.githubusercontent.com/20826878/200715384-d4793781-9775-4884-bffe-698b16677284.png)
 88 | 
 89 | Does sampling about 15-20% of a Common Crawl Snapshot do anything surprising? How much correlation is there between the resulting OLM dataset from a Common Crawl sample from a random seed versus another random seed? This command reports the domain correlation between two Sep/Oct datasets where the only difference is the sampled segments based on different random seeds:
 90 | 
 91 | ```
 92 | python url_dist.py --input_dataset_names Tristan/olm-CC-MAIN-2022-40-sampling-ratio-0.15894621295 Tristan/olm-CC-MAIN-2022-40-sampling-ratio-0.15894621295-seed-69 --input_dataset_pretty_names "Sep/Oct Seed 1" "Sep/Oct Seed 2" --url_column url --hist_plot_title "URL Domain Distribution for Sep/Oct Snapshots" --corr_plot_title "URL Domain Distribution Corr for Sep/Oct Snapshots" --num_proc=224 --output_corr_filename url_corr_sep_oct_different_seeds.png --output_hist_filename url_hist_sep_oct_different_seeds.png --load_from_hub_instead_of_disk --cache_dir url_dist_cache_all --annotation="This plot shows two different OLM datasets. They were created with the same code from a 16% random sample of Sep/Oct Common Crawl WET files, but with different random seeds for the sampling."
 93 | ```
 94 | 
 95 | Here is the resulting figure:
 96 | 
 97 | ![url_corr_sep_oct_different_seeds](https://user-images.githubusercontent.com/20826878/200715404-5ccb3a1e-9e82-41be-82db-9e54e73785fe.png)
 98 | 
 99 | ## To analyze for duplicates accross various datasets
100 | 
101 | This command reports the ratio of shared URLs between the August and June/July Common Crawl OLM Datasets:
102 | 
103 | ```
104 | python duplicates.py --input_dataset_names Tristan/olm-CC-MAIN-2022-33-sampling-ratio-0.20 Tristan/olm-CC-MAIN-2022-27-sampling-ratio-0.16142697881 --analysis_column=url --split=train --num_proc=224 --plot_title="URLs in the June/July plus the August CC OLM Datasets" --output_filename=duplicate_urls_aug_jun_jul.png --load_from_hub_instead_of_disk
105 | ```
106 | 
107 | Here is the resulting figure:
108 | 
109 | ![duplicate_urls_aug_jun_jul](https://user-images.githubusercontent.com/20826878/200715427-79d0120b-fa48-4fdf-8410-8943a1325780.png)
110 | 
111 | This command reports the ratio of exact text duplicated between the August and June/July Common Crawl OLM Datasets:
112 | 
113 | ```
114 | python duplicates.py --input_dataset_names Tristan/olm-CC-MAIN-2022-33-sampling-ratio-0.20 Tristan/olm-CC-MAIN-2022-27-sampling-ratio-0.16142697881 --analysis_column=text --split=train --num_proc=224 --plot_title="Text in the June/July plus the August CC OLM Datasets" --output_filename=duplicate_text_aug_jun_jul.png --load_from_hub_instead_of_disk
115 | ```
116 | 
117 | Here is the resulting figure:
118 | 
119 | ![duplicate_text_aug_jun_jul](https://user-images.githubusercontent.com/20826878/200715436-4893263b-1fe9-4941-ae43-edd4732652c4.png)
120 | 
121 | What about the duplicated URLs between two differently seeded OLM datasets from the same month?
122 | 
123 | ```
124 | python duplicates.py --input_dataset_names Tristan/olm-CC-MAIN-2022-40-sampling-ratio-0.15894621295 Tristan/olm-CC-MAIN-2022-40-sampling-ratio-0.15894621295-seed-69 --analysis_column=url --split=train --num_proc=224 --plot_title="URLs in two Differently Seeded Sep/Oct CC OLM Datasets" --output_filename=duplicate_urls_sep_oct_different_seeds.png --load_from_hub_instead_of_disk
125 | ```
126 | 
127 | Here is the resulting figure:
128 | 
129 | ![duplicate_urls_sep_oct_different_seeds](https://user-images.githubusercontent.com/20826878/200715575-fae99dcb-cef5-411e-a786-a6e20e53a003.png)
130 | 
131 | And what about the text?
132 | 
133 | ```
134 | python duplicates.py --input_dataset_names Tristan/olm-CC-MAIN-2022-40-sampling-ratio-0.15894621295 Tristan/olm-CC-MAIN-2022-40-sampling-ratio-0.15894621295-seed-69 --analysis_column=text --split=train --num_proc=224 --plot_title="Text in two Differently Seeded Sep/Oct CC OLM Datasets" --output_filename=duplicate_text_sep_oct_different_seeds.png --load_from_hub_instead_of_disk
135 | ```
136 | 
137 | ![duplicate_text_sep_oct_different_seeds](https://user-images.githubusercontent.com/20826878/200715583-1aa76245-14c5-4afe-88c5-539c8665d4d7.png)
138 | 
139 | ## Documentation
140 | 
141 | ```
142 | python term_counts.py --help
143 | ```
144 | 
145 | ```
146 | python url_dist.py --help
147 | ```
148 | 
149 | ```
150 | python timestamp_dist.py --help
151 | ```
152 | 
153 | ```
154 | python duplicates.py --help
155 | ```
156 | 


--------------------------------------------------------------------------------
/analysis_scripts/duplicates.py:
--------------------------------------------------------------------------------
 1 | from datasets import load_dataset, load_from_disk, concatenate_datasets
 2 | import argparse
 3 | import matplotlib.pyplot as plt
 4 | 
 5 | parser = argparse.ArgumentParser(description="This script takes a list of datasets, concatenates them, and saves a pie chart for duplicate versus unique items in the specified column.")
 6 | parser.add_argument("--input_dataset_names", nargs="+", required=True)
 7 | parser.add_argument("--analysis_column", required=True)
 8 | parser.add_argument("--plot_title", required=True)
 9 | parser.add_argument("--split", default=None, help="The dataset split to use. Some datasets don't have splits so this argument is optional.")
10 | parser.add_argument("--num_proc", type=int, required=True)
11 | parser.add_argument("--duplicate_label", default="Duplicate")
12 | parser.add_argument("--unique_label", default="Unique")
13 | parser.add_argument("--output_filename", required=True)
14 | parser.add_argument("--load_from_hub_instead_of_disk", action="store_true", help="Whether to load the input dataset from the Hugging Face hub instead of the disk (default is the disk).")
15 | args = parser.parse_args()
16 | 
17 | datasets = []
18 | for input_dataset_name in args.input_dataset_names:
19 |     if args.load_from_hub_instead_of_disk:
20 |         if args.split is None:
21 |             ds = load_dataset(input_dataset_name)
22 |         else:
23 |             ds = load_dataset(input_dataset_name, split=args.split)
24 |     else:
25 |         if args.split is None:
26 |             ds = load_from_disk(input_dataset_name)
27 |         else:
28 |             ds = load_from_disk(input_dataset_name)[args.split]
29 |     
30 |     datasets.append(ds)
31 | 
32 | ds = concatenate_datasets(datasets)
33 | 
34 | ds = ds.sort(args.analysis_column)
35 | 
36 | max_index = len(ds) - 1
37 | def same_adjacent_entry(entry, index):
38 |     if index == max_index:
39 |         return ds[index - 1][args.analysis_column] == entry
40 |     elif index == 0:
41 |         return ds[index + 1][args.analysis_column] == entry
42 |     return ds[index - 1][args.analysis_column] == entry or ds[index + 1][args.analysis_column] == entry
43 | 
44 | num_examples = len(ds)
45 | ds = ds.filter(lambda example, index: same_adjacent_entry(example[args.analysis_column], index), num_proc=args.num_proc, with_indices=True)
46 | num_examples_only_duplicate_entries = len(ds)
47 | 
48 | 
49 | labels = [args.duplicate_label, args.unique_label]
50 | sizes = [num_examples_only_duplicate_entries, num_examples - num_examples_only_duplicate_entries]
51 | plt.pie(sizes, labels=labels, autopct='%1.1f%%')
52 | 
53 | plt.title(args.plot_title, fontweight="bold")
54 | plt.rcParams["font.family"] = "Times New Roman"
55 | 
56 | plt.savefig(args.output_filename, dpi=300)
57 | 


--------------------------------------------------------------------------------
/analysis_scripts/term_counts.py:
--------------------------------------------------------------------------------
  1 | from datasets import load_dataset, load_from_disk
  2 | import argparse
  3 | import numpy as np
  4 | import matplotlib.pyplot as plt
  5 | from collections import Counter
  6 | from multiprocessing import Manager
  7 | from tqdm import tqdm
  8 | from os import path, mkdir
  9 | import pickle
 10 | import statistics
 11 | 
 12 | parser = argparse.ArgumentParser(description="This script takes in an ordered list of datasets and counts terms in each of them, in the specified column. It then plots a graph or a heatmap for how the count changed accross datasets. ")
 13 | parser.add_argument("--input_dataset_names", nargs="+", required=True)
 14 | parser.add_argument("--input_dataset_pretty_names", nargs="+", required=True, help="The names of the datasets that you want to appear in the saved graph.")
 15 | parser.add_argument("--terms", nargs="+", default=None, help="The terms that you want to count. If left as None, then you must specify --num_terms_to_find, and then the script will return the top --num_terms_to_find with the greatest percent change from the first dataset to the last dataset, out of the terms which have count > the mean count plus the standard deviation (so we don't get spurious results from low-count words).")
 16 | parser.add_argument("--term_pretty_names", nargs="+", default=None)
 17 | parser.add_argument("--analysis_column", required=True)
 18 | parser.add_argument("--plot_title", required=True)
 19 | parser.add_argument("--split", default=None, help="The dataset split to use. Some datasets don't have splits so this argument is optional.")
 20 | parser.add_argument("--num_proc", type=int, required=True)
 21 | parser.add_argument("--output_filename", required=True)
 22 | parser.add_argument("--as_heatmap", action="store_true")
 23 | parser.add_argument("--samples", default=None, type=int)
 24 | parser.add_argument("--num_terms_to_find", default=None, type=int)
 25 | parser.add_argument("--normalize", action="store_true")
 26 | parser.add_argument("--ylabel", required=True)
 27 | parser.add_argument("--cache_dir", default="term_count_cache")
 28 | parser.add_argument("--load_from_cache_dir", action="store_true")
 29 | parser.add_argument("--heatmap_bar_label", default="")
 30 | parser.add_argument("--annotation", default=None)
 31 | parser.add_argument("--xlabel", default="Dataset")
 32 | parser.add_argument("--normalize_axis", default=0, type=int)
 33 | parser.add_argument("--percent_increase", action="store_true")
 34 | parser.add_argument("--bottom", default=0.25, type=float)
 35 | parser.add_argument("--load_from_hub_instead_of_disk", action="store_true", help="Whether to load the input dataset from the Hugging Face hub instead of the disk (default is the disk).")
 36 | args = parser.parse_args()
 37 | 
 38 | datasets = []
 39 | term_y_coords = None
 40 | count_dicts = []
 41 | 
 42 | if args.load_from_cache_dir:
 43 |     if args.terms is None:
 44 |         count_dicts = pickle.load(open(path.join(args.cache_dir, "count_dicts.pkl"), "rb"))
 45 |     else:
 46 |         term_y_coords = pickle.load(open(path.join(args.cache_dir, "term_y_coords.pkl"), "rb"))
 47 |     cached_args = pickle.load(open(path.join(args.cache_dir, "args.pkl"), "rb"))
 48 |     if args != cached_args:
 49 |         print("Warning: argument mismatch between cached args and current args")
 50 |         print("Cached args: ", cached_args)
 51 |         print("Current args: ", args)
 52 | 
 53 | if term_y_coords is None:
 54 |     for input_dataset_name in args.input_dataset_names:
 55 |         if args.load_from_hub_instead_of_disk:
 56 |             if args.split is None:
 57 |                 ds = load_dataset(input_dataset_name)
 58 |             else:
 59 |                 ds = load_dataset(input_dataset_name, split=args.split)
 60 |         else:
 61 |             if args.split is None:
 62 |                 ds = load_from_disk(input_dataset_name)
 63 |             else:
 64 |                 ds = load_from_disk(input_dataset_name)[args.split]
 65 | 
 66 |         if args.samples is not None:
 67 |             ds = ds.shuffle(seed=42)
 68 |             ds = ds.select(range(args.samples))
 69 | 
 70 |         datasets.append(ds)
 71 | 
 72 |         if args.terms is None and not args.load_from_cache_dir:
 73 |             with Manager() as manager:
 74 |                 shared_list = manager.list()
 75 |                 def build_count_dict(examples):
 76 |                     counts = None
 77 |                     for text in examples[args.analysis_column]:
 78 |                         if counts is None:
 79 |                             counts = Counter(filter(lambda obj: obj.isalpha(), text.lower().split(" ")))
 80 |                         else:
 81 |                             counts += Counter(filter(lambda obj: obj.isalpha(), text.lower().split(" ")))
 82 |                     shared_list.append(counts)
 83 | 
 84 |                 ds.map(build_count_dict, num_proc=args.num_proc, batched=True, batch_size=len(ds) // args.num_proc, remove_columns=ds.column_names)
 85 |         
 86 |                 count_dict = shared_list[0]
 87 |                 for counts in tqdm(shared_list[1:]):
 88 |                     count_dict += counts
 89 | 
 90 |                 count_dicts.append(count_dict)
 91 | 
 92 |     if args.terms is None:
 93 |         if not path.exists(args.cache_dir):
 94 |             mkdir(args.cache_dir)
 95 |         pickle.dump(args, open(path.join(args.cache_dir, "args.pkl"), "wb"))
 96 |         pickle.dump(count_dicts, open(path.join(args.cache_dir, "count_dicts.pkl"), "wb"))
 97 | 
 98 | if args.terms is None:
 99 | 
100 |     intersection_count_set = set(count_dicts[0].keys())
101 |     for count_dict in count_dicts[1:]:
102 |         intersection_count_set = intersection_count_set.intersection(set(count_dict.keys()))
103 | 
104 |     words_with_occurence_changes = []
105 |     counts = []
106 |     for word in intersection_count_set:
107 |         count_sum = 0
108 |         for count_dict in count_dicts:
109 |             count_sum += count_dict[word]
110 |         counts.append(count_sum)
111 |     mean_count = statistics.mean(counts)
112 |     std = statistics.stdev(counts)
113 |     for word in intersection_count_set:
114 |         count_sum = 0
115 |         for count_dict in count_dicts:
116 |             count_sum += count_dict[word]
117 |         if count_sum > mean_count + std:
118 |             change = count_dicts[-1][word]/count_dicts[0][word]
119 |             words_with_occurence_changes.append((word, change))
120 | 
121 |     words_with_occurence_changes.sort(key=lambda word_and_change: word_and_change[1], reverse=True)
122 |     terms = [word_and_change[0] for word_and_change in words_with_occurence_changes[:args.num_terms_to_find]]
123 | 
124 | else:
125 |     terms = args.terms
126 | 
127 | 
128 | if term_y_coords is None:
129 | 
130 |     term_y_coords = {term: [] for term in terms}
131 | 
132 |     for ds in datasets:
133 |         def term_counts(text):
134 |             return {term + "_count": text.lower().count(term.lower()) for term in terms}
135 | 
136 |         ds = ds.map(lambda example: term_counts(example[args.analysis_column]), num_proc=args.num_proc)
137 | 
138 |         for term in terms:
139 |             term_y_coords[term].append(sum(ds[term + "_count"]))
140 | 
141 |     if not path.exists(args.cache_dir):
142 |         mkdir(args.cache_dir)
143 |     pickle.dump(args, open(path.join(args.cache_dir, "args.pkl"), "wb"))
144 |     pickle.dump(term_y_coords, open(path.join(args.cache_dir, "term_y_coords.pkl"), "wb"))
145 | 
146 | plt.xticks(range(len(args.input_dataset_pretty_names)), args.input_dataset_pretty_names)
147 | 
148 | if args.as_heatmap:
149 |     matrix = []
150 |     for term in terms:
151 |         matrix.append(term_y_coords[term])
152 |     matrix = np.array(matrix)
153 |     if args.percent_increase:
154 |         matrix = matrix.transpose()
155 |         matrix = (matrix - matrix[0])/matrix[0]
156 |         matrix = matrix.transpose() * 100
157 | 
158 |     if args.normalize:
159 |         column_sums = matrix.sum(axis=args.normalize_axis)
160 |         if args.normalize_axis == 0:
161 |             normalized_matrix = matrix / column_sums
162 |         if args.normalize_axis == 1:
163 |             normalized_matrix = matrix.transpose() / column_sums
164 |             normalized_matrix = normalized_matrix.transpose()
165 |         plt.imshow(np.flipud(normalized_matrix), plt.cm.Blues)
166 |     else:
167 |         plt.imshow(np.flipud(matrix), plt.cm.Blues)
168 |     plt.yticks(range(len(terms)), reversed(terms if args.term_pretty_names is None else args.term_pretty_names))
169 |     cbar = plt.colorbar()
170 |     cbar.ax.set_ylabel(args.heatmap_bar_label, rotation=-90, va="bottom")
171 |     plt.ylabel(args.ylabel, style='italic', fontweight="bold")
172 | else:
173 |     for term in terms:
174 |         if args.normalize:
175 |             term_y_coords[term] = np.array(term_y_coords[term])/sum(term_y_coords[term])
176 |         plt.plot(term_y_coords[term], label=term, marker=".")
177 |     plt.grid(linestyle=":")
178 |     plt.legend(loc="upper left")
179 |     plt.ylabel(args.ylabel, style='italic', fontweight="bold")
180 | 
181 | if args.annotation is not None:
182 |     plt.figtext(0.6, 0.01, args.annotation, wrap=True, horizontalalignment='center', fontsize=8)
183 |     plt.subplots_adjust(bottom=args.bottom)
184 | plt.xlabel(args.xlabel, style='italic', fontweight="bold")
185 | plt.title(args.plot_title, fontweight="bold")
186 | plt.rcParams["font.family"] = "Times New Roman"
187 | plt.savefig(args.output_filename, dpi=300)
188 | 


--------------------------------------------------------------------------------
/analysis_scripts/timestamp_dist.py:
--------------------------------------------------------------------------------
  1 | from datasets import load_dataset, load_from_disk
  2 | import argparse
  3 | import numpy as np
  4 | import matplotlib.pyplot as plt
  5 | import seaborn as sns
  6 | import pandas as pd
  7 | from datetime import datetime
  8 | from os import path, mkdir
  9 | import pickle
 10 | 
 11 | parser = argparse.ArgumentParser(description="This script takes in an ordered list of datasets. It is assumed that each dataset has a timestamp column. The script plots the timestamp distribution histogram for each dataset.")
 12 | parser.add_argument("--input_dataset_names", nargs="+", required=True)
 13 | parser.add_argument("--input_dataset_pretty_names", nargs="+", required=True, help="The names of the datasets that you want to appear in the saved graph.")
 14 | parser.add_argument("--timestamp_column", required=True)
 15 | parser.add_argument("--plot_title", required=True)
 16 | parser.add_argument("--split", default=None, help="The dataset split to use. Some datasets don't have splits so this argument is optional.")
 17 | parser.add_argument("--num_proc", type=int, required=True)
 18 | parser.add_argument("--output_filename", required=True)
 19 | parser.add_argument("--samples", default=None, type=int)
 20 | parser.add_argument("--bins", default=100, help="The number of histogram bins to plot")
 21 | parser.add_argument("--cache_dir", default="timestamp_dist_cache")
 22 | parser.add_argument("--load_from_cache_dir", action="store_true")
 23 | parser.add_argument("--annotation", default=None)
 24 | parser.add_argument("--legend_title", default="Internet Snapshot")
 25 | parser.add_argument("--load_from_hub_instead_of_disk", action="store_true", help="Whether to load the input dataset from the Hugging Face hub instead of the disk (default is the disk).")
 26 | args = parser.parse_args()
 27 | 
 28 | if args.load_from_cache_dir:
 29 |     data_array = np.load(open(path.join(args.cache_dir, "data_array.npy"), "rb"))
 30 |     cached_args = pickle.load(open(path.join(args.cache_dir, "args.pkl"), "rb"))
 31 |     if args != cached_args:
 32 |         print("Warning: argument mismatch between cached args and current args")
 33 |         print("Cached args: ", cached_args)
 34 |         print("Current args: ", args)
 35 | else:
 36 | 
 37 |     # Remove timestamp outliers more than 10 median deviations away from the median.
 38 |     # This is important if the timestamp is the Last-Modified timestamp, which can sometimes be wrong
 39 |     # because websites can report whatever they want. We don't want one website that says it was created
 40 |     # a billion years ago to seriously affect the distribution.
 41 |     def reject_outliers(data, m = 10.):
 42 |         d = np.abs(data - np.median(data))
 43 |         mdev = np.median(d)
 44 |         s = d/mdev if mdev else 0.
 45 |         return data[s<m]
 46 | 
 47 |     data_list = []
 48 |     shortest_len = None
 49 |     for input_dataset_name in args.input_dataset_names:
 50 |         if args.load_from_hub_instead_of_disk:
 51 |             if args.split is None:
 52 |                 ds = load_dataset(input_dataset_name)
 53 |             else:
 54 |                 ds = load_dataset(input_dataset_name, split=args.split)
 55 |         else:
 56 |             if args.split is None:
 57 |                 ds = load_from_disk(input_dataset_name)
 58 |             else:
 59 |                 ds = load_from_disk(input_dataset_name)[args.split]
 60 | 
 61 |         if args.samples is not None:
 62 |             ds = ds.shuffle(seed=42)
 63 |             ds = ds.select(range(args.samples))
 64 | 
 65 |         ds = ds.filter(lambda example: example[args.timestamp_column] is not None, num_proc=args.num_proc)
 66 |         
 67 |         data = np.array(ds[args.timestamp_column])
 68 |         data_no_outliers = reject_outliers(data)
 69 |         data_list.append(data_no_outliers)
 70 |         if shortest_len is None:
 71 |             shortest_len = len(data_no_outliers)
 72 |         else:
 73 |             shortest_len = min(shortest_len, len(data_no_outliers))
 74 | 
 75 |     truncated_data_list = []
 76 |     for data in data_list:
 77 |         truncated_data_list.append(data[:shortest_len])
 78 |     data_array = np.array(truncated_data_list).transpose()
 79 |     if not path.exists(args.cache_dir):
 80 |         mkdir(args.cache_dir)
 81 |     np.save(open(path.join(args.cache_dir, "data_array.npy"), "wb"), data_array)
 82 |     pickle.dump(args, open(path.join(args.cache_dir, "args.pkl"), "wb"))
 83 | 
 84 | df = pd.DataFrame(data=data_array, columns=args.input_dataset_pretty_names)
 85 | color_palette = sns.color_palette("viridis")
 86 | colors = color_palette[:len(args.input_dataset_names)]
 87 | plot = sns.displot(data=df, kde=True, palette=colors, bins=args.bins, height=5, aspect=1.5)
 88 | means = np.mean(data_array, axis=0)
 89 | xticks = np.concatenate((np.array([np.min(data_array)]), means, np.array([np.max(data_array)])))
 90 | for mean, color in zip(means, colors):
 91 |     plt.axvline(x=mean, linestyle="--", color=color)
 92 | 
 93 | plot.set(xticks=xticks)
 94 | plot.axes[0,0].set_title(args.plot_title, fontweight="bold")
 95 | plot.axes[0,0].set_xlabel("Timestamp", style="italic", fontweight="bold")
 96 | plot.axes[0,0].set_ylabel("Count", style="italic", fontweight="bold")
 97 | plot.set_xticklabels([datetime.fromtimestamp(timestamp).strftime('%b %d') for timestamp in xticks], rotation=45)
 98 | if args.annotation is not None:
 99 |     plot.figure.text(0.5, 0.01, args.annotation, wrap=True, horizontalalignment='center', fontsize=8)
100 |     plot.figure.subplots_adjust(bottom=0.20)
101 | plot._legend.set_title(args.legend_title)
102 | plot.fig.savefig(args.output_filename, dpi=300, bbox_inches='tight')
103 | 
104 | 


--------------------------------------------------------------------------------
/analysis_scripts/url_dist.py:
--------------------------------------------------------------------------------
  1 | from datasets import load_dataset, load_from_disk
  2 | import argparse
  3 | from collections import Counter
  4 | from multiprocessing import Manager
  5 | from tqdm import tqdm
  6 | from urllib.parse import urlparse
  7 | import seaborn as sns
  8 | import pandas as pd
  9 | from os import path, mkdir
 10 | import pickle
 11 | 
 12 | parser = argparse.ArgumentParser(description="This script takes in an ordered list of datasets which each have a URL column. It extracts domain names from each URL and then plots a histogram of the URL counts per domain and a correlation matrix comparing each dataset's histogram.")
 13 | parser.add_argument("--input_dataset_names", nargs="+", required=True)
 14 | parser.add_argument("--input_dataset_pretty_names", nargs="+", required=True, help="The names of the datasets that you want to appear in the saved graphs.")
 15 | parser.add_argument("--url_column", required=True)
 16 | parser.add_argument("--hist_plot_title", required=True)
 17 | parser.add_argument("--corr_plot_title", required=True)
 18 | parser.add_argument("--split", default=None, help="The dataset split to use. Some datasets don't have splits so this argument is optional.")
 19 | parser.add_argument("--num_proc", type=int, required=True)
 20 | parser.add_argument("--output_corr_filename", required=True)
 21 | parser.add_argument("--output_hist_filename", required=True)
 22 | parser.add_argument("--samples", default=None, type=int)
 23 | parser.add_argument("--hist_bins", default=25, type=int)
 24 | parser.add_argument("--hist_bin_fontsize", default=8, type=int)
 25 | parser.add_argument("--cache_dir", default="url_dist")
 26 | parser.add_argument("--no_hist_legend", action="store_true")
 27 | parser.add_argument("--load_from_cache_dir", action="store_true", help="If you've already run this function and just want to change parameters for the graphs (like --hist_bins, for example), then specify this option to load the cached domain distribution so the computation isn't repeated.")
 28 | parser.add_argument("--annotation", default=None)
 29 | parser.add_argument("--load_from_hub_instead_of_disk", action="store_true", help="Whether to load the input datasets from the Hugging Face hub instead of the disk (default is the disk).")
 30 | args = parser.parse_args()
 31 | 
 32 | if args.load_from_cache_dir:
 33 |     count_dicts = pickle.load(open(path.join(args.cache_dir, "count_dicts.pkl"), "rb"))
 34 |     cached_args = pickle.load(open(path.join(args.cache_dir, "args.pkl"), "rb"))
 35 |     if args != cached_args:
 36 |         print("Warning: argument mismatch between cached args and current args")
 37 |         print("Cached args: ", cached_args)
 38 |         print("Current args: ", args)
 39 | 
 40 | else:
 41 |     count_dicts = []
 42 |     for input_dataset_name in args.input_dataset_names:
 43 |         if args.load_from_hub_instead_of_disk:
 44 |             if args.split is None:
 45 |                 ds = load_dataset(input_dataset_name)
 46 |             else:
 47 |                 ds = load_dataset(input_dataset_name, split=args.split)
 48 |         else:
 49 |             if args.split is None:
 50 |                 ds = load_from_disk(input_dataset_name)
 51 |             else:
 52 |                 ds = load_from_disk(input_dataset_name)[args.split]
 53 | 
 54 |         if args.samples is not None:
 55 |             ds = ds.shuffle(seed=42)
 56 |             ds = ds.select(range(args.samples))
 57 | 
 58 |         with Manager() as manager:
 59 |             shared_list = manager.list()
 60 |             def build_count_dict(examples):
 61 |                 counts = None
 62 |                 for url in examples[args.url_column]:
 63 |                     domain = urlparse(url).netloc
 64 |                     if counts is None:
 65 |                         counts = Counter([domain])
 66 |                     else:
 67 |                         counts += Counter([domain])
 68 |                 shared_list.append(counts)
 69 | 
 70 |             ds.map(build_count_dict, num_proc=args.num_proc, batched=True, batch_size=len(ds) // args.num_proc)
 71 |         
 72 |             count_dict = shared_list[0]
 73 |             for counts in tqdm(shared_list[1:]):
 74 |                 count_dict += counts
 75 | 
 76 |             count_dicts.append(count_dict)
 77 |     
 78 |     if not path.exists(args.cache_dir):
 79 |         mkdir(args.cache_dir)
 80 |     pickle.dump(args, open(path.join(args.cache_dir, "args.pkl"), "wb"))
 81 |     pickle.dump(count_dicts, open(path.join(args.cache_dir, "count_dicts.pkl"), "wb"))
 82 | 
 83 | union_count_set = set(count_dicts[0].keys())
 84 | for count_dict in tqdm(count_dicts[1:]):
 85 |     union_count_set = union_count_set.union(set(count_dict.keys()))
 86 | 
 87 | dataframe_dict = {dataset_name: [] for dataset_name in args.input_dataset_pretty_names}
 88 | dataframe_dict["domain_name"] = []
 89 | for domain in tqdm(union_count_set):
 90 |     for index in range(len(args.input_dataset_pretty_names)):
 91 |         count_dict = count_dicts[index]
 92 |         dataset_name = args.input_dataset_pretty_names[index]
 93 |         dataframe_dict[dataset_name].append(count_dict.get(domain, 0))
 94 |     dataframe_dict["domain_name"].append(domain)
 95 | 
 96 | df = pd.DataFrame(dataframe_dict)
 97 | plot = sns.heatmap(df.corr().iloc[::-1], cmap="Blues", annot=True)
 98 | plot.set_title(args.corr_plot_title, fontweight="bold")
 99 | if args.annotation is not None:
100 |     plot.figure.text(0.5, 0.01, args.annotation, wrap=True, horizontalalignment="center", fontsize=8)
101 |     plot.figure.subplots_adjust(bottom=0.15)
102 | plot.figure.savefig(args.output_corr_filename, dpi=300)
103 | plot.figure.clf()
104 | 
105 | df = df.sort_values(by=args.input_dataset_pretty_names, ascending=False)
106 | dataframe_dict = {"samples": [], "dataset": [], "domain": []}
107 | index = 0
108 | for _, datum in df.iterrows():
109 |     if index >= args.hist_bins:
110 |         break
111 |     dataframe_dict["samples"] += [datum[name] for name in args.input_dataset_pretty_names]
112 |     dataframe_dict["dataset"] += args.input_dataset_pretty_names
113 |     dataframe_dict["domain"] += [datum["domain_name"]]*len(args.input_dataset_pretty_names)
114 |     index += 1
115 | 
116 | df = pd.DataFrame(dataframe_dict)
117 | color_palette = sns.color_palette("pastel")
118 | colors = color_palette[:len(args.input_dataset_names)]
119 | plot = sns.barplot(data=df, palette=colors, hue="dataset", y="domain", x="samples")
120 | if args.annotation is not None:
121 |     plot.figure.text(0.4, 0.01, args.annotation, wrap=True, horizontalalignment="center", fontsize=8)
122 |     plot.figure.subplots_adjust(bottom=0.15)
123 | plot.legend().set_title("")
124 | if args.no_hist_legend:
125 |     plot.legend().remove()
126 | for item in plot.get_yticklabels():
127 |     item.set_fontsize(args.hist_bin_fontsize)
128 | plot.set_title(args.hist_plot_title, fontweight="bold")
129 | plot.set_xlabel("Count", style="italic", fontweight="bold")
130 | plot.set_ylabel("Domain", style="italic", fontweight="bold")
131 | plot.figure.savefig(args.output_hist_filename, dpi=300, bbox_inches="tight")
132 | 


--------------------------------------------------------------------------------
/pipeline_scripts/common_crawl/README.md:
--------------------------------------------------------------------------------
 1 | ![olm_cc_pipeline](https://user-images.githubusercontent.com/20826878/199851707-64a7a026-c413-4d78-8b04-a825e07534b3.jpeg)
 2 | 
 3 | # Quick start
 4 | 
 5 | This section provides all the commands that you need to generate a deduplicated and filtered dataset from Common Crawl, ready for pretraining!
 6 | 
 7 | ## One time only
 8 | 
 9 | `bash download_pipeline_processing_models.sh`
10 | 
11 | ## Every time
12 | 
13 | Use the following commands to get a dataset. They should take only a few min if you have lots of CPUs. Adjust `--num_proc` to be equal to however many CPUs that you have.
14 | 
15 | ```
16 | python download_common_crawl.py --snapshots CC-MAIN-2022-33 --segment_sampling_ratios 0.0001 --seed=42 --download_dir=common_crawl_wet_downloads --num_proc=224
17 | python get_text_dataset_from_wet_downloads.py --download_dir=common_crawl_wet_downloads --output_dataset_name=cc_raw --num_proc=224
18 | python remove_wikipedia_urls.py --input_dataset_name=cc_raw --output_dataset_name=cc_no_wikipedia --url_column=url --split=en --num_proc=224
19 | python apply_bigscience_filters.py --input_dataset_name=cc_no_wikipedia --output_dataset_name=cc_filtered --lang_id=en --text_column=text --num_proc=224
20 | ulimit -Sn 1000000 && python deduplicate.py --input_dataset_name=cc_filtered --output_dataset_name=cc_olm --text_column=text --remove_whole_example --num_proc=224
21 | 
22 | # Optionally, get the last-modified headers from the websites and add them to the dataset. --segment_sampling_ratios and --seed must be the same as above for this to work.
23 | python download_common_crawl.py --snapshots CC-MAIN-2022-33 --segment_sampling_ratios 0.0001 --seed=42 --download_dir=common_crawl_wat_downloads --paths_type=wat --num_proc=224
24 | python get_last_modified_dataset_from_wat_downloads.py --download_dir=common_crawl_wat_downloads --output_dataset_name=cc_raw_last_modified --num_proc=224
25 | python combine_last_modified_with_text_dataset.py --text_dataset_name=cc_olm --last_modified_dataset_name=cc_raw_last_modified --output_dataset_name=cc_olm_with_last_modified --url_column=url --crawl_timestamp_column=crawl_timestamp --last_modified_timestamp_column=last_modified_timestamp --num_proc=224
26 | 
27 | ```
28 | 
29 | You can then upload the final dataset to the Hugging Face Hub from a Python terminal like this:
30 | 
31 | ```
32 | from datasets import load_from_disk
33 | 
34 | ds = load_from_disk("cc_olm")  # Or cc_olm_with_last_modified if you did the optional step above.
35 | 
36 | ds = ds.shuffle()  # Optionally, shuffle the dataset so you can get an idea of what a random sample of the dataset looks like in the Hugging Face Hub dataset preview.
37 | 
38 | ds.push_to_hub("cc_olm")  # Or cc_olm_with_last_modified if you did the optional step above.
39 | ```
40 | 
41 | 
42 | # Important notes
43 | 
44 | ## Finding the latest Common Crawl snapshots
45 | 
46 | They are displayed here: [https://commoncrawl.org/the-data/get-started/](https://commoncrawl.org/the-data/get-started/). Just enter the names of the snapshots you want as arguments to the `download_common_crawl.py` script.
47 | 
48 | ## Intermediate dataset checkpoints
49 | 
50 | Each of the python scripts from the quick start commands saves a Hugging Face dataset to the disk. The dataset is then read by the next python command. These intermediate datasets are not deleted by default, so you can observe what each step of the pipeline does. This also means that you should have a large disk. We use a 15 terabyte disk for the Online Language Modelling Project.
51 | 
52 | ## How to specify the size of the dataset
53 | 
54 | Increase `--segment_sampling_ratios` to get a larger dataset (it goes up to `1`). In the above quick start code, `0.0001` means that it only uses a sample of `0.01%` of the data from a Common Crawl snapshot. To generate a dataset for the Online Language Modelling Project, we are currently pulling about 1.45 terabytes from each Common Crawl snapshot, which is about 350 gigabytes after going through the BigScience filters and finally 30 gigabytes after going through the deduplication code. For the August 2022 snapshot, 1.45 terabytes is about 20% (i.e. `--segment_sampling_ratios 0.20`). Crawl sizes very though. For May 2022, 1.45 terabytes is about 14%.
55 | 
56 | If you want to train a larger model than us, then specify a higher value for `--segment_sampling_ratios`, or even use multiple Common Crawl snapshots like this:
57 | 
58 | ```
59 | python download_common_crawl.py --snapshots CC-MAIN-2022-27 CC-MAIN-2022-33 --segment_sampling_ratios 0.5 1 --download_dir=common_crawl_wet_downloads --num_proc=224
60 | ```
61 | 
62 | Keep in mind that, with more data, the deduplication script will need more RAM. Read on for limitations of the deduplication script.
63 | 
64 | ## Limitations of the deduplication code
65 | 
66 | There are tons of duplicates in Common Crawl data, which means that the deduplication script will need 100's of gigabytes of RAM if you want to generate a 30 gigabyte dataset like us :(. If you want to get around this, there is also the option in the deduplication script for you to chunk the dataset and deduplicate each chunk individually. The main problem is this issue in the Google deduplication code: [https://github.com/google-research/deduplicate-text-datasets/issues/18](https://github.com/google-research/deduplicate-text-datasets/issues/18).
67 | 
68 | 
69 | # More documentation
70 | 
71 | Run any of the python commands with the `--help` flag. For example, `python download_common_crawl.py --help`.
72 | 


--------------------------------------------------------------------------------
/pipeline_scripts/common_crawl/apply_bigscience_filters.py:
--------------------------------------------------------------------------------
 1 | from datasets import load_dataset, load_from_disk
 2 | import argparse
 3 | from subprocess import run
 4 | from os import path, mkdir
 5 | from shutil import rmtree
 6 | import sys
 7 | import uuid
 8 | 
 9 | sys.path.append("data-preparation/preprocessing/training/01b_oscar_cleaning_and_filtering")
10 | from filtering import DatasetFiltering
11 | 
12 | parser = argparse.ArgumentParser(description="Applies the BigScience BLOOM filters which were used on OSCAR. They are designed to improve text quality and remove pornographic content.")
13 | parser.add_argument("--input_dataset_name", help="The name of the input dataset.", required=True)
14 | parser.add_argument("--output_dataset_name", help="The name of the output dataset.", required=True)
15 | parser.add_argument("--lang_id", help="The language id of your dataset. This is necessary because the BigScience filters use a list of language-specific pornographic words, and also language-specific hyperparameters for text quality improvement.", required=True)
16 | parser.add_argument("--split", default=None, help="The split of the dataset to apply the filters to. Not all datasets have splits, so this is not a required argument.")
17 | parser.add_argument("--text_column", help="The name of the dataset column that contains the text.", required=True)
18 | parser.add_argument("--num_proc", type=int, help="The number of processes to use.", required=True)
19 | parser.add_argument("--push_to_hub", action="store_true", help="Whether to push the output dataset to the Hugging Face Hub after saving it to the disk.")
20 | parser.add_argument("--tmp_dir", default=".tmp_apply_bigscience_filters", help="Directory to store temporary files. It will be deleted afterwards. Defaults to .tmp_apply_bigscience_filters.")
21 | parser.add_argument("--load_from_hub_instead_of_disk", action="store_true", help="Whether to pull the input dataset by name from the Hugging Face Hub. If this argument is not used, it is assumed that there is a dataset saved to the disk with the input dataset name.")
22 | args = parser.parse_args()
23 | 
24 | if args.load_from_hub_instead_of_disk:
25 |     if args.split is None:
26 |         ds = load_dataset(args.input_dataset_name)
27 |     else:
28 |         ds = load_dataset(args.input_dataset_name, split=args.split)
29 | else:
30 |     if args.split is None:
31 |         ds = load_from_disk(args.input_dataset_name)
32 |     else:
33 |         ds = load_from_disk(args.input_dataset_name)[args.split]
34 | 
35 | # We have to do this if the text column is not named "text" in the dataset,
36 | # because DatasetFiltering assumes that the name is "text".
37 | temp_column_name = None
38 | if args.text_column != "text":
39 |     if "text" in ds.colum_names:
40 |         temp_column_name = str(uuid.uuid4())
41 |         ds = ds.rename_column("text", temp_column_name)
42 |     ds = ds.rename_column(args.text_column, "text")
43 | 
44 | if path.exists(args.tmp_dir):
45 |     run(f"rm -r {args.tmp_dir}", shell=True)
46 | 
47 | mkdir(args.tmp_dir)
48 | tmp_dataset_name = path.join(args.tmp_dir, "intermediate_bigscience_filtered_dataset")
49 | 
50 | dataset_filtering = DatasetFiltering(
51 |     dataset=ds,
52 |     lang_dataset_id=args.lang_id,
53 |     path_fasttext_model="sp_kenlm_ft_models/lid.176.bin",
54 |     path_sentencepiece_model=f"sp_kenlm_ft_models/{args.lang_id}.sp.model",
55 |     path_kenlm_model=f"sp_kenlm_ft_models/{args.lang_id}.arpa.bin",
56 |     num_proc=args.num_proc,
57 |     path_dir_save_dataset=tmp_dataset_name,
58 | )
59 | 
60 | dataset_filtering.modifying_documents()
61 | dataset_filtering.filtering()
62 | dataset_filtering.save_dataset()
63 | 
64 | ds = load_from_disk(path.join(tmp_dataset_name, args.lang_id))
65 | 
66 | # We have to do this if the text column is not named "text" in the dataset,
67 | # because DatasetFiltering assumes that the name is "text".
68 | if args.text_column != "text":
69 |     ds = ds.rename_column("text", args.text_column)
70 |     if temp_column_name is not None:
71 |         ds = ds.rename_column(temp_column_name, "text")
72 | 
73 | ds.save_to_disk(args.output_dataset_name)
74 | rmtree(args.tmp_dir)
75 | 
76 | if args.push_to_hub:
77 |     ds.push_to_hub(args.output_dataset_name)
78 | 


--------------------------------------------------------------------------------
/pipeline_scripts/common_crawl/combine_last_modified_with_text_dataset.py:
--------------------------------------------------------------------------------
 1 | from datasets import load_dataset, load_from_disk
 2 | import argparse
 3 | from multiprocessing import Manager
 4 | from tqdm import tqdm
 5 | import uuid
 6 | 
 7 | parser = argparse.ArgumentParser(description="This script takes in a text dataset with crawl timestamps and urls, and then a last-modified dataset with crawl timestamps and urls. It uses the shared urls and crawl timestamps to add last-modified timestamps to the text dataset.")
 8 | parser.add_argument("--text_dataset_name", required=True)
 9 | parser.add_argument("--last_modified_dataset_name", required=True)
10 | parser.add_argument("--output_dataset_name", required=True)
11 | parser.add_argument("--text_dataset_split", default=None)
12 | parser.add_argument("--last_modified_dataset_split", default=None)
13 | parser.add_argument("--last_modified_timestamp_column", required=True)
14 | parser.add_argument("--crawl_timestamp_column", required=True)
15 | parser.add_argument("--url_column", required=True)
16 | parser.add_argument("--num_proc", type=int, required=True)
17 | parser.add_argument("--load_text_dataset_from_hub_instead_of_disk", action="store_true", help="Whether to load the text dataset from the Hugging Face hub instead of the disk (default is the disk).")
18 | parser.add_argument("--load_last_modified_dataset_from_hub_instead_of_disk", action="store_true", help="Whether to load the last modified dataset from the Hugging Face hub instead of the disk (default is the disk).")
19 | parser.add_argument("--push_to_hub", action="store_true")
20 | args = parser.parse_args()
21 | 
22 | if args.load_text_dataset_from_hub_instead_of_disk:
23 |     if args.text_dataset_split is None:
24 |         text_ds = load_dataset(args.text_dataset_name)
25 |     else:
26 |         text_ds = load_dataset(args.text_dataset_name, split=args.text_dataset_split)
27 | else:
28 |     if args.text_dataset_split is None:
29 |         text_ds = load_from_disk(args.text_dataset_name)
30 |     else:
31 |         text_ds = load_from_disk(args.text_dataset_name)[args.text_dataset_split]
32 | 
33 | if args.load_last_modified_dataset_from_hub_instead_of_disk:
34 |     if args.last_modified_dataset_split is None:
35 |         last_modified_ds = load_dataset(args.last_modified_dataset_name)
36 |     else:
37 |         last_modified_ds = load_dataset(args.last_modified_dataset_name, split=args.last_modified_dataset_split)
38 | else:
39 |     if args.last_modified_dataset_split is None:
40 |         last_modified_ds = load_from_disk(args.last_modified_dataset_name)
41 |     else:
42 |         last_modified_ds = load_from_disk(args.last_modified_dataset_name)[args.last_modified_dataset_split]
43 | 
44 | 
45 | with Manager() as manager:
46 |     shared_list = manager.list()
47 |     def build_last_modified_dict(examples):
48 |         last_modified_dict = {}
49 |         for url, crawl_timestamp, last_modified_tag_timestamp in zip(examples[args.url_column], examples[args.crawl_timestamp_column], examples[args.last_modified_timestamp_column]):
50 |             last_modified_dict[(url, crawl_timestamp)] = last_modified_tag_timestamp
51 |         shared_list.append(last_modified_dict)
52 | 
53 |     last_modified_ds.map(build_last_modified_dict, num_proc=args.num_proc, batched=True, batch_size=len(last_modified_ds) // args.num_proc)
54 | 
55 |     aggregate_last_modified_dict = {}
56 |     for last_modified_dict in tqdm(shared_list):
57 |         aggregate_last_modified_dict |= last_modified_dict
58 | 
59 | # Set the new fingerprint manually so the map function doesn't take forever hashing the huge aggregate_last_modified_dict.
60 | text_ds = text_ds.map(lambda example: {args.last_modified_timestamp_column: aggregate_last_modified_dict.get((example[args.url_column], example[args.crawl_timestamp_column]), None)}, new_fingerprint=str(uuid.uuid4()))
61 | 
62 | text_ds.save_to_disk(args.output_dataset_name)
63 | 
64 | if args.push_to_hub:
65 |     text_ds.push_to_hub(args.output_dataset_name)
66 | 


--------------------------------------------------------------------------------
/pipeline_scripts/common_crawl/deduplicate.py:
--------------------------------------------------------------------------------
  1 | from datasets import load_dataset, load_from_disk, concatenate_datasets
  2 | from text_dedup.exact_dedup import GoogleSuffixArrayDeduplicator
  3 | from shutil import rmtree
  4 | from os import path
  5 | import argparse
  6 | import hashlib
  7 | import uuid
  8 | 
  9 | parser = argparse.ArgumentParser(description="Applies varying levels of exact deduplication or exact suffix array deduplication to a Hugging Face dataset.")
 10 | parser.add_argument("--input_dataset_name", help="Name of the input dataset.", required=True)
 11 | parser.add_argument("--output_dataset_name", help="Name of the output dataset.", required=True)
 12 | parser.add_argument("--text_column", help="Name of the dataset's text column.", required=True)
 13 | parser.add_argument("--split", default=None, help="The split of the dataset to apply deduplication on. Not all datasets have splits, so this argument is optional.")
 14 | parser.add_argument("--num_proc", type=int, help="The minimum number of processes to use.", required=True)
 15 | parser.add_argument("--push_to_hub", action="store_true", help="Whether to push the output dataset to the Hugging Face Hub after saving it to the disk.")
 16 | parser.add_argument("--remove_whole_example", action="store_true", help= "If an example in our courpus has a byte string of 100 or longer which is duplicated elsewhere in the corpus, then this option will result in the removal of the whole example. If this option is not specified, then only the substring is removed, not the whole example. In the paper for this deduplication method, they only remove the byte string, not the whole example. Removing the whole example will vastly shrink the size of the dataset, but it will ensure no gaps in text continuity.")
 17 | parser.add_argument("--only_exact_duplicates", action="store_true", help="Use this option if you want to forget about the suffix array stuff and just get rid of examples that exactly match other examples in the dataset.")
 18 | parser.add_argument("--chunks", type=int, default=1, help="Deduplication can be really memory-intensive. This option allows you to split the dataset up in to n chunks, and perform deduplication independently on each of the chunks. Then the resulting deduplicated datasets are concatenated together at the end.")
 19 | parser.add_argument("--load_from_hub_instead_of_disk", action="store_true", help="Whether to load the input dataset from the Hugging Face Hub. If this argument is not used, then it is assumed that the input dataset is stored locally on the disk.")
 20 | args = parser.parse_args()
 21 | 
 22 | if args.load_from_hub_instead_of_disk:
 23 |     if args.split is None:
 24 |         ds = load_dataset(args.input_dataset_name)
 25 |     else:
 26 |         ds = load_dataset(args.input_dataset_name, split=args.split)
 27 | else:
 28 |     if args.split is None:
 29 |         ds = load_from_disk(args.input_dataset_name)
 30 |     else:
 31 |         ds = load_from_disk(args.input_dataset_name)[args.split]
 32 | 
 33 | deduplicated_ds_shard_list = []
 34 | for ds_shard_index in range(args.chunks):
 35 |     ds_shard = ds.shard(num_shards=args.chunks, index=ds_shard_index)
 36 |     
 37 |     if args.remove_whole_example:
 38 |         def check_for_ending_example_in_cluster(example, index, column, last_index):
 39 |             if index == last_index:
 40 |                 return True
 41 |             return ds_shard[index+1][column] != example[column]
 42 | 
 43 |         # Sort the dataset so examples with the same first 100 bytes of text are grouped together.
 44 |         print("Sorting by first 100 bytes of text")
 45 |         temp_column_name = str(uuid.uuid4())
 46 |         ds_shard = ds_shard.map(lambda example: {temp_column_name: example[args.text_column].encode("u8")[:100]}, num_proc=args.num_proc)
 47 |         ds_shard = ds_shard.sort(temp_column_name)
 48 | 
 49 |         # Filter away examples if their first 100 bytes of text exactly matches another example's first 100 bytes of text.
 50 |         # This gets rid of a subset of the examples that the next step (suffix array deduplication) gets rid of, so we technically
 51 |         # don't need to do it. But it speeds up the next step quite a bit to do this first.
 52 |         last_index = len(ds_shard) - 1
 53 |         len_before = len(ds_shard)
 54 |         ds_shard = ds_shard.filter(lambda example, index: check_for_ending_example_in_cluster(example, index, temp_column_name, last_index), num_proc=args.num_proc, with_indices=True)
 55 |         ds_shard = ds_shard.remove_columns(temp_column_name)
 56 |         print(f"Got rid of all examples sharing first 100 bytes of text, as a speedup step. Removed {len_before - len(ds_shard)} from {len_before} examples.")
 57 | 
 58 |         # Do the same thing with the ending 100 bytes of text.
 59 |         print("Sorting by last 100 bytes of text")
 60 |         temp_column_name = str(uuid.uuid4())
 61 |         ds_shard = ds_shard.map(lambda example: {temp_column_name: example[args.text_column].encode("u8")[-100:]}, num_proc=args.num_proc)
 62 |         ds_shard = ds_shard.sort(temp_column_name)
 63 | 
 64 |         last_index = len(ds_shard) - 1 
 65 |         len_before = len(ds_shard)
 66 |         ds_shard = ds_shard.filter(lambda example, index: check_for_ending_example_in_cluster(example, index, temp_column_name, last_index), num_proc=args.num_proc, with_indices=True)
 67 |         ds_shard = ds_shard.remove_columns(temp_column_name)
 68 |         print(f"Got rid of all examples sharing last 100 bytes of text, as a speedup step. Removed {len_before - len(ds_shard)} from {len_before} examples.") 
 69 | 
 70 |     else:
 71 |         print("Getting rid of exact duplicates")
 72 |         def check_for_ending_example_in_cluster(example, index, column, last_index):
 73 |             if index == last_index:
 74 |                 return True
 75 |             return ds_shard[index+1][column] != example[column]
 76 | 
 77 |         temp_column_name = str(uuid.uuid4())
 78 |         ds_shard = ds_shard.map(lambda example: {temp_column_name: hashlib.md5(example[args.text_column].encode()).hexdigest()}, num_proc=args.num_proc)
 79 |         ds_shard = ds_shard.sort(temp_column_name)
 80 | 
 81 |         last_index = len(ds_shard) - 1
 82 |         ds_shard = ds_shard.filter(lambda example, index: check_for_ending_example_in_cluster(example, index, temp_column_name, last_index), num_proc=args.num_proc, with_indices=True)
 83 |         ds_shard = ds_shard.remove_columns(temp_column_name)
 84 |         print("Got rid of exact duplicates")
 85 | 
 86 |     if path.exists(".cache"):
 87 |         rmtree(".cache")
 88 | 
 89 |     if not args.only_exact_duplicates:
 90 |         # Now, do Suffix Array Substring Exact Deduplication.
 91 | 
 92 |         deduplicator = GoogleSuffixArrayDeduplicator(k=100)
 93 | 
 94 |         # We need to create this iterator over the dataset text column
 95 |         # to ensure that not all of the text entries are loaded into memory at once.
 96 |         class DatasetColumnIterator():
 97 |             def __init__(self, dataset, column):
 98 |                 self.iterable_dataset = dataset.__iter__()
 99 |                 self.column = column
100 | 
101 |             def __iter__(self):
102 |                 return self
103 | 
104 |             def __next__(self):
105 |                 return self.iterable_dataset.__next__()[self.column]
106 | 
107 |         slices = deduplicator.fit_predict(DatasetColumnIterator(ds_shard, args.text_column))
108 |         if args.remove_whole_example:
109 |             ds_shard = ds_shard.filter(lambda example, index: slices[index] == [], num_proc=args.num_proc, with_indices=True)
110 |         else:
111 |             def remove_slice_list(string, slice_list):
112 |                 for s in slice_list:
113 |                     string = string.replace(string[s], "")
114 |                 return string
115 |             # It's important to give this map function a uuid as its fingerprint. If we let it compute the fingerprint as a hash of the whole slice_list, then it will take too long.
116 |             ds_shard = ds_shard.map(lambda example, index: {args.text_column: remove_slice_list(example[args.text_column], slices[index])}, num_proc=args.num_proc, with_indices=True, new_fingerprint=str(uuid.uuid4()))
117 |             ds_shard = ds_shard.filter(lambda example: example[args.text_column] != "", num_proc=args.num_proc)
118 | 
119 |     if path.exists(".cache"):
120 |         rmtree(".cache")
121 | 
122 |     deduplicated_ds_shard_list.append(ds_shard)
123 | 
124 | ds = concatenate_datasets(deduplicated_ds_shard_list)
125 | 
126 | ds.save_to_disk(args.output_dataset_name)
127 | 
128 | if args.push_to_hub:
129 |     ds.push_to_hub(args.output_dataset_name)
130 | 


--------------------------------------------------------------------------------
/pipeline_scripts/common_crawl/download_common_crawl.py:
--------------------------------------------------------------------------------
 1 | from os import mkdir, path
 2 | from subprocess import run
 3 | import argparse
 4 | import random
 5 | 
 6 | parser = argparse.ArgumentParser(description="Downloads raw Common Crawl WET files, or WAT files if you specify --paths_type=wat.")
 7 | parser.add_argument("--snapshots", nargs='+', help="The Common Crawl snapshots to download files from, such as CC-MAIN-2022-33 or CC-MAIN-2022-27. Several can be specified.", required=True)
 8 | parser.add_argument("--download_dir", help="The name of the directory to create and download WET files to.", required=True)
 9 | parser.add_argument("--segment_sampling_ratios", type=float, nargs="+", help="The ratios of each Common Crawl snapshot to use. The higher the ratio, the larger the generated dataset (but also the longer the time that the OLM pipeline runs). You should specify one for each snapshot. For example, if you specify '--snapshots CC-MAIN-2022-33 CC-MAIN-2022-27', then --segment_sampling_ratios could be '0.15 0.11'. This means that 15 percent of the segments from CC-MAIN-2022-33 will uniformly randomly sampled and used, and 11 percent of the segments from CC-MAIN-2022-27 will be uniformly randomly sampled and used.", required=True)
10 | parser.add_argument("--tmp_dir", default=".tmp_download_common_crawl", help="The directory where temporary files are stored. They are deleted when this script completes. Default is .tmp_download_common_crawl.")
11 | parser.add_argument("--num_proc", type=int, help="The number of processes to use.", required=True)
12 | parser.add_argument("--seed", type=int, default=42)
13 | parser.add_argument("--paths_type", default="wet")
14 | args = parser.parse_args()
15 | 
16 | random.seed(args.seed)
17 | 
18 | if path.exists(args.download_dir):
19 |     run(f"rm -r {args.download_dir}", shell=True)
20 | 
21 | if path.exists(args.tmp_dir):
22 |     run(f"rm -r {args.tmp_dir}", shell=True)
23 | 
24 | run(f"mkdir {args.download_dir} {args.tmp_dir}", shell=True)
25 | for index in range(len(args.snapshots)):
26 |     # Download the data for a certian common crawl snapshot
27 |     tmp_download_dir_name = f"{args.tmp_dir}/ungoliant_downloads-{args.snapshots[index]}"
28 |     run(f"mkdir {tmp_download_dir_name}", shell=True)
29 |     run(f"wget https://data.commoncrawl.org/crawl-data/{args.snapshots[index]}/{args.paths_type}.paths.gz", shell=True)
30 |     run(f"gzip -d {args.paths_type}.paths.gz", shell=True)
31 |     paths_name = f"{args.paths_type}-{args.snapshots[index]}.paths"
32 |     run(f"mv {args.paths_type}.paths {paths_name}", shell=True)
33 |     segments = open(paths_name, "r").readlines()
34 |     kept_segments = []
35 |     for segment in segments:
36 |         if random.random() <= args.segment_sampling_ratios[index]:
37 |             kept_segments.append(segment)
38 |     open(paths_name, "w").writelines(kept_segments)
39 |     run(f"ungoliant download -t={args.num_proc} {paths_name} {tmp_download_dir_name}", shell=True)
40 |     run(f"rm {paths_name}", shell=True)
41 | 
42 |     # Now, add 0's to the filename for every downloaded file. We want the number of 0's to be different than those from another common crawl snapshot
43 |     # because we want every file to have a unique name accross multiple snapshot downloads.
44 |     if index > 0:
45 |         run(f"cd {tmp_download_dir_name} && for f in * ; do mv \"$f\" {'0'*index}\"$f\" ; done", shell=True)
46 | 
47 |     # Now we can move the downloaded files into the main download dir which has the downloads from the rest of this for loop.
48 |     run(f"mv {tmp_download_dir_name}/* {args.download_dir}/", shell=True)
49 |     run(f"rm -r {tmp_download_dir_name}", shell=True)
50 | 
51 | run(f"rm -r {args.tmp_dir}", shell=True)
52 | run("rm -r errors.txt", shell=True)
53 | 
54 | 


--------------------------------------------------------------------------------
/pipeline_scripts/common_crawl/download_pipeline_processing_models.sh:
--------------------------------------------------------------------------------
1 | # exit when any command fails
2 | set -e
3 | 
4 | python data-preparation/preprocessing/training/01b_oscar_cleaning_and_filtering/download_sentencepiece_kenlm_models.py --output_dir_path=sp_kenlm_ft_models
5 | wget https://dl.fbaipublicfiles.com/fasttext/supervised-models/lid.176.bin -P sp_kenlm_ft_models/
6 | 


--------------------------------------------------------------------------------
/pipeline_scripts/common_crawl/experimental/add_perplexity.py:
--------------------------------------------------------------------------------
 1 | from datasets import load_dataset, load_from_disk
 2 | import argparse
 3 | import sys
 4 | sys.path.append("kenlm")
 5 | from model import KenlmModel
 6 | 
 7 | parser = argparse.ArgumentParser(description="This script simply uses a kenlm trained on English Wikipedia to compute the perplexity of each text example in the dataset. It then sorts the dataset by perplexity so that the user can then select the range of perplexities that they want their data to be in.")
 8 | parser.add_argument("--input_dataset_name", help="The name of the input dataset.", required=True)
 9 | parser.add_argument("--output_dataset_name", help="The name of the output dataset.", required=True)
10 | parser.add_argument("--split", default=None, help="The split of the dataset to apply the filters to. Not all datasets have splits, so this is not a required argument.")
11 | parser.add_argument("--text_column", help="The name of the dataset column that contains the text.", required=True)
12 | parser.add_argument("--num_proc", type=int, help="The number of processes to use.", required=True)
13 | parser.add_argument("--push_to_hub", action="store_true", help="Whether to push the output dataset to the Hugging Face Hub after saving it to the disk.")
14 | parser.add_argument("--load_from_hub_instead_of_disk", action="store_true", help="Whether to pull the input dataset by name from the Hugging Face Hub. If this argument is not used, it is assumed that there is a dataset saved to the disk with the input dataset name.")
15 | args = parser.parse_args()
16 | 
17 | if args.load_from_hub_instead_of_disk:
18 |     if args.split is None:
19 |         ds = load_dataset(args.input_dataset_name)
20 |     else:
21 |         ds = load_dataset(args.input_dataset_name, split=args.split)
22 | else:
23 |     if args.split is None:
24 |         ds = load_from_disk(args.input_dataset_name)
25 |     else:
26 |         ds = load_from_disk(args.input_dataset_name)[args.split]
27 | 
28 | 
29 | model = KenlmModel.from_pretrained("kenlm/wikipedia", "en")
30 | ds = ds.map(lambda example: {"kenlm_ppl": model.get_perplexity(example[args.text_column])}, num_proc=args.num_proc)
31 | ds = ds.sort("kenlm_ppl")
32 | ds.save_to_disk(args.output_dataset_name)
33 | 
34 | if args.push_to_hub:
35 |     ds.push_to_hub(args.output_dataset_name)
36 | 


--------------------------------------------------------------------------------
/pipeline_scripts/common_crawl/experimental/filter_for_only_updated_websites.py:
--------------------------------------------------------------------------------
 1 | from datasets import load_dataset, load_from_disk
 2 | import argparse
 3 | 
 4 | parser = argparse.ArgumentParser(description="Experimental script to check and filter for a diff between examples with the same URL. It drastically reduces the size of the dataset in many cases, but it helps ensure that the text is up to date. The script only keeps an example if 1) the example shares a URL with other examples 2) the example is the most recent example with that URL 3) there was a diff between the example and an earlier example with the same URL.")
 5 | parser.add_argument("--input_dataset_name", required=True)
 6 | parser.add_argument("--output_dataset_name", required=True)
 7 | parser.add_argument("--text_column", required=True)
 8 | parser.add_argument("--timestamp_column", required=True)
 9 | parser.add_argument("--split", default=None, help="The split of the datset to apply this filter to. Not all datsets have splits, so this argument is optional.")
10 | parser.add_argument("--url_column", required=True)
11 | parser.add_argument("--num_proc", type=int, required=True)
12 | parser.add_argument("--push_to_hub", action="store_true", help="Whether to push the output dataset to the Hugging Face Hub after saving it to the disk.")
13 | parser.add_argument("--load_from_hub_instead_of_disk", action="store_true", help="Whether to load the input datset from the Hugging Face Hub. If this argument is not used, it is assumed that the input dataset is stored on the disk.")
14 | args = parser.parse_args()
15 | 
16 | if args.load_from_hub_instead_of_disk:
17 |     if args.split is None:
18 |         ds = load_dataset(args.input_dataset_name)
19 |     else:
20 |         ds = load_dataset(args.input_dataset_name, split=args.split)
21 | else:
22 |     if args.split is None:
23 |         ds = load_from_disk(args.input_dataset_name)
24 |     else:
25 |         ds = load_from_disk(args.input_dataset_name)[args.split]
26 | 
27 | # Group so examples with the same URL are next to each other in the dataset.
28 | ds = ds.sort(args.url_column)
29 | 
30 | # Throw away examples with URLs occuring only once in the dataset.
31 | last_index = len(ds) - 1
32 | def check_for_adjacent_duplicate_url(example, index):
33 |     if index == last_index:
34 |         return ds[index-1][args.url_column] == example[args.url_column]
35 |     if index == 0:
36 |         return ds[index+1][args.url_column] == example[args.url_column]
37 |     return ds[index-1][args.url_column] == example[args.url_column] or ds[index+1][args.url_column] == example[args.url_column]
38 | 
39 | ds = ds.filter(lambda example, index: check_for_adjacent_duplicate_url(example, index), num_proc=args.num_proc, with_indices=True)
40 | 
41 | # Sort the dataset so that examples with the same URL are still grouped together, but also arrange by timestamp from oldest to newest.
42 | ds = ds.sort(args.timestamp_column)
43 | ds = ds.sort(args.url_column, kind="stable")
44 | 
45 | # Keep only the pair of examples from each URL group with the oldest and newest timestamp.
46 | last_index = len(ds) - 1
47 | def check_for_ending_or_beginning_example_in_url_cluster(example, index):
48 |     if index in (last_index, 0):
49 |         return True
50 |     return ds[index-1][args.url_column] != example[args.url_column] or ds[index+1][args.url_column] != example[args.url_column]
51 | 
52 | ds = ds.filter(lambda example, index: check_for_ending_or_beginning_example_in_url_cluster(example, index), num_proc=args.num_proc, with_indices=True)
53 | 
54 | # For each example pair, check to see if the text was modified between the old time and the new time.
55 | # If it was modified, keep the latest example and throw the old example out. We have evidence that this new example is up-to-date :D
56 | # If it wasn't modified, throw both examples out. We have no evidence that this new example is up-to-date :(
57 | last_index = len(ds) - 1
58 | def check_for_updated_example_in_url_pair(example, index):
59 |     if index == 0 or ds[index-1][args.url_column] != example[args.url_column]:
60 |         return False
61 |     if ds[index-1][args.text_column] != example[args.text_column]:
62 |         return True
63 |     return False
64 | 
65 | ds = ds.filter(lambda example, index: check_for_updated_example_in_url_pair(example, index), num_proc=args.num_proc, with_indices=True)
66 | 
67 | ds.save_to_disk(args.output_dataset_name)
68 | 
69 | if args.push_to_hub:
70 |     ds.push_to_hub(args.output_dataset_name)
71 | 


--------------------------------------------------------------------------------
/pipeline_scripts/common_crawl/experimental/kenlm/LICENSE:
--------------------------------------------------------------------------------
  1 | 
  2 |                                  Apache License
  3 |                            Version 2.0, January 2004
  4 |                         http://www.apache.org/licenses/
  5 | 
  6 |    TERMS AND CONDITIONS FOR USE, REPRODUCTION, AND DISTRIBUTION
  7 | 
  8 |    1. Definitions.
  9 | 
 10 |       "License" shall mean the terms and conditions for use, reproduction,
 11 |       and distribution as defined by Sections 1 through 9 of this document.
 12 | 
 13 |       "Licensor" shall mean the copyright owner or entity authorized by
 14 |       the copyright owner that is granting the License.
 15 | 
 16 |       "Legal Entity" shall mean the union of the acting entity and all
 17 |       other entities that control, are controlled by, or are under common
 18 |       control with that entity. For the purposes of this definition,
 19 |       "control" means (i) the power, direct or indirect, to cause the
 20 |       direction or management of such entity, whether by contract or
 21 |       otherwise, or (ii) ownership of fifty percent (50%) or more of the
 22 |       outstanding shares, or (iii) beneficial ownership of such entity.
 23 | 
 24 |       "You" (or "Your") shall mean an individual or Legal Entity
 25 |       exercising permissions granted by this License.
 26 | 
 27 |       "Source" form shall mean the preferred form for making modifications,
 28 |       including but not limited to software source code, documentation
 29 |       source, and configuration files.
 30 | 
 31 |       "Object" form shall mean any form resulting from mechanical
 32 |       transformation or translation of a Source form, including but
 33 |       not limited to compiled object code, generated documentation,
 34 |       and conversions to other media types.
 35 | 
 36 |       "Work" shall mean the work of authorship, whether in Source or
 37 |       Object form, made available under the License, as indicated by a
 38 |       copyright notice that is included in or attached to the work
 39 |       (an example is provided in the Appendix below).
 40 | 
 41 |       "Derivative Works" shall mean any work, whether in Source or Object
 42 |       form, that is based on (or derived from) the Work and for which the
 43 |       editorial revisions, annotations, elaborations, or other modifications
 44 |       represent, as a whole, an original work of authorship. For the purposes
 45 |       of this License, Derivative Works shall not include works that remain
 46 |       separable from, or merely link (or bind by name) to the interfaces of,
 47 |       the Work and Derivative Works thereof.
 48 | 
 49 |       "Contribution" shall mean any work of authorship, including
 50 |       the original version of the Work and any modifications or additions
 51 |       to that Work or Derivative Works thereof, that is intentionally
 52 |       submitted to Licensor for inclusion in the Work by the copyright owner
 53 |       or by an individual or Legal Entity authorized to submit on behalf of
 54 |       the copyright owner. For the purposes of this definition, "submitted"
 55 |       means any form of electronic, verbal, or written communication sent
 56 |       to the Licensor or its representatives, including but not limited to
 57 |       communication on electronic mailing lists, source code control systems,
 58 |       and issue tracking systems that are managed by, or on behalf of, the
 59 |       Licensor for the purpose of discussing and improving the Work, but
 60 |       excluding communication that is conspicuously marked or otherwise
 61 |       designated in writing by the copyright owner as "Not a Contribution."
 62 | 
 63 |       "Contributor" shall mean Licensor and any individual or Legal Entity
 64 |       on behalf of whom a Contribution has been received by Licensor and
 65 |       subsequently incorporated within the Work.
 66 | 
 67 |    2. Grant of Copyright License. Subject to the terms and conditions of
 68 |       this License, each Contributor hereby grants to You a perpetual,
 69 |       worldwide, non-exclusive, no-charge, royalty-free, irrevocable
 70 |       copyright license to reproduce, prepare Derivative Works of,
 71 |       publicly display, publicly perform, sublicense, and distribute the
 72 |       Work and such Derivative Works in Source or Object form.
 73 | 
 74 |    3. Grant of Patent License. Subject to the terms and conditions of
 75 |       this License, each Contributor hereby grants to You a perpetual,
 76 |       worldwide, non-exclusive, no-charge, royalty-free, irrevocable
 77 |       (except as stated in this section) patent license to make, have made,
 78 |       use, offer to sell, sell, import, and otherwise transfer the Work,
 79 |       where such license applies only to those patent claims licensable
 80 |       by such Contributor that are necessarily infringed by their
 81 |       Contribution(s) alone or by combination of their Contribution(s)
 82 |       with the Work to which such Contribution(s) was submitted. If You
 83 |       institute patent litigation against any entity (including a
 84 |       cross-claim or counterclaim in a lawsuit) alleging that the Work
 85 |       or a Contribution incorporated within the Work constitutes direct
 86 |       or contributory patent infringement, then any patent licenses
 87 |       granted to You under this License for that Work shall terminate
 88 |       as of the date such litigation is filed.
 89 | 
 90 |    4. Redistribution. You may reproduce and distribute copies of the
 91 |       Work or Derivative Works thereof in any medium, with or without
 92 |       modifications, and in Source or Object form, provided that You
 93 |       meet the following conditions:
 94 | 
 95 |       (a) You must give any other recipients of the Work or
 96 |           Derivative Works a copy of this License; and
 97 | 
 98 |       (b) You must cause any modified files to carry prominent notices
 99 |           stating that You changed the files; and
100 | 
101 |       (c) You must retain, in the Source form of any Derivative Works
102 |           that You distribute, all copyright, patent, trademark, and
103 |           attribution notices from the Source form of the Work,
104 |           excluding those notices that do not pertain to any part of
105 |           the Derivative Works; and
106 | 
107 |       (d) If the Work includes a "NOTICE" text file as part of its
108 |           distribution, then any Derivative Works that You distribute must
109 |           include a readable copy of the attribution notices contained
110 |           within such NOTICE file, excluding those notices that do not
111 |           pertain to any part of the Derivative Works, in at least one
112 |           of the following places: within a NOTICE text file distributed
113 |           as part of the Derivative Works; within the Source form or
114 |           documentation, if provided along with the Derivative Works; or,
115 |           within a display generated by the Derivative Works, if and
116 |           wherever such third-party notices normally appear. The contents
117 |           of the NOTICE file are for informational purposes only and
118 |           do not modify the License. You may add Your own attribution
119 |           notices within Derivative Works that You distribute, alongside
120 |           or as an addendum to the NOTICE text from the Work, provided
121 |           that such additional attribution notices cannot be construed
122 |           as modifying the License.
123 | 
124 |       You may add Your own copyright statement to Your modifications and
125 |       may provide additional or different license terms and conditions
126 |       for use, reproduction, or distribution of Your modifications, or
127 |       for any such Derivative Works as a whole, provided Your use,
128 |       reproduction, and distribution of the Work otherwise complies with
129 |       the conditions stated in this License.
130 | 
131 |    5. Submission of Contributions. Unless You explicitly state otherwise,
132 |       any Contribution intentionally submitted for inclusion in the Work
133 |       by You to the Licensor shall be under the terms and conditions of
134 |       this License, without any additional terms or conditions.
135 |       Notwithstanding the above, nothing herein shall supersede or modify
136 |       the terms of any separate license agreement you may have executed
137 |       with Licensor regarding such Contributions.
138 | 
139 |    6. Trademarks. This License does not grant permission to use the trade
140 |       names, trademarks, service marks, or product names of the Licensor,
141 |       except as required for reasonable and customary use in describing the
142 |       origin of the Work and reproducing the content of the NOTICE file.
143 | 
144 |    7. Disclaimer of Warranty. Unless required by applicable law or
145 |       agreed to in writing, Licensor provides the Work (and each
146 |       Contributor provides its Contributions) on an "AS IS" BASIS,
147 |       WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or
148 |       implied, including, without limitation, any warranties or conditions
149 |       of TITLE, NON-INFRINGEMENT, MERCHANTABILITY, or FITNESS FOR A
150 |       PARTICULAR PURPOSE. You are solely responsible for determining the
151 |       appropriateness of using or redistributing the Work and assume any
152 |       risks associated with Your exercise of permissions under this License.
153 | 
154 |    8. Limitation of Liability. In no event and under no legal theory,
155 |       whether in tort (including negligence), contract, or otherwise,
156 |       unless required by applicable law (such as deliberate and grossly
157 |       negligent acts) or agreed to in writing, shall any Contributor be
158 |       liable to You for damages, including any direct, indirect, special,
159 |       incidental, or consequential damages of any character arising as a
160 |       result of this License or out of the use or inability to use the
161 |       Work (including but not limited to damages for loss of goodwill,
162 |       work stoppage, computer failure or malfunction, or any and all
163 |       other commercial damages or losses), even if such Contributor
164 |       has been advised of the possibility of such damages.
165 | 
166 |    9. Accepting Warranty or Additional Liability. While redistributing
167 |       the Work or Derivative Works thereof, You may choose to offer,
168 |       and charge a fee for, acceptance of support, warranty, indemnity,
169 |       or other liability obligations and/or rights consistent with this
170 |       License. However, in accepting such obligations, You may act only
171 |       on Your own behalf and on Your sole responsibility, not on behalf
172 |       of any other Contributor, and only if You agree to indemnify,
173 |       defend, and hold each Contributor harmless for any liability
174 |       incurred by, or claims asserted against, such Contributor by reason
175 |       of your accepting any such warranty or additional liability.
176 | 
177 |    END OF TERMS AND CONDITIONS
178 | 
179 |    APPENDIX: How to apply the Apache License to your work.
180 | 
181 |       To apply the Apache License to your work, attach the following
182 |       boilerplate notice, with the fields enclosed by brackets "[]"
183 |       replaced with your own identifying information. (Don't include
184 |       the brackets!)  The text should be enclosed in the appropriate
185 |       comment syntax for the file format. We also recommend that a
186 |       file or class name and description of purpose be included on the
187 |       same "printed page" as the copyright notice for easier
188 |       identification within third-party archives.
189 | 
190 |    Copyright 2021-2022 Eduardo González Ponferrada
191 | 
192 |    Licensed under the Apache License, Version 2.0 (the "License");
193 |    you may not use this file except in compliance with the License.
194 |    You may obtain a copy of the License at
195 | 
196 |        http://www.apache.org/licenses/LICENSE-2.0
197 | 
198 |    Unless required by applicable law or agreed to in writing, software
199 |    distributed under the License is distributed on an "AS IS" BASIS,
200 |    WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
201 |    See the License for the specific language governing permissions and
202 |    limitations under the License.


--------------------------------------------------------------------------------
/pipeline_scripts/common_crawl/experimental/kenlm/README.md:
--------------------------------------------------------------------------------
 1 | ---
 2 | language: 
 3 |   - es
 4 |   - af
 5 |   - ar
 6 |   - arz
 7 |   - as
 8 |   - bn
 9 |   - fr
10 |   - sw
11 |   - eu
12 |   - ca
13 |   - zh
14 |   - en
15 |   - hi
16 |   - ur
17 |   - id
18 |   - pt
19 |   - vi
20 |   - gu
21 |   - kn
22 |   - ml
23 |   - mr
24 |   - ta
25 |   - te
26 |   - yo
27 | tags:
28 | - kenlm
29 | - perplexity
30 | - n-gram
31 | - kneser-ney
32 | - bigscience
33 | license: "mit"
34 | datasets:
35 | - wikipedia
36 | - oscar
37 | ---
38 | 
39 | Taken from the amazing repo here: [https://huggingface.co/edugp/kenlm](https://huggingface.co/edugp/kenlm)
40 | 
41 | # KenLM models
42 | This repo contains several KenLM models trained on different tokenized datasets and languages.  
43 | KenLM models are probabilistic n-gram languge models that models. One use case of these models consist on fast perplexity estimation for [filtering or sampling large datasets](https://huggingface.co/bertin-project/bertin-roberta-base-spanish). For example, one could use a KenLM model trained on French Wikipedia to run inference on a large dataset and filter out samples that are very unlike to appear on Wikipedia (high perplexity), or very simple non-informative sentences that could appear repeatedly (low perplexity).
44 | 
45 | At the root of this repo you will find different directories named after the dataset models were trained on (e.g. `wikipedia`, `oscar`). Within each directory, you will find several models trained on different language subsets of the dataset (e.g. `en (English)`, `es (Spanish)`, `fr (French)`). For each language you will find three different files
46 | * `{language}.arpa.bin`: The trained KenLM model binary
47 | * `{language}.sp.model`: The trained SentencePiece model used for tokenization
48 | * `{language}.sp.vocab`: The vocabulary file for the SentencePiece model
49 | 
50 | The models have been trained using some of the preprocessing steps from [cc_net](https://github.com/facebookresearch/cc_net), in particular replacing numbers with zeros and normalizing punctuation. So, it is important to keep the default values for the parameters: `lower_case`, `remove_accents`, `normalize_numbers` and `punctuation` when using the pre-trained models in order to replicate the same pre-processing steps at inference time.
51 | 
52 | # Dependencies
53 | * KenLM: `pip install https://github.com/kpu/kenlm/archive/master.zip`
54 | * SentencePiece: `pip install sentencepiece`
55 | 
56 | # Example:
57 | ```
58 | from model import KenlmModel
59 | 
60 | 
61 | # Load model trained on English wikipedia
62 | model = KenlmModel.from_pretrained("wikipedia", "en")
63 | 
64 | # Get perplexity
65 | model.get_perplexity("I am very perplexed")
66 | # 341.3 (low perplexity, since sentence style is formal and with no grammar mistakes)
67 | 
68 | model.get_perplexity("im hella trippin")
69 | # 46793.5 (high perplexity, since the sentence is colloquial and contains grammar mistakes)
70 | ```
71 | In the example above we see that, since Wikipedia is a collection of encyclopedic articles, a KenLM model trained on it will naturally give lower perplexity scores to sentences with formal language and no grammar mistakes than colloquial sentences with grammar mistakes.
72 | 


--------------------------------------------------------------------------------
/pipeline_scripts/common_crawl/experimental/kenlm/model.py:
--------------------------------------------------------------------------------
  1 | import os
  2 | import re
  3 | import unicodedata
  4 | from typing import Dict
  5 | 
  6 | import kenlm
  7 | import sentencepiece
  8 | from huggingface_hub import cached_download, hf_hub_url
  9 | 
 10 | 
 11 | class SentencePiece:
 12 |     def __init__(
 13 |         self,
 14 |         model: str,
 15 |     ):
 16 |         super().__init__()
 17 |         self.sp = sentencepiece.SentencePieceProcessor()
 18 |         self.sp.load(str(model))
 19 | 
 20 |     def do(self, text: dict) -> dict:
 21 |         tokenized = self.sp.encode_as_pieces(text)
 22 |         return " ".join(tokenized)
 23 | 
 24 | 
 25 | class KenlmModel:
 26 |     digit_re: re.Pattern = re.compile(r"\d")
 27 |     unicode_punct: Dict[str, str] = {
 28 |         "，": ",",
 29 |         "。": ".",
 30 |         "、": ",",
 31 |         "„": '"',
 32 |         "”": '"',
 33 |         "“": '"',
 34 |         "«": '"',
 35 |         "»": '"',
 36 |         "１": '"',
 37 |         "」": '"',
 38 |         "「": '"',
 39 |         "《": '"',
 40 |         "》": '"',
 41 |         "´": "'",
 42 |         "∶": ":",
 43 |         "：": ":",
 44 |         "？": "?",
 45 |         "！": "!",
 46 |         "（": "(",
 47 |         "）": ")",
 48 |         "；": ";",
 49 |         "–": "-",
 50 |         "—": " - ",
 51 |         "．": ". ",
 52 |         "～": "~",
 53 |         "’": "'",
 54 |         "…": "...",
 55 |         "━": "-",
 56 |         "〈": "<",
 57 |         "〉": ">",
 58 |         "【": "[",
 59 |         "】": "]",
 60 |         "％": "%",
 61 |         "►": "-",
 62 |     }
 63 |     unicode_punct_re = re.compile(f"[{''.join(unicode_punct.keys())}]")
 64 |     non_printing_chars_re = re.compile(
 65 |         f"[{''.join(map(chr, list(range(0,32)) + list(range(127,160))))}]"
 66 |     )
 67 |     kenlm_model_dir = None
 68 |     sentence_piece_model_dir = None
 69 | 
 70 |     def __init__(
 71 |         self,
 72 |         model_dataset: str,
 73 |         language: str,
 74 |         lower_case: bool = False,
 75 |         remove_accents: bool = False,
 76 |         normalize_numbers: bool = True,
 77 |         punctuation: int = 1,
 78 |     ):
 79 |         self.model = kenlm.Model(os.path.join(model_dataset, f"{language}.arpa.bin"))
 80 |         self.tokenizer = SentencePiece(os.path.join(model_dataset, f"{language}.sp.model"))
 81 |         self.accent = remove_accents
 82 |         self.case = lower_case
 83 |         self.numbers = normalize_numbers
 84 |         self.punct = punctuation
 85 | 
 86 |     @classmethod
 87 |     def from_pretrained(
 88 |         cls,
 89 |         model_dataset: str,
 90 |         language: str,
 91 |     ):
 92 |         return cls(
 93 |             model_dataset,
 94 |             language,
 95 |             False,
 96 |             False,
 97 |             True,
 98 |             1,
 99 |         )
100 | 
101 |     def pp(self, log_score, length):
102 |         return 10.0 ** (-log_score / length)
103 | 
104 |     def get_perplexity(self, doc: str, normalize_cc_net: bool = True):
105 |         if normalize_cc_net:
106 |             doc = self.normalize(
107 |                 doc,
108 |                 accent=self.accent,
109 |                 case=self.case,
110 |                 numbers=self.numbers,
111 |                 punct=self.punct,
112 |             )
113 |         # Tokenize (after normalizing): See https://github.com/facebookresearch/cc_net/blob/bda555bd1cf1ee2e0b925363e62a61cd46c8b60d/cc_net/mine.py#L352 for full pipeline
114 |         doc = self.tokenizer.do(doc)
115 |         doc_log_score, doc_length = 0, 0
116 |         for line in doc.split("\n"):
117 |             log_score = self.model.score(line)
118 |             length = len(line.split()) + 1
119 |             doc_log_score += log_score
120 |             doc_length += length
121 |         return round(self.pp(doc_log_score, doc_length), 1)
122 | 
123 |     def normalize(
124 |         self,
125 |         line: str,
126 |         accent: bool = True,
127 |         case: bool = True,
128 |         numbers: bool = True,
129 |         punct: int = 1,
130 |     ) -> str:
131 |         line = line.strip()
132 |         if not line:
133 |             return line
134 |         if case:
135 |             line = line.lower()
136 |         if accent:
137 |             line = self.strip_accents(line)
138 |         if numbers:
139 |             line = self.digit_re.sub("0", line)
140 |         if punct == 1:
141 |             line = self.replace_unicode_punct(line)
142 |         elif punct == 2:
143 |             line = self.remove_unicode_punct(line)
144 |         line = self.remove_non_printing_char(line)
145 |         return line
146 | 
147 |     def strip_accents(self, line: str) -> str:
148 |         """Strips accents from a piece of text."""
149 |         nfd = unicodedata.normalize("NFD", line)
150 |         output = [c for c in nfd if unicodedata.category(c) != "Mn"]
151 |         if len(output) == line:
152 |             return line
153 |         return "".join(output)
154 | 
155 |     def replace_unicode_punct(self, text: str) -> str:
156 |         return "".join(self.unicode_punct.get(c, c) for c in text)
157 | 
158 |     def remove_unicode_punct(self, text: str) -> str:
159 |         """More aggressive version of replace_unicode_punct but also faster."""
160 |         return self.unicode_punct_re.sub("", text)
161 | 
162 |     def remove_non_printing_char(self, text: str) -> str:
163 |         return self.non_printing_chars_re.sub("", text)
164 | 


--------------------------------------------------------------------------------
/pipeline_scripts/common_crawl/experimental/kenlm/wikipedia/en.arpa.bin:
--------------------------------------------------------------------------------
1 | version https://git-lfs.github.com/spec/v1
2 | oid sha256:04923fccbb4e63005c40f01d66112659416de01accd80d16e366a592289ee07a
3 | size 4444690658
4 | 


--------------------------------------------------------------------------------
/pipeline_scripts/common_crawl/experimental/kenlm/wikipedia/en.sp.model:
--------------------------------------------------------------------------------
1 | version https://git-lfs.github.com/spec/v1
2 | oid sha256:cf8147a573770b4e6c0d4df1dcb75453baa88190706dab406be7711b84f059de
3 | size 931348
4 | 


--------------------------------------------------------------------------------
/pipeline_scripts/common_crawl/experimental/kenlm/wikipedia/en.sp.vocab:
--------------------------------------------------------------------------------
1 | version https://git-lfs.github.com/spec/v1
2 | oid sha256:a9c3c51a7736d736cc620cbe9a4c9430533469e57a54bc29546067a252f7d872
3 | size 729017
4 | 


--------------------------------------------------------------------------------
/pipeline_scripts/common_crawl/get_last_modified_dataset_from_wat_downloads.py:
--------------------------------------------------------------------------------
 1 | from datasets import load_dataset
 2 | from tqdm import tqdm
 3 | import pandas as pd
 4 | import subprocess
 5 | from multiprocessing import Process
 6 | from os import walk, mkdir, path
 7 | from shutil import rmtree
 8 | import dateutil
 9 | import dateparser
10 | import argparse
11 | import ujson
12 | 
13 | parser = argparse.ArgumentParser(description="Turns WAT downloads from download_common_crawl.py into a Hugging Face dataset with Last-Modified timestamps, URLs, and crawl timestamps.")
14 | parser.add_argument("--download_dir", help="The directory of the downloaded WAT files.", required=True)
15 | parser.add_argument("--output_dataset_name", help="The name of the Hugging Face dataset which will be saved upon completion of this program.", required=True)
16 | parser.add_argument("--num_proc", type=int, help="The number of processes to use.", required=True)
17 | parser.add_argument("--tmp_dir", default=".tmp_get_last_modified_dataset_from_wat_downloads")
18 | parser.add_argument("--push_to_hub", action="store_true", help="Whether to push the Hugging Face dataset to the Hugging Face Hub after saving a copy to the disk.")
19 | args = parser.parse_args()
20 | 
21 | if path.exists(args.tmp_dir):
22 |     rmtree(args.tmp_dir)
23 | 
24 | mkdir(args.tmp_dir)
25 | 
26 | filenames = next(walk(args.download_dir), (None, None, []))[2]
27 | 
28 | def split_a_into_n_parts(a, n):
29 |     k, m = divmod(len(a), n)
30 |     return [a[i*k+min(i, m):(i+1)*k+min(i+1, m)] for i in range(n)]
31 | 
32 | filename_per_proc = [names for names in split_a_into_n_parts(filenames, args.num_proc) if len(names) != 0]
33 | 
34 | processes = []
35 | for filenames in filename_per_proc:
36 |     def get_dataset(filenames):
37 |         for filename in tqdm(filenames):
38 |             dataset_dict = {"last_modified_timestamp": [], "url": [], "crawl_timestamp": []}
39 |             file_path = path.join(args.download_dir, filename)
40 |             if filename.endswith(".gz"):
41 |                 subprocess.run(f"gzip -d {file_path}", shell=True)
42 |                 filename = filename[:-3]
43 |                 file_path = path.join(args.download_dir, filename)
44 |             for line in open(file_path).readlines():
45 |                 if line.startswith("{"):
46 |                     parsed_line = ujson.loads(line)
47 |                     last_modified = parsed_line.get("Envelope", {}).get("Payload-Metadata", {}).get("HTTP-Response-Metadata", {}).get("Headers", {}).get("Last-Modified", None)
48 |                     url = parsed_line.get("Envelope", {}).get("WARC-Header-Metadata", {}).get("WARC-Target-URI", None)
49 |                     date = parsed_line.get("Envelope", {}).get("WARC-Header-Metadata", {}).get("WARC-Date", None)
50 |                     if None not in (last_modified, url, date):
51 |                         try:
52 |                             last_modified_timestamp = dateutil.parser.parse(last_modified).timestamp()
53 |                         except Exception:
54 |                             try:
55 |                                 last_modified_timestamp = dateparser.parse(last_modified).timestamp()
56 |                             except Exception:
57 |                                 last_modified_timestamp = None
58 |                         if last_modified_timestamp is not None:
59 |                             crawl_timestamp = dateutil.parser.parse(date).timestamp()
60 |                             dataset_dict["last_modified_timestamp"].append(last_modified_timestamp)
61 |                             dataset_dict["url"].append(url)
62 |                             dataset_dict["crawl_timestamp"].append(crawl_timestamp)
63 |             # Zip the download file again to save space.
64 |             subprocess.run(f"gzip {file_path}", shell=True)
65 |             pd.DataFrame(dataset_dict).to_parquet(path.join(args.tmp_dir, filename + ".filtered.parquet"))
66 |     p = Process(target=get_dataset, args=(filenames,))
67 |     p.start()
68 |     processes.append(p)
69 | 
70 | for p in processes:
71 |     p.join()
72 | 
73 | ds = load_dataset("parquet", data_files=path.join(args.tmp_dir, "*.parquet"))
74 | ds.save_to_disk(args.output_dataset_name)
75 | 
76 | rmtree(args.tmp_dir)
77 | 
78 | if args.push_to_hub:
79 |     ds.push_to_hub(args.output_dataset_name)
80 | 
81 | 


--------------------------------------------------------------------------------
/pipeline_scripts/common_crawl/get_text_dataset_from_wet_downloads.py:
--------------------------------------------------------------------------------
  1 | from datasets import load_dataset
  2 | from tqdm import tqdm
  3 | import pandas as pd
  4 | import subprocess
  5 | from multiprocessing import Process
  6 | from os import walk, mkdir, path
  7 | from shutil import move, rmtree
  8 | import dateutil
  9 | import argparse
 10 | 
 11 | parser = argparse.ArgumentParser(description="Turns downloads from download_common_crawl.py into a Hugging Face dataset, split by language (language is identified using a FastText model). The dataset has a timestamp column for the time it was crawled, along with a url column and, of course, a text column.")
 12 | parser.add_argument("--download_dir", help="The directory of the downloaded WET files.", required=True)
 13 | parser.add_argument("--output_dataset_name", help="The name of the Hugging Face dataset which will be saved upon completion of this program.", required=True)
 14 | parser.add_argument("--num_proc", type=int, help="The number of processes to use, at a minimum.", required=True)
 15 | parser.add_argument("--tmp_dir", default=".tmp_get_dataset_from_downloads", help="The directory to store temporary files. The directory will be deleted upon completion of this script. Defaults to .tmp_get_datasets_from_downloads.")
 16 | parser.add_argument("--push_to_hub", action="store_true", help="Whether to push the Hugging Face dataset to the Hugging Face Hub after saving a copy to the disk.")
 17 | args = parser.parse_args()
 18 | 
 19 | if path.exists(args.tmp_dir):
 20 |     rmtree(args.tmp_dir)
 21 | 
 22 | mkdir(args.tmp_dir)
 23 | 
 24 | tmp_download_dir = path.join(args.tmp_dir, "downloads")
 25 | 
 26 | move(args.download_dir, tmp_download_dir)
 27 | 
 28 | filenames = next(walk(tmp_download_dir), (None, None, []))[2]
 29 | 
 30 | def split_a_into_n_parts(a, n):
 31 |     k, m = divmod(len(a), n)
 32 |     return [a[i*k+min(i, m):(i+1)*k+min(i+1, m)] for i in range(n)]
 33 | 
 34 | ungoliant_pipeline_output_dirs = []
 35 | filename_per_directory = [names for names in split_a_into_n_parts(filenames, args.num_proc) if len(names) != 0]
 36 | num_files_awaiting_processing = 0
 37 | dirs_awaiting_processing = []
 38 | def do_parallel_pipeline_processing(dirs_awaiting_processing):
 39 |     processes = []
 40 |     for obj in dirs_awaiting_processing:
 41 |         p = subprocess.Popen(f"ungoliant pipeline --lid-path=sp_kenlm_ft_models/lid.176.bin {obj['download_chunk_dir']} {obj['pipeline_output_dir']}", shell=True)
 42 |         processes.append(p)
 43 |     for p in processes:
 44 |         p.wait()
 45 | 
 46 | # This loop runs the ungoliant pipeline num_proc number of times to generate num_proc number of output files.
 47 | # the ungoliant pipeline is already parallelized, so we don't do this so that the ungoliant pipeline will run faster.
 48 | # Instead, we do this so that we will have num_proc number of output files so we can load them in parallel into a 
 49 | # pandas dataframes, which will eventually be turned into Hugging Face dataset.
 50 | ungoliant_pipeline_results = path.join(args.tmp_dir, "ungoliant_pipeline_results")
 51 | mkdir(ungoliant_pipeline_results)
 52 | for i in range(len(filename_per_directory)):
 53 |     download_chunk_dir = path.join(tmp_download_dir, "chunk_" + str(i))
 54 |     mkdir(download_chunk_dir)
 55 |     for filename in filename_per_directory[i]:
 56 |         num_files_awaiting_processing += 1
 57 |         move(path.join(tmp_download_dir, filename), path.join(download_chunk_dir, filename))
 58 |     pipeline_output_dir = path.join(ungoliant_pipeline_results, "chunk_" + str(i))
 59 |     mkdir(pipeline_output_dir)
 60 |     ungoliant_pipeline_output_dirs.append(pipeline_output_dir)
 61 |     dirs_awaiting_processing.append({"pipeline_output_dir": pipeline_output_dir, "download_chunk_dir": download_chunk_dir})
 62 |     if num_files_awaiting_processing >= args.num_proc:
 63 |         do_parallel_pipeline_processing(dirs_awaiting_processing)
 64 |         num_files_awaiting_processing = 0
 65 |         dirs_awaiting_processing = []
 66 | 
 67 | do_parallel_pipeline_processing(dirs_awaiting_processing)
 68 | 
 69 | # For some reason, datasets errors out if we try to load directly from the jsonl, so we need to do this first.
 70 | processes = []
 71 | for ungoliant_pipeline_output_dir in ungoliant_pipeline_output_dirs:
 72 |     language_filenames = [name for name in next(walk(ungoliant_pipeline_output_dir), (None, None, []))[2] if name.endswith("_meta.jsonl")]
 73 |     language_ids = [language_filename.split("_")[0] for language_filename in language_filenames]
 74 |     def convert_to_parquet_and_reformat(ungoliant_pipeline_output_dir):
 75 |         for language_filename in language_filenames:
 76 |             language_id = language_filename.split("_")[0]
 77 |             i = 0
 78 |             print("Chunking the ungoliant json into several parquet files and reformatting before loading into huggingface dataset.")
 79 |             parquet_file_dir = path.join(ungoliant_pipeline_output_dir, language_id + "_parquet")
 80 |             mkdir(parquet_file_dir)
 81 |             for chunk in tqdm(pd.read_json(path.join(ungoliant_pipeline_output_dir, language_id + "_meta.jsonl"), lines=True, chunksize=10000)):
 82 |                 parquet_file_path = path.join(parquet_file_dir, str(i) + ".parquet")
 83 |                 chunk["url"] = chunk.apply(lambda row: row["warc_headers"]["warc-target-uri"], axis=1)
 84 |                 chunk["crawl_timestamp"] = chunk.apply(lambda row: dateutil.parser.parse(row["warc_headers"]["warc-date"]).timestamp(), axis=1)
 85 |                 chunk.drop(columns=["warc_headers", "metadata"], inplace=True)
 86 |                 chunk.rename(columns={"content": "text"}, inplace=True)
 87 |                 chunk.to_parquet(parquet_file_path)
 88 |                 i += 1
 89 |     p = Process(target=convert_to_parquet_and_reformat, args=(ungoliant_pipeline_output_dir,))
 90 |     p.start()
 91 |     processes.append(p)
 92 | 
 93 | for p in processes:
 94 |     p.join()
 95 | 
 96 | data_files = {language_id: [path.join(ungoliant_pipeline_output_dir, language_id + "_parquet", "*.parquet") for ungoliant_pipeline_output_dir in ungoliant_pipeline_output_dirs] for language_id in language_ids}
 97 | ds = load_dataset("parquet", data_files=data_files)
 98 | ds.save_to_disk(args.output_dataset_name)
 99 | rmtree(args.tmp_dir)
100 | 
101 | if args.push_to_hub:
102 |     ds.push_to_hub(args.output_dataset_name)
103 | 


--------------------------------------------------------------------------------
/pipeline_scripts/common_crawl/remove_wikipedia_urls.py:
--------------------------------------------------------------------------------
 1 | from datasets import load_dataset, load_from_disk
 2 | import argparse
 3 | 
 4 | parser = argparse.ArgumentParser(description="Removes all examples from a Hugging Face dataset if they have a Wikipedia URL. This script is intened to be used if you eventually want to merge the dataset with a Wikipedia snapshot. In that case, examples from Wikipedia in this dataset are redundant.")
 5 | parser.add_argument("--input_dataset_name", help="Input dataset name.", required=True)
 6 | parser.add_argument("--output_dataset_name", help="Output dataset name.", required=True)
 7 | parser.add_argument("--url_column", help="Name of the URL column of the dataset.", required=True)
 8 | parser.add_argument("--split", default=None, help="The split of the dataset to use. Some datasets don't have splits, so it is optional.")
 9 | parser.add_argument("--num_proc", type=int, help="The number of processes to use.")
10 | parser.add_argument("--push_to_hub", action="store_true", help="Whether to push the output dataset to the Hugging Face hub after saving to the disk.")
11 | parser.add_argument("--load_from_hub_instead_of_disk", action="store_true", help="Whether to load the input dataset by name from the Hugging Face hub. If this argument isn't specified then the input dataset will be loaded from a directory of the same name on the disk.")
12 | args = parser.parse_args()
13 | 
14 | if args.load_from_hub_instead_of_disk:
15 |     if args.split is None:
16 |         ds = load_dataset(args.input_dataset_name)
17 |     else:
18 |         ds = load_dataset(args.input_dataset_name, split=args.split)
19 | else:
20 |     if args.split is None:
21 |         ds = load_from_disk(args.input_dataset_name)
22 |     else:
23 |         ds = load_from_disk(args.input_dataset_name)[args.split]
24 | 
25 | ds = ds.filter(lambda example: not example[args.url_column].startswith("https://en.wikipedia.org/wiki/"), num_proc=args.num_proc)
26 | 
27 | ds.save_to_disk(args.output_dataset_name)
28 | 
29 | if args.push_to_hub:
30 |     ds.push_to_hub(args.output_dataset_name)
31 | 


--------------------------------------------------------------------------------
/pipeline_scripts/wikipedia/README.md:
--------------------------------------------------------------------------------
 1 | Per the repository [here](https://huggingface.co/datasets/olm/wikipedia), just run this Python code. It uses all CPUs available and should take less than an hour if you have a lot of CPUs (on the order of 100).
 2 | 
 3 | ```
 4 | from datasets import load_dataset
 5 | 
 6 | ds = load_dataset("olm/wikipedia", language="en", date="20220920")
 7 | 
 8 | ds.save_to_disk("wikipedia_en_20220920")
 9 | ds.push_to_hub("wikipedia_en_20220920")
10 | ````
11 | 
12 | The code pulls the Wikipedia snapshot for the given date and language and does all the processing required to turn it into a clean pretraining dataset. You can get the dates for the latest wikipedia snapshots here: [https://dumps.wikimedia.org/enwiki/](https://dumps.wikimedia.org/enwiki/).
13 | 


--------------------------------------------------------------------------------
/requirements.txt:
--------------------------------------------------------------------------------
 1 | datasets==2.6.1
 2 | emoji==1.7.0
 3 | fasttext==0.9.2
 4 | sentencepiece==0.1.97
 5 | pypi-kenlm==0.1.20220713
 6 | text-dedup==0.2.1
 7 | argparse==1.4.0
 8 | dateparser==1.1.1
 9 | mwparserfromhell==0.6.4
10 | matplotlib==3.6.2
11 | multiprocess==0.70.13
12 | 


--------------------------------------------------------------------------------