├── LICENSE ├── README.md ├── assets └── images │ └── logo.jpg ├── audit ├── GlotCC-audit.csv └── annotated_glotlid_vs_nllb_top20.txt ├── filters └── filter-v1.ipynb └── statistics ├── analyze-stat-2.ipynb └── v1.0-stat-2.zip /LICENSE: -------------------------------------------------------------------------------- 1 | Creative Commons Legal Code 2 | 3 | CC0 1.0 Universal 4 | 5 | CREATIVE COMMONS CORPORATION IS NOT A LAW FIRM AND DOES NOT PROVIDE 6 | LEGAL SERVICES. DISTRIBUTION OF THIS DOCUMENT DOES NOT CREATE AN 7 | ATTORNEY-CLIENT RELATIONSHIP. CREATIVE COMMONS PROVIDES THIS 8 | INFORMATION ON AN "AS-IS" BASIS. CREATIVE COMMONS MAKES NO WARRANTIES 9 | REGARDING THE USE OF THIS DOCUMENT OR THE INFORMATION OR WORKS 10 | PROVIDED HEREUNDER, AND DISCLAIMS LIABILITY FOR DAMAGES RESULTING FROM 11 | THE USE OF THIS DOCUMENT OR THE INFORMATION OR WORKS PROVIDED 12 | HEREUNDER. 13 | 14 | Statement of Purpose 15 | 16 | The laws of most jurisdictions throughout the world automatically confer 17 | exclusive Copyright and Related Rights (defined below) upon the creator 18 | and subsequent owner(s) (each and all, an "owner") of an original work of 19 | authorship and/or a database (each, a "Work"). 20 | 21 | Certain owners wish to permanently relinquish those rights to a Work for 22 | the purpose of contributing to a commons of creative, cultural and 23 | scientific works ("Commons") that the public can reliably and without fear 24 | of later claims of infringement build upon, modify, incorporate in other 25 | works, reuse and redistribute as freely as possible in any form whatsoever 26 | and for any purposes, including without limitation commercial purposes. 27 | These owners may contribute to the Commons to promote the ideal of a free 28 | culture and the further production of creative, cultural and scientific 29 | works, or to gain reputation or greater distribution for their Work in 30 | part through the use and efforts of others. 31 | 32 | For these and/or other purposes and motivations, and without any 33 | expectation of additional consideration or compensation, the person 34 | associating CC0 with a Work (the "Affirmer"), to the extent that he or she 35 | is an owner of Copyright and Related Rights in the Work, voluntarily 36 | elects to apply CC0 to the Work and publicly distribute the Work under its 37 | terms, with knowledge of his or her Copyright and Related Rights in the 38 | Work and the meaning and intended legal effect of CC0 on those rights. 39 | 40 | 1. Copyright and Related Rights. A Work made available under CC0 may be 41 | protected by copyright and related or neighboring rights ("Copyright and 42 | Related Rights"). Copyright and Related Rights include, but are not 43 | limited to, the following: 44 | 45 | i. the right to reproduce, adapt, distribute, perform, display, 46 | communicate, and translate a Work; 47 | ii. moral rights retained by the original author(s) and/or performer(s); 48 | iii. publicity and privacy rights pertaining to a person's image or 49 | likeness depicted in a Work; 50 | iv. rights protecting against unfair competition in regards to a Work, 51 | subject to the limitations in paragraph 4(a), below; 52 | v. rights protecting the extraction, dissemination, use and reuse of data 53 | in a Work; 54 | vi. database rights (such as those arising under Directive 96/9/EC of the 55 | European Parliament and of the Council of 11 March 1996 on the legal 56 | protection of databases, and under any national implementation 57 | thereof, including any amended or successor version of such 58 | directive); and 59 | vii. other similar, equivalent or corresponding rights throughout the 60 | world based on applicable law or treaty, and any national 61 | implementations thereof. 62 | 63 | 2. Waiver. To the greatest extent permitted by, but not in contravention 64 | of, applicable law, Affirmer hereby overtly, fully, permanently, 65 | irrevocably and unconditionally waives, abandons, and surrenders all of 66 | Affirmer's Copyright and Related Rights and associated claims and causes 67 | of action, whether now known or unknown (including existing as well as 68 | future claims and causes of action), in the Work (i) in all territories 69 | worldwide, (ii) for the maximum duration provided by applicable law or 70 | treaty (including future time extensions), (iii) in any current or future 71 | medium and for any number of copies, and (iv) for any purpose whatsoever, 72 | including without limitation commercial, advertising or promotional 73 | purposes (the "Waiver"). Affirmer makes the Waiver for the benefit of each 74 | member of the public at large and to the detriment of Affirmer's heirs and 75 | successors, fully intending that such Waiver shall not be subject to 76 | revocation, rescission, cancellation, termination, or any other legal or 77 | equitable action to disrupt the quiet enjoyment of the Work by the public 78 | as contemplated by Affirmer's express Statement of Purpose. 79 | 80 | 3. Public License Fallback. Should any part of the Waiver for any reason 81 | be judged legally invalid or ineffective under applicable law, then the 82 | Waiver shall be preserved to the maximum extent permitted taking into 83 | account Affirmer's express Statement of Purpose. In addition, to the 84 | extent the Waiver is so judged Affirmer hereby grants to each affected 85 | person a royalty-free, non transferable, non sublicensable, non exclusive, 86 | irrevocable and unconditional license to exercise Affirmer's Copyright and 87 | Related Rights in the Work (i) in all territories worldwide, (ii) for the 88 | maximum duration provided by applicable law or treaty (including future 89 | time extensions), (iii) in any current or future medium and for any number 90 | of copies, and (iv) for any purpose whatsoever, including without 91 | limitation commercial, advertising or promotional purposes (the 92 | "License"). The License shall be deemed effective as of the date CC0 was 93 | applied by Affirmer to the Work. Should any part of the License for any 94 | reason be judged legally invalid or ineffective under applicable law, such 95 | partial invalidity or ineffectiveness shall not invalidate the remainder 96 | of the License, and in such case Affirmer hereby affirms that he or she 97 | will not (i) exercise any of his or her remaining Copyright and Related 98 | Rights in the Work or (ii) assert any associated claims and causes of 99 | action with respect to the Work, in either case contrary to Affirmer's 100 | express Statement of Purpose. 101 | 102 | 4. Limitations and Disclaimers. 103 | 104 | a. No trademark or patent rights held by Affirmer are waived, abandoned, 105 | surrendered, licensed or otherwise affected by this document. 106 | b. Affirmer offers the Work as-is and makes no representations or 107 | warranties of any kind concerning the Work, express, implied, 108 | statutory or otherwise, including without limitation warranties of 109 | title, merchantability, fitness for a particular purpose, non 110 | infringement, or the absence of latent or other defects, accuracy, or 111 | the present or absence of errors, whether or not discoverable, all to 112 | the greatest extent permissible under applicable law. 113 | c. Affirmer disclaims responsibility for clearing rights of other persons 114 | that may apply to the Work or any use thereof, including without 115 | limitation any person's Copyright and Related Rights in the Work. 116 | Further, Affirmer disclaims responsibility for obtaining any necessary 117 | consents, permissions or other rights required for any use of the 118 | Work. 119 | d. Affirmer understands and acknowledges that Creative Commons is not a 120 | party to this document and has no duty or obligation with respect to 121 | this CC0 or use of the Work. 122 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # GlotCC HomePage 2 | 3 | 4 | 5 | arXiv 6 | 7 | 8 | **GlotCC is a multilingual corpus built by the GlotLID language identification and cisnlp/Ungoliant pipeline from CommonCrawl.** 9 | 10 | Lastest version supports more than **1000 languages** and is filtered based on adopted filters from C4, CCNet, MADLAD-400, RedPajama-Data-v2, OSCAR, Gopher, RefinedWeb, FineWeb, Datatrove, Dolma, Pile-CC, Pretrainer's Guide, and GlotScript. ™ The logo features a llama with the style of C.C. from the Code Geass anime reading a book. 11 | 12 | ## Dataset 13 | 14 | GlotCC Dataset, Version 1: [https://huggingface.co/datasets/cis-lmu/GlotCC-V1](https://huggingface.co/datasets/cis-lmu/GlotCC-V1) 15 | 16 | 17 | ## Running the pipeline 18 | 19 | We forked oscar-project/ungoliant to cisnlp/ungoliant and made the necessary changes to integrate it with the [GlotLID](https://github.com/cisnlp/glotlid) language identification model. 20 | 21 | For detailed instructions on running the pipeline, refer to the [cisnlp/ungoliant repository](https://github.com/cisnlp/ungoliant). The README is up-to-date. 22 | 23 | ## Acknowledgements 24 | 25 | - We appreciate the collaborators who are collectively advancing the frontier of open datasets and LLM models. 26 | - Thanks to the community and friends who enable the auditing of this dataset with higher quality. Also, to everyone contributing to the GlotCC dataset. 27 | - Our gratitude extends to the OSCAR team for pioneering the development of open pipelines and datasets from CommonCrawl, as well as to the CommonCrawl team. 28 | 29 | ## License 30 | 31 | - GlotCC data is released under the following licensing scheme: We do not own any of the text from which this data has been extracted. The data is licensed under the terms of the CommonCrawl [Terms of Use](https://commoncrawl.org/terms-of-use). We license the actual packaging, metadata, and annotations of this data under the Creative Commons [CC0 license](https://github.com/cisnlp/GlotCC/blob/main/LICENSE). 32 | - Ungoliant license remains unchanged as the [Apache License 2.0](https://github.com/cisnlp/ungoliant/blob/main/LICENSE). 33 | - GlotLID license remains unchanged as the [Apache License 2.0](https://github.com/cisnlp/GlotLID/blob/main/LICENSE). 34 | 35 | ## Citation 36 | 37 | If you find our repo and data useful for your research, please cite: 38 | 39 | ``` 40 | @article{kargaran2024glotcc, 41 | title = {Glot{CC}: An Open Broad-Coverage CommonCrawl Corpus and Pipeline for Minority Languages}, 42 | author = {Kargaran, Amir Hossein and Yvon, Fran{\c{c}}ois and Sch{\"u}tze, Hinrich}, 43 | journal = {Advances in Neural Information Processing Systems}, 44 | year = {2024}, 45 | url = {https://arxiv.org/abs/2410.23825} 46 | } 47 | ``` 48 | 49 | 50 | -------------------------------------------------------------------------------- /assets/images/logo.jpg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/cisnlp/GlotCC/824b072bc82558dd0ea8d1176976aba1f207b57c/assets/images/logo.jpg -------------------------------------------------------------------------------- /audit/GlotCC-audit.csv: -------------------------------------------------------------------------------- 1 | codes,https://huggingface.co/datasets/cis-lmu/GlotCC-V1,comment (https://iso639-3.sil.org/code/prs) 2 | aau-Latn,20/20, 3 | aaz-Latn,15/15,non-bible and bible 4 | abi-Latn,0/1,out of model cousin -- kpo 5 | abk-Cyrl,20/20, 6 | abq-Cyrl,15/15,moslty https://abaza.org/abq 7 | abs-Latn,?,out of model cousin -- mkn | the rest is lyrics 8 | abt-Latn,6/6, 9 | abx-Latn,3/3, 10 | aby-Latn,1/1, 11 | ace-Arab,1/1, 12 | ace-Latn,0/1,malay in arabic script 13 | acf-Latn,5/6,"list of words end with ""ppy"" from https://anagrams.app/" 14 | ach-Latn,10/10, 15 | acm-Arab,0/1,"misclassification, true label: msa_Arab or zsm_Arab" 16 | acn-Latn,1/1, 17 | acr-Latn,3/3, 18 | acu-Latn,20/20, 19 | ada-Latn,3/3, 20 | adi-Latn,12/12, 21 | adj-Latn,5/5, 22 | ady-Cyrl,20/20,mostly http://www.adyglife.com & https://www.circassianpress.com & https://dunai.adigeatoday.ru/ 23 | adz-Latn,6/6, 24 | aeb-Arab,?,"It's moslty pages with the word ""Tunis"". But many of them comes also from pages with "".tn"" as domain" 25 | aer-Latn,1/1, 26 | aey-Latn,2/2, 27 | afr-Latn,20/20, 28 | agd-Latn,20/20, 29 | agm-Latn,11/11, 30 | agn-Latn,2/2,bible-related 31 | agr-Latn,17/17,udhr & bible 32 | agt-Latn,2/2, 33 | agu-Latn,1/1, 34 | agw-Latn,1/1, 35 | aii-Syrc,20/20,https://reimagine.today 36 | aim-Latn,2/2, 37 | ain-Latn,2/2, 38 | ajg-Latn,0/5,out-of-model-cousin; change the name to sxw: https://saxwe.net/ 39 | ajp-Arab,?,ajp code is deprecated. join to apc. 40 | ajz-Latn,2/2, 41 | akb-Latn,17/20,out of model cousin from btm 42 | akh-Latn,2/2, 43 | aln-Latn,, 44 | alp-Latn,4/4, 45 | alq-Latn,1/1, 46 | als-Latn,, 47 | alt-Cyrl,, 48 | aly-Latn,, 49 | ame-Latn,0/1,list case 4epoch 50 | amf-Latn,15/15, 51 | amh-Ethi,, 52 | ami-Latn,, 53 | amm-Latn,1/1, 54 | amp-Latn,0/2,list noise starts with yikv and yifoa 55 | amu-Latn,8/8, 56 | ang-Latn,, 57 | ann-Latn,, 58 | anp-Deva,, 59 | any-Latn,1/1, 60 | aoi-Latn,, 61 | aoj-Latn,, 62 | aom-Latn,7/7, 63 | aoz-Latn,2/4,out-of-model cousin bkx; https://www.globalrecordings.net/es/script/bkx/395 64 | apb-Latn,, 65 | apc-Arab,, 66 | ape-Latn,7/8,incorrect from aon -- out of model cousin -- but we know aon was too much close to ape 67 | apr-Latn,4/4, 68 | apt-Latn,, 69 | apw-Latn,0/1,repeated ngram of ííííííjjjjjj 70 | apz-Latn,, 71 | arb-Arab,, 72 | arb-Latn,, 73 | arg-Latn,, 74 | arl-Latn,, 75 | arn-Latn,, 76 | arq-Arab,2/3, 77 | ars-Arab,, 78 | ary-Arab,, 79 | arz-Arab,, 80 | asm-Beng,, 81 | asm-Latn,, 82 | aso-Latn,4/4, 83 | ast-Latn,, 84 | ata-Latn,, 85 | atb-Latn,9/9, 86 | atd-Latn,1/1, 87 | atj-Latn,, 88 | atq-Latn,1/1, 89 | att-Latn,23/23, 90 | aui-Latn,4/4, 91 | auy-Latn,1/1, 92 | ava-Cyrl,, 93 | avk-Latn,, 94 | avt-Latn,34/34, 95 | avu-Latn,7/7, 96 | awa-Deva,, 97 | awx-Latn,, 98 | ayr-Latn,, 99 | azb-Arab,, 100 | azg-Latn,12/12, 101 | azj-Cyrl,, 102 | azj-Latn,20/20, 103 | azz-Latn,0/2,list case noise tat... 104 | bak-Cyrl,20/20, 105 | bam-Latn,, 106 | ban-Latn,, 107 | bar-Latn,, 108 | bas-Latn,2/2, 109 | bba-Latn,, 110 | bbb-Latn,8/8, 111 | bbc-Latn,, 112 | bbr-Latn,13/13, 113 | bcc-Arab,20/20,change the name to bal-Arab; it has other cousins. This is also a great resource for bgp: https://incubator.m.wikimedia.org/wiki/Special:PrefixIndex/Wp/bgp/ 114 | bch-Latn,, 115 | bci-Latn,18/18, 116 | bcl-Latn,, 117 | bdd-Latn,8/8, 118 | bdh-Latn,2/2, 119 | bdq-Latn,, 120 | bef-Latn,1/1, 121 | bel-Cyrl,, 122 | bem-Latn,, 123 | ben-Beng,, 124 | ben-Latn,, 125 | bew-Latn,, 126 | bgr-Latn,6/6, 127 | bgs-Latn,6/6, 128 | bgz-Latn,1/1, 129 | bhg-Latn,12/14,repeated ngrams: remove https://tenymalagasy.org/ and https://malagasyword.org 130 | bhl-Latn,4/4, 131 | bho-Deva,, 132 | bhw-Latn,2/2, 133 | big-Latn,4/4, 134 | bim-Latn,, 135 | bin-Latn,, 136 | bis-Latn,, 137 | biu-Latn,1/1, 138 | bjn-Arab,, 139 | bjn-Latn,, 140 | bjp-Latn,, 141 | bjr-Latn,5/5, 142 | bjv-Latn,0/3,out of model cousin for: mge 143 | bkd-Latn,18/18, 144 | bkl-Latn,1/1, 145 | blk-Mymr,20/20, 146 | blw-Latn,4/4, 147 | blz-Latn,3/3, 148 | bmh-Latn,11/11, 149 | bmu-Latn,20/20, 150 | bno-Latn,2/2, 151 | bnp-Latn,13/13, 152 | boa-Latn,20/20, 153 | bod-Tibt,, 154 | boj-Latn,2/2, 155 | bom-Latn,1/1, 156 | bon-Latn,, 157 | bos-Latn,, 158 | bpr-Latn,1/1, 159 | bps-Latn,, 160 | bpy-Beng,, 161 | bqc-Latn,2/2, 162 | bqj-Latn,1/1, 163 | bre-Latn,, 164 | brh-Arab,20/20, 165 | bru-Latn,, 166 | brx-Deva,, 167 | brx-Latn,0/1,phba repated word 168 | bsp-Latn,, 169 | btd-Latn,3/3, 170 | bts-Latn,, 171 | btx-Latn,, 172 | bug-Latn,, 173 | buk-Latn,13/13, 174 | bul-Cyrl,, 175 | bum-Latn,, 176 | bus-Latn,, 177 | bvr-Latn,9/9, 178 | bwd-Latn,1/1, 179 | bxh-Latn,, 180 | bxr-Cyrl,, 181 | byr-Latn,20/20, 182 | bzd-Latn,2/2, 183 | bzi-Thai,2/2, 184 | bzj-Latn,3/3, 185 | caa-Latn,7/7, 186 | cab-Latn,, 187 | cac-Latn,10/11,start with tzyu 188 | cag-Latn,1/1, 189 | cak-Latn,, 190 | cao-Latn,15/15, 191 | car-Latn,13/13, 192 | cat-Latn,, 193 | cax-Latn,3/3, 194 | cbi-Latn,1/1, 195 | cbk-Latn,, 196 | cbr-Latn,, 197 | cbs-Latn,1/1, 198 | cbt-Latn,, 199 | cbu-Latn,, 200 | ccp-Latn,, 201 | ceb-Latn,, 202 | ces-Latn,, 203 | cfm-Latn,, 204 | cgc-Latn,10/12,noise from month names 205 | cha-Latn,3/3, 206 | chd-Latn,19/19, 207 | che-Cyrl,, 208 | chk-Latn,, 209 | cho-Latn,4/4, 210 | chu-Cyrl,, 211 | chv-Cyrl,, 212 | chw-Latn,2/3,reoeated ddddddddddddddddddddddddddddddddddddd 213 | cjk-Latn,, 214 | cjv-Latn,20/20, 215 | ckb-Arab,, 216 | ckm-Latn,, 217 | ckt-Cyrl,, 218 | clu-Latn,8/8, 219 | cmn-Hani,20/20, 220 | cmo-Latn,, 221 | cmr-Latn,, 222 | cnh-Latn,, 223 | cni-Latn,20/20, 224 | cnk-Latn,, 225 | cnr-Latn,, 226 | cnw-Latn,1/1, 227 | con-Latn,1/1, 228 | cop-Copt,, 229 | cor-Latn,, 230 | cos-Latn,, 231 | cot-Latn,11/11, 232 | cpb-Latn,1/1, 233 | cpc-Latn,20/20, 234 | cpy-Latn,14/14, 235 | crh-Cyrl,, 236 | crh-Latn,, 237 | cri-Latn,0/3,noise 238 | crk-Cans,2/2, 239 | crk-Latn,, 240 | crl-Cans,2/2, 241 | crn-Latn,13/13, 242 | crs-Latn,, 243 | crx-Latn,4/4, 244 | csb-Latn,, 245 | csw-Latn,6/6, 246 | csy-Latn,, 247 | cta-Latn,3/3, 248 | ctd-Latn,, 249 | ctu-Latn,2/2, 250 | cub-Latn,5/5, 251 | cuk-Latn,, 252 | cut-Latn,18/18, 253 | cwe-Latn,1/1, 254 | cwt-Latn,2/2, 255 | cya-Latn,4/4, 256 | cym-Latn,, 257 | czt-Latn,4/4, 258 | dad-Latn,20/20, 259 | dag-Latn,0/20,only keep wikipedia --the training data of dag is not very clean 260 | dah-Latn,7/7, 261 | dak-Latn,16/16, 262 | dan-Latn,, 263 | dar-Cyrl,, 264 | ddg-Latn,1/1, 265 | ded-Latn,4/4, 266 | deu-Latn,20/20, 267 | dga-Latn,20/20, 268 | dgc-Latn,10/10, 269 | dgz-Latn,13/13, 270 | dhg-Latn,1/1, 271 | dhm-Latn,1/1, 272 | dhv-Latn,3/3, 273 | did-Latn,1/1, 274 | dik-Latn,20/20, 275 | diq-Latn,, 276 | dis-Latn,1/1, 277 | div-Thaa,, 278 | dje-Latn,5/5, 279 | djk-Latn,7/7, 280 | djr-Latn,1/1, 281 | dng-Cyrl,, 282 | dob-Latn,, 283 | doi-Deva,, 284 | dop-Latn,, 285 | dru-Latn,4/4, 286 | dsb-Latn,, 287 | dtp-Latn,, 288 | dts-Latn,2/2, 289 | dua-Latn,, 290 | dwr-Latn,0/17,noise: tags # 291 | dww-Latn,1/1, 292 | dyi-Latn,2/2, 293 | dyo-Latn,1/1, 294 | dyu-Latn,, 295 | dzo-Tibt,, 296 | efi-Latn,, 297 | ekk-Latn,, 298 | eko-Latn,4/4, 299 | ell-Grek,, 300 | emi-Latn,20/20, 301 | eml-Latn,, 302 | emp-Latn,4/4, 303 | eng-Latn,20/20, 304 | enl-Latn,2/2, 305 | enm-Latn,?,remove those that every words start with # 306 | enq-Latn,20/20, 307 | epo-Latn,, 308 | eri-Latn,3/3, 309 | esk-Latn,, 310 | esu-Latn,, 311 | etr-Latn,, 312 | eus-Latn,, 313 | eve-Cyrl,, 314 | ewe-Latn,, 315 | ewo-Latn,4/4, 316 | ext-Latn,, 317 | eza-Latn,1/1, 318 | faa-Latn,6/6, 319 | fai-Latn,5/5, 320 | fao-Latn,, 321 | fas-Arab,20/20, 322 | fat-Latn,2/2, 323 | ffm-Latn,20/20,content in fuc: https://contafrica.org/fr/index.php?/contes/conte-nat/sambayel-gujjo-tanum-pennowo 324 | fij-Latn,, 325 | fil-Latn,, 326 | fin-Latn,20/20, 327 | fit-Latn,, 328 | fkv-Latn,, 329 | fon-Latn,20/20, 330 | for-Latn,, 331 | fra-Latn,, 332 | fro-Latn,, 333 | frp-Latn,, 334 | frr-Latn,, 335 | fry-Latn,, 336 | fub-Latn,, 337 | fud-Latn,3/3, 338 | fue-Latn,, 339 | fuf-Latn,, 340 | fuh-Latn,20/20, 341 | fuq-Latn,0/1,noise 342 | fur-Latn,, 343 | fuv-Latn,, 344 | gaa-Latn,, 345 | gag-Latn,, 346 | gah-Latn,, 347 | gai-Latn,1/1, 348 | gam-Latn,16/16, 349 | gaw-Latn,20/20, 350 | gaz-Latn,, 351 | gbi-Latn,2/2, 352 | gcf-Latn,, 353 | gcr-Latn,, 354 | gdg-Latn,13/13, 355 | gdn-Latn,19/19, 356 | geb-Latn,18/18, 357 | gfk-Latn,, 358 | ghs-Latn,20/20, 359 | gil-Latn,, 360 | gjn-Latn,2/2, 361 | gla-Latn,, 362 | gle-Latn,, 363 | glg-Latn,, 364 | glk-Arab,5/20,"remove ""قالب بلاگ "" The classifer obsess with locations // keep the glk.wikipedia.org https://v6rg.com https://gilakmedia.com https://www.fekrazad.com/hachin/ -- also later move the fas contents to the training of glotlid -- so it becomes robust towards glk" 365 | glv-Latn,, 366 | gmh-Latn,, 367 | gmv-Ethi,2/2, 368 | gnb-Latn,4/4, 369 | gnd-Latn,0/3,complete noise 370 | gng-Latn,3/3, 371 | gnn-Latn,20/20, 372 | gnw-Latn,1/1, 373 | gof-Ethi,0/4,dwr: https://ebible.org/dwrENT/MRK01.htm 374 | goh-Latn,, 375 | gom-Deva,, 376 | gom-Latn,20/20, 377 | gor-Latn,20/20, 378 | gos-Latn,, 379 | got-Goth,2/2, 380 | got-Latn,15/15, 381 | grc-Grek,, 382 | gsw-Latn,, 383 | gub-Latn,1/1, 384 | guc-Latn,6/6, 385 | gug-Latn,, 386 | gui-Latn,0/3,noise 387 | guj-Gujr,, 388 | guj-Latn,, 389 | gul-Latn,, 390 | gum-Latn,, 391 | gur-Latn,1/1, 392 | guw-Latn,20/20, 393 | gux-Latn,2/2, 394 | guz-Latn,, 395 | gvn-Latn,1/5,change 4 of the records to gup -- out of model cousin 396 | gwi-Latn,12/12, 397 | gwr-Latn,1/1, 398 | gym-Latn,1/1, 399 | gyr-Latn,9/9, 400 | hac-Arab,, 401 | hag-Latn,1/1, 402 | hak-Hani,0/2, 403 | hak-Latn,20/20, 404 | hat-Latn,, 405 | hau-Latn,, 406 | hav-Latn,0/1,noise 407 | haw-Latn,20/20, 408 | hbo-Hebr,, 409 | hch-Latn,7/7, 410 | heb-Hebr,, 411 | heg-Latn,7/7, 412 | her-Latn,20/20, 413 | hif-Latn,, 414 | hil-Latn,, 415 | hin-Deva,20/20, 416 | hin-Latn,, 417 | hla-Latn,2/2, 418 | hlt-Latn,0/5,noise 419 | hmo-Latn,20/20, 420 | hmr-Latn,, 421 | hne-Deva,, 422 | hnj-Latn,, 423 | hns-Latn,20/20, 424 | hot-Latn,20/20, 425 | hra-Latn,1/1, 426 | hrv-Latn,, 427 | hsb-Latn,, 428 | hto-Latn,2/2, 429 | hub-Latn,, 430 | hui-Latn,, 431 | hun-Latn,, 432 | hus-Latn,6/6, 433 | huu-Latn,, 434 | hvn-Latn,, 435 | hwc-Latn,, 436 | hye-Armn,, 437 | hyw-Armn,, 438 | ian-Latn,18/18, 439 | iba-Latn,, 440 | ibg-Latn,, 441 | ibo-Latn,, 442 | icr-Latn,2/3,one jam.wikipedia should be changed 443 | ido-Latn,, 444 | idu-Latn,1/1, 445 | ifu-Latn,0/1,noise 446 | ify-Latn,20/20, 447 | ike-Cans,20/20, 448 | ikk-Latn,1/1, 449 | ikt-Latn,, 450 | ikw-Latn,2/2, 451 | ile-Latn,, 452 | ilo-Latn,20/20, 453 | ina-Latn,, 454 | inb-Latn,1/1, 455 | ind-Latn,20/20, 456 | inh-Cyrl,, 457 | ino-Latn,5/5, 458 | iou-Latn,19/20,out of model cousin from mcr 459 | ipi-Latn,20/20, 460 | iqw-Latn,1/1, 461 | isd-Latn,3/3, 462 | ish-Latn,1/1,nice web: https://www.esanland.org/2021/06/esan-adage-and-meaning.html 463 | isl-Latn,20/20, 464 | iso-Latn,20/20,jw 465 | ita-Latn,20/20, 466 | itv-Latn,1/1, 467 | ium-Latn,20/20,"bible, add http://kinhthanh.httlvn.org to the list of religous sites" 468 | ivb-Latn,2/2, 469 | ivv-Latn,0/1,out of model cousin: tao language 470 | iws-Latn,20/20, 471 | ixl-Latn,15/15, 472 | jac-Latn,20/20, 473 | jae-Latn,8/8, 474 | jam-Latn,, 475 | jav-Latn,, 476 | jbo-Latn,, 477 | jiv-Latn,, 478 | jpn-Jpan,, 479 | jra-Latn,, 480 | jvn-Latn,6/7, 481 | kaa-Cyrl,19/19, 482 | kaa-Latn,, 483 | kab-Latn,, 484 | kac-Latn,, 485 | kak-Latn,1/1, 486 | kal-Latn,, 487 | kam-Latn,4/4, 488 | kan-Knda,, 489 | kan-Latn,, 490 | kaq-Latn,19/20,amc from UDHR is part of this corpra - the rest are bible 491 | kas-Arab,, 492 | kas-Deva,, 493 | kas-Latn,, 494 | kat-Geor,, 495 | kaz-Cyrl,, 496 | kbc-Latn,1/1, 497 | kbd-Cyrl,, 498 | kbh-Latn,20/20, 499 | kbm-Latn,10/10, 500 | kbo-Latn,7/7, 501 | kbp-Latn,, 502 | kbq-Latn,4/4, 503 | kby-Latn,4/4, 504 | kca-Cyrl,14/15, 505 | kcg-Latn,2/2, 506 | kck-Latn,3/3, 507 | kdc-Latn,9/9, 508 | kde-Latn,, 509 | kdi-Latn,2/2, 510 | kdl-Latn,20/20, 511 | kea-Latn,, 512 | kek-Latn,7/7, 513 | kew-Latn,9/9, 514 | kgk-Latn,, 515 | kgr-Latn,2/2, 516 | kha-Latn,, 517 | khk-Cyrl,, 518 | khm-Khmr,, 519 | khz-Latn,8/8, 520 | kij-Latn,, 521 | kik-Latn,20/20, 522 | kin-Latn,, 523 | kir-Cyrl,, 524 | kiu-Latn,, 525 | kix-Latn,, 526 | kjb-Latn,, 527 | kjh-Cyrl,, 528 | kjs-Latn,, 529 | kkc-Latn,, 530 | kkl-Latn,, 531 | klt-Latn,, 532 | klv-Latn,, 533 | kmb-Latn,, 534 | kmd-Latn,, 535 | kmg-Latn,, 536 | kmh-Latn,, 537 | kmk-Latn,, 538 | kmm-Latn,, 539 | kmr-Cyrl,, 540 | kmr-Latn,, 541 | kms-Latn,, 542 | kmu-Latn,, 543 | kmy-Latn,, 544 | knc-Arab,, 545 | knc-Latn,, 546 | kne-Latn,, 547 | kng-Latn,, 548 | knv-Latn,, 549 | kog-Latn,, 550 | koi-Cyrl,, 551 | kor-Hang,20/20, 552 | kos-Latn,, 553 | kpe-Latn,, 554 | kpf-Latn,, 555 | kpg-Latn,, 556 | kpj-Latn,, 557 | kpr-Latn,, 558 | kpv-Cyrl,20/20, 559 | kpw-Latn,, 560 | kpx-Latn,, 561 | kqc-Latn,, 562 | kqf-Latn,, 563 | kqn-Latn,, 564 | kqw-Latn,, 565 | krc-Cyrl,, 566 | kri-Latn,, 567 | krl-Latn,, 568 | kru-Deva,, 569 | ksd-Latn,, 570 | ksh-Latn,, 571 | ksr-Latn,, 572 | ksw-Mymr,, 573 | ktm-Latn,, 574 | kto-Latn,, 575 | ktu-Latn,, 576 | ktz-Latn,, 577 | kua-Latn,, 578 | kum-Cyrl,, 579 | kup-Latn,, 580 | kvg-Latn,, 581 | kwf-Latn,, 582 | kwj-Latn,, 583 | kwn-Latn,, 584 | kwy-Latn,, 585 | kyc-Latn,, 586 | kyg-Latn,, 587 | kyq-Latn,, 588 | kyu-Kali,, 589 | kyu-Latn,, 590 | kyz-Latn,, 591 | kze-Latn,, 592 | kzf-Latn,, 593 | kzj-Latn,, 594 | lad-Latn,, 595 | laj-Latn,, 596 | lam-Latn,, 597 | lao-Laoo,, 598 | lat-Latn,, 599 | lbb-Latn,, 600 | lbe-Cyrl,, 601 | lbj-Tibt,13/13, 602 | lbk-Latn,, 603 | lcm-Latn,, 604 | leu-Latn,10/10, 605 | lew-Latn,, 606 | lex-Latn,6/6, 607 | lez-Cyrl,20/20, 608 | lfn-Cyrl,, 609 | lfn-Latn,, 610 | lgg-Latn,1/1, 611 | lgl-Latn,9/9, 612 | lhi-Latn,, 613 | lhu-Latn,0/1,all is NOISE 614 | lia-Latn,1/1, 615 | lid-Latn,, 616 | lif-Deva,20/20, 617 | lif-Limb,, 618 | lij-Latn,, 619 | lim-Latn,, 620 | lin-Latn,, 621 | lis-Lisu,1/1, 622 | lit-Latn,, 623 | liv-Latn,20/20, 624 | ljp-Latn,, 625 | lki-Arab,, 626 | lld-Latn,, 627 | llg-Latn,, 628 | lmk-Latn,1/1, 629 | lmo-Latn,, 630 | lnd-Latn,2/2, 631 | loz-Latn,, 632 | lrc-Arab,, 633 | lsi-Latn,5/5, 634 | ltg-Latn,, 635 | ltz-Latn,, 636 | lua-Latn,2/2, 637 | lub-Latn,6/6, 638 | lud-Latn,, 639 | lue-Latn,13/13, 640 | lug-Latn,, 641 | lun-Latn,1/1, 642 | luo-Latn,, 643 | lus-Latn,, 644 | lvs-Latn,, 645 | lwg-Latn,1/1, 646 | lww-Latn,20/20, 647 | lzh-Hani,, 648 | maa-Latn,, 649 | mad-Latn,, 650 | mag-Deva,, 651 | mah-Latn,, 652 | mai-Deva,, 653 | mak-Latn,, 654 | mal-Latn,, 655 | mal-Mlym,, 656 | mam-Latn,10/10, 657 | mar-Deva,, 658 | mar-Latn,, 659 | mas-Latn,15/15, 660 | mau-Latn,, 661 | maw-Latn,, 662 | maz-Latn,5/5, 663 | mbb-Latn,, 664 | mbf-Latn,, 665 | mbh-Latn,5/5, 666 | mbs-Latn,, 667 | mbt-Latn,3/3, 668 | mcd-Latn,2/2, 669 | mck-Latn,1/1, 670 | mcq-Latn,, 671 | mdf-Cyrl,, 672 | mdy-Ethi,, 673 | med-Latn,, 674 | mej-Latn,2/2, 675 | mek-Latn,5/5, 676 | men-Latn,, 677 | met-Latn,, 678 | meu-Latn,, 679 | mfe-Latn,, 680 | mfi-Latn,, 681 | mfq-Latn,10/10, 682 | mfy-Latn,, 683 | mgh-Latn,, 684 | mhi-Latn,, 685 | mhr-Cyrl,, 686 | mhw-Latn,9/9, 687 | mhx-Latn,4/4, 688 | mic-Latn,, 689 | mie-Latn,, 690 | mif-Latn,2/2, 691 | min-Latn,, 692 | mip-Latn,7/7, 693 | miq-Latn,1/1, 694 | mjc-Latn,4/4, 695 | mjw-Latn,, 696 | mkd-Cyrl,, 697 | mkn-Latn,15/15, 698 | mlh-Latn,0/7,"wrong label: mhl, but there are also sources mentione mhl for the same verses" 699 | mlt-Latn,, 700 | mmo-Latn,10/10, 701 | mmx-Latn,20/20, 702 | mna-Latn,20/20, 703 | mnb-Latn,1/2,List case 704 | mni-Beng,, 705 | mni-Latn,, 706 | mni-Mtei,19/19, 707 | mnk-Latn,9/9, 708 | mns-Cyrl,20/20, 709 | mnw-Mymr,20/20, 710 | moa-Latn,0/10,out of model cousin: mxx 711 | moc-Latn,2/2,interesting website: https://cartografiachaco.wikimedia.org.ar/ 712 | mog-Latn,, 713 | moh-Latn,, 714 | mos-Latn,20/20, 715 | mph-Latn,3/3, 716 | mpm-Latn,2/2, 717 | mpp-Latn,, 718 | mps-Latn,20/20, 719 | mpx-Latn,10/10, 720 | mqj-Latn,1/1, 721 | mqy-Latn,2/2, 722 | mri-Latn,, 723 | mrj-Cyrl,, 724 | mrw-Latn,, 725 | msb-Latn,7/7, 726 | msc-Latn,4/4, 727 | msk-Latn,4/4, 728 | msm-Latn,0/4,All Noise 729 | msy-Latn,20/20, 730 | mtg-Latn,2/2, 731 | mti-Latn,, 732 | muh-Latn,1/1, 733 | mui-Latn,, 734 | mup-Deva,0/6,"out of model cusin dhd,mtr" 735 | mur-Latn,20/20, 736 | mux-Latn,17/17, 737 | mva-Latn,3/3, 738 | mvn-Latn,9/9, 739 | mvp-Latn,13/13, 740 | mwc-Latn,2/2, 741 | mwf-Latn,1/1, 742 | mwl-Latn,, 743 | mwm-Latn,1/1, 744 | mwn-Latn,1/1, 745 | mwp-Latn,2/2, 746 | mwq-Latn,0/35,out of model cusin dao 747 | mwv-Latn,7/7, 748 | mww-Latn,, 749 | mxb-Latn,1/1, 750 | mxt-Latn,3/3, 751 | mxv-Latn,6/6, 752 | mya-Mymr,, 753 | myb-Latn,2/2, 754 | myk-Latn,20/20, 755 | myu-Latn,1/1, 756 | myv-Cyrl,, 757 | myw-Latn,3/3, 758 | myx-Latn,0/4,"List case, repeated words" 759 | myy-Latn,15/15, 760 | mza-Latn,5/5, 761 | mzh-Latn,1/1, 762 | mzn-Arab,, 763 | naf-Latn,3/3, 764 | nah-Latn,, 765 | nak-Latn,6/6, 766 | nan-Hani,0/15,"noise, the training data for nan_Hani is not very strong." 767 | nan-Latn,12/20,nan_Latn + cdo_Latn (out of model cousin) 768 | nap-Latn,, 769 | naq-Latn,16/16, 770 | nas-Latn,18/18, 771 | nav-Latn,, 772 | nbc-Latn,3/3, 773 | nbe-Latn,2/2, 774 | nbl-Latn,, 775 | nbu-Latn,, 776 | nca-Latn,6/6, 777 | nch-Latn,, 778 | ncj-Latn,, 779 | ncl-Latn,0/1, 780 | ncx-Latn,1/1, 781 | ndc-Latn,7/7, 782 | nde-Latn,, 783 | ndh-Latn,0/1,misclassification : SUK 784 | ndo-Latn,, 785 | nds-Latn,, 786 | new-Deva,, 787 | nfa-Latn,2/2, 788 | ngl-Latn,12/12, 789 | ngu-Latn,5/5, 790 | nhe-Latn,, 791 | nhg-Latn,, 792 | nhi-Latn,16/16, 793 | nho-Latn,, 794 | nhr-Latn,3/3, 795 | nhw-Latn,, 796 | nhx-Latn,2/2, 797 | nia-Latn,, 798 | nif-Latn,3/3, 799 | nii-Latn,3/3, 800 | nij-Latn,1/1, 801 | nin-Latn,16/17,https://www.tokei.cn/zx/xzdj/7291.html. it seems Noise 802 | niu-Latn,20/20, 803 | njm-Latn,2/2, 804 | njn-Latn,, 805 | njz-Latn,2/2, 806 | nld-Latn,, 807 | nlg-Latn,, 808 | nmf-Latn,, 809 | nmz-Latn,4/4, 810 | nnb-Latn,, 811 | nnh-Latn,1/1, 812 | nno-Latn,, 813 | nnp-Latn,1/1, 814 | nnw-Latn,2/2, 815 | nob-Latn,, 816 | nog-Cyrl,, 817 | non-Latn,26/68,Remove https://edl.ecml.at 818 | nop-Latn,20/20, 819 | not-Latn,1/1, 820 | nov-Latn,11/20,keep http://interlanguages.net wikipedia and https://ia801707.us.archive.org http://novial.nyelv.info 821 | nph-Latn,1/1, 822 | npi-Deva,, 823 | npi-Latn,, 824 | npy-Latn,7/7, 825 | nqo-Nkoo,, 826 | nrm-Latn,, 827 | nsa-Latn,1/1, 828 | nsn-Latn,, 829 | nso-Latn,, 830 | nss-Latn,11/11, 831 | nst-Latn,1/1, 832 | nsu-Latn,4/4, 833 | ntp-Latn,, 834 | ntu-Latn,1/2,remove https://loderi.com/kci 835 | nuj-Latn,8/8, 836 | nus-Latn,5/5, 837 | nuy-Latn,1/1, 838 | nvm-Latn,20/20, 839 | nwi-Latn,, 840 | nya-Latn,, 841 | nyk-Latn,, 842 | nyn-Latn,, 843 | nyu-Latn,, 844 | nyy-Latn,15/15, 845 | nzi-Latn,2/2, 846 | nzm-Latn,3/3, 847 | obo-Latn,3/3, 848 | oci-Latn,, 849 | ojb-Cans,1/1, 850 | ojb-Latn,, 851 | okv-Latn,17/17, 852 | olo-Latn,, 853 | omb-Latn,, 854 | omw-Latn,7/7, 855 | ong-Latn,, 856 | opm-Latn,, 857 | orv-Cyrl,, 858 | ory-Latn,, 859 | ory-Orya,, 860 | oss-Cyrl,, 861 | ota-Arab,, 862 | otd-Latn,1/1, 863 | ote-Latn,6/6, 864 | ots-Latn,4/4, 865 | otw-Latn,, 866 | pad-Latn,5/5, 867 | pag-Latn,11/14,out-of model cousin: ibl from bible 868 | pam-Latn,, 869 | pan-Guru,, 870 | pan-Latn,, 871 | pao-Latn,20/20, 872 | pap-Latn,, 873 | pau-Latn,, 874 | pbb-Latn,1/1, 875 | pbt-Arab,, 876 | pcd-Latn,, 877 | pck-Latn,, 878 | pcm-Latn,, 879 | pdc-Latn,, 880 | pdt-Latn,, 881 | pfl-Latn,, 882 | pib-Latn,6/6, 883 | pis-Latn,, 884 | pjt-Latn,, 885 | plg-Latn,1/1, 886 | pls-Latn,1/1, 887 | plt-Latn,, 888 | plu-Latn,1/1, 889 | plw-Latn,, 890 | pma-Latn,, 891 | pmf-Latn,, 892 | pms-Latn,, 893 | pmx-Latn,5/5, 894 | pnb-Arab,, 895 | pnt-Grek,, 896 | poe-Latn,1/1, 897 | poh-Latn,2/2, 898 | poi-Latn,20/20, 899 | pol-Latn,, 900 | pon-Latn,, 901 | por-Latn,, 902 | pot-Latn,1/1, 903 | ppk-Latn,17/17, 904 | ppo-Latn,7/7, 905 | pps-Latn,1/1, 906 | prg-Latn,, 907 | ptp-Latn,1/1, 908 | ptu-Latn,3/3, 909 | pui-Latn,, 910 | pwg-Latn,5/5, 911 | pwn-Latn,, 912 | qub-Latn,, 913 | quc-Latn,, 914 | quf-Latn,1/1, 915 | qug-Latn,, 916 | quh-Latn,, 917 | qul-Latn,3/3, 918 | qup-Latn,, 919 | quw-Latn,, 920 | quy-Latn,, 921 | quz-Latn,, 922 | qvc-Latn,, 923 | qvh-Latn,, 924 | qvi-Latn,, 925 | qvn-Latn,3/3, 926 | qvs-Latn,1/1, 927 | qvw-Latn,4/4, 928 | qxl-Latn,18/18, 929 | qxn-Latn,3/5,"Remove https://ja.wiktionary.org/wiki/proteger, http://wordsendingin.com/ag.html" 930 | qxo-Latn,, 931 | qxr-Latn,0/1,All Noise 932 | rad-Latn,, 933 | rap-Latn,, 934 | rar-Latn,, 935 | raw-Latn,, 936 | rcf-Latn,, 937 | rej-Latn,, 938 | rgu-Latn,1/1, 939 | rhg-Latn,, 940 | ria-Latn,2/2, 941 | rjs-Deva,, 942 | rkb-Latn,, 943 | rmc-Latn,, 944 | rml-Latn,, 945 | rmn-Cyrl,10/10, 946 | rmn-Latn,, 947 | rmo-Latn,1/1, 948 | rmq-Latn,7/7,all rows have same value 949 | rmy-Cyrl,, 950 | rmy-Latn,, 951 | rnl-Latn,1/1, 952 | roh-Latn,, 953 | ron-Cyrl,, 954 | ron-Latn,, 955 | roo-Latn,, 956 | rop-Latn,, 957 | row-Latn,, 958 | rro-Latn,, 959 | rtm-Latn,4/4, 960 | rue-Cyrl,, 961 | ruf-Latn,1/1, 962 | rug-Latn,, 963 | run-Latn,, 964 | rup-Latn,, 965 | rus-Cyrl,20/20, 966 | rwo-Latn,20/20, 967 | sab-Latn,3/3, 968 | sag-Latn,20/20, 969 | sah-Cyrl,, 970 | san-Deva,, 971 | san-Latn,, 972 | sas-Latn,2/2, 973 | sat-Latn,4/4, 974 | sat-Olck,, 975 | sbd-Latn,, 976 | sbe-Latn,20/20, 977 | scn-Latn,20/20, 978 | sco-Latn,20/20, 979 | sda-Latn,9/9, 980 | sdc-Latn,4/4,nice website: http://sardegnachiamasardegna.eu 981 | sdh-Arab,, 982 | seh-Latn,25/25, 983 | ses-Latn,1/1, 984 | sgb-Latn,0/1,out of model cousin: abp 985 | sgc-Latn,3/3, 986 | sgh-Cyrl,3/3, 987 | sgs-Latn,20/20, 988 | sgw-Ethi,, 989 | shi-Latn,, 990 | shk-Latn,, 991 | shn-Mymr,20/20, 992 | shp-Latn,7/7, 993 | shu-Arab,2/2, 994 | sid-Latn,, 995 | sil-Latn,10/10, 996 | sim-Latn,2/2, 997 | sin-Sinh,, 998 | sju-Latn,0/1,noise yji... 999 | skg-Latn,, 1000 | skr-Arab,16/16, 1001 | slk-Latn,, 1002 | sll-Latn,1/1, 1003 | slv-Latn,, 1004 | sma-Latn,, 1005 | sme-Latn,, 1006 | smj-Latn,20/20, 1007 | smk-Latn,19/19, 1008 | sml-Latn,11/11, 1009 | smn-Latn,, 1010 | smo-Latn,, 1011 | sms-Latn,, 1012 | sna-Latn,, 1013 | snc-Latn,17/17, 1014 | snd-Arab,, 1015 | snd-Deva,, 1016 | snd-Latn,, 1017 | snf-Latn,4/4, 1018 | snn-Latn,30/30, 1019 | snp-Latn,20/20, 1020 | snw-Latn,3/3, 1021 | sny-Latn,20/20, 1022 | som-Latn,, 1023 | sop-Latn,2/2, 1024 | soq-Latn,20/20, 1025 | sot-Latn,, 1026 | spa-Latn,, 1027 | spl-Latn,, 1028 | spm-Latn,12/12, 1029 | spp-Latn,4/4, 1030 | sps-Latn,2/2, 1031 | srd-Latn,, 1032 | srm-Latn,2/2, 1033 | srn-Latn,20/20, 1034 | srp-Cyrl,, 1035 | srp-Latn,, 1036 | srr-Latn,1/1, 1037 | ssd-Latn,20/20, 1038 | ssg-Latn,1/1, 1039 | ssw-Latn,, 1040 | ssx-Latn,2/2, 1041 | stq-Latn,20/20, 1042 | suc-Latn,10/10, 1043 | sue-Latn,18/18, 1044 | suk-Latn,20/20, 1045 | sun-Latn,20/20, 1046 | sus-Arab,, 1047 | sus-Latn,20/20, 1048 | suz-Deva,3/3, 1049 | swb-Latn,17/18,remove https://www.eol.org/search?q=Hallodapus 1050 | swc-Latn,3/3, 1051 | swe-Latn,20/20, 1052 | swg-Latn,, 1053 | swh-Latn,, 1054 | swp-Latn,3/3, 1055 | sxb-Latn,4/4, 1056 | sxn-Latn,4/4, 1057 | syb-Latn,13/13, 1058 | syc-Syrc,, 1059 | syl-Beng,6/6, 1060 | syl-Latn,1/5,just keep https://ebible.org/study/content/texts/syll/J31.html 1061 | szl-Latn,, 1062 | szy-Latn,, 1063 | tab-Cyrl,, 1064 | tah-Latn,, 1065 | taj-Deva,3/3, 1066 | tam-Latn,, 1067 | tam-Taml,, 1068 | tap-Latn,1/1, 1069 | taq-Latn,4/5,noise from programming langauge 1070 | taq-Tfng,, 1071 | tar-Latn,1/1,"a language in Parral, Mexico -- Tarahumara is a good fit." 1072 | tat-Cyrl,, 1073 | tat-Latn,, 1074 | tav-Latn,0/1,https://rhymebrain.com list case 1075 | taw-Latn,2/2, 1076 | tay-Latn,14/14, 1077 | tbg-Latn,9/9, 1078 | tbo-Latn,15/15, 1079 | tbw-Latn,1/1, 1080 | tby-Latn,3/3, 1081 | tbz-Latn,1/1, 1082 | tca-Latn,7/7, 1083 | tcs-Latn,2/2, 1084 | tcy-Knda,, 1085 | tcz-Latn,, 1086 | tdt-Latn,20/20,myabe the name should be tet? the difference between tdt and tet are close. 1087 | tdx-Latn,1/2,only keep the bible -- Ty ren repeated word 1088 | tel-Latn,, 1089 | tel-Telu,, 1090 | teo-Latn,, 1091 | tet-Latn,, 1092 | tfr-Latn,2/2, 1093 | tgk-Cyrl,20/20, 1094 | tgo-Latn,, 1095 | tgp-Latn,20/20, 1096 | tha-Thai,, 1097 | thl-Deva,, 1098 | tif-Latn,1/1, 1099 | tig-Ethi,, 1100 | tih-Latn,2/2, 1101 | tir-Ethi,, 1102 | tiv-Latn,2/4,just keep omniglot 1103 | tke-Latn,7/7, 1104 | tkl-Latn,, 1105 | tkr-Cyrl,, 1106 | tku-Latn,15/15, 1107 | tlb-Latn,2/2, 1108 | tlf-Latn,2/2, 1109 | tlh-Latn,, 1110 | tll-Latn,3/13,delete http://anagramy.net 1111 | tly-Latn,, 1112 | tmd-Latn,7/7, 1113 | tnc-Latn,1/1, 1114 | tnk-Latn,20/20, 1115 | tnn-Latn,, 1116 | tnp-Latn,26/26, 1117 | tnr-Latn,1/1, 1118 | tob-Latn,, 1119 | tod-Latn,, 1120 | tog-Latn,, 1121 | toi-Latn,, 1122 | toj-Latn,1/1, 1123 | tok-Latn,, 1124 | ton-Latn,, 1125 | too-Latn,2/2, 1126 | top-Latn,5/5, 1127 | tos-Latn,1/1, 1128 | tpi-Latn,20/20, 1129 | trc-Latn,1/1, 1130 | trn-Latn,1/1, 1131 | trp-Latn,, 1132 | trq-Latn,9/9, 1133 | trv-Latn,3/3, 1134 | tsg-Latn,2/2, 1135 | tsn-Latn,, 1136 | tso-Latn,, 1137 | tsz-Latn,, 1138 | ttc-Latn,6/6, 1139 | tte-Latn,6/6, 1140 | ttq-Latn,5/5, 1141 | tuc-Latn,20/20, 1142 | tuf-Latn,5/5, 1143 | tui-Latn,1/1, 1144 | tuk-Arab,1/1, 1145 | tuk-Cyrl,1/2,remove https://vsememy.ru 1146 | tuk-Latn,, 1147 | tum-Latn,20/20, 1148 | tur-Latn,20/20, 1149 | tvk-Latn,20/20, 1150 | tvl-Latn,13/13, 1151 | twb-Latn,0/10,tag word as noise 1152 | twi-Latn,, 1153 | twu-Latn,3/3, 1154 | txu-Latn,1/1, 1155 | tyv-Cyrl,, 1156 | tzh-Latn,2/2, 1157 | tzj-Latn,14/14, 1158 | tzm-Tfng,, 1159 | tzo-Latn,20/20, 1160 | ubr-Latn,20/20, 1161 | ubu-Latn,20/20, 1162 | udm-Cyrl,, 1163 | uig-Arab,, 1164 | uig-Cyrl,, 1165 | uig-Latn,, 1166 | ukr-Cyrl,20/20, 1167 | umb-Latn,15/15, 1168 | und-Dsrt,3/3, 1169 | und-Gran,3/3, 1170 | und-Hung,20/20,from mki.gov.hu -- change the name to hun-Hung 1171 | und-Mong,20/20,change to mvf-Mong 1172 | und-Newa,, 1173 | und-Shaw,, 1174 | und-Sylo,20/20,change name to syl-Sylo 1175 | upv-Latn,, 1176 | urd-Arab,20/20, 1177 | urd-Latn,16/20, 1178 | urh-Latn,, 1179 | usa-Latn,, 1180 | usp-Latn,1/1, 1181 | uvh-Latn,13/13, 1182 | uvl-Latn,5/5, 1183 | uzn-Cyrl,20/20, 1184 | uzn-Latn,20/20, 1185 | uzs-Arab,17/20,remove https://sudeasaran.ir https://turkmensnews.com/ https://elioshop.co https://bbrouzi.com https://irancooling.com 1186 | vap-Latn,1/1, 1187 | vec-Latn,, 1188 | ven-Latn,, 1189 | vep-Latn,, 1190 | vid-Latn,1/1, 1191 | vie-Latn,, 1192 | vls-Latn,, 1193 | vmw-Latn,, 1194 | vmy-Latn,, 1195 | vol-Latn,, 1196 | vro-Latn,, 1197 | vun-Latn,1/1, 1198 | waj-Latn,1/1, 1199 | wal-Latn,8/9,check https://www.suomisanakirja.fi/riimit/suututtaa 1200 | war-Latn,, 1201 | wat-Latn,2/2, 1202 | way-Latn,2/2, 1203 | wbm-Latn,2/2, 1204 | wbp-Latn,4/4, 1205 | wed-Latn,1/1, 1206 | wes-Latn,, 1207 | wln-Latn,, 1208 | wls-Latn,17/17, 1209 | wlv-Latn,2/2, 1210 | wlx-Latn,, 1211 | wmw-Latn,14/14, 1212 | wnc-Latn,2/2, 1213 | wnu-Latn,, 1214 | wol-Latn,, 1215 | wos-Latn,8/8, 1216 | wrs-Latn,20/20, 1217 | wsg-Telu,2/2, 1218 | wsk-Latn,20/20, 1219 | wuu-Hani,, 1220 | wuv-Latn,9/9, 1221 | xal-Cyrl,, 1222 | xbi-Latn,, 1223 | xho-Latn,, 1224 | xla-Latn,1/1, 1225 | xmf-Geor,20/20, 1226 | xmm-Latn,, 1227 | xmv-Latn,, 1228 | xog-Latn,, 1229 | xon-Latn,1/1, 1230 | xsm-Latn,, 1231 | xsr-Deva,1/1, 1232 | xtd-Latn,1/1, 1233 | xtm-Latn,2/2, 1234 | xtn-Latn,6/6, 1235 | yaa-Latn,1/1, 1236 | yal-Latn,, 1237 | yao-Latn,, 1238 | yap-Latn,, 1239 | yby-Latn,1/1, 1240 | ydd-Hebr,20/20, 1241 | yka-Latn,, 1242 | yle-Latn,, 1243 | yli-Latn,4/4, 1244 | yml-Latn,, 1245 | yom-Latn,, 1246 | yon-Latn,1/1, 1247 | yor-Latn,, 1248 | yrk-Cyrl,, 1249 | yrl-Latn,4/4, 1250 | yss-Latn,20/20, 1251 | yua-Latn,, 1252 | yue-Hani,, 1253 | yuj-Latn,, 1254 | yut-Latn,16/16, 1255 | yuw-Latn,2/2, 1256 | yva-Latn,3/3, 1257 | zac-Latn,18/18, 1258 | zai-Latn,6/6, 1259 | zam-Latn,, 1260 | zao-Latn,14/14, 1261 | zas-Latn,19/19, 1262 | zat-Latn,20/20, 1263 | zdj-Latn,, 1264 | zea-Latn,, 1265 | zgh-Tfng,, 1266 | zia-Latn,11/11, 1267 | zom-Latn,, 1268 | zos-Latn,1/1, 1269 | zpm-Latn,4/5,it seems noise: http://www.bangmywifeplease.com/zi_z/ 1270 | zpo-Latn,8/8, 1271 | zpt-Latn,13/13, 1272 | zpu-Latn,3/3, 1273 | zsm-Latn,, 1274 | zul-Latn,, 1275 | zyb-Latn,20/20, 1276 | zyp-Latn,4/5,out of model cousin from mrh: https://newchristianbiblestudy.org/bible/mara/luke/15/ 1277 | entries,657, -------------------------------------------------------------------------------- /audit/annotated_glotlid_vs_nllb_top20.txt: -------------------------------------------------------------------------------- 1 | bar-Latn_meta MISS sha1:YUWAWMJA7WJXSQ5EJCZOT5RDZV2Z3SQN https://bar.wikipedia.org/wiki/Wikipedia:AutoWikiBrowser 2 | eval glotlid 1 call nllb 0 miss 3 | bavarian wikipedia 4 | 5 | wes-Latn_meta MISS sha1:AXNHIE6BZBDIFQIYSU6KDV4S577IMPAE https://www.vitalpina.info/en/our-hotels/request-vitalpina-hotels-south-tyrol-which-one-suits-you/54-810242.html 6 | eval glotlid 0 call nllb 1 miss 7 | random english words (or web form content) 8 | 9 | ven-Latn_meta tso-Latn_meta sha1:WDUXCJ2F23FZMCNK2CEVD7QKDNQ7SHDH https://globalrecordings.net/id/script/ven/423 10 | eval glotlid 1 call nllb 0 call 11 | the URL says "ven" (but tso and ven are both southern bantu) 12 | 13 | otw-Latn_meta MISS sha1:IBZCS5IA5WKQX7C46CTXY2S5FXOZZYQO https://www.anishinabek.ca/indian-residential-schools-translated/ 14 | eval glotlid 1 call nllb 0 miss 15 | found website that claims this is otw: "Eve gii-bzindwaan gnebgoon miinwaa gii-miijin e-miinwang." 16 | 17 | fit-Latn_meta fin-Latn_meta sha1:C2UBGRYADGIOOSRQGLF34XVMAMUDVN2H http://liipetti.net/aviisi/2016/07/ 18 | eval glotlid -1 call nllb -1 call 19 | mixed: contains some swedish 20 | 21 | kha-Latn_meta MISS sha1:GSAWMTADCFP2BJJRMPHLCZVVYQREN66G https://wyrta.com/hei-sien-nyngkong-da-tip-yei-tynre-u-thomas-jones-deiwa-thoh-dei-ktien-khasi/ 22 | eval glotlid 1 call nllb 0 miss 23 | Khasi Author Society 24 | 25 | szl-Latn_meta MISS sha1:AAIBZBGDXC2ZN3JXZ75N6XKFLERKLHEI https://szl.wikipedia.org/wiki/Istanbu%C5%82 26 | eval glotlid 1 call nllb 0 miss 27 | szl wikipedia 28 | 29 | pis-Latn_meta MISS sha1:LDGLLOIPZHR7VBQFKXZX56BTAZXTSNA2 https://www.globalrecordings.net/es/script/pis/425 30 | eval glotlid 1 call nllb 0 miss 31 | pis in url, recognizable as similar to tok pisin 32 | 33 | goh-Latn_meta deu-Latn_meta sha1:5RTPPOXD6DZ7BPBLXVRMRYEHIRXEJCQN https://libcoll.mpiwg-berlin.mpg.de/libview?tocMode=none&start=161&viewMode=text&mode=imagepath&url=/mpiwg/online/permanent/library/YD9NH338/pageimg&pn=112 34 | eval glotlid 0 call nllb 0 call 35 | (middle high german)1595 Besson, Jacques: Theatrum oder Schawbuch allerley Werckzeug und Rüstungen 36 | 37 | sms-Latn_meta MISS sha1:RXMGTQ7G2GT6WO7RD34ANPPN5KOXIZ4H https://samediggi.fi/nuo/tuajjlazkadd/joonas-saari-4/ 38 | eval unclear 39 | pictures suggest sami;google translate recognizes estonian for some sents, finnishfor others on the same website, but it's a finnish url 40 | more comment: glotlid predicted correclty the page is in "Nuõrttsääʹmǩiõll" 41 | 42 | MISS tzm-Tfng_meta sha1:ZY5JWP4L3IMKJKZIHSDFGH752WVQGKBE https://www.mapnews.ma/am/actualites/%E2%B4%B0%E2%B5%8E%E2%B4%B0%E2%B5%9C%E2%B4%B0%E2%B5%A2/%E2%B4%B0%E2%B5%96%E2%B4%B0%E2%B5%A1%E2%B4%B0%E2%B5%99-%E2%B5%8F-%E2%B5%89%E2%B4%B3%E2%B5%A3%E2%B5%A3%E2%B5%93%E2%B5%8E%E2%B5%8F-%E2%B5%93%E2%B5%8E%E2%B5%99%E2%B5%8F%E2%B5%8F%E2%B5%8A%E2%B5%89-%E2%B5%89%E2%B5%8E%E2%B5%93%E2%B5%9C%E2%B5%9C%E2%B5%89%E2%B5%8F-%E2%B5%9C%E2%B4%B0%E2%B5%8E%E2%B4%B0%E2%B5%94%E2%B5%99%E2%B4%B0%E2%B5%8D%E2%B5%9C-%E2%B5%8E%E2%B5%93%E2%B5%83%E2%B5%8E%E2%B5%8E%E2%B4%B7-%E2%B5%A1%E2%B5%89%E2%B5%99%E2%B5%99-%E2%B5%99%E2%B5%8E%E2%B5%8E%E2%B5%93%E2%B5%99-%E2%B4%B0%E2%B5%99%E2%B5%8F%E2%B5%8E%E2%B4%B0%E2%B5%8D%E2%B4%B0-%E2%B5%8F-%E2%B5%93%E2%B5%8E%E2%B5%A3%E2%B5%A3%E2%B5%89%E2%B5%A3%E2%B5%8D-%E2%B5%8F 43 | eval glotlid 0 call nllb 1 call 44 | not clear which tamazight language? 45 | 46 | szl-Latn_meta MISS sha1:GKINNOLNLF6ESMPHQYK52LIDKMK23ODK https://szl.wikipedia.org/wiki/Tortefontaine 47 | eval glotlid 1 call nllb 0 miss 48 | 49 | ban-Latn_meta MISS sha1:YOPWIOWFBDFVORIOQRYS4W6F5N4CZMZY https://dictionary.basabali.org/Baudanda 50 | eval glotlid 1 call nllb 0 miss 51 | basa bali = balinese = ban 52 | 53 | cbk-Latn_meta spa-Latn_meta sha1:FMCA2HW2MODLGYG5NJ2LRKUXNLNWLQXR https://rpnradio.com/zamboanga-2-hvts-ya-aresta-p1-million-shabu-ya-confisca/ 54 | eval glotlid 0 call nllb 1 call 55 | cbk is spanish-based creole in philippines, but this looks very much like spanish 56 | 57 | MISS bho-Deva_meta sha1:LGSOZB6JBSL2ZC7H5226XGGVHQLAFEA4 https://bh.wikipedia.org/wiki/%E0%A4%B5%E0%A4%BE%E0%A4%B0%E0%A5%8D%E0%A4%A4%E0%A4%BE%E0%A4%B2%E0%A4%BE%E0%A4%AA:%E0%A4%85%E0%A4%A8%E0%A5%81%E0%A4%B7%E0%A5%8D%E0%A4%95%E0%A4%BE_%E0%A4%B6%E0%A4%B0%E0%A5%8D%E0%A4%AE%E0%A4%BE 58 | eval glotlid 0 miss nllb 1 call 59 | based on bho = bh.wikipedia.org 60 | 61 | zea-Latn_meta nld-Latn_meta sha1:JYOX4XDFNAOR6VYGRZ5KKRDYDPMKINBA https://zea.wikipedia.org/wiki/Incourt 62 | eval glotlid 1 call nllb 0 call 63 | wikipedia url 64 | 65 | tat-Latn_meta crh-Latn_meta sha1:ZPPXXVPC4QOB4IGOPKHHOPE6LLQ5YVWF http://selet.biz/lt/smena5/ 66 | eval glotlid 1 call nllb 0 call 67 | the website is about kazan in Tatarstan 68 | 69 | dzo-Tibt_meta MISS sha1:CZKIFI3SNDRX5LVK2S7AZ7LOAL6BS6X2 http://gasa.gov.bt/index.php/dz/gewogs/lung-ng-n-rged-og-gi-skor 70 | eval glotlid 1 call nllb 0 miss 71 | website is a bhutan website (available in dzo and english) 72 | 73 | kha-Latn_meta MISS sha1:KZMML6QUMI7QMEXOO4HYBKXUN3GLQFRO https://wyrta.com/ujor-ka-ejnc-wym-biang-i-wan-saam-bai-bordin/ 74 | eval glotlid 1 call nllb 0 miss 75 | based on categorization of page from same site above 76 | 77 | MISS abk-Cyrl_meta sha1:R3GT6CMQCXMJLIXWWTUBOT5CNMU2EXG6 https://www.abaza.org/abk/eventsfeed?d=2024-02-23 78 | eval glotlid 0 miss nllb 1 call 79 | based on URL 80 | -------------------------------------------------------------------------------- /filters/filter-v1.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "code", 5 | "execution_count": null, 6 | "metadata": { 7 | "pycharm": { 8 | "name": "#%%\n" 9 | } 10 | }, 11 | "outputs": [], 12 | "source": [ 13 | "import re\n", 14 | "import json\n", 15 | "from typing import List, Tuple\n", 16 | "from collections import Counter\n", 17 | "from urllib.parse import urlparse\n", 18 | "from GlotScript import sp\n", 19 | "from tqdm import tqdm \n", 20 | "import time\n", 21 | "import string\n", 22 | "\n", 23 | "\n", 24 | "class Filters:\n", 25 | " def __init__(self):\n", 26 | " self.filters = []\n", 27 | "\n", 28 | " def add_filter(self, filter_func, warning):\n", 29 | " self.filters.append((filter_func, warning))\n", 30 | "\n", 31 | " def apply_filters(self, sentence, quality_warnings):\n", 32 | " for filter_func, warning in self.filters:\n", 33 | " if filter_func(sentence):\n", 34 | " quality_warnings.append(warning)\n", 35 | " return quality_warnings" 36 | ] 37 | }, 38 | { 39 | "cell_type": "code", 40 | "execution_count": null, 41 | "metadata": { 42 | "pycharm": { 43 | "name": "#%%\n" 44 | } 45 | }, 46 | "outputs": [], 47 | "source": [ 48 | "CURSED_SUBSTRINGS = [\" №\", \"���\", \"\\\\|\\\\s*$\", \" nr\\\\.$\", \"aute irure dolor \", \" sunt in culpa qui \", \"orem ipsum \", \" quis nostrud \", \" adipisicing \", \" dolore eu \", \" cupidatat \", \"autem vel eum\", \"wisi enim ad\", \" sex \", \" porn \", \"黄色电影\", \"mp3\", \"ownload\", \"Vol\\\\.\", \" Ep\\\\.\", \"Episode\", \" г\\\\.\\\\s*$\", \" кг\\\\.\\\\s*$\", \" шт\\\\.\", \"Develop\", \"Facebook\", \" crusher \", \" xxx \", \" ... ... ... ... ... ... ... ... ...\", \" .... .... .... .... .... .... .... .... ....\", \" [^ ] [^ ] [^ ] [^ ] [^ ] [^ ] [^ ] [^ ] [^ ]\", \", ..,,? ..,,? ..,,? ..,,?\"]\n", 49 | "ADULT_SIGNALS = \"caoporn caoprom caopron caoporen caoponrn caoponav caopom caoorn 99re dy888 caopro hezyo re99 4438x zooskool xfplay 7tav xxoo xoxo 52av freexx 91chinese anquye cao97 538porm 87fuli 91pron 91porn 26uuu 4438x 182tv kk4444 777me ae86 91av 720lu yy6080 6080yy qqchub paa97 aiai777 yy4480 videossexo 91free 一级特黄大片 偷拍久久国产视频 日本毛片免费视频观看 久久免费热在线精品 高清毛片在线看 日本毛片高清免费视频 一级黄色录像影片 亚洲男人天堂 久久精品视频在线看 自拍区偷拍亚洲视频 亚洲人成视频在线播放 色姑娘综合站 丁香五月啪啪 在线视频成人社区 亚洲人成视频在线播放 久久国产自偷拍 一本道 大香蕉无码 香港经典三级 亚洲成在人线免费视频 天天色综合网 大香蕉伊人久草 欧美一级高清片 天天鲁夜夜啪视频在线 免费黄片视频在线观看 加比勒久久综合 久草热久草在线视频 韩国三级片大全在线观看 青青草在线视频 美国一级毛片 久草在线福利资源 啪啪啪视频在线观看免费 成人福利视频在线观看 婷婷我去也 老司机在线国产 久久成人视频 手机看片福利永久国产 高清国产偷拍在线 大香蕉在线影院 日本高清免费一本视频 男人的天堂东京热 影音先锋男人资源 五月婷婷开心中文字幕 亚洲香蕉视频在线播放 天天啪久久爱视频精品 超碰久久人人摸人人搞\".split()\n", 50 | "EXTRA_ADULT_SGINALS = \"日本一级特黄大片 qq的天天中彩票 一本道 一级特黄大片 三级片 下三烂 个老子的 中国福利彩票天天 久久免费热在线精品 久久国产视频 久久精品国产 久久综合久久 乳交 乳波臀浪 五分彩 亚洲男人天堂 亚洲精品 人人摸 人人摸人人 人人操 人人澡 人人爽人人 人人碰 人人碰人人 人人碰免费公开视频 人人碰免费视频 仆街 他奶娘的 他妈 他妈ㄉ王八蛋 他妈地 他妈的 他马的 伦理在线黄影片 伦理在线黄电影 伦理在线黄网影片 伦理在线黄网电影 伦理在线黄网视频 伦理在线黄视频 伦理影片 伦理影片观看 伦理片 伦理片免费 伦理片观看 伦理电影 伦理电影片观看 伦理电影观看 伦理电影观看平台 伦理电影观看网 伦理电影观看网址 伦理电视 伦理视频 伦理视频观看 伦理黄影片 伦理黄影片观看 伦理黄片 伦理黄片网站 伦理黄电影 伦理黄电影片 伦理黄电影观看 伦理黄电视 伦理黄电视片 伦理黄网影片 伦理黄网片 伦理黄网电影 伦理黄网视频 伦理黄网视频观看 伦理黄视频 伦理黄视频片 伦理黄视频网 伦理黄视频观看 伦理黄视频频道 伦理黄频道 你个傻比 你他马的 你全家 你奶奶的 你她马的 你妈的 你娘卡好 你娘咧 你它妈的 你它马的 你是鸡 你是鸭 你马的 做爱 傻比 傻逼 免费人成视频 免费伦理 免费伦理影片 免费伦理视频 免费在线伦理黄影片 免费在线伦理黄电影 免费在线伦理黄网影片 免费在线伦理黄网电影 免费在线伦理黄网视频 免费在线伦理黄视频 免费在线成人视频 免费在线成人黄影片 免费在线成人黄电影 免费在线成人黄网影片 免费在线成人黄网电影 免费在线成人黄网视频 免费在线成人黄视频 免费在线黄影片 免费在线黄电影 免费在线黄网影片 免费在线黄网电影 免费在线黄网视频 免费在线黄视频 免费成人电影 免费成人视频 免费成人视频观看 免费无码 免费色情影片 免费色情电影 免费色情视频 免费视频在线观看 六合彩 册那 军妓 分分彩 北京赛车开奖 午夜电影 午夜福利 卖B 卖比 卖淫 博彩 口交 口肯 吃屎 吹箫 啪啪啪视频 国产av 国产精品 国产自拍 国产自拍片 国产自拍视频 在线av 在线av在线 在线av直播 在线av网址 在线av视频 在线伦理黄影片 在线伦理黄片 在线伦理黄电影 在线伦理黄网影片 在线伦理黄网电影 在线伦理黄网视频 在线伦理黄视频 在线国产 在线大香蕉 在线成人黄影片 在线成人黄片 在线成人黄电影 在线成人黄网影片 在线成人黄网电影 在线成人黄网视频 在线成人黄视频 在线观看中文字幕 在线观看免费 在线黄影片 在线黄片 在线黄电影 在线黄网影片 在线黄网电影 在线黄网视频 在线黄视频 塞你公 塞你娘 塞你母 塞你爸 塞你老师 塞你老母 夜夜啪视频在线观看 大卵子 大卵泡 大发云 大发官网 大发彩票 大发彩票官网 大发快三 大发快三和值 大发快三大小单双 大发快三如何 大发快三官网 大发快三开奖 大发快三开奖结果 大发快三怎么 大发快三怎么看 大发快三是不是 大发快三是什么 大发快三是国家 大发快三计划 大发快三走势图 大发扑克 大发时时彩 大发时时彩开奖 大发时时彩是 大发时时彩计划 大发棋牌 大发游戏 大发游戏官网 大香蕉 大香蕉伊人 大鸡巴 天天中彩票 天天中彩票app 天天中彩票微信 天天中彩票怎么 天天中彩票是 天天中彩票的 天天啪 天天啪在线视频 天天彩票 天天彩票网 天天爱彩票 天天赢彩票 夫妻性生活 奸你 她妈地 她妈的 她马的 妈B 妈个B 妈个比 妈个老比 妈妈的 妈比 妈的 妈的B 妈逼 妓 妓女 妳她妈的 妳妈的 妳娘的 妳老母的 妳马的 娱乐平台开户 娱乐彩票 开奖结果 彩神争霸 彩神争霸大发快三 彩神争霸怎么 彩神争霸是 彩神争霸网站 彩神争霸邀请码 彩票 彩票天天 彩票娱乐 彩票娱乐注册 彩票平台 彩经彩票 微信的天天中彩票 性交 性感写真 性感写真视频 性感美女 性感视频 性爱 性爱视频 情色图片 情色电影 情色网站 情色视频 情色论坛 成人下载 成人书刊 成人动漫 成人动漫网 成人图片 成人在线影片 成人在线电影 成人在线网 成人在线视频 成人在线黄影片 成人在线黄电影 成人在线黄网影片 成人在线黄网电影 成人在线黄网视频 成人在线黄视频 成人小说 成人影城 成人影片 成人影片观看 成人影视 成人文学 成人文学网 成人漫画 成人片 成人电影 成人电影下载 成人电影免费 成人电影在线 成人电影播放 成人电影片 成人电影片观看 成人电影观看 成人电影观看免费 成人电影观看平台 成人电影观看网 成人电影观看网址 成人电视 成人秀 成人网站 成人自拍 成人视频 成人视频免费网站 成人视频啪啪啪 成人视频在线 成人视频在线观看 成人视频平台 成人视频播放 成人视频片 成人视频秀 成人视频网 成人视频自拍 成人视频观看 成人视频观看免费 成人视频观看平台 成人视频观看网站 成人视频频道 成人论坛 成人频道 成人黄影片 成人黄影片观看 成人黄片 成人黄片网站 成人黄电影 成人黄电影片 成人黄电影观看 成人黄电视 成人黄电视片 成人黄网影片 成人黄网片 成人黄网电影 成人黄网视频 成人黄网视频观看 成人黄视频 成人黄视频片 成人黄视频网 成人黄视频观看 成人黄视频频道 成人黄频道 无码 无码av 无码一区二区三区 无码不卡高清免费 无码不卡高清免费v 无码中文字幕 无码国产自拍 日本一本道 日本毛片免费视频观看 日韩av 时时彩 时时彩计划 淫秽 激情小说 激情裸聊 激情裸舞 激情视频 热久久精品 热在线精品 热这里只有精品 男人的天堂 真人性生活直播 真人性生活视频 真人性直播 真人性行为直播 真人性行为视频 真人性视频 真人秀 真人秀视频 真人色情 真人色情视频 真人裸聊 真人裸舞 真人视频 真人视频秀 神彩争霸 福利彩票 福利视频 精品一区二区三区 精品国产 美女裸聊 美女裸舞 自拍视频 色情 色情图片 色情影片 色情电影 色情电影观看 色情电视 色情网 色情网站 色情视频 色情视频在线 色情视频在线观看 色情视频网 色情视频频道 色情频道 色电影 色网 色网站 色视频 色视频在线 色视频网站 赌 赌债 赌博 赌城 赌局 赌徒 赌桌 赌注 赌王 赌瘾 赌盘 赌赛 赌钱 赌鬼 重庆时时彩 重庆时时彩杀 露点 高清无码 黄影片 黄片 黄片网站 黄电影 黄电影片 黄电视 黄电视片 黄网 黄网影片 黄网片 黄网电影 黄网视频 黄色 黄色录像 黄色录像影片 黄色录像片 黄色录像片影片 黄色录像片电影 黄色录像片网影片 黄色录像片网电影 黄色录像片网视频 黄色录像片视频 黄色录像电影 黄色录像网影片 黄色录像网电影 黄色录像网视频 黄色录像视频 黄视频 黄视频片 黄视频网\".split()\n", 51 | "\n", 52 | "\n", 53 | "\n", 54 | "\n", 55 | "POLICY_SUBSTRINGS = [\n", 56 | " \"terms of use\",\n", 57 | " \"privacy policy\",\n", 58 | " \"cookie policy\",\n", 59 | " \"uses cookies\",\n", 60 | " \"use of cookies\",\n", 61 | " \"use cookies\",\n", 62 | "]\n", 63 | "\n", 64 | "\n", 65 | "def list_case_filter(sentence):\n", 66 | " tokens = sentence.split()\n", 67 | " capital_tokens = [token for token in tokens if token[0].isupper() or all(char.isdigit() or char in string.punctuation for char in token)]\n", 68 | " warning = len(tokens) >= 12 and (len(capital_tokens) / len(tokens)) > 0.5\n", 69 | " return warning\n", 70 | "\n", 71 | "\n", 72 | "def danger_chars_filter(sentence):\n", 73 | " danger_chars_count = sum(1 for char in sentence if char in '0123456789{}+/()>')\n", 74 | " warning = (danger_chars_count / len(sentence)) > 0.2\n", 75 | " return warning\n", 76 | "\n", 77 | "\n", 78 | "def cursedness_filter(sentence):\n", 79 | " warning = any(curse in sentence for curse in CURSED_SUBSTRINGS)\n", 80 | " return warning\n", 81 | "\n", 82 | "def adult_filter(sentence):\n", 83 | " warning = any(curse in sentence for curse in ADULT_SIGNALS + EXTRA_ADULT_SGINALS)\n", 84 | " return warning\n", 85 | "\n", 86 | "def detect_long_words(sentence, max_chars=100):\n", 87 | " words = sentence.split()\n", 88 | " long_words = [word for word in words if len(word) > max_chars]\n", 89 | " return len(long_words) > 0\n", 90 | "\n", 91 | "def detect_js_warning(sentence):\n", 92 | "\n", 93 | " # Check if \"Javascript\" is present in the text\n", 94 | " if 'Javascript' in sentence or 'JavaScript' in sentence:\n", 95 | " return True\n", 96 | " \n", 97 | " return False\n", 98 | "\n", 99 | "\n", 100 | "def detect_lorem_ipsum(sentence):\n", 101 | " # Convert text to lowercase to make the search case-insensitive\n", 102 | " sentence = sentence.lower()\n", 103 | " \n", 104 | " # Check if the phrase \"lorem ipsum\" is present in the text\n", 105 | " if \"lorem ipsum\" in sentence:\n", 106 | " return True\n", 107 | " \n", 108 | " return False\n", 109 | "\n", 110 | "\n", 111 | "def detect_curly_bracket(sentence):\n", 112 | " # Check if the curly bracket '{' is present in the text\n", 113 | " if \"{\" in sentence:\n", 114 | " return True\n", 115 | " \n", 116 | " return False\n", 117 | "\n", 118 | "\n", 119 | "def detect_policy(text):\n", 120 | " text_lower = text.lower()\n", 121 | " for substring in POLICY_SUBSTRINGS:\n", 122 | " if substring in text_lower:\n", 123 | " return True\n", 124 | " return False\n" 125 | ] 126 | }, 127 | { 128 | "cell_type": "code", 129 | "execution_count": null, 130 | "metadata": { 131 | "pycharm": { 132 | "name": "#%%\n" 133 | } 134 | }, 135 | "outputs": [], 136 | "source": [ 137 | "class GopherRepetitionFilter():\n", 138 | " \n", 139 | " \"\"\"Check if there is repeated content in the input text. Excessive\n", 140 | " repetition is often linked with uninformative content and can be used to\n", 141 | " determine whether it is low-quality text. This function implements\n", 142 | " \"Repetition Removal\" as described in Gopher_.\n", 143 | "\n", 144 | " .. _Gopher: https://arxiv.org/abs/2112.11446\n", 145 | " \"\"\"\n", 146 | " \n", 147 | " @staticmethod\n", 148 | " def get_n_grams(words: List[str], n: int) -> List[str]:\n", 149 | " return [\" \".join(words[i : i + n]) for i in range(len(words) - n + 1)]\n", 150 | "\n", 151 | " @staticmethod\n", 152 | " def find_top_duplicate(x: List[str]) -> int:\n", 153 | " counter = Counter()\n", 154 | " for element in x:\n", 155 | " counter[element] += 1\n", 156 | " top_n_gram = counter.most_common(1)[0]\n", 157 | " return len(top_n_gram[0]) * top_n_gram[1]\n", 158 | "\n", 159 | "\n", 160 | " def filter(self, text, top_n_grams):\n", 161 | " \n", 162 | " text = self.add_space_between_numbers(text)\n", 163 | " words = text.split(' ')\n", 164 | "\n", 165 | " for n, n_frac in top_n_grams:\n", 166 | " n_grams = self.get_n_grams(words, n)\n", 167 | " if not n_grams:\n", 168 | " continue\n", 169 | " top_char_length = self.find_top_duplicate(n_grams)\n", 170 | " if top_char_length / len(text) > n_frac:\n", 171 | " return True\n", 172 | "\n", 173 | " return False\n", 174 | "\n", 175 | " \n", 176 | " @staticmethod\n", 177 | " def add_space_between_numbers(numbers):\n", 178 | " result = ''\n", 179 | " i = 0\n", 180 | " while i < len(numbers):\n", 181 | " if numbers[i].isdigit():\n", 182 | " result += numbers[i]\n", 183 | " i += 1\n", 184 | " while i < len(numbers) and numbers[i].isdigit():\n", 185 | " result += numbers[i]\n", 186 | " i += 1\n", 187 | " if i < len(numbers) and not numbers[i].isdigit():\n", 188 | " result += ' '\n", 189 | " else:\n", 190 | " result += numbers[i]\n", 191 | " if i+1 < len(numbers) and numbers[i+1].isdigit():\n", 192 | " result += ' '\n", 193 | " i += 1\n", 194 | " return result\n", 195 | " \n", 196 | " def filter_para(self, text):\n", 197 | " \n", 198 | " return self.filter(text, ((2, 0.2), (3, 0.18), (4, 0.16)))\n", 199 | " \n", 200 | " def filter_sent(self, text):\n", 201 | " \n", 202 | " sents = [s for s in text.split('\\n') if len(s.split(' ')) > 20]\n", 203 | " warning = any([self.filter(s, ((1, 0.5), (2, 0.3))) for s in sents])\n", 204 | " \n", 205 | " return warning\n", 206 | "\n", 207 | "gopher_repetition = GopherRepetitionFilter().filter_para\n", 208 | "gopher_repetition_sent = GopherRepetitionFilter().filter_sent" 209 | ] 210 | }, 211 | { 212 | "cell_type": "code", 213 | "execution_count": null, 214 | "metadata": { 215 | "pycharm": { 216 | "name": "#%%\n" 217 | } 218 | }, 219 | "outputs": [], 220 | "source": [ 221 | "filters = Filters()\n", 222 | "filters.add_filter(list_case_filter, \"list_case\")\n", 223 | "filters.add_filter(danger_chars_filter, \"danger_chars\")\n", 224 | "filters.add_filter(cursedness_filter, \"cursed_regex\")\n", 225 | "filters.add_filter(adult_filter, \"adult_signals\")\n", 226 | "filters.add_filter(detect_long_words, \"long_word\")\n", 227 | "filters.add_filter(detect_policy, \"detect_policy\")\n", 228 | "filters.add_filter(gopher_repetition, \"repetition\")\n", 229 | "filters.add_filter(gopher_repetition_sent, \"repetition_sent\")\n", 230 | "filters.add_filter(detect_js_warning, \"js_warning\")\n", 231 | "filters.add_filter(detect_lorem_ipsum, \"lorem_ipsum\")\n", 232 | "filters.add_filter(detect_curly_bracket, \"curly_bracket\")" 233 | ] 234 | }, 235 | { 236 | "cell_type": "code", 237 | "execution_count": null, 238 | "metadata": { 239 | "pycharm": { 240 | "name": "#%%\n" 241 | } 242 | }, 243 | "outputs": [], 244 | "source": [ 245 | "class DomainLabeler:\n", 246 | " def __init__(self):\n", 247 | " self.domain_groups = {}\n", 248 | " \n", 249 | " def add_domain_group(self, label, domains):\n", 250 | " self.domain_groups[label] = domains\n", 251 | " \n", 252 | " def apply(self, uri, categories):\n", 253 | " domain = urlparse(uri).netloc\n", 254 | " for label, domains in self.domain_groups.items():\n", 255 | " if any(domain.endswith(d) for d in domains):\n", 256 | " categories.append(label)\n", 257 | " break\n", 258 | " return categories\n", 259 | "\n", 260 | "\n", 261 | "labeler = DomainLabeler()\n", 262 | "labeler.add_domain_group(\"religious\", [\"bible.com\", \"ebible.org\", \"png.bible\", \"jw.org\", \"wol.jw.org\", \"breakeveryyoke.com\", \"scriptureearth.org\", \"live.bible.is\", \"bible.is\", \"faithcomesbyhearing.com\", \"download.sabda.org\", \"sabda.org\", \"alkitab.mobi\", \"biblerevelation.org\", \"gospelgo.com\", \"mykitabsuci.org\", \"aboriginalbibles.org.au\", \"wikiislam.net\", \"stepbible.org\", \"e-alkitab.org\"])\n", 263 | "labeler.add_domain_group(\"wikipedia\", [\"wikipedia.org\", \"wikimedia.org\"])" 264 | ] 265 | }, 266 | { 267 | "cell_type": "code", 268 | "execution_count": null, 269 | "metadata": { 270 | "pycharm": { 271 | "name": "#%%\n" 272 | } 273 | }, 274 | "outputs": [], 275 | "source": [ 276 | "import ipaddress\n", 277 | "import re\n", 278 | "from functools import partial\n", 279 | "\n", 280 | "\n", 281 | "class PIIReplacer:\n", 282 | " def __init__(\n", 283 | " self, regex: str, replacements, validator\n", 284 | " ):\n", 285 | " self.regex: re.Pattern = re.compile(regex)\n", 286 | " self.replacements = (\n", 287 | " replacements\n", 288 | " if type(replacements) is tuple\n", 289 | " else (tuple(replacements) if not isinstance(replacements, str) else (replacements,))\n", 290 | " )\n", 291 | " self.validator = validator # extra validation for a match\n", 292 | " self._replace_i = 0\n", 293 | "\n", 294 | " def replace(self, text: str):\n", 295 | " def get_replacement(matchobj):\n", 296 | " if self.validator and not self.validator(matchobj.group(0)):\n", 297 | " # not a valid match. replace with itself\n", 298 | " return matchobj.group(0)\n", 299 | " replacement = self.replacements[self._replace_i]\n", 300 | " self._replace_i = (self._replace_i + 1) % len(self.replacements)\n", 301 | " return replacement\n", 302 | "\n", 303 | " return self.regex.sub(get_replacement, text)\n", 304 | "\n", 305 | "\n", 306 | "def public_ip_validator(ip, public_only: bool = True) -> bool:\n", 307 | " try:\n", 308 | " ip = ipaddress.ip_address(ip)\n", 309 | " return not public_only or ip.is_global\n", 310 | " except ValueError:\n", 311 | " return False\n", 312 | "\n", 313 | "\n", 314 | "class PIIFormatter():\n", 315 | " \"\"\"\n", 316 | " Replaces email addresses and ip addresses in the document text.\n", 317 | " Args:\n", 318 | " remove_emails: Replace email addresses\n", 319 | " remove_ips: Replace IP addresses\n", 320 | " only_remove_public_ips: by default we only replace public (and thus PII) IPs\n", 321 | " email_replacement: tuple of strings to use as replacement. They will be used in a circular way\n", 322 | " ip_replacement same as email_replacement but for IP addresses\n", 323 | " \"\"\"\n", 324 | "\n", 325 | " name = \"📞 PII\"\n", 326 | "\n", 327 | " def __init__(\n", 328 | " self,\n", 329 | " remove_emails: bool = True,\n", 330 | " remove_ips: bool = True,\n", 331 | " only_remove_public_ips: bool = True,\n", 332 | " # example.com/org are actually maintained as an example\n", 333 | " email_replacement = (\"email@example.com\", \"firstname.lastname@example.org\"),\n", 334 | " # randomly generated list of ips. they did not respond to ping requests at the time the list was created\n", 335 | " ip_replacement = (\n", 336 | " \"22.214.171.124\",\n", 337 | " \"126.96.36.199\",\n", 338 | " \"188.8.131.52\",\n", 339 | " \"184.108.40.206\",\n", 340 | " \"220.127.116.11\",\n", 341 | " \"18.104.22.168\",\n", 342 | " ),\n", 343 | " ):\n", 344 | " super().__init__()\n", 345 | " self.remove_emails = remove_emails\n", 346 | " self.remove_ips = remove_ips\n", 347 | "\n", 348 | " self.emails_replacer = PIIReplacer(\n", 349 | " r\"\\b[A-Za-z0-9!#$%&'*+/=?^_`{|}~-]+(?:\\.[A-Za-z0-9!#$%&'*+/=?^_`{|}~-]+)*@(?:(?:[A-Za-z0-9](?:[\"\n", 350 | " r\"A-Za-z0-9-]*[A-Za-z0-9])?\\.)+[A-Za-z0-9](?:[A-Za-z0-9-]*[A-Za-z0-9])?|\\[(?:(?:25[0-5]|2[0-4][0-9]|[\"\n", 351 | " r\"01]?[0-9][0-9]?)\\.){3}(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?|[A-Za-z0-9-]*[A-Za-z0-9]:)])\",\n", 352 | " email_replacement,\n", 353 | " None\n", 354 | " )\n", 355 | "\n", 356 | " self.ip_replacer = PIIReplacer(\n", 357 | " r\"(?:(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\\.){3}(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\",\n", 358 | " validator=partial(public_ip_validator, public_only=only_remove_public_ips),\n", 359 | " replacements=ip_replacement,\n", 360 | " )\n", 361 | "\n", 362 | " def format(self, text: str) -> str:\n", 363 | " if self.remove_emails:\n", 364 | " text = self.emails_replacer.replace(text)\n", 365 | " if self.remove_ips:\n", 366 | " text = self.ip_replacer.replace(text)\n", 367 | " return text\n", 368 | " \n", 369 | " \n", 370 | "pii_format = PIIFormatter().format" 371 | ] 372 | }, 373 | { 374 | "cell_type": "code", 375 | "execution_count": null, 376 | "metadata": { 377 | "pycharm": { 378 | "name": "#%%\n" 379 | } 380 | }, 381 | "outputs": [], 382 | "source": [ 383 | "def detect_script(text, lang):\n", 384 | " \n", 385 | " main_script, percentage, details = sp(text)\n", 386 | " if lang == 'jpn':\n", 387 | " main_script = 'Jpan'\n", 388 | " percentage = details['details'].get('Hani', 0) + details['details'].get('Hira', 0) + details['details'].get('Kana', 0)\n", 389 | " \n", 390 | " return main_script, percentage" 391 | ] 392 | }, 393 | { 394 | "cell_type": "code", 395 | "execution_count": null, 396 | "metadata": { 397 | "pycharm": { 398 | "name": "#%%\n" 399 | } 400 | }, 401 | "outputs": [], 402 | "source": [ 403 | "def process_json(json_data, lang, script):\n", 404 | " \n", 405 | " # pii\n", 406 | " json_data[\"content\"] = pii_format(json_data[\"content\"])\n", 407 | " content = json_data[\"content\"]\n", 408 | " \n", 409 | " warc_headers = json_data[\"warc_headers\"]\n", 410 | " uri = warc_headers[\"warc-target-uri\"]\n", 411 | "\n", 412 | " metadata = json_data[\"metadata\"]\n", 413 | " \n", 414 | " quality_warnings = metadata.get(\"quality_warnings\", [])\n", 415 | " categories = metadata.get(\"categories\", [])\n", 416 | " \n", 417 | " if categories is None:\n", 418 | " categories = []\n", 419 | "\n", 420 | " if quality_warnings is None:\n", 421 | " quality_warnings = []\n", 422 | " \n", 423 | " quality_warnings = filters.apply_filters(content, quality_warnings)\n", 424 | " categories = labeler.apply(uri, categories)\n", 425 | "\n", 426 | " metadata[\"quality_warnings\"] = sorted(set(quality_warnings))\n", 427 | " metadata[\"categories\"] = sorted(set(categories))\n", 428 | "\n", 429 | " \n", 430 | " # run script identification\n", 431 | " script_label, script_percentage = detect_script(content, lang)\n", 432 | " metadata[\"script\"] = {\"label\": script_label, \"percentage\": float(\"{:.2f}\".format(script_percentage))}\n", 433 | "\n", 434 | " \n", 435 | " ## lang identification consistency\n", 436 | " \n", 437 | " label = metadata['identification']['label']\n", 438 | " sent_labels = [int(isent['label']==label) for isent in metadata['sentence_identifications'] if isinstance(isent, dict)]\n", 439 | " metadata['identification_consistency'] = {\"percentage\": float(\"{:.2f}\".format(sum(sent_labels)/len(sent_labels))), 'num_sents': len(sent_labels)}\n", 440 | " \n", 441 | " \n", 442 | " return json_data\n", 443 | "\n" 444 | ] 445 | }, 446 | { 447 | "cell_type": "code", 448 | "execution_count": null, 449 | "metadata": { 450 | "pycharm": { 451 | "name": "#%%\n" 452 | } 453 | }, 454 | "outputs": [], 455 | "source": [ 456 | "def Hani_nonspace_percentage(text):\n", 457 | " # Split the text into words\n", 458 | " \n", 459 | " words = text.split()\n", 460 | " \n", 461 | " # Count the number of words with 10 or more characters\n", 462 | " long_words_length = sum(len(word) for word in words if len(word) >= 20)\n", 463 | " \n", 464 | " # Calculate the percentage\n", 465 | " percentage = long_words_length / len(text)\n", 466 | " \n", 467 | " return percentage" 468 | ] 469 | }, 470 | { 471 | "cell_type": "code", 472 | "execution_count": null, 473 | "metadata": { 474 | "pycharm": { 475 | "name": "#%%\n" 476 | } 477 | }, 478 | "outputs": [], 479 | "source": [ 480 | "def post_annotation_filter(json_data, lang, script):\n", 481 | " \n", 482 | " metadata = json_data['metadata']\n", 483 | " \n", 484 | " if 'wikipedia' in metadata[\"categories\"]:\n", 485 | " quality = set(metadata['quality_warnings']) - set([\"short_sentences\",\"header\", \"footer\", \"tiny\", \"long_word\", \"repetition\", \"repetition_sent\"])\n", 486 | " \n", 487 | " # tiny for oscar is based on 5 sentences, we decrease it to 3\n", 488 | " if 'tiny' in metadata['quality_warnings']:\n", 489 | " \n", 490 | " if metadata['identification_consistency']['num_sents'] >= 3:\n", 491 | " metadata['quality_warnings'].remove('tiny')\n", 492 | " \n", 493 | " \n", 494 | " if metadata['script']['label'] != script and script not in ['Hani', 'Hans', 'Hant']:\n", 495 | " return False\n", 496 | " \n", 497 | " if script in ['Hani', 'Hans', 'Hant'] and metadata['script']['label'] not in ['Hani', 'Hans', 'Hant']:\n", 498 | " return False\n", 499 | " \n", 500 | " if metadata['script']['percentage'] < 0.9:\n", 501 | " return False\n", 502 | " \n", 503 | " if lang != 'und' and metadata['identification_consistency']['percentage'] < 0.6:\n", 504 | " return False\n", 505 | "\n", 506 | " if script in ['Hani', 'Jpan']:\n", 507 | " \n", 508 | " if Hani_nonspace_percentage(json_data['content']) < 0.3:\n", 509 | " metadata['quality_warnings'].append('hani_list_case')\n", 510 | " \n", 511 | " \n", 512 | " # for chinese and japanse do not remove tiny. \n", 513 | " quality = set(metadata['quality_warnings']) - set([\"short_sentences\",\"header\", \"footer\", \"tiny\", \"long_word\", \"repetition\", \"repetition_sent\"])\n", 514 | "\n", 515 | "\n", 516 | " if script in ['Latn', 'Cyrl', 'Arab', 'Grek', 'Hebr', 'Deva', 'Beng']:\n", 517 | " quality = set(metadata['quality_warnings']) - set([\"short_sentences\",\"header\", \"footer\"])\n", 518 | "\n", 519 | " else:\n", 520 | " quality = set(metadata['quality_warnings']) - set([\"short_sentences\",\"header\", \"footer\", \"long_word\", \"repetition\", \"repetition_sent\"])\n", 521 | " \n", 522 | " \n", 523 | " # quality = quality - set(['curly_bracket'])\n", 524 | " \n", 525 | " if len(quality)!=0:\n", 526 | " return False\n", 527 | " \n", 528 | " \n", 529 | " return True" 530 | ] 531 | }, 532 | { 533 | "cell_type": "code", 534 | "execution_count": null, 535 | "metadata": { 536 | "pycharm": { 537 | "name": "#%%\n" 538 | } 539 | }, 540 | "outputs": [], 541 | "source": [ 542 | "import json\n", 543 | "import os\n", 544 | "from tqdm import tqdm\n", 545 | "from multiprocessing import Pool, cpu_count\n", 546 | "\n", 547 | "# Batch size\n", 548 | "batch_size = 2000\n", 549 | "\n", 550 | "\n", 551 | "# Function to process JSON data\n", 552 | "def process_json_file(file_path):\n", 553 | " output_file_path = os.path.join(output_dir, os.path.basename(file_path))\n", 554 | " filter_file_path = os.path.join(filter_dir, os.path.basename(file_path))\n", 555 | "\n", 556 | " lang = os.path.basename(file_path).split('-')[0]\n", 557 | " script = os.path.basename(file_path).split('-')[-1].split('_')[0]\n", 558 | " \n", 559 | " if lang == 'multi':\n", 560 | " return None\n", 561 | " \n", 562 | " with open(file_path, 'r', encoding='utf-8') as input_file, open(output_file_path, 'w', encoding='utf-8') as output_file, open(filter_file_path, 'w', encoding='utf-8') as filter_file:\n", 563 | " batch = []\n", 564 | " batch_filter = []\n", 565 | " for line in tqdm(input_file, desc=f'Processing {file_path}'):\n", 566 | " # Parse the line as JSON\n", 567 | " try:\n", 568 | " data = json.loads(line)\n", 569 | " except:\n", 570 | " continue \n", 571 | " # Process the JSON object\n", 572 | " processed_data = process_json(data, lang, script)\n", 573 | " # Add processed data to the batch\n", 574 | " batch.append(processed_data)\n", 575 | "\n", 576 | " if post_annotation_filter(processed_data, lang, script):\n", 577 | " batch_filter.append(processed_data)\n", 578 | "\n", 579 | " # If the batch is full, write it to the output file\n", 580 | " if len(batch) == batch_size:\n", 581 | " for item in batch:\n", 582 | " output_file.write(json.dumps(item, ensure_ascii=False) + '\\n')\n", 583 | " batch = []\n", 584 | "\n", 585 | " # If the batch_filter is full, write it to the output file\n", 586 | " if len(batch_filter) == batch_size:\n", 587 | " for item in batch_filter:\n", 588 | " filter_file.write(json.dumps(item, ensure_ascii=False) + '\\n')\n", 589 | " batch_filter = []\n", 590 | "\n", 591 | " # Write remaining data in the last batch to the output file\n", 592 | " for item in batch:\n", 593 | " output_file.write(json.dumps(item, ensure_ascii=False) + '\\n')\n", 594 | "\n", 595 | " # Write remaining data in the last filter_batch to the output file\n", 596 | " for item in batch_filter:\n", 597 | " filter_file.write(json.dumps(item, ensure_ascii=False) + '\\n')\n" 598 | ] 599 | }, 600 | { 601 | "cell_type": "code", 602 | "execution_count": null, 603 | "metadata": { 604 | "pycharm": { 605 | "name": "#%%\n" 606 | } 607 | }, 608 | "outputs": [], 609 | "source": [ 610 | "num_cores = 40" 611 | ] 612 | }, 613 | { 614 | "cell_type": "code", 615 | "execution_count": null, 616 | "metadata": { 617 | "pycharm": { 618 | "name": "#%%\n" 619 | } 620 | }, 621 | "outputs": [], 622 | "source": [ 623 | "import os\n", 624 | "\n", 625 | "# Function to filter paths based on file size\n", 626 | "def filter_paths_by_size(paths, max_size_mb):\n", 627 | " max_size_bytes = max_size_mb * 1024 * 1024 # Convert MB to bytes\n", 628 | " filtered_paths = [path for path in paths if os.path.isfile(path) and os.path.getsize(path) < max_size_bytes]\n", 629 | " return filtered_paths\n", 630 | "\n", 631 | "input_dir = 'res/corpus/'\n", 632 | "output_dir = 'res/annotation/'\n", 633 | "filter_dir = 'res/filter/'\n", 634 | "\n", 635 | "# Get list of input files\n", 636 | "input_files = [os.path.join(input_dir, file) for file in os.listdir(input_dir) if file.endswith('.jsonl')]\n", 637 | "input_files = sorted(input_files, key=os.path.getsize)\n", 638 | "\n", 639 | "input_files_100mgb = filter_paths_by_size(input_files, 100)\n", 640 | "\n", 641 | "rest_input_files = list(set(input_files) - set(input_files_100mgb))" 642 | ] 643 | }, 644 | { 645 | "cell_type": "code", 646 | "execution_count": null, 647 | "metadata": { 648 | "pycharm": { 649 | "name": "#%%\n" 650 | } 651 | }, 652 | "outputs": [], 653 | "source": [ 654 | "# Define the number of processes to run in parallel\n", 655 | "num_processes = min(len(input_files_100mgb), num_cores)\n", 656 | "\n", 657 | "# Create a pool of processes\n", 658 | "with Pool(processes=num_processes) as pool:\n", 659 | " # Map the processing function to each input file\n", 660 | " pool.map(process_json_file, input_files_100mgb)\n" 661 | ] 662 | }, 663 | { 664 | "cell_type": "code", 665 | "execution_count": null, 666 | "metadata": { 667 | "pycharm": { 668 | "name": "#%%\n" 669 | } 670 | }, 671 | "outputs": [], 672 | "source": [ 673 | "import multiprocessing as mp\n", 674 | "import subprocess\n", 675 | "\n", 676 | "def process_batch(lines, lang, script):\n", 677 | " \n", 678 | " batch = []\n", 679 | " batch_filter = []\n", 680 | " for line in lines:\n", 681 | " # Parse the line as JSON\n", 682 | " try:\n", 683 | " data = json.loads(line)\n", 684 | " except:\n", 685 | " continue \n", 686 | " # Process the JSON object\n", 687 | " processed_data = process_json(data, lang, script)\n", 688 | " # Add processed data to the batch\n", 689 | " batch.append(processed_data)\n", 690 | "\n", 691 | " if post_annotation_filter(processed_data, lang, script):\n", 692 | " batch_filter.append(processed_data)\n", 693 | " \n", 694 | " return batch, batch_filter\n", 695 | "\n", 696 | "\n", 697 | "def check_and_remove_empty_file(file_path):\n", 698 | " \"\"\"\n", 699 | " Check if a file at file_path is empty and remove it if it is.\n", 700 | " \n", 701 | " Parameters:\n", 702 | " file_path (str): The path to the file to be checked and potentially removed.\n", 703 | " \"\"\"\n", 704 | " if os.path.isfile(file_path) and os.path.getsize(file_path) == 0:\n", 705 | " os.remove(file_path)\n", 706 | "\n", 707 | "\n", 708 | "def process_chunk(start_line, end_line, input_file_path, chunk_index, batch_size=2000):\n", 709 | " \n", 710 | " print(\"chunk_index\", chunk_index)\n", 711 | " \n", 712 | " output_file_path = os.path.join(output_dir, os.path.basename(input_file_path).replace('.jsonl', f'_{chunk_index}.jsonl'))\n", 713 | " filter_file_path = os.path.join(filter_dir, os.path.basename(input_file_path).replace('.jsonl', f'_{chunk_index}.jsonl'))\n", 714 | " \n", 715 | " with open(input_file_path, 'r', encoding='utf-8') as infile, open(output_file_path, 'w', encoding='utf-8') as outfile, open(filter_file_path, 'w', encoding='utf-8') as filterfile:\n", 716 | " \n", 717 | " lang = os.path.basename(input_file_path).split('-')[0]\n", 718 | " script = os.path.basename(input_file_path).split('-')[-1].split('_')[0]\n", 719 | "\n", 720 | " \n", 721 | " current_line = 0\n", 722 | " lines = []\n", 723 | " \n", 724 | " for line in infile:\n", 725 | " if current_line >= start_line:\n", 726 | " lines.append(line)\n", 727 | " current_line += 1\n", 728 | "\n", 729 | " if current_line >= end_line:\n", 730 | " break\n", 731 | "\n", 732 | " if len(lines) == batch_size:\n", 733 | " # Process the lines (replace this with your actual processing function)\n", 734 | " processed_lines, processed_filtered_lines = process_batch(lines, lang, script)\n", 735 | " \n", 736 | " for item in processed_lines:\n", 737 | " outfile.write(json.dumps(item, ensure_ascii=False) + '\\n')\n", 738 | "\n", 739 | " for item in processed_filtered_lines:\n", 740 | " filterfile.write(json.dumps(item, ensure_ascii=False) + '\\n')\n", 741 | "\n", 742 | " lines = []\n", 743 | "\n", 744 | " # Process any remaining lines in the last batch\n", 745 | " if lines:\n", 746 | " processed_lines, processed_filtered_lines = process_batch(lines, lang, script)\n", 747 | "\n", 748 | " for item in processed_lines:\n", 749 | " outfile.write(json.dumps(item, ensure_ascii=False) + '\\n')\n", 750 | "\n", 751 | " for item in processed_filtered_lines:\n", 752 | " filterfile.write(json.dumps(item, ensure_ascii=False) + '\\n')\n", 753 | "\n", 754 | "\n", 755 | " check_and_remove_empty_file(output_file_path)\n", 756 | " check_and_remove_empty_file(filter_file_path)\n", 757 | " \n", 758 | " return True\n", 759 | " \n", 760 | " \n", 761 | " \n", 762 | "# def count_lines(file_path):\n", 763 | "# result = subprocess.run(['sed', '-n', '$=', file_path], capture_output=True, text=True)\n", 764 | "# return int(result.stdout.strip())\n", 765 | " \n", 766 | "\n", 767 | "def count_lines(filename):\n", 768 | " count = 0\n", 769 | " with open(filename, 'rb') as f:\n", 770 | " while chunk := f.read(1024*1024*1024):\n", 771 | " count += chunk.count(b'\\n')\n", 772 | " return count \n", 773 | "\n", 774 | "def chunkify(file_path, num_chunks):\n", 775 | " # with open(file_path, 'r') as infile:\n", 776 | " # total_lines = sum(1 for _ in infile)\n", 777 | " print(file_path)\n", 778 | " total_lines = count_lines(file_path)\n", 779 | " chunk_size = total_lines // num_chunks\n", 780 | " chunks = []\n", 781 | " for i in range(num_chunks):\n", 782 | " start_line = i * chunk_size\n", 783 | " end_line = start_line + chunk_size if i != num_chunks - 1 else total_lines\n", 784 | " chunks.append((start_line, end_line))\n", 785 | " \n", 786 | " return chunks\n", 787 | "\n", 788 | "def parallel(input_file, num_cores):\n", 789 | " \n", 790 | " print(\"start\", input_file)\n", 791 | " chunks = chunkify(input_file, num_cores)\n", 792 | " print(\"chunks\", len(chunks))\n", 793 | " \n", 794 | " processes = []\n", 795 | " for chunk_index, (start_line, end_line) in enumerate(chunks):\n", 796 | " p = mp.Process(target=process_chunk, args=(start_line, end_line, input_file, chunk_index))\n", 797 | " processes.append(p)\n", 798 | " p.start()\n", 799 | " \n", 800 | " for p in processes:\n", 801 | " p.join()\n" 802 | ] 803 | }, 804 | { 805 | "cell_type": "code", 806 | "execution_count": null, 807 | "metadata": { 808 | "pycharm": { 809 | "name": "#%%\n" 810 | } 811 | }, 812 | "outputs": [], 813 | "source": [ 814 | "rest_input_files = sorted(rest_input_files, key=os.path.getsize)\n", 815 | "\n", 816 | "for r in rest_input_files:\n", 817 | " parallel(r, num_cores)\n" 818 | ] 819 | } 820 | ], 821 | "metadata": { 822 | "kernelspec": { 823 | "display_name": "lid", 824 | "language": "python", 825 | "name": "lid" 826 | }, 827 | "language_info": { 828 | "codemirror_mode": { 829 | "name": "ipython", 830 | "version": 3 831 | }, 832 | "file_extension": ".py", 833 | "mimetype": "text/x-python", 834 | "name": "python", 835 | "nbconvert_exporter": "python", 836 | "pygments_lexer": "ipython3", 837 | "version": "3.8.16" 838 | } 839 | }, 840 | "nbformat": 4, 841 | "nbformat_minor": 2 842 | } -------------------------------------------------------------------------------- /statistics/analyze-stat-2.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "code", 5 | "execution_count": null, 6 | "metadata": {}, 7 | "outputs": [], 8 | "source": [ 9 | "import os\n", 10 | "import pandas as pd\n", 11 | "import json\n", 12 | "import duckdb" 13 | ] 14 | }, 15 | { 16 | "cell_type": "code", 17 | "execution_count": null, 18 | "metadata": {}, 19 | "outputs": [], 20 | "source": [ 21 | "# Define the main directory\n", 22 | "main_dir = 'v1.0'\n", 23 | "output_dir = 'v1.0-stat-2'" 24 | ] 25 | }, 26 | { 27 | "cell_type": "code", 28 | "execution_count": null, 29 | "metadata": {}, 30 | "outputs": [], 31 | "source": [ 32 | "# Ensure the output directory exists\n", 33 | "os.makedirs(output_dir, exist_ok=True)" 34 | ] 35 | }, 36 | { 37 | "cell_type": "code", 38 | "execution_count": null, 39 | "metadata": {}, 40 | "outputs": [], 41 | "source": [ 42 | "import duckdb\n", 43 | "import os\n", 44 | "import json\n", 45 | "\n", 46 | "def process_parquet(file_path):\n", 47 | " # Connect to an in-memory DuckDB instance\n", 48 | " con = duckdb.connect()\n", 49 | "\n", 50 | " # Load the Parquet file into a DuckDB relation\n", 51 | " con.execute(f\"CREATE TABLE parquet_table AS SELECT * FROM '{file_path}'\")\n", 52 | "\n", 53 | " # Calculate the number of records\n", 54 | " num_records = con.execute(\"SELECT COUNT(*) FROM parquet_table\").fetchone()[0]\n", 55 | "\n", 56 | " # Add a new column 'num-words'\n", 57 | " con.execute(\"ALTER TABLE parquet_table ADD COLUMN num_words INT\")\n", 58 | " con.execute(\"UPDATE parquet_table SET num_words = array_length(str_split_regex(content, '\\\\s+'), 1)\")\n", 59 | "\n", 60 | " # Select the relevant columns and calculate the summations\n", 61 | " query = \"\"\"\n", 62 | " SELECT\n", 63 | " SUM(\"content-length\") AS total_content_length_sum,\n", 64 | " SUM(\"num-sents\") AS num_sents_sum,\n", 65 | " SUM(num_words) AS num_words_sum,\n", 66 | " SUM(CASE WHEN 'religious' = ANY(categories) OR 'associations_religieuses' = ANY(categories) THEN 1 ELSE 0 END) AS count_religious,\n", 67 | " SUM(CASE WHEN 'wikipedia' = ANY(categories) THEN 1 ELSE 0 END) AS count_wikipedia\n", 68 | " FROM parquet_table\n", 69 | " \"\"\"\n", 70 | " result = con.execute(query).fetchone()\n", 71 | "\n", 72 | " # Create a result dictionary\n", 73 | " result_dict = {\n", 74 | " 'file_path': file_path,\n", 75 | " 'num_records': str(num_records),\n", 76 | " 'total_content_length_sum': str(result[0]),\n", 77 | " 'num_sents_sum': str(result[1]),\n", 78 | " 'num_words_sum': str(result[2]),\n", 79 | " 'religious_num_records': str(result[3]),\n", 80 | " 'wikipedia_num_records': str(result[4])\n", 81 | " }\n", 82 | "\n", 83 | " # Define output path\n", 84 | " relative_path = os.path.relpath(file_path, main_dir)\n", 85 | " output_path = os.path.join(output_dir, relative_path + '.json')\n", 86 | "\n", 87 | " # Ensure the output directory exists\n", 88 | " os.makedirs(os.path.dirname(output_path), exist_ok=True)\n", 89 | "\n", 90 | " # Save the result to a JSON file\n", 91 | " with open(output_path, 'w') as f:\n", 92 | " json.dump(result_dict, f, indent=4)\n", 93 | "\n", 94 | " return None\n" 95 | ] 96 | }, 97 | { 98 | "cell_type": "code", 99 | "execution_count": null, 100 | "metadata": {}, 101 | "outputs": [], 102 | "source": [ 103 | "# Collect all parquet files\n", 104 | "parquet_files = []\n", 105 | "for root, dirs, files in os.walk(main_dir):\n", 106 | " for file in files:\n", 107 | " if file.endswith('.parquet'):\n", 108 | " file_path = os.path.join(root, file)\n", 109 | " parquet_files.append(file_path)\n", 110 | " \n", 111 | "parquet_files.sort(key=os.path.getsize)" 112 | ] 113 | }, 114 | { 115 | "cell_type": "code", 116 | "execution_count": null, 117 | "metadata": {}, 118 | "outputs": [], 119 | "source": [ 120 | "from tqdm import tqdm\n", 121 | "\n", 122 | "for file_path in tqdm(parquet_files):\n", 123 | " process_parquet(file_path)\n", 124 | "\n" 125 | ] 126 | }, 127 | { 128 | "cell_type": "code", 129 | "execution_count": null, 130 | "metadata": {}, 131 | "outputs": [], 132 | "source": [ 133 | "! zip -r v1.0-stat-2.zip v1.0-stat-2/*" 134 | ] 135 | }, 136 | { 137 | "cell_type": "code", 138 | "execution_count": null, 139 | "metadata": {}, 140 | "outputs": [], 141 | "source": [] 142 | } 143 | ], 144 | "metadata": { 145 | "kernelspec": { 146 | "display_name": "lid", 147 | "language": "python", 148 | "name": "lid" 149 | }, 150 | "language_info": { 151 | "codemirror_mode": { 152 | "name": "ipython", 153 | "version": 3 154 | }, 155 | "file_extension": ".py", 156 | "mimetype": "text/x-python", 157 | "name": "python", 158 | "nbconvert_exporter": "python", 159 | "pygments_lexer": "ipython3", 160 | "version": "3.8.16" 161 | } 162 | }, 163 | "nbformat": 4, 164 | "nbformat_minor": 5 165 | } 166 | -------------------------------------------------------------------------------- /statistics/v1.0-stat-2.zip: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/cisnlp/GlotCC/824b072bc82558dd0ea8d1176976aba1f207b57c/statistics/v1.0-stat-2.zip --------------------------------------------------------------------------------