├── .gitignore ├── CODE_LICENSE ├── DATA_LICENSE.md ├── README.md ├── mmc4_arxiv.pdf ├── mmc4_logo.png └── scripts ├── compute_assignments.py ├── download_images.py ├── linear_assignment.py └── requirements.txt /.gitignore: -------------------------------------------------------------------------------- 1 | # General 2 | .DS_Store 3 | .AppleDouble 4 | .LSOverride 5 | 6 | # File Related 7 | *.zip -------------------------------------------------------------------------------- /CODE_LICENSE: -------------------------------------------------------------------------------- 1 | MIT License 2 | 3 | Copyright (c) 2023 Allen Institute for AI 4 | 5 | Permission is hereby granted, free of charge, to any person obtaining a copy 6 | of this software and associated documentation files (the "Software"), to deal 7 | in the Software without restriction, including without limitation the rights 8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell 9 | copies of the Software, and to permit persons to whom the Software is 10 | furnished to do so, subject to the following conditions: 11 | 12 | The above copyright notice and this permission notice shall be included in all 13 | copies or substantial portions of the Software. 14 | 15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR 16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, 17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE 18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER 19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, 20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE 21 | SOFTWARE. -------------------------------------------------------------------------------- /DATA_LICENSE.md: -------------------------------------------------------------------------------- 1 | # ODC Attribution License (ODC-By) 2 | 3 | ### Preamble 4 | 5 | The Open Data Commons Attribution License is a license agreement 6 | intended to allow users to freely share, modify, and use this Database 7 | subject only to the attribution requirements set out in Section 4. 8 | 9 | Databases can contain a wide variety of types of content (images, 10 | audiovisual material, and sounds all in the same database, for example), 11 | and so this license only governs the rights over the Database, and not 12 | the contents of the Database individually. Licensors may therefore wish 13 | to use this license together with another license for the contents. 14 | 15 | Sometimes the contents of a database, or the database itself, can be 16 | covered by other rights not addressed here (such as private contracts, 17 | trademark over the name, or privacy rights / data protection rights 18 | over information in the contents), and so you are advised that you may 19 | have to consult other documents or clear other rights before doing 20 | activities not covered by this License. 21 | 22 | ------ 23 | 24 | The Licensor (as defined below) 25 | 26 | and 27 | 28 | You (as defined below) 29 | 30 | agree as follows: 31 | 32 | ### 1.0 Definitions of Capitalised Words 33 | 34 | "Collective Database" - Means this Database in unmodified form as part 35 | of a collection of independent databases in themselves that together are 36 | assembled into a collective whole. A work that constitutes a Collective 37 | Database will not be considered a Derivative Database. 38 | 39 | "Convey" - As a verb, means Using the Database, a Derivative Database, 40 | or the Database as part of a Collective Database in any way that enables 41 | a Person to make or receive copies of the Database or a Derivative 42 | Database. Conveying does not include interaction with a user through a 43 | computer network, or creating and Using a Produced Work, where no 44 | transfer of a copy of the Database or a Derivative Database occurs. 45 | 46 | "Contents" - The contents of this Database, which includes the 47 | information, independent works, or other material collected into the 48 | Database. For example, the contents of the Database could be factual 49 | data or works such as images, audiovisual material, text, or sounds. 50 | 51 | "Database" - A collection of material (the Contents) arranged in a 52 | systematic or methodical way and individually accessible by electronic 53 | or other means offered under the terms of this License. 54 | 55 | "Database Directive" - Means Directive 96/9/EC of the European 56 | Parliament and of the Council of 11 March 1996 on the legal protection 57 | of databases, as amended or succeeded. 58 | 59 | "Database Right" - Means rights resulting from the Chapter III ("sui 60 | generis") rights in the Database Directive (as amended and as transposed 61 | by member states), which includes the Extraction and Re-utilisation of 62 | the whole or a Substantial part of the Contents, as well as any similar 63 | rights available in the relevant jurisdiction under Section 10.4. 64 | 65 | "Derivative Database" - Means a database based upon the Database, and 66 | includes any translation, adaptation, arrangement, modification, or any 67 | other alteration of the Database or of a Substantial part of the 68 | Contents. This includes, but is not limited to, Extracting or 69 | Re-utilising the whole or a Substantial part of the Contents in a new 70 | Database. 71 | 72 | "Extraction" - Means the permanent or temporary transfer of all or a 73 | Substantial part of the Contents to another medium by any means or in 74 | any form. 75 | 76 | "License" - Means this license agreement and is both a license of rights 77 | such as copyright and Database Rights and an agreement in contract. 78 | 79 | "Licensor" - Means the Person that offers the Database under the terms 80 | of this License. 81 | 82 | "Person" - Means a natural or legal person or a body of persons 83 | corporate or incorporate. 84 | 85 | "Produced Work" - a work (such as an image, audiovisual material, text, 86 | or sounds) resulting from using the whole or a Substantial part of the 87 | Contents (via a search or other query) from this Database, a Derivative 88 | Database, or this Database as part of a Collective Database. 89 | 90 | "Publicly" - means to Persons other than You or under Your control by 91 | either more than 50% ownership or by the power to direct their 92 | activities (such as contracting with an independent consultant). 93 | 94 | "Re-utilisation" - means any form of making available to the public all 95 | or a Substantial part of the Contents by the distribution of copies, by 96 | renting, by online or other forms of transmission. 97 | 98 | "Substantial" - Means substantial in terms of quantity or quality or a 99 | combination of both. The repeated and systematic Extraction or 100 | Re-utilisation of insubstantial parts of the Contents may amount to the 101 | Extraction or Re-utilisation of a Substantial part of the Contents. 102 | 103 | "Use" - As a verb, means doing any act that is restricted by copyright 104 | or Database Rights whether in the original medium or any other; and 105 | includes without limitation distributing, copying, publicly performing, 106 | publicly displaying, and preparing derivative works of the Database, as 107 | well as modifying the Database as may be technically necessary to use it 108 | in a different mode or format. 109 | 110 | "You" - Means a Person exercising rights under this License who has not 111 | previously violated the terms of this License with respect to the 112 | Database, or who has received express permission from the Licensor to 113 | exercise rights under this License despite a previous violation. 114 | 115 | Words in the singular include the plural and vice versa. 116 | 117 | ### 2.0 What this License covers 118 | 119 | 2.1. Legal effect of this document. This License is: 120 | 121 | a. A license of applicable copyright and neighbouring rights; 122 | 123 | b. A license of the Database Right; and 124 | 125 | c. An agreement in contract between You and the Licensor. 126 | 127 | 2.2 Legal rights covered. This License covers the legal rights in the 128 | Database, including: 129 | 130 | a. Copyright. Any copyright or neighbouring rights in the Database. 131 | The copyright licensed includes any individual elements of the 132 | Database, but does not cover the copyright over the Contents 133 | independent of this Database. See Section 2.4 for details. Copyright 134 | law varies between jurisdictions, but is likely to cover: the Database 135 | model or schema, which is the structure, arrangement, and organisation 136 | of the Database, and can also include the Database tables and table 137 | indexes; the data entry and output sheets; and the Field names of 138 | Contents stored in the Database; 139 | 140 | b. Database Rights. Database Rights only extend to the Extraction and 141 | Re-utilisation of the whole or a Substantial part of the Contents. 142 | Database Rights can apply even when there is no copyright over the 143 | Database. Database Rights can also apply when the Contents are removed 144 | from the Database and are selected and arranged in a way that would 145 | not infringe any applicable copyright; and 146 | 147 | c. Contract. This is an agreement between You and the Licensor for 148 | access to the Database. In return you agree to certain conditions of 149 | use on this access as outlined in this License. 150 | 151 | 2.3 Rights not covered. 152 | 153 | a. This License does not apply to computer programs used in the making 154 | or operation of the Database; 155 | 156 | b. This License does not cover any patents over the Contents or the 157 | Database; and 158 | 159 | c. This License does not cover any trademarks associated with the 160 | Database. 161 | 162 | 2.4 Relationship to Contents in the Database. The individual items of 163 | the Contents contained in this Database may be covered by other rights, 164 | including copyright, patent, data protection, privacy, or personality 165 | rights, and this License does not cover any rights (other than Database 166 | Rights or in contract) in individual Contents contained in the Database. 167 | For example, if used on a Database of images (the Contents), this 168 | License would not apply to copyright over individual images, which could 169 | have their own separate licenses, or one single license covering all of 170 | the rights over the images. 171 | 172 | ### 3.0 Rights granted 173 | 174 | 3.1 Subject to the terms and conditions of this License, the Licensor 175 | grants to You a worldwide, royalty-free, non-exclusive, terminable (but 176 | only under Section 9) license to Use the Database for the duration of 177 | any applicable copyright and Database Rights. These rights explicitly 178 | include commercial use, and do not exclude any field of endeavour. To 179 | the extent possible in the relevant jurisdiction, these rights may be 180 | exercised in all media and formats whether now known or created in the 181 | future. 182 | 183 | The rights granted cover, for example: 184 | 185 | a. Extraction and Re-utilisation of the whole or a Substantial part of 186 | the Contents; 187 | 188 | b. Creation of Derivative Databases; 189 | 190 | c. Creation of Collective Databases; 191 | 192 | d. Creation of temporary or permanent reproductions by any means and 193 | in any form, in whole or in part, including of any Derivative 194 | Databases or as a part of Collective Databases; and 195 | 196 | e. Distribution, communication, display, lending, making available, or 197 | performance to the public by any means and in any form, in whole or in 198 | part, including of any Derivative Database or as a part of Collective 199 | Databases. 200 | 201 | 3.2 Compulsory license schemes. For the avoidance of doubt: 202 | 203 | a. Non-waivable compulsory license schemes. In those jurisdictions in 204 | which the right to collect royalties through any statutory or 205 | compulsory licensing scheme cannot be waived, the Licensor reserves 206 | the exclusive right to collect such royalties for any exercise by You 207 | of the rights granted under this License; 208 | 209 | b. Waivable compulsory license schemes. In those jurisdictions in 210 | which the right to collect royalties through any statutory or 211 | compulsory licensing scheme can be waived, the Licensor waives the 212 | exclusive right to collect such royalties for any exercise by You of 213 | the rights granted under this License; and, 214 | 215 | c. Voluntary license schemes. The Licensor waives the right to collect 216 | royalties, whether individually or, in the event that the Licensor is 217 | a member of a collecting society that administers voluntary licensing 218 | schemes, via that society, from any exercise by You of the rights 219 | granted under this License. 220 | 221 | 3.3 The right to release the Database under different terms, or to stop 222 | distributing or making available the Database, is reserved. Note that 223 | this Database may be multiple-licensed, and so You may have the choice 224 | of using alternative licenses for this Database. Subject to Section 225 | 10.4, all other rights not expressly granted by Licensor are reserved. 226 | 227 | ### 4.0 Conditions of Use 228 | 229 | 4.1 The rights granted in Section 3 above are expressly made subject to 230 | Your complying with the following conditions of use. These are important 231 | conditions of this License, and if You fail to follow them, You will be 232 | in material breach of its terms. 233 | 234 | 4.2 Notices. If You Publicly Convey this Database, any Derivative 235 | Database, or the Database as part of a Collective Database, then You 236 | must: 237 | 238 | a. Do so only under the terms of this License; 239 | 240 | b. Include a copy of this License or its Uniform Resource Identifier (URI) 241 | with the Database or Derivative Database, including both in the 242 | Database or Derivative Database and in any relevant documentation; 243 | 244 | c. Keep intact any copyright or Database Right notices and notices 245 | that refer to this License; and 246 | 247 | d. If it is not possible to put the required notices in a particular 248 | file due to its structure, then You must include the notices in a 249 | location (such as a relevant directory) where users would be likely to 250 | look for it. 251 | 252 | 4.3 Notice for using output (Contents). Creating and Using a Produced 253 | Work does not require the notice in Section 4.2. However, if you 254 | Publicly Use a Produced Work, You must include a notice associated with 255 | the Produced Work reasonably calculated to make any Person that uses, 256 | views, accesses, interacts with, or is otherwise exposed to the Produced 257 | Work aware that Content was obtained from the Database, Derivative 258 | Database, or the Database as part of a Collective Database, and that it 259 | is available under this License. 260 | 261 | a. Example notice. The following text will satisfy notice under 262 | Section 4.3: 263 | 264 | Contains information from DATABASE NAME which is made available 265 | under the ODC Attribution License. 266 | 267 | DATABASE NAME should be replaced with the name of the Database and a 268 | hyperlink to the location of the Database. "ODC Attribution License" 269 | should contain a hyperlink to the URI of the text of this License. If 270 | hyperlinks are not possible, You should include the plain text of the 271 | required URI's with the above notice. 272 | 273 | 4.4 Licensing of others. You may not sublicense the Database. Each time 274 | You communicate the Database, the whole or Substantial part of the 275 | Contents, or any Derivative Database to anyone else in any way, the 276 | Licensor offers to the recipient a license to the Database on the same 277 | terms and conditions as this License. You are not responsible for 278 | enforcing compliance by third parties with this License, but You may 279 | enforce any rights that You have over a Derivative Database. You are 280 | solely responsible for any modifications of a Derivative Database made 281 | by You or another Person at Your direction. You may not impose any 282 | further restrictions on the exercise of the rights granted or affirmed 283 | under this License. 284 | 285 | ### 5.0 Moral rights 286 | 287 | 5.1 Moral rights. This section covers moral rights, including any rights 288 | to be identified as the author of the Database or to object to treatment 289 | that would otherwise prejudice the author's honour and reputation, or 290 | any other derogatory treatment: 291 | 292 | a. For jurisdictions allowing waiver of moral rights, Licensor waives 293 | all moral rights that Licensor may have in the Database to the fullest 294 | extent possible by the law of the relevant jurisdiction under Section 295 | 10.4; 296 | 297 | b. If waiver of moral rights under Section 5.1 a in the relevant 298 | jurisdiction is not possible, Licensor agrees not to assert any moral 299 | rights over the Database and waives all claims in moral rights to the 300 | fullest extent possible by the law of the relevant jurisdiction under 301 | Section 10.4; and 302 | 303 | c. For jurisdictions not allowing waiver or an agreement not to assert 304 | moral rights under Section 5.1 a and b, the author may retain their 305 | moral rights over certain aspects of the Database. 306 | 307 | Please note that some jurisdictions do not allow for the waiver of moral 308 | rights, and so moral rights may still subsist over the Database in some 309 | jurisdictions. 310 | 311 | ### 6.0 Fair dealing, Database exceptions, and other rights not affected 312 | 313 | 6.1 This License does not affect any rights that You or anyone else may 314 | independently have under any applicable law to make any use of this 315 | Database, including without limitation: 316 | 317 | a. Exceptions to the Database Right including: Extraction of Contents 318 | from non-electronic Databases for private purposes, Extraction for 319 | purposes of illustration for teaching or scientific research, and 320 | Extraction or Re-utilisation for public security or an administrative 321 | or judicial procedure. 322 | 323 | b. Fair dealing, fair use, or any other legally recognised limitation 324 | or exception to infringement of copyright or other applicable laws. 325 | 326 | 6.2 This License does not affect any rights of lawful users to Extract 327 | and Re-utilise insubstantial parts of the Contents, evaluated 328 | quantitatively or qualitatively, for any purposes whatsoever, including 329 | creating a Derivative Database (subject to other rights over the 330 | Contents, see Section 2.4). The repeated and systematic Extraction or 331 | Re-utilisation of insubstantial parts of the Contents may however amount 332 | to the Extraction or Re-utilisation of a Substantial part of the 333 | Contents. 334 | 335 | ### 7.0 Warranties and Disclaimer 336 | 337 | 7.1 The Database is licensed by the Licensor "as is" and without any 338 | warranty of any kind, either express, implied, or arising by statute, 339 | custom, course of dealing, or trade usage. Licensor specifically 340 | disclaims any and all implied warranties or conditions of title, 341 | non-infringement, accuracy or completeness, the presence or absence of 342 | errors, fitness for a particular purpose, merchantability, or otherwise. 343 | Some jurisdictions do not allow the exclusion of implied warranties, so 344 | this exclusion may not apply to You. 345 | 346 | ### 8.0 Limitation of liability 347 | 348 | 8.1 Subject to any liability that may not be excluded or limited by law, 349 | the Licensor is not liable for, and expressly excludes, all liability 350 | for loss or damage however and whenever caused to anyone by any use 351 | under this License, whether by You or by anyone else, and whether caused 352 | by any fault on the part of the Licensor or not. This exclusion of 353 | liability includes, but is not limited to, any special, incidental, 354 | consequential, punitive, or exemplary damages such as loss of revenue, 355 | data, anticipated profits, and lost business. This exclusion applies 356 | even if the Licensor has been advised of the possibility of such 357 | damages. 358 | 359 | 8.2 If liability may not be excluded by law, it is limited to actual and 360 | direct financial loss to the extent it is caused by proved negligence on 361 | the part of the Licensor. 362 | 363 | ### 9.0 Termination of Your rights under this License 364 | 365 | 9.1 Any breach by You of the terms and conditions of this License 366 | automatically terminates this License with immediate effect and without 367 | notice to You. For the avoidance of doubt, Persons who have received the 368 | Database, the whole or a Substantial part of the Contents, Derivative 369 | Databases, or the Database as part of a Collective Database from You 370 | under this License will not have their licenses terminated provided 371 | their use is in full compliance with this License or a license granted 372 | under Section 4.8 of this License. Sections 1, 2, 7, 8, 9 and 10 will 373 | survive any termination of this License. 374 | 375 | 9.2 If You are not in breach of the terms of this License, the Licensor 376 | will not terminate Your rights under it. 377 | 378 | 9.3 Unless terminated under Section 9.1, this License is granted to You 379 | for the duration of applicable rights in the Database. 380 | 381 | 9.4 Reinstatement of rights. If you cease any breach of the terms and 382 | conditions of this License, then your full rights under this License 383 | will be reinstated: 384 | 385 | a. Provisionally and subject to permanent termination until the 60th 386 | day after cessation of breach; 387 | 388 | b. Permanently on the 60th day after cessation of breach unless 389 | otherwise reasonably notified by the Licensor; or 390 | 391 | c. Permanently if reasonably notified by the Licensor of the 392 | violation, this is the first time You have received notice of 393 | violation of this License from the Licensor, and You cure the 394 | violation prior to 30 days after your receipt of the notice. 395 | 396 | 9.5 Notwithstanding the above, Licensor reserves the right to release 397 | the Database under different license terms or to stop distributing or 398 | making available the Database. Releasing the Database under different 399 | license terms or stopping the distribution of the Database will not 400 | withdraw this License (or any other license that has been, or is 401 | required to be, granted under the terms of this License), and this 402 | License will continue in full force and effect unless terminated as 403 | stated above. 404 | 405 | ### 10.0 General 406 | 407 | 10.1 If any provision of this License is held to be invalid or 408 | unenforceable, that must not affect the validity or enforceability of 409 | the remainder of the terms and conditions of this License and each 410 | remaining provision of this License shall be valid and enforced to the 411 | fullest extent permitted by law. 412 | 413 | 10.2 This License is the entire agreement between the parties with 414 | respect to the rights granted here over the Database. It replaces any 415 | earlier understandings, agreements or representations with respect to 416 | the Database. 417 | 418 | 10.3 If You are in breach of the terms of this License, You will not be 419 | entitled to rely on the terms of this License or to complain of any 420 | breach by the Licensor. 421 | 422 | 10.4 Choice of law. This License takes effect in and will be governed by 423 | the laws of the relevant jurisdiction in which the License terms are 424 | sought to be enforced. If the standard suite of rights granted under 425 | applicable copyright law and Database Rights in the relevant 426 | jurisdiction includes additional rights not granted under this License, 427 | these additional rights are granted in this License in order to meet the 428 | terms of this License. 429 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 |

2 | 3 |

4 | 5 |

:camera: :memo: Multimodal C4 (mmc4) :memo: :camera:

6 | 7 |

An open, billion-scale corpus of images interleaved with text.

8 |

arXiv paper with curation details out now!

9 | 10 |
11 | 12 | ## Updates 13 | 14 | - **mmc4 is available once again!** A huge thanks to [Weizhi Wang](https://victorwz.github.io/) and [Zekun Li](https://github.com/Leezekun/) for saving mmc4-ff and mmc4-core-ff! 15 | - The original copies of mmc4 at ai2 were accidentially deleted in Feb 2025. [If you have any of the original copies of the dataset from before Feb. 2025, do let me know!](#missing-data) 16 | - released mmc4 version 1.1 :fire: which fixes https://github.com/allenai/mmc4/issues/11 and https://github.com/allenai/mmc4/issues/10 17 | 18 | ## Corpus stats (v1.1) 19 | 20 | | | # images | # docs | # tokens | 21 | |-----------------------------------------------------|----------|--------|----------| 22 | | Multimodal-C4 (mmc4) | 571M | 101.2M | 43B | 23 | | Multimodal-C4 fewer-faces** (mmc4-ff) | 375M | 77.7M | 33B | 24 | | Multimodal-C4 core (mmc4-core) | 29.9M | 7.3M | 2.4B | 25 | | Multimodal-C4 core fewer-faces** (mmc4-core-ff) | 22.4M | 5.5M | 1.8B | 26 | 27 | ** = available for direct download 28 | 29 | More details about these datasets and our processing steps [can be found in our paper](https://arxiv.org/abs/2304.06939). 30 | 31 | ## Accessing mmc4-ff 32 | 33 | ### Documents 34 | 35 | Now hosted on huggingface: 36 | 37 | - mmc4 fewer faces (~218GB): [jmhessel/mmc4-ff](https://huggingface.co/datasets/jmhessel/mmc4-ff) 38 | - mmc4 core fewer faces (~20GB): [jmhessel/mmc4-core-ff](https://huggingface.co/datasets/jmhessel/mmc4-core-ff) 39 | 40 | The dataset is split into shards of jsonls. 41 | - The shard number varies between 0 to 23098. [14 shards are missing and are not included in the dataset](#the-missing-shards-%EF%B8%8F). 42 | - Each shard is a jsonl of documents. Each line is a document. 43 | 44 | Documents contain text, image URLs, assignments of images to sentences, and image-by-text CLIP ViT-L/14 similarity matrices. 45 | 46 | Specifically: 47 | 48 | - `text_list`: a list of sentences comprising the text of the document 49 | - `url`: the original url where the document was hosted 50 | - `image_info` is a key mapping to a list of images. each image contains: 51 | - `image_name`: a filename that you could download the image to 52 | - `face_detections`: `None` if no faces are detected (which should be the case in "fewer faces") 53 | - `matched_text_index`: the index within `text_list` representing the sentence that this image is matched to 54 | - `matched_sim`: the CLIP ViT-L/14 similarity between the image and the sentence at the matched index 55 | - `similarity_matrix`: a matrix of shape `len(image_info) x len(text_list)` where `similarity_matrix[i, j]` is the CLIP ViT-L/14 similarity between image `i` and sentence `j`. 56 | - `could_have_url_duplicate`: a small number of URLs (~3%) in the corpus may have duplicate entries due to commoncrawl collecting multiple snapshots over time. we downsample such that, in expectation, each URL occurs once, but duplicates are technically possible. You can discard all entries with `could_have_url_duplicate` equal to 1 if you want a more strictly deduplicated set. 57 | 58 | Here's an example: 59 | 60 | ``` 61 | {'image_info': [{'face_detections': None, 62 | 'image_name': 'b9040a0dbb22.jpg', 63 | 'matched_sim': 0.27694183588027954, 64 | 'matched_text_index': 2, 65 | 'raw_url': 'http://www.hfitinfo.com/honda_fit_pics/3/2/index.90.jpg'}, 66 | {'face_detections': None, 67 | 'image_name': 'db1c21bc8474.jpg', 68 | 'matched_sim': 0.3234919607639313, 69 | 'matched_text_index': 1, 70 | 'raw_url': 'http://www.hfitinfo.com/honda_fit_pics/3/2/index.91.jpg'}], 71 | 'similarity_matrix': [[0.24363446235656738, 72 | 0.31758785247802734, 73 | 0.27694183588027954], 74 | [0.2233106791973114, 75 | 0.3234919607639313, 76 | 0.26118797063827515]], 77 | 'text_list': ['When you lock the door using the lock tab on the driver’s ' 78 | 'door, all of the other doors and tailgate lock at the same ' 79 | 'time.', 80 | 'Press the master door lock switch in as shown to lock or ' 81 | 'unlock all doors and the tailgate.', 82 | 'When you lock/unlock the driver’s door and tailgate using the ' 83 | 'master lock switch, all the other doors lock/ unlock at the ' 84 | 'same time.'], 85 | 'url': 'http://www.hfitinfo.com/hofi-48.html', 86 | 'could_have_url_duplicate': 0 } 87 | ``` 88 | The assignments of images to sentences are computed using [compute_assignments.py](https://github.com/allenai/mmc4/blob/main/scripts/compute_assignments.py) 89 | 90 | ## Accessing raw images 91 | 92 | Raw images can be downloaded from the provided URLs in the documents using [this script](scripts/download_images.py). The intent is to respect folks who have removed images from the web and not redistribute those images. 93 | 94 | However, we understand that some of the URLs may be stale which can harm reproducibility efforts. If you're interested in updates regarding raw image availability, you can contact us using [this google form](https://forms.gle/fPSXY359MT1VvF1g8) 95 | 96 | ## The missing shards ⛏️💎🔍 97 | 98 | .1% of the 23099 shards are missing from the corpus. These were not included in any statistics or experiments, so they are not part of mmc4. The missing shards are: 99 | 100 | ``` 101 | 3218,3267,5064,5146,7119,8991,9750,11899,15127,15252,16996,17369,17499,17818 102 | ``` 103 | 104 | ## License 105 | 106 | - the new contributions of mmc4 beyond text-only c4 (e.g., the similarity matrices/image-text alignments) are released under [ODC-BY](https://opendatacommons.org/licenses/by/1-0/). 107 | - By using mmc4, be aware of that you are also bound by the [Common Crawl terms of use](https://commoncrawl.org/terms-of-use/). 108 | 109 | ## Citation 110 | 111 | If you found our work useful, please consider citing: 112 | ``` 113 | @article{zhu2023multimodal, 114 | title={{Multimodal C4}: An Open, Billion-scale Corpus of Images Interleaved With Text}, 115 | author={Wanrong Zhu and Jack Hessel and Anas Awadalla and Samir Yitzhak Gadre and Jesse Dodge and Alex Fang and Youngjae Yu and Ludwig Schmidt and William Yang Wang and Yejin Choi}, 116 | journal={arXiv preprint arXiv:2304.06939}, 117 | year={2023} 118 | } 119 | ``` 120 | 121 | ## Missing data 122 | 123 | In Feb 2025, the original copy of mmc4 hosted at AI2 was accidentially deleted. Thanks to some heroic efforts from [Weizhi Wang](https://victorwz.github.io/) and [Zekun Li](https://github.com/Leezekun/) who kindly provided their locally saved copies of mmc4 to be re-hosted, the corpus is (partially!) available again. Specifically: the "fewer faces" splits (both full and core) are available. The remaining missing files are: 124 | 125 | - mmc4, originally hosted at `https://storage.googleapis.com/ai2-jackh-mmc4-gated-public-41423/data_v1.1/docs_shard_{$SHARD}_v2.jsonl.zip`. 126 | - mmc4-core, originally hosted at `https://storage.googleapis.com/ai2-jackh-mmc4-gated-public-41423/data_core_v1.1/docs_shard_{$SHARD}_v3.jsonl` 127 | - CLIP ViT/L-14 image features, originally hosted at `https://storage.googleapis.com/ai2-jackh-mmc4-public/images/clip_vitl14_shard_{$SHARD}_features.pkl` 128 | 129 | If you have access to any of these files and are willing to make them available so we can once again host them for the broader community, [please let me know!](mailto:jmhessel@gmail.com) 130 | -------------------------------------------------------------------------------- /mmc4_arxiv.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/allenai/mmc4/5304ddd7a608c88d8f4ee1a6b3a4356104215517/mmc4_arxiv.pdf -------------------------------------------------------------------------------- /mmc4_logo.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/allenai/mmc4/5304ddd7a608c88d8f4ee1a6b3a4356104215517/mmc4_logo.png -------------------------------------------------------------------------------- /scripts/compute_assignments.py: -------------------------------------------------------------------------------- 1 | ''' 2 | example usage: 3 | python compute_assignment.py docs_shard_{$SHARD}_v2.jsonl 4 | ''' 5 | import argparse 6 | import json 7 | import numpy as np 8 | import linear_assignment 9 | import tqdm 10 | 11 | 12 | def parse_args(): 13 | parser = argparse.ArgumentParser() 14 | parser.add_argument('input_jsonl') 15 | return parser.parse_args() 16 | 17 | 18 | def get_image_assignments(im2txt): 19 | ''' 20 | returns a list assignments of length N_images such that assignments[i] is the sentence index that image i was assigned to. 21 | ''' 22 | im_idxs_s, txt_idxs_s, sol = linear_assignment.base_solve(-im2txt) 23 | im2txt_idxs = {im_idxs_s[k]: txt_idxs_s[k] for k in range(len(im_idxs_s))} 24 | if im2txt.shape[0] > im2txt.shape[1]: 25 | # there are more images than sentences. we dont want to discard images. so, for unassigned images, we will put them with their corresponding max. 26 | for imidx in range(len(im2txt)): 27 | if imidx not in im2txt_idxs: 28 | im2txt_idxs[imidx] = int(np.argmax(im2txt[imidx])) 29 | 30 | return [im2txt_idxs[idx] for idx in range(len(im2txt_idxs))] 31 | 32 | 33 | def main(): 34 | args = parse_args() 35 | 36 | docs = [] 37 | with open(args.input_jsonl) as f: 38 | for line in f: 39 | docs.append(json.loads(line)) 40 | 41 | for d in docs: 42 | im2txt = np.array(d['similarity_matrix']) 43 | assignment = get_image_assignments(im2txt) 44 | 45 | for im_idx, im in enumerate(d['image_info']): 46 | im['matched_text_index'] = int(assignment[im_idx]) 47 | im['matched_sim'] = float(im2txt[im_idx, assignment[im_idx]]) 48 | 49 | with open(args.input_jsonl, 'w') as f: 50 | f.write('\n'.join([json.dumps(d) for d in docs])) 51 | 52 | 53 | if __name__ == '__main__': 54 | main() 55 | -------------------------------------------------------------------------------- /scripts/download_images.py: -------------------------------------------------------------------------------- 1 | """ 2 | Adapted from: 3 | https://github.com/igorbrigadir/DownloadConceptualCaptions/blob/master/download_data.py 4 | 5 | Requirements: 6 | - ImageMagick 7 | - See requirements.txt for python dependencies 8 | 9 | Example Usage: 10 | python download_images.py --input_jsonl ./data_core/docs_no_face_shard_0_v3.jsonl 11 | OR 12 | python download_images.py --input_shards "https://storage.googleapis.com/ai2-jackh-mmc4-public/data/docs_no_face_shard_{0..23098}_v2.jsonl.zip" --output_image_dir mmc4_images/ 13 | """ 14 | 15 | import pandas as pd 16 | import requests 17 | import os 18 | import shelve 19 | import magic 20 | from multiprocessing import Pool 21 | import tqdm 22 | import argparse 23 | import json 24 | import subprocess 25 | import time 26 | import glob 27 | from pathlib import Path 28 | import urllib 29 | import braceexpand 30 | import zipfile 31 | from PIL import Image 32 | 33 | 34 | headers = { 35 | 'User-Agent':'Googlebot-Image/1.0', # Pretend to be googlebot 36 | 'X-Forwarded-For': '64.18.15.200' 37 | } 38 | 39 | 40 | def parse_args(): 41 | parser = argparse.ArgumentParser() 42 | 43 | parser.add_argument('--input_jsonl', type=str, default=None, help='Local path to the input jsonl file') 44 | parser.add_argument("--input_shards", type=str, default=None, help='URL to shards') 45 | parser.add_argument('--output_image_dir', type=str, default=None, help='Local path to the directory that stores the downloaded images') 46 | parser.add_argument('--num_process', type=int, default=16, help='Number of processes in the pool can be larger than cores') 47 | parser.add_argument('--chunk_size', type=int, default=100, help='Number of images per chunk per process') 48 | parser.add_argument('--shard_name', type=str, default=None) 49 | parser.add_argument('--report_dir', type=str, default='./status_report/', help='Local path to the directory that stores the downloading status') 50 | 51 | args = parser.parse_args() 52 | 53 | assert args.input_jsonl is not None or args.input_shards is not None 54 | 55 | if args.input_jsonl is not None: 56 | assert args.input_jsonl.endswith('.jsonl') 57 | 58 | if args.shard_name is None: 59 | args.shard_name = Path(args.input_jsonl).stem 60 | elif args.input_shards is not None: 61 | assert args.output_image_dir is not None 62 | 63 | if args.output_image_dir is None: 64 | args.output_image_dir = f'./{args.shard_name}_images/' 65 | 66 | return args 67 | 68 | 69 | def call(cmd): 70 | subprocess.call(cmd, shell=True) 71 | 72 | 73 | def _df_split_apply(tup_arg): 74 | split_ind, subset, func = tup_arg 75 | r = subset.apply(func, axis=1) 76 | return (split_ind, r) 77 | 78 | 79 | def download_images_multiprocess(args, df, func): 80 | """Download images with multiprocessing""" 81 | 82 | chunk_size = args.chunk_size 83 | num_process = args.num_process 84 | 85 | print('Generating parts...') 86 | 87 | shelve_filename = '%s_%s_%s_results.tmp' % (args.shard_name, func.__name__, chunk_size) 88 | with shelve.open(shelve_filename) as results: 89 | 90 | pbar = tqdm.tqdm(total=len(df), position=0) 91 | # Resume: 92 | finished_chunks = set([int(k) for k in results.keys()]) 93 | pbar.desc = "Resuming" 94 | for k in results.keys(): 95 | pbar.update(len(results[str(k)][1])) 96 | 97 | pool_data = ((index, df[i:i + chunk_size], func) for index, i in enumerate(range(0, len(df), chunk_size)) if index not in finished_chunks) 98 | pbar.write(f'\t{int(len(df) / chunk_size)} parts. Using {num_process} processes.') 99 | 100 | pbar.desc = "Downloading" 101 | with Pool(num_process) as pool: 102 | for i, result in enumerate(pool.imap_unordered(_df_split_apply, pool_data, 2)): 103 | results[str(result[0])] = result 104 | pbar.update(len(result[1])) 105 | pbar.close() 106 | 107 | print(f'Finished downloading images for {args.input_jsonl}\nImages saved at {args.output_image_dir}') 108 | 109 | return shelve_filename 110 | 111 | 112 | def _get_local_image_filename(row): 113 | return row['folder'] + '/' + row['local_identifier'] 114 | 115 | 116 | def download_image(row): 117 | fname = _get_local_image_filename(row) 118 | 119 | # Skip already downloaded images, retry others later 120 | if os.path.isfile(fname): 121 | row['status'] = 200 122 | row['file'] = fname 123 | row['mimetype'] = magic.from_file(row['file'], mime=True) 124 | row['size'] = os.stat(row['file']).st_size 125 | return row 126 | 127 | try: 128 | # Use smaller timeout to skip errors, but can result in failed downloads 129 | response = requests.get(row['url'], stream=False, timeout=10, allow_redirects=True, headers=headers) 130 | row['status'] = response.status_code 131 | rate_limit_idx = 0 132 | while response.status_code == 429: 133 | print(f'RATE LIMIT {rate_limit_idx} for {row["local_identifier"]}, will try again in 2s') 134 | response = requests.get(row['url'], stream=False, timeout=10, allow_redirects=True, headers=headers) 135 | row['status'] = response.status_code 136 | rate_limit_idx += 1 137 | time.sleep(2) 138 | if rate_limit_idx == 5: 139 | print(f'Reached rate limit for {row["local_identifier"]} ({row["url"]}). Will skip this image for now.') 140 | row['status'] = 429 141 | return row 142 | 143 | except Exception as e: 144 | # log errors later, set error as 408 timeout 145 | row['status'] = 408 146 | return row 147 | 148 | if response.ok: 149 | try: 150 | with open(fname, 'wb') as out_file: 151 | # some sites respond with gzip transport encoding 152 | response.raw.decode_content = True 153 | out_file.write(response.content) 154 | 155 | # Resize image if it is too big 156 | call('mogrify -resize "800x800>" {}'.format(fname)) 157 | 158 | # Use the following if mogrify doesn't exist or can't be found 159 | # img = Image.open(fname) 160 | # if max(img.size) > 800: 161 | # img = img.resize((min(img.width, 800), min(img.height, 800))) 162 | # img.save(fname) 163 | 164 | 165 | row['mimetype'] = magic.from_file(fname, mime=True) 166 | row['size'] = os.stat(fname).st_size 167 | except: 168 | # This is if it times out during a download or decode 169 | row['status'] = 408 170 | return row 171 | row['file'] = fname 172 | return row 173 | 174 | 175 | def save_status(args, shelve_filename): 176 | print(f'Generating Dataframe from results...') 177 | with shelve.open(shelve_filename) as results: 178 | keylist = sorted([int(k) for k in results.keys()]) 179 | df = pd.concat([results[str(k)][1] for k in keylist], sort=True) 180 | 181 | report_filename = os.path.join(args.report_dir, f'{args.shard_name}.tsv.gz') 182 | df.to_csv(report_filename, sep='\t', compression='gzip', header=False, index=False) 183 | print(f'Status report saved to {report_filename}') 184 | 185 | print('Cleaning up...') 186 | matched_files = glob.glob(f'{shelve_filename}*') 187 | for fn in matched_files: 188 | os.remove(fn) 189 | 190 | 191 | def gather_image_info(args): 192 | """Gather image info from the input jsonl""" 193 | data = [] 194 | with open(args.input_jsonl) as f: 195 | for line in tqdm.tqdm(f): 196 | info = json.loads(line.strip()) 197 | for img_item in info['image_info']: 198 | data.append({ 199 | 'local_identifier': img_item['image_name'], 200 | 'url': img_item['raw_url'], 201 | }) 202 | return data 203 | 204 | 205 | def gather_image_info_shard(json_file): 206 | """Gather image info from shard""" 207 | data = [] 208 | for sample_data in tqdm.tqdm(json_file): 209 | # get image names from json 210 | sample_data = json.loads(sample_data) 211 | for img_item in sample_data['image_info']: 212 | data.append({ 213 | 'local_identifier': img_item['image_name'], 214 | 'url': img_item['raw_url'], 215 | }) 216 | return data 217 | 218 | 219 | def local(args): 220 | # Load image info for current shard 221 | data = gather_image_info(args) 222 | for d in data: 223 | d['folder'] = args.output_image_dir 224 | df = pd.DataFrame(data) 225 | 226 | # Download images 227 | shelve_filename = download_images_multiprocess( 228 | args=args, 229 | df=df, 230 | func=download_image, 231 | ) 232 | 233 | # Save status & cleaning up 234 | save_status( 235 | args=args, 236 | shelve_filename=shelve_filename, 237 | ) 238 | 239 | 240 | def main(): 241 | args = parse_args() 242 | 243 | # Prepare directory 244 | for _dir in [args.output_image_dir, args.report_dir]: 245 | if not os.path.exists(_dir): 246 | os.makedirs(_dir) 247 | 248 | if args.input_jsonl is not None: 249 | local(args) 250 | else: 251 | doc_shards = list(braceexpand.braceexpand(args.input_shards)) 252 | 253 | for idx in range(len(doc_shards)): 254 | # image_tar = tarfile.open(image_shards[idx]) 255 | print("Downloading zip for shard", idx) 256 | try: 257 | urllib.request.urlretrieve(doc_shards[idx], "temp.zip") 258 | 259 | # Open the ZIP archive and extract the JSON file 260 | with zipfile.ZipFile("temp.zip", "r") as zip_file: 261 | # Assumes the JSON file is the first file in the archive 262 | json_filename = zip_file.namelist()[0] 263 | with zip_file.open(json_filename, "r") as json_file: 264 | data = gather_image_info_shard(json_file) 265 | 266 | shard_folder = args.output_image_dir + "/" + str(idx) 267 | if not os.path.exists(shard_folder): 268 | os.makedirs(shard_folder) 269 | 270 | for d in data: 271 | d['folder'] = shard_folder 272 | 273 | df = pd.DataFrame(data) 274 | 275 | args.shard_name = idx 276 | 277 | # Download images 278 | shelve_filename = download_images_multiprocess( 279 | args=args, 280 | df=df, 281 | func=download_image, 282 | ) 283 | 284 | # Save status & cleaning up 285 | save_status( 286 | args=args, 287 | shelve_filename=shelve_filename, 288 | ) 289 | 290 | except urllib.error.HTTPError as e: 291 | print(e) 292 | print("Skipping shard", idx) 293 | continue 294 | 295 | 296 | if __name__ == '__main__': 297 | main() -------------------------------------------------------------------------------- /scripts/linear_assignment.py: -------------------------------------------------------------------------------- 1 | ''' 2 | code for computing linear assignments using lapjv 3 | ''' 4 | 5 | import numpy as np 6 | import unittest 7 | from numpy import array, dstack, float32, float64, linspace, meshgrid, random, sqrt 8 | from scipy.spatial.distance import cdist 9 | from lapjv import lapjv 10 | 11 | 12 | def base_solve(W, max_dummy_cost_value=1000): 13 | ''' 14 | Gives hungarian solve for a non-square matrix. it's roughly from: 15 | 16 | NOTE: this ** MINIMIZES COST **. So, if you're handing sims, make sure to negate them! 17 | 18 | https://github.com/jmhessel/multi-retrieval/blob/master/bipartite_utils.py 19 | 20 | returns i_s, j_s, cost such that: 21 | for i, j in zip(i_s, j_s) 22 | 23 | are the (i, j) row column entries selected. 24 | 25 | cost is sum( cost[i, j] for i, j in zip(i_s, j_s) ) 26 | 27 | ''' 28 | if np.sum(np.abs(W)) > max_dummy_cost_value: 29 | print('Warning, you values in your matrix may be too big, please raise max_dummy_cost_value') 30 | 31 | 32 | orig_shape = W.shape 33 | if orig_shape[0] != orig_shape[1]: 34 | if orig_shape[0] > orig_shape[1]: 35 | pad_idxs = [[0, 0], [0, W.shape[0]-W.shape[1]]] 36 | col_pad = True 37 | else: 38 | pad_idxs = [[0, W.shape[1]-W.shape[0]], [0, 0]] 39 | col_pad = False 40 | W = np.pad(W, pad_idxs, 'constant', constant_values=max_dummy_cost_value) 41 | 42 | sol, _, cost = lapjv(W) 43 | 44 | i_s = np.arange(len(sol)) 45 | j_s = sol[i_s] 46 | 47 | sort_idxs = np.argsort(-W[i_s, j_s]) 48 | i_s, j_s = map(lambda x: x[sort_idxs], [i_s, j_s]) 49 | 50 | if orig_shape[0] != orig_shape[1]: 51 | if col_pad: 52 | valid_idxs = np.where(j_s < orig_shape[1])[0] 53 | else: 54 | valid_idxs = np.where(i_s < orig_shape[0])[0] 55 | i_s, j_s = i_s[valid_idxs], j_s[valid_idxs] 56 | 57 | m_cost = 0.0 58 | for i, j in zip(i_s, j_s): 59 | m_cost += W[i, j] 60 | 61 | return i_s, j_s, m_cost 62 | 63 | 64 | # unit tests from https://github.com/src-d/lapjv/blob/master/test.py . except the last one which is non-square. 65 | class LapjvTests(unittest.TestCase): 66 | def test_basic(self): 67 | arr = -np.array([[1.0, 1.0], 68 | [1.5, 1.0], 69 | [3.0, 2.6]]) 70 | # should be 1.5 and 2.6 71 | i_s, j_s, _ = base_solve(arr) 72 | 73 | assert set(zip(i_s, j_s)) == set([(1, 0), (2,1)]) 74 | 75 | 76 | def _test_random_100(self, dtype): 77 | random.seed(777) 78 | size = 100 79 | dots = random.random((size, 2)) 80 | grid = dstack(meshgrid(linspace(0, 1, int(sqrt(size))), 81 | linspace(0, 1, int(sqrt(size))))).reshape(-1, 2) 82 | cost = cdist(dots, grid, "sqeuclidean").astype(dtype) 83 | cost *= 100000 / cost.max() 84 | row_ind_lapjv, col_ind_lapjv, _ = base_solve(cost) 85 | # Obtained from pyLAPJV on Python 2.7 86 | row_ind_original = array([ 87 | 32, 51, 99, 77, 62, 1, 35, 69, 57, 42, 13, 24, 96, 26, 82, 52, 65, 88 | 6, 95, 7, 63, 47, 28, 45, 74, 89 | 61, 34, 14, 94, 31, 25, 3, 71, 49, 58, 83, 91, 93, 23, 98, 36, 40, 90 | 4, 97, 21, 92, 89, 90, 29, 46, 91 | 79, 2, 76, 84, 72, 64, 33, 37, 41, 15, 59, 85, 70, 78, 81, 20, 18, 92 | 30, 8, 66, 38, 87, 44, 67, 68, 93 | 39, 86, 54, 11, 50, 16, 17, 56, 0, 5, 80, 10, 48, 60, 73, 53, 75, 94 | 55, 19, 22, 12, 9, 88, 43, 27]) 95 | 96 | # we have to do this conversion to get to the (r, c) format... 97 | A = np.zeros((100, 100)) 98 | for i in range(100): 99 | A[i, row_ind_original[i]] = 1 100 | 101 | row_ind_original = np.arange(A.shape[0]) 102 | col_ind_original = np.argmax(A, axis=1) 103 | 104 | # make sure the set of index pairs is the same 105 | orig_pairs = set(zip(row_ind_original, col_ind_original)) 106 | new_pairs = set(zip(row_ind_lapjv, col_ind_lapjv)) 107 | assert orig_pairs == new_pairs, (orig_pairs, new_pairs) 108 | 109 | 110 | def test_random_100_float64(self): 111 | self._test_random_100(np.float64) 112 | 113 | def test_random_100_float32(self): 114 | self._test_random_100(np.float32) 115 | 116 | def test_1024(self): 117 | random.seed(777) 118 | size = 1024 119 | dots = random.random((size, 2)) 120 | grid = dstack(meshgrid(linspace(0, 1, int(sqrt(size))), 121 | linspace(0, 1, int(sqrt(size))))).reshape(-1, 2) 122 | cost = cdist(dots, grid, "sqeuclidean") 123 | cost *= 100000 / cost.max() 124 | row_ind_lapjv32, col_ind_lapjv32, _ = base_solve(cost) 125 | self.assertEqual(len(set(col_ind_lapjv32)), dots.shape[0]) 126 | self.assertEqual(len(set(row_ind_lapjv32)), dots.shape[0]) 127 | row_ind_lapjv64, col_ind_lapjv64, _ = base_solve(cost) 128 | 129 | f32_pairs = set(zip(row_ind_lapjv32, col_ind_lapjv32)) 130 | f64_pairs = set(zip(row_ind_lapjv64, col_ind_lapjv64)) 131 | assert f32_pairs == f64_pairs 132 | 133 | 134 | if __name__ == '__main__': 135 | unittest.main() 136 | -------------------------------------------------------------------------------- /scripts/requirements.txt: -------------------------------------------------------------------------------- 1 | python-magic 2 | tqdm 3 | pandas 4 | requests 5 | braceexpand 6 | Pillow 7 | lapjv 8 | --------------------------------------------------------------------------------