├── .gitignore
├── CODE_LICENSE
├── DATA_LICENSE.md
├── README.md
├── mmc4_arxiv.pdf
├── mmc4_logo.png
└── scripts
├── compute_assignments.py
├── download_images.py
├── linear_assignment.py
└── requirements.txt
/.gitignore:
--------------------------------------------------------------------------------
1 | # General
2 | .DS_Store
3 | .AppleDouble
4 | .LSOverride
5 |
6 | # File Related
7 | *.zip
--------------------------------------------------------------------------------
/CODE_LICENSE:
--------------------------------------------------------------------------------
1 | MIT License
2 |
3 | Copyright (c) 2023 Allen Institute for AI
4 |
5 | Permission is hereby granted, free of charge, to any person obtaining a copy
6 | of this software and associated documentation files (the "Software"), to deal
7 | in the Software without restriction, including without limitation the rights
8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
9 | copies of the Software, and to permit persons to whom the Software is
10 | furnished to do so, subject to the following conditions:
11 |
12 | The above copyright notice and this permission notice shall be included in all
13 | copies or substantial portions of the Software.
14 |
15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
21 | SOFTWARE.
--------------------------------------------------------------------------------
/DATA_LICENSE.md:
--------------------------------------------------------------------------------
1 | # ODC Attribution License (ODC-By)
2 |
3 | ### Preamble
4 |
5 | The Open Data Commons Attribution License is a license agreement
6 | intended to allow users to freely share, modify, and use this Database
7 | subject only to the attribution requirements set out in Section 4.
8 |
9 | Databases can contain a wide variety of types of content (images,
10 | audiovisual material, and sounds all in the same database, for example),
11 | and so this license only governs the rights over the Database, and not
12 | the contents of the Database individually. Licensors may therefore wish
13 | to use this license together with another license for the contents.
14 |
15 | Sometimes the contents of a database, or the database itself, can be
16 | covered by other rights not addressed here (such as private contracts,
17 | trademark over the name, or privacy rights / data protection rights
18 | over information in the contents), and so you are advised that you may
19 | have to consult other documents or clear other rights before doing
20 | activities not covered by this License.
21 |
22 | ------
23 |
24 | The Licensor (as defined below)
25 |
26 | and
27 |
28 | You (as defined below)
29 |
30 | agree as follows:
31 |
32 | ### 1.0 Definitions of Capitalised Words
33 |
34 | "Collective Database" - Means this Database in unmodified form as part
35 | of a collection of independent databases in themselves that together are
36 | assembled into a collective whole. A work that constitutes a Collective
37 | Database will not be considered a Derivative Database.
38 |
39 | "Convey" - As a verb, means Using the Database, a Derivative Database,
40 | or the Database as part of a Collective Database in any way that enables
41 | a Person to make or receive copies of the Database or a Derivative
42 | Database. Conveying does not include interaction with a user through a
43 | computer network, or creating and Using a Produced Work, where no
44 | transfer of a copy of the Database or a Derivative Database occurs.
45 |
46 | "Contents" - The contents of this Database, which includes the
47 | information, independent works, or other material collected into the
48 | Database. For example, the contents of the Database could be factual
49 | data or works such as images, audiovisual material, text, or sounds.
50 |
51 | "Database" - A collection of material (the Contents) arranged in a
52 | systematic or methodical way and individually accessible by electronic
53 | or other means offered under the terms of this License.
54 |
55 | "Database Directive" - Means Directive 96/9/EC of the European
56 | Parliament and of the Council of 11 March 1996 on the legal protection
57 | of databases, as amended or succeeded.
58 |
59 | "Database Right" - Means rights resulting from the Chapter III ("sui
60 | generis") rights in the Database Directive (as amended and as transposed
61 | by member states), which includes the Extraction and Re-utilisation of
62 | the whole or a Substantial part of the Contents, as well as any similar
63 | rights available in the relevant jurisdiction under Section 10.4.
64 |
65 | "Derivative Database" - Means a database based upon the Database, and
66 | includes any translation, adaptation, arrangement, modification, or any
67 | other alteration of the Database or of a Substantial part of the
68 | Contents. This includes, but is not limited to, Extracting or
69 | Re-utilising the whole or a Substantial part of the Contents in a new
70 | Database.
71 |
72 | "Extraction" - Means the permanent or temporary transfer of all or a
73 | Substantial part of the Contents to another medium by any means or in
74 | any form.
75 |
76 | "License" - Means this license agreement and is both a license of rights
77 | such as copyright and Database Rights and an agreement in contract.
78 |
79 | "Licensor" - Means the Person that offers the Database under the terms
80 | of this License.
81 |
82 | "Person" - Means a natural or legal person or a body of persons
83 | corporate or incorporate.
84 |
85 | "Produced Work" - a work (such as an image, audiovisual material, text,
86 | or sounds) resulting from using the whole or a Substantial part of the
87 | Contents (via a search or other query) from this Database, a Derivative
88 | Database, or this Database as part of a Collective Database.
89 |
90 | "Publicly" - means to Persons other than You or under Your control by
91 | either more than 50% ownership or by the power to direct their
92 | activities (such as contracting with an independent consultant).
93 |
94 | "Re-utilisation" - means any form of making available to the public all
95 | or a Substantial part of the Contents by the distribution of copies, by
96 | renting, by online or other forms of transmission.
97 |
98 | "Substantial" - Means substantial in terms of quantity or quality or a
99 | combination of both. The repeated and systematic Extraction or
100 | Re-utilisation of insubstantial parts of the Contents may amount to the
101 | Extraction or Re-utilisation of a Substantial part of the Contents.
102 |
103 | "Use" - As a verb, means doing any act that is restricted by copyright
104 | or Database Rights whether in the original medium or any other; and
105 | includes without limitation distributing, copying, publicly performing,
106 | publicly displaying, and preparing derivative works of the Database, as
107 | well as modifying the Database as may be technically necessary to use it
108 | in a different mode or format.
109 |
110 | "You" - Means a Person exercising rights under this License who has not
111 | previously violated the terms of this License with respect to the
112 | Database, or who has received express permission from the Licensor to
113 | exercise rights under this License despite a previous violation.
114 |
115 | Words in the singular include the plural and vice versa.
116 |
117 | ### 2.0 What this License covers
118 |
119 | 2.1. Legal effect of this document. This License is:
120 |
121 | a. A license of applicable copyright and neighbouring rights;
122 |
123 | b. A license of the Database Right; and
124 |
125 | c. An agreement in contract between You and the Licensor.
126 |
127 | 2.2 Legal rights covered. This License covers the legal rights in the
128 | Database, including:
129 |
130 | a. Copyright. Any copyright or neighbouring rights in the Database.
131 | The copyright licensed includes any individual elements of the
132 | Database, but does not cover the copyright over the Contents
133 | independent of this Database. See Section 2.4 for details. Copyright
134 | law varies between jurisdictions, but is likely to cover: the Database
135 | model or schema, which is the structure, arrangement, and organisation
136 | of the Database, and can also include the Database tables and table
137 | indexes; the data entry and output sheets; and the Field names of
138 | Contents stored in the Database;
139 |
140 | b. Database Rights. Database Rights only extend to the Extraction and
141 | Re-utilisation of the whole or a Substantial part of the Contents.
142 | Database Rights can apply even when there is no copyright over the
143 | Database. Database Rights can also apply when the Contents are removed
144 | from the Database and are selected and arranged in a way that would
145 | not infringe any applicable copyright; and
146 |
147 | c. Contract. This is an agreement between You and the Licensor for
148 | access to the Database. In return you agree to certain conditions of
149 | use on this access as outlined in this License.
150 |
151 | 2.3 Rights not covered.
152 |
153 | a. This License does not apply to computer programs used in the making
154 | or operation of the Database;
155 |
156 | b. This License does not cover any patents over the Contents or the
157 | Database; and
158 |
159 | c. This License does not cover any trademarks associated with the
160 | Database.
161 |
162 | 2.4 Relationship to Contents in the Database. The individual items of
163 | the Contents contained in this Database may be covered by other rights,
164 | including copyright, patent, data protection, privacy, or personality
165 | rights, and this License does not cover any rights (other than Database
166 | Rights or in contract) in individual Contents contained in the Database.
167 | For example, if used on a Database of images (the Contents), this
168 | License would not apply to copyright over individual images, which could
169 | have their own separate licenses, or one single license covering all of
170 | the rights over the images.
171 |
172 | ### 3.0 Rights granted
173 |
174 | 3.1 Subject to the terms and conditions of this License, the Licensor
175 | grants to You a worldwide, royalty-free, non-exclusive, terminable (but
176 | only under Section 9) license to Use the Database for the duration of
177 | any applicable copyright and Database Rights. These rights explicitly
178 | include commercial use, and do not exclude any field of endeavour. To
179 | the extent possible in the relevant jurisdiction, these rights may be
180 | exercised in all media and formats whether now known or created in the
181 | future.
182 |
183 | The rights granted cover, for example:
184 |
185 | a. Extraction and Re-utilisation of the whole or a Substantial part of
186 | the Contents;
187 |
188 | b. Creation of Derivative Databases;
189 |
190 | c. Creation of Collective Databases;
191 |
192 | d. Creation of temporary or permanent reproductions by any means and
193 | in any form, in whole or in part, including of any Derivative
194 | Databases or as a part of Collective Databases; and
195 |
196 | e. Distribution, communication, display, lending, making available, or
197 | performance to the public by any means and in any form, in whole or in
198 | part, including of any Derivative Database or as a part of Collective
199 | Databases.
200 |
201 | 3.2 Compulsory license schemes. For the avoidance of doubt:
202 |
203 | a. Non-waivable compulsory license schemes. In those jurisdictions in
204 | which the right to collect royalties through any statutory or
205 | compulsory licensing scheme cannot be waived, the Licensor reserves
206 | the exclusive right to collect such royalties for any exercise by You
207 | of the rights granted under this License;
208 |
209 | b. Waivable compulsory license schemes. In those jurisdictions in
210 | which the right to collect royalties through any statutory or
211 | compulsory licensing scheme can be waived, the Licensor waives the
212 | exclusive right to collect such royalties for any exercise by You of
213 | the rights granted under this License; and,
214 |
215 | c. Voluntary license schemes. The Licensor waives the right to collect
216 | royalties, whether individually or, in the event that the Licensor is
217 | a member of a collecting society that administers voluntary licensing
218 | schemes, via that society, from any exercise by You of the rights
219 | granted under this License.
220 |
221 | 3.3 The right to release the Database under different terms, or to stop
222 | distributing or making available the Database, is reserved. Note that
223 | this Database may be multiple-licensed, and so You may have the choice
224 | of using alternative licenses for this Database. Subject to Section
225 | 10.4, all other rights not expressly granted by Licensor are reserved.
226 |
227 | ### 4.0 Conditions of Use
228 |
229 | 4.1 The rights granted in Section 3 above are expressly made subject to
230 | Your complying with the following conditions of use. These are important
231 | conditions of this License, and if You fail to follow them, You will be
232 | in material breach of its terms.
233 |
234 | 4.2 Notices. If You Publicly Convey this Database, any Derivative
235 | Database, or the Database as part of a Collective Database, then You
236 | must:
237 |
238 | a. Do so only under the terms of this License;
239 |
240 | b. Include a copy of this License or its Uniform Resource Identifier (URI)
241 | with the Database or Derivative Database, including both in the
242 | Database or Derivative Database and in any relevant documentation;
243 |
244 | c. Keep intact any copyright or Database Right notices and notices
245 | that refer to this License; and
246 |
247 | d. If it is not possible to put the required notices in a particular
248 | file due to its structure, then You must include the notices in a
249 | location (such as a relevant directory) where users would be likely to
250 | look for it.
251 |
252 | 4.3 Notice for using output (Contents). Creating and Using a Produced
253 | Work does not require the notice in Section 4.2. However, if you
254 | Publicly Use a Produced Work, You must include a notice associated with
255 | the Produced Work reasonably calculated to make any Person that uses,
256 | views, accesses, interacts with, or is otherwise exposed to the Produced
257 | Work aware that Content was obtained from the Database, Derivative
258 | Database, or the Database as part of a Collective Database, and that it
259 | is available under this License.
260 |
261 | a. Example notice. The following text will satisfy notice under
262 | Section 4.3:
263 |
264 | Contains information from DATABASE NAME which is made available
265 | under the ODC Attribution License.
266 |
267 | DATABASE NAME should be replaced with the name of the Database and a
268 | hyperlink to the location of the Database. "ODC Attribution License"
269 | should contain a hyperlink to the URI of the text of this License. If
270 | hyperlinks are not possible, You should include the plain text of the
271 | required URI's with the above notice.
272 |
273 | 4.4 Licensing of others. You may not sublicense the Database. Each time
274 | You communicate the Database, the whole or Substantial part of the
275 | Contents, or any Derivative Database to anyone else in any way, the
276 | Licensor offers to the recipient a license to the Database on the same
277 | terms and conditions as this License. You are not responsible for
278 | enforcing compliance by third parties with this License, but You may
279 | enforce any rights that You have over a Derivative Database. You are
280 | solely responsible for any modifications of a Derivative Database made
281 | by You or another Person at Your direction. You may not impose any
282 | further restrictions on the exercise of the rights granted or affirmed
283 | under this License.
284 |
285 | ### 5.0 Moral rights
286 |
287 | 5.1 Moral rights. This section covers moral rights, including any rights
288 | to be identified as the author of the Database or to object to treatment
289 | that would otherwise prejudice the author's honour and reputation, or
290 | any other derogatory treatment:
291 |
292 | a. For jurisdictions allowing waiver of moral rights, Licensor waives
293 | all moral rights that Licensor may have in the Database to the fullest
294 | extent possible by the law of the relevant jurisdiction under Section
295 | 10.4;
296 |
297 | b. If waiver of moral rights under Section 5.1 a in the relevant
298 | jurisdiction is not possible, Licensor agrees not to assert any moral
299 | rights over the Database and waives all claims in moral rights to the
300 | fullest extent possible by the law of the relevant jurisdiction under
301 | Section 10.4; and
302 |
303 | c. For jurisdictions not allowing waiver or an agreement not to assert
304 | moral rights under Section 5.1 a and b, the author may retain their
305 | moral rights over certain aspects of the Database.
306 |
307 | Please note that some jurisdictions do not allow for the waiver of moral
308 | rights, and so moral rights may still subsist over the Database in some
309 | jurisdictions.
310 |
311 | ### 6.0 Fair dealing, Database exceptions, and other rights not affected
312 |
313 | 6.1 This License does not affect any rights that You or anyone else may
314 | independently have under any applicable law to make any use of this
315 | Database, including without limitation:
316 |
317 | a. Exceptions to the Database Right including: Extraction of Contents
318 | from non-electronic Databases for private purposes, Extraction for
319 | purposes of illustration for teaching or scientific research, and
320 | Extraction or Re-utilisation for public security or an administrative
321 | or judicial procedure.
322 |
323 | b. Fair dealing, fair use, or any other legally recognised limitation
324 | or exception to infringement of copyright or other applicable laws.
325 |
326 | 6.2 This License does not affect any rights of lawful users to Extract
327 | and Re-utilise insubstantial parts of the Contents, evaluated
328 | quantitatively or qualitatively, for any purposes whatsoever, including
329 | creating a Derivative Database (subject to other rights over the
330 | Contents, see Section 2.4). The repeated and systematic Extraction or
331 | Re-utilisation of insubstantial parts of the Contents may however amount
332 | to the Extraction or Re-utilisation of a Substantial part of the
333 | Contents.
334 |
335 | ### 7.0 Warranties and Disclaimer
336 |
337 | 7.1 The Database is licensed by the Licensor "as is" and without any
338 | warranty of any kind, either express, implied, or arising by statute,
339 | custom, course of dealing, or trade usage. Licensor specifically
340 | disclaims any and all implied warranties or conditions of title,
341 | non-infringement, accuracy or completeness, the presence or absence of
342 | errors, fitness for a particular purpose, merchantability, or otherwise.
343 | Some jurisdictions do not allow the exclusion of implied warranties, so
344 | this exclusion may not apply to You.
345 |
346 | ### 8.0 Limitation of liability
347 |
348 | 8.1 Subject to any liability that may not be excluded or limited by law,
349 | the Licensor is not liable for, and expressly excludes, all liability
350 | for loss or damage however and whenever caused to anyone by any use
351 | under this License, whether by You or by anyone else, and whether caused
352 | by any fault on the part of the Licensor or not. This exclusion of
353 | liability includes, but is not limited to, any special, incidental,
354 | consequential, punitive, or exemplary damages such as loss of revenue,
355 | data, anticipated profits, and lost business. This exclusion applies
356 | even if the Licensor has been advised of the possibility of such
357 | damages.
358 |
359 | 8.2 If liability may not be excluded by law, it is limited to actual and
360 | direct financial loss to the extent it is caused by proved negligence on
361 | the part of the Licensor.
362 |
363 | ### 9.0 Termination of Your rights under this License
364 |
365 | 9.1 Any breach by You of the terms and conditions of this License
366 | automatically terminates this License with immediate effect and without
367 | notice to You. For the avoidance of doubt, Persons who have received the
368 | Database, the whole or a Substantial part of the Contents, Derivative
369 | Databases, or the Database as part of a Collective Database from You
370 | under this License will not have their licenses terminated provided
371 | their use is in full compliance with this License or a license granted
372 | under Section 4.8 of this License. Sections 1, 2, 7, 8, 9 and 10 will
373 | survive any termination of this License.
374 |
375 | 9.2 If You are not in breach of the terms of this License, the Licensor
376 | will not terminate Your rights under it.
377 |
378 | 9.3 Unless terminated under Section 9.1, this License is granted to You
379 | for the duration of applicable rights in the Database.
380 |
381 | 9.4 Reinstatement of rights. If you cease any breach of the terms and
382 | conditions of this License, then your full rights under this License
383 | will be reinstated:
384 |
385 | a. Provisionally and subject to permanent termination until the 60th
386 | day after cessation of breach;
387 |
388 | b. Permanently on the 60th day after cessation of breach unless
389 | otherwise reasonably notified by the Licensor; or
390 |
391 | c. Permanently if reasonably notified by the Licensor of the
392 | violation, this is the first time You have received notice of
393 | violation of this License from the Licensor, and You cure the
394 | violation prior to 30 days after your receipt of the notice.
395 |
396 | 9.5 Notwithstanding the above, Licensor reserves the right to release
397 | the Database under different license terms or to stop distributing or
398 | making available the Database. Releasing the Database under different
399 | license terms or stopping the distribution of the Database will not
400 | withdraw this License (or any other license that has been, or is
401 | required to be, granted under the terms of this License), and this
402 | License will continue in full force and effect unless terminated as
403 | stated above.
404 |
405 | ### 10.0 General
406 |
407 | 10.1 If any provision of this License is held to be invalid or
408 | unenforceable, that must not affect the validity or enforceability of
409 | the remainder of the terms and conditions of this License and each
410 | remaining provision of this License shall be valid and enforced to the
411 | fullest extent permitted by law.
412 |
413 | 10.2 This License is the entire agreement between the parties with
414 | respect to the rights granted here over the Database. It replaces any
415 | earlier understandings, agreements or representations with respect to
416 | the Database.
417 |
418 | 10.3 If You are in breach of the terms of this License, You will not be
419 | entitled to rely on the terms of this License or to complain of any
420 | breach by the Licensor.
421 |
422 | 10.4 Choice of law. This License takes effect in and will be governed by
423 | the laws of the relevant jurisdiction in which the License terms are
424 | sought to be enforced. If the standard suite of rights granted under
425 | applicable copyright law and Database Rights in the relevant
426 | jurisdiction includes additional rights not granted under this License,
427 | these additional rights are granted in this License in order to meet the
428 | terms of this License.
429 |
--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
1 |
2 |
3 |
4 |
5 | :camera: :memo: Multimodal C4 (mmc4) :memo: :camera:
6 |
7 | An open, billion-scale corpus of images interleaved with text.
8 |
9 |
10 |
11 |
12 | ## Updates
13 |
14 | - **mmc4 is available once again!** A huge thanks to [Weizhi Wang](https://victorwz.github.io/) and [Zekun Li](https://github.com/Leezekun/) for saving mmc4-ff and mmc4-core-ff!
15 | - The original copies of mmc4 at ai2 were accidentially deleted in Feb 2025. [If you have any of the original copies of the dataset from before Feb. 2025, do let me know!](#missing-data)
16 | - released mmc4 version 1.1 :fire: which fixes https://github.com/allenai/mmc4/issues/11 and https://github.com/allenai/mmc4/issues/10
17 |
18 | ## Corpus stats (v1.1)
19 |
20 | | | # images | # docs | # tokens |
21 | |-----------------------------------------------------|----------|--------|----------|
22 | | Multimodal-C4 (mmc4) | 571M | 101.2M | 43B |
23 | | Multimodal-C4 fewer-faces** (mmc4-ff) | 375M | 77.7M | 33B |
24 | | Multimodal-C4 core (mmc4-core) | 29.9M | 7.3M | 2.4B |
25 | | Multimodal-C4 core fewer-faces** (mmc4-core-ff) | 22.4M | 5.5M | 1.8B |
26 |
27 | ** = available for direct download
28 |
29 | More details about these datasets and our processing steps [can be found in our paper](https://arxiv.org/abs/2304.06939).
30 |
31 | ## Accessing mmc4-ff
32 |
33 | ### Documents
34 |
35 | Now hosted on huggingface:
36 |
37 | - mmc4 fewer faces (~218GB): [jmhessel/mmc4-ff](https://huggingface.co/datasets/jmhessel/mmc4-ff)
38 | - mmc4 core fewer faces (~20GB): [jmhessel/mmc4-core-ff](https://huggingface.co/datasets/jmhessel/mmc4-core-ff)
39 |
40 | The dataset is split into shards of jsonls.
41 | - The shard number varies between 0 to 23098. [14 shards are missing and are not included in the dataset](#the-missing-shards-%EF%B8%8F).
42 | - Each shard is a jsonl of documents. Each line is a document.
43 |
44 | Documents contain text, image URLs, assignments of images to sentences, and image-by-text CLIP ViT-L/14 similarity matrices.
45 |
46 | Specifically:
47 |
48 | - `text_list`: a list of sentences comprising the text of the document
49 | - `url`: the original url where the document was hosted
50 | - `image_info` is a key mapping to a list of images. each image contains:
51 | - `image_name`: a filename that you could download the image to
52 | - `face_detections`: `None` if no faces are detected (which should be the case in "fewer faces")
53 | - `matched_text_index`: the index within `text_list` representing the sentence that this image is matched to
54 | - `matched_sim`: the CLIP ViT-L/14 similarity between the image and the sentence at the matched index
55 | - `similarity_matrix`: a matrix of shape `len(image_info) x len(text_list)` where `similarity_matrix[i, j]` is the CLIP ViT-L/14 similarity between image `i` and sentence `j`.
56 | - `could_have_url_duplicate`: a small number of URLs (~3%) in the corpus may have duplicate entries due to commoncrawl collecting multiple snapshots over time. we downsample such that, in expectation, each URL occurs once, but duplicates are technically possible. You can discard all entries with `could_have_url_duplicate` equal to 1 if you want a more strictly deduplicated set.
57 |
58 | Here's an example:
59 |
60 | ```
61 | {'image_info': [{'face_detections': None,
62 | 'image_name': 'b9040a0dbb22.jpg',
63 | 'matched_sim': 0.27694183588027954,
64 | 'matched_text_index': 2,
65 | 'raw_url': 'http://www.hfitinfo.com/honda_fit_pics/3/2/index.90.jpg'},
66 | {'face_detections': None,
67 | 'image_name': 'db1c21bc8474.jpg',
68 | 'matched_sim': 0.3234919607639313,
69 | 'matched_text_index': 1,
70 | 'raw_url': 'http://www.hfitinfo.com/honda_fit_pics/3/2/index.91.jpg'}],
71 | 'similarity_matrix': [[0.24363446235656738,
72 | 0.31758785247802734,
73 | 0.27694183588027954],
74 | [0.2233106791973114,
75 | 0.3234919607639313,
76 | 0.26118797063827515]],
77 | 'text_list': ['When you lock the door using the lock tab on the driver’s '
78 | 'door, all of the other doors and tailgate lock at the same '
79 | 'time.',
80 | 'Press the master door lock switch in as shown to lock or '
81 | 'unlock all doors and the tailgate.',
82 | 'When you lock/unlock the driver’s door and tailgate using the '
83 | 'master lock switch, all the other doors lock/ unlock at the '
84 | 'same time.'],
85 | 'url': 'http://www.hfitinfo.com/hofi-48.html',
86 | 'could_have_url_duplicate': 0 }
87 | ```
88 | The assignments of images to sentences are computed using [compute_assignments.py](https://github.com/allenai/mmc4/blob/main/scripts/compute_assignments.py)
89 |
90 | ## Accessing raw images
91 |
92 | Raw images can be downloaded from the provided URLs in the documents using [this script](scripts/download_images.py). The intent is to respect folks who have removed images from the web and not redistribute those images.
93 |
94 | However, we understand that some of the URLs may be stale which can harm reproducibility efforts. If you're interested in updates regarding raw image availability, you can contact us using [this google form](https://forms.gle/fPSXY359MT1VvF1g8)
95 |
96 | ## The missing shards ⛏️💎🔍
97 |
98 | .1% of the 23099 shards are missing from the corpus. These were not included in any statistics or experiments, so they are not part of mmc4. The missing shards are:
99 |
100 | ```
101 | 3218,3267,5064,5146,7119,8991,9750,11899,15127,15252,16996,17369,17499,17818
102 | ```
103 |
104 | ## License
105 |
106 | - the new contributions of mmc4 beyond text-only c4 (e.g., the similarity matrices/image-text alignments) are released under [ODC-BY](https://opendatacommons.org/licenses/by/1-0/).
107 | - By using mmc4, be aware of that you are also bound by the [Common Crawl terms of use](https://commoncrawl.org/terms-of-use/).
108 |
109 | ## Citation
110 |
111 | If you found our work useful, please consider citing:
112 | ```
113 | @article{zhu2023multimodal,
114 | title={{Multimodal C4}: An Open, Billion-scale Corpus of Images Interleaved With Text},
115 | author={Wanrong Zhu and Jack Hessel and Anas Awadalla and Samir Yitzhak Gadre and Jesse Dodge and Alex Fang and Youngjae Yu and Ludwig Schmidt and William Yang Wang and Yejin Choi},
116 | journal={arXiv preprint arXiv:2304.06939},
117 | year={2023}
118 | }
119 | ```
120 |
121 | ## Missing data
122 |
123 | In Feb 2025, the original copy of mmc4 hosted at AI2 was accidentially deleted. Thanks to some heroic efforts from [Weizhi Wang](https://victorwz.github.io/) and [Zekun Li](https://github.com/Leezekun/) who kindly provided their locally saved copies of mmc4 to be re-hosted, the corpus is (partially!) available again. Specifically: the "fewer faces" splits (both full and core) are available. The remaining missing files are:
124 |
125 | - mmc4, originally hosted at `https://storage.googleapis.com/ai2-jackh-mmc4-gated-public-41423/data_v1.1/docs_shard_{$SHARD}_v2.jsonl.zip`.
126 | - mmc4-core, originally hosted at `https://storage.googleapis.com/ai2-jackh-mmc4-gated-public-41423/data_core_v1.1/docs_shard_{$SHARD}_v3.jsonl`
127 | - CLIP ViT/L-14 image features, originally hosted at `https://storage.googleapis.com/ai2-jackh-mmc4-public/images/clip_vitl14_shard_{$SHARD}_features.pkl`
128 |
129 | If you have access to any of these files and are willing to make them available so we can once again host them for the broader community, [please let me know!](mailto:jmhessel@gmail.com)
130 |
--------------------------------------------------------------------------------
/mmc4_arxiv.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/allenai/mmc4/5304ddd7a608c88d8f4ee1a6b3a4356104215517/mmc4_arxiv.pdf
--------------------------------------------------------------------------------
/mmc4_logo.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/allenai/mmc4/5304ddd7a608c88d8f4ee1a6b3a4356104215517/mmc4_logo.png
--------------------------------------------------------------------------------
/scripts/compute_assignments.py:
--------------------------------------------------------------------------------
1 | '''
2 | example usage:
3 | python compute_assignment.py docs_shard_{$SHARD}_v2.jsonl
4 | '''
5 | import argparse
6 | import json
7 | import numpy as np
8 | import linear_assignment
9 | import tqdm
10 |
11 |
12 | def parse_args():
13 | parser = argparse.ArgumentParser()
14 | parser.add_argument('input_jsonl')
15 | return parser.parse_args()
16 |
17 |
18 | def get_image_assignments(im2txt):
19 | '''
20 | returns a list assignments of length N_images such that assignments[i] is the sentence index that image i was assigned to.
21 | '''
22 | im_idxs_s, txt_idxs_s, sol = linear_assignment.base_solve(-im2txt)
23 | im2txt_idxs = {im_idxs_s[k]: txt_idxs_s[k] for k in range(len(im_idxs_s))}
24 | if im2txt.shape[0] > im2txt.shape[1]:
25 | # there are more images than sentences. we dont want to discard images. so, for unassigned images, we will put them with their corresponding max.
26 | for imidx in range(len(im2txt)):
27 | if imidx not in im2txt_idxs:
28 | im2txt_idxs[imidx] = int(np.argmax(im2txt[imidx]))
29 |
30 | return [im2txt_idxs[idx] for idx in range(len(im2txt_idxs))]
31 |
32 |
33 | def main():
34 | args = parse_args()
35 |
36 | docs = []
37 | with open(args.input_jsonl) as f:
38 | for line in f:
39 | docs.append(json.loads(line))
40 |
41 | for d in docs:
42 | im2txt = np.array(d['similarity_matrix'])
43 | assignment = get_image_assignments(im2txt)
44 |
45 | for im_idx, im in enumerate(d['image_info']):
46 | im['matched_text_index'] = int(assignment[im_idx])
47 | im['matched_sim'] = float(im2txt[im_idx, assignment[im_idx]])
48 |
49 | with open(args.input_jsonl, 'w') as f:
50 | f.write('\n'.join([json.dumps(d) for d in docs]))
51 |
52 |
53 | if __name__ == '__main__':
54 | main()
55 |
--------------------------------------------------------------------------------
/scripts/download_images.py:
--------------------------------------------------------------------------------
1 | """
2 | Adapted from:
3 | https://github.com/igorbrigadir/DownloadConceptualCaptions/blob/master/download_data.py
4 |
5 | Requirements:
6 | - ImageMagick
7 | - See requirements.txt for python dependencies
8 |
9 | Example Usage:
10 | python download_images.py --input_jsonl ./data_core/docs_no_face_shard_0_v3.jsonl
11 | OR
12 | python download_images.py --input_shards "https://storage.googleapis.com/ai2-jackh-mmc4-public/data/docs_no_face_shard_{0..23098}_v2.jsonl.zip" --output_image_dir mmc4_images/
13 | """
14 |
15 | import pandas as pd
16 | import requests
17 | import os
18 | import shelve
19 | import magic
20 | from multiprocessing import Pool
21 | import tqdm
22 | import argparse
23 | import json
24 | import subprocess
25 | import time
26 | import glob
27 | from pathlib import Path
28 | import urllib
29 | import braceexpand
30 | import zipfile
31 | from PIL import Image
32 |
33 |
34 | headers = {
35 | 'User-Agent':'Googlebot-Image/1.0', # Pretend to be googlebot
36 | 'X-Forwarded-For': '64.18.15.200'
37 | }
38 |
39 |
40 | def parse_args():
41 | parser = argparse.ArgumentParser()
42 |
43 | parser.add_argument('--input_jsonl', type=str, default=None, help='Local path to the input jsonl file')
44 | parser.add_argument("--input_shards", type=str, default=None, help='URL to shards')
45 | parser.add_argument('--output_image_dir', type=str, default=None, help='Local path to the directory that stores the downloaded images')
46 | parser.add_argument('--num_process', type=int, default=16, help='Number of processes in the pool can be larger than cores')
47 | parser.add_argument('--chunk_size', type=int, default=100, help='Number of images per chunk per process')
48 | parser.add_argument('--shard_name', type=str, default=None)
49 | parser.add_argument('--report_dir', type=str, default='./status_report/', help='Local path to the directory that stores the downloading status')
50 |
51 | args = parser.parse_args()
52 |
53 | assert args.input_jsonl is not None or args.input_shards is not None
54 |
55 | if args.input_jsonl is not None:
56 | assert args.input_jsonl.endswith('.jsonl')
57 |
58 | if args.shard_name is None:
59 | args.shard_name = Path(args.input_jsonl).stem
60 | elif args.input_shards is not None:
61 | assert args.output_image_dir is not None
62 |
63 | if args.output_image_dir is None:
64 | args.output_image_dir = f'./{args.shard_name}_images/'
65 |
66 | return args
67 |
68 |
69 | def call(cmd):
70 | subprocess.call(cmd, shell=True)
71 |
72 |
73 | def _df_split_apply(tup_arg):
74 | split_ind, subset, func = tup_arg
75 | r = subset.apply(func, axis=1)
76 | return (split_ind, r)
77 |
78 |
79 | def download_images_multiprocess(args, df, func):
80 | """Download images with multiprocessing"""
81 |
82 | chunk_size = args.chunk_size
83 | num_process = args.num_process
84 |
85 | print('Generating parts...')
86 |
87 | shelve_filename = '%s_%s_%s_results.tmp' % (args.shard_name, func.__name__, chunk_size)
88 | with shelve.open(shelve_filename) as results:
89 |
90 | pbar = tqdm.tqdm(total=len(df), position=0)
91 | # Resume:
92 | finished_chunks = set([int(k) for k in results.keys()])
93 | pbar.desc = "Resuming"
94 | for k in results.keys():
95 | pbar.update(len(results[str(k)][1]))
96 |
97 | pool_data = ((index, df[i:i + chunk_size], func) for index, i in enumerate(range(0, len(df), chunk_size)) if index not in finished_chunks)
98 | pbar.write(f'\t{int(len(df) / chunk_size)} parts. Using {num_process} processes.')
99 |
100 | pbar.desc = "Downloading"
101 | with Pool(num_process) as pool:
102 | for i, result in enumerate(pool.imap_unordered(_df_split_apply, pool_data, 2)):
103 | results[str(result[0])] = result
104 | pbar.update(len(result[1]))
105 | pbar.close()
106 |
107 | print(f'Finished downloading images for {args.input_jsonl}\nImages saved at {args.output_image_dir}')
108 |
109 | return shelve_filename
110 |
111 |
112 | def _get_local_image_filename(row):
113 | return row['folder'] + '/' + row['local_identifier']
114 |
115 |
116 | def download_image(row):
117 | fname = _get_local_image_filename(row)
118 |
119 | # Skip already downloaded images, retry others later
120 | if os.path.isfile(fname):
121 | row['status'] = 200
122 | row['file'] = fname
123 | row['mimetype'] = magic.from_file(row['file'], mime=True)
124 | row['size'] = os.stat(row['file']).st_size
125 | return row
126 |
127 | try:
128 | # Use smaller timeout to skip errors, but can result in failed downloads
129 | response = requests.get(row['url'], stream=False, timeout=10, allow_redirects=True, headers=headers)
130 | row['status'] = response.status_code
131 | rate_limit_idx = 0
132 | while response.status_code == 429:
133 | print(f'RATE LIMIT {rate_limit_idx} for {row["local_identifier"]}, will try again in 2s')
134 | response = requests.get(row['url'], stream=False, timeout=10, allow_redirects=True, headers=headers)
135 | row['status'] = response.status_code
136 | rate_limit_idx += 1
137 | time.sleep(2)
138 | if rate_limit_idx == 5:
139 | print(f'Reached rate limit for {row["local_identifier"]} ({row["url"]}). Will skip this image for now.')
140 | row['status'] = 429
141 | return row
142 |
143 | except Exception as e:
144 | # log errors later, set error as 408 timeout
145 | row['status'] = 408
146 | return row
147 |
148 | if response.ok:
149 | try:
150 | with open(fname, 'wb') as out_file:
151 | # some sites respond with gzip transport encoding
152 | response.raw.decode_content = True
153 | out_file.write(response.content)
154 |
155 | # Resize image if it is too big
156 | call('mogrify -resize "800x800>" {}'.format(fname))
157 |
158 | # Use the following if mogrify doesn't exist or can't be found
159 | # img = Image.open(fname)
160 | # if max(img.size) > 800:
161 | # img = img.resize((min(img.width, 800), min(img.height, 800)))
162 | # img.save(fname)
163 |
164 |
165 | row['mimetype'] = magic.from_file(fname, mime=True)
166 | row['size'] = os.stat(fname).st_size
167 | except:
168 | # This is if it times out during a download or decode
169 | row['status'] = 408
170 | return row
171 | row['file'] = fname
172 | return row
173 |
174 |
175 | def save_status(args, shelve_filename):
176 | print(f'Generating Dataframe from results...')
177 | with shelve.open(shelve_filename) as results:
178 | keylist = sorted([int(k) for k in results.keys()])
179 | df = pd.concat([results[str(k)][1] for k in keylist], sort=True)
180 |
181 | report_filename = os.path.join(args.report_dir, f'{args.shard_name}.tsv.gz')
182 | df.to_csv(report_filename, sep='\t', compression='gzip', header=False, index=False)
183 | print(f'Status report saved to {report_filename}')
184 |
185 | print('Cleaning up...')
186 | matched_files = glob.glob(f'{shelve_filename}*')
187 | for fn in matched_files:
188 | os.remove(fn)
189 |
190 |
191 | def gather_image_info(args):
192 | """Gather image info from the input jsonl"""
193 | data = []
194 | with open(args.input_jsonl) as f:
195 | for line in tqdm.tqdm(f):
196 | info = json.loads(line.strip())
197 | for img_item in info['image_info']:
198 | data.append({
199 | 'local_identifier': img_item['image_name'],
200 | 'url': img_item['raw_url'],
201 | })
202 | return data
203 |
204 |
205 | def gather_image_info_shard(json_file):
206 | """Gather image info from shard"""
207 | data = []
208 | for sample_data in tqdm.tqdm(json_file):
209 | # get image names from json
210 | sample_data = json.loads(sample_data)
211 | for img_item in sample_data['image_info']:
212 | data.append({
213 | 'local_identifier': img_item['image_name'],
214 | 'url': img_item['raw_url'],
215 | })
216 | return data
217 |
218 |
219 | def local(args):
220 | # Load image info for current shard
221 | data = gather_image_info(args)
222 | for d in data:
223 | d['folder'] = args.output_image_dir
224 | df = pd.DataFrame(data)
225 |
226 | # Download images
227 | shelve_filename = download_images_multiprocess(
228 | args=args,
229 | df=df,
230 | func=download_image,
231 | )
232 |
233 | # Save status & cleaning up
234 | save_status(
235 | args=args,
236 | shelve_filename=shelve_filename,
237 | )
238 |
239 |
240 | def main():
241 | args = parse_args()
242 |
243 | # Prepare directory
244 | for _dir in [args.output_image_dir, args.report_dir]:
245 | if not os.path.exists(_dir):
246 | os.makedirs(_dir)
247 |
248 | if args.input_jsonl is not None:
249 | local(args)
250 | else:
251 | doc_shards = list(braceexpand.braceexpand(args.input_shards))
252 |
253 | for idx in range(len(doc_shards)):
254 | # image_tar = tarfile.open(image_shards[idx])
255 | print("Downloading zip for shard", idx)
256 | try:
257 | urllib.request.urlretrieve(doc_shards[idx], "temp.zip")
258 |
259 | # Open the ZIP archive and extract the JSON file
260 | with zipfile.ZipFile("temp.zip", "r") as zip_file:
261 | # Assumes the JSON file is the first file in the archive
262 | json_filename = zip_file.namelist()[0]
263 | with zip_file.open(json_filename, "r") as json_file:
264 | data = gather_image_info_shard(json_file)
265 |
266 | shard_folder = args.output_image_dir + "/" + str(idx)
267 | if not os.path.exists(shard_folder):
268 | os.makedirs(shard_folder)
269 |
270 | for d in data:
271 | d['folder'] = shard_folder
272 |
273 | df = pd.DataFrame(data)
274 |
275 | args.shard_name = idx
276 |
277 | # Download images
278 | shelve_filename = download_images_multiprocess(
279 | args=args,
280 | df=df,
281 | func=download_image,
282 | )
283 |
284 | # Save status & cleaning up
285 | save_status(
286 | args=args,
287 | shelve_filename=shelve_filename,
288 | )
289 |
290 | except urllib.error.HTTPError as e:
291 | print(e)
292 | print("Skipping shard", idx)
293 | continue
294 |
295 |
296 | if __name__ == '__main__':
297 | main()
--------------------------------------------------------------------------------
/scripts/linear_assignment.py:
--------------------------------------------------------------------------------
1 | '''
2 | code for computing linear assignments using lapjv
3 | '''
4 |
5 | import numpy as np
6 | import unittest
7 | from numpy import array, dstack, float32, float64, linspace, meshgrid, random, sqrt
8 | from scipy.spatial.distance import cdist
9 | from lapjv import lapjv
10 |
11 |
12 | def base_solve(W, max_dummy_cost_value=1000):
13 | '''
14 | Gives hungarian solve for a non-square matrix. it's roughly from:
15 |
16 | NOTE: this ** MINIMIZES COST **. So, if you're handing sims, make sure to negate them!
17 |
18 | https://github.com/jmhessel/multi-retrieval/blob/master/bipartite_utils.py
19 |
20 | returns i_s, j_s, cost such that:
21 | for i, j in zip(i_s, j_s)
22 |
23 | are the (i, j) row column entries selected.
24 |
25 | cost is sum( cost[i, j] for i, j in zip(i_s, j_s) )
26 |
27 | '''
28 | if np.sum(np.abs(W)) > max_dummy_cost_value:
29 | print('Warning, you values in your matrix may be too big, please raise max_dummy_cost_value')
30 |
31 |
32 | orig_shape = W.shape
33 | if orig_shape[0] != orig_shape[1]:
34 | if orig_shape[0] > orig_shape[1]:
35 | pad_idxs = [[0, 0], [0, W.shape[0]-W.shape[1]]]
36 | col_pad = True
37 | else:
38 | pad_idxs = [[0, W.shape[1]-W.shape[0]], [0, 0]]
39 | col_pad = False
40 | W = np.pad(W, pad_idxs, 'constant', constant_values=max_dummy_cost_value)
41 |
42 | sol, _, cost = lapjv(W)
43 |
44 | i_s = np.arange(len(sol))
45 | j_s = sol[i_s]
46 |
47 | sort_idxs = np.argsort(-W[i_s, j_s])
48 | i_s, j_s = map(lambda x: x[sort_idxs], [i_s, j_s])
49 |
50 | if orig_shape[0] != orig_shape[1]:
51 | if col_pad:
52 | valid_idxs = np.where(j_s < orig_shape[1])[0]
53 | else:
54 | valid_idxs = np.where(i_s < orig_shape[0])[0]
55 | i_s, j_s = i_s[valid_idxs], j_s[valid_idxs]
56 |
57 | m_cost = 0.0
58 | for i, j in zip(i_s, j_s):
59 | m_cost += W[i, j]
60 |
61 | return i_s, j_s, m_cost
62 |
63 |
64 | # unit tests from https://github.com/src-d/lapjv/blob/master/test.py . except the last one which is non-square.
65 | class LapjvTests(unittest.TestCase):
66 | def test_basic(self):
67 | arr = -np.array([[1.0, 1.0],
68 | [1.5, 1.0],
69 | [3.0, 2.6]])
70 | # should be 1.5 and 2.6
71 | i_s, j_s, _ = base_solve(arr)
72 |
73 | assert set(zip(i_s, j_s)) == set([(1, 0), (2,1)])
74 |
75 |
76 | def _test_random_100(self, dtype):
77 | random.seed(777)
78 | size = 100
79 | dots = random.random((size, 2))
80 | grid = dstack(meshgrid(linspace(0, 1, int(sqrt(size))),
81 | linspace(0, 1, int(sqrt(size))))).reshape(-1, 2)
82 | cost = cdist(dots, grid, "sqeuclidean").astype(dtype)
83 | cost *= 100000 / cost.max()
84 | row_ind_lapjv, col_ind_lapjv, _ = base_solve(cost)
85 | # Obtained from pyLAPJV on Python 2.7
86 | row_ind_original = array([
87 | 32, 51, 99, 77, 62, 1, 35, 69, 57, 42, 13, 24, 96, 26, 82, 52, 65,
88 | 6, 95, 7, 63, 47, 28, 45, 74,
89 | 61, 34, 14, 94, 31, 25, 3, 71, 49, 58, 83, 91, 93, 23, 98, 36, 40,
90 | 4, 97, 21, 92, 89, 90, 29, 46,
91 | 79, 2, 76, 84, 72, 64, 33, 37, 41, 15, 59, 85, 70, 78, 81, 20, 18,
92 | 30, 8, 66, 38, 87, 44, 67, 68,
93 | 39, 86, 54, 11, 50, 16, 17, 56, 0, 5, 80, 10, 48, 60, 73, 53, 75,
94 | 55, 19, 22, 12, 9, 88, 43, 27])
95 |
96 | # we have to do this conversion to get to the (r, c) format...
97 | A = np.zeros((100, 100))
98 | for i in range(100):
99 | A[i, row_ind_original[i]] = 1
100 |
101 | row_ind_original = np.arange(A.shape[0])
102 | col_ind_original = np.argmax(A, axis=1)
103 |
104 | # make sure the set of index pairs is the same
105 | orig_pairs = set(zip(row_ind_original, col_ind_original))
106 | new_pairs = set(zip(row_ind_lapjv, col_ind_lapjv))
107 | assert orig_pairs == new_pairs, (orig_pairs, new_pairs)
108 |
109 |
110 | def test_random_100_float64(self):
111 | self._test_random_100(np.float64)
112 |
113 | def test_random_100_float32(self):
114 | self._test_random_100(np.float32)
115 |
116 | def test_1024(self):
117 | random.seed(777)
118 | size = 1024
119 | dots = random.random((size, 2))
120 | grid = dstack(meshgrid(linspace(0, 1, int(sqrt(size))),
121 | linspace(0, 1, int(sqrt(size))))).reshape(-1, 2)
122 | cost = cdist(dots, grid, "sqeuclidean")
123 | cost *= 100000 / cost.max()
124 | row_ind_lapjv32, col_ind_lapjv32, _ = base_solve(cost)
125 | self.assertEqual(len(set(col_ind_lapjv32)), dots.shape[0])
126 | self.assertEqual(len(set(row_ind_lapjv32)), dots.shape[0])
127 | row_ind_lapjv64, col_ind_lapjv64, _ = base_solve(cost)
128 |
129 | f32_pairs = set(zip(row_ind_lapjv32, col_ind_lapjv32))
130 | f64_pairs = set(zip(row_ind_lapjv64, col_ind_lapjv64))
131 | assert f32_pairs == f64_pairs
132 |
133 |
134 | if __name__ == '__main__':
135 | unittest.main()
136 |
--------------------------------------------------------------------------------
/scripts/requirements.txt:
--------------------------------------------------------------------------------
1 | python-magic
2 | tqdm
3 | pandas
4 | requests
5 | braceexpand
6 | Pillow
7 | lapjv
8 |
--------------------------------------------------------------------------------