├── CNAME
├── _config.yml
├── resources
└── thumbnail_kucing-top5-cnn.png
├── README.md
├── doc.md
└── downloads.md
/CNAME:
--------------------------------------------------------------------------------
1 | multilingual-images.org
--------------------------------------------------------------------------------
/_config.yml:
--------------------------------------------------------------------------------
1 | theme: jekyll-theme-minimal
2 | logo: https://multilingual-images.org/resources/thumbnail_kucing-top5-cnn.png
3 |
--------------------------------------------------------------------------------
/resources/thumbnail_kucing-top5-cnn.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/penn-nlp/mmid/HEAD/resources/thumbnail_kucing-top5-cnn.png
--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
1 | ## The Massively Multilingual Image Dataset (MMID)
2 |
3 | MMID is a large-scale, massively multilingual dataset of images paired with the words they represent collected at the [University of Pennsylvania](https://upenn.edu).
4 | The dataset is doubly parallel: for each language, words are stored parallel to images that represent the word, _and_ parallel to the word's translation into English (and corresponding images.)
5 |
6 | By far the largest dataset of its kind, it has 98 languages (including English) and up to 10,000 words per language! (and many more for English.)
7 |
8 | ## Getting Started
9 |
10 | See the [documentation page](https://multilingual-images.org/doc.html)
11 |
12 | If you have questions, please email the MMID users list. (mmid-users@googlegroups.com).
13 |
14 | ## Downloads
15 |
16 | We're happy to announce that MMID is available via the Amazon Public Datasets program!
17 | Through their generosity, we're able to provide all of MMID free of charge via a public S3 bucket.
18 |
19 | Check out the [downloads](https://multilingual-images.org/downloads.html) page for options on how to access the dataset.
20 |
21 | ## Citation
22 |
23 | We gratefully acknowledge the support of an Amazon Research Award and AWS Research Credits, which enabled the construction of MMID.
24 |
25 | If you use MMID for your research, please cite:
26 |
27 | Learning Translations via Images with a Massively Multilingual Image Dataset.
28 | John Hewitt\*, Daphne Ippolito\*, Brendan Callahan, Reno Kriz, Derry Tanti Wijaya and Chris Callison-Burch.
29 | ACL 2018.
30 |
31 | ```
32 | @InProceedings{hewitt-et-al:2018:Long,
33 | author = {Hewitt, John and Ippolito, Daphne and Callahan, Brendan and Kriz, Reno and Wijaya, Derry Tanti and Callison-Burch, Chris},
34 | title = {Learning Translations via Images with a Massively Multilingual Image Dataset},
35 | booktitle = {Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)},
36 | month = {July},
37 | year = {2018},
38 | address = {Melbourne, Australia},
39 | publisher = {Association for Computational Linguistics}
40 | }
41 | ```
42 |
--------------------------------------------------------------------------------
/doc.md:
--------------------------------------------------------------------------------
1 | ---
2 | title: Documentation
3 | ---
4 |
5 | MMID provides images for words in 99 languages, packaged by language.
6 | For all 99 languages, it also provides images for each word's English translation.
7 | The dataset is packaged by language, so you can download any of the languages that interest you.
8 | Because of its size, MMID is distributed for each language in a few forms:
9 |
10 | - **[image package]** The image package contains 100 images for each of up to 10,000 words in each of the 99 languages, as well as the corresponding metadata.
11 | - **[mini image package]** The mini image package contains 1 image for each word in each of the 99 languages, as well as the corresponding metadata.
12 | - **[metadata]** The metadata is a `.jsonl` file which provides URLs to images, thumbnails, and the webpages they appeared on.
13 | - **[dictionary]** The dictionary file simply contains each word for each language and the index used to identify it in MMID.
14 | - **[text package]** The text package contains WARC files with the contents of each webpage that the images in MMID appeared on.
15 | - **[CNN package]** The CNN package contains CNN-featurized images for a subset of languages, used in our ACL images-for-translation paper.
16 |
17 | ### Image Package
18 |
19 | Each language has its own image package, named `-package.tar`.
20 | The structure of each iamge package is as follows, where `n` is the number of words represented for the language in the dataset:
21 |
22 | ```
23 | /
24 | 0.tar.gz
25 | word.txt
26 | metadata.json
27 | errors.json
28 | /
.png
29 | ...
30 | /
.png
31 | ...
32 | .tar.gz
33 | ```
34 | In this structure, each word in the dataset has its own gzipped tarball, named by the index of the word in the language's dictionary, e.g., `0.tar.gz`.
35 | In the tarball is `word.txt`, which contains the plaintext of the word, as well as `errors.json`, a log of errors encountered during the image scrape.
36 |
37 | The tarball also contains `metadata.json`, which includes crawl metadata like the URLs of the images stored in the tarball, with the following structure:
38 |
39 | ```
40 | {
41 | '': {
42 | 'google': {
43 | 'ru':
44 | }
45 | 'image_url':
46 | }
47 | '': { ... }
48 | ...
49 | '': { ... }
50 | }
51 | ```
52 | Finally, the tarball contains up to 100 image files, of the form `.png` where `
` is a two- or three-digit numeral between 01 and 100.
53 |
54 | ### Mini Image Package
55 | The form of the mini image package is identical to that of the full image package, to aid rapid development. It just has 1 image per word instead of 100.
56 |
57 | ### Metadata Package
58 | This package provides all metadata information for all words and words' images (for a single language.)
59 | It is in the JSON lines (`.jsonl`) format, in which each line of the file is a JSON object with information for a single word, of the following form:
60 | ```
61 | {
62 | 'word_string': WORD_STRING,
63 | 'word_index': WORD_INDEX,
64 | 'webpage_urls': {
65 | IMG_ID1: WEB_URL1,
66 | IMG_ID2: WEB_URL2,
67 | ...
68 | },
69 | 'image_original_urls': {
70 | IMG_ID1: IMG_URL1,
71 | IMG_ID2: IMG_URL2,
72 | ...
73 | },
74 | 'image_thumbnail_urls': {
75 | IMG_ID1: THUMB_URL1,
76 | IMG_ID2: THUMB_URL2,
77 | ...
78 | }
79 | }
80 | ```
81 | Where the `word_string` is the character sequence (in unicode) of the foreign word, the `word_index` specifies the index folder under which the images of the word are stored, the `image_original_urls` specifies the mapping to (possibly rotted) links to the original images, and the `image_thumbnail_urls` specifies the mapping to thumbnails of the original images.
82 |
83 | ### Text package
84 | In the raw dump, we present the text crawl corresponding to our web crawl in a completely unadulterated form.
85 | We crawled web pages using the Nutch crawler, and release the output of the crawls in [WARC](https://www.loc.gov/preservation/digital/formats/fdd/fdd000236.shtml), a standard, readable, maintainable format.
86 |
87 | The data for each language is contained at `-text.tar`.
88 | The crawl for each language is split arbitrarily into multiple segments.
89 | Each segment name has a name similar to `20170223072936-part-00000.seg-00000.attempt-00000.warc.gz`, but the names are arbitrary.
90 | While we omit an in-depth description of the WARC format here, each segment consists of plain text (after unzipping) similar to
91 |
92 | ```
93 | WARC/1.0
94 | WARC-Record-ID:
95 | Content-Length: 65536
96 | WARC-Date: 2017-02-25T18:01:51Z
97 | WARC-Type: resource
98 | WARC-Target-URI: http://02elf.net/headlines/politics-headlines/nrw-cdu-will-fluechtlinge-rigoros-zurueckfuehren-962524
99 |
100 |
101 |
102 |
103 | ...content...
104 | ```
105 |
106 | Thus, the HTML of each page is preceded by metadata, and succeeded by the metadata of the next page.
107 |
108 | ### CNN package
109 |
110 | In this section, we'll discuss how to work with the CNN image feature downloads.
111 | Each language download has the following macro file structure, each folder of which is described in more depth below.
112 |
113 | ```
114 | /
115 | english-features/
116 | english-text/
117 | english-metadata/
118 | -features/
119 | -text/
120 | -metadata/
121 | dictionary/
122 | ```
123 |
124 | #### `english-features/`
125 |
126 | The `english-features/` folder contains CNN image features (the FC7 layer of an pretrained AlexNet network).
127 | The image features are distributed across 27 sections of the English vocabulary, labelled `English-01` through `English-27`.
128 | Each section contains some unique portion of the English vocabulary.
129 | Each word of a section has an ID that is unique within the section but not across sections. These `word_ID`s are integers, and each `word_ID` is a folder within a section.
130 | Each such `word_ID` folder has up to 100 `.pkl` files, one for each image that represents the given word.
131 | The mapping between `/` pairs and English word literals is given in `dictionaries/english_path_index.tsv`.
132 |
133 | ```
134 | english-features/
135 | English-01/
136 |
137 | .pkl
138 | .pkl
139 | ...
140 | .pkl
141 |
142 | English-02/
143 | ...
144 | ...
145 | English-27/
146 | ...
147 | ```
148 |
149 | #### `english-text/`
150 | The `english-text` folder holds the tokenized plaintext on the webpages that images showed up on.
151 | The structure is identical to that of `english-features`, where each English word is identified by a `/` pair.
152 | Each image index in `english-features` corresponds to one `.gz` file, though text crawling failed for some images.
153 |
154 | #### `english-metadata/`
155 | The `english-metadata` folder holds the URLs of the images and corresponding websites for English images and words in the dataset.
156 | The folder has the structure `/.json`, so each English word has a single JSON metadata file.
157 | Each metadata file is a dictionary of the form:
158 | ```
159 | {
160 | '': {
161 | 'google': {
162 | 'ru':
163 | }
164 | 'image_url':
165 | }
166 | '': { ... }
167 | ...
168 | '': { ... }
169 | }
170 | ```
171 |
172 |
173 |
174 | #### `-features/`
175 |
176 | The `source-features/` folder holds the feature files for images of source language words.
177 | The `source-features/` folder contains `word_ID` files, each of which is a `.pkl` matrix that is of dimension `(k,4096)` where `k` is the number of images that represent the word.
178 | In the medium view, these images have already been filtered to those whose corresponding website had text in the `source` language.
179 | The mapping between `word_ID` and source language literal word is given in `dictionaries/dict.-features/
183 | .combined.pkl
184 | ```
185 |
186 | #### `-text/`
187 | The `-text` folder holds the tokenized plaintext on the webpages that images for the `` language showed up on.
188 | Each `word_ID` is a folder, in which each image has a single `.txt` file containing all text for that image.
189 | Thus, the structure is:
190 | ```
191 | -text/
192 | /
193 | .txt
194 | ```
195 |
196 | #### `-metadata/`
197 | The `-metadata` folder holds the URLs of the images and corresponding websites for `` language images and words in the dataset.
198 | Each word has a single metadata file, `.json`, in the directory.
199 | Each metadata file is of identical structure to the English metadata files, above.
200 |
201 |
202 | #### `dictionary/`
203 | The `dictionary/` folder holds the gold-standard translations between `source` language words and English words in `dict.`. It also contains a mapping from language name to language ID, used in our software, at `langcodes.csv`.
204 |
205 |
206 | ### Running a translation experiment
207 |
208 | (in progress)
209 |
210 | ```
211 | python code/evaluate_package_cnn_combined.py
212 | -f \
213 | -e /nlp \
214 | -d \
215 | -o \
216 | -l
217 | -t 25 \
218 | -tc 25 \
219 | ```
220 |
--------------------------------------------------------------------------------
/downloads.md:
--------------------------------------------------------------------------------
1 | ---
2 | title: Downloads
3 | ---
4 |
5 |
6 |
7 | ## MMID packages
8 | Below are links for the full MMID image/word dataset for each language (100 images), a smaller view of MMID with only 1 image per word (1 image), the metadata of all images and the webpages they showed up on, and the dictionary containing just the words we have images for in each language, as well as their canonical MMID ID within the language. For more information, see our [documentation page](https://multilingual-images.org/doc.html).
9 |
10 | MMID was constructed by building translations for the bilingual dictionaries found [here](http://www.seas.upenn.edu/~nlp/resources/TACL-data-release/dictionaries.tar.gz), which were built as described in the paper [The Language Demographics of Amazon Mechanical Turk](https://cs.brown.edu/people/epavlick/papers/language_demographics_mturk.pdf).
11 |
12 | Through the generosity of the Amazon Public Datasets program, each download is available via a public S3 bucket!
13 |
14 | | language | 100 images | 1 image | metadata | dictionary | web text |
15 | | -------- | -------- | -------- | -------- | -------- | -------- |
16 | | afrikaans | [link](https://s3.amazonaws.com/mmid-pds/language_image_packages/scale-afrikaans-package.tgz) | [link](https://s3.amazonaws.com/mmid-pds/mini_language_image_packages/mini-afrikaans-package.tgz) | [link](https://s3.amazonaws.com/mmid-pds/language_metadata_files/metadata-afrikaans-package.jsonl) | [link](https://s3.amazonaws.com/mmid-pds/language_index_files/index-afrikaans-package.tsv) | [link](https://s3.amazonaws.com/mmid-pds/language_text_warcs/afrikaans-text-warcs.tgz) |
17 | | albanian | [link](https://s3.amazonaws.com/mmid-pds/language_image_packages/scale-albanian-package.tgz) | [link](https://s3.amazonaws.com/mmid-pds/mini_language_image_packages/mini-albanian-package.tgz) | [link](https://s3.amazonaws.com/mmid-pds/language_metadata_files/metadata-albanian-package.jsonl) | [link](https://s3.amazonaws.com/mmid-pds/language_index_files/index-albanian-package.tsv) | [link](https://s3.amazonaws.com/mmid-pds/language_text_warcs/albanian-text-warcs.tgz) |
18 | | amharic | [link](https://s3.amazonaws.com/mmid-pds/language_image_packages/scale-amharic-package.tgz) | [link](https://s3.amazonaws.com/mmid-pds/mini_language_image_packages/mini-amharic-package.tgz) | [link](https://s3.amazonaws.com/mmid-pds/language_metadata_files/metadata-amharic-package.jsonl) | [link](https://s3.amazonaws.com/mmid-pds/language_index_files/index-amharic-package.tsv) | [link](https://s3.amazonaws.com/mmid-pds/language_text_warcs/amharic-text-warcs.tgz) |
19 | | arabic | [link](https://s3.amazonaws.com/mmid-pds/language_image_packages/scale-arabic-package.tgz) | [link](https://s3.amazonaws.com/mmid-pds/mini_language_image_packages/mini-arabic-package.tgz) | [link](https://s3.amazonaws.com/mmid-pds/language_metadata_files/metadata-arabic-package.jsonl) | [link](https://s3.amazonaws.com/mmid-pds/language_index_files/index-arabic-package.tsv) | [link](https://s3.amazonaws.com/mmid-pds/language_text_warcs/arabic-text-warcs.tgz) |
20 | | aragonese | [link](https://s3.amazonaws.com/mmid-pds/language_image_packages/scale-aragonese-package.tgz) | [link](https://s3.amazonaws.com/mmid-pds/mini_language_image_packages/mini-aragonese-package.tgz) | [link](https://s3.amazonaws.com/mmid-pds/language_metadata_files/metadata-aragonese-package.jsonl) | [link](https://s3.amazonaws.com/mmid-pds/language_index_files/index-aragonese-package.tsv) | [link](https://s3.amazonaws.com/mmid-pds/language_text_warcs/aragonese-text-warcs.tgz) |
21 | | armenian | [link](https://s3.amazonaws.com/mmid-pds/language_image_packages/scale-armenian-package.tgz) | [link](https://s3.amazonaws.com/mmid-pds/mini_language_image_packages/mini-armenian-package.tgz) | [link](https://s3.amazonaws.com/mmid-pds/language_metadata_files/metadata-armenian-package.jsonl) | [link](https://s3.amazonaws.com/mmid-pds/language_index_files/index-armenian-package.tsv) | [link](https://s3.amazonaws.com/mmid-pds/language_text_warcs/armenian-text-warcs.tgz) |
22 | | asturian | [link](https://s3.amazonaws.com/mmid-pds/language_image_packages/scale-asturian-package.tgz) | [link](https://s3.amazonaws.com/mmid-pds/mini_language_image_packages/mini-asturian-package.tgz) | [link](https://s3.amazonaws.com/mmid-pds/language_metadata_files/metadata-asturian-package.jsonl) | [link](https://s3.amazonaws.com/mmid-pds/language_index_files/index-asturian-package.tsv) | [link](https://s3.amazonaws.com/mmid-pds/language_text_warcs/asturian-text-warcs.tgz) |
23 | | azerbaijani | [link](https://s3.amazonaws.com/mmid-pds/language_image_packages/scale-azerbaijani-package.tgz) | [link](https://s3.amazonaws.com/mmid-pds/mini_language_image_packages/mini-azerbaijani-package.tgz) | [link](https://s3.amazonaws.com/mmid-pds/language_metadata_files/metadata-azerbaijani-package.jsonl) | [link](https://s3.amazonaws.com/mmid-pds/language_index_files/index-azerbaijani-package.tsv) | [link](https://s3.amazonaws.com/mmid-pds/language_text_warcs/azerbaijani-text-warcs.tgz) |
24 | | basque | [link](https://s3.amazonaws.com/mmid-pds/language_image_packages/scale-basque-package.tgz) | [link](https://s3.amazonaws.com/mmid-pds/mini_language_image_packages/mini-basque-package.tgz) | [link](https://s3.amazonaws.com/mmid-pds/language_metadata_files/metadata-basque-package.jsonl) | [link](https://s3.amazonaws.com/mmid-pds/language_index_files/index-basque-package.tsv) | [link](https://s3.amazonaws.com/mmid-pds/language_text_warcs/basque-text-warcs.tgz) |
25 | | belarusian | [link](https://s3.amazonaws.com/mmid-pds/language_image_packages/scale-belarusian-package.tgz) | [link](https://s3.amazonaws.com/mmid-pds/mini_language_image_packages/mini-belarusian-package.tgz) | [link](https://s3.amazonaws.com/mmid-pds/language_metadata_files/metadata-belarusian-package.jsonl) | [link](https://s3.amazonaws.com/mmid-pds/language_index_files/index-belarusian-package.tsv) | [link](https://s3.amazonaws.com/mmid-pds/language_text_warcs/belarusian-text-warcs.tgz) |
26 | | bengali | [link](https://s3.amazonaws.com/mmid-pds/language_image_packages/scale-bengali-package.tgz) | [link](https://s3.amazonaws.com/mmid-pds/mini_language_image_packages/mini-bengali-package.tgz) | [link](https://s3.amazonaws.com/mmid-pds/language_metadata_files/metadata-bengali-package.jsonl) | [link](https://s3.amazonaws.com/mmid-pds/language_index_files/index-bengali-package.tsv) | [link](https://s3.amazonaws.com/mmid-pds/language_text_warcs/bengali-text-warcs.tgz) |
27 | | bishnupriya-manipuri | [link](https://s3.amazonaws.com/mmid-pds/language_image_packages/scale-bishnupriya-manipuri-package.tgz) | [link](https://s3.amazonaws.com/mmid-pds/mini_language_image_packages/mini-bishnupriya-manipuri-package.tgz) | [link](https://s3.amazonaws.com/mmid-pds/language_metadata_files/metadata-bishnupriya-manipuri-package.jsonl) | [link](https://s3.amazonaws.com/mmid-pds/language_index_files/index-bishnupriya-manipuri-package.tsv) | [link](https://s3.amazonaws.com/mmid-pds/language_text_warcs/bishnupriya-manipuri-text-warcs.tgz) |
28 | | bosnian | [link](https://s3.amazonaws.com/mmid-pds/language_image_packages/scale-bosnian-package.tgz) | [link](https://s3.amazonaws.com/mmid-pds/mini_language_image_packages/mini-bosnian-package.tgz) | [link](https://s3.amazonaws.com/mmid-pds/language_metadata_files/metadata-bosnian-package.jsonl) | [link](https://s3.amazonaws.com/mmid-pds/language_index_files/index-bosnian-package.tsv) | [link](https://s3.amazonaws.com/mmid-pds/language_text_warcs/bosnian-text-warcs.tgz) |
29 | | breton | [link](https://s3.amazonaws.com/mmid-pds/language_image_packages/scale-breton-package.tgz) | [link](https://s3.amazonaws.com/mmid-pds/mini_language_image_packages/mini-breton-package.tgz) | [link](https://s3.amazonaws.com/mmid-pds/language_metadata_files/metadata-breton-package.jsonl) | [link](https://s3.amazonaws.com/mmid-pds/language_index_files/index-breton-package.tsv) | [link](https://s3.amazonaws.com/mmid-pds/language_text_warcs/breton-text-warcs.tgz) |
30 | | bulgarian | [link](https://s3.amazonaws.com/mmid-pds/language_image_packages/scale-bulgarian-package.tgz) | [link](https://s3.amazonaws.com/mmid-pds/mini_language_image_packages/mini-bulgarian-package.tgz) | [link](https://s3.amazonaws.com/mmid-pds/language_metadata_files/metadata-bulgarian-package.jsonl) | [link](https://s3.amazonaws.com/mmid-pds/language_index_files/index-bulgarian-package.tsv) | [link](https://s3.amazonaws.com/mmid-pds/language_text_warcs/bulgarian-text-warcs.tgz) |
31 | | catalan | [link](https://s3.amazonaws.com/mmid-pds/language_image_packages/scale-catalan-package.tgz) | [link](https://s3.amazonaws.com/mmid-pds/mini_language_image_packages/mini-catalan-package.tgz) | [link](https://s3.amazonaws.com/mmid-pds/language_metadata_files/metadata-catalan-package.jsonl) | [link](https://s3.amazonaws.com/mmid-pds/language_index_files/index-catalan-package.tsv) | [link](https://s3.amazonaws.com/mmid-pds/language_text_warcs/catalan-text-warcs.tgz) |
32 | | cebuano | [link](https://s3.amazonaws.com/mmid-pds/language_image_packages/scale-cebuano-package.tgz) | [link](https://s3.amazonaws.com/mmid-pds/mini_language_image_packages/mini-cebuano-package.tgz) | [link](https://s3.amazonaws.com/mmid-pds/language_metadata_files/metadata-cebuano-package.jsonl) | [link](https://s3.amazonaws.com/mmid-pds/language_index_files/index-cebuano-package.tsv) | [link](https://s3.amazonaws.com/mmid-pds/language_text_warcs/cebuano-text-warcs.tgz) |
33 | | central-bicolano | [link](https://s3.amazonaws.com/mmid-pds/language_image_packages/scale-central-bicolano-package.tgz) | [link](https://s3.amazonaws.com/mmid-pds/mini_language_image_packages/mini-central-bicolano-package.tgz) | [link](https://s3.amazonaws.com/mmid-pds/language_metadata_files/metadata-central-bicolano-package.jsonl) | [link](https://s3.amazonaws.com/mmid-pds/language_index_files/index-central-bicolano-package.tsv) | [link](https://s3.amazonaws.com/mmid-pds/language_text_warcs/central-bicolano-text-warcs.tgz) |
34 | | **language** | **100 images** | **1 image** | **metadata** | **dictionary** | **web text** |
35 | | chinese | [link](https://s3.amazonaws.com/mmid-pds/language_image_packages/scale-chinese-package.tgz) | [link](https://s3.amazonaws.com/mmid-pds/mini_language_image_packages/mini-chinese-package.tgz) | [link](https://s3.amazonaws.com/mmid-pds/language_metadata_files/metadata-chinese-package.jsonl) | [link](https://s3.amazonaws.com/mmid-pds/language_index_files/index-chinese-package.tsv) | [link](https://s3.amazonaws.com/mmid-pds/language_text_warcs/chinese-text-warcs.tgz) |
36 | | croatian | [link](https://s3.amazonaws.com/mmid-pds/language_image_packages/scale-croatian-package.tgz) | [link](https://s3.amazonaws.com/mmid-pds/mini_language_image_packages/mini-croatian-package.tgz) | [link](https://s3.amazonaws.com/mmid-pds/language_metadata_files/metadata-croatian-package.jsonl) | [link](https://s3.amazonaws.com/mmid-pds/language_index_files/index-croatian-package.tsv) | [link](https://s3.amazonaws.com/mmid-pds/language_text_warcs/croatian-text-warcs.tgz) |
37 | | czech | [link](https://s3.amazonaws.com/mmid-pds/language_image_packages/scale-czech-package.tgz) | [link](https://s3.amazonaws.com/mmid-pds/mini_language_image_packages/mini-czech-package.tgz) | [link](https://s3.amazonaws.com/mmid-pds/language_metadata_files/metadata-czech-package.jsonl) | [link](https://s3.amazonaws.com/mmid-pds/language_index_files/index-czech-package.tsv) | [link](https://s3.amazonaws.com/mmid-pds/language_text_warcs/czech-text-warcs.tgz) |
38 | | danish | [link](https://s3.amazonaws.com/mmid-pds/language_image_packages/scale-danish-package.tgz) | [link](https://s3.amazonaws.com/mmid-pds/mini_language_image_packages/mini-danish-package.tgz) | [link](https://s3.amazonaws.com/mmid-pds/language_metadata_files/metadata-danish-package.jsonl) | [link](https://s3.amazonaws.com/mmid-pds/language_index_files/index-danish-package.tsv) | [link](https://s3.amazonaws.com/mmid-pds/language_text_warcs/danish-text-warcs.tgz) |
39 | | dutch | [link](https://s3.amazonaws.com/mmid-pds/language_image_packages/scale-dutch-package.tgz) | [link](https://s3.amazonaws.com/mmid-pds/mini_language_image_packages/mini-dutch-package.tgz) | [link](https://s3.amazonaws.com/mmid-pds/language_metadata_files/metadata-dutch-package.jsonl) | [link](https://s3.amazonaws.com/mmid-pds/language_index_files/index-dutch-package.tsv) | [link](https://s3.amazonaws.com/mmid-pds/language_text_warcs/dutch-text-warcs.tgz) |
40 | | english-01 | [link](https://s3.amazonaws.com/mmid-pds/language_image_packages/scale-english-01-package.tgz) | [link](https://s3.amazonaws.com/mmid-pds/mini_language_image_packages/mini-english-01-package.tgz) | [link](https://s3.amazonaws.com/mmid-pds/language_metadata_files/metadata-english-01-package.jsonl) | [link](https://s3.amazonaws.com/mmid-pds/language_index_files/index-english-01-package.tsv) | [link](https://s3.amazonaws.com/mmid-pds/language_text_warcs/english-01-text-warcs.tgz) |
41 | | english-02 | [link](https://s3.amazonaws.com/mmid-pds/language_image_packages/scale-english-02-package.tgz) | [link](https://s3.amazonaws.com/mmid-pds/mini_language_image_packages/mini-english-02-package.tgz) | [link](https://s3.amazonaws.com/mmid-pds/language_metadata_files/metadata-english-02-package.jsonl) | [link](https://s3.amazonaws.com/mmid-pds/language_index_files/index-english-02-package.tsv) | [link](https://s3.amazonaws.com/mmid-pds/language_text_warcs/english-02-text-warcs.tgz) |
42 | | english-03 | [link](https://s3.amazonaws.com/mmid-pds/language_image_packages/scale-english-03-package.tgz) | [link](https://s3.amazonaws.com/mmid-pds/mini_language_image_packages/mini-english-03-package.tgz) | [link](https://s3.amazonaws.com/mmid-pds/language_metadata_files/metadata-english-03-package.jsonl) | [link](https://s3.amazonaws.com/mmid-pds/language_index_files/index-english-03-package.tsv) | [link](https://s3.amazonaws.com/mmid-pds/language_text_warcs/english-03-text-warcs.tgz) |
43 | | english-04 | [link](https://s3.amazonaws.com/mmid-pds/language_image_packages/scale-english-04-package.tgz) | [link](https://s3.amazonaws.com/mmid-pds/mini_language_image_packages/mini-english-04-package.tgz) | [link](https://s3.amazonaws.com/mmid-pds/language_metadata_files/metadata-english-04-package.jsonl) | [link](https://s3.amazonaws.com/mmid-pds/language_index_files/index-english-04-package.tsv) | [link](https://s3.amazonaws.com/mmid-pds/language_text_warcs/english-04-text-warcs.tgz) |
44 | | english-05 | [link](https://s3.amazonaws.com/mmid-pds/language_image_packages/scale-english-05-package.tgz) | [link](https://s3.amazonaws.com/mmid-pds/mini_language_image_packages/mini-english-05-package.tgz) | [link](https://s3.amazonaws.com/mmid-pds/language_metadata_files/metadata-english-05-package.jsonl) | [link](https://s3.amazonaws.com/mmid-pds/language_index_files/index-english-05-package.tsv) | [link](https://s3.amazonaws.com/mmid-pds/language_text_warcs/english-05-text-warcs.tgz) |
45 | | english-06 | [link](https://s3.amazonaws.com/mmid-pds/language_image_packages/scale-english-06-package.tgz) | [link](https://s3.amazonaws.com/mmid-pds/mini_language_image_packages/mini-english-06-package.tgz) | [link](https://s3.amazonaws.com/mmid-pds/language_metadata_files/metadata-english-06-package.jsonl) | [link](https://s3.amazonaws.com/mmid-pds/language_index_files/index-english-06-package.tsv) | [link](https://s3.amazonaws.com/mmid-pds/language_text_warcs/english-06-text-warcs.tgz) |
46 | | english-07 | [link](https://s3.amazonaws.com/mmid-pds/language_image_packages/scale-english-07-package.tgz) | [link](https://s3.amazonaws.com/mmid-pds/mini_language_image_packages/mini-english-07-package.tgz) | [link](https://s3.amazonaws.com/mmid-pds/language_metadata_files/metadata-english-07-package.jsonl) | [link](https://s3.amazonaws.com/mmid-pds/language_index_files/index-english-07-package.tsv) | [link](https://s3.amazonaws.com/mmid-pds/language_text_warcs/english-07-text-warcs.tgz) |
47 | | english-08 | [link](https://s3.amazonaws.com/mmid-pds/language_image_packages/scale-english-08-package.tgz) | [link](https://s3.amazonaws.com/mmid-pds/mini_language_image_packages/mini-english-08-package.tgz) | [link](https://s3.amazonaws.com/mmid-pds/language_metadata_files/metadata-english-08-package.jsonl) | [link](https://s3.amazonaws.com/mmid-pds/language_index_files/index-english-08-package.tsv) | [link](https://s3.amazonaws.com/mmid-pds/language_text_warcs/english-08-text-warcs.tgz) |
48 | | english-09 | [link](https://s3.amazonaws.com/mmid-pds/language_image_packages/scale-english-09-package.tgz) | [link](https://s3.amazonaws.com/mmid-pds/mini_language_image_packages/mini-english-09-package.tgz) | [link](https://s3.amazonaws.com/mmid-pds/language_metadata_files/metadata-english-09-package.jsonl) | [link](https://s3.amazonaws.com/mmid-pds/language_index_files/index-english-09-package.tsv) | [link](https://s3.amazonaws.com/mmid-pds/language_text_warcs/english-09-text-warcs.tgz) |
49 | | english-10 | [link](https://s3.amazonaws.com/mmid-pds/language_image_packages/scale-english-10-package.tgz) | [link](https://s3.amazonaws.com/mmid-pds/mini_language_image_packages/mini-english-10-package.tgz) | [link](https://s3.amazonaws.com/mmid-pds/language_metadata_files/metadata-english-10-package.jsonl) | [link](https://s3.amazonaws.com/mmid-pds/language_index_files/index-english-10-package.tsv) | [link](https://s3.amazonaws.com/mmid-pds/language_text_warcs/english-10-text-warcs.tgz) |
50 | | english-11 | [link](https://s3.amazonaws.com/mmid-pds/language_image_packages/scale-english-11-package.tgz) | [link](https://s3.amazonaws.com/mmid-pds/mini_language_image_packages/mini-english-11-package.tgz) | [link](https://s3.amazonaws.com/mmid-pds/language_metadata_files/metadata-english-11-package.jsonl) | [link](https://s3.amazonaws.com/mmid-pds/language_index_files/index-english-11-package.tsv) | [link](https://s3.amazonaws.com/mmid-pds/language_text_warcs/english-11-text-warcs.tgz) |
51 | | english-12 | [link](https://s3.amazonaws.com/mmid-pds/language_image_packages/scale-english-12-package.tgz) | [link](https://s3.amazonaws.com/mmid-pds/mini_language_image_packages/mini-english-12-package.tgz) | [link](https://s3.amazonaws.com/mmid-pds/language_metadata_files/metadata-english-12-package.jsonl) | [link](https://s3.amazonaws.com/mmid-pds/language_index_files/index-english-12-package.tsv) | [link](https://s3.amazonaws.com/mmid-pds/language_text_warcs/english-12-text-warcs.tgz) |
52 | | english-13 | [link](https://s3.amazonaws.com/mmid-pds/language_image_packages/scale-english-13-package.tgz) | [link](https://s3.amazonaws.com/mmid-pds/mini_language_image_packages/mini-english-13-package.tgz) | [link](https://s3.amazonaws.com/mmid-pds/language_metadata_files/metadata-english-13-package.jsonl) | [link](https://s3.amazonaws.com/mmid-pds/language_index_files/index-english-13-package.tsv) | [link](https://s3.amazonaws.com/mmid-pds/language_text_warcs/english-13-text-warcs.tgz) |
53 | | english-14 | [link](https://s3.amazonaws.com/mmid-pds/language_image_packages/scale-english-14-package.tgz) | [link](https://s3.amazonaws.com/mmid-pds/mini_language_image_packages/mini-english-14-package.tgz) | [link](https://s3.amazonaws.com/mmid-pds/language_metadata_files/metadata-english-14-package.jsonl) | [link](https://s3.amazonaws.com/mmid-pds/language_index_files/index-english-14-package.tsv) | [link](https://s3.amazonaws.com/mmid-pds/language_text_warcs/english-14-text-warcs.tgz) |
54 | | english-15 | [link](https://s3.amazonaws.com/mmid-pds/language_image_packages/scale-english-15-package.tgz) | [link](https://s3.amazonaws.com/mmid-pds/mini_language_image_packages/mini-english-15-package.tgz) | [link](https://s3.amazonaws.com/mmid-pds/language_metadata_files/metadata-english-15-package.jsonl) | [link](https://s3.amazonaws.com/mmid-pds/language_index_files/index-english-15-package.tsv) | [link](https://s3.amazonaws.com/mmid-pds/language_text_warcs/english-15-text-warcs.tgz) |
55 | | **language** | **100 images** | **1 image** | **metadata** | **dictionary** | **web text** |
56 | | english-16 | [link](https://s3.amazonaws.com/mmid-pds/language_image_packages/scale-english-16-package.tgz) | [link](https://s3.amazonaws.com/mmid-pds/mini_language_image_packages/mini-english-16-package.tgz) | [link](https://s3.amazonaws.com/mmid-pds/language_metadata_files/metadata-english-16-package.jsonl) | [link](https://s3.amazonaws.com/mmid-pds/language_index_files/index-english-16-package.tsv) | [link](https://s3.amazonaws.com/mmid-pds/language_text_warcs/english-16-text-warcs.tgz) |
57 | | english-17 | [link](https://s3.amazonaws.com/mmid-pds/language_image_packages/scale-english-17-package.tgz) | [link](https://s3.amazonaws.com/mmid-pds/mini_language_image_packages/mini-english-17-package.tgz) | [link](https://s3.amazonaws.com/mmid-pds/language_metadata_files/metadata-english-17-package.jsonl) | [link](https://s3.amazonaws.com/mmid-pds/language_index_files/index-english-17-package.tsv) | [link](https://s3.amazonaws.com/mmid-pds/language_text_warcs/english-17-text-warcs.tgz) |
58 | | english-18 | [link](https://s3.amazonaws.com/mmid-pds/language_image_packages/scale-english-18-package.tgz) | [link](https://s3.amazonaws.com/mmid-pds/mini_language_image_packages/mini-english-18-package.tgz) | [link](https://s3.amazonaws.com/mmid-pds/language_metadata_files/metadata-english-18-package.jsonl) | [link](https://s3.amazonaws.com/mmid-pds/language_index_files/index-english-18-package.tsv) | [link](https://s3.amazonaws.com/mmid-pds/language_text_warcs/english-18-text-warcs.tgz) |
59 | | english-19 | [link](https://s3.amazonaws.com/mmid-pds/language_image_packages/scale-english-19-package.tgz) | [link](https://s3.amazonaws.com/mmid-pds/mini_language_image_packages/mini-english-19-package.tgz) | [link](https://s3.amazonaws.com/mmid-pds/language_metadata_files/metadata-english-19-package.jsonl) | [link](https://s3.amazonaws.com/mmid-pds/language_index_files/index-english-19-package.tsv) | [link](https://s3.amazonaws.com/mmid-pds/language_text_warcs/english-19-text-warcs.tgz) |
60 | | english-20 | [link](https://s3.amazonaws.com/mmid-pds/language_image_packages/scale-english-20-package.tgz) | [link](https://s3.amazonaws.com/mmid-pds/mini_language_image_packages/mini-english-20-package.tgz) | [link](https://s3.amazonaws.com/mmid-pds/language_metadata_files/metadata-english-20-package.jsonl) | [link](https://s3.amazonaws.com/mmid-pds/language_index_files/index-english-20-package.tsv) | [link](https://s3.amazonaws.com/mmid-pds/language_text_warcs/english-20-text-warcs.tgz) |
61 | | english-21 | [link](https://s3.amazonaws.com/mmid-pds/language_image_packages/scale-english-21-package.tgz) | [link](https://s3.amazonaws.com/mmid-pds/mini_language_image_packages/mini-english-21-package.tgz) | [link](https://s3.amazonaws.com/mmid-pds/language_metadata_files/metadata-english-21-package.jsonl) | [link](https://s3.amazonaws.com/mmid-pds/language_index_files/index-english-21-package.tsv) | [link](https://s3.amazonaws.com/mmid-pds/language_text_warcs/english-21-text-warcs.tgz) |
62 | | english-22 | [link](https://s3.amazonaws.com/mmid-pds/language_image_packages/scale-english-22-package.tgz) | [link](https://s3.amazonaws.com/mmid-pds/mini_language_image_packages/mini-english-22-package.tgz) | [link](https://s3.amazonaws.com/mmid-pds/language_metadata_files/metadata-english-22-package.jsonl) | [link](https://s3.amazonaws.com/mmid-pds/language_index_files/index-english-22-package.tsv) | [link](https://s3.amazonaws.com/mmid-pds/language_text_warcs/english-22-text-warcs.tgz) |
63 | | english-23 | [link](https://s3.amazonaws.com/mmid-pds/language_image_packages/scale-english-23-package.tgz) | [link](https://s3.amazonaws.com/mmid-pds/mini_language_image_packages/mini-english-23-package.tgz) | [link](https://s3.amazonaws.com/mmid-pds/language_metadata_files/metadata-english-23-package.jsonl) | [link](https://s3.amazonaws.com/mmid-pds/language_index_files/index-english-23-package.tsv) | [link](https://s3.amazonaws.com/mmid-pds/language_text_warcs/english-23-text-warcs.tgz) |
64 | | english-24 | [link](https://s3.amazonaws.com/mmid-pds/language_image_packages/scale-english-24-package.tgz) | [link](https://s3.amazonaws.com/mmid-pds/mini_language_image_packages/mini-english-24-package.tgz) | [link](https://s3.amazonaws.com/mmid-pds/language_metadata_files/metadata-english-24-package.jsonl) | [link](https://s3.amazonaws.com/mmid-pds/language_index_files/index-english-24-package.tsv) | [link](https://s3.amazonaws.com/mmid-pds/language_text_warcs/english-24-text-warcs.tgz) |
65 | | english-25 | [link](https://s3.amazonaws.com/mmid-pds/language_image_packages/scale-english-25-package.tgz) | [link](https://s3.amazonaws.com/mmid-pds/mini_language_image_packages/mini-english-25-package.tgz) | [link](https://s3.amazonaws.com/mmid-pds/language_metadata_files/metadata-english-25-package.jsonl) | [link](https://s3.amazonaws.com/mmid-pds/language_index_files/index-english-25-package.tsv) | [link](https://s3.amazonaws.com/mmid-pds/language_text_warcs/english-25-text-warcs.tgz) |
66 | | english-26 | [link](https://s3.amazonaws.com/mmid-pds/language_image_packages/scale-english-26-package.tgz) | [link](https://s3.amazonaws.com/mmid-pds/mini_language_image_packages/mini-english-26-package.tgz) | [link](https://s3.amazonaws.com/mmid-pds/language_metadata_files/metadata-english-26-package.jsonl) | [link](https://s3.amazonaws.com/mmid-pds/language_index_files/index-english-26-package.tsv) | [link](https://s3.amazonaws.com/mmid-pds/language_text_warcs/english-26-text-warcs.tgz) |
67 | | english-27 | [link](https://s3.amazonaws.com/mmid-pds/language_image_packages/scale-english-27-package.tgz) | [link](https://s3.amazonaws.com/mmid-pds/mini_language_image_packages/mini-english-27-package.tgz) | [link](https://s3.amazonaws.com/mmid-pds/language_metadata_files/metadata-english-27-package.jsonl) | [link](https://s3.amazonaws.com/mmid-pds/language_index_files/index-english-27-package.tsv) | [link](https://s3.amazonaws.com/mmid-pds/language_text_warcs/english-27-text-warcs.tgz) |
68 | | esperanto | [link](https://s3.amazonaws.com/mmid-pds/language_image_packages/scale-esperanto-package.tgz) | [link](https://s3.amazonaws.com/mmid-pds/mini_language_image_packages/mini-esperanto-package.tgz) | [link](https://s3.amazonaws.com/mmid-pds/language_metadata_files/metadata-esperanto-package.jsonl) | [link](https://s3.amazonaws.com/mmid-pds/language_index_files/index-esperanto-package.tsv) | [link](https://s3.amazonaws.com/mmid-pds/language_text_warcs/esperanto-text-warcs.tgz) |
69 | | filipino | [link](https://s3.amazonaws.com/mmid-pds/language_image_packages/scale-filipino-package.tgz) | [link](https://s3.amazonaws.com/mmid-pds/mini_language_image_packages/mini-filipino-package.tgz) | [link](https://s3.amazonaws.com/mmid-pds/language_metadata_files/metadata-filipino-package.jsonl) | [link](https://s3.amazonaws.com/mmid-pds/language_index_files/index-filipino-package.tsv) | [link](https://s3.amazonaws.com/mmid-pds/language_text_warcs/filipino-text-warcs.tgz) |
70 | | finnish | [link](https://s3.amazonaws.com/mmid-pds/language_image_packages/scale-finnish-package.tgz) | [link](https://s3.amazonaws.com/mmid-pds/mini_language_image_packages/mini-finnish-package.tgz) | [link](https://s3.amazonaws.com/mmid-pds/language_metadata_files/metadata-finnish-package.jsonl) | [link](https://s3.amazonaws.com/mmid-pds/language_index_files/index-finnish-package.tsv) | [link](https://s3.amazonaws.com/mmid-pds/language_text_warcs/finnish-text-warcs.tgz) |
71 | | french | [link](https://s3.amazonaws.com/mmid-pds/language_image_packages/scale-french-package.tgz) | [link](https://s3.amazonaws.com/mmid-pds/mini_language_image_packages/mini-french-package.tgz) | [link](https://s3.amazonaws.com/mmid-pds/language_metadata_files/metadata-french-package.jsonl) | [link](https://s3.amazonaws.com/mmid-pds/language_index_files/index-french-package.tsv) | [link](https://s3.amazonaws.com/mmid-pds/language_text_warcs/french-text-warcs.tgz) |
72 | | frisian | [link](https://s3.amazonaws.com/mmid-pds/language_image_packages/scale-frisian-package.tgz) | [link](https://s3.amazonaws.com/mmid-pds/mini_language_image_packages/mini-frisian-package.tgz) | [link](https://s3.amazonaws.com/mmid-pds/language_metadata_files/metadata-frisian-package.jsonl) | [link](https://s3.amazonaws.com/mmid-pds/language_index_files/index-frisian-package.tsv) | [link](https://s3.amazonaws.com/mmid-pds/language_text_warcs/frisian-text-warcs.tgz) |
73 | | galician | [link](https://s3.amazonaws.com/mmid-pds/language_image_packages/scale-galician-package.tgz) | [link](https://s3.amazonaws.com/mmid-pds/mini_language_image_packages/mini-galician-package.tgz) | [link](https://s3.amazonaws.com/mmid-pds/language_metadata_files/metadata-galician-package.jsonl) | [link](https://s3.amazonaws.com/mmid-pds/language_index_files/index-galician-package.tsv) | [link](https://s3.amazonaws.com/mmid-pds/language_text_warcs/galician-text-warcs.tgz) |
74 | | georgian | [link](https://s3.amazonaws.com/mmid-pds/language_image_packages/scale-georgian-package.tgz) | [link](https://s3.amazonaws.com/mmid-pds/mini_language_image_packages/mini-georgian-package.tgz) | [link](https://s3.amazonaws.com/mmid-pds/language_metadata_files/metadata-georgian-package.jsonl) | [link](https://s3.amazonaws.com/mmid-pds/language_index_files/index-georgian-package.tsv) | [link](https://s3.amazonaws.com/mmid-pds/language_text_warcs/georgian-text-warcs.tgz) |
75 | | german | [link](https://s3.amazonaws.com/mmid-pds/language_image_packages/scale-german-package.tgz) | [link](https://s3.amazonaws.com/mmid-pds/mini_language_image_packages/mini-german-package.tgz) | [link](https://s3.amazonaws.com/mmid-pds/language_metadata_files/metadata-german-package.jsonl) | [link](https://s3.amazonaws.com/mmid-pds/language_index_files/index-german-package.tsv) | [link](https://s3.amazonaws.com/mmid-pds/language_text_warcs/german-text-warcs.tgz) |
76 | | **language** | **100 images** | **1 image** | **metadata** | **dictionary** | **web text** |
77 | | greek | [link](https://s3.amazonaws.com/mmid-pds/language_image_packages/scale-greek-package.tgz) | [link](https://s3.amazonaws.com/mmid-pds/mini_language_image_packages/mini-greek-package.tgz) | [link](https://s3.amazonaws.com/mmid-pds/language_metadata_files/metadata-greek-package.jsonl) | [link](https://s3.amazonaws.com/mmid-pds/language_index_files/index-greek-package.tsv) | [link](https://s3.amazonaws.com/mmid-pds/language_text_warcs/greek-text-warcs.tgz) |
78 | | gujarati | [link](https://s3.amazonaws.com/mmid-pds/language_image_packages/scale-gujarati-package.tgz) | [link](https://s3.amazonaws.com/mmid-pds/mini_language_image_packages/mini-gujarati-package.tgz) | [link](https://s3.amazonaws.com/mmid-pds/language_metadata_files/metadata-gujarati-package.jsonl) | [link](https://s3.amazonaws.com/mmid-pds/language_index_files/index-gujarati-package.tsv) | [link](https://s3.amazonaws.com/mmid-pds/language_text_warcs/gujarati-text-warcs.tgz) |
79 | | haitian | [link](https://s3.amazonaws.com/mmid-pds/language_image_packages/scale-haitian-package.tgz) | [link](https://s3.amazonaws.com/mmid-pds/mini_language_image_packages/mini-haitian-package.tgz) | [link](https://s3.amazonaws.com/mmid-pds/language_metadata_files/metadata-haitian-package.jsonl) | [link](https://s3.amazonaws.com/mmid-pds/language_index_files/index-haitian-package.tsv) | [link](https://s3.amazonaws.com/mmid-pds/language_text_warcs/haitian-text-warcs.tgz) |
80 | | hebrew | [link](https://s3.amazonaws.com/mmid-pds/language_image_packages/scale-hebrew-package.tgz) | [link](https://s3.amazonaws.com/mmid-pds/mini_language_image_packages/mini-hebrew-package.tgz) | [link](https://s3.amazonaws.com/mmid-pds/language_metadata_files/metadata-hebrew-package.jsonl) | [link](https://s3.amazonaws.com/mmid-pds/language_index_files/index-hebrew-package.tsv) | [link](https://s3.amazonaws.com/mmid-pds/language_text_warcs/hebrew-text-warcs.tgz) |
81 | | hindi | [link](https://s3.amazonaws.com/mmid-pds/language_image_packages/scale-hindi-package.tgz) | [link](https://s3.amazonaws.com/mmid-pds/mini_language_image_packages/mini-hindi-package.tgz) | [link](https://s3.amazonaws.com/mmid-pds/language_metadata_files/metadata-hindi-package.jsonl) | [link](https://s3.amazonaws.com/mmid-pds/language_index_files/index-hindi-package.tsv) | [link](https://s3.amazonaws.com/mmid-pds/language_text_warcs/hindi-text-warcs.tgz) |
82 | | hungarian | [link](https://s3.amazonaws.com/mmid-pds/language_image_packages/scale-hungarian-package.tgz) | [link](https://s3.amazonaws.com/mmid-pds/mini_language_image_packages/mini-hungarian-package.tgz) | [link](https://s3.amazonaws.com/mmid-pds/language_metadata_files/metadata-hungarian-package.jsonl) | [link](https://s3.amazonaws.com/mmid-pds/language_index_files/index-hungarian-package.tsv) | [link](https://s3.amazonaws.com/mmid-pds/language_text_warcs/hungarian-text-warcs.tgz) |
83 | | icelandic | [link](https://s3.amazonaws.com/mmid-pds/language_image_packages/scale-icelandic-package.tgz) | [link](https://s3.amazonaws.com/mmid-pds/mini_language_image_packages/mini-icelandic-package.tgz) | [link](https://s3.amazonaws.com/mmid-pds/language_metadata_files/metadata-icelandic-package.jsonl) | [link](https://s3.amazonaws.com/mmid-pds/language_index_files/index-icelandic-package.tsv) | [link](https://s3.amazonaws.com/mmid-pds/language_text_warcs/icelandic-text-warcs.tgz) |
84 | | ido | [link](https://s3.amazonaws.com/mmid-pds/language_image_packages/scale-ido-package.tgz) | [link](https://s3.amazonaws.com/mmid-pds/mini_language_image_packages/mini-ido-package.tgz) | [link](https://s3.amazonaws.com/mmid-pds/language_metadata_files/metadata-ido-package.jsonl) | [link](https://s3.amazonaws.com/mmid-pds/language_index_files/index-ido-package.tsv) | [link](https://s3.amazonaws.com/mmid-pds/language_text_warcs/ido-text-warcs.tgz) |
85 | | ilokano | [link](https://s3.amazonaws.com/mmid-pds/language_image_packages/scale-ilokano-package.tgz) | [link](https://s3.amazonaws.com/mmid-pds/mini_language_image_packages/mini-ilokano-package.tgz) | [link](https://s3.amazonaws.com/mmid-pds/language_metadata_files/metadata-ilokano-package.jsonl) | [link](https://s3.amazonaws.com/mmid-pds/language_index_files/index-ilokano-package.tsv) | [link](https://s3.amazonaws.com/mmid-pds/language_text_warcs/ilokano-text-warcs.tgz) |
86 | | indonesian | [link](https://s3.amazonaws.com/mmid-pds/language_image_packages/scale-indonesian-package.tgz) | [link](https://s3.amazonaws.com/mmid-pds/mini_language_image_packages/mini-indonesian-package.tgz) | [link](https://s3.amazonaws.com/mmid-pds/language_metadata_files/metadata-indonesian-package.jsonl) | [link](https://s3.amazonaws.com/mmid-pds/language_index_files/index-indonesian-package.tsv) | [link](https://s3.amazonaws.com/mmid-pds/language_text_warcs/indonesian-text-warcs.tgz) |
87 | | irish | [link](https://s3.amazonaws.com/mmid-pds/language_image_packages/scale-irish-package.tgz) | [link](https://s3.amazonaws.com/mmid-pds/mini_language_image_packages/mini-irish-package.tgz) | [link](https://s3.amazonaws.com/mmid-pds/language_metadata_files/metadata-irish-package.jsonl) | [link](https://s3.amazonaws.com/mmid-pds/language_index_files/index-irish-package.tsv) | [link](https://s3.amazonaws.com/mmid-pds/language_text_warcs/irish-text-warcs.tgz) |
88 | | italian | [link](https://s3.amazonaws.com/mmid-pds/language_image_packages/scale-italian-package.tgz) | [link](https://s3.amazonaws.com/mmid-pds/mini_language_image_packages/mini-italian-package.tgz) | [link](https://s3.amazonaws.com/mmid-pds/language_metadata_files/metadata-italian-package.jsonl) | [link](https://s3.amazonaws.com/mmid-pds/language_index_files/index-italian-package.tsv) | [link](https://s3.amazonaws.com/mmid-pds/language_text_warcs/italian-text-warcs.tgz) |
89 | | japanese | [link](https://s3.amazonaws.com/mmid-pds/language_image_packages/scale-japanese-package.tgz) | [link](https://s3.amazonaws.com/mmid-pds/mini_language_image_packages/mini-japanese-package.tgz) | [link](https://s3.amazonaws.com/mmid-pds/language_metadata_files/metadata-japanese-package.jsonl) | [link](https://s3.amazonaws.com/mmid-pds/language_index_files/index-japanese-package.tsv) | [link](https://s3.amazonaws.com/mmid-pds/language_text_warcs/japanese-text-warcs.tgz) |
90 | | javanese | [link](https://s3.amazonaws.com/mmid-pds/language_image_packages/scale-javanese-package.tgz) | [link](https://s3.amazonaws.com/mmid-pds/mini_language_image_packages/mini-javanese-package.tgz) | [link](https://s3.amazonaws.com/mmid-pds/language_metadata_files/metadata-javanese-package.jsonl) | [link](https://s3.amazonaws.com/mmid-pds/language_index_files/index-javanese-package.tsv) | [link](https://s3.amazonaws.com/mmid-pds/language_text_warcs/javanese-text-warcs.tgz) |
91 | | kannada | [link](https://s3.amazonaws.com/mmid-pds/language_image_packages/scale-kannada-package.tgz) | [link](https://s3.amazonaws.com/mmid-pds/mini_language_image_packages/mini-kannada-package.tgz) | [link](https://s3.amazonaws.com/mmid-pds/language_metadata_files/metadata-kannada-package.jsonl) | [link](https://s3.amazonaws.com/mmid-pds/language_index_files/index-kannada-package.tsv) | [link](https://s3.amazonaws.com/mmid-pds/language_text_warcs/kannada-text-warcs.tgz) |
92 | | kapampangan | [link](https://s3.amazonaws.com/mmid-pds/language_image_packages/scale-kapampangan-package.tgz) | [link](https://s3.amazonaws.com/mmid-pds/mini_language_image_packages/mini-kapampangan-package.tgz) | [link](https://s3.amazonaws.com/mmid-pds/language_metadata_files/metadata-kapampangan-package.jsonl) | [link](https://s3.amazonaws.com/mmid-pds/language_index_files/index-kapampangan-package.tsv) | [link](https://s3.amazonaws.com/mmid-pds/language_text_warcs/kapampangan-text-warcs.tgz) |
93 | | kazakh | [link](https://s3.amazonaws.com/mmid-pds/language_image_packages/scale-kazakh-package.tgz) | [link](https://s3.amazonaws.com/mmid-pds/mini_language_image_packages/mini-kazakh-package.tgz) | [link](https://s3.amazonaws.com/mmid-pds/language_metadata_files/metadata-kazakh-package.jsonl) | [link](https://s3.amazonaws.com/mmid-pds/language_index_files/index-kazakh-package.tsv) | [link](https://s3.amazonaws.com/mmid-pds/language_text_warcs/kazakh-text-warcs.tgz) |
94 | | korean | [link](https://s3.amazonaws.com/mmid-pds/language_image_packages/scale-korean-package.tgz) | [link](https://s3.amazonaws.com/mmid-pds/mini_language_image_packages/mini-korean-package.tgz) | [link](https://s3.amazonaws.com/mmid-pds/language_metadata_files/metadata-korean-package.jsonl) | [link](https://s3.amazonaws.com/mmid-pds/language_index_files/index-korean-package.tsv) | [link](https://s3.amazonaws.com/mmid-pds/language_text_warcs/korean-text-warcs.tgz) |
95 | | kurdish | [link](https://s3.amazonaws.com/mmid-pds/language_image_packages/scale-kurdish-package.tgz) | [link](https://s3.amazonaws.com/mmid-pds/mini_language_image_packages/mini-kurdish-package.tgz) | [link](https://s3.amazonaws.com/mmid-pds/language_metadata_files/metadata-kurdish-package.jsonl) | [link](https://s3.amazonaws.com/mmid-pds/language_index_files/index-kurdish-package.tsv) | [link](https://s3.amazonaws.com/mmid-pds/language_text_warcs/kurdish-text-warcs.tgz) |
96 | | latvian | [link](https://s3.amazonaws.com/mmid-pds/language_image_packages/scale-latvian-package.tgz) | [link](https://s3.amazonaws.com/mmid-pds/mini_language_image_packages/mini-latvian-package.tgz) | [link](https://s3.amazonaws.com/mmid-pds/language_metadata_files/metadata-latvian-package.jsonl) | [link](https://s3.amazonaws.com/mmid-pds/language_index_files/index-latvian-package.tsv) | [link](https://s3.amazonaws.com/mmid-pds/language_text_warcs/latvian-text-warcs.tgz) |
97 | | **language** | **100 images** | **1 image** | **metadata** | **dictionary** | **web text** |
98 | | lithuanian | [link](https://s3.amazonaws.com/mmid-pds/language_image_packages/scale-lithuanian-package.tgz) | [link](https://s3.amazonaws.com/mmid-pds/mini_language_image_packages/mini-lithuanian-package.tgz) | [link](https://s3.amazonaws.com/mmid-pds/language_metadata_files/metadata-lithuanian-package.jsonl) | [link](https://s3.amazonaws.com/mmid-pds/language_index_files/index-lithuanian-package.tsv) | [link](https://s3.amazonaws.com/mmid-pds/language_text_warcs/lithuanian-text-warcs.tgz) |
99 | | low-saxon | [link](https://s3.amazonaws.com/mmid-pds/language_image_packages/scale-low-saxon-package.tgz) | [link](https://s3.amazonaws.com/mmid-pds/mini_language_image_packages/mini-low-saxon-package.tgz) | [link](https://s3.amazonaws.com/mmid-pds/language_metadata_files/metadata-low-saxon-package.jsonl) | [link](https://s3.amazonaws.com/mmid-pds/language_index_files/index-low-saxon-package.tsv) | [link](https://s3.amazonaws.com/mmid-pds/language_text_warcs/low-saxon-text-warcs.tgz) |
100 | | luxembourgish | [link](https://s3.amazonaws.com/mmid-pds/language_image_packages/scale-luxembourgish-package.tgz) | [link](https://s3.amazonaws.com/mmid-pds/mini_language_image_packages/mini-luxembourgish-package.tgz) | [link](https://s3.amazonaws.com/mmid-pds/language_metadata_files/metadata-luxembourgish-package.jsonl) | [link](https://s3.amazonaws.com/mmid-pds/language_index_files/index-luxembourgish-package.tsv) | [link](https://s3.amazonaws.com/mmid-pds/language_text_warcs/luxembourgish-text-warcs.tgz) |
101 | | macedonian | [link](https://s3.amazonaws.com/mmid-pds/language_image_packages/scale-macedonian-package.tgz) | [link](https://s3.amazonaws.com/mmid-pds/mini_language_image_packages/mini-macedonian-package.tgz) | [link](https://s3.amazonaws.com/mmid-pds/language_metadata_files/metadata-macedonian-package.jsonl) | [link](https://s3.amazonaws.com/mmid-pds/language_index_files/index-macedonian-package.tsv) | [link](https://s3.amazonaws.com/mmid-pds/language_text_warcs/macedonian-text-warcs.tgz) |
102 | | malagasy | [link](https://s3.amazonaws.com/mmid-pds/language_image_packages/scale-malagasy-package.tgz) | [link](https://s3.amazonaws.com/mmid-pds/mini_language_image_packages/mini-malagasy-package.tgz) | [link](https://s3.amazonaws.com/mmid-pds/language_metadata_files/metadata-malagasy-package.jsonl) | [link](https://s3.amazonaws.com/mmid-pds/language_index_files/index-malagasy-package.tsv) | [link](https://s3.amazonaws.com/mmid-pds/language_text_warcs/malagasy-text-warcs.tgz) |
103 | | malay | [link](https://s3.amazonaws.com/mmid-pds/language_image_packages/scale-malay-package.tgz) | [link](https://s3.amazonaws.com/mmid-pds/mini_language_image_packages/mini-malay-package.tgz) | [link](https://s3.amazonaws.com/mmid-pds/language_metadata_files/metadata-malay-package.jsonl) | [link](https://s3.amazonaws.com/mmid-pds/language_index_files/index-malay-package.tsv) | [link](https://s3.amazonaws.com/mmid-pds/language_text_warcs/malay-text-warcs.tgz) |
104 | | malayalam | [link](https://s3.amazonaws.com/mmid-pds/language_image_packages/scale-malayalam-package.tgz) | [link](https://s3.amazonaws.com/mmid-pds/mini_language_image_packages/mini-malayalam-package.tgz) | [link](https://s3.amazonaws.com/mmid-pds/language_metadata_files/metadata-malayalam-package.jsonl) | [link](https://s3.amazonaws.com/mmid-pds/language_index_files/index-malayalam-package.tsv) | [link](https://s3.amazonaws.com/mmid-pds/language_text_warcs/malayalam-text-warcs.tgz) |
105 | | marathi | [link](https://s3.amazonaws.com/mmid-pds/language_image_packages/scale-marathi-package.tgz) | [link](https://s3.amazonaws.com/mmid-pds/mini_language_image_packages/mini-marathi-package.tgz) | [link](https://s3.amazonaws.com/mmid-pds/language_metadata_files/metadata-marathi-package.jsonl) | [link](https://s3.amazonaws.com/mmid-pds/language_index_files/index-marathi-package.tsv) | [link](https://s3.amazonaws.com/mmid-pds/language_text_warcs/marathi-text-warcs.tgz) |
106 | | neapolitan | [link](https://s3.amazonaws.com/mmid-pds/language_image_packages/scale-neapolitan-package.tgz) | [link](https://s3.amazonaws.com/mmid-pds/mini_language_image_packages/mini-neapolitan-package.tgz) | [link](https://s3.amazonaws.com/mmid-pds/language_metadata_files/metadata-neapolitan-package.jsonl) | [link](https://s3.amazonaws.com/mmid-pds/language_index_files/index-neapolitan-package.tsv) | [link](https://s3.amazonaws.com/mmid-pds/language_text_warcs/neapolitan-text-warcs.tgz) |
107 | | nepali | [link](https://s3.amazonaws.com/mmid-pds/language_image_packages/scale-nepali-package.tgz) | [link](https://s3.amazonaws.com/mmid-pds/mini_language_image_packages/mini-nepali-package.tgz) | [link](https://s3.amazonaws.com/mmid-pds/language_metadata_files/metadata-nepali-package.jsonl) | [link](https://s3.amazonaws.com/mmid-pds/language_index_files/index-nepali-package.tsv) | [link](https://s3.amazonaws.com/mmid-pds/language_text_warcs/nepali-text-warcs.tgz) |
108 | | newar | [link](https://s3.amazonaws.com/mmid-pds/language_image_packages/scale-newar-package.tgz) | [link](https://s3.amazonaws.com/mmid-pds/mini_language_image_packages/mini-newar-package.tgz) | [link](https://s3.amazonaws.com/mmid-pds/language_metadata_files/metadata-newar-package.jsonl) | [link](https://s3.amazonaws.com/mmid-pds/language_index_files/index-newar-package.tsv) | [link](https://s3.amazonaws.com/mmid-pds/language_text_warcs/newar-text-warcs.tgz) |
109 | | norwegian-nynorsk | [link](https://s3.amazonaws.com/mmid-pds/language_image_packages/scale-norwegian-nynorsk-package.tgz) | [link](https://s3.amazonaws.com/mmid-pds/mini_language_image_packages/mini-norwegian-nynorsk-package.tgz) | [link](https://s3.amazonaws.com/mmid-pds/language_metadata_files/metadata-norwegian-nynorsk-package.jsonl) | [link](https://s3.amazonaws.com/mmid-pds/language_index_files/index-norwegian-nynorsk-package.tsv) | [link](https://s3.amazonaws.com/mmid-pds/language_text_warcs/norwegian-nynorsk-text-warcs.tgz) |
110 | | norwegian | [link](https://s3.amazonaws.com/mmid-pds/language_image_packages/scale-norwegian-package.tgz) | [link](https://s3.amazonaws.com/mmid-pds/mini_language_image_packages/mini-norwegian-package.tgz) | [link](https://s3.amazonaws.com/mmid-pds/language_metadata_files/metadata-norwegian-package.jsonl) | [link](https://s3.amazonaws.com/mmid-pds/language_index_files/index-norwegian-package.tsv) | [link](https://s3.amazonaws.com/mmid-pds/language_text_warcs/norwegian-text-warcs.tgz) |
111 | | pashto | [link](https://s3.amazonaws.com/mmid-pds/language_image_packages/scale-pashto-package.tgz) | [link](https://s3.amazonaws.com/mmid-pds/mini_language_image_packages/mini-pashto-package.tgz) | [link](https://s3.amazonaws.com/mmid-pds/language_metadata_files/metadata-pashto-package.jsonl) | [link](https://s3.amazonaws.com/mmid-pds/language_index_files/index-pashto-package.tsv) | [link](https://s3.amazonaws.com/mmid-pds/language_text_warcs/pashto-text-warcs.tgz) |
112 | | persian | [link](https://s3.amazonaws.com/mmid-pds/language_image_packages/scale-persian-package.tgz) | [link](https://s3.amazonaws.com/mmid-pds/mini_language_image_packages/mini-persian-package.tgz) | [link](https://s3.amazonaws.com/mmid-pds/language_metadata_files/metadata-persian-package.jsonl) | [link](https://s3.amazonaws.com/mmid-pds/language_index_files/index-persian-package.tsv) | [link](https://s3.amazonaws.com/mmid-pds/language_text_warcs/persian-text-warcs.tgz) |
113 | | piedmontese | [link](https://s3.amazonaws.com/mmid-pds/language_image_packages/scale-piedmontese-package.tgz) | [link](https://s3.amazonaws.com/mmid-pds/mini_language_image_packages/mini-piedmontese-package.tgz) | [link](https://s3.amazonaws.com/mmid-pds/language_metadata_files/metadata-piedmontese-package.jsonl) | [link](https://s3.amazonaws.com/mmid-pds/language_index_files/index-piedmontese-package.tsv) | [link](https://s3.amazonaws.com/mmid-pds/language_text_warcs/piedmontese-text-warcs.tgz) |
114 | | polish | [link](https://s3.amazonaws.com/mmid-pds/language_image_packages/scale-polish-package.tgz) | [link](https://s3.amazonaws.com/mmid-pds/mini_language_image_packages/mini-polish-package.tgz) | [link](https://s3.amazonaws.com/mmid-pds/language_metadata_files/metadata-polish-package.jsonl) | [link](https://s3.amazonaws.com/mmid-pds/language_index_files/index-polish-package.tsv) | [link](https://s3.amazonaws.com/mmid-pds/language_text_warcs/polish-text-warcs.tgz) |
115 | | portuguese | [link](https://s3.amazonaws.com/mmid-pds/language_image_packages/scale-portuguese-package.tgz) | [link](https://s3.amazonaws.com/mmid-pds/mini_language_image_packages/mini-portuguese-package.tgz) | [link](https://s3.amazonaws.com/mmid-pds/language_metadata_files/metadata-portuguese-package.jsonl) | [link](https://s3.amazonaws.com/mmid-pds/language_index_files/index-portuguese-package.tsv) | [link](https://s3.amazonaws.com/mmid-pds/language_text_warcs/portuguese-text-warcs.tgz) |
116 | | punjabi | [link](https://s3.amazonaws.com/mmid-pds/language_image_packages/scale-punjabi-package.tgz) | [link](https://s3.amazonaws.com/mmid-pds/mini_language_image_packages/mini-punjabi-package.tgz) | [link](https://s3.amazonaws.com/mmid-pds/language_metadata_files/metadata-punjabi-package.jsonl) | [link](https://s3.amazonaws.com/mmid-pds/language_index_files/index-punjabi-package.tsv) | [link](https://s3.amazonaws.com/mmid-pds/language_text_warcs/punjabi-text-warcs.tgz) |
117 | | romanian | [link](https://s3.amazonaws.com/mmid-pds/language_image_packages/scale-romanian-package.tgz) | [link](https://s3.amazonaws.com/mmid-pds/mini_language_image_packages/mini-romanian-package.tgz) | [link](https://s3.amazonaws.com/mmid-pds/language_metadata_files/metadata-romanian-package.jsonl) | [link](https://s3.amazonaws.com/mmid-pds/language_index_files/index-romanian-package.tsv) | [link](https://s3.amazonaws.com/mmid-pds/language_text_warcs/romanian-text-warcs.tgz) |
118 | | **language** | **100 images** | **1 image** | **metadata** | **dictionary** | **web text** |
119 | | russian | [link](https://s3.amazonaws.com/mmid-pds/language_image_packages/scale-russian-package.tgz) | [link](https://s3.amazonaws.com/mmid-pds/mini_language_image_packages/mini-russian-package.tgz) | [link](https://s3.amazonaws.com/mmid-pds/language_metadata_files/metadata-russian-package.jsonl) | [link](https://s3.amazonaws.com/mmid-pds/language_index_files/index-russian-package.tsv) | [link](https://s3.amazonaws.com/mmid-pds/language_text_warcs/russian-text-warcs.tgz) |
120 | | serbian | [link](https://s3.amazonaws.com/mmid-pds/language_image_packages/scale-serbian-package.tgz) | [link](https://s3.amazonaws.com/mmid-pds/mini_language_image_packages/mini-serbian-package.tgz) | [link](https://s3.amazonaws.com/mmid-pds/language_metadata_files/metadata-serbian-package.jsonl) | [link](https://s3.amazonaws.com/mmid-pds/language_index_files/index-serbian-package.tsv) | [link](https://s3.amazonaws.com/mmid-pds/language_text_warcs/serbian-text-warcs.tgz) |
121 | | serbo-croatian | [link](https://s3.amazonaws.com/mmid-pds/language_image_packages/scale-serbo-croatian-package.tgz) | [link](https://s3.amazonaws.com/mmid-pds/mini_language_image_packages/mini-serbo-croatian-package.tgz) | [link](https://s3.amazonaws.com/mmid-pds/language_metadata_files/metadata-serbo-croatian-package.jsonl) | [link](https://s3.amazonaws.com/mmid-pds/language_index_files/index-serbo-croatian-package.tsv) | [link](https://s3.amazonaws.com/mmid-pds/language_text_warcs/serbo-croatian-text-warcs.tgz) |
122 | | sicilian | [link](https://s3.amazonaws.com/mmid-pds/language_image_packages/scale-sicilian-package.tgz) | [link](https://s3.amazonaws.com/mmid-pds/mini_language_image_packages/mini-sicilian-package.tgz) | [link](https://s3.amazonaws.com/mmid-pds/language_metadata_files/metadata-sicilian-package.jsonl) | [link](https://s3.amazonaws.com/mmid-pds/language_index_files/index-sicilian-package.tsv) | [link](https://s3.amazonaws.com/mmid-pds/language_text_warcs/sicilian-text-warcs.tgz) |
123 | | sindhi | [link](https://s3.amazonaws.com/mmid-pds/language_image_packages/scale-sindhi-package.tgz) | [link](https://s3.amazonaws.com/mmid-pds/mini_language_image_packages/mini-sindhi-package.tgz) | [link](https://s3.amazonaws.com/mmid-pds/language_metadata_files/metadata-sindhi-package.jsonl) | [link](https://s3.amazonaws.com/mmid-pds/language_index_files/index-sindhi-package.tsv) | [link](https://s3.amazonaws.com/mmid-pds/language_text_warcs/sindhi-text-warcs.tgz) |
124 | | slovak | [link](https://s3.amazonaws.com/mmid-pds/language_image_packages/scale-slovak-package.tgz) | [link](https://s3.amazonaws.com/mmid-pds/mini_language_image_packages/mini-slovak-package.tgz) | [link](https://s3.amazonaws.com/mmid-pds/language_metadata_files/metadata-slovak-package.jsonl) | [link](https://s3.amazonaws.com/mmid-pds/language_index_files/index-slovak-package.tsv) | [link](https://s3.amazonaws.com/mmid-pds/language_text_warcs/slovak-text-warcs.tgz) |
125 | | slovenian | [link](https://s3.amazonaws.com/mmid-pds/language_image_packages/scale-slovenian-package.tgz) | [link](https://s3.amazonaws.com/mmid-pds/mini_language_image_packages/mini-slovenian-package.tgz) | [link](https://s3.amazonaws.com/mmid-pds/language_metadata_files/metadata-slovenian-package.jsonl) | [link](https://s3.amazonaws.com/mmid-pds/language_index_files/index-slovenian-package.tsv) | [link](https://s3.amazonaws.com/mmid-pds/language_text_warcs/slovenian-text-warcs.tgz) |
126 | | somali | [link](https://s3.amazonaws.com/mmid-pds/language_image_packages/scale-somali-package.tgz) | [link](https://s3.amazonaws.com/mmid-pds/mini_language_image_packages/mini-somali-package.tgz) | [link](https://s3.amazonaws.com/mmid-pds/language_metadata_files/metadata-somali-package.jsonl) | [link](https://s3.amazonaws.com/mmid-pds/language_index_files/index-somali-package.tsv) | [link](https://s3.amazonaws.com/mmid-pds/language_text_warcs/somali-text-warcs.tgz) |
127 | | spanish | [link](https://s3.amazonaws.com/mmid-pds/language_image_packages/scale-spanish-package.tgz) | [link](https://s3.amazonaws.com/mmid-pds/mini_language_image_packages/mini-spanish-package.tgz) | [link](https://s3.amazonaws.com/mmid-pds/language_metadata_files/metadata-spanish-package.jsonl) | [link](https://s3.amazonaws.com/mmid-pds/language_index_files/index-spanish-package.tsv) | [link](https://s3.amazonaws.com/mmid-pds/language_text_warcs/spanish-text-warcs.tgz) |
128 | | sundanese | [link](https://s3.amazonaws.com/mmid-pds/language_image_packages/scale-sundanese-package.tgz) | [link](https://s3.amazonaws.com/mmid-pds/mini_language_image_packages/mini-sundanese-package.tgz) | [link](https://s3.amazonaws.com/mmid-pds/language_metadata_files/metadata-sundanese-package.jsonl) | [link](https://s3.amazonaws.com/mmid-pds/language_index_files/index-sundanese-package.tsv) | [link](https://s3.amazonaws.com/mmid-pds/language_text_warcs/sundanese-text-warcs.tgz) |
129 | | swahili | [link](https://s3.amazonaws.com/mmid-pds/language_image_packages/scale-swahili-package.tgz) | [link](https://s3.amazonaws.com/mmid-pds/mini_language_image_packages/mini-swahili-package.tgz) | [link](https://s3.amazonaws.com/mmid-pds/language_metadata_files/metadata-swahili-package.jsonl) | [link](https://s3.amazonaws.com/mmid-pds/language_index_files/index-swahili-package.tsv) | [link](https://s3.amazonaws.com/mmid-pds/language_text_warcs/swahili-text-warcs.tgz) |
130 | | swedish | [link](https://s3.amazonaws.com/mmid-pds/language_image_packages/scale-swedish-package.tgz) | [link](https://s3.amazonaws.com/mmid-pds/mini_language_image_packages/mini-swedish-package.tgz) | [link](https://s3.amazonaws.com/mmid-pds/language_metadata_files/metadata-swedish-package.jsonl) | [link](https://s3.amazonaws.com/mmid-pds/language_index_files/index-swedish-package.tsv) | [link](https://s3.amazonaws.com/mmid-pds/language_text_warcs/swedish-text-warcs.tgz) |
131 | | tamil | [link](https://s3.amazonaws.com/mmid-pds/language_image_packages/scale-tamil-package.tgz) | [link](https://s3.amazonaws.com/mmid-pds/mini_language_image_packages/mini-tamil-package.tgz) | [link](https://s3.amazonaws.com/mmid-pds/language_metadata_files/metadata-tamil-package.jsonl) | [link](https://s3.amazonaws.com/mmid-pds/language_index_files/index-tamil-package.tsv) | [link](https://s3.amazonaws.com/mmid-pds/language_text_warcs/tamil-text-warcs.tgz) |
132 | | telugu | [link](https://s3.amazonaws.com/mmid-pds/language_image_packages/scale-telugu-package.tgz) | [link](https://s3.amazonaws.com/mmid-pds/mini_language_image_packages/mini-telugu-package.tgz) | [link](https://s3.amazonaws.com/mmid-pds/language_metadata_files/metadata-telugu-package.jsonl) | [link](https://s3.amazonaws.com/mmid-pds/language_index_files/index-telugu-package.tsv) | [link](https://s3.amazonaws.com/mmid-pds/language_text_warcs/telugu-text-warcs.tgz) |
133 | | thai | [link](https://s3.amazonaws.com/mmid-pds/language_image_packages/scale-thai-package.tgz) | [link](https://s3.amazonaws.com/mmid-pds/mini_language_image_packages/mini-thai-package.tgz) | [link](https://s3.amazonaws.com/mmid-pds/language_metadata_files/metadata-thai-package.jsonl) | [link](https://s3.amazonaws.com/mmid-pds/language_index_files/index-thai-package.tsv) | [link](https://s3.amazonaws.com/mmid-pds/language_text_warcs/thai-text-warcs.tgz) |
134 | | turkish-august | [link](https://s3.amazonaws.com/mmid-pds/language_image_packages/scale-turkish-august-package.tgz) | [link](https://s3.amazonaws.com/mmid-pds/mini_language_image_packages/mini-turkish-august-package.tgz) | [link](https://s3.amazonaws.com/mmid-pds/language_metadata_files/metadata-turkish-august-package.jsonl) | [link](https://s3.amazonaws.com/mmid-pds/language_index_files/index-turkish-august-package.tsv) | [link](https://s3.amazonaws.com/mmid-pds/language_text_warcs/turkish-august-text-warcs.tgz) |
135 | | turkish | [link](https://s3.amazonaws.com/mmid-pds/language_image_packages/scale-turkish-package.tgz) | [link](https://s3.amazonaws.com/mmid-pds/mini_language_image_packages/mini-turkish-package.tgz) | [link](https://s3.amazonaws.com/mmid-pds/language_metadata_files/metadata-turkish-package.jsonl) | [link](https://s3.amazonaws.com/mmid-pds/language_index_files/index-turkish-package.tsv) | [link](https://s3.amazonaws.com/mmid-pds/language_text_warcs/turkish-text-warcs.tgz) |
136 | | uighur | [link](https://s3.amazonaws.com/mmid-pds/language_image_packages/scale-uighur-package.tgz) | [link](https://s3.amazonaws.com/mmid-pds/mini_language_image_packages/mini-uighur-package.tgz) | [link](https://s3.amazonaws.com/mmid-pds/language_metadata_files/metadata-uighur-package.jsonl) | [link](https://s3.amazonaws.com/mmid-pds/language_index_files/index-uighur-package.tsv) | [link](https://s3.amazonaws.com/mmid-pds/language_text_warcs/uighur-text-warcs.tgz) |
137 | | ukrainian | [link](https://s3.amazonaws.com/mmid-pds/language_image_packages/scale-ukrainian-package.tgz) | [link](https://s3.amazonaws.com/mmid-pds/mini_language_image_packages/mini-ukrainian-package.tgz) | [link](https://s3.amazonaws.com/mmid-pds/language_metadata_files/metadata-ukrainian-package.jsonl) | [link](https://s3.amazonaws.com/mmid-pds/language_index_files/index-ukrainian-package.tsv) | [link](https://s3.amazonaws.com/mmid-pds/language_text_warcs/ukrainian-text-warcs.tgz) |
138 | | urdu | [link](https://s3.amazonaws.com/mmid-pds/language_image_packages/scale-urdu-package.tgz) | [link](https://s3.amazonaws.com/mmid-pds/mini_language_image_packages/mini-urdu-package.tgz) | [link](https://s3.amazonaws.com/mmid-pds/language_metadata_files/metadata-urdu-package.jsonl) | [link](https://s3.amazonaws.com/mmid-pds/language_index_files/index-urdu-package.tsv) | [link](https://s3.amazonaws.com/mmid-pds/language_text_warcs/urdu-text-warcs.tgz) |
139 | | **language** | **100 images** | **1 image** | **metadata** | **dictionary** | **web text** |
140 | | uzbek | [link](https://s3.amazonaws.com/mmid-pds/language_image_packages/scale-uzbek-package.tgz) | [link](https://s3.amazonaws.com/mmid-pds/mini_language_image_packages/mini-uzbek-package.tgz) | [link](https://s3.amazonaws.com/mmid-pds/language_metadata_files/metadata-uzbek-package.jsonl) | [link](https://s3.amazonaws.com/mmid-pds/language_index_files/index-uzbek-package.tsv) | [link](https://s3.amazonaws.com/mmid-pds/language_text_warcs/uzbek-text-warcs.tgz) |
141 | | vietnamese | [link](https://s3.amazonaws.com/mmid-pds/language_image_packages/scale-vietnamese-package.tgz) | [link](https://s3.amazonaws.com/mmid-pds/mini_language_image_packages/mini-vietnamese-package.tgz) | [link](https://s3.amazonaws.com/mmid-pds/language_metadata_files/metadata-vietnamese-package.jsonl) | [link](https://s3.amazonaws.com/mmid-pds/language_index_files/index-vietnamese-package.tsv) | [link](https://s3.amazonaws.com/mmid-pds/language_text_warcs/vietnamese-text-warcs.tgz) |
142 | | waray | [link](https://s3.amazonaws.com/mmid-pds/language_image_packages/scale-waray-package.tgz) | [link](https://s3.amazonaws.com/mmid-pds/mini_language_image_packages/mini-waray-package.tgz) | [link](https://s3.amazonaws.com/mmid-pds/language_metadata_files/metadata-waray-package.jsonl) | [link](https://s3.amazonaws.com/mmid-pds/language_index_files/index-waray-package.tsv) | [link](https://s3.amazonaws.com/mmid-pds/language_text_warcs/waray-text-warcs.tgz) |
143 | | welsh | [link](https://s3.amazonaws.com/mmid-pds/language_image_packages/scale-welsh-package.tgz) | [link](https://s3.amazonaws.com/mmid-pds/mini_language_image_packages/mini-welsh-package.tgz) | [link](https://s3.amazonaws.com/mmid-pds/language_metadata_files/metadata-welsh-package.jsonl) | [link](https://s3.amazonaws.com/mmid-pds/language_index_files/index-welsh-package.tsv) | [link](https://s3.amazonaws.com/mmid-pds/language_text_warcs/welsh-text-warcs.tgz) |
144 | | wolof | [link](https://s3.amazonaws.com/mmid-pds/language_image_packages/scale-wolof-package.tgz) | [link](https://s3.amazonaws.com/mmid-pds/mini_language_image_packages/mini-wolof-package.tgz) | [link](https://s3.amazonaws.com/mmid-pds/language_metadata_files/metadata-wolof-package.jsonl) | [link](https://s3.amazonaws.com/mmid-pds/language_index_files/index-wolof-package.tsv) | [link](https://s3.amazonaws.com/mmid-pds/language_text_warcs/wolof-text-warcs.tgz) |
145 | | yoruba | [link](https://s3.amazonaws.com/mmid-pds/language_image_packages/scale-yoruba-package.tgz) | [link](https://s3.amazonaws.com/mmid-pds/mini_language_image_packages/mini-yoruba-package.tgz) | [link](https://s3.amazonaws.com/mmid-pds/language_metadata_files/metadata-yoruba-package.jsonl) | [link](https://s3.amazonaws.com/mmid-pds/language_index_files/index-yoruba-package.tsv) | [link](https://s3.amazonaws.com/mmid-pds/language_text_warcs/yoruba-text-warcs.tgz) |
146 |
147 |
148 | ## Code
149 |
150 | To replicate the experiments in _Learning Translations via Images_, you'll need the code at this [github repo](https://github.com/john-hewitt/mmid-tools).
151 | It contains scripts for reading in CNN image feature files and predicting translations as described in the paper.
152 |
153 |
154 | ## **CNN package** Downloads
155 |
156 | For these 30 languages, we extracted CNN features and plaintext for all words of a language. Using these, you can recreate or improve on the translation results of our ACL paper. As a warning, each download is as much as **11 GB** per language!
157 | The metadata files relate images to their URLs.
158 |
159 | | Language | Dataset | | Language | Dataset |
160 | | ------------- | ------------- | ---------- | ---- | ------------- | ------------- | ---------- |
161 | | Albanian | [download](http://nlpgrid.seas.upenn.edu/MMID/albanian.tar.gz) | | Latvian | [download](http://nlpgrid.seas.upenn.edu/MMID/latvian.tar.gz) |
162 | | Arabic | [download](http://nlpgrid.seas.upenn.edu/MMID/arabic.tar.gz) | | Romanian | [download](http://nlpgrid.seas.upenn.edu/MMID/romanian.tar.gz) |
163 | | Azerbaijani | [download](http://nlpgrid.seas.upenn.edu/MMID/azerbaijani.tar.gz) | | Serbian | [download](http://nlpgrid.seas.upenn.edu/MMID/serbian.tar.gz) |
164 | | Bengali | [download](http://nlpgrid.seas.upenn.edu/MMID/bengali.tar.gz) | | Slovak | [download](http://nlpgrid.seas.upenn.edu/MMID/slovak.tar.gz) |
165 | | Bosnian | [download](http://nlpgrid.seas.upenn.edu/MMID/bosnian.tar.gz) | | Somali | [download](http://nlpgrid.seas.upenn.edu/MMID/somali.tar.gz) |
166 | | Bulgarian | [download](http://nlpgrid.seas.upenn.edu/MMID/bulgarian.tar.gz) | | Spanish | [download](http://nlpgrid.seas.upenn.edu/MMID/spanish.tar.gz) |
167 | | Cebuano | [download](http://nlpgrid.seas.upenn.edu/MMID/cebuano.tar.gz) | | Swedish | [download](http://nlpgrid.seas.upenn.edu/MMID/swedish.tar.gz) |
168 | | chinese | [download](http://nlpgrid.seas.upenn.edu/MMID/chinese.tar.gz) | | Tamil | [download](http://nlpgrid.seas.upenn.edu/MMID/tamil.tar.gz) |
169 | | Dutch | [download](http://nlpgrid.seas.upenn.edu/MMID/dutch.tar.gz) | | Telugu | [download](http://nlpgrid.seas.upenn.edu/MMID/telugu.tar.gz) |
170 | | Filipino | [download](http://nlpgrid.seas.upenn.edu/MMID/filipino.tar.gz) | | Thai | [download](http://nlpgrid.seas.upenn.edu/MMID/thai.tar.gz) |
171 | | French | [download](http://nlpgrid.seas.upenn.edu/MMID/french.tar.gz) | | Turkish | [download](http://nlpgrid.seas.upenn.edu/MMID/turkish.tar.gz) |
172 | | German | [download](http://nlpgrid.seas.upenn.edu/MMID/german.tar.gz) | | Ukrainian | [download](http://nlpgrid.seas.upenn.edu/MMID/ukrainian.tar.gz) |
173 | | Gujarati | [download](http://nlpgrid.seas.upenn.edu/MMID/gujarati.tar.gz) | | Urdu | [download](http://nlpgrid.seas.upenn.edu/MMID/urdu.tar.gz) |
174 | | Hindi | [download](http://nlpgrid.seas.upenn.edu/MMID/hindi.tar.gz) | | Uzbek | [download](http://nlpgrid.seas.upenn.edu/MMID/uzbek.tar.gz) |
175 | | Hungarian | [download](http://nlpgrid.seas.upenn.edu/MMID/hungarian.tar.gz) | | Vietnamese | [download](http://nlpgrid.seas.upenn.edu/MMID/vietnamese.tar.gz) |
176 | | Indonesian | [download](http://nlpgrid.seas.upenn.edu/MMID/indonesian.tar.gz) | | Welsh | [download](http://nlpgrid.seas.upenn.edu/MMID/welsh.tar.gz) |
177 | | Italian | [download](http://nlpgrid.seas.upenn.edu/MMID/italian.tar.gz) | | Yoruba | [download](http://nlpgrid.seas.upenn.edu/MMID/yoruba.tar.gz) |
178 |
179 |
--------------------------------------------------------------------------------