├── .gitignore
├── LICENSE
├── README.md
├── data
├── w00000
│ ├── w00000000.jpg
│ ├── w00000000.json
│ ├── w00000000.txt
│ ├── w00000001.jpg
│ ├── w00000001.txt
│ ├── w00000002.jpg
│ ├── w00000002.txt
│ ├── w00000003.jpg
│ ├── w00000003.txt
│ ├── w00000004.jpg
│ └── w00000004.txt
├── w00001
│ ├── w00000005.jpg
│ ├── w00000005.txt
│ ├── w00000006.jpg
│ ├── w00000006.txt
│ ├── w00000007.jpg
│ ├── w00000007.txt
│ ├── w00000008.jpg
│ └── w00000008.txt
├── w00002
│ ├── w00000009.jpg
│ ├── w00000009.txt
│ ├── w00000010.jpg
│ ├── w00000010.txt
│ ├── w00000011.jpg
│ └── w00000011.txt
└── w00003
│ ├── w00000013.jpg
│ ├── w00000013.txt
│ ├── w00000014.jpg
│ ├── w00000014.txt
│ ├── w00000015.jpg
│ └── w00000015.txt
├── download_open_images.txt
├── general
├── cc12m.py
├── cc3m.py
├── filtered_yfcc100m.py
├── helper_scripts
│ ├── wit_clip_class.py
│ ├── wit_dtype.py
│ ├── wit_image_downloader.py
│ └── wit_url_downloader.py
├── openimages_labels.py
├── openimages_narrative.py
├── wit.py
├── wit_clip.py
└── wit_old.py
├── setup.py
└── utilities
├── clip_wit.py
├── dataset_sanitycheck.py
├── tokenizer_from_wds_or_text.py
├── wds_create_legacy.py
├── wds_create_shards.py
├── wds_from_tfrecords.py
├── wds_from_tfrecords_alternative.py
├── wds_pytorchread.py
└── wds_read.py
/.gitignore:
--------------------------------------------------------------------------------
1 | .vscode/
2 | witurls/
3 | datasets
4 | general/helper_scripts/__pycache__
5 | general/wit_old.py
6 | general/clip_wit.py
7 | general/wit.py
8 | general/wit_clip copy.py
9 | utilities/clip_wit.py
10 | general/wit_old.py
11 | general/wit_clip copy2.py
12 | tfrecords
13 | tfr
14 | tartest.py
15 | test.py
16 | skips
17 | shards
18 | output
19 | openimages
20 | openimages_old.py
21 | .DS_Store
22 | .gitignore
23 | build
24 | dist
25 | .ipynb_checkpoints
26 | dalle_datasets.egg-info
27 | captions_train.json
28 | openimages-train-000000.tar
29 | downsampled-open-images-v4
30 | downsampled-open-images-v4-9208d33aceb2ca3eb2beb70a192600c9c41efba1.torrent
31 | downsampled-open-images-v4.aria2
32 | wit_urls
33 | wds_create_shards_backup.py
34 | testfolder
35 | testfolder_backup
36 | dataset.tar.gz
37 | dataset_sanitycheck_backup.py
38 | incomplete_files.csv
39 | wit
40 | mytest.py
--------------------------------------------------------------------------------
/LICENSE:
--------------------------------------------------------------------------------
1 | MIT License
2 |
3 | Copyright (c) 2021 robvanvolt
4 |
5 | Permission is hereby granted, free of charge, to any person obtaining a copy
6 | of this software and associated documentation files (the "Software"), to deal
7 | in the Software without restriction, including without limitation the rights
8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
9 | copies of the Software, and to permit persons to whom the Software is
10 | furnished to do so, subject to the following conditions:
11 |
12 | The above copyright notice and this permission notice shall be included in all
13 | copies or substantial portions of the Software.
14 |
15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
21 | SOFTWARE.
22 |
--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
1 | ## DALLE-datasets
2 | This is a summary of easily available, high-quality datasets consisiting of captioned image files for generalized DALLE-pytorch training (https://github.com/lucidrains/DALLE-pytorch).
3 |
4 | The scripts help you download and resize the files from the given sources.
5 |
6 | * general datasets
7 | * Conceptual Images 12m
8 | * Wikipedia
9 | * Filtered yfcc100m
10 | * Open Images
11 | * specific datasets
12 | * None yet
13 |
14 |
15 | ## Helper scripts
16 |
17 | All helper scripts can be found in the utilities folder now:
18 | * TFrecords to WebDataset converter
19 | * Image-Text-Folder to WebDataset converter
20 | * Dataset sanitycheck for image-text-files
21 | * Example reader for WebDataset files
22 |
23 |
24 | ### Sanitycheck for downloaded datasets
25 |
26 | The following command will look for image-text-pairs (.jpg / .png / .bmp) and return a csv table with incomplete data.
27 | When you add the optional argument -DEL, the incomplete files get deleted. The python scripts checks one folder and the first subdirectories.
28 |
29 | ```python sanity_check.py --dataset_folder my-dataset-folder```
30 |
31 |
32 | ## Pretrained models
33 |
34 | If you want to continue training on pretrained models or even upload your own Dall-E model, head over to https://github.com/robvanvolt/DALLE-models
35 |
36 | ## Credits
37 |
38 | Special thanks go to Romaine, who improved the download scripts and made the great WebDataset format more accessible with his continuous coding efforts! 🙏
39 |
40 | A lot of inspiration was taken from https://github.com/yashbonde/dall-e-baby - unfortunately that repo does not get updated anymore...
41 | Also, the shard creator was inspired by https://github.com/tmbdev-archive/webdataset-examples/blob/master/makeshards.py.
42 | The custom tokenizer was inspired by afiaka87, who showed a simple way to generate custom tokenizers with youtokentome.
43 |
--------------------------------------------------------------------------------
/data/w00000/w00000000.jpg:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/robvanvolt/DALLE-datasets/bb983e3abe99d76a6fefbb173dcb640a6d8deb17/data/w00000/w00000000.jpg
--------------------------------------------------------------------------------
/data/w00000/w00000000.json:
--------------------------------------------------------------------------------
1 | {
2 | "A": 12,
3 | "B": "Test"
4 | }
--------------------------------------------------------------------------------
/data/w00000/w00000000.txt:
--------------------------------------------------------------------------------
1 | Galego: Logo do Movemento Galego ao Socialismo
--------------------------------------------------------------------------------
/data/w00000/w00000001.jpg:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/robvanvolt/DALLE-datasets/bb983e3abe99d76a6fefbb173dcb640a6d8deb17/data/w00000/w00000001.jpg
--------------------------------------------------------------------------------
/data/w00000/w00000001.txt:
--------------------------------------------------------------------------------
1 | Lesser bulldog bat (Noctilio albiventris)
--------------------------------------------------------------------------------
/data/w00000/w00000002.jpg:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/robvanvolt/DALLE-datasets/bb983e3abe99d76a6fefbb173dcb640a6d8deb17/data/w00000/w00000002.jpg
--------------------------------------------------------------------------------
/data/w00000/w00000002.txt:
--------------------------------------------------------------------------------
1 | Coin of Ukraine Русский: Юбилейная монета Украины
--------------------------------------------------------------------------------
/data/w00000/w00000003.jpg:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/robvanvolt/DALLE-datasets/bb983e3abe99d76a6fefbb173dcb640a6d8deb17/data/w00000/w00000003.jpg
--------------------------------------------------------------------------------
/data/w00000/w00000003.txt:
--------------------------------------------------------------------------------
1 | mendeleevo
--------------------------------------------------------------------------------
/data/w00000/w00000004.jpg:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/robvanvolt/DALLE-datasets/bb983e3abe99d76a6fefbb173dcb640a6d8deb17/data/w00000/w00000004.jpg
--------------------------------------------------------------------------------
/data/w00000/w00000004.txt:
--------------------------------------------------------------------------------
1 | Sehemu za Mji wa Brookline, Massachusetts
2 | Brookline MA August 2015 Photo Collage 2
--------------------------------------------------------------------------------
/data/w00001/w00000005.jpg:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/robvanvolt/DALLE-datasets/bb983e3abe99d76a6fefbb173dcb640a6d8deb17/data/w00001/w00000005.jpg
--------------------------------------------------------------------------------
/data/w00001/w00000005.txt:
--------------------------------------------------------------------------------
1 | Hay Street 中文(繁體): 禧街
--------------------------------------------------------------------------------
/data/w00001/w00000006.jpg:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/robvanvolt/DALLE-datasets/bb983e3abe99d76a6fefbb173dcb640a6d8deb17/data/w00001/w00000006.jpg
--------------------------------------------------------------------------------
/data/w00001/w00000006.txt:
--------------------------------------------------------------------------------
1 | Jayson Musson on October 29, 2007
2 | Jayson Scott Musson on October 29, 2007
--------------------------------------------------------------------------------
/data/w00001/w00000007.jpg:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/robvanvolt/DALLE-datasets/bb983e3abe99d76a6fefbb173dcb640a6d8deb17/data/w00001/w00000007.jpg
--------------------------------------------------------------------------------
/data/w00001/w00000007.txt:
--------------------------------------------------------------------------------
1 | Չիբո կապելլա
2 | Photo of the Cybo Chapel of Santa Maria del Popolo, Rome, Italy.
--------------------------------------------------------------------------------
/data/w00001/w00000008.jpg:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/robvanvolt/DALLE-datasets/bb983e3abe99d76a6fefbb173dcb640a6d8deb17/data/w00001/w00000008.jpg
--------------------------------------------------------------------------------
/data/w00001/w00000008.txt:
--------------------------------------------------------------------------------
1 | Euodynerus megaera
--------------------------------------------------------------------------------
/data/w00002/w00000009.jpg:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/robvanvolt/DALLE-datasets/bb983e3abe99d76a6fefbb173dcb640a6d8deb17/data/w00002/w00000009.jpg
--------------------------------------------------------------------------------
/data/w00002/w00000009.txt:
--------------------------------------------------------------------------------
1 | Simon Wolfe Rosendale (June 23, 1842 - April 22, 1937) was an American lawyer and politician. Rosendale was the first Jew elected to a statewide elective office in New York
--------------------------------------------------------------------------------
/data/w00002/w00000010.jpg:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/robvanvolt/DALLE-datasets/bb983e3abe99d76a6fefbb173dcb640a6d8deb17/data/w00002/w00000010.jpg
--------------------------------------------------------------------------------
/data/w00002/w00000010.txt:
--------------------------------------------------------------------------------
1 | ქართული: არქიმანდრიტი ადამი (ერისკაცობაში ვახტანგ მიხეილის ძე ახალაძე)
--------------------------------------------------------------------------------
/data/w00002/w00000011.jpg:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/robvanvolt/DALLE-datasets/bb983e3abe99d76a6fefbb173dcb640a6d8deb17/data/w00002/w00000011.jpg
--------------------------------------------------------------------------------
/data/w00002/w00000011.txt:
--------------------------------------------------------------------------------
1 | Photograph of Rainbow Springs in Marion County, Florida (2005).
--------------------------------------------------------------------------------
/data/w00003/w00000013.jpg:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/robvanvolt/DALLE-datasets/bb983e3abe99d76a6fefbb173dcb640a6d8deb17/data/w00003/w00000013.jpg
--------------------------------------------------------------------------------
/data/w00003/w00000013.txt:
--------------------------------------------------------------------------------
1 | Die California Clipper in der Bucht von Manila, ca. 1940
2 | The Boeing 314 California Clipper (civil registration NC18602) off the Cavite Navy Yard, Philippines, 1939-1941. Delivered on 27 January 1939, it went to the USAAF as C-98 42-88632, 18 December 1941; then to the US Navy as BuNo 99084. It was sold to Universal Airways in 1946, and to American International in 1948. It was finally scrapped in 1950.
--------------------------------------------------------------------------------
/data/w00003/w00000014.jpg:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/robvanvolt/DALLE-datasets/bb983e3abe99d76a6fefbb173dcb640a6d8deb17/data/w00003/w00000014.jpg
--------------------------------------------------------------------------------
/data/w00003/w00000014.txt:
--------------------------------------------------------------------------------
1 | Rozdělení kantonu Basilej v letech 1832-1833
2 | Deutsch: Karte zur Basler Kantonstrennung 1832/33
--------------------------------------------------------------------------------
/data/w00003/w00000015.jpg:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/robvanvolt/DALLE-datasets/bb983e3abe99d76a6fefbb173dcb640a6d8deb17/data/w00003/w00000015.jpg
--------------------------------------------------------------------------------
/data/w00003/w00000015.txt:
--------------------------------------------------------------------------------
1 | 展示在芝加哥菲爾德自然史博物館的兩頭食人獅
2 | The maneless male Lions of Tsavo.
--------------------------------------------------------------------------------
/download_open_images.txt:
--------------------------------------------------------------------------------
1 | aria2c --bt-metadata-only=true --bt-save-metadata=true https://academictorrents.com/download/9208d33aceb2ca3eb2beb70a192600c9c41efba1.torrent;
2 | aria2c --show-files downsampled-open-images-v4-9208d33aceb2ca3eb2beb70a192600c9c41efba1.torrent;
3 | aria2c --select-file=9,11,15 downsampled-open-images-v4-9208d33aceb2ca3eb2beb70a192600c9c41efba1.torrent;
4 |
5 | # next steps is to go to each folder and unzip the files
6 | echo "Gathering downsampled-open-images-v4";
7 | rm downsampled-open-images-v4*;
8 | cd downsampled-open-images-v4/;
9 | rm -rf 512px/;
10 | cd 256px/;
11 | for i in test-256.tar.gz test_challenge_2018-256.tar.gz train-256.tar.gz validation-256.tar.gz
12 | do
13 | echo "Untarring: $i";
14 | tar -xf $i;
15 | done
16 |
--------------------------------------------------------------------------------
/general/cc12m.py:
--------------------------------------------------------------------------------
1 | import pandas as pd
2 | import os
3 | import requests
4 | from pathlib import Path
5 | from PIL import Image
6 | from tqdm import tqdm
7 | from multiprocessing import Pool
8 | import gc
9 | import glob
10 |
11 | cc_url = 'https://storage.googleapis.com/conceptual_12m/cc12m.tsv'
12 | root_folder = './'
13 | total = 12423374
14 | maxwidth = 256
15 | maxheight = 256
16 | thread_count = 16
17 | batch = 10000
18 |
19 | def load_caption(x):
20 | name, caption, text_folder = x
21 | fid = str(int(int(name) / 10000 ))
22 | subdir = "0"*(5-len(fid)) + fid
23 | os.makedirs(Path(text_folder+"/"+subdir), exist_ok=True)
24 | fp = text_folder + '/' + subdir + "/" + "0"*(9-len(str(name))) + str(name) + '.txt'
25 | with open(fp, 'w') as f:
26 | f.write(caption)
27 |
28 | def download_file(url):
29 | response = requests.get(url, stream=True)
30 | total_size_in_bytes= int(response.headers.get('content-length', 0))
31 | block_size = 1024
32 | progress_bar = tqdm(total=total_size_in_bytes, unit='iB', unit_scale=True)
33 | with open(Path(root_folder + '/cc12m.tsv'), 'wb') as file:
34 | for data in response.iter_content(block_size):
35 | progress_bar.update(len(data))
36 | file.write(data)
37 | progress_bar.close()
38 | if total_size_in_bytes != 0 and progress_bar.n != total_size_in_bytes:
39 | print("Error, something went wrong...")
40 |
41 | def load_image(x):
42 | name, url, image_folder, skip_folder = x
43 | fid = str(int(int(name) / 10000 ))
44 | subdir = "0"*(5-len(fid)) + fid
45 | os.makedirs(Path(image_folder+"/"+subdir), exist_ok=True)
46 | id = subdir + "/" + "0"*(9-len(str(name))) + str(name)
47 | try:
48 | with Image.open(requests.get(url,
49 | headers={'User-Agent': 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:72.0) Gecko/20100101 Firefox/72.0'},
50 | stream=True, timeout=3).raw) as foo:
51 | a = max(maxwidth/foo.size[0], maxheight/foo.size[1])
52 | foo = foo.resize((int(foo.size[0] * a), int(foo.size[1] * a)), Image.ANTIALIAS)
53 | with open(Path(image_folder + "/" + id + '.jpg'), 'wb') as file:
54 | foo.save(file, optimize=True, quality=85)
55 | except Exception:
56 | os.makedirs(Path(skip_folder+"/"+subdir), exist_ok=True)
57 | open(Path(skip_folder + '/' + id), 'a').close
58 | pass
59 |
60 | if __name__ == '__main__':
61 | if not os.path.isfile(Path(root_folder + '/cc12m.tsv')):
62 | print('Missing cc12m url-caption-dataset. Downloading...')
63 | download_file(cc_url)
64 | else:
65 | print('cc12m.tsv already downloaded. Proceeding with downloading images!')
66 |
67 | dfc = pd.read_csv(root_folder + "cc12m.tsv", sep='\t', names=["url", "caption"])
68 |
69 | image_folder = root_folder + '/images'
70 | text_folder = root_folder + '/texts'
71 | skip_folder = root_folder + '/skip'
72 |
73 | paths = [image_folder, text_folder, skip_folder]
74 |
75 | for path in paths:
76 | os.makedirs(path, exist_ok=True)
77 |
78 | def list_ids(path):
79 | return [int(os.path.splitext(os.path.basename(a))[0]) for a in glob.glob(path+"/**/*")]
80 |
81 | skiplist = list_ids(text_folder)
82 | remaining = total - len(skiplist)
83 | percent_remaining = 100 * (total - remaining) / total
84 | df = dfc.loc[~dfc.index.isin(skiplist)]
85 |
86 | print('Remaining {} captions to be written - {} ({:.5f} %) already written.'.format(remaining, len(skiplist), percent_remaining))
87 |
88 | if len(df) > 0:
89 | captions = zip(df.index, df["caption"], [text_folder]*len(df))
90 | pool = Pool(thread_count)
91 | for _ in tqdm(pool.imap_unordered(load_caption, captions), total=len(df)):
92 | pass
93 | pool.close()
94 | print('Done with captions!')
95 |
96 | skiplist = list_ids(skip_folder) + list_ids(image_folder)
97 | remaining = total - len(skiplist)
98 | percent_remaining = 100 * (total - remaining) / total
99 |
100 | df = dfc.loc[~dfc.index.isin(skiplist)]
101 | print('Remaining {} images to be downloaded - {} ({:.5f} %) already downloaded.'.format(remaining, len(skiplist), percent_remaining))
102 | images = list(zip(df.index, df["url"], [image_folder]*len(df), [skip_folder]*len(df)))
103 |
104 | for i in tqdm(range(0, len(df), batch)):
105 | pool = Pool(thread_count)
106 | for _ in tqdm(pool.imap_unordered(load_image, images[i:i+batch]), total=batch):
107 | pass
108 | pool.terminate()
109 | pool.join()
110 | del pool
111 | gc.collect()
112 |
113 | print('Finished downloading available images from conceptual images!')
114 |
--------------------------------------------------------------------------------
/general/cc3m.py:
--------------------------------------------------------------------------------
1 | import pandas as pd
2 | from pathlib import Path
3 | from PIL import Image
4 | from tqdm import tqdm
5 | import requests
6 | import os
7 | from pandarallel import pandarallel
8 |
9 | ### seperator = |
10 |
11 | ##### https://ai.google.com/research/ConceptualCaptions/download
12 | ##### download url-caption dataset from
13 | ##### https://storage.cloud.google.com/gcc-data/Train/GCC-training.tsv?_ga=2.191230122.-1896153081.1529438250
14 |
15 | DATASETFOLDER = 'content'
16 | DATASET = 'Train_GCC-training.tsv'
17 | FILEID = '1edNr-GEYz69RWcsSgskNzjtM--Qxepdz'
18 | URL = 'https://storage.cloud.google.com/gcc-data/Train/GCC-training.tsv?_ga=2.191230122.-1896153081.1529438250'
19 |
20 | ##### download location of image-caption pairs
21 | PARENTPATH = 'output'
22 | TEXTFOLDER = 'texts'
23 | IMAGEFOLDER = 'images'
24 | PREFIX = ""
25 | CHECKALLFOLDERS = True
26 |
27 | KEEPTHESECOLS = ['caption', 'url']
28 | IMAGEFORMATS = ['jpg', 'jpeg', 'bmp', 'png']
29 | MAXWIDTH = 320
30 | MAXHEIGHT = 320
31 | CHUNKS = 500000
32 | THREAD_COUNT = 16
33 | HIDE_ERRORS = False
34 |
35 | os.makedirs(Path(DATASETFOLDER), exist_ok=True)
36 |
37 | #### Helper scripts to download url-caption dataset
38 | def download_file_from_google_drive(id, destination):
39 | URL = "https://docs.google.com/uc?export=download"
40 | session = requests.Session()
41 | response = session.get(URL, params = { 'id' : id }, stream = True)
42 | token = get_confirm_token(response)
43 | if token:
44 | params = { 'id' : id, 'confirm' : token }
45 | response = session.get(URL, params = params, stream = True)
46 |
47 | save_response_content(response, destination)
48 |
49 | def download_file(url, root_folder):
50 | response = requests.get(url, stream=True)
51 | total_size_in_bytes= int(response.headers.get('content-length', 0))
52 | block_size = 1024
53 | progress_bar = tqdm(total=total_size_in_bytes, unit='iB', unit_scale=True)
54 | with open(Path(root_folder + '/cc3.tsv'), 'wb') as file:
55 | for data in response.iter_content(block_size):
56 | progress_bar.update(len(data))
57 | file.write(data)
58 | progress_bar.close()
59 | if total_size_in_bytes != 0 and progress_bar.n != total_size_in_bytes:
60 | print("Error, something went wrong...")
61 |
62 | def get_confirm_token(response):
63 | for key, value in response.cookies.items():
64 | if key.startswith('download_warning'):
65 | return value
66 | return None
67 |
68 | def save_response_content(response, destination):
69 | CHUNK_SIZE = 32768
70 | with open(destination, "wb") as f:
71 | for chunk in response.iter_content(CHUNK_SIZE):
72 | if chunk: # filter out keep-alive new chunks
73 | f.write(chunk)
74 |
75 | if __name__ == '__main__':
76 | assert os.path.isfile(Path(DATASETFOLDER + '/' + DATASET)), print('''
77 | #################################################################################################################
78 | Missing cc3m url-caption-dataset. Automatic downloading not supported yet.
79 | Download https://storage.cloud.google.com/gcc-data/Train/GCC-training.tsv?_ga=2.191230122.-1896153081.1529438250
80 | And put it into following folder: {}
81 | #################################################################################################################
82 | '''.format(DATASETFOLDER))
83 |
84 | pandarallel.initialize(nb_workers=THREAD_COUNT)
85 |
86 | ### downloading dataset and resizsing images in parallel
87 | def write_files(x, folderpath):
88 | id = PREFIX + "0"*(8-len(str(x.name))) + str(x.name)
89 | try:
90 | foo = Image.open(requests.get(x.url, stream=True, timeout=4).raw)
91 | a = max(MAXWIDTH/foo.size[0], MAXHEIGHT/foo.size[1])
92 | foo = foo.resize((int(foo.size[0] * a), int(foo.size[1] * a)), Image.ANTIALIAS)
93 | foo.save(Path(folderpath + '/' + id + '.jpg'), optimize=True, quality=85)
94 | except Exception as exc:
95 | if not HIDE_ERRORS:
96 | print('Failed downloading {} with url {}'.format(id, x.url))
97 | print(exc)
98 | pass
99 | else:
100 | with open(Path(folderpath + '/' + id + '.txt'), 'w') as f:
101 | f.write(x.caption)
102 |
103 | os.makedirs(Path(PARENTPATH), exist_ok=True)
104 |
105 | keep_downloading = True
106 | if CHECKALLFOLDERS:
107 | batch = 0
108 | else:
109 | batch = len(os.listdir(Path(PARENTPATH))) - 1
110 | batch = 0 if batch == -1 else batch
111 |
112 | while keep_downloading:
113 | try:
114 | df = pd.read_csv(Path(DATASETFOLDER + '/' + DATASET), sep="\t", skiprows=range(0, batch * CHUNKS), nrows=CHUNKS, names=KEEPTHESECOLS)
115 | # df = pd.read_csv(Path(DATASETFOLDER + '/' + DATASET), sep="\t", skiprows=range(0, batch * CHUNKS), nrows=CHUNKS, names=KEEPTHESECOLS)
116 | df.index = [x + batch * CHUNKS for x in list(df.index)]
117 | folderid = str(PREFIX) + "0"*(4-len(str(batch))) + str(batch)
118 | folderpath = PARENTPATH + '/' + folderid
119 | os.makedirs(folderpath, exist_ok=True)
120 | skip = list(set([int(x[1:-4]) for x in os.listdir(folderpath)]))
121 | df = df[~df.index.isin(skip)]
122 | print('Saving {} images to {}.'.format(len(df), folderpath))
123 | print('Skipping {} already downloaded urls.'.format(len(skip)))
124 | df.apply(lambda x: write_files(x, folderpath), axis=1)
125 | # df.parallel_apply(lambda x: write_files(x, folderpath), axis=1)
126 | except Exception as excp:
127 | print('An error occurred trying to download the filtered dataframe.')
128 | print(excp)
129 | keep_downloading = False
130 | pass
131 | else:
132 | if len(df) == 0:
133 | print('Alredy finished downloading images of batch {}!'.format(batch))
134 | batch += 1
135 |
136 | print('Finished downloading dataset to {}.'.format(PARENTPATH))
137 |
--------------------------------------------------------------------------------
/general/filtered_yfcc100m.py:
--------------------------------------------------------------------------------
1 | import pandas as pd
2 | from pathlib import Path
3 | from PIL import Image
4 | import requests
5 | import zipfile
6 | import os
7 | from pandarallel import pandarallel
8 |
9 | ##### url-caption dataset from https://github.com/christophschuhmann/4MC-4M-Image-Text-Pairs-with-CLIP-embeddings
10 | DATASETFOLDER = 'content'
11 | DATASETZIP = 'yfcc_filtered.zip'
12 | DATASET = 'yfcc_filtered.csv'
13 | FILEID = '1edNr-GEYz69RWcsSgskNzjtM--Qxepdz'
14 |
15 | ##### download location of image-caption pairs
16 | PARENTPATH = 'output'
17 | TEXTFOLDER = 'texts'
18 | IMAGEFOLDER = 'images'
19 | PREFIX = "F"
20 | CHECKALLFOLDERS = True
21 |
22 | KEEPTHESECOLS = ['final_caption', 'url']
23 | IMAGEFORMATS = ['jpg', 'jpeg', 'bmp', 'png']
24 | MAXWIDTH = 320
25 | MAXHEIGHT = 320
26 | CHUNKS = 100000
27 |
28 | os.makedirs(Path(DATASETFOLDER), exist_ok=True)
29 |
30 | #### Helper scripts to download url-caption dataset
31 | def download_file_from_google_drive(id, destination):
32 | URL = "https://docs.google.com/uc?export=download"
33 | session = requests.Session()
34 | response = session.get(URL, params = { 'id' : id }, stream = True)
35 | token = get_confirm_token(response)
36 | if token:
37 | params = { 'id' : id, 'confirm' : token }
38 | response = session.get(URL, params = params, stream = True)
39 |
40 | save_response_content(response, destination)
41 |
42 | def get_confirm_token(response):
43 | for key, value in response.cookies.items():
44 | if key.startswith('download_warning'):
45 | return value
46 | return None
47 |
48 | def save_response_content(response, destination):
49 | CHUNK_SIZE = 32768
50 | with open(destination, "wb") as f:
51 | for chunk in response.iter_content(CHUNK_SIZE):
52 | if chunk: # filter out keep-alive new chunks
53 | f.write(chunk)
54 |
55 | if not os.path.isfile(Path(DATASETFOLDER + '/' + DATASET)):
56 | if not os.path.isfile(Path(DATASETFOLDER + '/' + DATASETZIP)):
57 | download_file_from_google_drive(FILEID, Path(DATASETFOLDER + '/' + DATASETZIP))
58 |
59 | with zipfile.ZipFile(Path(DATASETFOLDER + '/' + DATASETZIP), 'r') as zip_ref:
60 | zipname = zip_ref.namelist()[0].split('/')[-1]
61 |
62 | with zipfile.ZipFile(Path(DATASETFOLDER + '/' + DATASETZIP), 'r') as zip_ref:
63 | zip_ref.extractall()
64 | os.rename(Path(DATASETFOLDER + '/' + zipname), Path(DATASETFOLDER + '/' + DATASET))
65 |
66 | pandarallel.initialize()
67 |
68 | ### downloading dataset and resizsing images in parallel
69 | def write_files(x, folderpath):
70 | id = PREFIX + "0"*(8-len(str(x.name))) + str(x.name)
71 | try:
72 | foo = Image.open(requests.get(x.url, stream=True, timeout=4).raw)
73 | a = max(MAXWIDTH/foo.size[0], MAXHEIGHT/foo.size[1])
74 | foo = foo.resize((int(foo.size[0] * a), int(foo.size[1] * a)), Image.ANTIALIAS)
75 | foo.save(Path(folderpath + '/' + id + '.jpg'), optimize=True, quality=85)
76 | except Exception as exc:
77 | print('Failed downloading {} with url {}'.format(id, x.url))
78 | print(exc)
79 | pass
80 | else:
81 | with open(Path(folderpath + '/' + id + '.txt'), 'w') as f:
82 | f.write(x.final_caption)
83 |
84 | os.makedirs(Path(PARENTPATH), exist_ok=True)
85 |
86 | keep_downloading = True
87 | if CHECKALLFOLDERS:
88 | batch = 0
89 | else:
90 | batch = len(os.listdir(Path(PARENTPATH))) - 1
91 | batch = 0 if batch == -1 else batch
92 |
93 | while keep_downloading:
94 | try:
95 | df = pd.read_csv(Path(DATASETFOLDER + '/' + DATASET), sep="|", skiprows=range(1, batch * CHUNKS + 1), nrows=CHUNKS, header=0, usecols=KEEPTHESECOLS)
96 | df.index = [x + batch * CHUNKS for x in list(df.index)]
97 | folderid = PREFIX + "0"*(4-len(str(batch))) + str(batch)
98 | folderpath = PARENTPATH + '/' + folderid
99 | os.makedirs(folderpath, exist_ok=True)
100 | skip = list(set([int(x[1:-4]) for x in os.listdir(folderpath)]))
101 | df = df[~df.index.isin(skip)]
102 | print('Saving {} images to {}.'.format(len(df), folderpath))
103 | print('Skipping {} already downloaded urls.'.format(len(skip)))
104 | df.parallel_apply(lambda x: write_files(x, folderpath), axis=1)
105 | except Exception as excp:
106 | print('An error occurred trying to download the filtered dataframe.')
107 | print(excp)
108 | keep_downloading = False
109 | pass
110 | else:
111 | if len(df) == 0:
112 | print('Alredy finished downloading images of batch {}!'.format(batch))
113 | batch += 1
114 |
115 | print('Finished downloading dataset to {}.'.format(PARENTPATH))
116 |
--------------------------------------------------------------------------------
/general/helper_scripts/wit_clip_class.py:
--------------------------------------------------------------------------------
1 | import os
2 | import clip
3 | import torch
4 | from PIL import Image
5 | from multiprocessing import cpu_count
6 | from multiprocessing.queues import JoinableQueue
7 | from svglib.svglib import svg2rlg
8 | from reportlab.graphics import renderPM
9 |
10 | device = "cpu" # "cuda" if torch.cuda.is_available() else "cpu"
11 | use_jit = False # torch.cuda.is_available()
12 |
13 | class CLIP:
14 | def __init__(self):
15 | self.model, self.preprocess = clip.load("ViT-B/32", device=device, jit=use_jit)
16 | self.tokenizer = clip.tokenize
17 |
18 | def return_similarities(self, image, captions, image_url):
19 | if '.svg' in image_url:
20 | svgname = image_url.split('/')[-1]
21 | pngname = svgname[:-4] + '.png'
22 | with open(svgname, 'wb') as f:
23 | f.write(image.content)
24 | svg_image = svg2rlg(svgname)
25 | renderPM.drawToFile(svg_image, pngname, fmt="PNG")
26 | openedImage = Image.open(pngname)
27 | image_tokens = self.preprocess(openedImage).unsqueeze(0).to(device)
28 | os.remove(svgname)
29 | os.remove(pngname)
30 | else:
31 | openedImage = Image.open(image.raw)
32 | image_tokens = self.preprocess(openedImage).unsqueeze(0).to(device)
33 | openedImage.close()
34 | logits = []
35 | for caption in captions:
36 | text_tokens = self.tokenizer(caption, context_length=77, truncate=True).to(device)
37 | with torch.no_grad():
38 | logits_per_image, _ = self.model(image_tokens, text_tokens)
39 | logits.append(list(torch.flatten(logits_per_image))[0].item())
40 | return logits, image_tokens
--------------------------------------------------------------------------------
/general/helper_scripts/wit_dtype.py:
--------------------------------------------------------------------------------
1 | DTYPE = {
2 | 'language': str,
3 | 'page_url': str,
4 | 'image_url': str,
5 | 'page_title': str,
6 | 'section_title': str,
7 | 'hierarchical_section_title': str,
8 | 'caption_reference_description': str,
9 | 'caption_attribution_description': str,
10 | 'caption_alt_text_description': str,
11 | 'mime_type': str,
12 | 'original_height': int,
13 | 'original_width': int,
14 | 'is_main_image': bool,
15 | 'attribution_passes_lang_id': bool,
16 | 'page_changed_recently': str,
17 | 'context_page_description': str,
18 | 'context_section_description': str
19 | }
20 |
21 | DFLENGTH = {
22 | 'wit_v1.train.all-00004-of-00010.tsv.gz': 3701161,
23 | 'wit_v1.train.all-00001-of-00010.tsv.gz': 3702075,
24 | 'wit_v1.train.all-00005-of-00010.tsv.gz': 3708106,
25 | 'wit_v1.train.all-00006-of-00010.tsv.gz': 3704684,
26 | 'wit_v1.train.all-00002-of-00010.tsv.gz': 3701785,
27 | 'wit_v1.train.all-00007-of-00010.tsv.gz': 3703736,
28 | 'wit_v1.train.all-00008-of-00010.tsv.gz': 3705646,
29 | 'wit_v1.train.all-00000-of-00010.tsv.gz': 3708026,
30 | 'wit_v1.train.all-1percent_sample.tsv.gz': 370373,
31 | 'wit_v1.train.all-00003-of-00010.tsv.gz': 3706924
32 | }
33 |
34 | DFLENGTH_ENGLISH = {
35 | 'wit_v1.train.all-00004-of-00010.tsv.gz': 540463,
36 | 'wit_v1.train.all-00001-of-00010.tsv.gz': 542006,
37 | 'wit_v1.train.all-00005-of-00010.tsv.gz': 540982,
38 | 'wit_v1.train.all-00006-of-00010.tsv.gz': 540387,
39 | 'wit_v1.train.all-00002-of-00010.tsv.gz': 540499,
40 | 'wit_v1.train.all-00007-of-00010.tsv.gz': 541728,
41 | 'wit_v1.train.all-00008-of-00010.tsv.gz': 540557,
42 | 'wit_v1.train.all-00000-of-00010.tsv.gz': 542593,
43 | 'wit_v1.train.all-1percent_sample.tsv.gz': 54071,
44 | 'wit_v1.train.all-00003-of-00010.tsv.gz': 541391
45 | }
--------------------------------------------------------------------------------
/general/helper_scripts/wit_image_downloader.py:
--------------------------------------------------------------------------------
1 | import os
2 | import requests
3 | from PIL import Image
4 |
5 | maxwidth = 256
6 | maxheight = 256
7 |
8 | def wit_download_image(url, saveimages=False):
9 | foo = requests.get(
10 | url,
11 | headers={'User-Agent': 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:72.0) Gecko/20100101 Firefox/72.0'},
12 | stream=True,
13 | timeout=3)
14 | if saveimages:
15 | with Image.open(foo) as fooimage:
16 | a = max(maxwidth/fooimage.size[0], maxheight/fooimage.size[1])
17 | fooimage = fooimage.resize((int(fooimage.size[0] * a), int(fooimage.size[1] * a)), Image.ANTIALIAS)
18 | with open(os.path.join('./wit_images/', + id + '.jpg'), 'wb') as file:
19 | fooimage.save(file, optimize=True, quality=85)
20 | return foo
--------------------------------------------------------------------------------
/general/helper_scripts/wit_url_downloader.py:
--------------------------------------------------------------------------------
1 | import urllib, os
2 | from tqdm import tqdm
3 | import urllib.request
4 |
5 | def download_wit_urls(urlfolder='../wit_urls', onepercentsample=True):
6 | links = ["https://storage.googleapis.com/gresearch/wit/wit_v1.train.all-0000{}-of-00010.tsv.gz".format(i) for i in range(9)]
7 | if onepercentsample:
8 | links = ["https://storage.googleapis.com/gresearch/wit/wit_v1.train.all-1percent_sample.tsv.gz"]
9 | filenames = [link.split('/')[-1] for link in links]
10 | os.makedirs(urlfolder, exist_ok=True)
11 |
12 | class TqdmUpTo(tqdm):
13 | def update_to(self, b=1, bsize=1, tsize=None):
14 | if tsize is not None:
15 | self.total = tsize
16 | return self.update(b * bsize - self.n)
17 |
18 | for witurl, filename in zip(links, filenames):
19 | filepath = os.path.join(urlfolder, filename)
20 | if not os.path.exists(filepath):
21 | with TqdmUpTo(unit='B', unit_scale=True, unit_divisor=1024, miniters=1,
22 | desc=witurl.split('/')[-1]) as t: # all optional kwargs
23 | urllib.request.urlretrieve(witurl, filename=filepath,
24 | reporthook=t.update_to, data=None)
25 | t.total = t.n
26 | else:
27 | print('{} already downloaded.'.format(filename))
--------------------------------------------------------------------------------
/general/openimages_labels.py:
--------------------------------------------------------------------------------
1 |
2 | import os
3 | import re
4 | import h5py
5 | import json
6 | from tqdm import trange
7 | import numpy as np
8 | import pandas as pd
9 | from tabulate import tabulate
10 |
11 | #############################################################################################
12 | ###### ATTENTION ############################################################################
13 | ###### You need to download class-descriptions-boxable.csv from the following website #######
14 | ###### https://storage.googleapis.com/openimages/v5/class-descriptions-boxable.csv ##########
15 | #############################################################################################
16 |
17 | def get_open_images_label_names():
18 | with open("./downsampled-open-images-v4/class-descriptions-boxable.csv", "r") as f:
19 | open_image_labels = {x.split(",")[0]: x.split(",")[1] for x in f.read().split("\n") if len(x)}
20 | return open_image_labels
21 |
22 | def get_open_images_labels(annotations_path):
23 | open_image_labels = get_open_images_label_names()
24 | df = pd.read_csv(annotations_path)
25 | image_to_labels = {}
26 | dropped = []
27 | pbar = trange(len(df.ImageID.unique()))
28 | path_f = "./downsampled-open-images-v4/256px/"
29 | if "validation" in annotations_path:
30 | path_f += "validation/"
31 | elif "train" in annotations_path:
32 | path_f += "train-256/"
33 | elif "test" in annotations_path:
34 | path_f += "test/"
35 | for _, (img_id, df_sub) in zip(pbar, df.groupby("ImageID")):
36 | path = f"{path_f}{img_id}.jpg"
37 | pbar.set_description(f"Loading {path[::-1][:40][::-1]}")
38 | high_conf = df_sub[df_sub.Confidence == 1].LabelName.values.tolist()
39 | low_conf = df_sub[df_sub.Confidence != 1].LabelName.values.tolist()
40 | if not high_conf or not os.path.exists(path):
41 | dropped.append(img_id)
42 | image_to_labels["open_images_" + img_id] = {
43 | "label": [
44 | [open_image_labels[x] for x in high_conf],
45 | [open_image_labels[x] for x in low_conf]
46 | ],
47 | "path": path
48 | }
49 | return image_to_labels, dropped
50 |
51 | # ---- Captions are generated using CaptionsGenerator
52 |
53 | class CaptionGenerator():
54 | templates_labels = [
55 | "a picture of {}",
56 | "a photo that has {}",
57 | "photo consisting of {}",
58 | "a low resolution photo of {}",
59 | "small photo of {}",
60 | "high resolution picture of {}",
61 | "low resolution picture of {}",
62 | "high res photo that has {}",
63 | "low res photo of {}",
64 | "{} in a photo",
65 | "{} in a picture",
66 | "rendered picture of {}",
67 | "jpeg photo of {}",
68 | "a cool photo of {}",
69 | "{} rendered in a picture",
70 | ]
71 |
72 | templates_maybe = [
73 | *[x + " and maybe containing {}" for x in templates_labels],
74 | *[x + " and possibly containing {}" for x in templates_labels],
75 | *[x + " and {} but not sure" for x in templates_labels],
76 | *[x + " also roughly {}" for x in templates_labels],
77 | ]
78 |
79 | captions_templates = {
80 | "open_images": [templates_labels, templates_maybe],
81 | }
82 |
83 | def __init__(self):
84 | self.ds_names = list(self.captions_templates.keys())
85 |
86 | def generate_open_images_caption(self, ds):
87 | temps_high, temps_low = self.captions_templates["open_images"]
88 | captions = {}
89 | for i,k in enumerate(ds):
90 | high_conf = ", ".join(ds[k]["label"][0])
91 | if np.random.random() > 0.5:
92 | low_conf = ", ".join(ds[k]["label"][1])
93 | temp = np.random.choice(temps_low, size=1)[0]
94 | cap = temp.format(high_conf, low_conf)
95 | else:
96 | temp = np.random.choice(temps_high, size = 1)[0]
97 | cap = temp.format(high_conf)
98 | cap = re.sub(r"\s+", " ", cap).strip().lower()
99 | captions["open_images_" + str(k)] = {
100 | "path": ds[k]["path"],
101 | "caption": cap
102 | }
103 | return captions
104 |
105 | def generate_captions(self, ds, ds_name):
106 | print("Generating captions for", ds_name)
107 | if ds_name not in self.ds_names:
108 | raise ValueError(f"{ds_name} not in {self.ds_names}")
109 |
110 | if ds_name == "open_images":
111 | return self.generate_open_images_caption(ds)
112 |
113 | temps = []
114 | for temp in self.captions_templates[ds_name]:
115 | temps.extend(temp)
116 |
117 | # each ds: {: {"path": , "label": [