├── requirements.txt
├── README.md
├── prepare_stats.py
└── downloader.py


/requirements.txt:
--------------------------------------------------------------------------------
1 | numpy==1.16.2
2 | matplotlib==3.0.3
3 | requests==2.21.0
4 | 


--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
 1 | # ImageNet Downloader
 2 | 
 3 | This is ImageNet dataset downloader. **You can create new datasets from subsets of ImageNet by specifying how many 
 4 | classes you need and how many images per class you need.** 
 5 | This is achieved by using image urls provided by ImageNet API.
 6 | 
 7 | 
 8 | [In this blog post](https://towardsdatascience.com/how-to-scrape-the-imagenet-f309e02de1f4) I wrote in a bit more detail how and why I wrote the tool. Also, I did a little analysis of the current state of the ImageNet URLs in the post. 
 9 | 
10 | This software is written in Python 3
11 | 
12 | ## Usage
13 | 
14 | 
15 | The following command will randomly select 100 of ImageNet classes with at least 200 images in them and start downloading:
16 | ```
17 | python ./downloader.py \
18 |     -data_root /data_root_folder/imagenet \
19 |     -number_of_classes 100 \
20 |     -images_per_class 200
21 | ```
22 | 
23 | 
24 | The following command will download 500 images from each of selected class:
25 | ```
26 | python ./downloader.py 
27 |     -data_root /data_root_folder/imagenet \
28 |     -use_class_list True \
29 |     -class_list n09858165 n01539573 n03405111 \
30 |     -images_per_class 500 
31 | ```
32 | You can find class list in [this csv](https://github.com/mf1024/ImageNet-datasets-downloader/blob/master/classes_in_imagenet.csv) where I list every class that appear in the ImageNet with number of total urls and total flickr urls it that class.
33 | 
34 | # Multiprocessing workers 
35 | 
36 | I've implementet parallel request processing and I've added **multiprocessing_workers parameter** which by default is 8. You can turn it higher, but I havent yet tested the limits of flickr allowed bandwith myself, so use it with care and you will have to find the limits yourself if you want to go for the maximum speed.
37 | 
38 | You can do something like this:
39 | 
40 | ```
41 | python ./downloader.py \
42 |     -data_root /data_root_folder/imagenet \
43 |     -number_of_classes 1000 \
44 |     -images_per_class 500 \
45 |     -multiprocessing_workers 24
46 | ```
47 | 


--------------------------------------------------------------------------------
/prepare_stats.py:
--------------------------------------------------------------------------------
  1 | import os
  2 | import requests
  3 | import csv
  4 | import codecs
  5 | import matplotlib.pyplot as plt
  6 | import json
  7 | 
  8 | DATA_ROOT = '/Users/martinsf/ai/deep_learning_projects/data'
  9 | URL_WORDNET = 'http://image-net.org/archive/words.txt'
 10 | IMAGENET_API_WNID_TO_URLS = lambda wnid: f'http://www.image-net.org/api/text/imagenet.synset.geturls?wnid={wnid}'
 11 | 
 12 | current_folder = os.path.dirname(os.path.realpath(__file__))
 13 | 
 14 | wordnet_filename = URL_WORDNET.split('/')[-1]
 15 | wordnet_file_path = os.path.join(current_folder, wordnet_filename)
 16 | print(wordnet_file_path)
 17 | if not os.path.exists(wordnet_file_path):
 18 | 
 19 |     print(f'Downloading {URL_WORDNET}')
 20 |     resp = requests.get(URL_WORDNET)
 21 | 
 22 |     with open(wordnet_file_path, "wb") as file:
 23 |         file.write(resp.content)
 24 |         file.close()
 25 | 
 26 | # Downloaded from http://image-net.org/imagenet_data/urls/imagenet_fall11_urls.tgz
 27 | url_list_filepath = '/Users/martinsf/ai/datasets/imagenet/fall11_urls.txt'
 28 | img_url_dict = dict()
 29 | 
 30 | total_urls = 0
 31 | flickr_urls = 0
 32 | 
 33 | #Go trough the urls list and count urls per class and flickr_urls per class, store the info in csv
 34 | with codecs.open(url_list_filepath, 'r', encoding='utf-8', errors='ignore') as f:
 35 |     it = 0
 36 |     for line  in f:
 37 |         it += 1
 38 |         if it % 10000 == 0:
 39 |             print(it)
 40 |         row = line.split('\t')
 41 | 
 42 |         if (len(row) != 2):
 43 |             continue
 44 |         id = row[0].split('_')[0]
 45 |         url = row[1]
 46 | 
 47 |         if not id in img_url_dict:
 48 |             img_url_dict[id] = dict(urls = 0, flickr_urls = 0)
 49 | 
 50 |         img_url_dict[id]['urls'] += 1
 51 |         total_urls += 1
 52 |         if 'flickr' in url:
 53 |             flickr_urls += 1
 54 |             img_url_dict[id]['flickr_urls'] += 1
 55 | 
 56 | 
 57 |     wnid_to_class_dict = dict()
 58 |     with open(wordnet_file_path, "r") as word_list_file:
 59 |             csv_reader_word_list = csv.reader(word_list_file, delimiter='\t')
 60 |             for row in csv_reader_word_list:
 61 |                 wnid = row[0]
 62 |                 keywords = row[1]
 63 |                 wnid_to_class_dict[wnid] = keywords
 64 | 
 65 |     class_info_json_filename = 'imagenet_class_info.json'
 66 |     class_info_json_filepath = os.path.join(current_folder, class_info_json_filename)
 67 | 
 68 |     img_counts = []
 69 |     total_url_counts = []
 70 |     flickr_url_counts = []
 71 | 
 72 |     class_info_dict = dict()
 73 | 
 74 |     with open("classes_in_imagenet.csv", "w") as csv_f:
 75 |         csv_writer  = csv.writer(csv_f, delimiter=",")
 76 |         csv_writer.writerow(["synid", "class_name", "urls", "flickr_urls"])
 77 | 
 78 |         for key, val in img_url_dict.items():
 79 |             class_info_dict[key] = dict(
 80 |                 img_url_count = val['urls'],
 81 |                 flickr_img_url_count = val['flickr_urls'],
 82 |                 class_name = wnid_to_class_dict[key].split(',')[0]
 83 |             )
 84 |             print(f'{wnid_to_class_dict[key]} {len(val)}')
 85 |             total_url_counts.append(val['urls'])
 86 |             csv_writer.writerow([key, wnid_to_class_dict[key].split(',')[0], val['urls'], val["flickr_urls"]])
 87 | 
 88 |             flickr_url_counts.append(val['flickr_urls'])
 89 | 
 90 | 
 91 |     with open(class_info_json_filepath,"w") as class_info_json_f:
 92 |         json.dump(class_info_dict, class_info_json_f)
 93 |         csv_writer = csv.writer(class_info_json_f, delimiter=';')
 94 | 
 95 |     print(f'In total there are {total_urls} img urls and {flickr_urls} flickr urls')
 96 | 
 97 |     fig, axs = plt.subplots(3,1)
 98 |     plt.style.use('seaborn')
 99 | 
100 |     plt.subplots_adjust(hspace = 0.5)
101 | 
102 |     axs[0].hist(total_url_counts, range=(500,2000), bins=50, rwidth=0.8)
103 |     axs[0].set_title('All ImageNet urls')
104 |     axs[0].set_xticks([x for x in range(500,2000,150)])
105 |     axs[0].set_xlabel("Images per class")
106 |     axs[0].set_ylabel("Number of classes")
107 | 
108 |     axs[1].set_title('Flickr ImageNet urls')
109 |     axs[1].hist(flickr_url_counts, range=(500,2000), bins=50, rwidth=0.8)
110 |     axs[1].set_xticks([x for x in range(500,2000,150)])
111 |     axs[1].set_xlabel("Images per class")
112 |     axs[1].set_ylabel("Number of classes")
113 | 
114 |     axs[2].set_title('Flickr ImageNet urls')
115 |     axs[2].hist(flickr_url_counts, range=(500,2000), bins=50, rwidth=0.8, cumulative=-1)
116 |     axs[2].set_xticks([x for x in range(500,2000,150)])
117 |     axs[2].set_xlabel("Images per class")
118 |     axs[2].set_ylabel("Number of classes")
119 | 
120 |     plt.show()


--------------------------------------------------------------------------------
/downloader.py:
--------------------------------------------------------------------------------
  1 | #!/usr/bin/env python3
  2 | import os
  3 | import numpy as np
  4 | import requests
  5 | import argparse
  6 | import json
  7 | import time
  8 | import logging
  9 | import csv
 10 | 
 11 | from multiprocessing import Pool, Process, Value, Lock
 12 | 
 13 | from requests.exceptions import ConnectionError, ReadTimeout, TooManyRedirects, MissingSchema, InvalidURL
 14 | 
 15 | parser = argparse.ArgumentParser(description='ImageNet image scraper')
 16 | parser.add_argument('-scrape_only_flickr', default=True, type=lambda x: (str(x).lower() == 'true'))
 17 | parser.add_argument('-number_of_classes', default = 10, type=int)
 18 | parser.add_argument('-images_per_class', default = 10, type=int)
 19 | parser.add_argument('-data_root', default='' , type=str)
 20 | parser.add_argument('-use_class_list', default=False,type=lambda x: (str(x).lower() == 'true'))
 21 | parser.add_argument('-class_list', default=[], nargs='*')
 22 | parser.add_argument('-debug', default=False,type=lambda x: (str(x).lower() == 'true'))
 23 | 
 24 | parser.add_argument('-multiprocessing_workers', default = 8, type=int)
 25 | 
 26 | args, args_other = parser.parse_known_args()
 27 | 
 28 | if args.debug:
 29 |     logging.basicConfig(filename='imagenet_scarper.log', level=logging.DEBUG)
 30 | 
 31 | if len(args.data_root) == 0:
 32 |     logging.error("-data_root is required to run downloader!")
 33 |     exit()
 34 | 
 35 | if not os.path.isdir(args.data_root):
 36 |     logging.error(f'folder {args.data_root} does not exist! please provide existing folder in -data_root arg!')
 37 |     exit()
 38 | 
 39 | 
 40 | IMAGENET_API_WNID_TO_URLS = lambda wnid: f'http://www.image-net.org/api/imagenet.synset.geturls?wnid={wnid}'
 41 | 
 42 | current_folder = os.path.dirname(os.path.realpath(__file__))
 43 | 
 44 | class_info_json_filename = 'imagenet_class_info.json'
 45 | class_info_json_filepath = os.path.join(current_folder, class_info_json_filename)
 46 | 
 47 | class_info_dict = dict()
 48 | 
 49 | with open(class_info_json_filepath) as class_info_json_f:
 50 |     class_info_dict = json.load(class_info_json_f)
 51 | 
 52 | classes_to_scrape = []
 53 | 
 54 | if args.use_class_list == True:
 55 |    for item in args.class_list:
 56 |        classes_to_scrape.append(item)
 57 |        if item not in class_info_dict:
 58 |            logging.error(f'Class {item} not found in ImageNete')
 59 |            exit()
 60 | 
 61 | elif args.use_class_list == False:
 62 |     potential_class_pool = []
 63 |     for key, val in class_info_dict.items():
 64 | 
 65 |         if args.scrape_only_flickr:
 66 |             if int(val['flickr_img_url_count']) * 0.9 > args.images_per_class:
 67 |                 potential_class_pool.append(key)
 68 |         else:
 69 |             if int(val['img_url_count']) * 0.8 > args.images_per_class:
 70 |                 potential_class_pool.append(key)
 71 | 
 72 |     if (len(potential_class_pool) < args.number_of_classes):
 73 |         logging.error(f"With {args.images_per_class} images per class there are {len(potential_class_pool)} to choose from.")
 74 |         logging.error(f"Decrease number of classes or decrease images per class.")
 75 |         exit()
 76 | 
 77 |     picked_classes_idxes = np.random.choice(len(potential_class_pool), args.number_of_classes, replace = False)
 78 | 
 79 |     for idx in picked_classes_idxes:
 80 |         classes_to_scrape.append(potential_class_pool[idx])
 81 | 
 82 | 
 83 | print("Picked the following clases:")
 84 | print([ class_info_dict[class_wnid]['class_name'] for class_wnid in classes_to_scrape ])
 85 | 
 86 | imagenet_images_folder = os.path.join(args.data_root, 'imagenet_images')
 87 | if not os.path.isdir(imagenet_images_folder):
 88 |     os.mkdir(imagenet_images_folder)
 89 | 
 90 | 
 91 | scraping_stats = dict(
 92 |     all=dict(
 93 |         tried=0,
 94 |         success=0,
 95 |         time_spent=0,
 96 |     ),
 97 |     is_flickr=dict(
 98 |         tried=0,
 99 |         success=0,
100 |         time_spent=0,
101 |     ),
102 |     not_flickr=dict(
103 |         tried=0,
104 |         success=0,
105 |         time_spent=0,
106 |     )
107 | )
108 | 
109 | def add_debug_csv_row(row):
110 |     with open('stats.csv', "a") as csv_f:
111 |         csv_writer = csv.writer(csv_f, delimiter=",")
112 |         csv_writer.writerow(row)
113 | 
114 | class MultiStats():
115 |     def __init__(self):
116 | 
117 |         self.lock = Lock()
118 | 
119 |         self.stats = dict(
120 |             all=dict(
121 |                 tried=Value('d', 0),
122 |                 success=Value('d',0),
123 |                 time_spent=Value('d',0),
124 |             ),
125 |             is_flickr=dict(
126 |                 tried=Value('d', 0),
127 |                 success=Value('d',0),
128 |                 time_spent=Value('d',0),
129 |             ),
130 |             not_flickr=dict(
131 |                 tried=Value('d', 0),
132 |                 success=Value('d', 0),
133 |                 time_spent=Value('d', 0),
134 |             )
135 |         )
136 |     def inc(self, cls, stat, val):
137 |         with self.lock:
138 |             self.stats[cls][stat].value += val
139 | 
140 |     def get(self, cls, stat):
141 |         with self.lock:
142 |             ret = self.stats[cls][stat].value
143 |         return ret
144 | 
145 | multi_stats = MultiStats()
146 | 
147 | 
148 | if args.debug:
149 |     row = [
150 |         "all_tried",
151 |         "all_success",
152 |         "all_time_spent",
153 |         "is_flickr_tried",
154 |         "is_flickr_success",
155 |         "is_flickr_time_spent",
156 |         "not_flickr_tried",
157 |         "not_flickr_success",
158 |         "not_flickr_time_spent"
159 |     ]
160 |     add_debug_csv_row(row)
161 | 
162 | def add_stats_to_debug_csv():
163 |     row = [
164 |         multi_stats.get('all', 'tried'),
165 |         multi_stats.get('all', 'success'),
166 |         multi_stats.get('all', 'time_spent'),
167 |         multi_stats.get('is_flickr', 'tried'),
168 |         multi_stats.get('is_flickr', 'success'),
169 |         multi_stats.get('is_flickr', 'time_spent'),
170 |         multi_stats.get('not_flickr', 'tried'),
171 |         multi_stats.get('not_flickr', 'success'),
172 |         multi_stats.get('not_flickr', 'time_spent'),
173 |     ]
174 |     add_debug_csv_row(row)
175 | 
176 | def print_stats(cls, print_func):
177 | 
178 |     actual_all_time_spent = time.time() - scraping_t_start.value
179 |     processes_all_time_spent = multi_stats.get('all', 'time_spent')
180 | 
181 |     if processes_all_time_spent == 0:
182 |         actual_processes_ratio = 1.0
183 |     else:
184 |         actual_processes_ratio = actual_all_time_spent / processes_all_time_spent
185 | 
186 |     #print(f"actual all time: {actual_all_time_spent} proc all time {processes_all_time_spent}")
187 | 
188 |     print_func(f'STATS For class {cls}:')
189 |     print_func(f' tried {multi_stats.get(cls, "tried")} urls with'
190 |                f' {multi_stats.get(cls, "success")} successes')
191 | 
192 |     if multi_stats.get(cls, "tried") > 0:
193 |         print_func(f'{100.0 * multi_stats.get(cls, "success")/multi_stats.get(cls, "tried")}% success rate for {cls} urls ')
194 |     if multi_stats.get(cls, "success") > 0:
195 |         print_func(f'{multi_stats.get(cls,"time_spent") * actual_processes_ratio / multi_stats.get(cls,"success")} seconds spent per {cls} succesful image download')
196 | 
197 | 
198 | 
199 | lock = Lock()
200 | url_tries = Value('d', 0)
201 | scraping_t_start = Value('d', time.time())
202 | class_folder = ''
203 | class_images = Value('d', 0)
204 | 
205 | def get_image(img_url):
206 | 
207 |     #print(f'Processing {img_url}')
208 | 
209 |     #time.sleep(3)
210 | 
211 |     if len(img_url) <= 1:
212 |         return
213 | 
214 | 
215 |     cls_imgs = 0
216 |     with lock:
217 |         cls_imgs = class_images.value
218 | 
219 |     if cls_imgs >= args.images_per_class:
220 |         return
221 | 
222 |     logging.debug(img_url)
223 | 
224 |     cls = ''
225 | 
226 |     if 'flickr' in img_url:
227 |         cls = 'is_flickr'
228 |     else:
229 |         cls = 'not_flickr'
230 |         if args.scrape_only_flickr:
231 |             return
232 | 
233 |     t_start = time.time()
234 | 
235 |     def finish(status):
236 |         t_spent = time.time() - t_start
237 |         multi_stats.inc(cls, 'time_spent', t_spent)
238 |         multi_stats.inc('all', 'time_spent', t_spent)
239 | 
240 |         multi_stats.inc(cls,'tried', 1)
241 |         multi_stats.inc('all', 'tried', 1)
242 | 
243 |         if status == 'success':
244 |             multi_stats.inc(cls,'success', 1)
245 |             multi_stats.inc('all', 'success', 1)
246 | 
247 |         elif status == 'failure':
248 |             pass
249 |         else:
250 |             logging.error(f'No such status {status}!!')
251 |             exit()
252 |         return
253 | 
254 | 
255 |     with lock:
256 |         url_tries.value += 1
257 |         if url_tries.value % 250 == 0:
258 |             print(f'\nScraping stats:')
259 |             print_stats('is_flickr', print)
260 |             print_stats('not_flickr', print)
261 |             print_stats('all', print)
262 |             if args.debug:
263 |                 add_stats_to_debug_csv()
264 | 
265 |     try:
266 |         img_resp = requests.get(img_url, timeout = 1)
267 |     except ConnectionError:
268 |         logging.debug(f"Connection Error for url {img_url}")
269 |         return finish('failure')
270 |     except ReadTimeout:
271 |         logging.debug(f"Read Timeout for url {img_url}")
272 |         return finish('failure')
273 |     except TooManyRedirects:
274 |         logging.debug(f"Too many redirects {img_url}")
275 |         return finish('failure')
276 |     except MissingSchema:
277 |         return finish('failure')
278 |     except InvalidURL:
279 |         return finish('failure')
280 | 
281 |     if not 'content-type' in img_resp.headers:
282 |         return finish('failure')
283 | 
284 |     if not 'image' in img_resp.headers['content-type']:
285 |         logging.debug("Not an image")
286 |         return finish('failure')
287 | 
288 |     if (len(img_resp.content) < 1000):
289 |         return finish('failure')
290 | 
291 |     logging.debug(img_resp.headers['content-type'])
292 |     logging.debug(f'image size {len(img_resp.content)}')
293 | 
294 |     img_name = img_url.split('/')[-1]
295 |     img_name = img_name.split("?")[0]
296 | 
297 |     if (len(img_name) <= 1):
298 |         return finish('failure')
299 | 
300 |     img_file_path = os.path.join(class_folder, img_name)
301 |     logging.debug(f'Saving image in {img_file_path}')
302 | 
303 |     with open(img_file_path, 'wb') as img_f:
304 |         img_f.write(img_resp.content)
305 | 
306 |         with lock:
307 |             class_images.value += 1
308 | 
309 |         logging.debug(f'Scraping stats')
310 |         print_stats('is_flickr', logging.debug)
311 |         print_stats('not_flickr', logging.debug)
312 |         print_stats('all', logging.debug)
313 | 
314 |         return finish('success')
315 | 
316 | 
317 | for class_wnid in classes_to_scrape:
318 | 
319 |     class_name = class_info_dict[class_wnid]["class_name"]
320 |     print(f'Scraping images for class \"{class_name}\"')
321 |     url_urls = IMAGENET_API_WNID_TO_URLS(class_wnid)
322 | 
323 |     time.sleep(0.05)
324 |     resp = requests.get(url_urls)
325 | 
326 |     class_folder = os.path.join(imagenet_images_folder, class_name)
327 |     if not os.path.exists(class_folder):
328 |         os.mkdir(class_folder)
329 | 
330 |     class_images.value = 0
331 | 
332 |     urls = [url.decode('utf-8') for url in resp.content.splitlines()]
333 | 
334 |     #for url in  urls:
335 |     #    get_image(url)
336 | 
337 |     print(f"Multiprocessing workers: {args.multiprocessing_workers}")
338 |     with Pool(processes=args.multiprocessing_workers) as p:
339 |         p.map(get_image,urls)
340 | 


--------------------------------------------------------------------------------