├── README.md
└── download.py


/README.md:
--------------------------------------------------------------------------------
 1 | # Open Image Dataset Maker
 2 | 
 3 | How to find a label and download images from the [Open Images dataset.](https://github.com/openimages/dataset)
 4 | 
 5 | The download script uses the multiprocessing library to speed up the download process.
 6 | 
 7 | ### Step 1 - Find a label
 8 | 
 9 | Go to [google big query](https://bigquery.cloud.google.com) and use the following to find a label name
10 | 
11 | This snippet would find the desert tag, but you could also search for just a few letters to broaden the search.
12 | ```
13 | #standardsql
14 | SELECT
15 |   *
16 | FROM
17 |   `bigquery-public-data.open_images.dict`
18 | WHERE
19 |   label_display_name LIKE '%desert%'
20 | LIMIT
21 |   200;
22 | ```
23 | 
24 | ### Step 2 - Get JSON File
25 | 
26 | Once you find a tag, take note of the label_name parameter and then in a separate query run the following, after inserting the label name from the first step. 
27 | 
28 | You can also play with the confidence number if you aren't turning up enough results, or increase/decrease the limit to suit your needs.
29 | 
30 | After your search just hit the download json button to get a table of all the links.
31 | 
32 | ```
33 | #standardsql
34 | SELECT
35 |   i.image_id AS image_id,
36 |   original_url,
37 |   confidence
38 | FROM
39 |   `bigquery-public-data.open_images.labels` l
40 | INNER JOIN
41 |   `bigquery-public-data.open_images.images` i
42 | ON
43 |   l.image_id = i.image_id
44 | WHERE
45 |   label_name='/m/0284w'
46 |   AND confidence >= 0.8
47 |   AND Subset='train'
48 | LIMIT
49 |   2000;
50 | ```
51 | 
52 | ### Step 3 - Download the images
53 | 
54 | Run download.py with arguments for a destination folder and the location of the json file.
55 | 
56 | `python download.py --dest myNewDataset --json myJSONFile.json`
57 | 
58 | That's it! All the files will be saved to a folder in the location that you specified. These SQL queries were pulled from [Google's OpenImage Page](https://cloud.google.com/bigquery/public-data/openimages) 
59 | 


--------------------------------------------------------------------------------
/download.py:
--------------------------------------------------------------------------------
 1 | import shutil
 2 | import requests
 3 | import json
 4 | import sys
 5 | import os
 6 | import errno
 7 | import argparse
 8 | from multiprocessing import Pool
 9 | from progress.bar import Bar
10 | from PIL import Image
11 | 
12 | def makeDir(path):
13 |     try:
14 |         os.makedirs(path)
15 |     except OSError as exception:
16 |         if exception.errno != errno.EEXIST:
17 |             raise
18 | 
19 | def job(url):
20 |     filename = os.path.basename(url) 
21 |     fileToSave = args.dest + '/' + filename
22 |     with open(fileToSave, 'wb') as f:
23 |         try:
24 |             response = requests.get(url, stream=True, timeout=5.0, allow_redirects=False)
25 |             fileType = response.headers['Content-Type']
26 |             if fileType == 'image/jpeg': 
27 |                 if response.status_code == 200:
28 |                     for block in response.iter_content(1024):
29 |                         if not block:
30 |                             break
31 |                         f.write(block)
32 |         except requests.exceptions.RequestException as e:
33 |             print e
34 | 
35 | def main():
36 |     makeDir(args.dest)
37 |     
38 |     urls = [json.loads(line)['original_url'] for line in open(args.json, 'r')]
39 | 
40 |     pool = Pool()
41 |     bar = Bar('Downloading...', max=len(urls))
42 |     for i in pool.imap(job, urls):
43 |         bar.next()
44 |     bar.finish()
45 | 
46 | if __name__ == "__main__":
47 |     parser = argparse.ArgumentParser(description="Downloads Images from Open Image database.")
48 |     parser.add_argument('-d','--dest', help='output folder name')
49 |     parser.add_argument('-j','--json', help='json database file name or location')
50 |     args = parser.parse_args()
51 | 
52 |     main()
53 | 


--------------------------------------------------------------------------------