├── README.md └── download.py /README.md: -------------------------------------------------------------------------------- 1 | # Open Image Dataset Maker 2 | 3 | How to find a label and download images from the [Open Images dataset.](https://github.com/openimages/dataset) 4 | 5 | The download script uses the multiprocessing library to speed up the download process. 6 | 7 | ### Step 1 - Find a label 8 | 9 | Go to [google big query](https://bigquery.cloud.google.com) and use the following to find a label name 10 | 11 | This snippet would find the desert tag, but you could also search for just a few letters to broaden the search. 12 | ``` 13 | #standardsql 14 | SELECT 15 | * 16 | FROM 17 | `bigquery-public-data.open_images.dict` 18 | WHERE 19 | label_display_name LIKE '%desert%' 20 | LIMIT 21 | 200; 22 | ``` 23 | 24 | ### Step 2 - Get JSON File 25 | 26 | Once you find a tag, take note of the label_name parameter and then in a separate query run the following, after inserting the label name from the first step. 27 | 28 | You can also play with the confidence number if you aren't turning up enough results, or increase/decrease the limit to suit your needs. 29 | 30 | After your search just hit the download json button to get a table of all the links. 31 | 32 | ``` 33 | #standardsql 34 | SELECT 35 | i.image_id AS image_id, 36 | original_url, 37 | confidence 38 | FROM 39 | `bigquery-public-data.open_images.labels` l 40 | INNER JOIN 41 | `bigquery-public-data.open_images.images` i 42 | ON 43 | l.image_id = i.image_id 44 | WHERE 45 | label_name='/m/0284w' 46 | AND confidence >= 0.8 47 | AND Subset='train' 48 | LIMIT 49 | 2000; 50 | ``` 51 | 52 | ### Step 3 - Download the images 53 | 54 | Run download.py with arguments for a destination folder and the location of the json file. 55 | 56 | `python download.py --dest myNewDataset --json myJSONFile.json` 57 | 58 | That's it! All the files will be saved to a folder in the location that you specified. These SQL queries were pulled from [Google's OpenImage Page](https://cloud.google.com/bigquery/public-data/openimages) 59 | -------------------------------------------------------------------------------- /download.py: -------------------------------------------------------------------------------- 1 | import shutil 2 | import requests 3 | import json 4 | import sys 5 | import os 6 | import errno 7 | import argparse 8 | from multiprocessing import Pool 9 | from progress.bar import Bar 10 | from PIL import Image 11 | 12 | def makeDir(path): 13 | try: 14 | os.makedirs(path) 15 | except OSError as exception: 16 | if exception.errno != errno.EEXIST: 17 | raise 18 | 19 | def job(url): 20 | filename = os.path.basename(url) 21 | fileToSave = args.dest + '/' + filename 22 | with open(fileToSave, 'wb') as f: 23 | try: 24 | response = requests.get(url, stream=True, timeout=5.0, allow_redirects=False) 25 | fileType = response.headers['Content-Type'] 26 | if fileType == 'image/jpeg': 27 | if response.status_code == 200: 28 | for block in response.iter_content(1024): 29 | if not block: 30 | break 31 | f.write(block) 32 | except requests.exceptions.RequestException as e: 33 | print e 34 | 35 | def main(): 36 | makeDir(args.dest) 37 | 38 | urls = [json.loads(line)['original_url'] for line in open(args.json, 'r')] 39 | 40 | pool = Pool() 41 | bar = Bar('Downloading...', max=len(urls)) 42 | for i in pool.imap(job, urls): 43 | bar.next() 44 | bar.finish() 45 | 46 | if __name__ == "__main__": 47 | parser = argparse.ArgumentParser(description="Downloads Images from Open Image database.") 48 | parser.add_argument('-d','--dest', help='output folder name') 49 | parser.add_argument('-j','--json', help='json database file name or location') 50 | args = parser.parse_args() 51 | 52 | main() 53 | --------------------------------------------------------------------------------