├── .gitignore ├── LICENSE ├── README.md ├── Reddit_image_scraper.py ├── config.ini └── howtoscrape.gif /.gitignore: -------------------------------------------------------------------------------- 1 | /.idea 2 | -------------------------------------------------------------------------------- /LICENSE: -------------------------------------------------------------------------------- 1 | MIT License 2 | 3 | Copyright (c) 2018 Dev Daksan P S 4 | 5 | Permission is hereby granted, free of charge, to any person obtaining a copy 6 | of this software and associated documentation files (the "Software"), to deal 7 | in the Software without restriction, including without limitation the rights 8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell 9 | copies of the Software, and to permit persons to whom the Software is 10 | furnished to do so, subject to the following conditions: 11 | 12 | The above copyright notice and this permission notice shall be included in all 13 | copies or substantial portions of the Software. 14 | 15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR 16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, 17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE 18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER 19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, 20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE 21 | SOFTWARE. 22 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # Reddit Image Scraper 2 | 3 | ## Description 4 | 5 | Reliably scrape multiple subreddits and users for multiple file formats. 6 | 7 | ## Original 8 | 9 | https://github.com/D3vd/Reddit_Image_Scraper 10 | 11 | 12 | ## New Features 13 | 14 | This version well-supersedes the template created previously, with MANY new features. 15 | 16 | * Auto-blacklisting low-quality images 17 | * Auto-blacklisting dead links 18 | * User-defined query timeout (how long will you wait between each query?) 19 | * User-defined API error timeout (this seems to help overall speed) 20 | * User-defined query quantity (How many queries per category per sub?) 21 | * User-defined minimum file size (to blacklist and delete after downloading) 22 | * De-duplication of downloaded files (It will never download the same file twice) 23 | * Puts files in respective folders 24 | * Logging of progress, all files downloaded 25 | * Logs the time it takes per sub, per category 26 | 27 | And best of all, it's VERY EASY to setup. 28 | 29 | ## Prerequisites / Packages Used 30 | 31 | Make sure to have installed these libraries before executing the program. 32 | 33 | * [PRAW](https://pypi.org/project/praw/) 34 | * [ConfigParser](https://pypi.org/project/configparser/) 35 | * [Urllib](https://docs.python.org/3.0/library/urllib.request.html) 36 | * Blake3 37 | 38 | ## First time running 39 | 40 | ### Run it once 41 | 42 | 1. Run the program once. It will create the source files you need to get started. 43 | 44 | ### Get an API key by "Creating an app" 45 | 2. Go to this [link](https://www.reddit.com/prefs/apps/) 46 | 3. Press the **Create an app** button on the bottom. 47 | 4. Give a name, and description for your app. 48 | 5. Choose 'Script' in the app type section. 49 | 50 | ### Back in the program 51 | 6. Put the client ID and Secret in config.ini 52 | 7. Add some subreddits to your subs.txt 53 | 8. run python3 reddit_image_scraper.py. 54 | 10. Check the ./result directory for your images! 55 | 11. Check the ./logs folder for history / troubleshooting on your recent runs. 56 | 57 | ## Warnings 58 | Write some warnings here soon for best practices. 59 | * Don't run more than one at a time. Your API key will get rate-limited and both may go even slower. 60 | * DO NOT SHARE your API keys, or upload them anywhere public! Don't upload them to github, either! Treat them like a username/password. 61 | 62 | ## Automating the script 63 | 64 | **Crontab** entry for you if you like: 65 | 66 | Runs once a day at 00:00 UTC. 67 | 68 | ```00 00 * * * cd /path/to/script/Reddit_Image_Scraper-master && python3 Reddit_image_scraper.py``` 69 | 70 | 71 | # Gif Demo 72 | 73 | ![](./howtoscrape.gif) 74 | 75 | 76 | 77 | 78 | 79 | 80 | 81 | 82 | 83 | -------------------------------------------------------------------------------- /Reddit_image_scraper.py: -------------------------------------------------------------------------------- 1 | # Reddit Image Sync 2 | # https://github.com/crawsome/ 3 | 4 | # scrapes Reddit for subs defined in subs.txt and users defined in users.txt for all possible images, videos, etc 5 | # downloads them, deduplicates, and badlists any duplicates. 6 | # separates users and subs by folder 7 | # creates a log file for each run 8 | # creates a badlist.txt file for any files that were removed from the download 9 | 10 | 11 | # reminder, this is not meant to be a fast operation. Each download takes 2 sec to avoid rate limiting / premature D/C. 12 | # Set it and forget it overnight. 13 | 14 | # install "praw" on pip 15 | import praw 16 | import configparser 17 | import urllib.request 18 | import re 19 | import os 20 | import hashlib 21 | from blake3 import blake3 22 | import threading 23 | 24 | from prawcore.exceptions import Redirect 25 | from prawcore.exceptions import ResponseException 26 | from urllib.error import HTTPError 27 | import http.client 28 | import time 29 | from time import sleep 30 | from datetime import datetime 31 | 32 | os.environ["PYTHONHASHSEED"] = "0" 33 | 34 | now = datetime.now() 35 | logfile_date = now.strftime("%d-%m-%Y-%H.%M.%S") 36 | dt_name = now.strftime("%d-%m-%Y-%H.%M.%S") 37 | 38 | 39 | def create_directories(): 40 | if not os.path.exists('./users'): 41 | print('users directory created. Is this your first time running?') 42 | os.mkdir('./users/') 43 | 44 | if not os.path.exists('./logs'): 45 | print('log directory created. Is this your first time running?') 46 | os.mkdir('./logs') 47 | 48 | if not os.path.exists('./result'): 49 | print('result directory created. Is this your first time running?') 50 | os.mkdir('./result/') 51 | 52 | if not os.path.exists('./subs.txt'): 53 | with open('./subs.txt', 'w') as f: 54 | log('subs.txt created. You need to add subreddits to it now.') 55 | f.write( 56 | '# add your subs here. you can comment out subs like this line, as well, if you\'d prefer to not download some subs.') 57 | if not os.path.exists('./users.txt'): 58 | with open('./users.txt', 'w') as f: 59 | log('users.txt created. You need to add redditors to it now.') 60 | f.write( 61 | '# add your redditors here. you can comment out redditors like this line, as well, if you\'d prefer to not download some subs.') 62 | 63 | 64 | def sort_text_file(file): 65 | with open(file, 'r') as f: 66 | lines = f.readlines() 67 | lines.sort(key=lambda x: x.lower()) 68 | with open(file, 'w') as f: 69 | f.writelines(lines) 70 | 71 | 72 | # not implemented yet 73 | def delete_duplicates_by_hash_multithreaded(directory): 74 | threads = [] 75 | for root, subdirs, files in os.walk(directory): 76 | for file in files: 77 | fullpath = os.path.join(root, file) 78 | if 'gitignore' in fullpath: 79 | continue 80 | t = threading.Thread(target=delete_duplicates_by_hash_thread, args=(fullpath,)) 81 | threads.append(t) 82 | t.start() 83 | for t in threads: 84 | t.join() 85 | 86 | 87 | # not implemented yet 88 | def delete_duplicates_by_hash_thread(fullpath): 89 | hasher = blake3() 90 | next_file = open(fullpath, 'rb').read() 91 | hasher.update(next_file) 92 | next_digest = hasher.digest() 93 | if next_digest in our_hashes: 94 | os.remove(fullpath) 95 | log('removed duplicate file: {}'.format(fullpath)) 96 | # extract just the filename from fullpath 97 | filename = os.path.basename(fullpath) 98 | add_to_badlist(filename) 99 | else: 100 | our_hashes.append(next_digest) 101 | 102 | 103 | # single-folder deduplication 104 | def delete_duplicates_by_hash(directory): 105 | our_hashes = [] 106 | for root, subdirs, files in os.walk(directory): 107 | for file in files: 108 | fullpath = os.path.join(root, file) 109 | hasher = blake3() 110 | next_file = open(fullpath, 'rb').read() 111 | hasher.update(next_file) 112 | next_digest = hasher.digest() 113 | if next_digest in our_hashes: 114 | os.remove(fullpath) 115 | log('removed duplicate file: {}'.format(fullpath)) 116 | # extract just the filename from fullpath 117 | filename = os.path.basename(fullpath) 118 | add_to_badlist(filename) 119 | else: 120 | our_hashes.append(next_digest) 121 | 122 | 123 | # Does two levels of directory traversal. 124 | def delete_duplicates_by_hash_2Deep(directory): 125 | for root, subdirs, files in os.walk(directory): 126 | dir_count = 0 127 | subdirs = sorted(subdirs) 128 | for subdir in subdirs: 129 | log('checking subdir: {}: {}/{}'.format(subdir, dir_count, len(subdirs))) 130 | our_hashes = [] 131 | for root, subdirs, files in os.walk(os.path.join(directory, subdir)): 132 | dupe_count = 0 133 | log('starting: {} dedupe'.format(subdir)) 134 | for file in sorted(files, reverse=True): 135 | hasher = blake3() 136 | next_file = open(os.path.join(directory, subdir, file), 'rb').read() 137 | hasher.update(next_file) 138 | next_digest = hasher.digest() 139 | if next_digest in our_hashes: 140 | os.remove(os.path.join(directory, subdir, file)) 141 | log('removed duplicate file: {}'.format(os.path.join(directory, subdir, file))) 142 | add_to_badlist(os.path.join(file)) 143 | dupe_count += 1 144 | else: 145 | our_hashes.append(next_digest) 146 | log('finished folder: {}'.format(subdir)) 147 | log('duplicates removed: {}'.format(dupe_count)) 148 | dir_count += 1 149 | return 150 | 151 | 152 | def log(text): 153 | print(text) 154 | now = datetime.now() 155 | logfile = open('./logs/{}.txt'.format(logfile_date), 'a') 156 | event_time = now.strftime("%d-%m-%Y-%H.%M.%S") 157 | logfile.write('{}: {}\n'.format(event_time, text)) 158 | 159 | 160 | class ClientInfo: 161 | id = '' 162 | secret = '' 163 | user_agent = 'Reddit_Image_Scraper' 164 | 165 | 166 | def badlist_cleanup(minimum_file_size_kb): 167 | """removes small files, cleans and de-dupes the badlist""" 168 | 169 | # crawl all directories recursively, find files lower than variable kb, add them to the badlist, then delete them. 170 | for root, subdirs, files in os.walk('./result'): 171 | for file in files: 172 | fullpath = os.path.join(root, file) 173 | filesize = os.path.getsize(fullpath) / 1000 174 | if 'gitignore' in fullpath: 175 | continue 176 | if filesize < minimum_file_size_kb: 177 | # remove anything before " -" in file. 178 | file = re.sub(r'^.* - ', '', file) 179 | add_to_badlist(file) 180 | print(str(fullpath) + " \t" + str(filesize) + "KB") 181 | os.remove(fullpath) 182 | 183 | # create badlist if it's not there 184 | if not os.path.exists('badlist.txt'): 185 | with open('badlist.txt', 'w') as f: 186 | log('badlist.txt created. Is this your first time running?') 187 | f.write('') 188 | 189 | # de-duplicate the badlist. 190 | with open('badlist.txt', 'r') as f: 191 | in_list = set(f.readlines()) 192 | 193 | # we-write the badlist back to itself. 194 | with open('badlist.txt', 'w') as f: 195 | f.writelines(in_list) 196 | 197 | 198 | def add_to_badlist(filename): 199 | with open('badlist.txt', 'a') as f: 200 | f.writelines(filename + '\n') 201 | log("added {} to badlist".format(filename)) 202 | 203 | 204 | def get_client_info(): 205 | if not os.path.exists('config.ini'): 206 | with open('config.ini', 'w') as f: 207 | log('config.ini template created. Please paste in your client secret. (And RTM)') 208 | f.write("""[ALPHA] 209 | client_id=PASTE ID HERE 210 | client_secret=PASTE SECRET HERE 211 | query_limit=2000 212 | ratelimit_sleep=2 213 | failure_sleep=10 214 | minimum_file_size_kb=12.0""") 215 | config = configparser.ConfigParser() 216 | config.read("config.ini") 217 | id = config["ALPHA"]["client_id"] 218 | secret = config["ALPHA"]["client_secret"] 219 | query_limit = config["ALPHA"]["query_limit"] 220 | ratelimit_sleep = config["ALPHA"]["ratelimit_sleep"] 221 | failure_sleep = config["ALPHA"]["failure_sleep"] 222 | minimum_file_size_kb = config["ALPHA"]["minimum_file_size_kb"] 223 | return id, secret, int(query_limit), int(ratelimit_sleep), int(failure_sleep), float(minimum_file_size_kb) 224 | 225 | 226 | def is_media_file(uri): 227 | # print('Original Link:' + img_link) # enable this if you want to log the literal URLs it sees 228 | regex = '([.][\w]+)$' 229 | re.compile(regex) 230 | t = re.search(regex, uri) 231 | 232 | # extension is in the last 4 characters, unless it matches the regex. 233 | ext = uri[-4:] 234 | if t: 235 | ext = t.group() 236 | if ext in ('.webm', '.gif', '.avi', '.mp4', '.jpg', '.jpeg', '.png', '.mov', '.ogg', '.wmv', '.mp2', '.mp3', '.mkv'): 237 | return True 238 | else: 239 | return False 240 | 241 | 242 | def get_redditor_urls(redditor, limit): 243 | # TODO how do I automate these generators? 244 | time_filters = ['hour', 'month', 'all', 'week', 'year', 'day'] 245 | try: 246 | all_start = time.time() 247 | if ClientInfo.id == 'PASTE ID HERE' or ClientInfo.secret == 'PASTE SECRET HERE': 248 | log('Error: Please enter your "Client Info" and "Secret" into config.ini') 249 | r = praw.Reddit(client_id=ClientInfo.id, client_secret=ClientInfo.secret, user_agent=ClientInfo.user_agent) 250 | 251 | start = time.time() 252 | submissions1 = [submission.url for submission in r.redditor(redditor).submissions.top("all")] 253 | end = time.time() 254 | log('Query return time for Top/All:{},\nTotal Found: {}'.format(str(end - start), len(submissions1))) 255 | 256 | start = time.time() 257 | submissions2 = [submission.url for submission in r.redditor(redditor).submissions.new()] 258 | end = time.time() 259 | log('Query return time for new:{},\nTotal Found: {}'.format(str(end - start), len(submissions2))) 260 | 261 | start = time.time() 262 | submissions3 = [submission.url for submission in r.redditor(redditor).submissions.controversial()] 263 | end = time.time() 264 | log('Query return time for controversial:{},\nTotal Found: {}'.format(str(end - start), len(submissions3))) 265 | 266 | # combine them all, and de-duplicate them 267 | submissions = submissions = list(set(submissions1 + submissions2 + submissions3)) 268 | 269 | log('total unique submissions: {}'.format(len(submissions))) 270 | 271 | all_end = time.time() 272 | log('Query return time for :{}: {}'.format(str(redditor), str(all_end - all_start))) 273 | return submissions 274 | 275 | except Redirect: 276 | log("get_img_urls() Redirect. Invalid Subreddit?") 277 | return 0 278 | 279 | except HTTPError: 280 | log("get_img_urls() HTTPError in last query") 281 | sleep(10) 282 | return 0 283 | 284 | except ResponseException: 285 | log("get_img_urls() ResponseException.") 286 | return 0 287 | 288 | 289 | def get_img_urls(sub, limit): 290 | # TODO how do I automate these generators? 291 | time_filters = ['hour', 'month', 'all', 'week', 'year', 'day'] 292 | try: 293 | all_start = time.time() 294 | if ClientInfo.id == 'PASTE ID HERE' or ClientInfo.secret == 'PASTE SECRET HERE': 295 | log('Error: Please enter your "Client Info" and "Secret" into config.ini') 296 | r = praw.Reddit(client_id=ClientInfo.id, client_secret=ClientInfo.secret, user_agent=ClientInfo.user_agent) 297 | start = time.time() 298 | submissions = [submission.url for submission in r.subreddit(sub).top(time_filter='all', limit=limit)] 299 | end = time.time() 300 | log('Query return time for ALL:{},\nTotal Found: {}'.format(str(end - start), len(submissions))) 301 | 302 | start = time.time() 303 | submissions2 = [submission.url for submission in r.subreddit(sub).top(time_filter='year', limit=limit)] 304 | end = time.time() 305 | log('Query return time for year:{},\nTotal Found: {}'.format(str(end - start), len(submissions2))) 306 | 307 | start = time.time() 308 | submissions3 = [submission.url for submission in r.subreddit(sub).top(time_filter='month', limit=limit)] 309 | end = time.time() 310 | log('Query return time for month:{},\nTotal Found: {}'.format(str(end - start), len(submissions3))) 311 | 312 | start = time.time() 313 | submissions4 = [submission.url for submission in r.subreddit(sub).top(time_filter='week', limit=limit)] 314 | end = time.time() 315 | log('Query return time for week:{},\nTotal Found: {}'.format(str(end - start), len(submissions4))) 316 | 317 | start = time.time() 318 | submissions5 = [submission.url for submission in r.subreddit(sub).top(time_filter='hour', limit=limit)] 319 | end = time.time() 320 | log('Query return time for hour:{},\nTotal Found: {}'.format(str(end - start), len(submissions5))) 321 | 322 | start = time.time() 323 | submissions6 = [submission.url for submission in r.subreddit(sub).top(time_filter='day', limit=limit)] 324 | end = time.time() 325 | log('Query return time for day:{},\nTotal Found: {}'.format(str(end - start), len(submissions6))) 326 | 327 | start = time.time() 328 | submissions7 = [submission.url for submission in r.subreddit(sub).hot(limit=limit)] 329 | end = time.time() 330 | log('Query return time for HOT:{},\nTotal Found: {}'.format(str(end - start), len(submissions7))) 331 | 332 | start = time.time() 333 | submissions8 = [submission.url for submission in r.subreddit(sub).new(limit=limit)] 334 | end = time.time() 335 | log('Query return time for NEW:{},\nTotal Found: {}'.format(str(end - start), len(submissions8))) 336 | 337 | start = time.time() 338 | submissions9 = [submission.url for submission in r.subreddit(sub).rising(limit=limit)] 339 | end = time.time() 340 | log('Query return time for RISING:{},\nTotal Found: {}'.format(str(end - start), len(submissions9))) 341 | 342 | # combine them all, and de-duplicate them 343 | submissions = list(set( 344 | submissions + submissions2 + submissions3 + submissions4 + submissions5 + submissions6 + submissions7 + submissions8 + submissions9)) 345 | 346 | log('total unique submissions: {}'.format(len(submissions))) 347 | 348 | all_end = time.time() 349 | log('Query return time for :{}: {}'.format(str(sub), str(all_end - all_start))) 350 | return submissions 351 | 352 | except Redirect: 353 | log("get_img_urls() Redirect. Invalid Subreddit?") 354 | return 0 355 | 356 | except HTTPError: 357 | log("get_img_urls() HTTPError in last query") 358 | sleep(10) 359 | return 0 360 | 361 | except ResponseException: 362 | log("get_img_urls() ResponseException.") 363 | return 0 364 | 365 | 366 | def download_img(img_url, img_title, file_loc, sub, ratelimit_sleep: int, failure_sleep: int): 367 | # print(img_url + ' ' + img_title + ' ' + file_loc) 368 | opener = urllib.request.build_opener() 369 | opener.addheaders = [('User-agent', 'Mozilla/5.0')] 370 | urllib.request.install_opener(opener) 371 | try: 372 | log('DL From: {} - Filename: {} - URL:{}'.format(sub, file_loc, img_url)) 373 | # u = urllib.request.urlopen(img_url) 374 | # u_metadata = u.info() 375 | # size = int(u_metadata.getheaders("Content-Length")[0]) 376 | # print(size) 377 | 378 | urllib.request.urlretrieve(img_url, file_loc) 379 | sleep(ratelimit_sleep) # remove this at your own risk. This is necessary so you can download the whole sub! 380 | return 0 381 | 382 | except HTTPError as e: 383 | log("download_img() HTTPError in last query (file might not exist anymore, or malformed URL)") 384 | add_to_badlist(img_title) 385 | log(e) 386 | sleep(failure_sleep) 387 | return 1 388 | 389 | except urllib.error.URLError as e: 390 | log("download_img() URLError! Site might be offline") 391 | log(e) 392 | add_to_badlist(img_title) 393 | sleep(ratelimit_sleep) 394 | return 1 395 | 396 | except http.client.InvalidURL as e: 397 | log('Crappy URL') 398 | log(e) 399 | add_to_badlist(img_title) 400 | return 1 401 | 402 | 403 | def read_img_links(submissions, url_list, user_submissions): 404 | sub = submissions.lower() 405 | if user_submissions: 406 | if not os.path.exists('./users/{}'.format(sub)): 407 | os.mkdir('./users/{}'.format(sub)) 408 | else: 409 | if not os.path.exists('./result'): 410 | os.mkdir('./result') 411 | if not os.path.exists('./result/{}'.format(sub)): 412 | os.mkdir('./result/{}'.format(sub)) 413 | 414 | url_list = [x.strip() for x in url_list] 415 | url_list.sort() 416 | download_count = 0 417 | exist_count = 0 418 | download_status = 0 419 | 420 | with open('badlist.txt', 'r') as f: 421 | badlist = f.readlines() 422 | badlist = [x.strip() for x in set(badlist)] 423 | 424 | for link in url_list: 425 | if 'gfycat.com' in link and '.gif' not in link[-4:] and '.webm' not in link[-4:]: 426 | # print(link[-4:]) 427 | # print('gfycat found:{}'.format(link)) 428 | link = link + '.gif' 429 | if not is_media_file(link): 430 | continue 431 | 432 | file_name = link.split('/')[-1] 433 | if file_name in badlist: 434 | # log('{} found in badlist, skipping'.format(file_name)) 435 | continue 436 | 437 | if user_submissions: 438 | file_loc = 'users/{}/{}'.format(sub, file_name) 439 | base = './users/' 440 | else: 441 | file_loc = 'result/{}/{}'.format(sub, file_name) 442 | base = './result/' 443 | if os.path.exists(file_name) or os.path.exists(file_loc): 444 | # log(file_name + ' already exists') 445 | exist_count += 1 446 | continue 447 | 448 | if not file_name: 449 | log(file_name + ' cannot download') 450 | continue 451 | 452 | download_status = download_img(link, file_name, file_loc, sub, ratelimit_sleep, failure_sleep) 453 | 454 | download_count += 1 455 | return download_count, download_status, exist_count 456 | 457 | 458 | if __name__ == '__main__': 459 | # Get client info 460 | ClientInfo.id, ClientInfo.secret, query_lookup_limit, ratelimit_sleep, failure_sleep, minimum_file_size_kb = get_client_info() 461 | 462 | # Create project directories 463 | create_directories() 464 | 465 | # Cleanup files to the badlist, and to size standards 466 | badlist_cleanup(minimum_file_size_kb) 467 | 468 | # Sort all our files. 469 | sort_text_file('./subs.txt') 470 | sort_text_file('./users.txt') 471 | sort_text_file('./badlist.txt') 472 | 473 | # THIS TAKES A VERY VERY LONG TIME 474 | # delete_duplicates_by_hash('./users') 475 | # delete_duplicates_by_hash('./result') 476 | 477 | subs_file = open('./subs.txt', 'r') 478 | redditors_file = open('./users.txt', 'r') 479 | 480 | for redditor in redditors_file.readlines(): 481 | if '#' in redditor: 482 | continue 483 | redditor = redditor.strip('\n').lower() 484 | log('Starting Retrieval from: ' + redditor) 485 | 486 | # subreddit = input('Enter Subreddit: ') 487 | # query_lookup_limit = int(input('Enter the max amount of queries: ')) 488 | url_list = [] 489 | url_list = get_redditor_urls(redditor, query_lookup_limit) 490 | file_no = 1 491 | 492 | if url_list: 493 | try: 494 | log('{} images found on {}'.format(len(url_list), redditor)) 495 | count, status, already_here = read_img_links(redditor, url_list, True) 496 | 497 | if status == 1: 498 | log( 499 | 'Download Complete from {}\n{} - Images Downloaded\nQuery limit: {} \nAlready downloaded: {} \n'.format( 500 | redditor, count, query_lookup_limit, already_here)) 501 | elif status == 0: 502 | log( 503 | 'Download Complete from {}\n{} - Images Downloaded\nQuery limit: {} \nAlready downloaded: {} \n'.format( 504 | redditor, count, query_lookup_limit, already_here)) 505 | except UnicodeEncodeError: 506 | log('UnicodeEncodeError:{}\n'.format(redditor)) 507 | except OSError as e: 508 | 509 | log('OSError:{}\nVerbose:{}'.format(redditor, e)) 510 | # confirm = input('confirm next sub? CTRL+C to cancel.') 511 | delete_duplicates_by_hash('./users/{}'.format(redditor)) 512 | 513 | for subreddit in subs_file.readlines(): 514 | if '#' in subreddit: 515 | continue 516 | subreddit = subreddit.strip('\n').lower() 517 | log('Starting Retrieval from: /r/' + subreddit) 518 | 519 | # subreddit = input('Enter Subreddit: ') 520 | # query_lookup_limit = int(input('Enter the max amount of queries: ')) 521 | url_list = [] 522 | url_list = get_img_urls(subreddit, query_lookup_limit) 523 | file_no = 1 524 | 525 | if url_list: 526 | try: 527 | log('{} images found on {}'.format(len(url_list), subreddit)) 528 | count, status, already_here = read_img_links(subreddit, url_list, 0) 529 | 530 | if status == 1: 531 | log( 532 | 'Download Complete from {}\n{} - Images Downloaded\nQuery limit: {} \nAlready downloaded: {} \n'.format( 533 | subreddit, count, query_lookup_limit, already_here)) 534 | elif status == 0: 535 | log( 536 | 'Download Complete from {}\n{} - Images Downloaded\nQuery limit: {} \nAlready downloaded: {} \n'.format( 537 | subreddit, count, query_lookup_limit, already_here)) 538 | except UnicodeEncodeError: 539 | log('UnicodeEncodeError:{}\n'.format(subreddit)) 540 | except OSError as e: 541 | 542 | log('OSError:{}\nVerbose:{}'.format(subreddit, e)) 543 | # confirm = input('confirm next sub? CTRL+C to cancel.') 544 | delete_duplicates_by_hash('./result/{}'.format(subreddit)) 545 | -------------------------------------------------------------------------------- /config.ini: -------------------------------------------------------------------------------- 1 | [ALPHA] 2 | client_id= 3 | client_secret= 4 | query_limit=10 5 | ratelimit_sleep=2 6 | failure_sleep=10 7 | minimum_file_size_kb=12.0 8 | -------------------------------------------------------------------------------- /howtoscrape.gif: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/crawsome/Reddit_Image_Scraper/c0e6780f14689375f1b66b6a8d52a6458017e8e9/howtoscrape.gif --------------------------------------------------------------------------------