├── .gitignore
├── LICENSE
├── README.md
├── Reddit_image_scraper.py
├── config.ini
└── howtoscrape.gif


/.gitignore:
--------------------------------------------------------------------------------
1 | /.idea
2 | 


--------------------------------------------------------------------------------
/LICENSE:
--------------------------------------------------------------------------------
 1 | MIT License
 2 | 
 3 | Copyright (c) 2018 Dev Daksan P S
 4 | 
 5 | Permission is hereby granted, free of charge, to any person obtaining a copy
 6 | of this software and associated documentation files (the "Software"), to deal
 7 | in the Software without restriction, including without limitation the rights
 8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
 9 | copies of the Software, and to permit persons to whom the Software is
10 | furnished to do so, subject to the following conditions:
11 | 
12 | The above copyright notice and this permission notice shall be included in all
13 | copies or substantial portions of the Software.
14 | 
15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
21 | SOFTWARE.
22 | 


--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
 1 | # Reddit Image Scraper
 2 | 
 3 | ## Description 
 4 | 
 5 | Reliably scrape multiple subreddits and users for multiple file formats. 
 6 | 
 7 | ## Original
 8 | 
 9 | https://github.com/D3vd/Reddit_Image_Scraper
10 | 
11 | 
12 | ## New Features
13 | 
14 | This version well-supersedes the template created previously, with MANY new features. 
15 | 
16 | * Auto-blacklisting low-quality images
17 | * Auto-blacklisting dead links
18 | * User-defined query timeout (how long will you wait between each query?)
19 | * User-defined API error timeout (this seems to help overall speed)
20 | * User-defined query quantity (How many queries per category per sub?)
21 | * User-defined minimum file size (to blacklist and delete after downloading)
22 | * De-duplication of downloaded files (It will never download the same file twice)
23 | * Puts files in respective folders
24 | * Logging of progress, all files downloaded
25 | * Logs the time it takes per sub, per category
26 | 
27 | And best of all, it's VERY EASY to setup. 
28 | 
29 | ## Prerequisites / Packages Used
30 | 
31 | Make sure to have installed these libraries before executing the program.
32 | 
33 | * [PRAW](https://pypi.org/project/praw/)
34 | * [ConfigParser](https://pypi.org/project/configparser/)
35 | * [Urllib](https://docs.python.org/3.0/library/urllib.request.html)
36 | * Blake3
37 | 
38 | ## First time running
39 | 
40 | ### Run it once
41 | 
42 | 1. Run the program once. It will create the source files you need to get started. 
43 | 
44 | ### Get an API key by "Creating an app"
45 | 2. Go to this [link](https://www.reddit.com/prefs/apps/)
46 | 3. Press the **Create an app** button on the bottom.
47 | 4. Give a name, and description for your app. 
48 | 5. Choose 'Script' in the app type section.
49 | 
50 | ### Back in the program
51 | 6. Put the client ID and Secret in config.ini 
52 | 7. Add some subreddits to your subs.txt
53 | 8. run python3 reddit_image_scraper.py.
54 | 10. Check the ./result directory for your images!
55 | 11. Check the ./logs folder for history / troubleshooting on your recent runs. 
56 | 
57 | ## Warnings
58 | Write some warnings here soon for best practices. 
59 | * Don't run more than one at a time. Your API key will get rate-limited and both may go even slower. 
60 | * DO NOT SHARE your API keys, or upload them anywhere public! Don't upload them to github, either! Treat them like a username/password.
61 | 
62 | ## Automating the script
63 | 
64 | **Crontab** entry for you if you like: 
65 | 
66 | Runs once a day at 00:00 UTC.  
67 | 
68 | ```00 00 * * * cd /path/to/script/Reddit_Image_Scraper-master && python3 Reddit_image_scraper.py```
69 | 
70 | 
71 | # Gif Demo
72 | 
73 | ![](./howtoscrape.gif)
74 | 
75 | 
76 | 
77 | 
78 | 
79 | 
80 | 
81 | 
82 | 
83 | 


--------------------------------------------------------------------------------
/Reddit_image_scraper.py:
--------------------------------------------------------------------------------
  1 | # Reddit Image Sync
  2 | # https://github.com/crawsome/
  3 | 
  4 | # scrapes Reddit for subs defined in subs.txt and users defined in users.txt for all possible images, videos, etc
  5 | # downloads them, deduplicates, and badlists any duplicates.
  6 | # separates users and subs by folder
  7 | # creates a log file for each run
  8 | # creates a badlist.txt file for any files that were removed from the download
  9 | 
 10 | 
 11 | # reminder, this is not meant to be a fast operation. Each download takes 2 sec to avoid rate limiting / premature D/C.
 12 | # Set it and forget it overnight.
 13 | 
 14 | # install "praw" on pip
 15 | import praw
 16 | import configparser
 17 | import urllib.request
 18 | import re
 19 | import os
 20 | import hashlib
 21 | from blake3 import blake3
 22 | import threading
 23 | 
 24 | from prawcore.exceptions import Redirect
 25 | from prawcore.exceptions import ResponseException
 26 | from urllib.error import HTTPError
 27 | import http.client
 28 | import time
 29 | from time import sleep
 30 | from datetime import datetime
 31 | 
 32 | os.environ["PYTHONHASHSEED"] = "0"
 33 | 
 34 | now = datetime.now()
 35 | logfile_date = now.strftime("%d-%m-%Y-%H.%M.%S")
 36 | dt_name = now.strftime("%d-%m-%Y-%H.%M.%S")
 37 | 
 38 | 
 39 | def create_directories():
 40 |     if not os.path.exists('./users'):
 41 |         print('users directory created. Is this your first time running?')
 42 |         os.mkdir('./users/')
 43 | 
 44 |     if not os.path.exists('./logs'):
 45 |         print('log directory created. Is this your first time running?')
 46 |         os.mkdir('./logs')
 47 | 
 48 |     if not os.path.exists('./result'):
 49 |         print('result directory created. Is this your first time running?')
 50 |         os.mkdir('./result/')
 51 | 
 52 |     if not os.path.exists('./subs.txt'):
 53 |         with open('./subs.txt', 'w') as f:
 54 |             log('subs.txt created. You need to add subreddits to it now.')
 55 |             f.write(
 56 |                 '# add your subs here. you can comment out subs like this line, as well, if you\'d prefer to not download some subs.')
 57 |     if not os.path.exists('./users.txt'):
 58 |         with open('./users.txt', 'w') as f:
 59 |             log('users.txt created. You need to add redditors to it now.')
 60 |             f.write(
 61 |                 '# add your redditors here. you can comment out redditors like this line, as well, if you\'d prefer to not download some subs.')
 62 | 
 63 | 
 64 | def sort_text_file(file):
 65 |     with open(file, 'r') as f:
 66 |         lines = f.readlines()
 67 |     lines.sort(key=lambda x: x.lower())
 68 |     with open(file, 'w') as f:
 69 |         f.writelines(lines)
 70 | 
 71 | 
 72 | # not implemented yet
 73 | def delete_duplicates_by_hash_multithreaded(directory):
 74 |     threads = []
 75 |     for root, subdirs, files in os.walk(directory):
 76 |         for file in files:
 77 |             fullpath = os.path.join(root, file)
 78 |             if 'gitignore' in fullpath:
 79 |                 continue
 80 |             t = threading.Thread(target=delete_duplicates_by_hash_thread, args=(fullpath,))
 81 |             threads.append(t)
 82 |             t.start()
 83 |     for t in threads:
 84 |         t.join()
 85 | 
 86 | 
 87 | # not implemented yet
 88 | def delete_duplicates_by_hash_thread(fullpath):
 89 |     hasher = blake3()
 90 |     next_file = open(fullpath, 'rb').read()
 91 |     hasher.update(next_file)
 92 |     next_digest = hasher.digest()
 93 |     if next_digest in our_hashes:
 94 |         os.remove(fullpath)
 95 |         log('removed duplicate file: {}'.format(fullpath))
 96 |         # extract just the filename from fullpath
 97 |         filename = os.path.basename(fullpath)
 98 |         add_to_badlist(filename)
 99 |     else:
100 |         our_hashes.append(next_digest)
101 | 
102 | 
103 | # single-folder deduplication
104 | def delete_duplicates_by_hash(directory):
105 |     our_hashes = []
106 |     for root, subdirs, files in os.walk(directory):
107 |         for file in files:
108 |             fullpath = os.path.join(root, file)
109 |             hasher = blake3()
110 |             next_file = open(fullpath, 'rb').read()
111 |             hasher.update(next_file)
112 |             next_digest = hasher.digest()
113 |             if next_digest in our_hashes:
114 |                 os.remove(fullpath)
115 |                 log('removed duplicate file: {}'.format(fullpath))
116 |                 # extract just the filename from fullpath
117 |                 filename = os.path.basename(fullpath)
118 |                 add_to_badlist(filename)
119 |             else:
120 |                 our_hashes.append(next_digest)
121 | 
122 | 
123 | # Does two levels of directory traversal.
124 | def delete_duplicates_by_hash_2Deep(directory):
125 |     for root, subdirs, files in os.walk(directory):
126 |         dir_count = 0
127 |         subdirs = sorted(subdirs)
128 |         for subdir in subdirs:
129 |             log('checking subdir: {}: {}/{}'.format(subdir, dir_count, len(subdirs)))
130 |             our_hashes = []
131 |             for root, subdirs, files in os.walk(os.path.join(directory, subdir)):
132 |                 dupe_count = 0
133 |                 log('starting: {} dedupe'.format(subdir))
134 |                 for file in sorted(files, reverse=True):
135 |                     hasher = blake3()
136 |                     next_file = open(os.path.join(directory, subdir, file), 'rb').read()
137 |                     hasher.update(next_file)
138 |                     next_digest = hasher.digest()
139 |                     if next_digest in our_hashes:
140 |                         os.remove(os.path.join(directory, subdir, file))
141 |                         log('removed duplicate file: {}'.format(os.path.join(directory, subdir, file)))
142 |                         add_to_badlist(os.path.join(file))
143 |                         dupe_count += 1
144 |                     else:
145 |                         our_hashes.append(next_digest)
146 |                 log('finished folder: {}'.format(subdir))
147 |                 log('duplicates removed: {}'.format(dupe_count))
148 |             dir_count += 1
149 |         return
150 | 
151 | 
152 | def log(text):
153 |     print(text)
154 |     now = datetime.now()
155 |     logfile = open('./logs/{}.txt'.format(logfile_date), 'a')
156 |     event_time = now.strftime("%d-%m-%Y-%H.%M.%S")
157 |     logfile.write('{}: {}\n'.format(event_time, text))
158 | 
159 | 
160 | class ClientInfo:
161 |     id = ''
162 |     secret = ''
163 |     user_agent = 'Reddit_Image_Scraper'
164 | 
165 | 
166 | def badlist_cleanup(minimum_file_size_kb):
167 |     """removes small files, cleans and de-dupes the badlist"""
168 | 
169 |     # crawl all directories recursively, find files lower than variable kb, add them to the badlist, then delete them.
170 |     for root, subdirs, files in os.walk('./result'):
171 |         for file in files:
172 |             fullpath = os.path.join(root, file)
173 |             filesize = os.path.getsize(fullpath) / 1000
174 |             if 'gitignore' in fullpath:
175 |                 continue
176 |             if filesize < minimum_file_size_kb:
177 |                 # remove anything before " -" in file.
178 |                 file = re.sub(r'^.* - ', '', file)
179 |                 add_to_badlist(file)
180 |                 print(str(fullpath) + " \t" + str(filesize) + "KB")
181 |                 os.remove(fullpath)
182 | 
183 |     # create badlist if it's not there
184 |     if not os.path.exists('badlist.txt'):
185 |         with open('badlist.txt', 'w') as f:
186 |             log('badlist.txt created. Is this your first time running?')
187 |             f.write('')
188 | 
189 |     # de-duplicate the badlist.
190 |     with open('badlist.txt', 'r') as f:
191 |         in_list = set(f.readlines())
192 | 
193 |     # we-write the badlist back to itself.
194 |     with open('badlist.txt', 'w') as f:
195 |         f.writelines(in_list)
196 | 
197 | 
198 | def add_to_badlist(filename):
199 |     with open('badlist.txt', 'a') as f:
200 |         f.writelines(filename + '\n')
201 |         log("added {} to badlist".format(filename))
202 | 
203 | 
204 | def get_client_info():
205 |     if not os.path.exists('config.ini'):
206 |         with open('config.ini', 'w') as f:
207 |             log('config.ini template created. Please paste in your client secret. (And RTM)')
208 |             f.write("""[ALPHA]
209 | client_id=PASTE ID HERE
210 | client_secret=PASTE SECRET HERE
211 | query_limit=2000
212 | ratelimit_sleep=2
213 | failure_sleep=10
214 | minimum_file_size_kb=12.0""")
215 |     config = configparser.ConfigParser()
216 |     config.read("config.ini")
217 |     id = config["ALPHA"]["client_id"]
218 |     secret = config["ALPHA"]["client_secret"]
219 |     query_limit = config["ALPHA"]["query_limit"]
220 |     ratelimit_sleep = config["ALPHA"]["ratelimit_sleep"]
221 |     failure_sleep = config["ALPHA"]["failure_sleep"]
222 |     minimum_file_size_kb = config["ALPHA"]["minimum_file_size_kb"]
223 |     return id, secret, int(query_limit), int(ratelimit_sleep), int(failure_sleep), float(minimum_file_size_kb)
224 | 
225 | 
226 | def is_media_file(uri):
227 |     # print('Original Link:' + img_link) # enable this if you want to log the literal URLs it sees
228 |     regex = '([.][\w]+)$'
229 |     re.compile(regex)
230 |     t = re.search(regex, uri)
231 | 
232 |     # extension is in the last 4 characters, unless it matches the regex.
233 |     ext = uri[-4:]
234 |     if t:
235 |         ext = t.group()
236 |     if ext in ('.webm', '.gif', '.avi', '.mp4', '.jpg', '.jpeg', '.png', '.mov', '.ogg', '.wmv', '.mp2', '.mp3', '.mkv'):
237 |         return True
238 |     else:
239 |         return False
240 | 
241 | 
242 | def get_redditor_urls(redditor, limit):
243 |     # TODO how do I automate these generators?
244 |     time_filters = ['hour', 'month', 'all', 'week', 'year', 'day']
245 |     try:
246 |         all_start = time.time()
247 |         if ClientInfo.id == 'PASTE ID HERE' or ClientInfo.secret == 'PASTE SECRET HERE':
248 |             log('Error: Please enter your "Client Info" and "Secret" into config.ini')
249 |         r = praw.Reddit(client_id=ClientInfo.id, client_secret=ClientInfo.secret, user_agent=ClientInfo.user_agent)
250 | 
251 |         start = time.time()
252 |         submissions1 = [submission.url for submission in r.redditor(redditor).submissions.top("all")]
253 |         end = time.time()
254 |         log('Query return time for Top/All:{},\nTotal Found: {}'.format(str(end - start), len(submissions1)))
255 | 
256 |         start = time.time()
257 |         submissions2 = [submission.url for submission in r.redditor(redditor).submissions.new()]
258 |         end = time.time()
259 |         log('Query return time for new:{},\nTotal Found: {}'.format(str(end - start), len(submissions2)))
260 | 
261 |         start = time.time()
262 |         submissions3 = [submission.url for submission in r.redditor(redditor).submissions.controversial()]
263 |         end = time.time()
264 |         log('Query return time for controversial:{},\nTotal Found: {}'.format(str(end - start), len(submissions3)))
265 | 
266 |         # combine them all, and de-duplicate them
267 |         submissions = submissions = list(set(submissions1 + submissions2 + submissions3))
268 | 
269 |         log('total unique submissions: {}'.format(len(submissions)))
270 | 
271 |         all_end = time.time()
272 |         log('Query return time for :{}: {}'.format(str(redditor), str(all_end - all_start)))
273 |         return submissions
274 | 
275 |     except Redirect:
276 |         log("get_img_urls() Redirect. Invalid Subreddit?")
277 |         return 0
278 | 
279 |     except HTTPError:
280 |         log("get_img_urls() HTTPError in last query")
281 |         sleep(10)
282 |         return 0
283 | 
284 |     except ResponseException:
285 |         log("get_img_urls() ResponseException.")
286 |         return 0
287 | 
288 | 
289 | def get_img_urls(sub, limit):
290 |     # TODO how do I automate these generators?
291 |     time_filters = ['hour', 'month', 'all', 'week', 'year', 'day']
292 |     try:
293 |         all_start = time.time()
294 |         if ClientInfo.id == 'PASTE ID HERE' or ClientInfo.secret == 'PASTE SECRET HERE':
295 |             log('Error: Please enter your "Client Info" and "Secret" into config.ini')
296 |         r = praw.Reddit(client_id=ClientInfo.id, client_secret=ClientInfo.secret, user_agent=ClientInfo.user_agent)
297 |         start = time.time()
298 |         submissions = [submission.url for submission in r.subreddit(sub).top(time_filter='all', limit=limit)]
299 |         end = time.time()
300 |         log('Query return time for ALL:{},\nTotal Found: {}'.format(str(end - start), len(submissions)))
301 | 
302 |         start = time.time()
303 |         submissions2 = [submission.url for submission in r.subreddit(sub).top(time_filter='year', limit=limit)]
304 |         end = time.time()
305 |         log('Query return time for year:{},\nTotal Found: {}'.format(str(end - start), len(submissions2)))
306 | 
307 |         start = time.time()
308 |         submissions3 = [submission.url for submission in r.subreddit(sub).top(time_filter='month', limit=limit)]
309 |         end = time.time()
310 |         log('Query return time for month:{},\nTotal Found: {}'.format(str(end - start), len(submissions3)))
311 | 
312 |         start = time.time()
313 |         submissions4 = [submission.url for submission in r.subreddit(sub).top(time_filter='week', limit=limit)]
314 |         end = time.time()
315 |         log('Query return time for week:{},\nTotal Found: {}'.format(str(end - start), len(submissions4)))
316 | 
317 |         start = time.time()
318 |         submissions5 = [submission.url for submission in r.subreddit(sub).top(time_filter='hour', limit=limit)]
319 |         end = time.time()
320 |         log('Query return time for hour:{},\nTotal Found: {}'.format(str(end - start), len(submissions5)))
321 | 
322 |         start = time.time()
323 |         submissions6 = [submission.url for submission in r.subreddit(sub).top(time_filter='day', limit=limit)]
324 |         end = time.time()
325 |         log('Query return time for day:{},\nTotal Found: {}'.format(str(end - start), len(submissions6)))
326 | 
327 |         start = time.time()
328 |         submissions7 = [submission.url for submission in r.subreddit(sub).hot(limit=limit)]
329 |         end = time.time()
330 |         log('Query return time for HOT:{},\nTotal Found: {}'.format(str(end - start), len(submissions7)))
331 | 
332 |         start = time.time()
333 |         submissions8 = [submission.url for submission in r.subreddit(sub).new(limit=limit)]
334 |         end = time.time()
335 |         log('Query return time for NEW:{},\nTotal Found: {}'.format(str(end - start), len(submissions8)))
336 | 
337 |         start = time.time()
338 |         submissions9 = [submission.url for submission in r.subreddit(sub).rising(limit=limit)]
339 |         end = time.time()
340 |         log('Query return time for RISING:{},\nTotal Found: {}'.format(str(end - start), len(submissions9)))
341 | 
342 |         # combine them all, and de-duplicate them
343 |         submissions = list(set(
344 |             submissions + submissions2 + submissions3 + submissions4 + submissions5 + submissions6 + submissions7 + submissions8 + submissions9))
345 | 
346 |         log('total unique submissions: {}'.format(len(submissions)))
347 | 
348 |         all_end = time.time()
349 |         log('Query return time for :{}: {}'.format(str(sub), str(all_end - all_start)))
350 |         return submissions
351 | 
352 |     except Redirect:
353 |         log("get_img_urls() Redirect. Invalid Subreddit?")
354 |         return 0
355 | 
356 |     except HTTPError:
357 |         log("get_img_urls() HTTPError in last query")
358 |         sleep(10)
359 |         return 0
360 | 
361 |     except ResponseException:
362 |         log("get_img_urls() ResponseException.")
363 |         return 0
364 | 
365 | 
366 | def download_img(img_url, img_title, file_loc, sub, ratelimit_sleep: int, failure_sleep: int):
367 |     # print(img_url + ' ' + img_title + ' ' + file_loc)
368 |     opener = urllib.request.build_opener()
369 |     opener.addheaders = [('User-agent', 'Mozilla/5.0')]
370 |     urllib.request.install_opener(opener)
371 |     try:
372 |         log('DL From: {} - Filename: {} - URL:{}'.format(sub, file_loc, img_url))
373 |         # u = urllib.request.urlopen(img_url)
374 |         # u_metadata = u.info()
375 |         # size = int(u_metadata.getheaders("Content-Length")[0])
376 |         # print(size)
377 | 
378 |         urllib.request.urlretrieve(img_url, file_loc)
379 |         sleep(ratelimit_sleep)  # remove this at your own risk. This is necessary so you can download the whole sub!
380 |         return 0
381 | 
382 |     except HTTPError as e:
383 |         log("download_img() HTTPError in last query (file might not exist anymore, or malformed URL)")
384 |         add_to_badlist(img_title)
385 |         log(e)
386 |         sleep(failure_sleep)
387 |         return 1
388 | 
389 |     except urllib.error.URLError as e:
390 |         log("download_img() URLError! Site might be offline")
391 |         log(e)
392 |         add_to_badlist(img_title)
393 |         sleep(ratelimit_sleep)
394 |         return 1
395 | 
396 |     except http.client.InvalidURL as e:
397 |         log('Crappy URL')
398 |         log(e)
399 |         add_to_badlist(img_title)
400 |         return 1
401 | 
402 | 
403 | def read_img_links(submissions, url_list, user_submissions):
404 |     sub = submissions.lower()
405 |     if user_submissions:
406 |         if not os.path.exists('./users/{}'.format(sub)):
407 |             os.mkdir('./users/{}'.format(sub))
408 |     else:
409 |         if not os.path.exists('./result'):
410 |             os.mkdir('./result')
411 |         if not os.path.exists('./result/{}'.format(sub)):
412 |             os.mkdir('./result/{}'.format(sub))
413 | 
414 |     url_list = [x.strip() for x in url_list]
415 |     url_list.sort()
416 |     download_count = 0
417 |     exist_count = 0
418 |     download_status = 0
419 | 
420 |     with open('badlist.txt', 'r') as f:
421 |         badlist = f.readlines()
422 |         badlist = [x.strip() for x in set(badlist)]
423 | 
424 |     for link in url_list:
425 |         if 'gfycat.com' in link and '.gif' not in link[-4:] and '.webm' not in link[-4:]:
426 |             # print(link[-4:])
427 |             # print('gfycat found:{}'.format(link))
428 |             link = link + '.gif'
429 |         if not is_media_file(link):
430 |             continue
431 | 
432 |         file_name = link.split('/')[-1]
433 |         if file_name in badlist:
434 |             # log('{} found in badlist, skipping'.format(file_name))
435 |             continue
436 | 
437 |         if user_submissions:
438 |             file_loc = 'users/{}/{}'.format(sub, file_name)
439 |             base = './users/'
440 |         else:
441 |             file_loc = 'result/{}/{}'.format(sub, file_name)
442 |             base = './result/'
443 |         if os.path.exists(file_name) or os.path.exists(file_loc):
444 |             # log(file_name + ' already exists')
445 |             exist_count += 1
446 |             continue
447 | 
448 |         if not file_name:
449 |             log(file_name + ' cannot download')
450 |             continue
451 | 
452 |         download_status = download_img(link, file_name, file_loc, sub, ratelimit_sleep, failure_sleep)
453 | 
454 |         download_count += 1
455 |     return download_count, download_status, exist_count
456 | 
457 | 
458 | if __name__ == '__main__':
459 |     # Get client info
460 |     ClientInfo.id, ClientInfo.secret, query_lookup_limit, ratelimit_sleep, failure_sleep, minimum_file_size_kb = get_client_info()
461 | 
462 |     # Create project directories
463 |     create_directories()
464 | 
465 |     # Cleanup files to the badlist, and to size standards
466 |     badlist_cleanup(minimum_file_size_kb)
467 | 
468 |     # Sort all our files.
469 |     sort_text_file('./subs.txt')
470 |     sort_text_file('./users.txt')
471 |     sort_text_file('./badlist.txt')
472 | 
473 |     # THIS TAKES A VERY VERY LONG TIME
474 |     # delete_duplicates_by_hash('./users')
475 |     # delete_duplicates_by_hash('./result')
476 | 
477 |     subs_file = open('./subs.txt', 'r')
478 |     redditors_file = open('./users.txt', 'r')
479 | 
480 |     for redditor in redditors_file.readlines():
481 |         if '#' in redditor:
482 |             continue
483 |         redditor = redditor.strip('\n').lower()
484 |         log('Starting Retrieval from: ' + redditor)
485 | 
486 |         # subreddit = input('Enter Subreddit: ')
487 |         # query_lookup_limit = int(input('Enter the max amount of queries: '))
488 |         url_list = []
489 |         url_list = get_redditor_urls(redditor, query_lookup_limit)
490 |         file_no = 1
491 | 
492 |         if url_list:
493 |             try:
494 |                 log('{} images found on {}'.format(len(url_list), redditor))
495 |                 count, status, already_here = read_img_links(redditor, url_list, True)
496 | 
497 |                 if status == 1:
498 |                     log(
499 |                         'Download Complete from {}\n{} - Images Downloaded\nQuery limit: {} \nAlready downloaded: {} \n'.format(
500 |                             redditor, count, query_lookup_limit, already_here))
501 |                 elif status == 0:
502 |                     log(
503 |                         'Download Complete from {}\n{} - Images Downloaded\nQuery limit: {} \nAlready downloaded: {} \n'.format(
504 |                             redditor, count, query_lookup_limit, already_here))
505 |             except UnicodeEncodeError:
506 |                 log('UnicodeEncodeError:{}\n'.format(redditor))
507 |             except OSError as e:
508 | 
509 |                 log('OSError:{}\nVerbose:{}'.format(redditor, e))
510 |         # confirm = input('confirm next sub? CTRL+C to cancel.')
511 |         delete_duplicates_by_hash('./users/{}'.format(redditor))
512 | 
513 |     for subreddit in subs_file.readlines():
514 |         if '#' in subreddit:
515 |             continue
516 |         subreddit = subreddit.strip('\n').lower()
517 |         log('Starting Retrieval from: /r/' + subreddit)
518 | 
519 |         # subreddit = input('Enter Subreddit: ')
520 |         # query_lookup_limit = int(input('Enter the max amount of queries: '))
521 |         url_list = []
522 |         url_list = get_img_urls(subreddit, query_lookup_limit)
523 |         file_no = 1
524 | 
525 |         if url_list:
526 |             try:
527 |                 log('{} images found on {}'.format(len(url_list), subreddit))
528 |                 count, status, already_here = read_img_links(subreddit, url_list, 0)
529 | 
530 |                 if status == 1:
531 |                     log(
532 |                         'Download Complete from {}\n{} - Images Downloaded\nQuery limit: {} \nAlready downloaded: {} \n'.format(
533 |                             subreddit, count, query_lookup_limit, already_here))
534 |                 elif status == 0:
535 |                     log(
536 |                         'Download Complete from {}\n{} - Images Downloaded\nQuery limit: {} \nAlready downloaded: {} \n'.format(
537 |                             subreddit, count, query_lookup_limit, already_here))
538 |             except UnicodeEncodeError:
539 |                 log('UnicodeEncodeError:{}\n'.format(subreddit))
540 |             except OSError as e:
541 | 
542 |                 log('OSError:{}\nVerbose:{}'.format(subreddit, e))
543 |         # confirm = input('confirm next sub? CTRL+C to cancel.')
544 |         delete_duplicates_by_hash('./result/{}'.format(subreddit))
545 | 


--------------------------------------------------------------------------------
/config.ini:
--------------------------------------------------------------------------------
1 | [ALPHA]
2 | client_id=<ID HERE>
3 | client_secret=<SECRET HERE>
4 | query_limit=10
5 | ratelimit_sleep=2
6 | failure_sleep=10
7 | minimum_file_size_kb=12.0
8 | 


--------------------------------------------------------------------------------
/howtoscrape.gif:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/crawsome/Reddit_Image_Scraper/c0e6780f14689375f1b66b6a8d52a6458017e8e9/howtoscrape.gif


--------------------------------------------------------------------------------