├── .gitignore ├── Dockerfile ├── LICENSE ├── README.md ├── cookies.txt ├── ignore-list ├── pipeline.py ├── reddit.lua └── user-agents /.gitignore: -------------------------------------------------------------------------------- 1 | *~ 2 | *.pyc 3 | wget-lua 4 | wget-at 5 | STOP 6 | BANNED 7 | data/ 8 | -------------------------------------------------------------------------------- /Dockerfile: -------------------------------------------------------------------------------- 1 | FROM atdr.meo.ws/archiveteam/grab-base:gnutls 2 | -------------------------------------------------------------------------------- /LICENSE: -------------------------------------------------------------------------------- 1 | This is free and unencumbered software released into the public domain. 2 | 3 | Anyone is free to copy, modify, publish, use, compile, sell, or 4 | distribute this software, either in source code form or as a compiled 5 | binary, for any purpose, commercial or non-commercial, and by any 6 | means. 7 | 8 | In jurisdictions that recognize copyright laws, the author or authors 9 | of this software dedicate any and all copyright interest in the 10 | software to the public domain. We make this dedication for the benefit 11 | of the public at large and to the detriment of our heirs and 12 | successors. We intend this dedication to be an overt act of 13 | relinquishment in perpetuity of all present and future rights to this 14 | software under copyright law. 15 | 16 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, 17 | EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF 18 | MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. 19 | IN NO EVENT SHALL THE AUTHORS BE LIABLE FOR ANY CLAIM, DAMAGES OR 20 | OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, 21 | ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR 22 | OTHER DEALINGS IN THE SOFTWARE. 23 | 24 | For more information, please refer to 25 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # reddit-grab 2 | 3 | More information about the archiving project can be found on the ArchiveTeam wiki: [Reddit](https://wiki.archiveteam.org/index.php?title=Reddit) 4 | 5 | ## Setup instructions 6 | 7 | ### General instructions 8 | 9 | Data integrity is very important in Archive Team projects. Please note the following important rules: 10 | 11 | * [Do not use proxies or VPNs](https://wiki.archiveteam.org/index.php/ArchiveTeam_Warrior#Can_I_use_whatever_internet_access_for_the_Warrior?). 12 | * Run the project using the either the Warrior or the project-specific Docker container as listed below. [Do not modify project code](https://wiki.archiveteam.org/index.php/ArchiveTeam_Warrior#I'd_like_to_help_write_code_or_I_want_to_tweak_the_scripts_to_run_to_my_liking._Where_can_I_find_more_info?_Where_is_the_source_code_and_repository?). Compiling the project dependencies yourself is no longer supported. 13 | * You can share your tracker nickname(s) across machine(s) you personally operate, but not with machines operated by other users. Nickname sharing makes it harder to inspect data if a problem arises. 14 | * [Use clean internet connections](https://wiki.archiveteam.org/index.php/ArchiveTeam_Warrior#Can_I_use_whatever_internet_access_for_the_Warrior?). 15 | * Only x64-based machines are supported. [ARM (used on Raspberry Pi and Apple Silicon Macs) is not currently supported](https://wiki.archiveteam.org/index.php/ArchiveTeam_Warrior#Can_I_run_the_Warrior_on_ARM_or_some_other_unusual_architecture?). 16 | * See the [Archive Team Wiki](https://wiki.archiveteam.org/index.php/ArchiveTeam_Warrior#Warrior_FAQ) for additional information. 17 | 18 | We strongly encourage you to join the IRC channel associated with this project in order to be informed about project updates and other important announcements, as well as to be reachable in the event of an issue. The Archive Team Wiki has [more information about IRC](https://wiki.archiveteam.org/index.php/Archiveteam:IRC). We can be found at hackint IRC [#shreddit](https://webirc.hackint.org/#irc://irc.hackint.org/#shreddit). 19 | 20 | **If you have any questions or issues during setup, please review the wiki pages or contact us on IRC for troubleshooting information.** 21 | 22 | ### Running the project 23 | 24 | #### Archive Team Warrior (recommended for most users) 25 | 26 | This and other archiving projects can easily be run using the [Archive Team Warrior](https://wiki.archiveteam.org/index.php/ArchiveTeam_Warrior) virtual machine. Follow the [instructions on the Archive Team wiki](https://wiki.archiveteam.org/index.php/ArchiveTeam_Warrior) for installing the Warrior, and from the web interface running at `http://localhost:8001/`, enter the nickname that you want to be shown as on the tracker. There is no registration, just pick a nickname you like. Then, select the `Reddit` project in the Warrior interface. 27 | 28 | #### Project-specific Docker container (for more advanced users) 29 | 30 | Alternatively, more advanced users can also run projects using Docker. While users of the Warrior can switch between projects using a web interface, Docker containers are specific to each project. However, while the Warrior supports a maximum of 6 concurrent items, a Docker container supports a maximum of 20 concurrent items. The instructions below are a short overview. For more information and detailed explanations of the commands, follow the follow the [Docker instructions on the Archive Team wiki](https://wiki.archiveteam.org/index.php/Running_Archive_Team_Projects_with_Docker). 31 | 32 | It is advised to use [Watchtower](https://github.com/containrrr/watchtower) to automatically update the project container: 33 | 34 | docker run -d --name watchtower --restart=unless-stopped -v /var/run/docker.sock:/var/run/docker.sock containrrr/watchtower --label-enable --cleanup --interval 3600 --include-restarting 35 | 36 | after which the project container can be run: 37 | 38 | docker run -d --name archiveteam --label=com.centurylinklabs.watchtower.enable=true --log-driver json-file --log-opt max-size=50m --restart=unless-stopped atdr.meo.ws/archiveteam/reddit-grab --concurrent 1 YOURNICKHERE 39 | 40 | Be sure to replace `YOURNICKHERE` with the nickname that you want to be shown as on the tracker. There is no registration, just pick a nickname you like. 41 | 42 | ### Supporting Archive Team 43 | 44 | Behind the scenes Archive Team has infrastructure to run the projects and process the data with. If you would like to help out with the costs of our infrastructure, a donation on our [Open Collective](https://opencollective.com/archiveteam) would be very welcome. 45 | 46 | ### Issues in the code 47 | 48 | If you notice a bug and want to file a bug report, please use the GitHub issues tracker. 49 | 50 | Are you a developer? Help write code for us! Look at our [developer documentation](https://wiki.archiveteam.org/index.php?title=Dev) for details. 51 | 52 | ### Other problems 53 | 54 | Have an issue not listed here? Join us on IRC and ask! We can be found at hackint IRC [#shreddit](https://webirc.hackint.org/#irc://irc.hackint.org/#shreddit). 55 | 56 | 57 | -------------------------------------------------------------------------------- /cookies.txt: -------------------------------------------------------------------------------- 1 | .reddit.com TRUE / FALSE 0 eu_cookie_v2 3 2 | .reddit.com TRUE / FALSE 0 over18 1 3 | .reddit.com TRUE / FALSE 0 _options %7B%22pref_quarantine_optin%22%3A%20true%2C%20%22pref_gated_sr_optin%22%3A%20true%7D 4 | -------------------------------------------------------------------------------- /ignore-list: -------------------------------------------------------------------------------- 1 | https://old.reddit.com/static/opensearch.xml 2 | https://reddit.com/static/pixel.png 3 | -------------------------------------------------------------------------------- /pipeline.py: -------------------------------------------------------------------------------- 1 | # encoding=utf8 2 | import datetime 3 | from distutils.version import StrictVersion 4 | import hashlib 5 | import os.path 6 | import random 7 | import re 8 | from seesaw.config import realize, NumberConfigValue 9 | from seesaw.externalprocess import ExternalProcess 10 | from seesaw.item import ItemInterpolation, ItemValue 11 | from seesaw.task import SimpleTask, LimitConcurrent 12 | from seesaw.tracker import GetItemFromTracker, PrepareStatsForTracker, \ 13 | UploadWithTracker, SendDoneToTracker 14 | import shutil 15 | import socket 16 | import subprocess 17 | import sys 18 | import time 19 | import string 20 | 21 | import seesaw 22 | from seesaw.externalprocess import WgetDownload 23 | from seesaw.pipeline import Pipeline 24 | from seesaw.project import Project 25 | from seesaw.util import find_executable 26 | 27 | from tornado import httpclient 28 | 29 | import requests 30 | import zstandard 31 | 32 | if StrictVersion(seesaw.__version__) < StrictVersion('0.8.5'): 33 | raise Exception('This pipeline needs seesaw version 0.8.5 or higher.') 34 | 35 | 36 | ########################################################################### 37 | # Find a useful Wget+Lua executable. 38 | # 39 | # WGET_AT will be set to the first path that 40 | # 1. does not crash with --version, and 41 | # 2. prints the required version string 42 | 43 | class HigherVersion: 44 | def __init__(self, expression, min_version): 45 | self._expression = re.compile(expression) 46 | self._min_version = min_version 47 | 48 | def search(self, text): 49 | for result in self._expression.findall(text): 50 | if result >= self._min_version: 51 | print('Found version {}.'.format(result)) 52 | return True 53 | 54 | WGET_AT = find_executable( 55 | 'Wget+AT', 56 | HigherVersion( 57 | r'(GNU Wget 1\.[0-9]{2}\.[0-9]{1}-at\.[0-9]{8}\.[0-9]{2})[^0-9a-zA-Z\.-_]', 58 | 'GNU Wget 1.21.3-at.20231213.03' 59 | ), 60 | [ 61 | './wget-at', 62 | '/home/warrior/data/wget-at-gnutls' 63 | ] 64 | ) 65 | 66 | if not WGET_AT: 67 | raise Exception('No usable Wget+At found.') 68 | 69 | 70 | ########################################################################### 71 | # The version number of this pipeline definition. 72 | # 73 | # Update this each time you make a non-cosmetic change. 74 | # It will be added to the WARC files and reported to the tracker. 75 | VERSION = '20240216.01' 76 | TRACKER_ID = 'reddit' 77 | TRACKER_HOST = 'legacy-api.arpa.li' 78 | MULTI_ITEM_SIZE = 100 79 | 80 | 81 | ########################################################################### 82 | # This section defines project-specific tasks. 83 | # 84 | # Simple tasks (tasks that do not need any concurrency) are based on the 85 | # SimpleTask class and have a process(item) method that is called for 86 | # each item. 87 | class CheckIP(SimpleTask): 88 | def __init__(self): 89 | SimpleTask.__init__(self, 'CheckIP') 90 | self._counter = 0 91 | 92 | def process(self, item): 93 | # NEW for 2014! Check if we are behind firewall/proxy 94 | 95 | if self._counter <= 0: 96 | item.log_output('Checking IP address.') 97 | ip_set = set() 98 | 99 | ip_set.add(socket.gethostbyname('twitter.com')) 100 | #ip_set.add(socket.gethostbyname('facebook.com')) 101 | ip_set.add(socket.gethostbyname('youtube.com')) 102 | ip_set.add(socket.gethostbyname('microsoft.com')) 103 | ip_set.add(socket.gethostbyname('icanhas.cheezburger.com')) 104 | ip_set.add(socket.gethostbyname('archiveteam.org')) 105 | 106 | if len(ip_set) != 5: 107 | item.log_output('Got IP addresses: {0}'.format(ip_set)) 108 | item.log_output( 109 | 'Are you behind a firewall/proxy? That is a big no-no!') 110 | raise Exception( 111 | 'Are you behind a firewall/proxy? That is a big no-no!') 112 | 113 | # Check only occasionally 114 | if self._counter <= 0: 115 | self._counter = 10 116 | else: 117 | self._counter -= 1 118 | 119 | 120 | class PrepareDirectories(SimpleTask): 121 | def __init__(self, warc_prefix): 122 | SimpleTask.__init__(self, 'PrepareDirectories') 123 | self.warc_prefix = warc_prefix 124 | 125 | def process(self, item): 126 | item_name = item['item_name'] 127 | item_name_hash = hashlib.sha1(item_name.encode('utf8')).hexdigest() 128 | escaped_item_name = item_name_hash 129 | dirname = '/'.join((item['data_dir'], escaped_item_name)) 130 | 131 | if os.path.isdir(dirname): 132 | shutil.rmtree(dirname) 133 | 134 | os.makedirs(dirname) 135 | 136 | item['item_dir'] = dirname 137 | item['warc_file_base'] = '-'.join([ 138 | self.warc_prefix, 139 | item_name_hash, 140 | time.strftime('%Y%m%d-%H%M%S') 141 | ]) 142 | 143 | open('%(item_dir)s/%(warc_file_base)s.warc.zst' % item, 'w').close() 144 | open('%(item_dir)s/%(warc_file_base)s_data.txt' % item, 'w').close() 145 | 146 | class MoveFiles(SimpleTask): 147 | def __init__(self): 148 | SimpleTask.__init__(self, 'MoveFiles') 149 | 150 | def process(self, item): 151 | os.rename('%(item_dir)s/%(warc_file_base)s.warc.zst' % item, 152 | '%(data_dir)s/%(warc_file_base)s.%(dict_project)s.%(dict_id)s.warc.zst' % item) 153 | os.rename('%(item_dir)s/%(warc_file_base)s_data.txt' % item, 154 | '%(data_dir)s/%(warc_file_base)s_data.txt' % item) 155 | 156 | shutil.rmtree('%(item_dir)s' % item) 157 | 158 | 159 | class SetBadUrls(SimpleTask): 160 | def __init__(self): 161 | SimpleTask.__init__(self, 'SetBadUrls') 162 | 163 | def process(self, item): 164 | item['item_name_original'] = item['item_name'] 165 | items = item['item_name'].split('\0') 166 | items_lower = [s.lower() for s in items] 167 | with open('%(item_dir)s/%(warc_file_base)s_bad-items.txt' % item, 'r') as f: 168 | for aborted_item in f: 169 | aborted_item = aborted_item.strip().lower() 170 | index = items_lower.index(aborted_item) 171 | item.log_output('Item {} is aborted.'.format(aborted_item)) 172 | items.pop(index) 173 | items_lower.pop(index) 174 | item['item_name'] = '\0'.join(items) 175 | 176 | 177 | class MaybeSendDoneToTracker(SendDoneToTracker): 178 | def enqueue(self, item): 179 | if len(item['item_name']) == 0: 180 | return self.complete_item(item) 181 | return super(MaybeSendDoneToTracker, self).enqueue(item) 182 | 183 | 184 | def get_hash(filename): 185 | with open(filename, 'rb') as in_file: 186 | return hashlib.sha1(in_file.read()).hexdigest() 187 | 188 | CWD = os.getcwd() 189 | PIPELINE_SHA1 = get_hash(os.path.join(CWD, 'pipeline.py')) 190 | LUA_SHA1 = get_hash(os.path.join(CWD, 'reddit.lua')) 191 | 192 | def stats_id_function(item): 193 | d = { 194 | 'pipeline_hash': PIPELINE_SHA1, 195 | 'lua_hash': LUA_SHA1, 196 | 'python_version': sys.version, 197 | } 198 | 199 | return d 200 | 201 | 202 | class ZstdDict(object): 203 | created = 0 204 | data = None 205 | 206 | @classmethod 207 | def get_dict(cls): 208 | if cls.data is not None and time.time() - cls.created < 1800: 209 | return cls.data 210 | response = requests.get( 211 | 'https://legacy-api.arpa.li/dictionary', 212 | params={ 213 | 'project': 'reddit' 214 | } 215 | ) 216 | response.raise_for_status() 217 | response = response.json() 218 | if cls.data is not None and response['id'] == cls.data['id']: 219 | cls.created = time.time() 220 | return cls.data 221 | print('Downloading latest dictionary.') 222 | response_dict = requests.get(response['url']) 223 | response_dict.raise_for_status() 224 | raw_data = response_dict.content 225 | if hashlib.sha256(raw_data).hexdigest() != response['sha256']: 226 | raise ValueError('Hash of downloaded dictionary does not match.') 227 | if raw_data[:4] == b'\x28\xB5\x2F\xFD': 228 | raw_data = zstandard.ZstdDecompressor().decompress(raw_data) 229 | cls.data = { 230 | 'id': response['id'], 231 | 'dict': raw_data 232 | } 233 | cls.created = time.time() 234 | return cls.data 235 | 236 | 237 | class WgetArgs(object): 238 | post_chars = string.digits + string.ascii_lowercase 239 | 240 | def int_to_str(self, i): 241 | d, m = divmod(i, 36) 242 | if d > 0: 243 | return self.int_to_str(d) + self.post_chars[m] 244 | return self.post_chars[m] 245 | 246 | def realize(self, item): 247 | with open('user-agents', 'r') as f: 248 | user_agent = random.choice(list(f)).strip() 249 | wget_args = [ 250 | WGET_AT, 251 | '-U', user_agent, 252 | '-nv', 253 | '--host-lookups', 'dns', 254 | '--hosts-file', '/dev/null', 255 | '--resolvconf-file', '/dev/null', 256 | '--dns-servers', '9.9.9.10,149.112.112.10,2620:fe::10,2620:fe::fe:10', 257 | '--reject-reserved-subnets', 258 | '--load-cookies', 'cookies.txt', 259 | '--content-on-error', 260 | '--no-http-keep-alive', 261 | '--lua-script', 'reddit.lua', 262 | '-o', ItemInterpolation('%(item_dir)s/wget.log'), 263 | '--no-check-certificate', 264 | '--output-document', ItemInterpolation('%(item_dir)s/wget.tmp'), 265 | '--truncate-output', 266 | '-e', 'robots=off', 267 | '--rotate-dns', 268 | '--recursive', '--level=inf', 269 | '--no-parent', 270 | '--page-requisites', 271 | '--timeout', '30', 272 | '--tries', 'inf', 273 | '--domains', 'reddit.com', 274 | '--span-hosts', 275 | '--waitretry', '30', 276 | '--warc-file', ItemInterpolation('%(item_dir)s/%(warc_file_base)s'), 277 | '--warc-header', 'operator: Archive Team', 278 | '--warc-header', 'x-wget-at-project-version: ' + VERSION, 279 | '--warc-header', 'x-wget-at-project-name: ' + TRACKER_ID, 280 | '--warc-dedup-url-agnostic', 281 | '--warc-compression-use-zstd', 282 | '--warc-zstd-dict-no-include', 283 | '--header', 'Accept-Language: en-US;q=0.9, en;q=0.8', 284 | '--secure-protocol', 'TLSv1_2', 285 | #'--ciphers', '+ECDHE-RSA:+AES-256-CBC:+SHA384' 286 | ] 287 | dict_data = ZstdDict.get_dict() 288 | with open(os.path.join(item['item_dir'], 'zstdict'), 'wb') as f: 289 | f.write(dict_data['dict']) 290 | item['dict_id'] = dict_data['id'] 291 | item['dict_project'] = 'reddit' 292 | wget_args.extend([ 293 | '--warc-zstd-dict', ItemInterpolation('%(item_dir)s/zstdict'), 294 | ]) 295 | 296 | for item_name in item['item_name'].split('\0'): 297 | wget_args.extend(['--warc-header', 'x-wget-at-project-item-name: '+item_name]) 298 | wget_args.append('item-name://'+item_name) 299 | item_type, item_value = item_name.split(':', 1) 300 | if item_type == 'post': 301 | wget_args.extend(['--warc-header', 'reddit-post: '+item_value]) 302 | wget_args.append('https://www.reddit.com/api/info.json?id=t3_'+item_value) 303 | elif item_type == 'comment': 304 | wget_args.extend(['--warc-header', 'reddit-comment: '+item_value]) 305 | wget_args.append('https://www.reddit.com/api/info.json?id=t1_'+item_value) 306 | elif item_type == 'url': 307 | wget_args.extend(['--warc-header', 'reddit-media-url: '+item_value]) 308 | wget_args.append(item_value) 309 | else: 310 | raise Exception('Unknown item') 311 | 312 | item['item_name_newline'] = item['item_name'].replace('\0', '\n') 313 | 314 | if 'bind_address' in globals(): 315 | wget_args.extend(['--bind-address', globals()['bind_address']]) 316 | print('') 317 | print('*** Wget will bind address at {0} ***'.format( 318 | globals()['bind_address'])) 319 | print('') 320 | 321 | return realize(wget_args, item) 322 | 323 | ########################################################################### 324 | # Initialize the project. 325 | # 326 | # This will be shown in the warrior management panel. The logo should not 327 | # be too big. The deadline is optional. 328 | project = Project( 329 | title='reddit', 330 | project_html=''' 331 | 332 |

reddit.com Website · Leaderboard

333 |

Archiving everything from reddit.

334 | ''' 335 | ) 336 | 337 | pipeline = Pipeline( 338 | CheckIP(), 339 | GetItemFromTracker('http://{}/{}/multi={}/' 340 | .format(TRACKER_HOST, TRACKER_ID, MULTI_ITEM_SIZE), 341 | downloader, VERSION), 342 | PrepareDirectories(warc_prefix='reddit'), 343 | WgetDownload( 344 | WgetArgs(), 345 | max_tries=2, 346 | accept_on_exit_code=[0, 4, 8], 347 | env={ 348 | 'item_dir': ItemValue('item_dir'), 349 | 'item_names': ItemValue('item_name_newline'), 350 | 'warc_file_base': ItemValue('warc_file_base'), 351 | } 352 | ), 353 | SetBadUrls(), 354 | PrepareStatsForTracker( 355 | defaults={'downloader': downloader, 'version': VERSION}, 356 | file_groups={ 357 | 'data': [ 358 | ItemInterpolation('%(item_dir)s/%(warc_file_base)s.warc.zst') 359 | ] 360 | }, 361 | id_function=stats_id_function, 362 | ), 363 | MoveFiles(), 364 | LimitConcurrent(NumberConfigValue(min=1, max=20, default='20', 365 | name='shared:rsync_threads', title='Rsync threads', 366 | description='The maximum number of concurrent uploads.'), 367 | UploadWithTracker( 368 | 'http://%s/%s' % (TRACKER_HOST, TRACKER_ID), 369 | downloader=downloader, 370 | version=VERSION, 371 | files=[ 372 | ItemInterpolation('%(data_dir)s/%(warc_file_base)s.%(dict_project)s.%(dict_id)s.warc.zst'), 373 | ItemInterpolation('%(data_dir)s/%(warc_file_base)s_data.txt') 374 | ], 375 | rsync_target_source_path=ItemInterpolation('%(data_dir)s/'), 376 | rsync_extra_args=[ 377 | '--recursive', 378 | '--min-size', '1', 379 | '--no-compress', 380 | '--compress-level', '0' 381 | ] 382 | ), 383 | ), 384 | MaybeSendDoneToTracker( 385 | tracker_url='http://%s/%s' % (TRACKER_HOST, TRACKER_ID), 386 | stats=ItemValue('stats') 387 | ) 388 | ) 389 | -------------------------------------------------------------------------------- /reddit.lua: -------------------------------------------------------------------------------- 1 | local urlparse = require("socket.url") 2 | local http = require("socket.http") 3 | local cjson = require("cjson") 4 | local utf8 = require("utf8") 5 | 6 | local item_names = os.getenv('item_names') 7 | local item_dir = os.getenv('item_dir') 8 | local warc_file_base = os.getenv('warc_file_base') 9 | local item_type = nil 10 | local item_name = nil 11 | local item_value = nil 12 | 13 | local selftext = nil 14 | local retry_url = true 15 | 16 | local item_types = {} 17 | for s in string.gmatch(item_names, "([^\n]+)") do 18 | local t, n = string.match(s, "^([^:]+):(.+)$") 19 | item_types[n] = t 20 | end 21 | 22 | if urlparse == nil or http == nil then 23 | io.stdout:write("socket not corrently installed.\n") 24 | io.stdout:flush() 25 | abortgrab = true 26 | end 27 | 28 | local url_count = 0 29 | local tries = 0 30 | local downloaded = {} 31 | local addedtolist = {} 32 | local abortgrab = false 33 | local killgrab = false 34 | 35 | local posts = {} 36 | local requested_children = {} 37 | local is_crosspost = false 38 | 39 | local outlinks = {} 40 | local reddit_media_urls = {} 41 | 42 | local bad_items = {} 43 | 44 | for ignore in io.open("ignore-list", "r"):lines() do 45 | downloaded[ignore] = true 46 | end 47 | 48 | abort_item = function(item) 49 | abortgrab = true 50 | if not item then 51 | item = item_name 52 | end 53 | if not bad_items[item] then 54 | io.stdout:write("Aborting item " .. item .. ".\n") 55 | io.stdout:flush() 56 | bad_items[item] = true 57 | end 58 | end 59 | 60 | kill_grab = function(item) 61 | io.stdout:write("Aborting crawling.\n") 62 | killgrab = true 63 | end 64 | 65 | read_file = function(file) 66 | if file then 67 | local f = assert(io.open(file)) 68 | local data = f:read("*all") 69 | f:close() 70 | return data 71 | else 72 | return "" 73 | end 74 | end 75 | 76 | processed = function(url) 77 | if downloaded[url] or addedtolist[url] then 78 | return true 79 | end 80 | return false 81 | end 82 | 83 | allowed = function(url, parenturl) 84 | if item_type == "url" then 85 | if url ~= item_value then 86 | reddit_media_urls["url:" .. url] = true 87 | return false 88 | end 89 | return true 90 | end 91 | 92 | --[[if string.match(url, "^https?://www%.reddit%.com/svc/") then 93 | return true 94 | end]] 95 | 96 | if string.match(url, "'+") 97 | or string.match(urlparse.unescape(url), "[<>\\%$%^%[%]%(%){}]") 98 | or string.match(url, "^https?://[^/]*reddit%.com/[^%?]+%?context=[0-9]+&depth=[0-9]+") 99 | or string.match(url, "^https?://[^/]*reddit%.com/[^%?]+%?depth=[0-9]+&context=[0-9]+") 100 | or string.match(url, "^https?://[^/]*reddit%.com/login") 101 | or string.match(url, "^https?://[^/]*reddit%.com/register") 102 | or string.match(url, "^https?://[^/]*reddit%.com/r/undefined/") 103 | or ( 104 | string.match(url, "%?sort=") 105 | and not string.match(url, "/svc/") 106 | ) 107 | or string.match(url, "%?limit=500$") 108 | or string.match(url, "%?ref=readnext$") 109 | or string.match(url, "/tailwind%-build%.css$") 110 | or string.match(url, "^https?://v%.redd%.it/.+%?source=fallback$") 111 | or string.match(url, "^https?://[^/]*reddit%.app%.link/") 112 | or string.match(url, "^https?://out%.reddit%.com/r/") 113 | or string.match(url, "^https?://old%.reddit%.com/gallery/") 114 | or string.match(url, "^https?://old%.reddit%.com/gold%?") 115 | or string.match(url, "^https?://[^/]+/over18.+dest=https%%3A%%2F%%2Fold%.reddit%.com") 116 | or string.match(url, "^https?://old%.[^%?]+%?utm_source=reddit") 117 | or string.match(url, "/%?context=1$") 118 | or string.match(url, '/"$') 119 | or string.match(url, "^https?://[^/]+/message/compose") 120 | or string.match(url, "www%.reddit%.com/avatar[/]?$") 121 | or ( 122 | string.match(url, "^https?://gateway%.reddit%.com/") 123 | and not string.match(url, "/morecomments/") 124 | ) 125 | or string.match(url, "/%.rss$") 126 | or ( 127 | parenturl 128 | and string.match(url, "^https?://amp%.reddit%.com/") 129 | ) 130 | or ( 131 | parenturl 132 | and string.match(url, "^https?://v%.redd%.it/[^/]+/HLSPlaylist%.m3u8") 133 | ) 134 | or ( 135 | item_type == "post" 136 | and ( 137 | string.match(url, "^https?://[^/]*reddit%.com/r/[^/]+/comments/[0-9a-z]+/[^/]+/[0-9a-z]+/?$") 138 | or string.match(url, "^https?://[^/]*reddit%.com/r/[^/]+/comments/[0-9a-z]+/[^/]+/[0-9a-z]+/?%?utm_source=") 139 | ) 140 | ) 141 | or ( 142 | parenturl 143 | and string.match(parenturl, "^https?://[^/]*reddit%.com/r/[^/]+/duplicates/") 144 | and string.match(url, "^https?://[^/]*reddit%.com/r/[^/]+/duplicates/") 145 | ) 146 | or ( 147 | parenturl 148 | and string.match(parenturl, "^https?://[^/]*reddit%.com/user/[^/]+/duplicates/") 149 | and string.match(url, "^https?://[^/]*reddit%.com/user/[^/]+/duplicates/") 150 | ) 151 | or ( 152 | parenturl 153 | and string.match(parenturl, "^https?://[^/]+/r/EASportsFC/") 154 | and string.match(url, "^https?://[^/]+/r/FIFA/") 155 | ) then 156 | return false 157 | end 158 | 159 | local tested = {} 160 | for s in string.gmatch(url, "([^/]+)") do 161 | if tested[s] == nil then 162 | tested[s] = 0 163 | end 164 | if tested[s] == 6 then 165 | return false 166 | end 167 | tested[s] = tested[s] + 1 168 | end 169 | 170 | if not ( 171 | string.match(url, "^https?://[^/]*redd%.it/") 172 | or string.match(url, "^https?://[^/]*reddit%.com/") 173 | or string.match(url, "^https?://[^/]*redditmedia%.com/") 174 | or string.match(url, "^https?://[^/]*redditstatic%.com/") 175 | ) then 176 | local temp = "" 177 | for c in string.gmatch(url, "(.)") do 178 | local b = string.byte(c) 179 | if b < 32 or b > 126 then 180 | c = string.format("%%%02X", b) 181 | end 182 | temp = temp .. c 183 | end 184 | url = temp 185 | outlinks[url] = true 186 | return false 187 | end 188 | 189 | if url .. "/" == parenturl then 190 | return false 191 | end 192 | 193 | if string.match(url, "^https?://gateway%.reddit%.com/desktopapi/v1/morecomments/") 194 | or string.match(url, "^https?://old%.reddit%.com/api/morechildren$") 195 | or string.match(url, "^https?://[^/]*reddit%.com/video/") then 196 | return true 197 | end 198 | 199 | if ( 200 | string.match(url, "^https?://[^/]*redditmedia%.com/") 201 | or string.match(url, "^https?://v%.redd%.it/") 202 | or string.match(url, "^https?://[^/]*reddit%.com/video/") 203 | or string.match(url, "^https?://i%.redd%.it/") 204 | or string.match(url, "^https?://[^%.]*preview%.redd%.it/.") 205 | ) 206 | and not string.match(item_type, "comment") 207 | and not string.match(url, "^https?://[^/]*redditmedia%.com/mediaembed/") 208 | and not is_crosspost then 209 | if parenturl 210 | and string.match(parenturl, "^https?://www%.reddit.com/api/info%.json%?id=t") 211 | and not string.match(url, "^https?://v%.redd%.it/") 212 | and not string.match(url, "^https?://[^/]*reddit%.com/video/") 213 | and not string.find(url, "thumbs.") then 214 | return false 215 | end 216 | if not string.match(url, "^https?://v%.redd%.it/") 217 | or string.match(url, "%.mp4$") 218 | or string.match(url, "%.ts$") then 219 | reddit_media_urls["url:" .. url] = true 220 | return false 221 | end 222 | return true 223 | end 224 | 225 | for s in string.gmatch(url, "([a-z0-9]+)") do 226 | if posts[s] then 227 | return true 228 | end 229 | end 230 | 231 | return false 232 | end 233 | 234 | wget.callbacks.download_child_p = function(urlpos, parent, depth, start_url_parsed, iri, verdict, reason) 235 | local url = urlpos["url"]["url"] 236 | local html = urlpos["link_expect_html"] 237 | 238 | if item_type == "comment" or item_type == "url" then 239 | return false 240 | end 241 | 242 | if string.match(url, "[<>\\%*%$;%^%[%],%(%){}]") 243 | or string.match(url, "^https?://[^/]*redditstatic%.com/") 244 | or string.match(url, "^https?://old%.reddit%.com/static/") 245 | or string.match(url, "^https?://www%.reddit%.com/static/") 246 | or string.match(url, "^https?://styles%.redditmedia%.com/") 247 | or string.match(url, "^https?://emoji%.redditmedia%.com/") 248 | or string.match(url, "/%.rss$") then 249 | return false 250 | end 251 | 252 | if string.match(parent["url"], "^https?://old%.reddit%.com/comments/[a-z0-9]+") then 253 | return true 254 | end 255 | 256 | url = string.gsub(url, "&", "&") 257 | 258 | if not processed(url) 259 | and (allowed(url, parent["url"]) or (allowed(parent["url"]) and html == 0)) then 260 | addedtolist[url] = true 261 | return true 262 | end 263 | 264 | return false 265 | end 266 | 267 | wget.callbacks.get_urls = function(file, url, is_css, iri) 268 | local urls = {} 269 | local html = nil 270 | local no_more_svc = false 271 | 272 | downloaded[url] = true 273 | 274 | if abortgrab then 275 | return {} 276 | end 277 | 278 | local function check(urla) 279 | if no_more_svc 280 | and string.match(urla, "^https?://[^/]+/svc/") then 281 | return nil 282 | end 283 | local origurl = url 284 | local url = string.match(urla, "^([^#]+)") 285 | local url_ = string.match(url, "^(.-)%.?$") 286 | if not string.find(url, "old.reddit.com") then 287 | url_ = string.gsub( 288 | url_, "\\[uU]([0-9a-fA-F][0-9a-fA-F][0-9a-fA-F][0-9a-fA-F])", 289 | function (s) 290 | return utf8.char(tonumber(s, 16)) 291 | end 292 | ) 293 | end 294 | while string.find(url_, "&") do 295 | url_ = string.gsub(url_, "&", "&") 296 | end 297 | if not processed(url_) 298 | and string.match(url_, "^https?://.+") 299 | and allowed(url_, origurl) 300 | and not (string.match(url_, "[^/]$") and processed(url_ .. "/")) then 301 | table.insert(urls, { url=url_ }) 302 | addedtolist[url_] = true 303 | addedtolist[url] = true 304 | end 305 | end 306 | 307 | local function checknewurl(newurl) 308 | if string.match(newurl, "^https?:////") then 309 | check(string.gsub(newurl, ":////", "://")) 310 | elseif string.match(newurl, "^https?://") then 311 | check(newurl) 312 | elseif string.match(newurl, "^https?:\\/\\?/") then 313 | check(string.gsub(newurl, "\\", "")) 314 | elseif string.match(newurl, "^\\/\\/") then 315 | checknewurl(string.gsub(newurl, "\\", "")) 316 | elseif string.match(newurl, "^//") then 317 | check(urlparse.absolute(url, newurl)) 318 | elseif string.match(newurl, "^\\/") then 319 | checknewurl(string.gsub(newurl, "\\", "")) 320 | elseif string.match(newurl, "^/") then 321 | check(urlparse.absolute(url, newurl)) 322 | elseif string.match(newurl, "^%.%./") then 323 | if string.match(url, "^https?://[^/]+/[^/]+/") then 324 | check(urlparse.absolute(url, newurl)) 325 | else 326 | checknewurl(string.match(newurl, "^%.%.(/.+)$")) 327 | end 328 | elseif string.match(newurl, "^%./") then 329 | check(urlparse.absolute(url, newurl)) 330 | end 331 | end 332 | 333 | local function checknewshorturl(newurl) 334 | if string.match(newurl, "^%?") then 335 | check(urlparse.absolute(url, newurl)) 336 | elseif not ( 337 | string.match(newurl, "^https?:\\?/\\?//?/?") 338 | or string.match(newurl, "^[/\\]") 339 | or string.match(newurl, "^%./") 340 | or string.match(newurl, "^[jJ]ava[sS]cript:") 341 | or string.match(newurl, "^[mM]ail[tT]o:") 342 | or string.match(newurl, "^vine:") 343 | or string.match(newurl, "^android%-app:") 344 | or string.match(newurl, "^ios%-app:") 345 | or string.match(newurl, "^data:") 346 | or string.match(newurl, "^irc:") 347 | or string.match(newurl, "^%${") 348 | ) then 349 | check(urlparse.absolute(url, newurl)) 350 | end 351 | end 352 | 353 | if string.match(url, "^https?://www%.reddit%.com/") 354 | and not string.match(url, "/api/") 355 | and not string.match(url, "^https?://[^/]+/svc/") then 356 | check(string.gsub(url, "^https?://www%.reddit%.com/", "https://old.reddit.com/")) 357 | end 358 | 359 | local match = string.match(url, "^https?://preview%.redd%.it/([a-zA-Z0-9]+%.[a-zA-Z0-9]+)") 360 | if match then 361 | check("https://i.redd.it/" .. match) 362 | end 363 | 364 | if string.match(url, "is_lit_ssr=") 365 | and not string.match(url, "/svc/shreddit/more%-comments/") then 366 | check(string.gsub(url, "([%?&]is_lit_ssr=)[a-z]+", "%1true")) 367 | check(string.gsub(url, "([%?&]is_lit_ssr=)[a-z]+", "%1false")) 368 | end 369 | 370 | if allowed(url) 371 | and status_code < 300 372 | and item_type ~= "url" 373 | and not string.match(url, "^https?://[^/]*redditmedia%.com/") 374 | and not string.match(url, "^https?://[^/]*redditstatic%.com/") 375 | and not string.match(url, "^https?://out%.reddit%.com/") 376 | and not string.match(url, "^https?://[^%.]*preview%.redd%.it/") 377 | and not string.match(url, "^https?://i%.redd%.it/") 378 | and not ( 379 | string.match(url, "^https?://v%.redd%.it/") 380 | and not string.match(url, "%.m3u8") 381 | and not string.match(url, "%.mpd") 382 | ) then 383 | html = read_file(file) 384 | --[[if string.match(url, "^https?://www%.reddit%.com/[^/]+/[^/]+/comments/[0-9a-z]+/[^/]+/[0-9a-z]*/?$") then 385 | check(url .. "?utm_source=reddit&utm_medium=web2x&context=3") 386 | end]] 387 | if string.match(url, "^https?://old%.reddit%.com/api/morechildren$") then 388 | html = string.gsub(html, '\\"', '"') 389 | elseif string.match(url, "^https?://old%.reddit%.com/r/[^/]+/comments/") 390 | or string.match(url, "^https?://old%.reddit%.com/r/[^/]+/duplicates/") then 391 | html = string.gsub(html, "%s*.-%s*%s*%s*", "") 392 | end 393 | if string.match(url, "^https?://old%.reddit%.com/") then 394 | for s in string.gmatch(html, "(return%s+morechildren%(this,%s*'[^']+',%s*'[^']+',%s*'[^']+',%s*'[^']+'%))") do 395 | local link_id, sort, children, limit_children = string.match(s, "%(this,%s*'([^']+)',%s*'([^']+)',%s*'([^']+)',%s*'([^']+)'%)$") 396 | local id = string.match(children, "^([^,]+)") 397 | local subreddit = string.match(html, 'data%-subreddit="([^"]+)"') 398 | local post_data = 399 | "link_id=" .. link_id .. 400 | "&sort=" .. sort .. 401 | "&children=" .. string.gsub(children, ",", "%%2C") .. 402 | "&id=t1_" .. id .. 403 | "&limit_children=" .. limit_children .. 404 | "&r=" .. subreddit .. 405 | "&renderstyle=html" 406 | if not requested_children[post_data] then 407 | requested_children[post_data] = true 408 | print("posting for modechildren with", post_data) 409 | table.insert(urls, { 410 | url="https://old.reddit.com/api/morechildren", 411 | post_data=post_data, 412 | headers={ 413 | ["Content-Type"]="application/x-www-form-urlencoded; charset=UTF-8", 414 | ["X-Requested-With"]="XMLHttpRequest" 415 | } 416 | }) 417 | end 418 | end 419 | --[[elseif string.match(url, "^https?://www%.reddit%.com/r/[^/]+/comments/[^/]") 420 | or string.match(url, "^https?://www%.reddit%.com/user/[^/]+/comments/[^/]") 421 | or string.match(url, "^https?://www%.reddit%.com/comments/[^/]") 422 | or string.match(url, "^https?://gateway%.reddit%.com/desktopapi/v1/morecomments/t3_[^%?]") then 423 | local comments_data = nil 424 | if string.match(url, "^https?://www%.reddit%.com/") then 425 | comments_data = string.match(html, '%s*window%.___r%s*=%s*({.+});%s*%s*