├── README.md └── getfrontend.py /README.md: -------------------------------------------------------------------------------- 1 | # getfrontend 2 | 3 | > [!IMPORTANT] 4 | > This project is in the **experimental** stage.. take the following usefulness claims with a M☉ of salt... 5 | 6 | ## Why 7 | Let's say we want to scan the frontend code (of a SPA/PWA) for some API keys that shouldn't be exposed to the client (or just explore the whole frontend code), but the app code is split into many chunks with unpredictable names.. and most of these aren't loaded by the browser (you probably only see the login page).. additionally, for each of these chunks, there's a corresponding source map file with the `.map` extension, that happens to not be specified via `sourceMappingURL` (and so the browser doesn't even see it). How do we "get" that frontend? 8 | 9 | Well, that's exactly what getfrontend was designed for - to "get" the frontend code, to get all these chunks, to get all these source maps and recreate the original files as accurately as possible. 10 | 11 | In theory, this task is not achievable (let me mention the famous "halting problem"), however, in practice, since many apps are bundled with webpack/vite, it's often possible to enumerate all these chunks even with some simple static analysis. 12 | And this is what the program does, it attempts to recognize: 13 | - webpack chunks in various configurations (includes basic federated module support) 14 | - vite chunks 15 | - next.js chunks from a build manifest 16 | - remix chunks from a manifest 17 | - ES6 imports / dynamic imports 18 | - scripts specified in import maps 19 | 20 | There's also an "aggressive mode" which - as the name suggests - attempts to find more possible paths (in string literals for example), but this mode isn't "smart" - it's only possible to do a smart detection when the used stack is known. 21 | 22 | And of course, there are many false positives, but getfrontend assumes that these are irrelevant - if something is not there, a 404 will be returned and the file won't be saved. 23 | 24 | By default JS/CSS files are fetched (and the initial HTML page), but you can specify other extensions if you want them saved. There are shortcuts to specify the most common asset/image/media file extensions. 25 | 26 | ## What this tool doesn't do 27 | Notably, it: 28 | - doesn't try to unpack minified webpack chunks without source maps 29 | - doesn't crawl html pages, it assumes the specified url points to a SPA 30 | - doesn't do any dynamic analysis to discover chunks 31 | - doesn't attempt to deobfuscate obfuscated (not just minified) JS files 32 | - doesn't even try to defend against any kind of targeted DoS (like infinitely many JS files and so on) 33 | 34 | ## Basic usage 35 | 36 | First you obviously need the target url. If you're trying to run getfrontend for a multipage app with more than one entry, then you need to specify all these entries. 37 | In case you know that a particular url should be included but getfrontend can't find it, you can also specify it on the command line. 38 | ``` 39 | python getfrontend.py [options]... [root url] [optional additional urls]... 40 | ``` 41 | Note that only the first url is treated as the "root" url.. practically this means multi-page apps are supported only for the same origin 42 | 43 | ### Need custom headers? Cookies? 44 | You can specify a custom header to be added to each request using the `--add-header`/`-H` option (works similarly to `curl`). 45 | Additionally there's the `--add-cookie`/`-c` convenience argument to add a cookie. Both options might be used multiple times. 46 | ``` 47 | python getfrontend.py -H'X-Is-Admin: of-course' -c'is_admin=sure' -c'is_a_bot=nope' https://securesite.com/ 48 | ``` 49 | 50 | ### Choose the output method 51 | By default everything is **dumped to stdout**.. since this might not necessarily be what you want, you can specify the `--output`/`-o` argument: 52 | ``` 53 | python getfrontend.py -o /tmp/antarctica_realestate.zip https://realestate.aq/ 54 | ``` 55 | **If it ends with the `.zip` suffix**, then files are written to the specified file as a zip archive, otherwise the argument value is treated as the target directory. 56 | 57 | ### Choose what you want saved 58 | By default, only JS and CSS files are saved. This is.. a weird default... 59 | 60 | If you don't want to save CSS files, the `--no-css`/`-nc` argument is your friend :) 61 | 62 | If you want to save more, you can either specify extensions manually using the `--asset-extensions`/`-ae` option (comma-separated or use the argument multiple times), or use these shortcuts: 63 | 64 | |Option|Extensions| 65 | | --- | --- | 66 | | `--save-common-assets`/`-sa` | `svg,json,webmanifest,ico,eot,woff,woff2,otf,ttf` | 67 | | `--save-images`/`-si` | `jpg,jpeg,png,gif,webp` | 68 | | `--save-media`/`-sm` | `mp3,ogg,wav,m4a,opus,mp4,mov,mkv,webm` | 69 | 70 | ### Scripts are being fetched from the wrong path? 71 | While getfrontend attempts to detect the correct "public path" for dynamically loaded chunks, this detection might sometimes yield wrong results, for example when the path is generated in an unusual way.. In that case you might want to supply the path manually. 72 | 73 | First, run getfrontend with the `-v` option, then look for strings like "public path for": 74 | ``` 75 | python getfrontend.py -v https://somesite.com/ |& grep 'public path for' 76 | ``` 77 | you might see something like this: 78 | ``` 79 | webpack public path for https://somesite.com/js/main.somehash.js is https://somesite.com/ 80 | ``` 81 | then if you know (for instance by finding it out using devtools) what the actual prefix should be (like `https://somesite.com/js_chunks/`), you can use the `-ppm` option to add the mapping: 82 | ``` 83 | python getfrontend.py -ppm "https://somesite.com/js/main.somehash.js=https://somesite.com/js_chunks/" https://somesite.com/ 84 | ``` 85 | and then it should work as desired :) 86 | 87 | 88 | ### Chunks/scripts not found? 89 | Try using the aforementioned "aggressive mode" by specifying the `--aggressive-mode`/`-a` option. 90 | It might work.. otherwise - if that's a common configuration - consider filling an issue. 91 | 92 | 93 | ### But there's more.. 94 | There are more options, check them out by running: 95 | ``` 96 | python getfrontend.py --help 97 | ``` 98 | You can also read the code, albeit be prepared for the worst.. 99 | -------------------------------------------------------------------------------- /getfrontend.py: -------------------------------------------------------------------------------- 1 | import argparse 2 | import base64 3 | import hashlib 4 | import html 5 | import sys 6 | import re 7 | import time 8 | import zipfile 9 | import requests 10 | import os 11 | import json 12 | import threading 13 | import urllib3 14 | 15 | from urllib.parse import urljoin, urlparse 16 | from queue import Queue 17 | 18 | urllib3.disable_warnings(urllib3.exceptions.InsecureRequestWarning) 19 | 20 | 21 | class Logger: 22 | LOG = 1 23 | DEBUG = 2 24 | VERBOSE_DEBUG = 3 25 | 26 | def __init__(self): 27 | self.level = self.LOG 28 | 29 | def write(self, level, *args, **kwargs): 30 | if level <= self.level: 31 | print(*args, **kwargs, file=sys.stderr) 32 | 33 | def log(self, *args, **kwargs): 34 | self.write(self.LOG, *args, **kwargs) 35 | 36 | def debug(self, *args, **kwargs): 37 | self.write(self.DEBUG, *args, **kwargs) 38 | 39 | def vdebug(self, *args, **kwargs): 40 | self.write(self.VERBOSE_DEBUG, *args, **kwargs) 41 | 42 | 43 | log = Logger() 44 | 45 | 46 | class SourceMapError(Exception): 47 | pass 48 | 49 | 50 | def normpath(path): 51 | # os.path.normpath allows //, we can't since these are protocol-relative urls 52 | path = os.path.normpath(path) 53 | 54 | if path.startswith('//'): 55 | path = path[1:] 56 | 57 | return path 58 | 59 | 60 | class FArchive: 61 | def __init__(self): 62 | self.files = {} 63 | self.names = set() 64 | self.effective_names = {} 65 | 66 | def normalize_name(self, name): 67 | name = normpath('/' + name)[1:] 68 | return name 69 | 70 | def get_unique_name(self, name): 71 | base_name, ext = os.path.splitext(name) 72 | counter = 2 73 | while name in self.names: 74 | name = f"{base_name}_{counter}{ext}" 75 | counter += 1 76 | self.names.add(name) 77 | 78 | return name 79 | 80 | def add_file(self, name, content): 81 | name = self.normalize_name(name) 82 | 83 | content_hash = hashlib.md5(content).hexdigest() 84 | name_key = name + '//' + content_hash 85 | 86 | if self.effective_names.get(name + '//' + content_hash): 87 | # same name + existing content has means we can skip it 88 | return 89 | 90 | name = self.get_unique_name(name) 91 | self.effective_names[name_key] = name 92 | 93 | assert name not in self.files 94 | 95 | self.files[name] = content 96 | 97 | def write_to_file(self, wf): 98 | with zipfile.ZipFile(wf, 'w', compression=zipfile.ZIP_DEFLATED) as zf: 99 | for name, content in self.files.items(): 100 | zf.writestr(name, content) 101 | 102 | def save_to_directory(self, path): 103 | os.makedirs(path, exist_ok=True) 104 | 105 | for name, content in self.files.items(): 106 | file_path = os.path.join(path, name) 107 | os.makedirs(os.path.dirname(file_path), exist_ok=True) 108 | 109 | with open(file_path, 'wb') as f: 110 | f.write(content) 111 | 112 | def dump_to_stdout(self): 113 | for name, content in self.files.items(): 114 | print(f"// {name}\n// [{len(content)}] bytes\n") 115 | print(content.decode()) 116 | 117 | 118 | class Client: 119 | def __init__(self, timeout: None | float | tuple[float, float] = (8, 16), cookies: dict | None = None, headers: dict | None = None): 120 | self.timeout = timeout 121 | self.sleep = 5 122 | self.rs = requests.Session() 123 | self.rs.verify = False 124 | 125 | if cookies: 126 | self.rs.cookies.update(cookies) 127 | 128 | if headers: 129 | self.rs.headers.update(headers) 130 | 131 | def request(self, method, url, *args, **kwargs): 132 | try: 133 | while True: 134 | try: 135 | res = self.rs.request(method, url, *args, timeout=self.timeout, **kwargs) 136 | 137 | if res.status_code in (408, 429, 500, 502, 503, 504): 138 | log.log('retrying', url) 139 | time.sleep(self.sleep) 140 | continue 141 | 142 | return res 143 | 144 | except requests.exceptions.RequestException as e: 145 | log.log('non-status request error, retrying', url, e) 146 | time.sleep(self.sleep) 147 | continue 148 | 149 | except Exception as e: 150 | log.log('non-status request exception', url, e) 151 | return None 152 | 153 | def get(self, *args, **kwargs): 154 | return self.request('GET', *args, **kwargs) 155 | 156 | 157 | class Fetcher: 158 | def __init__(self, client: Client, n_workers): 159 | self.input_queue = Queue() 160 | self.output_queue = Queue() 161 | self.client = client 162 | self.queued_urls = set() 163 | self.lock = threading.Lock() 164 | self.pending_urls = 0 165 | 166 | self.n_workers = n_workers 167 | for _ in range(n_workers): 168 | threading.Thread(target=self.worker).start() 169 | 170 | def queue(self, url, *args): 171 | # called from the main thread only 172 | 173 | if url not in self.queued_urls: 174 | self.pending_urls += 1 175 | self.queued_urls.add(url) 176 | self.input_queue.put([url, *args]) 177 | 178 | def get_response(self): 179 | # called from the main thread only 180 | 181 | if not self.pending_urls: 182 | self.shutdown_workers() 183 | return None 184 | 185 | ret = self.output_queue.get() 186 | self.pending_urls -= 1 187 | return ret 188 | 189 | def worker(self): 190 | while True: 191 | arg = self.input_queue.get() 192 | if arg is None: 193 | break 194 | 195 | [url, *args] = arg 196 | 197 | response = self.client.get(url) # doesn't throw 198 | self.output_queue.put([url, response, *args]) 199 | 200 | def shutdown_workers(self): 201 | for _ in range(self.n_workers): 202 | self.input_queue.put(None) 203 | 204 | 205 | def prepare_link(link): 206 | if '?' in link: 207 | link = link.split('?', 1)[0] 208 | 209 | if '#' in link: 210 | link = link.split('#', 1)[0] 211 | 212 | return link 213 | 214 | 215 | class Crawler: 216 | def __init__(self, gf, root, save_prefix=''): 217 | self.gf: GetFrontend = gf 218 | 219 | self.save_prefix = save_prefix # this also affects url assets 220 | self.root = root 221 | 222 | self.webpack_chunk_formats = [] 223 | self.possible_webpack_chunk_ids = set() 224 | 225 | def check_prefix(self, link): 226 | if not self.gf.prefix_whitelist: 227 | return True 228 | 229 | for prefix in self.gf.prefix_whitelist: 230 | if link.startswith(prefix): 231 | return True 232 | 233 | return False 234 | 235 | def save_fetched_asset(self, path, content): 236 | # path is an url 237 | # this is meant for original assets 238 | 239 | path = re.sub(r'^(https?:)//', r'\1', path) 240 | return self.gf.save_asset(self.save_prefix + path, content) 241 | 242 | def save_mapped_asset(self, path, content, origin_path): 243 | return self.save_generated_asset(path, content, origin_path, 'mapped') 244 | 245 | def save_unpacked_asset(self, path, content, origin_path): 246 | return self.save_generated_asset(path, content, origin_path, 'unpacked') 247 | 248 | def save_generated_asset(self, path, content, origin_path, label): 249 | origin = re.search(r'https?://([^/]+)', origin_path).group(1) 250 | self.gf.save_asset(f'{self.save_prefix}{label}@{origin}/{path}', content) 251 | 252 | def queue(self, link, tag, *args): 253 | self.gf.fetcher.queue(link, self, tag, *args) 254 | 255 | def queue_link(self, link, tag=None, fallback=None): 256 | if not self.check_prefix(link) and not fallback: 257 | return 258 | 259 | link = prepare_link(link) 260 | 261 | if tag is None: 262 | if link.endswith('.css') and not self.gf.skip_css: 263 | tag = 'css' 264 | elif re.search(r'\.m?[jt]sx?$', link): 265 | tag = 'js' 266 | elif self.gf.other_asset_extensions and link.endswith('.webmanifest'): 267 | tag = 'webmanifest' 268 | elif self.gf.other_asset_extensions and re.search(rf'{self.gf.asset_ext_pat}$', link, flags=re.IGNORECASE): 269 | tag = 'asset' 270 | 271 | if not tag and fallback: 272 | tag = fallback 273 | 274 | if tag: 275 | self.queue(link, tag) 276 | 277 | def handle_result(self, url, response, mode, *args): 278 | response = self.gf.check_response(url, response) 279 | if response: 280 | if mode == 'dynamic': 281 | content_type = response.headers.get('content-type', '') 282 | if ';' in content_type: 283 | content_type = content_type.split(';', 1)[0] 284 | 285 | match content_type: 286 | case "text/javascript" | "application/javascript" | "application/x-javascript": 287 | mode = 'js' 288 | case "text/css": 289 | mode = 'css' 290 | case "text/html" | "application/xhtml+xml" | "application/xml": 291 | mode = 'page' 292 | case _: 293 | mode = 'asset' 294 | 295 | log.debug('dynamic mode detected as', mode, 'for url', url) 296 | 297 | match mode: 298 | case "page": 299 | self.handle_html_response(url, response) 300 | case "nextjs": 301 | self.handle_nextjs_manifest(url, response) 302 | case "js": 303 | self.handle_js(url, response) 304 | case "css": 305 | self.handle_css(url, response) 306 | case "remote_entry_js": 307 | self.handle_remote_module(url, response, *args) 308 | case "webmanifest": 309 | self.handle_webmanifest(url, response) 310 | case "asset": 311 | self.handle_asset(url, response) 312 | 313 | def handle_js_data(self, content, path, skip_sourcemaps=False): 314 | self.find_webpack_chunk_info(content, path) 315 | self.find_federated_modules(content, path) 316 | self.find_webpack_chunk_refs(content, path) 317 | self.find_vite_chunks(content, path) 318 | self.unpack_webpack_eval_sources(content, path) 319 | 320 | if not skip_sourcemaps: 321 | self.handle_content_sourcemaps(content, path) 322 | 323 | # module scan, here we might encounter absolute links 324 | for module_link in find_import_references(content, path): 325 | self.queue_link(module_link) 326 | 327 | self.find_imported_scripts(content, path) 328 | self.find_workers(content, path) 329 | self.find_manifests(content, path) 330 | 331 | if self.gf.aggressive_mode: 332 | self.run_aggressive_scan(content, path) 333 | 334 | def handle_js(self, url, res): 335 | res_headers = res.headers 336 | res = self.gf.decode_response(res) 337 | 338 | skip_sourcemaps = False 339 | 340 | if self.gf.ignore_vendor: 341 | last_part = url.rsplit('/', 1)[-1] 342 | if re.match(r'(chunk[-.])?vendors?[-.]', last_part): 343 | skip_sourcemaps = True 344 | log.debug('skipping maps for', url, 'due to vendor detection') 345 | 346 | self.handle_js_data(res, url, skip_sourcemaps=skip_sourcemaps) 347 | 348 | should_save = True 349 | 350 | if not skip_sourcemaps: 351 | if self.handle_header_sourcemaps(res_headers, url): 352 | should_save = False 353 | 354 | # it's often the case that there wasn't a comment but a sourcemap exists 355 | # we don't queue because we need the result here 356 | if self.fetch_and_handle_srcmap(url + '.map'): 357 | should_save = False 358 | 359 | if should_save or self.gf.save_original_assets: 360 | self.save_fetched_asset(url, res.encode()) 361 | 362 | def handle_css_data(self, content, path): 363 | self.handle_content_sourcemaps(content, path) 364 | 365 | url_pat = r'''(?\s*import\({vite_file_pat}\),\s*__vite__mapDeps\(\[(\d+)''', content): 651 | proper_path = urljoin(current_path, m.group(1)) 652 | dep_path = vite_deps[int(m.group(2))] 653 | 654 | assert proper_path.endswith(dep_path) 655 | 656 | vite_base = proper_path[:-len(dep_path)] 657 | log.debug('vite base', vite_base) 658 | else: 659 | log.debug('failed to find vite base path') 660 | vite_base = self.root 661 | 662 | if vite_base: 663 | for dep in vite_deps: 664 | chunk_path = urljoin(vite_base, dep) 665 | log.log('adding vite', chunk_path) 666 | self.queue_link(chunk_path) 667 | 668 | elif not had_vite_deps: 669 | # the older variant with no __vite__mapDeps 670 | # here we use the first import to derive the base path 671 | 672 | for m in re.finditer(rf'''\(\s*\(\)\s*=>\s*import\({vite_file_pat}\),\s*\[\s*(({vite_file_pat},?\s*)+)''', content): 673 | deps = [] 674 | for dm in re.finditer(vite_file_pat, m.group(2)): 675 | chk = dm.group(1) 676 | if chk.startswith('.'): 677 | chunk_path = urljoin(current_path, chk) 678 | log.log('adding vite1 rel', chunk_path) 679 | self.queue_link(chunk_path) 680 | else: 681 | deps.append(chk) 682 | 683 | if not deps: 684 | continue 685 | 686 | # now we have those that require a base 687 | 688 | if not vite_base: 689 | proper_path = urljoin(current_path, m.group(1)) 690 | dep_path = deps[0] 691 | 692 | if not proper_path.endswith(dep_path): 693 | # that's not vite... 694 | continue 695 | 696 | vite_base = proper_path[:-len(dep_path)] 697 | log.debug('vite base2', vite_base) 698 | 699 | for dep in deps: 700 | chunk_path = urljoin(vite_base, dep) 701 | log.log('adding vite2', chunk_path) 702 | self.queue_link(chunk_path) 703 | 704 | def unpack_webpack_eval_sources(self, content, current_path): 705 | for m in re.finditer(r'''[\n{]eval\s*\(\s*(?:"((?:\\.|[^"\\])+)"|'((?:\\.|[^'\\])+)')\s*\)''', content): 706 | src = '' 707 | 708 | if src := m.group(2): # transform single quotes so we can decode as json 709 | src = src.replace("\\'", "'").replace('"', '\\"') 710 | else: 711 | src = m.group(1) 712 | 713 | if '//# sourceURL=' not in src: 714 | continue 715 | 716 | src = json.loads('"' + src + '"') 717 | 718 | if m := re.search(r'\n//# sourceURL=([^\n?]+)[?]?', src): 719 | name = m.group(1) 720 | content = src[:m.start()] 721 | 722 | name, content = self._prepare_mapped_asset(name, content) 723 | log.debug('unpacking eval asset', name, 'from', current_path) 724 | self.save_unpacked_asset(name, content.encode(), current_path) 725 | 726 | def add_webpack_chunk_format(self, fmt): 727 | self.webpack_chunk_formats.append([fmt, set()]) 728 | 729 | def add_possible_webpack_chunk_id(self, chunk_id): 730 | self.possible_webpack_chunk_ids.add(chunk_id) 731 | 732 | def queue_possible_webpack_chunks(self): 733 | for resolve, queued in self.webpack_chunk_formats: 734 | for chunk_id in self.possible_webpack_chunk_ids - queued: 735 | self.queue_link(resolve(chunk_id)) 736 | 737 | queued.update(self.possible_webpack_chunk_ids) 738 | 739 | def find_webpack_chunk_info(self, res, current_path): 740 | # todo: this vs remote? what if we encounter remoteEntry.js 741 | 742 | # this works since 2015 743 | is_webpack_chunk_runtime = 'ChunkLoadError' in res or "'Loading chunk '" in res or '"Loading chunk "' in res or 'Automatic publicPath is not supported in this browser' in res 744 | 745 | if not is_webpack_chunk_runtime: 746 | return 747 | 748 | res = res.replace('\\u002F', '/') 749 | 750 | if current_path in self.gf.public_path_map: 751 | public_path = self.gf.public_path_map[current_path] 752 | 753 | else: 754 | public_path = '' 755 | 756 | # note for paths like someVariable + sth, we assume someVariable is empty 757 | for m in re.finditer(r'''(?:\w|__webpack_require__)\.p\s*=(\s*[\w.]+\s*\+)?\s*(?P['"])(?P[^'"]*)(?P=quot)\s*[,;})]''', res): 758 | # we pick the last one 759 | public_path = m.group('path') 760 | 761 | if 'Automatic publicPath is not supported in this browser' in res: 762 | # in one case it was relative to the script.. is it always true for automatic publicpath? 763 | # EDIT: well, no... need more data 764 | public_path = urljoin(current_path, 'abc')[:-3] 765 | 766 | # public_path is sometimes empty.. in that case it won't work with urljoin, we assume the root folder is used 767 | if public_path == '': 768 | public_path = urljoin(self.root, 'abc')[:-3] 769 | 770 | # relative to root, not the script 771 | public_path = urljoin(self.root, public_path) 772 | 773 | log.debug('webpack public path for', current_path, 'is', public_path) 774 | 775 | # first we need some cleanup, clean /******/ then clean // comments 776 | wr = re.sub(r'/\*{3,}/', ' ', res) 777 | # be careful not to trip strings like https:// 778 | wr = re.sub(r'\n\s*//.*', ' ', wr) 779 | 780 | # resolve full hashes 781 | def make_hash_repl(target): 782 | def hash_repl(m): 783 | ret = target 784 | if maxlen := m.group('maxlen'): 785 | log.debug('maxlen', maxlen) 786 | ret = target[:int(maxlen)] 787 | 788 | return '"' + ret + '"' 789 | 790 | return hash_repl 791 | 792 | hash_maxlen_pat = r'(?:\.(?:slice|substr(?:ing)?)\(\s*0,\s*(?P\d+)\))?' 793 | 794 | if 'hotCurrentHash' in wr and (full_hash := re.search(r'[^a-zA-Z0-9$_]hotCurrentHash\s*=\s*"(?P[a-fA-F0-9]+)"', wr)): 795 | full_hash = full_hash.group('hash') 796 | log.debug('hotcurrenthash', full_hash) 797 | wr = re.sub(rf'hotCurrentHash{hash_maxlen_pat}', make_hash_repl(full_hash), wr) 798 | 799 | last_match = None 800 | for m in re.finditer(r'''(__webpack_require__|\w)\.h\s*=\s*(?:function\s*\(\s*\)\s*\{\s*return(?![a-zA-Z0-9$_])|\(\s*\)\s*=>(?:\s*\{\s*return(?![a-zA-Z0-9$_]))?)\s*(?:\(\s*)?['"](?P[^'"]+)['"]''', wr): 801 | last_match = m 802 | 803 | if m := last_match: 804 | full_hash = m.group('hash') 805 | log.debug('replacing full hash', full_hash) 806 | wr = re.sub(rf'(?\s*(?:\{\s*|\(\s*)?)' 813 | r1v_func_start = r'(?:function(?:\s+\w+|\s*)\(\s*\w+\s*\)\s*\{\s*|=>\s*(?:\{\s*|\(\s*)?)' # optimized 814 | 815 | static_path_param = r'''['"]\s*\+\s*\w+\s*\+\s*['"]''' 816 | static_path_inner_pat = rf'''[^'"]+(?:{static_path_param}[^'"]+)?''' 817 | static_multi_ids_pat = r'''\{(?:(?:\d+(?:e\d+)?|['"][^'"]+['"]):1,?)+\}\s*\[\w+\]''' 818 | static_chunk_pat1 = rf'''if\s*\((?:\w+\s*===\s*(?P\d+(?:e\d+)?|['"][^'"]+['"])|(?P{static_multi_ids_pat}))\)\s*return\s*['"](?P{static_path_inner_pat})['"]\s*;\s*''' 819 | static_chunk_pat2 = rf'''(?:(?P\d+(?:e\d+)?|['"][^'"]+['"])===\w+|(?P{static_multi_ids_pat}))\?['"](?P{static_path_inner_pat})['"]:''' 820 | 821 | start_v1 = rf'(?:{r1v_func_start}(return(?![a-zA-Z0-9$_])\s*\(?\s*)?|\.src\s*=\s*(?:\([^;]{"{,5}"})?)(?:\w|__webpack_require__)\.p\s*\+' 822 | # return is possible in two locations depending on static_chunks variant 823 | start_v2 = rf'\.u\s*=\s*{r1v_func}(return(?![a-zA-Z0-9$_])\s*\(?\s*)?(?P(?:{static_chunk_pat1}|{static_chunk_pat2})+)?(return(?![a-zA-Z0-9$_])\s*\(?\s*)?' 824 | 825 | prefix_pat = r'''['"](?P[^"' ]*)['"]''' 826 | 827 | # premap can be identity or in a compact form 828 | # but... there might be no premap.. we saw this with federated modules 829 | premap_pat = r'''(?:\(\(?\s*\{(?P[^{}]*)\}\)?\s*\[(?:\s*\w+\s*=)?\s*\w+\s*\]\s*\|\|\s*\w+\s*\)|\{(?P[^{}]*)\}\s*\[(?:\s*\w+\s*=)?\s*\w+\s*\]|\((?:(?P\d+)\s*===\s*\w+|\w+\s*===\s*(?P\d+))\s*\?\s*"(?P[^"]+)"\s*:\s*\w+\)|(?P\w+))''' 830 | 831 | # exhaustive maps 832 | map_pat = r'''(?:['"](?P[^"' ]*)['"]\s*\+\s*)?\(?\{(?P[^{}]*)\}\)?\s*\[(?:\w+\s*=\s*)?\w+\]''' 833 | 834 | qmap_pat_common = r'''\?\w+=)['"]\s*\+\s*\{(?P[^{}]*)\}\s*\[(?:\w+\s*=\s*)?\w+\]\s*[,;]''' 835 | qmap_pat = r'''['"](?P[^"' ]*\.m?jsx?''' + qmap_pat_common 836 | qmap_css_pat = r'''['"](?P[^"' ]*\.css''' + qmap_pat_common 837 | 838 | suffix_pat = r'''(?:['"](?P[^"']*\.m?jsx?)(?:\?t=\d+)?['"]\s*[^+]|(?P(?<=:)void\(?\s*0\s*\)?|undefined))''' 839 | 840 | def parse_chunk_match(m, search_static=False): 841 | prefix = m.group('prefix') or '' 842 | suffix = m.group('suffix') or '' 843 | 844 | known_ids = set() 845 | exhaustive = False 846 | 847 | # premap should be constructed for the chunk format... 848 | # either from parse chunkmap if a dict, or from cm map 849 | # or identity 850 | # empty premap is not truthy but it's not "None", can't use "or" 851 | if m.group('premap_e') is not None: 852 | pm = m.group('premap_e') 853 | exhaustive = True 854 | else: 855 | pm = m.group('premap') 856 | 857 | if pm is not None: 858 | premap = parse_chunkmap(pm) 859 | known_ids.update(premap.keys()) 860 | elif cid := m.group('cpm_id') or m.group('cpm_id_2'): 861 | premap = {cid: m.group('cpm_value')} 862 | known_ids.add(cid) 863 | elif m.group('identity'): 864 | premap = {} 865 | else: 866 | premap = None 867 | 868 | cmap = m.group('map') or m.group('qmap') 869 | if cmap: 870 | cmap = parse_chunkmap(cmap) 871 | known_ids.update(cmap.keys()) 872 | exhaustive = True 873 | 874 | if m.group('qmap'): 875 | sep = m.group('qmap_sep') 876 | else: 877 | sep = m.group('sep') or '' 878 | 879 | static_map = {} 880 | if search_static: 881 | if sp := m.group('static_chunks'): 882 | for sm in re.finditer(rf'{static_chunk_pat1}|{static_chunk_pat2}', sp): 883 | chunk_ids = [] 884 | 885 | if single_id := sm.group('static1_id') or sm.group('static2_id'): 886 | chunk_ids.append(parse_chunk_id(single_id)) 887 | else: 888 | ids = sm.group('static1_ids') or sm.group('static2_ids') 889 | for scm in re.finditer(r'''(\d+(?:e\d+)?|['"][^'"]+['"]):1''', ids): 890 | chunk_ids.append(parse_chunk_id(scm.group(1))) 891 | 892 | path_src = sm.group('static1_path') or sm.group('static2_path') 893 | 894 | for chunk_id in chunk_ids: 895 | chunk_path = re.sub(static_path_param, chunk_id, path_src) 896 | log.debug('static path', chunk_path) 897 | static_map[chunk_id] = chunk_path 898 | known_ids.add(chunk_id) 899 | 900 | log.debug('static path map', static_map) 901 | 902 | def resolve(chunk_id): 903 | if chunk_id in static_map: 904 | chunk_path = static_map[chunk_id] 905 | else: 906 | chunk_path = premap.get(chunk_id, chunk_id) if premap is not None else '' 907 | chunk_path += sep + (cmap[chunk_id] if cmap else '') 908 | chunk_path = prefix + chunk_path + suffix 909 | 910 | return urljoin(public_path, chunk_path) 911 | 912 | depends_on_id = static_map or premap is not None or cmap 913 | 914 | return depends_on_id, known_ids, exhaustive, resolve 915 | 916 | has_exhaustive_chunks = False 917 | 918 | pattern = rf'(?:{start_v1}|{start_v2})\s*(?:{prefix_pat}\s*\+\s*)?(?:{premap_pat}\s*\+\s*)?(?:{qmap_pat}|(?:{map_pat}\s*\+\s*)?{suffix_pat})' 919 | 920 | last_match = None 921 | for m in re.finditer(pattern, wr): 922 | last_match = m 923 | 924 | if m := last_match: 925 | depends_on_id, known_ids, exhaustive, resolve = parse_chunk_match(m, True) 926 | log.debug('webpack match result', current_path, known_ids) 927 | 928 | # the js version should depend on the id somehow.. 929 | if not depends_on_id: 930 | log.log('webpack: no premap and no map', current_path) 931 | 932 | if exhaustive: 933 | has_exhaustive_chunks = True 934 | # here we don't add the chunk format deliberately 935 | 936 | for chunk_id in known_ids: 937 | self.queue_link(resolve(chunk_id)) 938 | 939 | else: 940 | self.add_webpack_chunk_format(resolve) 941 | for chunk_id in known_ids: 942 | self.add_possible_webpack_chunk_id(chunk_id) 943 | 944 | self.queue_possible_webpack_chunks() 945 | 946 | if not self.gf.skip_css: 947 | suffix_css_pat = r'''['"](?P[^"']*\.css)(?:\?t=\d+)?['"]\s*[^+]''' 948 | css_pattern = rf'(?P\.miniCssF\s*=\s*{r1v_func}(return(?![a-zA-Z0-9$_])\s*)?|(?:for\s*\(|\{"{"})\s*var \w+\s*=)\s*(?:{prefix_pat}\s*\+\s*)?(?:{premap_pat}\s*\+\s*)?(?:{qmap_css_pat}|(?:{map_pat}\s*\+\s*)?{suffix_css_pat})' 949 | 950 | last_match = None 951 | for m in re.finditer(css_pattern, wr): 952 | last_match = m 953 | 954 | # css chunks.. they're always exhaustive (a subset of js chunks?) 955 | if m := last_match: 956 | has_css_map = None 957 | 958 | # try to find the 01 map, a subset of emap 959 | if 'var ' in m.group('prelude'): 960 | # in this case we match the map.. backwards 961 | if m2 := re.search(r'''(?:[;,]|\]\s*\w+\s*\[)\}\s*((?:,?1\s*:\s*(?:\d+(?:e\d+)?|["'][^'"]*['"])\s*)+)\{''', wr[:m.start()][::-1]): 962 | has_css_map = m2.group(1)[::-1] 963 | 964 | else: 965 | # the map is inside the minicss 966 | if m2 := re.search(r'''\.miniCss\s*=\s*(?:function|\().*?\{(\s*((?:\d+(?:e\d+)?|["'][^'"]*['"])\s*:\s*1,?\s*)+)\}''', wr, flags=re.DOTALL): 967 | has_css_map = m2.group(1) 968 | 969 | if has_css_map is not None: 970 | cstr = has_css_map 971 | has_css_map = set() 972 | 973 | for cid in re.findall(r'''([a-zA-Z0-9_$]+|['"][^'"]+['"])\s*:\s*1,?''', cstr): 974 | has_css_map.add(parse_chunk_id(cid)) 975 | 976 | depends_on_id, known_ids, exhaustive, resolve = parse_chunk_match(m) 977 | log.debug('css chunks', has_css_map, known_ids) 978 | 979 | if not depends_on_id and not has_css_map: 980 | # corner case: only one chunk... :D 981 | has_css_map = set(['']) 982 | 983 | if has_css_map is None: 984 | # basically this.. "should not happen" 985 | # might happen if css chunks not used 986 | log.log('webpack: no css bitmap', current_path) 987 | 988 | has_css_map = set() 989 | for chunk_id in known_ids: 990 | has_css_map.add(chunk_id) 991 | 992 | for chunk_id in has_css_map: 993 | self.queue_link(resolve(chunk_id)) 994 | 995 | if not has_exhaustive_chunks: 996 | # these are all inside the webpack runtime, not other chunks 997 | # preload/prefetch maps 998 | 999 | chunk_id_pat = r'\d+(?:e\d+)?|"[^"]+"' # map keys might be strings 1000 | 1001 | for pm in re.finditer(r'var \w+\s*=\s*\{(?P[^{}]+)\};\s*(__webpack_require__|\w)\.f\.pre(?:load|fetch)\s*=', wr): 1002 | for pcm in re.finditer(rf'({chunk_id_pat})\s*:\s*\[([^\[\]]+)\]', pm.group('map')): 1003 | chunk_id = parse_chunk_id(pcm.group(1)) 1004 | log.debug('adding dephead', chunk_id) 1005 | self.add_possible_webpack_chunk_id(chunk_id) 1006 | 1007 | for pcmc in re.finditer(chunk_id_pat, pcm.group(2)): 1008 | chunk_id = parse_chunk_id(pcmc.group(0)) 1009 | log.debug('adding depchild', chunk_id) 1010 | 1011 | self.add_possible_webpack_chunk_id(chunk_id) 1012 | 1013 | # startup prefetch 1014 | for pm in re.finditer(r'\[([^\[\]]+)\]\.map\((?:__webpack_require__|\w)\.E\)', wr): 1015 | for pmc in re.finditer(chunk_id_pat, pm.group(1)): 1016 | chunk_id = parse_chunk_id(pmc.group(0)) 1017 | log.debug('adding startup', chunk_id) 1018 | self.add_possible_webpack_chunk_id(chunk_id) 1019 | 1020 | self.queue_possible_webpack_chunks() 1021 | 1022 | def find_webpack_chunk_refs(self, res, current_path): 1023 | # first we need some cleanup, clean /******/ clean // comments, clean escapes 1024 | # note these refs can be inside eval 1025 | wr = re.sub(r'/\*{3,}/|\\[nt]', ' ', res) 1026 | wr = re.sub(r'\n\s*//.*', ' ', wr) # be careful not to trip strings like https:// 1027 | wr = wr.replace(r'\"', '"') 1028 | 1029 | req_pat = r'(?:__webpack_require__|__nested_webpack_require_\d+__|\w)\.e' 1030 | 1031 | # search for context maps (for dynamic require support) 1032 | # these probably aren't exhaustive 1033 | # we simply look for all integers and strings except map keys and first items 1034 | chunk_id_pat = r'[{,]\s*(\d+(?:e\d+)?|"[^"]+")(?!\s*:)' 1035 | 1036 | added = False 1037 | 1038 | if re.search(rf'Promise\.all\(\w+\.slice\(\d+\)\.map\({req_pat}\)|return\s+(?:\w\s*\?\s*)?{req_pat}\(\w+\[[1-3]\]\)\.then', wr): 1039 | # we need to find a map, but we don't really know which variable is correct 1040 | # keys in this context map are strings 1041 | 1042 | map_found = False 1043 | 1044 | for m in re.finditer(r'''var \w+\s*=\s*\{((['"][^'"]*['"]|[\[\],\s:]|[0-9e])+)\}''', wr): 1045 | log.debug('async context: found map variable', m.group(1)) 1046 | map_found = True 1047 | 1048 | for md in re.finditer(chunk_id_pat, m.group(1)): 1049 | chunk_id = parse_chunk_id(md.group(1)) 1050 | log.debug('adding chunk from context map', chunk_id) 1051 | self.add_possible_webpack_chunk_id(chunk_id) 1052 | added = True 1053 | 1054 | assert map_found 1055 | 1056 | # search for possible ensure references, they might contain comments 1057 | # chunk ids can also be strings.. 1058 | # now we try to avoid false positives, so we assume no spaces inside 1059 | 1060 | chunk_id_pat = r'''\d+(?:e\d+)?|['"][^"'\s]+['"]''' 1061 | chunk_ref_pat = rf'(?:__webpack_require__|__nested_webpack_require_\d+__|\W\w)\.e\((?:/\*.*?\*/\s*)?({chunk_id_pat})\)' 1062 | 1063 | for m in re.finditer(chunk_ref_pat, wr): 1064 | chunk_id = parse_chunk_id(m.group(1)) 1065 | log.debug('adding manual chunk', chunk_id) 1066 | self.add_possible_webpack_chunk_id(chunk_id) 1067 | added = True 1068 | 1069 | if added: 1070 | self.queue_possible_webpack_chunks() 1071 | 1072 | def find_federated_modules(self, res, current_path): 1073 | for m in re.finditer(r'''new Promise\([^&}]*?['"](https?://[^'"?#]+.js)(?:[?#][^'"]*)?['"][^{]+\{[^{}]*?['"]ScriptExternalLoadError['"][^{'"]+['"]([^'"]+)['"]''', res): 1074 | url, app_name = m.group(1), m.group(2) 1075 | 1076 | log.log('adding federated module', m.group(1), m.group(2)) 1077 | self.queue(url, 'remote_entry_js', app_name) 1078 | 1079 | def handle_remote_module(self, url, res, app_name): 1080 | res = self.gf.decode_response(res) 1081 | 1082 | # it's not mapped, so save it as is 1083 | self.save_fetched_asset(url, res.encode()) 1084 | 1085 | # we need public path, but it needs to be absolute 1086 | public_path = re.search(r'''(?:\w|__webpack_require__)\.p\s*=\s*['"](https?://[^'"?#]+/)''', res) 1087 | assert public_path 1088 | public_path = public_path.group(1) 1089 | 1090 | log.debug('remote module', app_name, 'path', public_path) 1091 | rm_crawler = Crawler(self.gf, public_path, f'module:{app_name}/') 1092 | 1093 | # just send this js file to the new crawler, it should handle it all 1094 | rm_crawler.handle_js_data(res, url) 1095 | 1096 | def run_aggressive_scan(self, content, path): 1097 | # todo: should we scan css url here? would only matter for non-quoted 1098 | 1099 | content = re.sub(r'\\u002f', '/', content, flags=re.IGNORECASE) 1100 | content = re.sub(r'\\u005c', '\\\\', content, flags=re.IGNORECASE) 1101 | content = re.sub(r'''\\(['"])''', r'\1', content) 1102 | content = re.sub(r'\\[nrt]', ' ', content) 1103 | content = re.sub(r'\\/', '/', content) 1104 | 1105 | for m in re.finditer(self.gf.aggressive_rel_pat, content, flags=re.IGNORECASE): 1106 | # two variants - relative to the script or relative to the document 1107 | 1108 | lpart = m.group(1).replace('\\', '') 1109 | link = urljoin(path, lpart) 1110 | link2 = urljoin(self.root, lpart) 1111 | 1112 | log.log('aggressive rel match', path, link) 1113 | self.queue_link(link) 1114 | 1115 | if link != link2: 1116 | log.log('aggressive rel doc match', path, link2) 1117 | self.queue_link(link2) 1118 | 1119 | for m in re.finditer(self.gf.aggressive_abs_pat, content, flags=re.IGNORECASE): 1120 | link = urljoin(path, m.group(0).replace('\\', '')) 1121 | log.log('aggressive abs match', path, link) 1122 | self.queue_link(link) 1123 | 1124 | def find_nextjs_chunks(self, content, links): 1125 | has_manifest = False 1126 | next_path = None 1127 | 1128 | for link in links: 1129 | if '/_next/static/' not in link: 1130 | continue 1131 | 1132 | if not next_path: 1133 | next_path = urljoin(self.root, link[:link.index('/_next/static/') + len('/_next/')]) 1134 | 1135 | if '/_next/static/chunks/' not in link and link.endswith('/_buildManifest.js'): 1136 | has_manifest = True 1137 | log.log('adding next.js manifest', link) 1138 | self.queue(link, 'nextjs') 1139 | 1140 | if has_manifest or not next_path: 1141 | return 1142 | 1143 | content = content.replace('\\u002F', '/') 1144 | content = content.replace(r'\"', '"') 1145 | 1146 | for m in re.finditer(r'"(?:_next/|\d+:)?(static/(?:chunks|css)/[^"]+\.(?:m?jsx?|css))(?:[#?][^"]*)?"', content): 1147 | link = next_path + m.group(1) 1148 | log.log('adding nextjs chunk from non-manifest', link) 1149 | self.queue_link(next_path + m.group(1)) 1150 | 1151 | def handle_html_response(self, url, response): 1152 | links = [] 1153 | 1154 | if link_header := response.headers.get('link'): # multiple are joined by comma 1155 | for m in re.finditer(r'<([^>?#]+)', link_header): 1156 | log.debug('preload header', m.group(1)) 1157 | links.append(m.group(1)) 1158 | 1159 | html_content = self.gf.decode_response(response) 1160 | 1161 | # generally should guess as much as we can from html 1162 | # we assume there's only one html source 1163 | 1164 | # we intentionally want to catch attributes like data-href 1165 | for m in re.finditer(r'''(?:href|src)\s*=\s*(?:['"]|(?=/))(?!data:)([^"'?#> ]+)''', html_content): 1166 | log.debug('link', m.group(1)) 1167 | links.append(m.group(1)) 1168 | 1169 | # here we actually want to avoid false positives 1170 | if not self.gf.ignore_base_tag and (m := re.search(r''']+ )?href\s*=\s*['"]([^'">]+)['"]''', html_content)): 1171 | url = urljoin(url, m.group(1)) 1172 | log.log('base url changed to', url) 1173 | 1174 | links = [urljoin(url, prepare_link(link)) for link in links] 1175 | links = [link for link in links if self.check_prefix(link)] 1176 | 1177 | # todo maybe optional, maybe log that the script was filtered? 1178 | links = filter_es_variants(links) 1179 | 1180 | # special handlers run first since only the first tag for an url is queued 1181 | self.find_nextjs_chunks(html_content, links) 1182 | 1183 | if self.gf.other_asset_extensions: 1184 | for m in re.finditer(r''']*?)rel=['"]?manifest['"]?([^>]*)/?>''', html_content): 1185 | if hm := re.search(r'''href=['"]?([^'"\s]+)''', m.group(1) + m.group(2)): 1186 | link = urljoin(url, hm.group(1)) 1187 | log.log('adding link manifest', link) 1188 | self.queue_link(link, 'webmanifest') 1189 | 1190 | for link in links: 1191 | self.queue_link(link) 1192 | 1193 | # save this file 1194 | this_file_path = url + 'index.html' if url.endswith('/') else url 1195 | if not re.search(r'\.html?', this_file_path, flags=re.IGNORECASE): 1196 | this_file_path += '.html' 1197 | 1198 | self.save_fetched_asset(urljoin(url, this_file_path), html_content.encode()) 1199 | 1200 | # handle inline scripts 1201 | inline_script_content = '' 1202 | for m in re.finditer(r''']*)?>(.*?)''', html_content, flags=re.DOTALL): 1203 | script_content = html.unescape(m.group(1)).strip() 1204 | if script_content: 1205 | inline_script_content += '\n//=====\n' + script_content 1206 | self.handle_js_data(script_content, url) 1207 | 1208 | if inline_script_content: 1209 | self.save_fetched_asset(this_file_path + '.__inline.js', inline_script_content.encode()) 1210 | 1211 | # handle inline styles 1212 | inline_style_content = '' 1213 | for m in re.finditer(r''']*)?>(.*?)''', html_content, flags=re.DOTALL): 1214 | style_content = html.unescape(m.group(1)).strip() 1215 | if style_content: 1216 | inline_style_content += '\n/*=====*/\n' + style_content 1217 | self.handle_css_data(style_content, url) 1218 | 1219 | if inline_style_content and not self.gf.skip_css: 1220 | self.save_fetched_asset(this_file_path + '.__inline.css', inline_style_content.encode()) 1221 | 1222 | # scan for importmaps 1223 | for m in re.finditer(r'''''', html_content, flags=re.DOTALL): 1224 | imap = None 1225 | try: 1226 | imap = json.loads(html.unescape(m.group(1))) 1227 | except Exception: 1228 | log.debug('warn: invalid importmap') 1229 | continue 1230 | 1231 | if not imap: 1232 | continue 1233 | 1234 | for name, src in imap.get('imports', {}).items(): 1235 | if not name.endswith('/'): 1236 | self.queue_link(urljoin(url, src)) 1237 | else: 1238 | log.log('NOT IMPLEMENTED: importmap entry ending with a slash') 1239 | 1240 | # todo: support self.import_paths lookup 1241 | # but here some might be already imported so we'd need to add all to lpath 1242 | 1243 | if self.gf.other_asset_extensions: 1244 | # attribute styles 1245 | for m in re.finditer(r'''style\s*=\s*(?:['"])([^"']+)''', html_content): 1246 | self.handle_css_data(m.group(1), url) 1247 | 1248 | # srcset 1249 | for m in re.finditer(r'''srcset\s*=\s*(?:['"])([^"']+)''', html_content, flags=re.IGNORECASE): 1250 | for m2 in re.finditer(r'(?:^|,)\s*(?!data:)([^,\s]+)', m.group(1)): 1251 | link = urljoin(url, m2.group(1)) 1252 | log.log('adding link from srcset', link) 1253 | self.queue_link(link) 1254 | 1255 | if self.gf.aggressive_mode: 1256 | # aggressive mode is also for html since inline handlers can contain js 1257 | # this also looks into html comments 1258 | content = html_content.replace('"', '"').replace(''', "'").replace(''', "'") 1259 | self.run_aggressive_scan(content, url) 1260 | 1261 | 1262 | class GetFrontend: 1263 | def __init__(self, config): 1264 | self.root = config['root'] 1265 | if self.root.count('/') < 3 and not self.root.endswith('/'): 1266 | # normalize it so that we know that if it doesn't end with a slash 1267 | # it's a particular file, not a directory index 1268 | self.root += '/' 1269 | 1270 | self.prefix_whitelist = self.make_prefix_whitelist(config.get('origin_whitelist', [])) 1271 | 1272 | self.ignore_vendor = config.get('ignore_vendor') 1273 | self.all_srcmap_urls = config.get('all_srcmap_urls') 1274 | self.save_original_assets = config.get('save_original_assets') 1275 | self.skip_css = config.get('skip_css') 1276 | self.extract_nested_sourcemaps = config.get('extract_nested_sourcemaps') 1277 | self.no_header_sourcemaps = config.get('no_header_sourcemaps') 1278 | self.client = Client(cookies=config.get('cookies'), headers=config.get('headers')) 1279 | 1280 | self.other_asset_extensions: set[str] = config.get('other_asset_extensions', set()) 1281 | self.aggressive_mode = config.get('aggressive_mode') 1282 | self.make_scan_patterns() 1283 | 1284 | self.other_urls = config.get('other_urls', []) 1285 | self.use_original_base = config.get('use_original_base') 1286 | self.ignore_base_tag = config.get('ignore_base_tag') 1287 | self.public_path_map = config.get('public_path_map') 1288 | 1289 | self.fetcher = Fetcher(self.client, 3) 1290 | self.fetched_sourcemaps: dict[bool] = {} # so we don't fetch .map twice, maps are not fetched via queue 1291 | 1292 | self.archive = FArchive() 1293 | 1294 | self.root_crawler = Crawler(self, self.root) 1295 | 1296 | def make_prefix_whitelist(self, allowed_origins): 1297 | ret = set() 1298 | 1299 | if allowed_origins: 1300 | ret.add(extract_origin(self.root)) 1301 | for origin in allowed_origins: 1302 | if origin.startswith('http://') or origin.startswith('https://'): 1303 | ret.add(extract_origin(origin)) 1304 | else: 1305 | ret.add(extract_origin(f'http://{origin}')) 1306 | ret.add(extract_origin(f'https://{origin}')) 1307 | 1308 | return ret 1309 | 1310 | def make_scan_patterns(self): 1311 | asset_ext_part = None 1312 | 1313 | if self.other_asset_extensions: 1314 | asset_ext_part = '|'.join([re.escape(ext) for ext in self.other_asset_extensions]) 1315 | self.asset_ext_pat = rf'\.(?:{asset_ext_part})' 1316 | 1317 | parts = [r'm?[jt]sx?'] 1318 | 1319 | if not self.skip_css: 1320 | parts.append('css') 1321 | 1322 | if self.other_asset_extensions: 1323 | parts.append(asset_ext_part) 1324 | 1325 | ext_pat = r'\.(?:' + '|'.join(parts) + r')' 1326 | 1327 | # aggressive scan can be invoked by handlers, even without self.aggressive_mode 1328 | self.aggressive_rel_pat = rf'''["'`]([^"'`?#<>:]+{ext_pat})(?:[?#][^'"`]+)?['"`]''' 1329 | self.aggressive_abs_pat = rf'''https?://[^"'`?#\s(){"{}"}\[\]<>|!,;]+{ext_pat}(?![a-zA-Z0-9._%/-])''' 1330 | 1331 | def check_response(self, url, response): 1332 | if response is None: 1333 | # here the exception has already been logged 1334 | return 1335 | 1336 | if response.status_code == 404: 1337 | log.debug('not found', url) 1338 | return None 1339 | 1340 | if response.status_code != 200: 1341 | log.log('warning: bad response code', url, response.status_code) 1342 | 1343 | return response 1344 | 1345 | def decode_response(self, response): 1346 | # response.text might be too slow because it tries to guess the encoding 1347 | # we don't bother 1348 | text = None 1349 | 1350 | try: 1351 | text = response.content.decode('utf-8') 1352 | except UnicodeDecodeError: 1353 | text = response.content.decode('iso-8859-1') 1354 | 1355 | return text 1356 | 1357 | def get_url(self, url): 1358 | response = self.check_response(url, self.client.get(url)) 1359 | return response 1360 | 1361 | def run(self): 1362 | # root is handled specially because we update the root if the previous 1363 | # one caused a redirect 1364 | root_response = self.get_url(self.root) 1365 | if not root_response: 1366 | raise Exception("can't fetch target") 1367 | 1368 | if not self.use_original_base and root_response.url != self.root: 1369 | self.root = root_response.url 1370 | log.log('target redirected, new root url', self.root) 1371 | 1372 | self.root_crawler.handle_html_response(self.root, root_response) 1373 | 1374 | # self.root_crawler.queue_link(self.root, tag='page') 1375 | 1376 | if 'ico' in self.other_asset_extensions: 1377 | self.root_crawler.queue_link(urljoin(self.root, '/favicon.ico')) 1378 | 1379 | for url in self.other_urls: 1380 | self.root_crawler.queue_link(urljoin(self.root, url), fallback='dynamic') 1381 | 1382 | self.loop() 1383 | 1384 | def save_asset(self, path, content): 1385 | log.vdebug('saving asset', path) 1386 | self.archive.add_file(path, content) 1387 | 1388 | def export_to_file(self, fileobj): 1389 | self.archive.write_to_file(fileobj) 1390 | 1391 | def export_to_directory(self, path): 1392 | self.archive.save_to_directory(path) 1393 | 1394 | def dump_to_stdout(self): 1395 | self.archive.dump_to_stdout() 1396 | 1397 | def loop(self): 1398 | while True: 1399 | res = self.fetcher.get_response() 1400 | if res is None: 1401 | break 1402 | 1403 | url, response, crawler, mode, *args = res 1404 | crawler: Crawler 1405 | log.log('handling', url, mode, *args) 1406 | if response and response.url != url: 1407 | log.log('it redirected to', response.url) 1408 | 1409 | if not self.use_original_base: 1410 | url = response.url 1411 | 1412 | crawler.handle_result(url, response, mode, *args) 1413 | 1414 | 1415 | def extract_origin(url): 1416 | parsed_url = urlparse(url) 1417 | origin = f"{parsed_url.scheme}://{parsed_url.netloc}/" 1418 | return origin 1419 | 1420 | 1421 | def find_import_references(res, current_path): 1422 | value_pat = r'''['"]([^'"]+)['"]''' 1423 | 1424 | for m in re.finditer(rf'''(?:(?:^|[^.\s])\s*|[^a-zA-Z0-9_$.\s])import\s*\({value_pat}\)|(?:^|[\n;])\s*import\s*{value_pat}|(?:[{"}"}]|[a-zA-Z0-9$_]\s|\*/)\s*from\s*{value_pat}''', res): 1425 | link = m.group(1) or m.group(2) or m.group(3) 1426 | 1427 | if '.js' not in link and '.mjs' not in link and '.ts' not in link: 1428 | # we get... many false positives 1429 | continue 1430 | 1431 | if re.search(r'[<>(){}\[\]]', link): # things like ?v=sth might be there 1432 | continue 1433 | 1434 | if link.startswith('./') or link.startswith('../') or link.startswith('/'): 1435 | yield urljoin(current_path, link) 1436 | 1437 | elif link.startswith('http://') or link.startswith('https://'): 1438 | yield link 1439 | 1440 | else: 1441 | log.log('NOT IMPLEMENTED: using mapped import', link) 1442 | pass 1443 | 1444 | 1445 | def parse_chunkmap(val): 1446 | chunkmap_entry_pat = r'''([a-zA-Z0-9_$]+|['"][^'"]+['"]):['"]([^"']+)['"]''' 1447 | ret = {} 1448 | 1449 | if val: 1450 | for k, v in re.findall(chunkmap_entry_pat, val): 1451 | ret[parse_chunk_id(k)] = v 1452 | 1453 | return ret 1454 | 1455 | 1456 | def parse_chunk_id(v): 1457 | if v.startswith('"') or v.startswith("'"): # todo not exact, should be json but nothing is exact here 1458 | v = v[1:-1] 1459 | elif 'e' in v and (m := re.match(r'^(\d+)(?:e(\d+))', v)): 1460 | v = m.group(1) + '0'*int(m.group(2)) 1461 | 1462 | return v 1463 | 1464 | 1465 | def filter_es_variants(links): 1466 | ret = [] 1467 | best_dict = {} 1468 | 1469 | for link in links: 1470 | parts = re.split(r'(?<=[-.])es(\d+|next)(?=[.-])', link, maxsplit=1) 1471 | 1472 | if len(parts) == 1: 1473 | ret.append(link) 1474 | else: 1475 | key = parts[0] + '?' + parts[2] 1476 | nval = parts[1] 1477 | curr = best_dict.get(key, 0) 1478 | if nval == 'next' or (curr != 'next' and int(nval) > int(curr)): 1479 | best_dict[key] = nval 1480 | 1481 | for k, v in best_dict.items(): 1482 | ret.append(k.replace('?', 'es' + v)) 1483 | 1484 | return ret 1485 | 1486 | 1487 | def get_config_from_args(): 1488 | tpl_common = 'svg,json,webmanifest,ico,eot,woff,woff2,otf,ttf' 1489 | tpl_images = 'jpg,jpeg,png,gif,webp' 1490 | tpl_media = 'mp3,ogg,wav,m4a,opus,mp4,mov,mkv,webm' 1491 | 1492 | default_ua = 'Mozilla/5.0 (Windows NT 10.0; rv:124.0) Gecko/20100101 Firefox/124.0' 1493 | 1494 | parser = argparse.ArgumentParser() 1495 | 1496 | parser.add_argument('url', help='The root url.') 1497 | parser.add_argument('other_urls', nargs='*', help='Other urls that should be scanned.') 1498 | parser.add_argument('-o', '--output', default='-', help='Output can be a zip file (needs to end with .zip), a directory or stdout specified via the - character (default)') 1499 | parser.add_argument('-v', '--verbose', action='count', default=0, help='Increase verbosity, use -vv for even more verbosity.') 1500 | 1501 | scope_group = parser.add_argument_group('scope options') 1502 | scope_group.add_argument('-wo', '--whitelist-origin', action='append', default=[], help='Make requests to this origin/domain only. May be specified multiple times.') 1503 | scope_group.add_argument('-ob', '--use-original-base', action='store_true', help="Don't change the base url if the original request has redirected. By default, the base url is updated.") 1504 | scope_group.add_argument('-ib', '--ignore-base-tag', action='store_true', help="Ignore html tags.") 1505 | scope_group.add_argument('-c', '--add-cookie', action='append', default=[], help='Add a cookie in the form name=value to all requests.') 1506 | scope_group.add_argument('-H', '--add-header', action='append', default=[], help='Add a given HTTP header (name: value) to all requests.') 1507 | scope_group.add_argument('-ppm', '--public-path-map', action='append', default=[], help='Add a custom public path for a given chunk index file (indexfile=publicpath), see "public path for" in the debug output to find out which index files were found.') 1508 | 1509 | scan_group = parser.add_argument_group('scan options') 1510 | scan_group.add_argument('-a', '--aggressive-mode', action='store_true', help='Scan JS/HTML files for possible script paths more aggressively.') 1511 | scan_group.add_argument('-i', '--ignore-vendor', action='store_true', help='Do not fetch source maps for scripts starting with vendor.') 1512 | scan_group.add_argument('-nn', '--no-nested-sourcemaps', action='store_true', help='Do not unpack inline sourcemaps found inside mapped content.') 1513 | scan_group.add_argument('-so', '--save-original-assets', action='store_true', help='Save original asset files even if a source map exists.') 1514 | scan_group.add_argument('-as', '--all-srcmap-urls', action='store_true', help='By default only one map specified by sourceMappingURL is fetched for a given script - this option overrides that. Use with caution, might generate many additional requests which are usually unsuccessful.') 1515 | scan_group.add_argument('-nh', '--no-header-sourcemaps', action='store_true', help='Do not detect source maps from SourceMap and X-SourceMap headers.') 1516 | 1517 | asset_group = parser.add_argument_group('asset options') 1518 | 1519 | asset_group.add_argument('-ae', '--asset-extensions', action='append', default=[], help='Specify comma-separated list of extensions for additional asset files to be saved. By default only HTML/JS/CSS files are saved. May be specified multiple times.') 1520 | asset_group.add_argument('-sa', '--save-common-assets', action='store_true', help=f'Shortcut for --asset-extensions={tpl_common}') 1521 | asset_group.add_argument('-si', '--save-images', action='store_true', help=f'Shortcut for --asset-extensions={tpl_images}') 1522 | asset_group.add_argument('-sm', '--save-media', action='store_true', help=f'Shortcut for --asset-extensions={tpl_media}') 1523 | asset_group.add_argument('-nc', '--no-css', action='store_true', help='Do not save CSS files.') 1524 | 1525 | args = parser.parse_args() 1526 | 1527 | config = { 1528 | 'root': args.url, 1529 | 'origin_whitelist': args.whitelist_origin, 1530 | 'cookies': {k: v for val in args.add_cookie for k, v in [val.split('=', 1)]}, 1531 | 'headers': {k: v for val in args.add_header for k, v in [val.split(': ', 1)]}, 1532 | 'other_urls': args.other_urls, 1533 | 'aggressive_mode': args.aggressive_mode, 1534 | 'ignore_vendor': args.ignore_vendor, 1535 | 'extract_nested_sourcemaps': not args.no_nested_sourcemaps, 1536 | 'save_original_assets': args.save_original_assets, 1537 | 'all_srcmap_urls': args.all_srcmap_urls, 1538 | 'no_header_sourcemaps': args.no_header_sourcemaps, 1539 | 'skip_css': args.no_css, 1540 | 'other_asset_extensions': set([e for exts in args.asset_extensions for e in exts.split(',')]), 1541 | 'use_original_base': args.use_original_base, 1542 | 'ignore_base_tag': args.ignore_base_tag, 1543 | 'public_path_map': {k: v for val in args.public_path_map for k, v in [val.split('=', 1)]}, 1544 | } 1545 | 1546 | if args.save_common_assets: 1547 | config['other_asset_extensions'].update(tpl_common.split(',')) 1548 | 1549 | if args.save_images: 1550 | config['other_asset_extensions'].update(tpl_images.split(',')) 1551 | 1552 | if args.save_media: 1553 | config['other_asset_extensions'].update(tpl_media.split(',')) 1554 | 1555 | if 'User-Agent' not in config['headers'] and 'user-agent' not in config['headers']: 1556 | config['headers']['User-Agent'] = default_ua 1557 | 1558 | return config, args.output, args.verbose 1559 | 1560 | 1561 | def main(): 1562 | config, output, verbosity = get_config_from_args() 1563 | 1564 | log.level += verbosity 1565 | 1566 | gf = GetFrontend(config) 1567 | gf.run() 1568 | 1569 | if output.endswith('.zip'): 1570 | with open(output, 'wb') as wf: 1571 | gf.export_to_file(wf) 1572 | 1573 | elif output != '-': 1574 | gf.export_to_directory(output) 1575 | 1576 | else: 1577 | gf.dump_to_stdout() 1578 | 1579 | 1580 | if __name__ == '__main__': 1581 | main() 1582 | --------------------------------------------------------------------------------