├── README.md
└── getfrontend.py


/README.md:
--------------------------------------------------------------------------------
 1 | # getfrontend
 2 | 
 3 | > [!IMPORTANT]
 4 | > This project is in the **experimental** stage.. take the following usefulness claims with a M☉ of salt...
 5 | 
 6 | ## Why
 7 | Let's say we want to scan the frontend code (of a SPA/PWA) for some API keys that shouldn't be exposed to the client (or just explore the whole frontend code), but the app code is split into many chunks with unpredictable names.. and most of these aren't loaded by the browser (you probably only see the login page).. additionally, for each of these chunks, there's a corresponding source map file with the `.map` extension, that happens to not be specified via `sourceMappingURL` (and so the browser doesn't even see it). How do we "get" that frontend?
 8 | 
 9 | Well, that's exactly what getfrontend was designed for - to "get" the frontend code, to get all these chunks, to get all these source maps and recreate the original files as accurately as possible.
10 | 
11 | In theory, this task is not achievable (let me mention the famous "halting problem"), however, in practice, since many apps are bundled with webpack/vite, it's often possible to enumerate all these chunks even with some simple static analysis.
12 | And this is what the program does, it attempts to recognize:
13 | - webpack chunks in various configurations (includes basic federated module support)
14 | - vite chunks
15 | - next.js chunks from a build manifest
16 | - remix chunks from a manifest
17 | - ES6 imports / dynamic imports
18 | - scripts specified in import maps
19 | 
20 | There's also an "aggressive mode" which - as the name suggests - attempts to find more possible paths (in string literals for example), but this mode isn't "smart" - it's only possible to do a smart detection when the used stack is known.
21 | 
22 | And of course, there are many false positives, but getfrontend assumes that these are irrelevant - if something is not there, a 404 will be returned and the file won't be saved.
23 | 
24 | By default JS/CSS files are fetched (and the initial HTML page), but you can specify other extensions if you want them saved. There are shortcuts to specify the most common asset/image/media file extensions.
25 | 
26 | ## What this tool doesn't do
27 | Notably, it:
28 | - doesn't try to unpack minified webpack chunks without source maps
29 | - doesn't crawl html pages, it assumes the specified url points to a SPA
30 | - doesn't do any dynamic analysis to discover chunks
31 | - doesn't attempt to deobfuscate obfuscated (not just minified) JS files
32 | - doesn't even try to defend against any kind of targeted DoS (like infinitely many JS files and so on)
33 | 
34 | ## Basic usage
35 | 
36 | First you obviously need the target url. If you're trying to run getfrontend for a multipage app with more than one entry, then you need to specify all these entries.
37 | In case you know that a particular url should be included but getfrontend can't find it, you can also specify it on the command line.
38 | ```
39 | python getfrontend.py [options]... [root url] [optional additional urls]...
40 | ```
41 | Note that only the first url is treated as the "root" url.. practically this means multi-page apps are supported only for the same origin
42 | 
43 | ### Need custom headers? Cookies?
44 | You can specify a custom header to be added to each request using the `--add-header`/`-H` option (works similarly to `curl`).
45 | Additionally there's the `--add-cookie`/`-c` convenience argument to add a cookie. Both options might be used multiple times.
46 | ```
47 | python getfrontend.py -H'X-Is-Admin: of-course' -c'is_admin=sure' -c'is_a_bot=nope' https://securesite.com/
48 | ```
49 | 
50 | ### Choose the output method
51 | By default everything is **dumped to stdout**.. since this might not necessarily be what you want, you can specify the `--output`/`-o` argument:
52 | ```
53 | python getfrontend.py -o /tmp/antarctica_realestate.zip https://realestate.aq/
54 | ```
55 | **If it ends with the `.zip` suffix**, then files are written to the specified file as a zip archive, otherwise the argument value is treated as the target directory.
56 | 
57 | ### Choose what you want saved
58 | By default, only JS and CSS files are saved. This is.. a weird default...
59 | 
60 | If you don't want to save CSS files, the `--no-css`/`-nc` argument is your friend :)
61 | 
62 | If you want to save more, you can either specify extensions manually using the `--asset-extensions`/`-ae` option (comma-separated or use the argument multiple times), or use these shortcuts:
63 | 
64 | |Option|Extensions|
65 | | --- | --- |
66 | | `--save-common-assets`/`-sa` | `svg,json,webmanifest,ico,eot,woff,woff2,otf,ttf` |
67 | | `--save-images`/`-si` | `jpg,jpeg,png,gif,webp` |
68 | | `--save-media`/`-sm` | `mp3,ogg,wav,m4a,opus,mp4,mov,mkv,webm` |
69 | 
70 | ### Scripts are being fetched from the wrong path?
71 | While getfrontend attempts to detect the correct "public path" for dynamically loaded chunks, this detection might sometimes yield wrong results, for example when the path is generated in an unusual way.. In that case you might want to supply the path manually.
72 | 
73 | First, run getfrontend with the `-v` option, then look for strings like "public path for":
74 | ```
75 | python getfrontend.py -v https://somesite.com/ |& grep 'public path for'
76 | ```
77 | you might see something like this:
78 | ```
79 | webpack public path for https://somesite.com/js/main.somehash.js is https://somesite.com/
80 | ```
81 | then if you know (for instance by finding it out using devtools) what the actual prefix should be (like `https://somesite.com/js_chunks/`), you can use the `-ppm` option to add the mapping:
82 | ```
83 | python getfrontend.py -ppm "https://somesite.com/js/main.somehash.js=https://somesite.com/js_chunks/" https://somesite.com/
84 | ```
85 | and then it should work as desired :)
86 | 
87 | 
88 | ### Chunks/scripts not found?
89 | Try using the aforementioned "aggressive mode" by specifying the `--aggressive-mode`/`-a` option.
90 | It might work.. otherwise - if that's a common configuration - consider filling an issue.
91 | 
92 | 
93 | ### But there's more..
94 | There are more options, check them out by running:
95 | ```
96 | python getfrontend.py --help
97 | ```
98 | You can also read the code, albeit be prepared for the worst..
99 | 


--------------------------------------------------------------------------------
/getfrontend.py:
--------------------------------------------------------------------------------
   1 | import argparse
   2 | import base64
   3 | import hashlib
   4 | import html
   5 | import sys
   6 | import re
   7 | import time
   8 | import zipfile
   9 | import requests
  10 | import os
  11 | import json
  12 | import threading
  13 | import urllib3
  14 | 
  15 | from urllib.parse import urljoin, urlparse
  16 | from queue import Queue
  17 | 
  18 | urllib3.disable_warnings(urllib3.exceptions.InsecureRequestWarning)
  19 | 
  20 | 
  21 | class Logger:
  22 |     LOG = 1
  23 |     DEBUG = 2
  24 |     VERBOSE_DEBUG = 3
  25 | 
  26 |     def __init__(self):
  27 |         self.level = self.LOG
  28 | 
  29 |     def write(self, level, *args, **kwargs):
  30 |         if level <= self.level:
  31 |             print(*args, **kwargs, file=sys.stderr)
  32 | 
  33 |     def log(self, *args, **kwargs):
  34 |         self.write(self.LOG, *args, **kwargs)
  35 | 
  36 |     def debug(self, *args, **kwargs):
  37 |         self.write(self.DEBUG, *args, **kwargs)
  38 | 
  39 |     def vdebug(self, *args, **kwargs):
  40 |         self.write(self.VERBOSE_DEBUG, *args, **kwargs)
  41 | 
  42 | 
  43 | log = Logger()
  44 | 
  45 | 
  46 | class SourceMapError(Exception):
  47 |     pass
  48 | 
  49 | 
  50 | def normpath(path):
  51 |     # os.path.normpath allows //, we can't since these are protocol-relative urls
  52 |     path = os.path.normpath(path)
  53 | 
  54 |     if path.startswith('//'):
  55 |         path = path[1:]
  56 | 
  57 |     return path
  58 | 
  59 | 
  60 | class FArchive:
  61 |     def __init__(self):
  62 |         self.files = {}
  63 |         self.names = set()
  64 |         self.effective_names = {}
  65 | 
  66 |     def normalize_name(self, name):
  67 |         name = normpath('/' + name)[1:]
  68 |         return name
  69 | 
  70 |     def get_unique_name(self, name):
  71 |         base_name, ext = os.path.splitext(name)
  72 |         counter = 2
  73 |         while name in self.names:
  74 |             name = f"{base_name}_{counter}{ext}"
  75 |             counter += 1
  76 |         self.names.add(name)
  77 | 
  78 |         return name
  79 | 
  80 |     def add_file(self, name, content):
  81 |         name = self.normalize_name(name)
  82 | 
  83 |         content_hash = hashlib.md5(content).hexdigest()
  84 |         name_key = name + '//' + content_hash
  85 | 
  86 |         if self.effective_names.get(name + '//' + content_hash):
  87 |             # same name + existing content has means we can skip it
  88 |             return
  89 | 
  90 |         name = self.get_unique_name(name)
  91 |         self.effective_names[name_key] = name
  92 | 
  93 |         assert name not in self.files
  94 | 
  95 |         self.files[name] = content
  96 | 
  97 |     def write_to_file(self, wf):
  98 |         with zipfile.ZipFile(wf, 'w', compression=zipfile.ZIP_DEFLATED) as zf:
  99 |             for name, content in self.files.items():
 100 |                 zf.writestr(name, content)
 101 | 
 102 |     def save_to_directory(self, path):
 103 |         os.makedirs(path, exist_ok=True)
 104 | 
 105 |         for name, content in self.files.items():
 106 |             file_path = os.path.join(path, name)
 107 |             os.makedirs(os.path.dirname(file_path), exist_ok=True)
 108 | 
 109 |             with open(file_path, 'wb') as f:
 110 |                 f.write(content)
 111 | 
 112 |     def dump_to_stdout(self):
 113 |         for name, content in self.files.items():
 114 |             print(f"// {name}\n// [{len(content)}] bytes\n")
 115 |             print(content.decode())
 116 | 
 117 | 
 118 | class Client:
 119 |     def __init__(self, timeout: None | float | tuple[float, float] = (8, 16), cookies: dict | None = None, headers: dict | None = None):
 120 |         self.timeout = timeout
 121 |         self.sleep = 5
 122 |         self.rs = requests.Session()
 123 |         self.rs.verify = False
 124 | 
 125 |         if cookies:
 126 |             self.rs.cookies.update(cookies)
 127 | 
 128 |         if headers:
 129 |             self.rs.headers.update(headers)
 130 | 
 131 |     def request(self, method, url, *args, **kwargs):
 132 |         try:
 133 |             while True:
 134 |                 try:
 135 |                     res = self.rs.request(method, url, *args, timeout=self.timeout, **kwargs)
 136 | 
 137 |                     if res.status_code in (408, 429, 500, 502, 503, 504):
 138 |                         log.log('retrying', url)
 139 |                         time.sleep(self.sleep)
 140 |                         continue
 141 | 
 142 |                     return res
 143 | 
 144 |                 except requests.exceptions.RequestException as e:
 145 |                     log.log('non-status request error, retrying', url, e)
 146 |                     time.sleep(self.sleep)
 147 |                     continue
 148 | 
 149 |         except Exception as e:
 150 |             log.log('non-status request exception', url, e)
 151 |             return None
 152 | 
 153 |     def get(self, *args, **kwargs):
 154 |         return self.request('GET', *args, **kwargs)
 155 | 
 156 | 
 157 | class Fetcher:
 158 |     def __init__(self, client: Client, n_workers):
 159 |         self.input_queue = Queue()
 160 |         self.output_queue = Queue()
 161 |         self.client = client
 162 |         self.queued_urls = set()
 163 |         self.lock = threading.Lock()
 164 |         self.pending_urls = 0
 165 | 
 166 |         self.n_workers = n_workers
 167 |         for _ in range(n_workers):
 168 |             threading.Thread(target=self.worker).start()
 169 | 
 170 |     def queue(self, url, *args):
 171 |         # called from the main thread only
 172 | 
 173 |         if url not in self.queued_urls:
 174 |             self.pending_urls += 1
 175 |             self.queued_urls.add(url)
 176 |             self.input_queue.put([url, *args])
 177 | 
 178 |     def get_response(self):
 179 |         # called from the main thread only
 180 | 
 181 |         if not self.pending_urls:
 182 |             self.shutdown_workers()
 183 |             return None
 184 | 
 185 |         ret = self.output_queue.get()
 186 |         self.pending_urls -= 1
 187 |         return ret
 188 | 
 189 |     def worker(self):
 190 |         while True:
 191 |             arg = self.input_queue.get()
 192 |             if arg is None:
 193 |                 break
 194 | 
 195 |             [url, *args] = arg
 196 | 
 197 |             response = self.client.get(url) # doesn't throw
 198 |             self.output_queue.put([url, response, *args])
 199 | 
 200 |     def shutdown_workers(self):
 201 |         for _ in range(self.n_workers):
 202 |             self.input_queue.put(None)
 203 | 
 204 | 
 205 | def prepare_link(link):
 206 |     if '?' in link:
 207 |         link = link.split('?', 1)[0]
 208 | 
 209 |     if '#' in link:
 210 |         link = link.split('#', 1)[0]
 211 | 
 212 |     return link
 213 | 
 214 | 
 215 | class Crawler:
 216 |     def __init__(self, gf, root, save_prefix=''):
 217 |         self.gf: GetFrontend = gf
 218 | 
 219 |         self.save_prefix = save_prefix # this also affects url assets
 220 |         self.root = root
 221 | 
 222 |         self.webpack_chunk_formats = []
 223 |         self.possible_webpack_chunk_ids = set()
 224 | 
 225 |     def check_prefix(self, link):
 226 |         if not self.gf.prefix_whitelist:
 227 |             return True
 228 | 
 229 |         for prefix in self.gf.prefix_whitelist:
 230 |             if link.startswith(prefix):
 231 |                 return True
 232 | 
 233 |         return False
 234 | 
 235 |     def save_fetched_asset(self, path, content):
 236 |         # path is an url
 237 |         # this is meant for original assets
 238 | 
 239 |         path = re.sub(r'^(https?:)//', r'\1', path)
 240 |         return self.gf.save_asset(self.save_prefix + path, content)
 241 | 
 242 |     def save_mapped_asset(self, path, content, origin_path):
 243 |         return self.save_generated_asset(path, content, origin_path, 'mapped')
 244 | 
 245 |     def save_unpacked_asset(self, path, content, origin_path):
 246 |         return self.save_generated_asset(path, content, origin_path, 'unpacked')
 247 | 
 248 |     def save_generated_asset(self, path, content, origin_path, label):
 249 |         origin = re.search(r'https?://([^/]+)', origin_path).group(1)
 250 |         self.gf.save_asset(f'{self.save_prefix}{label}@{origin}/{path}', content)
 251 | 
 252 |     def queue(self, link, tag, *args):
 253 |         self.gf.fetcher.queue(link, self, tag, *args)
 254 | 
 255 |     def queue_link(self, link, tag=None, fallback=None):
 256 |         if not self.check_prefix(link) and not fallback:
 257 |             return
 258 | 
 259 |         link = prepare_link(link)
 260 | 
 261 |         if tag is None:
 262 |             if link.endswith('.css') and not self.gf.skip_css:
 263 |                 tag = 'css'
 264 |             elif re.search(r'\.m?[jt]sx?$', link):
 265 |                 tag = 'js'
 266 |             elif self.gf.other_asset_extensions and link.endswith('.webmanifest'):
 267 |                 tag = 'webmanifest'
 268 |             elif self.gf.other_asset_extensions and re.search(rf'{self.gf.asset_ext_pat}$', link, flags=re.IGNORECASE):
 269 |                 tag = 'asset'
 270 | 
 271 |         if not tag and fallback:
 272 |             tag = fallback
 273 | 
 274 |         if tag:
 275 |             self.queue(link, tag)
 276 | 
 277 |     def handle_result(self, url, response, mode, *args):
 278 |         response = self.gf.check_response(url, response)
 279 |         if response:
 280 |             if mode == 'dynamic':
 281 |                 content_type = response.headers.get('content-type', '')
 282 |                 if ';' in content_type:
 283 |                     content_type = content_type.split(';', 1)[0]
 284 | 
 285 |                 match content_type:
 286 |                     case "text/javascript" | "application/javascript" | "application/x-javascript":
 287 |                         mode = 'js'
 288 |                     case "text/css":
 289 |                         mode = 'css'
 290 |                     case "text/html" | "application/xhtml+xml" | "application/xml":
 291 |                         mode = 'page'
 292 |                     case _:
 293 |                         mode = 'asset'
 294 | 
 295 |                 log.debug('dynamic mode detected as', mode, 'for url', url)
 296 | 
 297 |             match mode:
 298 |                 case "page":
 299 |                     self.handle_html_response(url, response)
 300 |                 case "nextjs":
 301 |                     self.handle_nextjs_manifest(url, response)
 302 |                 case "js":
 303 |                     self.handle_js(url, response)
 304 |                 case "css":
 305 |                     self.handle_css(url, response)
 306 |                 case "remote_entry_js":
 307 |                     self.handle_remote_module(url, response, *args)
 308 |                 case "webmanifest":
 309 |                     self.handle_webmanifest(url, response)
 310 |                 case "asset":
 311 |                     self.handle_asset(url, response)
 312 | 
 313 |     def handle_js_data(self, content, path, skip_sourcemaps=False):
 314 |         self.find_webpack_chunk_info(content, path)
 315 |         self.find_federated_modules(content, path)
 316 |         self.find_webpack_chunk_refs(content, path)
 317 |         self.find_vite_chunks(content, path)
 318 |         self.unpack_webpack_eval_sources(content, path)
 319 | 
 320 |         if not skip_sourcemaps:
 321 |             self.handle_content_sourcemaps(content, path)
 322 | 
 323 |         # module scan, here we might encounter absolute links
 324 |         for module_link in find_import_references(content, path):
 325 |             self.queue_link(module_link)
 326 | 
 327 |         self.find_imported_scripts(content, path)
 328 |         self.find_workers(content, path)
 329 |         self.find_manifests(content, path)
 330 | 
 331 |         if self.gf.aggressive_mode:
 332 |             self.run_aggressive_scan(content, path)
 333 | 
 334 |     def handle_js(self, url, res):
 335 |         res_headers = res.headers
 336 |         res = self.gf.decode_response(res)
 337 | 
 338 |         skip_sourcemaps = False
 339 | 
 340 |         if self.gf.ignore_vendor:
 341 |             last_part = url.rsplit('/', 1)[-1]
 342 |             if re.match(r'(chunk[-.])?vendors?[-.]', last_part):
 343 |                 skip_sourcemaps = True
 344 |                 log.debug('skipping maps for', url, 'due to vendor detection')
 345 | 
 346 |         self.handle_js_data(res, url, skip_sourcemaps=skip_sourcemaps)
 347 | 
 348 |         should_save = True
 349 | 
 350 |         if not skip_sourcemaps:
 351 |             if self.handle_header_sourcemaps(res_headers, url):
 352 |                 should_save = False
 353 | 
 354 |             # it's often the case that there wasn't a comment but a sourcemap exists
 355 |             # we don't queue because we need the result here
 356 |             if self.fetch_and_handle_srcmap(url + '.map'):
 357 |                 should_save = False
 358 | 
 359 |         if should_save or self.gf.save_original_assets:
 360 |             self.save_fetched_asset(url, res.encode())
 361 | 
 362 |     def handle_css_data(self, content, path):
 363 |         self.handle_content_sourcemaps(content, path)
 364 | 
 365 |         url_pat = r'''(?<![a-zA-Z0-9$_])url\(\s*['"]?(?!data:)([^'")]+)'''
 366 |         import_pat = rf'''@import(?:\s*['"]([^'"]+)['"]|\s+{url_pat})'''
 367 | 
 368 |         for m in re.finditer(import_pat, content):
 369 |             link = urljoin(path, m.group(1) or m.group(2))
 370 |             log.log('adding css import', link)
 371 |             self.queue_link(link, 'css')
 372 | 
 373 |         if self.gf.other_asset_extensions:
 374 |             for m in re.finditer(url_pat, content):
 375 |                 link = urljoin(path, m.group(1))
 376 |                 log.log('adding css url asset', link)
 377 |                 self.queue_link(link)
 378 | 
 379 |     def handle_css(self, url, res):
 380 |         res_headers = res.headers
 381 |         res = self.gf.decode_response(res)
 382 | 
 383 |         self.handle_css_data(res, url)
 384 | 
 385 |         should_save = True
 386 | 
 387 |         if self.handle_header_sourcemaps(res_headers, url):
 388 |             should_save = False
 389 | 
 390 |         if self.fetch_and_handle_srcmap(url + '.map'):
 391 |             should_save = False
 392 | 
 393 |         if should_save or self.gf.save_original_assets:
 394 |             self.save_fetched_asset(url, res.encode())
 395 | 
 396 |     def handle_header_sourcemaps(self, headers, path):
 397 |         header_map = headers.get('SourceMap') or headers.get('X-SourceMap')
 398 |         if header_map and not self.gf.no_header_sourcemaps:
 399 |             log.debug('found SourceMap header for', path)
 400 | 
 401 |             if self.fetch_and_handle_srcmap(urljoin(path, header_map)):
 402 |                 return True
 403 | 
 404 |     def handle_content_sourcemaps(self, content, path, inline_only=False):
 405 |         had_sourcemap = False
 406 | 
 407 |         # url sourcemaps.. we normally append .map
 408 |         # and normally there should be only one
 409 |         # otherwise that's probably spam (like nonexisting files inside eval)
 410 |         sourcemap_urls = re.findall(r'sourceMappingURL=(?!data:application/json)([^\s?#*]+)', content)
 411 |         if not inline_only and (len(sourcemap_urls) < 2 or self.gf.all_srcmap_urls):
 412 |             for src_map in sourcemap_urls:
 413 |                 link = urljoin(path, src_map)
 414 |                 if self.fetch_and_handle_srcmap(link):
 415 |                     had_sourcemap = True
 416 | 
 417 |         json_map_contents = []
 418 | 
 419 |         # iterate inline sourcemaps
 420 |         for m in re.finditer(r'sourceMappingURL=data:application/json;(?:charset=([^;]+);)?base64,([A-Za-z0-9/+]+)', content):
 421 |             charset = m.group(1) or 'utf-8'
 422 |             map_content = base64.b64decode(m.group(2) + '==').decode(charset)
 423 |             json_map_contents.append(map_content)
 424 | 
 425 |         # iterate JSON-embedded sourcemaps (example observed in css-loader)
 426 |         if '{"version":3,"sources":[' in content and '"sourcesContent":[' in content:
 427 |             str_pat_part = r'"(?:(?:\\.|[^"\\]+)*)"'
 428 |             str_arr_pat_part = rf'\[(?:{str_pat_part},?)*\]'
 429 | 
 430 |             for m in re.finditer(rf'\{"{"}"version":3,"sources":{str_arr_pat_part}(?:,{str_pat_part}:(?:{str_pat_part}|{str_arr_pat_part}))+\{"}"}', content):
 431 |                 log.debug(f'found embedded source map at {path}')
 432 |                 map_content = m.group(0)
 433 |                 json_map_contents.append(map_content)
 434 | 
 435 |         for map_content in json_map_contents:
 436 |             try:
 437 |                 data = json.loads(map_content)
 438 |                 self.handle_srcmap_data(data, path)
 439 |                 had_sourcemap = True
 440 |             except (json.decoder.JSONDecodeError, SourceMapError) as e:
 441 |                 log.debug(f'warn: inline sourcemap at {path} was not correct', e)
 442 | 
 443 |         return had_sourcemap
 444 | 
 445 |     def handle_nextjs_manifest(self, url, res):
 446 |         res = self.gf.decode_response(res)
 447 |         base = url[:url.index('/_next/') + len('/_next/')]
 448 | 
 449 |         if self.gf.save_original_assets:
 450 |             self.save_fetched_asset(url, res.encode())
 451 | 
 452 |         res = res.replace('\\u002F', '/')
 453 | 
 454 |         for m in re.finditer(r'''"(static/(?:chunks|css)/[^"]+\.(m?jsx?|css))([#?][^"]*)?"''', res):
 455 |             link = urljoin(base, m.group(1))
 456 |             log.log('adding nextjs chunk from manifest', link)
 457 |             self.queue_link(link)
 458 | 
 459 |         if self.gf.aggressive_mode:
 460 |             self.run_aggressive_scan(res, url)
 461 | 
 462 |     def handle_webmanifest(self, url, res):
 463 |         res = self.gf.decode_response(res)
 464 |         if not res:
 465 |             return
 466 | 
 467 |         self.save_fetched_asset(url, res.encode())
 468 |         self.run_aggressive_scan(res, url)
 469 | 
 470 |     def handle_asset(self, url, res):
 471 |         self.save_fetched_asset(url, res.content)
 472 | 
 473 |     def fetch_and_handle_srcmap(self, path):
 474 |         if path in self.gf.fetched_sourcemaps:
 475 |             return self.gf.fetched_sourcemaps[path]
 476 | 
 477 |         ok = True
 478 | 
 479 |         url = urljoin(self.root, path)
 480 |         response = self.gf.get_url(url)
 481 |         if not response:
 482 |             log.log(f'no source map at {path}')
 483 |             self.gf.fetched_sourcemaps[path] = False
 484 |             return
 485 | 
 486 |         try:
 487 |             try:
 488 |                 data = response.json()
 489 |             except Exception:
 490 |                 raise SourceMapError('not json')
 491 | 
 492 |             self.handle_srcmap_data(data, path)
 493 | 
 494 |         except SourceMapError as e:
 495 |             log.log(f'source map at {path} error:', e)
 496 |             ok = False
 497 | 
 498 |         self.gf.fetched_sourcemaps[path] = ok
 499 | 
 500 |         return ok
 501 | 
 502 |     def _prepare_mapped_asset(self, name, content, real_dir=None):
 503 |         if name.startswith('webpack://'):
 504 |             name = name[len('webpack://'):]
 505 | 
 506 |         # not sure if we could normalize the name..
 507 |         name = re.sub(r'(\/|^)\.(?=\/|$)', r'', name)
 508 |         name = re.sub(r'(\/|^)\.\.(?=\/|$)', '\1_', name)
 509 | 
 510 |         if name.startswith('/'):
 511 |             name = name[1:]
 512 | 
 513 |         if '/' not in name and real_dir:
 514 |             log.debug('using real_dir prefix', real_dir)
 515 |             name = real_dir + '/' + name
 516 | 
 517 |         name = re.sub(r'//+', r'/', name)
 518 | 
 519 |         if '?' in name:
 520 |             left, right = name.split('?', 1)
 521 | 
 522 |             if '/' in right or '?' in right:
 523 |                 log.debug('unusual name', name)
 524 | 
 525 |             right = '_' + re.sub(r'[^a-zA-Z0-9_.-]', '_', right)
 526 | 
 527 |             base_name, ext = os.path.splitext(left)
 528 | 
 529 |             name = base_name + right + ext
 530 |         else:
 531 |             base_name, ext = os.path.splitext(name)
 532 | 
 533 |         # content processing, for example unpack angular .html
 534 |         content = self.process_src_asset(name, ext, content)
 535 | 
 536 |         return name, content
 537 | 
 538 |     def handle_srcmap_data(self, data, origin):
 539 |         real_dir = None
 540 | 
 541 |         try:
 542 |             try:
 543 |                 if len(data['sources']) != len(data['sourcesContent']):
 544 |                     log.debug('warn: invalid source map sources length in', origin)
 545 |             except KeyError:
 546 |                 raise SourceMapError('no sources content')
 547 | 
 548 |             if 'file' in data and '!.' in data['file']:
 549 |                 last_part = data['file'].rsplit('!.', 1)[-1]
 550 |                 file_part = last_part.split('?', 1)[0]
 551 | 
 552 |                 real_dir = os.path.dirname(file_part)
 553 | 
 554 |                 if real_dir.startswith('/'):
 555 |                     real_dir = real_dir[1:]
 556 | 
 557 |                 # for webpack, this might be a chain, but it could give us the idea
 558 |                 # of the original file location, as sometimes the location in
 559 |                 # ['sources'] might not be complete
 560 | 
 561 |             for x in range(min(len(data['sources']), len(data['sourcesContent']))):
 562 |                 if not data['sourcesContent'][x]:
 563 |                     # yeah, it can happen
 564 |                     continue
 565 | 
 566 |                 name = data['sources'][x]
 567 |                 content = data['sourcesContent'][x]
 568 | 
 569 |                 # ignore names here
 570 |                 if name.endswith('/'):
 571 |                     log.debug('skipping source map entry: name', name, 'ends with a slash, content is', content)
 572 |                     continue
 573 | 
 574 |                 name, content = self._prepare_mapped_asset(name, content, real_dir=real_dir)
 575 | 
 576 |                 # nested source maps? not practically useful
 577 |                 if self.gf.extract_nested_sourcemaps:
 578 |                     self.handle_content_sourcemaps(content, origin, True)
 579 | 
 580 |                 self.save_mapped_asset(name, content.encode(), origin)
 581 | 
 582 |         except (TypeError, ValueError) as e:
 583 |             raise SourceMapError(f'invalid data {e}')
 584 | 
 585 |     def process_src_asset(self, name, ext, content):
 586 |         # currently we only unpack html templates here
 587 |         if ext == '.html':
 588 |             if m := re.match(r'^(?:module\.exports\s*=\s*|export default\s+)(".+");?\s*(\n\s*//[^\n]*)*$', content):
 589 |                 try:
 590 |                     json_content = json.loads(m.group(1))
 591 |                     log.debug('replacing html', name)
 592 |                     return json_content
 593 |                 except json.decoder.JSONDecodeError:
 594 |                     pass
 595 | 
 596 |         return content
 597 | 
 598 |     def find_workers(self, content, current_path):
 599 |         worker_pat = r'''(?:[^a-zA-Z0-9$_]new\s+(?:Shared)?Worker|navigator\.serviceWorker\.register)\s*\(\s*(?!data:)['"]([^'"]+)['"]\s*[,)]'''
 600 |         for m in re.finditer(worker_pat, content):
 601 |             link = urljoin(self.root, m.group(1))
 602 |             log.log('adding worker', link)
 603 |             self.queue_link(link, tag='js')
 604 | 
 605 |     def find_imported_scripts(self, content, current_path):
 606 |         pat = r'''(?:^|[^a-zA-Z0-9_$])importScripts\s*\(\s*((?:['"][^'"]+['"],?\s*)+)\)'''
 607 |         for m in re.finditer(pat, content):
 608 |             for mv in re.finditer(r'''['"]([^'"]+)['"]''', m.group(1)):
 609 |                 link = urljoin(current_path, mv.group(1))
 610 |                 log.log('adding imported script', link)
 611 |                 self.queue_link(link, tag='js')
 612 | 
 613 |     def find_manifests(self, content, current_path):
 614 |         # search for remix manifest
 615 |         if '/manifest-' in current_path:
 616 |             if content.startswith('window.__remixManifest={'):
 617 |                 for m in re.finditer(r'"(/[^"]+\.js)(?:[?#][^"]*)?', content):
 618 |                     link = urljoin(self.root, m.group(1))
 619 |                     log.log('adding remix link', link)
 620 |                     self.queue_link(link)
 621 | 
 622 |     def find_vite_chunks(self, content, current_path):
 623 |         # limited vite support, we rely on duplication to find the base path..
 624 |         # .js files are picked up by imports, but .css files seem to only be here
 625 |         vite_file_pat = r'''['"]([^'"]+)['"]'''
 626 |         vite_pat = rf'(?:__vite__fileDeps|__vite__mapDeps\.viteFileDeps)\s*=\s*\[\s*((?:{vite_file_pat},?\s*)+)\]'
 627 | 
 628 |         had_vite_deps = False
 629 |         vite_deps = []
 630 | 
 631 |         for m in re.finditer(vite_pat, content):
 632 |             for mf in re.finditer(vite_file_pat, m.group(1)):
 633 |                 had_vite_deps = True
 634 |                 dep = mf.group(1)
 635 | 
 636 |                 if dep.startswith('.'):
 637 |                     chunk_path = urljoin(current_path, dep)
 638 |                     log.log('adding vite2 rel', chunk_path)
 639 |                     self.queue_link(chunk_path)
 640 |                 else:
 641 |                     vite_deps.append(dep)
 642 | 
 643 |         vite_base = None
 644 |         if vite_deps:
 645 |             # process deps that have base
 646 |             vite_base = None
 647 | 
 648 |             # we need at least one import to find the base path
 649 | 
 650 |             if m := re.search(rf'''\(\s*\(\)\s*=>\s*import\({vite_file_pat}\),\s*__vite__mapDeps\(\[(\d+)''', content):
 651 |                 proper_path = urljoin(current_path, m.group(1))
 652 |                 dep_path = vite_deps[int(m.group(2))]
 653 | 
 654 |                 assert proper_path.endswith(dep_path)
 655 | 
 656 |                 vite_base = proper_path[:-len(dep_path)]
 657 |                 log.debug('vite base', vite_base)
 658 |             else:
 659 |                 log.debug('failed to find vite base path')
 660 |                 vite_base = self.root
 661 | 
 662 |             if vite_base:
 663 |                 for dep in vite_deps:
 664 |                     chunk_path = urljoin(vite_base, dep)
 665 |                     log.log('adding vite', chunk_path)
 666 |                     self.queue_link(chunk_path)
 667 | 
 668 |         elif not had_vite_deps:
 669 |             # the older variant with no __vite__mapDeps
 670 |             # here we use the first import to derive the base path
 671 | 
 672 |             for m in re.finditer(rf'''\(\s*\(\)\s*=>\s*import\({vite_file_pat}\),\s*\[\s*(({vite_file_pat},?\s*)+)''', content):
 673 |                 deps = []
 674 |                 for dm in re.finditer(vite_file_pat, m.group(2)):
 675 |                     chk = dm.group(1)
 676 |                     if chk.startswith('.'):
 677 |                         chunk_path = urljoin(current_path, chk)
 678 |                         log.log('adding vite1 rel', chunk_path)
 679 |                         self.queue_link(chunk_path)
 680 |                     else:
 681 |                         deps.append(chk)
 682 | 
 683 |                 if not deps:
 684 |                     continue
 685 | 
 686 |                 # now we have those that require a base
 687 | 
 688 |                 if not vite_base:
 689 |                     proper_path = urljoin(current_path, m.group(1))
 690 |                     dep_path = deps[0]
 691 | 
 692 |                     if not proper_path.endswith(dep_path):
 693 |                         # that's not vite...
 694 |                         continue
 695 | 
 696 |                     vite_base = proper_path[:-len(dep_path)]
 697 |                     log.debug('vite base2', vite_base)
 698 | 
 699 |                 for dep in deps:
 700 |                     chunk_path = urljoin(vite_base, dep)
 701 |                     log.log('adding vite2', chunk_path)
 702 |                     self.queue_link(chunk_path)
 703 | 
 704 |     def unpack_webpack_eval_sources(self, content, current_path):
 705 |         for m in re.finditer(r'''[\n{]eval\s*\(\s*(?:"((?:\\.|[^"\\])+)"|'((?:\\.|[^'\\])+)')\s*\)''', content):
 706 |             src = ''
 707 | 
 708 |             if src := m.group(2): # transform single quotes so we can decode as json
 709 |                 src = src.replace("\\'", "'").replace('"', '\\"')
 710 |             else:
 711 |                 src = m.group(1)
 712 | 
 713 |             if '//# sourceURL=' not in src:
 714 |                 continue
 715 | 
 716 |             src = json.loads('"' + src + '"')
 717 | 
 718 |             if m := re.search(r'\n//# sourceURL=([^\n?]+)[?]?', src):
 719 |                 name = m.group(1)
 720 |                 content = src[:m.start()]
 721 | 
 722 |                 name, content = self._prepare_mapped_asset(name, content)
 723 |                 log.debug('unpacking eval asset', name, 'from', current_path)
 724 |                 self.save_unpacked_asset(name, content.encode(), current_path)
 725 | 
 726 |     def add_webpack_chunk_format(self, fmt):
 727 |         self.webpack_chunk_formats.append([fmt, set()])
 728 | 
 729 |     def add_possible_webpack_chunk_id(self, chunk_id):
 730 |         self.possible_webpack_chunk_ids.add(chunk_id)
 731 | 
 732 |     def queue_possible_webpack_chunks(self):
 733 |         for resolve, queued in self.webpack_chunk_formats:
 734 |             for chunk_id in self.possible_webpack_chunk_ids - queued:
 735 |                 self.queue_link(resolve(chunk_id))
 736 | 
 737 |             queued.update(self.possible_webpack_chunk_ids)
 738 | 
 739 |     def find_webpack_chunk_info(self, res, current_path):
 740 |         # todo: this vs remote? what if we encounter remoteEntry.js
 741 | 
 742 |         # this works since 2015
 743 |         is_webpack_chunk_runtime = 'ChunkLoadError' in res or "'Loading chunk '" in res or '"Loading chunk "' in res or 'Automatic publicPath is not supported in this browser' in res
 744 | 
 745 |         if not is_webpack_chunk_runtime:
 746 |             return
 747 | 
 748 |         res = res.replace('\\u002F', '/')
 749 | 
 750 |         if current_path in self.gf.public_path_map:
 751 |             public_path = self.gf.public_path_map[current_path]
 752 | 
 753 |         else:
 754 |             public_path = ''
 755 | 
 756 |             # note for paths like someVariable + sth, we assume someVariable is empty
 757 |             for m in re.finditer(r'''(?:\w|__webpack_require__)\.p\s*=(\s*[\w.]+\s*\+)?\s*(?P<quot>['"])(?P<path>[^'"]*)(?P=quot)\s*[,;})]''', res):
 758 |                 # we pick the last one
 759 |                 public_path = m.group('path')
 760 | 
 761 |             if 'Automatic publicPath is not supported in this browser' in res:
 762 |                 # in one case it was relative to the script.. is it always true for automatic publicpath?
 763 |                 # EDIT: well, no... need more data
 764 |                 public_path = urljoin(current_path, 'abc')[:-3]
 765 | 
 766 |             # public_path is sometimes empty.. in that case it won't work with urljoin, we assume the root folder is used
 767 |             if public_path == '':
 768 |                 public_path = urljoin(self.root, 'abc')[:-3]
 769 | 
 770 |             # relative to root, not the script
 771 |             public_path = urljoin(self.root, public_path)
 772 | 
 773 |         log.debug('webpack public path for', current_path, 'is', public_path)
 774 | 
 775 |         # first we need some cleanup, clean /******/ then clean // comments
 776 |         wr = re.sub(r'/\*{3,}/', ' ', res)
 777 |         # be careful not to trip strings like https://
 778 |         wr = re.sub(r'\n\s*//.*', ' ', wr)
 779 | 
 780 |         # resolve full hashes
 781 |         def make_hash_repl(target):
 782 |             def hash_repl(m):
 783 |                 ret = target
 784 |                 if maxlen := m.group('maxlen'):
 785 |                     log.debug('maxlen', maxlen)
 786 |                     ret = target[:int(maxlen)]
 787 | 
 788 |                 return '"' + ret + '"'
 789 | 
 790 |             return hash_repl
 791 | 
 792 |         hash_maxlen_pat = r'(?:\.(?:slice|substr(?:ing)?)\(\s*0,\s*(?P<maxlen>\d+)\))?'
 793 | 
 794 |         if 'hotCurrentHash' in wr and (full_hash := re.search(r'[^a-zA-Z0-9$_]hotCurrentHash\s*=\s*"(?P<hash>[a-fA-F0-9]+)"', wr)):
 795 |             full_hash = full_hash.group('hash')
 796 |             log.debug('hotcurrenthash', full_hash)
 797 |             wr = re.sub(rf'hotCurrentHash{hash_maxlen_pat}', make_hash_repl(full_hash), wr)
 798 | 
 799 |         last_match = None
 800 |         for m in re.finditer(r'''(__webpack_require__|\w)\.h\s*=\s*(?:function\s*\(\s*\)\s*\{\s*return(?![a-zA-Z0-9$_])|\(\s*\)\s*=>(?:\s*\{\s*return(?![a-zA-Z0-9$_]))?)\s*(?:\(\s*)?['"](?P<hash>[^'"]+)['"]''', wr):
 801 |             last_match = m
 802 | 
 803 |         if m := last_match:
 804 |             full_hash = m.group('hash')
 805 |             log.debug('replacing full hash', full_hash)
 806 |             wr = re.sub(rf'(?<![a-zA-Z0-9_$])(__webpack_require__|\w)\.h\(\){hash_maxlen_pat}', make_hash_repl(full_hash), wr)
 807 | 
 808 |         wr = re.sub(r'"\s*\+\s*"', '', wr) # clean concatenated strings, must be done also after replacing hashes
 809 | 
 810 |         # also need scoring for this "big" part
 811 | 
 812 |         r1v_func = r'(?:function(?:\s+\w+|\s*)\(\s*\w+\s*\)\s*\{\s*|\(?\s*?\w+\s*\)?\s*=>\s*(?:\{\s*|\(\s*)?)'
 813 |         r1v_func_start = r'(?:function(?:\s+\w+|\s*)\(\s*\w+\s*\)\s*\{\s*|=>\s*(?:\{\s*|\(\s*)?)' # optimized
 814 | 
 815 |         static_path_param = r'''['"]\s*\+\s*\w+\s*\+\s*['"]'''
 816 |         static_path_inner_pat = rf'''[^'"]+(?:{static_path_param}[^'"]+)?'''
 817 |         static_multi_ids_pat = r'''\{(?:(?:\d+(?:e\d+)?|['"][^'"]+['"]):1,?)+\}\s*\[\w+\]'''
 818 |         static_chunk_pat1 = rf'''if\s*\((?:\w+\s*===\s*(?P<static1_id>\d+(?:e\d+)?|['"][^'"]+['"])|(?P<static1_ids>{static_multi_ids_pat}))\)\s*return\s*['"](?P<static1_path>{static_path_inner_pat})['"]\s*;\s*'''
 819 |         static_chunk_pat2 = rf'''(?:(?P<static2_id>\d+(?:e\d+)?|['"][^'"]+['"])===\w+|(?P<static2_ids>{static_multi_ids_pat}))\?['"](?P<static2_path>{static_path_inner_pat})['"]:'''
 820 | 
 821 |         start_v1 = rf'(?:{r1v_func_start}(return(?![a-zA-Z0-9$_])\s*\(?\s*)?|\.src\s*=\s*(?:\([^;]{"{,5}"})?)(?:\w|__webpack_require__)\.p\s*\+'
 822 |         # return is possible in two locations depending on static_chunks variant
 823 |         start_v2 = rf'\.u\s*=\s*{r1v_func}(return(?![a-zA-Z0-9$_])\s*\(?\s*)?(?P<static_chunks>(?:{static_chunk_pat1}|{static_chunk_pat2})+)?(return(?![a-zA-Z0-9$_])\s*\(?\s*)?'
 824 | 
 825 |         prefix_pat = r'''['"](?P<prefix>[^"' ]*)['"]'''
 826 | 
 827 |         # premap can be identity or in a compact form
 828 |         # but... there might be no premap.. we saw this with federated modules
 829 |         premap_pat = r'''(?:\(\(?\s*\{(?P<premap>[^{}]*)\}\)?\s*\[(?:\s*\w+\s*=)?\s*\w+\s*\]\s*\|\|\s*\w+\s*\)|\{(?P<premap_e>[^{}]*)\}\s*\[(?:\s*\w+\s*=)?\s*\w+\s*\]|\((?:(?P<cpm_id>\d+)\s*===\s*\w+|\w+\s*===\s*(?P<cpm_id_2>\d+))\s*\?\s*"(?P<cpm_value>[^"]+)"\s*:\s*\w+\)|(?P<identity>\w+))'''
 830 | 
 831 |         # exhaustive maps
 832 |         map_pat = r'''(?:['"](?P<sep>[^"' ]*)['"]\s*\+\s*)?\(?\{(?P<map>[^{}]*)\}\)?\s*\[(?:\w+\s*=\s*)?\w+\]'''
 833 | 
 834 |         qmap_pat_common = r'''\?\w+=)['"]\s*\+\s*\{(?P<qmap>[^{}]*)\}\s*\[(?:\w+\s*=\s*)?\w+\]\s*[,;]'''
 835 |         qmap_pat = r'''['"](?P<qmap_sep>[^"' ]*\.m?jsx?''' + qmap_pat_common
 836 |         qmap_css_pat = r'''['"](?P<qmap_sep>[^"' ]*\.css''' + qmap_pat_common
 837 | 
 838 |         suffix_pat = r'''(?:['"](?P<suffix>[^"']*\.m?jsx?)(?:\?t=\d+)?['"]\s*[^+]|(?P<void_suffix>(?<=:)void\(?\s*0\s*\)?|undefined))'''
 839 | 
 840 |         def parse_chunk_match(m, search_static=False):
 841 |             prefix = m.group('prefix') or ''
 842 |             suffix = m.group('suffix') or ''
 843 | 
 844 |             known_ids = set()
 845 |             exhaustive = False
 846 | 
 847 |             # premap should be constructed for the chunk format...
 848 |             # either from parse chunkmap if a dict, or from cm map
 849 |             # or identity
 850 |             # empty premap is not truthy but it's not "None", can't use "or"
 851 |             if m.group('premap_e') is not None:
 852 |                 pm = m.group('premap_e')
 853 |                 exhaustive = True
 854 |             else:
 855 |                 pm = m.group('premap')
 856 | 
 857 |             if pm is not None:
 858 |                 premap = parse_chunkmap(pm)
 859 |                 known_ids.update(premap.keys())
 860 |             elif cid := m.group('cpm_id') or m.group('cpm_id_2'):
 861 |                 premap = {cid: m.group('cpm_value')}
 862 |                 known_ids.add(cid)
 863 |             elif m.group('identity'):
 864 |                 premap = {}
 865 |             else:
 866 |                 premap = None
 867 | 
 868 |             cmap = m.group('map') or m.group('qmap')
 869 |             if cmap:
 870 |                 cmap = parse_chunkmap(cmap)
 871 |                 known_ids.update(cmap.keys())
 872 |                 exhaustive = True
 873 | 
 874 |             if m.group('qmap'):
 875 |                 sep = m.group('qmap_sep')
 876 |             else:
 877 |                 sep = m.group('sep') or ''
 878 | 
 879 |             static_map = {}
 880 |             if search_static:
 881 |                 if sp := m.group('static_chunks'):
 882 |                     for sm in re.finditer(rf'{static_chunk_pat1}|{static_chunk_pat2}', sp):
 883 |                         chunk_ids = []
 884 | 
 885 |                         if single_id := sm.group('static1_id') or sm.group('static2_id'):
 886 |                             chunk_ids.append(parse_chunk_id(single_id))
 887 |                         else:
 888 |                             ids = sm.group('static1_ids') or sm.group('static2_ids')
 889 |                             for scm in re.finditer(r'''(\d+(?:e\d+)?|['"][^'"]+['"]):1''', ids):
 890 |                                 chunk_ids.append(parse_chunk_id(scm.group(1)))
 891 | 
 892 |                         path_src = sm.group('static1_path') or sm.group('static2_path')
 893 | 
 894 |                         for chunk_id in chunk_ids:
 895 |                             chunk_path = re.sub(static_path_param, chunk_id, path_src)
 896 |                             log.debug('static path', chunk_path)
 897 |                             static_map[chunk_id] = chunk_path
 898 |                             known_ids.add(chunk_id)
 899 | 
 900 |                     log.debug('static path map', static_map)
 901 | 
 902 |             def resolve(chunk_id):
 903 |                 if chunk_id in static_map:
 904 |                     chunk_path = static_map[chunk_id]
 905 |                 else:
 906 |                     chunk_path = premap.get(chunk_id, chunk_id) if premap is not None else ''
 907 |                     chunk_path += sep + (cmap[chunk_id] if cmap else '')
 908 |                     chunk_path = prefix + chunk_path + suffix
 909 | 
 910 |                 return urljoin(public_path, chunk_path)
 911 | 
 912 |             depends_on_id = static_map or premap is not None or cmap
 913 | 
 914 |             return depends_on_id, known_ids, exhaustive, resolve
 915 | 
 916 |         has_exhaustive_chunks = False
 917 | 
 918 |         pattern = rf'(?:{start_v1}|{start_v2})\s*(?:{prefix_pat}\s*\+\s*)?(?:{premap_pat}\s*\+\s*)?(?:{qmap_pat}|(?:{map_pat}\s*\+\s*)?{suffix_pat})'
 919 | 
 920 |         last_match = None
 921 |         for m in re.finditer(pattern, wr):
 922 |             last_match = m
 923 | 
 924 |         if m := last_match:
 925 |             depends_on_id, known_ids, exhaustive, resolve = parse_chunk_match(m, True)
 926 |             log.debug('webpack match result', current_path, known_ids)
 927 | 
 928 |             # the js version should depend on the id somehow..
 929 |             if not depends_on_id:
 930 |                 log.log('webpack: no premap and no map', current_path)
 931 | 
 932 |             if exhaustive:
 933 |                 has_exhaustive_chunks = True
 934 |                 # here we don't add the chunk format deliberately
 935 | 
 936 |                 for chunk_id in known_ids:
 937 |                     self.queue_link(resolve(chunk_id))
 938 | 
 939 |             else:
 940 |                 self.add_webpack_chunk_format(resolve)
 941 |                 for chunk_id in known_ids:
 942 |                     self.add_possible_webpack_chunk_id(chunk_id)
 943 | 
 944 |                 self.queue_possible_webpack_chunks()
 945 | 
 946 |         if not self.gf.skip_css:
 947 |             suffix_css_pat = r'''['"](?P<suffix>[^"']*\.css)(?:\?t=\d+)?['"]\s*[^+]'''
 948 |             css_pattern = rf'(?P<prelude>\.miniCssF\s*=\s*{r1v_func}(return(?![a-zA-Z0-9$_])\s*)?|(?:for\s*\(|\{"{"})\s*var \w+\s*=)\s*(?:{prefix_pat}\s*\+\s*)?(?:{premap_pat}\s*\+\s*)?(?:{qmap_css_pat}|(?:{map_pat}\s*\+\s*)?{suffix_css_pat})'
 949 | 
 950 |             last_match = None
 951 |             for m in re.finditer(css_pattern, wr):
 952 |                 last_match = m
 953 | 
 954 |             # css chunks.. they're always exhaustive (a subset of js chunks?)
 955 |             if m := last_match:
 956 |                 has_css_map = None
 957 | 
 958 |                 # try to find the 01 map, a subset of emap
 959 |                 if 'var ' in m.group('prelude'):
 960 |                     # in this case we match the map..  backwards
 961 |                     if m2 := re.search(r'''(?:[;,]|\]\s*\w+\s*\[)\}\s*((?:,?1\s*:\s*(?:\d+(?:e\d+)?|["'][^'"]*['"])\s*)+)\{''', wr[:m.start()][::-1]):
 962 |                         has_css_map = m2.group(1)[::-1]
 963 | 
 964 |                 else:
 965 |                     # the map is inside the minicss
 966 |                     if m2 := re.search(r'''\.miniCss\s*=\s*(?:function|\().*?\{(\s*((?:\d+(?:e\d+)?|["'][^'"]*['"])\s*:\s*1,?\s*)+)\}''', wr, flags=re.DOTALL):
 967 |                         has_css_map = m2.group(1)
 968 | 
 969 |                 if has_css_map is not None:
 970 |                     cstr = has_css_map
 971 |                     has_css_map = set()
 972 | 
 973 |                     for cid in re.findall(r'''([a-zA-Z0-9_$]+|['"][^'"]+['"])\s*:\s*1,?''', cstr):
 974 |                         has_css_map.add(parse_chunk_id(cid))
 975 | 
 976 |                 depends_on_id, known_ids, exhaustive, resolve = parse_chunk_match(m)
 977 |                 log.debug('css chunks', has_css_map, known_ids)
 978 | 
 979 |                 if not depends_on_id and not has_css_map:
 980 |                     # corner case: only one chunk... :D
 981 |                     has_css_map = set([''])
 982 | 
 983 |                 if has_css_map is None:
 984 |                     # basically this.. "should not happen"
 985 |                     # might happen if css chunks not used
 986 |                     log.log('webpack: no css bitmap', current_path)
 987 | 
 988 |                     has_css_map = set()
 989 |                     for chunk_id in known_ids:
 990 |                         has_css_map.add(chunk_id)
 991 | 
 992 |                 for chunk_id in has_css_map:
 993 |                     self.queue_link(resolve(chunk_id))
 994 | 
 995 |         if not has_exhaustive_chunks:
 996 |             # these are all inside the webpack runtime, not other chunks
 997 |             # preload/prefetch maps
 998 | 
 999 |             chunk_id_pat = r'\d+(?:e\d+)?|"[^"]+"' # map keys might be strings
1000 | 
1001 |             for pm in re.finditer(r'var \w+\s*=\s*\{(?P<map>[^{}]+)\};\s*(__webpack_require__|\w)\.f\.pre(?:load|fetch)\s*=', wr):
1002 |                 for pcm in re.finditer(rf'({chunk_id_pat})\s*:\s*\[([^\[\]]+)\]', pm.group('map')):
1003 |                     chunk_id = parse_chunk_id(pcm.group(1))
1004 |                     log.debug('adding dephead', chunk_id)
1005 |                     self.add_possible_webpack_chunk_id(chunk_id)
1006 | 
1007 |                     for pcmc in re.finditer(chunk_id_pat, pcm.group(2)):
1008 |                         chunk_id = parse_chunk_id(pcmc.group(0))
1009 |                         log.debug('adding depchild', chunk_id)
1010 | 
1011 |                         self.add_possible_webpack_chunk_id(chunk_id)
1012 | 
1013 |             # startup prefetch
1014 |             for pm in re.finditer(r'\[([^\[\]]+)\]\.map\((?:__webpack_require__|\w)\.E\)', wr):
1015 |                 for pmc in re.finditer(chunk_id_pat, pm.group(1)):
1016 |                     chunk_id = parse_chunk_id(pmc.group(0))
1017 |                     log.debug('adding startup', chunk_id)
1018 |                     self.add_possible_webpack_chunk_id(chunk_id)
1019 | 
1020 |             self.queue_possible_webpack_chunks()
1021 | 
1022 |     def find_webpack_chunk_refs(self, res, current_path):
1023 |         # first we need some cleanup, clean /******/ clean // comments, clean escapes
1024 |         # note these refs can be inside eval
1025 |         wr = re.sub(r'/\*{3,}/|\\[nt]', ' ', res)
1026 |         wr = re.sub(r'\n\s*//.*', ' ', wr) # be careful not to trip strings like https://
1027 |         wr = wr.replace(r'\"', '"')
1028 | 
1029 |         req_pat = r'(?:__webpack_require__|__nested_webpack_require_\d+__|\w)\.e'
1030 | 
1031 |         # search for context maps (for dynamic require support)
1032 |         # these probably aren't exhaustive
1033 |         # we simply look for all integers and strings except map keys and first items
1034 |         chunk_id_pat = r'[{,]\s*(\d+(?:e\d+)?|"[^"]+")(?!\s*:)'
1035 | 
1036 |         added = False
1037 | 
1038 |         if re.search(rf'Promise\.all\(\w+\.slice\(\d+\)\.map\({req_pat}\)|return\s+(?:\w\s*\?\s*)?{req_pat}\(\w+\[[1-3]\]\)\.then', wr):
1039 |             # we need to find a map, but we don't really know which variable is correct
1040 |             # keys in this context map are strings
1041 | 
1042 |             map_found = False
1043 | 
1044 |             for m in re.finditer(r'''var \w+\s*=\s*\{((['"][^'"]*['"]|[\[\],\s:]|[0-9e])+)\}''', wr):
1045 |                 log.debug('async context: found map variable', m.group(1))
1046 |                 map_found = True
1047 | 
1048 |                 for md in re.finditer(chunk_id_pat, m.group(1)):
1049 |                     chunk_id = parse_chunk_id(md.group(1))
1050 |                     log.debug('adding chunk from context map', chunk_id)
1051 |                     self.add_possible_webpack_chunk_id(chunk_id)
1052 |                     added = True
1053 | 
1054 |             assert map_found
1055 | 
1056 |         # search for possible ensure references, they might contain comments
1057 |         # chunk ids can also be strings..
1058 |         # now we try to avoid false positives, so we assume no spaces inside
1059 | 
1060 |         chunk_id_pat = r'''\d+(?:e\d+)?|['"][^"'\s]+['"]'''
1061 |         chunk_ref_pat = rf'(?:__webpack_require__|__nested_webpack_require_\d+__|\W\w)\.e\((?:/\*.*?\*/\s*)?({chunk_id_pat})\)'
1062 | 
1063 |         for m in re.finditer(chunk_ref_pat, wr):
1064 |             chunk_id = parse_chunk_id(m.group(1))
1065 |             log.debug('adding manual chunk', chunk_id)
1066 |             self.add_possible_webpack_chunk_id(chunk_id)
1067 |             added = True
1068 | 
1069 |         if added:
1070 |             self.queue_possible_webpack_chunks()
1071 | 
1072 |     def find_federated_modules(self, res, current_path):
1073 |         for m in re.finditer(r'''new Promise\([^&}]*?['"](https?://[^'"?#]+.js)(?:[?#][^'"]*)?['"][^{]+\{[^{}]*?['"]ScriptExternalLoadError['"][^{'"]+['"]([^'"]+)['"]''', res):
1074 |             url, app_name = m.group(1), m.group(2)
1075 | 
1076 |             log.log('adding federated module', m.group(1), m.group(2))
1077 |             self.queue(url, 'remote_entry_js', app_name)
1078 | 
1079 |     def handle_remote_module(self, url, res, app_name):
1080 |         res = self.gf.decode_response(res)
1081 | 
1082 |         # it's not mapped, so save it as is
1083 |         self.save_fetched_asset(url, res.encode())
1084 | 
1085 |         # we need public path, but it needs to be absolute
1086 |         public_path = re.search(r'''(?:\w|__webpack_require__)\.p\s*=\s*['"](https?://[^'"?#]+/)''', res)
1087 |         assert public_path
1088 |         public_path = public_path.group(1)
1089 | 
1090 |         log.debug('remote module', app_name, 'path', public_path)
1091 |         rm_crawler = Crawler(self.gf, public_path, f'module:{app_name}/')
1092 | 
1093 |         # just send this js file to the new crawler, it should handle it all
1094 |         rm_crawler.handle_js_data(res, url)
1095 | 
1096 |     def run_aggressive_scan(self, content, path):
1097 |         # todo: should we scan css url here? would only matter for non-quoted
1098 | 
1099 |         content = re.sub(r'\\u002f', '/', content, flags=re.IGNORECASE)
1100 |         content = re.sub(r'\\u005c', '\\\\', content, flags=re.IGNORECASE)
1101 |         content = re.sub(r'''\\(['"])''', r'\1', content)
1102 |         content = re.sub(r'\\[nrt]', ' ', content)
1103 |         content = re.sub(r'\\/', '/', content)
1104 | 
1105 |         for m in re.finditer(self.gf.aggressive_rel_pat, content, flags=re.IGNORECASE):
1106 |             # two variants - relative to the script or relative to the document
1107 | 
1108 |             lpart = m.group(1).replace('\\', '')
1109 |             link = urljoin(path, lpart)
1110 |             link2 = urljoin(self.root, lpart)
1111 | 
1112 |             log.log('aggressive rel match', path, link)
1113 |             self.queue_link(link)
1114 | 
1115 |             if link != link2:
1116 |                 log.log('aggressive rel doc match', path, link2)
1117 |                 self.queue_link(link2)
1118 | 
1119 |         for m in re.finditer(self.gf.aggressive_abs_pat, content, flags=re.IGNORECASE):
1120 |             link = urljoin(path, m.group(0).replace('\\', ''))
1121 |             log.log('aggressive abs match', path, link)
1122 |             self.queue_link(link)
1123 | 
1124 |     def find_nextjs_chunks(self, content, links):
1125 |         has_manifest = False
1126 |         next_path = None
1127 | 
1128 |         for link in links:
1129 |             if '/_next/static/' not in link:
1130 |                 continue
1131 | 
1132 |             if not next_path:
1133 |                 next_path = urljoin(self.root, link[:link.index('/_next/static/') + len('/_next/')])
1134 | 
1135 |             if '/_next/static/chunks/' not in link and link.endswith('/_buildManifest.js'):
1136 |                 has_manifest = True
1137 |                 log.log('adding next.js manifest', link)
1138 |                 self.queue(link, 'nextjs')
1139 | 
1140 |         if has_manifest or not next_path:
1141 |             return
1142 | 
1143 |         content = content.replace('\\u002F', '/')
1144 |         content = content.replace(r'\"', '"')
1145 | 
1146 |         for m in re.finditer(r'"(?:_next/|\d+:)?(static/(?:chunks|css)/[^"]+\.(?:m?jsx?|css))(?:[#?][^"]*)?"', content):
1147 |             link = next_path + m.group(1)
1148 |             log.log('adding nextjs chunk from non-manifest', link)
1149 |             self.queue_link(next_path + m.group(1))
1150 | 
1151 |     def handle_html_response(self, url, response):
1152 |         links = []
1153 | 
1154 |         if link_header := response.headers.get('link'): # multiple are joined by comma
1155 |             for m in re.finditer(r'<([^>?#]+)', link_header):
1156 |                 log.debug('preload header', m.group(1))
1157 |                 links.append(m.group(1))
1158 | 
1159 |         html_content = self.gf.decode_response(response)
1160 | 
1161 |         # generally should guess as much as we can from html
1162 |         # we assume there's only one html source
1163 | 
1164 |         # we intentionally want to catch attributes like data-href
1165 |         for m in re.finditer(r'''(?:href|src)\s*=\s*(?:['"]|(?=/))(?!data:)([^"'?#> ]+)''', html_content):
1166 |             log.debug('link', m.group(1))
1167 |             links.append(m.group(1))
1168 | 
1169 |         # here we actually want to avoid false positives
1170 |         if not self.gf.ignore_base_tag and (m := re.search(r'''<base (?:[^>]+ )?href\s*=\s*['"]([^'">]+)['"]''', html_content)):
1171 |             url = urljoin(url, m.group(1))
1172 |             log.log('base url changed to', url)
1173 | 
1174 |         links = [urljoin(url, prepare_link(link)) for link in links]
1175 |         links = [link for link in links if self.check_prefix(link)]
1176 | 
1177 |         # todo maybe optional, maybe log that the script was filtered?
1178 |         links = filter_es_variants(links)
1179 | 
1180 |         # special handlers run first since only the first tag for an url is queued
1181 |         self.find_nextjs_chunks(html_content, links)
1182 | 
1183 |         if self.gf.other_asset_extensions:
1184 |             for m in re.finditer(r'''<link ([^>]*?)rel=['"]?manifest['"]?([^>]*)/?>''', html_content):
1185 |                 if hm := re.search(r'''href=['"]?([^'"\s]+)''', m.group(1) + m.group(2)):
1186 |                     link = urljoin(url, hm.group(1))
1187 |                     log.log('adding link manifest', link)
1188 |                     self.queue_link(link, 'webmanifest')
1189 | 
1190 |         for link in links:
1191 |             self.queue_link(link)
1192 | 
1193 |         # save this file
1194 |         this_file_path = url + 'index.html' if url.endswith('/') else url
1195 |         if not re.search(r'\.html?', this_file_path, flags=re.IGNORECASE):
1196 |             this_file_path += '.html'
1197 | 
1198 |         self.save_fetched_asset(urljoin(url, this_file_path), html_content.encode())
1199 | 
1200 |         # handle inline scripts
1201 |         inline_script_content = ''
1202 |         for m in re.finditer(r'''<script(?: [^>]*)?>(.*?)</script>''', html_content, flags=re.DOTALL):
1203 |             script_content = html.unescape(m.group(1)).strip()
1204 |             if script_content:
1205 |                 inline_script_content += '\n//=====\n' + script_content
1206 |                 self.handle_js_data(script_content, url)
1207 | 
1208 |         if inline_script_content:
1209 |             self.save_fetched_asset(this_file_path + '.__inline.js', inline_script_content.encode())
1210 | 
1211 |         # handle inline styles
1212 |         inline_style_content = ''
1213 |         for m in re.finditer(r'''<style(?: [^>]*)?>(.*?)</style>''', html_content, flags=re.DOTALL):
1214 |             style_content = html.unescape(m.group(1)).strip()
1215 |             if style_content:
1216 |                 inline_style_content += '\n/*=====*/\n' + style_content
1217 |                 self.handle_css_data(style_content, url)
1218 | 
1219 |         if inline_style_content and not self.gf.skip_css:
1220 |             self.save_fetched_asset(this_file_path + '.__inline.css', inline_style_content.encode())
1221 | 
1222 |         # scan for importmaps
1223 |         for m in re.finditer(r'''<script [^>]*?type\s*=\s*['"]?importmap['"]?[^>]*?>(.+?)</script>''', html_content, flags=re.DOTALL):
1224 |             imap = None
1225 |             try:
1226 |                 imap = json.loads(html.unescape(m.group(1)))
1227 |             except Exception:
1228 |                 log.debug('warn: invalid importmap')
1229 |                 continue
1230 | 
1231 |             if not imap:
1232 |                 continue
1233 | 
1234 |             for name, src in imap.get('imports', {}).items():
1235 |                 if not name.endswith('/'):
1236 |                     self.queue_link(urljoin(url, src))
1237 |                 else:
1238 |                     log.log('NOT IMPLEMENTED: importmap entry ending with a slash')
1239 | 
1240 |                 # todo: support self.import_paths lookup
1241 |                 # but here some might be already imported so we'd need to add all to lpath
1242 | 
1243 |         if self.gf.other_asset_extensions:
1244 |             # attribute styles
1245 |             for m in re.finditer(r'''style\s*=\s*(?:['"])([^"']+)''', html_content):
1246 |                 self.handle_css_data(m.group(1), url)
1247 | 
1248 |             # srcset
1249 |             for m in re.finditer(r'''srcset\s*=\s*(?:['"])([^"']+)''', html_content, flags=re.IGNORECASE):
1250 |                 for m2 in re.finditer(r'(?:^|,)\s*(?!data:)([^,\s]+)', m.group(1)):
1251 |                     link = urljoin(url, m2.group(1))
1252 |                     log.log('adding link from srcset', link)
1253 |                     self.queue_link(link)
1254 | 
1255 |         if self.gf.aggressive_mode:
1256 |             # aggressive mode is also for html since inline handlers can contain js
1257 |             # this also looks into html comments
1258 |             content = html_content.replace('&quot;', '"').replace('&apos;', "'").replace('&#39;', "'")
1259 |             self.run_aggressive_scan(content, url)
1260 | 
1261 | 
1262 | class GetFrontend:
1263 |     def __init__(self, config):
1264 |         self.root = config['root']
1265 |         if self.root.count('/') < 3 and not self.root.endswith('/'):
1266 |             # normalize it so that we know that if it doesn't end with a slash
1267 |             # it's a particular file, not a directory index
1268 |             self.root += '/'
1269 | 
1270 |         self.prefix_whitelist = self.make_prefix_whitelist(config.get('origin_whitelist', []))
1271 | 
1272 |         self.ignore_vendor = config.get('ignore_vendor')
1273 |         self.all_srcmap_urls = config.get('all_srcmap_urls')
1274 |         self.save_original_assets = config.get('save_original_assets')
1275 |         self.skip_css = config.get('skip_css')
1276 |         self.extract_nested_sourcemaps = config.get('extract_nested_sourcemaps')
1277 |         self.no_header_sourcemaps = config.get('no_header_sourcemaps')
1278 |         self.client = Client(cookies=config.get('cookies'), headers=config.get('headers'))
1279 | 
1280 |         self.other_asset_extensions: set[str] = config.get('other_asset_extensions', set())
1281 |         self.aggressive_mode = config.get('aggressive_mode')
1282 |         self.make_scan_patterns()
1283 | 
1284 |         self.other_urls = config.get('other_urls', [])
1285 |         self.use_original_base = config.get('use_original_base')
1286 |         self.ignore_base_tag = config.get('ignore_base_tag')
1287 |         self.public_path_map = config.get('public_path_map')
1288 | 
1289 |         self.fetcher = Fetcher(self.client, 3)
1290 |         self.fetched_sourcemaps: dict[bool] = {} # so we don't fetch .map twice, maps are not fetched via queue
1291 | 
1292 |         self.archive = FArchive()
1293 | 
1294 |         self.root_crawler = Crawler(self, self.root)
1295 | 
1296 |     def make_prefix_whitelist(self, allowed_origins):
1297 |         ret = set()
1298 | 
1299 |         if allowed_origins:
1300 |             ret.add(extract_origin(self.root))
1301 |             for origin in allowed_origins:
1302 |                 if origin.startswith('http://') or origin.startswith('https://'):
1303 |                     ret.add(extract_origin(origin))
1304 |                 else:
1305 |                     ret.add(extract_origin(f'http://{origin}'))
1306 |                     ret.add(extract_origin(f'https://{origin}'))
1307 | 
1308 |         return ret
1309 | 
1310 |     def make_scan_patterns(self):
1311 |         asset_ext_part = None
1312 | 
1313 |         if self.other_asset_extensions:
1314 |             asset_ext_part = '|'.join([re.escape(ext) for ext in self.other_asset_extensions])
1315 |             self.asset_ext_pat = rf'\.(?:{asset_ext_part})'
1316 | 
1317 |         parts = [r'm?[jt]sx?']
1318 | 
1319 |         if not self.skip_css:
1320 |             parts.append('css')
1321 | 
1322 |         if self.other_asset_extensions:
1323 |             parts.append(asset_ext_part)
1324 | 
1325 |         ext_pat = r'\.(?:' + '|'.join(parts) + r')'
1326 | 
1327 |         # aggressive scan can be invoked by handlers, even without self.aggressive_mode
1328 |         self.aggressive_rel_pat = rf'''["'`]([^"'`?#<>:]+{ext_pat})(?:[?#][^'"`]+)?['"`]'''
1329 |         self.aggressive_abs_pat = rf'''https?://[^"'`?#\s(){"{}"}\[\]<>|!,;]+{ext_pat}(?![a-zA-Z0-9._%/-])'''
1330 | 
1331 |     def check_response(self, url, response):
1332 |         if response is None:
1333 |             # here the exception has already been logged
1334 |             return
1335 | 
1336 |         if response.status_code == 404:
1337 |             log.debug('not found', url)
1338 |             return None
1339 | 
1340 |         if response.status_code != 200:
1341 |             log.log('warning: bad response code', url, response.status_code)
1342 | 
1343 |         return response
1344 | 
1345 |     def decode_response(self, response):
1346 |         # response.text might be too slow because it tries to guess the encoding
1347 |         # we don't bother
1348 |         text = None
1349 | 
1350 |         try:
1351 |             text = response.content.decode('utf-8')
1352 |         except UnicodeDecodeError:
1353 |             text = response.content.decode('iso-8859-1')
1354 | 
1355 |         return text
1356 | 
1357 |     def get_url(self, url):
1358 |         response = self.check_response(url, self.client.get(url))
1359 |         return response
1360 | 
1361 |     def run(self):
1362 |         # root is handled specially because we update the root if the previous
1363 |         # one caused a redirect
1364 |         root_response = self.get_url(self.root)
1365 |         if not root_response:
1366 |             raise Exception("can't fetch target")
1367 | 
1368 |         if not self.use_original_base and root_response.url != self.root:
1369 |             self.root = root_response.url
1370 |             log.log('target redirected, new root url', self.root)
1371 | 
1372 |         self.root_crawler.handle_html_response(self.root, root_response)
1373 | 
1374 |         # self.root_crawler.queue_link(self.root, tag='page')
1375 | 
1376 |         if 'ico' in self.other_asset_extensions:
1377 |             self.root_crawler.queue_link(urljoin(self.root, '/favicon.ico'))
1378 | 
1379 |         for url in self.other_urls:
1380 |             self.root_crawler.queue_link(urljoin(self.root, url), fallback='dynamic')
1381 | 
1382 |         self.loop()
1383 | 
1384 |     def save_asset(self, path, content):
1385 |         log.vdebug('saving asset', path)
1386 |         self.archive.add_file(path, content)
1387 | 
1388 |     def export_to_file(self, fileobj):
1389 |         self.archive.write_to_file(fileobj)
1390 | 
1391 |     def export_to_directory(self, path):
1392 |         self.archive.save_to_directory(path)
1393 | 
1394 |     def dump_to_stdout(self):
1395 |         self.archive.dump_to_stdout()
1396 | 
1397 |     def loop(self):
1398 |         while True:
1399 |             res = self.fetcher.get_response()
1400 |             if res is None:
1401 |                 break
1402 | 
1403 |             url, response, crawler, mode, *args = res
1404 |             crawler: Crawler
1405 |             log.log('handling', url, mode, *args)
1406 |             if response and response.url != url:
1407 |                 log.log('it redirected to', response.url)
1408 | 
1409 |                 if not self.use_original_base:
1410 |                     url = response.url
1411 | 
1412 |             crawler.handle_result(url, response, mode, *args)
1413 | 
1414 | 
1415 | def extract_origin(url):
1416 |     parsed_url = urlparse(url)
1417 |     origin = f"{parsed_url.scheme}://{parsed_url.netloc}/"
1418 |     return origin
1419 | 
1420 | 
1421 | def find_import_references(res, current_path):
1422 |     value_pat = r'''['"]([^'"]+)['"]'''
1423 | 
1424 |     for m in re.finditer(rf'''(?:(?:^|[^.\s])\s*|[^a-zA-Z0-9_$.\s])import\s*\({value_pat}\)|(?:^|[\n;])\s*import\s*{value_pat}|(?:[{"}"}]|[a-zA-Z0-9$_]\s|\*/)\s*from\s*{value_pat}''', res):
1425 |         link = m.group(1) or m.group(2) or m.group(3)
1426 | 
1427 |         if '.js' not in link and '.mjs' not in link and '.ts' not in link:
1428 |             # we get... many false positives
1429 |             continue
1430 | 
1431 |         if re.search(r'[<>(){}\[\]]', link): # things like ?v=sth might be there
1432 |             continue
1433 | 
1434 |         if link.startswith('./') or link.startswith('../') or link.startswith('/'):
1435 |             yield urljoin(current_path, link)
1436 | 
1437 |         elif link.startswith('http://') or link.startswith('https://'):
1438 |             yield link
1439 | 
1440 |         else:
1441 |             log.log('NOT IMPLEMENTED: using mapped import', link)
1442 |             pass
1443 | 
1444 | 
1445 | def parse_chunkmap(val):
1446 |     chunkmap_entry_pat = r'''([a-zA-Z0-9_$]+|['"][^'"]+['"]):['"]([^"']+)['"]'''
1447 |     ret = {}
1448 | 
1449 |     if val:
1450 |         for k, v in re.findall(chunkmap_entry_pat, val):
1451 |             ret[parse_chunk_id(k)] = v
1452 | 
1453 |     return ret
1454 | 
1455 | 
1456 | def parse_chunk_id(v):
1457 |     if v.startswith('"') or v.startswith("'"): # todo not exact, should be json but nothing is exact here
1458 |         v = v[1:-1]
1459 |     elif 'e' in v and (m := re.match(r'^(\d+)(?:e(\d+))', v)):
1460 |         v = m.group(1) + '0'*int(m.group(2))
1461 | 
1462 |     return v
1463 | 
1464 | 
1465 | def filter_es_variants(links):
1466 |     ret = []
1467 |     best_dict = {}
1468 | 
1469 |     for link in links:
1470 |         parts = re.split(r'(?<=[-.])es(\d+|next)(?=[.-])', link, maxsplit=1)
1471 | 
1472 |         if len(parts) == 1:
1473 |             ret.append(link)
1474 |         else:
1475 |             key = parts[0] + '?' + parts[2]
1476 |             nval = parts[1]
1477 |             curr = best_dict.get(key, 0)
1478 |             if nval == 'next' or (curr != 'next' and int(nval) > int(curr)):
1479 |                 best_dict[key] = nval
1480 | 
1481 |     for k, v in best_dict.items():
1482 |         ret.append(k.replace('?', 'es' + v))
1483 | 
1484 |     return ret
1485 | 
1486 | 
1487 | def get_config_from_args():
1488 |     tpl_common = 'svg,json,webmanifest,ico,eot,woff,woff2,otf,ttf'
1489 |     tpl_images = 'jpg,jpeg,png,gif,webp'
1490 |     tpl_media = 'mp3,ogg,wav,m4a,opus,mp4,mov,mkv,webm'
1491 | 
1492 |     default_ua = 'Mozilla/5.0 (Windows NT 10.0; rv:124.0) Gecko/20100101 Firefox/124.0'
1493 | 
1494 |     parser = argparse.ArgumentParser()
1495 | 
1496 |     parser.add_argument('url', help='The root url.')
1497 |     parser.add_argument('other_urls', nargs='*', help='Other urls that should be scanned.')
1498 |     parser.add_argument('-o', '--output', default='-', help='Output can be a zip file (needs to end with .zip), a directory or stdout specified via the - character (default)')
1499 |     parser.add_argument('-v', '--verbose', action='count', default=0, help='Increase verbosity, use -vv for even more verbosity.')
1500 | 
1501 |     scope_group = parser.add_argument_group('scope options')
1502 |     scope_group.add_argument('-wo', '--whitelist-origin', action='append', default=[], help='Make requests to this origin/domain only. May be specified multiple times.')
1503 |     scope_group.add_argument('-ob', '--use-original-base', action='store_true', help="Don't change the base url if the original request has redirected. By default, the base url is updated.")
1504 |     scope_group.add_argument('-ib', '--ignore-base-tag', action='store_true', help="Ignore html <base> tags.")
1505 |     scope_group.add_argument('-c', '--add-cookie', action='append', default=[], help='Add a cookie in the form name=value to all requests.')
1506 |     scope_group.add_argument('-H', '--add-header', action='append', default=[], help='Add a given HTTP header (name: value) to all requests.')
1507 |     scope_group.add_argument('-ppm', '--public-path-map', action='append', default=[], help='Add a custom public path for a given chunk index file (indexfile=publicpath), see "public path for" in the debug output to find out which index files were found.')
1508 | 
1509 |     scan_group = parser.add_argument_group('scan options')
1510 |     scan_group.add_argument('-a', '--aggressive-mode', action='store_true', help='Scan JS/HTML files for possible script paths more aggressively.')
1511 |     scan_group.add_argument('-i', '--ignore-vendor', action='store_true', help='Do not fetch source maps for scripts starting with vendor.')
1512 |     scan_group.add_argument('-nn', '--no-nested-sourcemaps', action='store_true', help='Do not unpack inline sourcemaps found inside mapped content.')
1513 |     scan_group.add_argument('-so', '--save-original-assets', action='store_true', help='Save original asset files even if a source map exists.')
1514 |     scan_group.add_argument('-as', '--all-srcmap-urls', action='store_true', help='By default only one map specified by sourceMappingURL is fetched for a given script - this option overrides that. Use with caution, might generate many additional requests which are usually unsuccessful.')
1515 |     scan_group.add_argument('-nh', '--no-header-sourcemaps', action='store_true', help='Do not detect source maps from SourceMap and X-SourceMap headers.')
1516 | 
1517 |     asset_group = parser.add_argument_group('asset options')
1518 | 
1519 |     asset_group.add_argument('-ae', '--asset-extensions', action='append', default=[], help='Specify comma-separated list of extensions for additional asset files to be saved. By default only HTML/JS/CSS files are saved. May be specified multiple times.')
1520 |     asset_group.add_argument('-sa', '--save-common-assets', action='store_true', help=f'Shortcut for --asset-extensions={tpl_common}')
1521 |     asset_group.add_argument('-si', '--save-images', action='store_true', help=f'Shortcut for --asset-extensions={tpl_images}')
1522 |     asset_group.add_argument('-sm', '--save-media', action='store_true', help=f'Shortcut for --asset-extensions={tpl_media}')
1523 |     asset_group.add_argument('-nc', '--no-css', action='store_true', help='Do not save CSS files.')
1524 | 
1525 |     args = parser.parse_args()
1526 | 
1527 |     config = {
1528 |         'root': args.url,
1529 |         'origin_whitelist': args.whitelist_origin,
1530 |         'cookies': {k: v for val in args.add_cookie for k, v in [val.split('=', 1)]},
1531 |         'headers': {k: v for val in args.add_header for k, v in [val.split(': ', 1)]},
1532 |         'other_urls': args.other_urls,
1533 |         'aggressive_mode': args.aggressive_mode,
1534 |         'ignore_vendor': args.ignore_vendor,
1535 |         'extract_nested_sourcemaps': not args.no_nested_sourcemaps,
1536 |         'save_original_assets': args.save_original_assets,
1537 |         'all_srcmap_urls': args.all_srcmap_urls,
1538 |         'no_header_sourcemaps': args.no_header_sourcemaps,
1539 |         'skip_css': args.no_css,
1540 |         'other_asset_extensions': set([e for exts in args.asset_extensions for e in exts.split(',')]),
1541 |         'use_original_base': args.use_original_base,
1542 |         'ignore_base_tag': args.ignore_base_tag,
1543 |         'public_path_map': {k: v for val in args.public_path_map for k, v in [val.split('=', 1)]},
1544 |     }
1545 | 
1546 |     if args.save_common_assets:
1547 |         config['other_asset_extensions'].update(tpl_common.split(','))
1548 | 
1549 |     if args.save_images:
1550 |         config['other_asset_extensions'].update(tpl_images.split(','))
1551 | 
1552 |     if args.save_media:
1553 |         config['other_asset_extensions'].update(tpl_media.split(','))
1554 | 
1555 |     if 'User-Agent' not in config['headers'] and 'user-agent' not in config['headers']:
1556 |         config['headers']['User-Agent'] = default_ua
1557 | 
1558 |     return config, args.output, args.verbose
1559 | 
1560 | 
1561 | def main():
1562 |     config, output, verbosity = get_config_from_args()
1563 | 
1564 |     log.level += verbosity
1565 | 
1566 |     gf = GetFrontend(config)
1567 |     gf.run()
1568 | 
1569 |     if output.endswith('.zip'):
1570 |         with open(output, 'wb') as wf:
1571 |             gf.export_to_file(wf)
1572 | 
1573 |     elif output != '-':
1574 |         gf.export_to_directory(output)
1575 | 
1576 |     else:
1577 |         gf.dump_to_stdout()
1578 | 
1579 | 
1580 | if __name__ == '__main__':
1581 |     main()
1582 | 


--------------------------------------------------------------------------------