├── .gitignore ├── LICENSE ├── README.md ├── git_dumper.py ├── pyproject.toml ├── requirements.txt └── setup.cfg /.gitignore: -------------------------------------------------------------------------------- 1 | # Byte-compiled / optimized / DLL files 2 | __pycache__/ 3 | *.py[cod] 4 | *$py.class 5 | 6 | # C extensions 7 | *.so 8 | 9 | # Distribution / packaging 10 | .Python 11 | env/ 12 | build/ 13 | develop-eggs/ 14 | dist/ 15 | downloads/ 16 | eggs/ 17 | .eggs/ 18 | lib/ 19 | lib64/ 20 | parts/ 21 | sdist/ 22 | var/ 23 | *.egg-info/ 24 | .installed.cfg 25 | *.egg 26 | 27 | # PyInstaller 28 | # Usually these files are written by a python script from a template 29 | # before PyInstaller builds the exe, so as to inject date/other infos into it. 30 | *.manifest 31 | *.spec 32 | 33 | # Installer logs 34 | pip-log.txt 35 | pip-delete-this-directory.txt 36 | 37 | # Unit test / coverage reports 38 | htmlcov/ 39 | .tox/ 40 | .coverage 41 | .coverage.* 42 | .cache 43 | nosetests.xml 44 | coverage.xml 45 | *,cover 46 | .hypothesis/ 47 | 48 | # Translations 49 | *.mo 50 | *.pot 51 | 52 | # Django stuff: 53 | *.log 54 | local_settings.py 55 | 56 | # Flask stuff: 57 | instance/ 58 | .webassets-cache 59 | 60 | # Scrapy stuff: 61 | .scrapy 62 | 63 | # Sphinx documentation 64 | docs/_build/ 65 | 66 | # PyBuilder 67 | target/ 68 | 69 | # IPython Notebook 70 | .ipynb_checkpoints 71 | 72 | # pyenv 73 | .python-version 74 | 75 | # celery beat schedule file 76 | celerybeat-schedule 77 | 78 | # dotenv 79 | .env 80 | 81 | # virtualenv 82 | venv/ 83 | ENV/ 84 | 85 | # Spyder project settings 86 | .spyderproject 87 | 88 | # Rope project settings 89 | .ropeproject 90 | -------------------------------------------------------------------------------- /LICENSE: -------------------------------------------------------------------------------- 1 | MIT License 2 | 3 | Copyright (c) 2017 Maxime Arthaud 4 | 5 | Permission is hereby granted, free of charge, to any person obtaining a copy 6 | of this software and associated documentation files (the "Software"), to deal 7 | in the Software without restriction, including without limitation the rights 8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell 9 | copies of the Software, and to permit persons to whom the Software is 10 | furnished to do so, subject to the following conditions: 11 | 12 | The above copyright notice and this permission notice shall be included in all 13 | copies or substantial portions of the Software. 14 | 15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR 16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, 17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE 18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER 19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, 20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE 21 | SOFTWARE. 22 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # git-dumper 2 | 3 | A tool to dump a git repository from a website. 4 | 5 | ## Install 6 | 7 | This can be installed easily with pip: 8 | ``` 9 | pip install git-dumper 10 | ``` 11 | 12 | ## Usage 13 | 14 | ``` 15 | usage: git-dumper [options] URL DIR 16 | 17 | Dump a git repository from a website. 18 | 19 | positional arguments: 20 | URL url 21 | DIR output directory 22 | 23 | optional arguments: 24 | -h, --help show this help message and exit 25 | --proxy PROXY use the specified proxy 26 | -j JOBS, --jobs JOBS number of simultaneous requests 27 | -r RETRY, --retry RETRY 28 | number of request attempts before giving up 29 | -t TIMEOUT, --timeout TIMEOUT 30 | maximum time in seconds before giving up 31 | -u USER_AGENT, --user-agent USER_AGENT 32 | user-agent to use for requests 33 | -H HEADER, --header HEADER 34 | additional http headers, e.g `NAME=VALUE` 35 | --client-cert-p12 CLIENT_CERT_P12 36 | client certificate in PKCS#12 format 37 | --client-cert-p12-password CLIENT_CERT_P12_PASSWORD 38 | password for the client certificate 39 | ``` 40 | 41 | ### Example 42 | 43 | ``` 44 | git-dumper http://website.com/.git ~/website 45 | ``` 46 | 47 | ### Disclaimer 48 | 49 | **Use this software at your own risk!** 50 | 51 | You should know that if the repository you are downloading is controlled by an attacker, 52 | this could lead to remote code execution on your machine. 53 | 54 | ## Build from source 55 | 56 | Simply install the dependencies with pip: 57 | ``` 58 | pip install -r requirements.txt 59 | ``` 60 | 61 | Then, simply use: 62 | ``` 63 | ./git_dumper.py http://website.com/.git ~/website 64 | ``` 65 | 66 | ## How does it work? 67 | 68 | The tool will first check if directory listing is available. If it is, then it will just recursively download the .git directory (what you would do with `wget`). 69 | 70 | If directory listing is not available, it will use several methods to find as many files as possible. Step by step, git-dumper will: 71 | * Fetch all common files (`.gitignore`, `.git/HEAD`, `.git/index`, etc.); 72 | * Find as many refs as possible (such as `refs/heads/master`, `refs/remotes/origin/HEAD`, etc.) by analyzing `.git/HEAD`, `.git/logs/HEAD`, `.git/config`, `.git/packed-refs` and so on; 73 | * Find as many objects (sha1) as possible by analyzing `.git/packed-refs`, `.git/index`, `.git/refs/*` and `.git/logs/*`; 74 | * Fetch all objects recursively, analyzing each commits to find their parents; 75 | * Run `git checkout .` to recover the current working tree 76 | -------------------------------------------------------------------------------- /git_dumper.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python3 2 | from contextlib import closing 3 | import argparse 4 | import multiprocessing 5 | import os 6 | import os.path 7 | import re 8 | import socket 9 | import subprocess 10 | import sys 11 | import traceback 12 | import urllib.parse 13 | 14 | import urllib3 15 | 16 | import bs4 17 | import dulwich.index 18 | import dulwich.objects 19 | import dulwich.pack 20 | import requests 21 | import socks 22 | from requests_pkcs12 import Pkcs12Adapter 23 | 24 | 25 | def printf(fmt, *args, file=sys.stdout): 26 | if args: 27 | fmt = fmt % args 28 | 29 | file.write(fmt) 30 | file.flush() 31 | 32 | 33 | def is_html(response): 34 | """ Return True if the response is a HTML webpage """ 35 | return ( 36 | "Content-Type" in response.headers 37 | and "text/html" in response.headers["Content-Type"] 38 | ) 39 | 40 | 41 | def is_safe_path(path): 42 | """ Prevent directory traversal attacks """ 43 | if path.startswith("/"): 44 | return False 45 | 46 | safe_path = os.path.expanduser("~") 47 | return ( 48 | os.path.commonpath( 49 | (os.path.realpath(os.path.join(safe_path, path)), safe_path) 50 | ) 51 | == safe_path 52 | ) 53 | 54 | 55 | def get_indexed_files(response): 56 | """ Return all the files in the directory index webpage """ 57 | html = bs4.BeautifulSoup(response.text, "html.parser") 58 | files = [] 59 | 60 | for link in html.find_all("a"): 61 | url = urllib.parse.urlparse(link.get("href")) 62 | 63 | if ( 64 | url.path 65 | and is_safe_path(url.path) 66 | and not url.scheme 67 | and not url.netloc 68 | ): 69 | files.append(url.path) 70 | 71 | return files 72 | 73 | 74 | def verify_response(response): 75 | if response.status_code != 200: 76 | return ( 77 | False, 78 | "[-] %s/%s responded with status code {code}\n".format( 79 | code=response.status_code 80 | ), 81 | ) 82 | elif ( 83 | "Content-Length" in response.headers 84 | and response.headers["Content-Length"] == 0 85 | ): 86 | return False, "[-] %s/%s responded with a zero-length body\n" 87 | elif ( 88 | "Content-Type" in response.headers 89 | and "text/html" in response.headers["Content-Type"] 90 | ): 91 | return False, "[-] %s/%s responded with HTML\n" 92 | else: 93 | return True, True 94 | 95 | 96 | def create_intermediate_dirs(path): 97 | """ Create intermediate directories, if necessary """ 98 | 99 | dirname, basename = os.path.split(path) 100 | 101 | if dirname and not os.path.exists(dirname): 102 | try: 103 | os.makedirs(dirname) 104 | except FileExistsError: 105 | pass # race condition 106 | 107 | 108 | def get_referenced_sha1(obj_file): 109 | """ Return all the referenced SHA1 in the given object file """ 110 | objs = [] 111 | 112 | if isinstance(obj_file, dulwich.objects.Commit): 113 | objs.append(obj_file.tree.decode()) 114 | 115 | for parent in obj_file.parents: 116 | objs.append(parent.decode()) 117 | elif isinstance(obj_file, dulwich.objects.Tree): 118 | for item in obj_file.iteritems(): 119 | objs.append(item.sha.decode()) 120 | elif isinstance(obj_file, dulwich.objects.Blob): 121 | pass 122 | elif isinstance(obj_file, dulwich.objects.Tag): 123 | pass 124 | else: 125 | printf( 126 | "error: unexpected object type: %r\n" % obj_file, file=sys.stderr 127 | ) 128 | sys.exit(1) 129 | 130 | return objs 131 | 132 | 133 | class Worker(multiprocessing.Process): 134 | """ Worker for process_tasks """ 135 | 136 | def __init__(self, pending_tasks, tasks_done, args): 137 | super().__init__() 138 | self.daemon = True 139 | self.pending_tasks = pending_tasks 140 | self.tasks_done = tasks_done 141 | self.args = args 142 | 143 | def run(self): 144 | # initialize process 145 | self.init(*self.args) 146 | 147 | # fetch and do tasks 148 | while True: 149 | task = self.pending_tasks.get(block=True) 150 | 151 | if task is None: # end signal 152 | return 153 | 154 | try: 155 | result = self.do_task(task, *self.args) 156 | except Exception: 157 | printf("Task %s raised exception:\n", task, file=sys.stderr) 158 | traceback.print_exc() 159 | result = [] 160 | 161 | assert isinstance( 162 | result, list 163 | ), "do_task() should return a list of tasks" 164 | 165 | self.tasks_done.put(result) 166 | 167 | def init(self, *args): 168 | raise NotImplementedError 169 | 170 | def do_task(self, task, *args): 171 | raise NotImplementedError 172 | 173 | 174 | def process_tasks(initial_tasks, worker, jobs, args=(), tasks_done=None): 175 | """ Process tasks in parallel """ 176 | 177 | if not initial_tasks: 178 | return 179 | 180 | tasks_seen = set(tasks_done) if tasks_done else set() 181 | pending_tasks = multiprocessing.Queue() 182 | tasks_done = multiprocessing.Queue() 183 | num_pending_tasks = 0 184 | 185 | # add all initial tasks in the queue 186 | for task in initial_tasks: 187 | assert task is not None 188 | 189 | if task not in tasks_seen: 190 | pending_tasks.put(task) 191 | num_pending_tasks += 1 192 | tasks_seen.add(task) 193 | 194 | # initialize processes 195 | processes = [worker(pending_tasks, tasks_done, args) for _ in range(jobs)] 196 | 197 | # launch them all 198 | for p in processes: 199 | p.start() 200 | 201 | # collect task results 202 | while num_pending_tasks > 0: 203 | task_result = tasks_done.get(block=True) 204 | num_pending_tasks -= 1 205 | 206 | for task in task_result: 207 | assert task is not None 208 | 209 | if task not in tasks_seen: 210 | pending_tasks.put(task) 211 | num_pending_tasks += 1 212 | tasks_seen.add(task) 213 | 214 | # send termination signal (task=None) 215 | for _ in range(jobs): 216 | pending_tasks.put(None) 217 | 218 | # join all 219 | for p in processes: 220 | p.join() 221 | 222 | 223 | class DownloadWorker(Worker): 224 | """ Download a list of files """ 225 | 226 | def init(self, url, directory, retry, timeout, http_headers, client_cert_p12=None, client_cert_p12_password=None): 227 | self.session = requests.Session() 228 | self.session.verify = False 229 | self.session.headers = http_headers 230 | if client_cert_p12: 231 | self.session.mount(url, Pkcs12Adapter(pkcs12_filename=client_cert_p12, pkcs12_password=client_cert_p12_password)) 232 | else: 233 | self.session.mount(url, requests.adapters.HTTPAdapter(max_retries=retry)) 234 | 235 | def do_task(self, filepath, url, directory, retry, timeout, http_headers, client_cert_p12=None, client_cert_p12_password=None): 236 | if os.path.isfile(os.path.join(directory, filepath)): 237 | printf("[-] Already downloaded %s/%s\n", url, filepath) 238 | return [] 239 | 240 | with closing( 241 | self.session.get( 242 | "%s/%s" % (url, filepath), 243 | allow_redirects=False, 244 | stream=True, 245 | timeout=timeout, 246 | ) 247 | ) as response: 248 | printf( 249 | "[-] Fetching %s/%s [%d]\n", 250 | url, 251 | filepath, 252 | response.status_code, 253 | ) 254 | 255 | valid, error_message = verify_response(response) 256 | if not valid: 257 | printf(error_message, url, filepath, file=sys.stderr) 258 | return [] 259 | 260 | abspath = os.path.abspath(os.path.join(directory, filepath)) 261 | create_intermediate_dirs(abspath) 262 | 263 | # write file 264 | with open(abspath, "wb") as f: 265 | for chunk in response.iter_content(4096): 266 | f.write(chunk) 267 | 268 | return [] 269 | 270 | 271 | class RecursiveDownloadWorker(DownloadWorker): 272 | """ Download a directory recursively """ 273 | 274 | def do_task(self, filepath, url, directory, retry, timeout, http_headers): 275 | if os.path.isfile(os.path.join(directory, filepath)): 276 | printf("[-] Already downloaded %s/%s\n", url, filepath) 277 | return [] 278 | 279 | with closing( 280 | self.session.get( 281 | "%s/%s" % (url, filepath), 282 | allow_redirects=False, 283 | stream=True, 284 | timeout=timeout, 285 | ) 286 | ) as response: 287 | printf( 288 | "[-] Fetching %s/%s [%d]\n", 289 | url, 290 | filepath, 291 | response.status_code, 292 | ) 293 | 294 | if ( 295 | response.status_code in (301, 302) 296 | and "Location" in response.headers 297 | and response.headers["Location"].endswith(filepath + "/") 298 | ): 299 | return [filepath + "/"] 300 | 301 | if filepath.endswith("/"): # directory index 302 | assert is_html(response) 303 | 304 | return [ 305 | filepath + filename 306 | for filename in get_indexed_files(response) 307 | ] 308 | else: # file 309 | valid, error_message = verify_response(response) 310 | if not valid: 311 | printf(error_message, url, filepath, file=sys.stderr) 312 | return [] 313 | 314 | abspath = os.path.abspath(os.path.join(directory, filepath)) 315 | create_intermediate_dirs(abspath) 316 | 317 | # write file 318 | with open(abspath, "wb") as f: 319 | for chunk in response.iter_content(4096): 320 | f.write(chunk) 321 | 322 | return [] 323 | 324 | 325 | class FindRefsWorker(DownloadWorker): 326 | """ Find refs/ """ 327 | 328 | def do_task(self, filepath, url, directory, retry, timeout, http_headers, client_cert_p12=None, client_cert_p12_password=None): 329 | response = self.session.get( 330 | "%s/%s" % (url, filepath), allow_redirects=False, timeout=timeout 331 | ) 332 | printf( 333 | "[-] Fetching %s/%s [%d]\n", url, filepath, response.status_code 334 | ) 335 | 336 | valid, error_message = verify_response(response) 337 | if not valid: 338 | printf(error_message, url, filepath, file=sys.stderr) 339 | return [] 340 | 341 | abspath = os.path.abspath(os.path.join(directory, filepath)) 342 | create_intermediate_dirs(abspath) 343 | 344 | # write file 345 | with open(abspath, "w") as f: 346 | f.write(response.text) 347 | 348 | # find refs 349 | tasks = [] 350 | 351 | for ref in re.findall( 352 | r"(refs(/[a-zA-Z0-9\-\.\_\*]+)+)", response.text 353 | ): 354 | ref = ref[0] 355 | if not ref.endswith("*") and is_safe_path(ref): 356 | tasks.append(".git/%s" % ref) 357 | tasks.append(".git/logs/%s" % ref) 358 | 359 | return tasks 360 | 361 | 362 | class FindObjectsWorker(DownloadWorker): 363 | """ Find objects """ 364 | 365 | def do_task(self, obj, url, directory, retry, timeout, http_headers, client_cert_p12=None, client_cert_p12_password=None): 366 | filepath = ".git/objects/%s/%s" % (obj[:2], obj[2:]) 367 | 368 | if os.path.isfile(os.path.join(directory, filepath)): 369 | printf("[-] Already downloaded %s/%s\n", url, filepath) 370 | else: 371 | response = self.session.get( 372 | "%s/%s" % (url, filepath), 373 | allow_redirects=False, 374 | timeout=timeout, 375 | ) 376 | printf( 377 | "[-] Fetching %s/%s [%d]\n", 378 | url, 379 | filepath, 380 | response.status_code, 381 | ) 382 | 383 | valid, error_message = verify_response(response) 384 | if not valid: 385 | printf(error_message, url, filepath, file=sys.stderr) 386 | return [] 387 | 388 | abspath = os.path.abspath(os.path.join(directory, filepath)) 389 | create_intermediate_dirs(abspath) 390 | 391 | # write file 392 | with open(abspath, "wb") as f: 393 | f.write(response.content) 394 | 395 | abspath = os.path.abspath(os.path.join(directory, filepath)) 396 | # parse object file to find other objects 397 | obj_file = dulwich.objects.ShaFile.from_path(abspath) 398 | return get_referenced_sha1(obj_file) 399 | 400 | 401 | def sanitize_file(filepath): 402 | """ Inplace comment out possibly unsafe lines based on regex """ 403 | assert os.path.isfile(filepath), "%s is not a file" % filepath 404 | 405 | UNSAFE=r"^\s*fsmonitor|sshcommand|askpass|editor|pager" 406 | 407 | with open(filepath, 'r+') as f: 408 | content = f.read() 409 | modified_content = re.sub(UNSAFE, '# \g<0>', content, flags=re.IGNORECASE) 410 | if content != modified_content: 411 | printf("Warning: '%s' file was altered\n" % filepath) 412 | f.seek(0) 413 | f.write(modified_content) 414 | 415 | 416 | def fetch_git(url, directory, jobs, retry, timeout, http_headers, client_cert_p12=None, client_cert_p12_password=None): 417 | """ Dump a git repository into the output directory """ 418 | 419 | assert os.path.isdir(directory), "%s is not a directory" % directory 420 | assert jobs >= 1, "invalid number of jobs" 421 | assert retry >= 1, "invalid number of retries" 422 | assert timeout >= 1, "invalid timeout" 423 | 424 | session = requests.Session() 425 | session.verify = False 426 | session.headers = http_headers 427 | if client_cert_p12: 428 | session.mount(url, Pkcs12Adapter(pkcs12_filename=client_cert_p12, pkcs12_password=client_cert_p12_password)) 429 | else: 430 | session.mount(url, requests.adapters.HTTPAdapter(max_retries=retry)) 431 | if os.listdir(directory): 432 | printf("Warning: Destination '%s' is not empty\n", directory) 433 | 434 | # find base url 435 | url = url.rstrip("/") 436 | if url.endswith("HEAD"): 437 | url = url[:-4] 438 | url = url.rstrip("/") 439 | if url.endswith(".git"): 440 | url = url[:-4] 441 | url = url.rstrip("/") 442 | 443 | # check for /.git/HEAD 444 | printf("[-] Testing %s/.git/HEAD ", url) 445 | response = session.get( 446 | "%s/.git/HEAD" % url, 447 | timeout=timeout, 448 | allow_redirects=False 449 | ) 450 | printf("[%d]\n", response.status_code) 451 | 452 | valid, error_message = verify_response(response) 453 | if not valid: 454 | printf(error_message, url, "/.git/HEAD", file=sys.stderr) 455 | return 1 456 | elif not re.match(r"^(ref:.*|[0-9a-f]{40}$)", response.text.strip()): 457 | printf( 458 | "error: %s/.git/HEAD is not a git HEAD file\n", 459 | url, 460 | file=sys.stderr, 461 | ) 462 | return 1 463 | 464 | # set up environment to ensure proxy usage 465 | environment = os.environ.copy() 466 | configured_proxy = socks.getdefaultproxy() 467 | if configured_proxy is not None: 468 | proxy_types = ["http", "socks4h", "socks5h"] 469 | environment["ALL_PROXY"] = f"http.proxy={proxy_types[configured_proxy[0]]}://{configured_proxy[1]}:{configured_proxy[2]}" 470 | 471 | # check for directory listing 472 | printf("[-] Testing %s/.git/ ", url) 473 | response = session.get("%s/.git/" % url, allow_redirects=False) 474 | printf("[%d]\n", response.status_code) 475 | 476 | 477 | if ( 478 | response.status_code == 200 479 | and is_html(response) 480 | and "HEAD" in get_indexed_files(response) 481 | ): 482 | printf("[-] Fetching .git recursively\n") 483 | process_tasks( 484 | [".git/", ".gitignore"], 485 | RecursiveDownloadWorker, 486 | jobs, 487 | args=(url, directory, retry, timeout, http_headers), 488 | ) 489 | 490 | os.chdir(directory) 491 | 492 | printf("[-] Sanitizing .git/config\n") 493 | sanitize_file(".git/config") 494 | 495 | printf("[-] Running git checkout .\n") 496 | subprocess.check_call(["git", "checkout", "."], env=environment) 497 | return 0 498 | 499 | # no directory listing 500 | printf("[-] Fetching common files\n") 501 | tasks = [ 502 | ".gitignore", 503 | ".git/COMMIT_EDITMSG", 504 | ".git/description", 505 | ".git/hooks/applypatch-msg.sample", 506 | ".git/hooks/commit-msg.sample", 507 | ".git/hooks/post-commit.sample", 508 | ".git/hooks/post-receive.sample", 509 | ".git/hooks/post-update.sample", 510 | ".git/hooks/pre-applypatch.sample", 511 | ".git/hooks/pre-commit.sample", 512 | ".git/hooks/pre-push.sample", 513 | ".git/hooks/pre-rebase.sample", 514 | ".git/hooks/pre-receive.sample", 515 | ".git/hooks/prepare-commit-msg.sample", 516 | ".git/hooks/update.sample", 517 | ".git/index", 518 | ".git/info/exclude", 519 | ".git/objects/info/packs", 520 | ] 521 | process_tasks( 522 | tasks, 523 | DownloadWorker, 524 | jobs, 525 | args=(url, directory, retry, timeout, http_headers, client_cert_p12, client_cert_p12_password), 526 | ) 527 | 528 | # find refs 529 | printf("[-] Finding refs/\n") 530 | tasks = [ 531 | ".git/FETCH_HEAD", 532 | ".git/HEAD", 533 | ".git/ORIG_HEAD", 534 | ".git/config", 535 | ".git/info/refs", 536 | ".git/logs/HEAD", 537 | ".git/logs/refs/heads/main", 538 | ".git/logs/refs/heads/master", 539 | ".git/logs/refs/heads/staging", 540 | ".git/logs/refs/heads/production", 541 | ".git/logs/refs/heads/development", 542 | ".git/logs/refs/remotes/origin/HEAD", 543 | ".git/logs/refs/remotes/origin/main", 544 | ".git/logs/refs/remotes/origin/master", 545 | ".git/logs/refs/remotes/origin/staging", 546 | ".git/logs/refs/remotes/origin/production", 547 | ".git/logs/refs/remotes/origin/development", 548 | ".git/logs/refs/stash", 549 | ".git/packed-refs", 550 | ".git/refs/heads/main", 551 | ".git/refs/heads/master", 552 | ".git/refs/heads/staging", 553 | ".git/refs/heads/production", 554 | ".git/refs/heads/development", 555 | ".git/refs/remotes/origin/HEAD", 556 | ".git/refs/remotes/origin/main", 557 | ".git/refs/remotes/origin/master", 558 | ".git/refs/remotes/origin/staging", 559 | ".git/refs/remotes/origin/production", 560 | ".git/refs/remotes/origin/development", 561 | ".git/refs/stash", 562 | ".git/refs/wip/wtree/refs/heads/main", 563 | ".git/refs/wip/wtree/refs/heads/master", 564 | ".git/refs/wip/wtree/refs/heads/staging", 565 | ".git/refs/wip/wtree/refs/heads/production", 566 | ".git/refs/wip/wtree/refs/heads/development", 567 | ".git/refs/wip/index/refs/heads/main", 568 | ".git/refs/wip/index/refs/heads/master", 569 | ".git/refs/wip/index/refs/heads/staging", 570 | ".git/refs/wip/index/refs/heads/production", 571 | ".git/refs/wip/index/refs/heads/development" 572 | ] 573 | 574 | process_tasks( 575 | tasks, 576 | FindRefsWorker, 577 | jobs, 578 | args=(url, directory, retry, timeout, http_headers, client_cert_p12, client_cert_p12_password), 579 | ) 580 | 581 | # find packs 582 | printf("[-] Finding packs\n") 583 | tasks = [] 584 | 585 | # use .git/objects/info/packs to find packs 586 | info_packs_path = os.path.join( 587 | directory, ".git", "objects", "info", "packs" 588 | ) 589 | if os.path.exists(info_packs_path): 590 | with open(info_packs_path, "r") as f: 591 | info_packs = f.read() 592 | 593 | for sha1 in re.findall(r"pack-([a-f0-9]{40})\.pack", info_packs): 594 | tasks.append(".git/objects/pack/pack-%s.idx" % sha1) 595 | tasks.append(".git/objects/pack/pack-%s.pack" % sha1) 596 | 597 | process_tasks( 598 | tasks, 599 | DownloadWorker, 600 | jobs, 601 | args=(url, directory, retry, timeout, http_headers, client_cert_p12, client_cert_p12_password), 602 | ) 603 | 604 | # find objects 605 | printf("[-] Finding objects\n") 606 | objs = set() 607 | packed_objs = set() 608 | 609 | # .git/packed-refs, .git/info/refs, .git/refs/*, .git/logs/* 610 | files = [ 611 | os.path.join(directory, ".git", "packed-refs"), 612 | os.path.join(directory, ".git", "info", "refs"), 613 | os.path.join(directory, ".git", "FETCH_HEAD"), 614 | os.path.join(directory, ".git", "ORIG_HEAD"), 615 | ] 616 | for dirpath, _, filenames in os.walk( 617 | os.path.join(directory, ".git", "refs") 618 | ): 619 | for filename in filenames: 620 | files.append(os.path.join(dirpath, filename)) 621 | for dirpath, _, filenames in os.walk( 622 | os.path.join(directory, ".git", "logs") 623 | ): 624 | for filename in filenames: 625 | files.append(os.path.join(dirpath, filename)) 626 | 627 | for filepath in files: 628 | if not os.path.exists(filepath): 629 | continue 630 | 631 | with open(filepath, "r") as f: 632 | content = f.read() 633 | 634 | for obj in re.findall(r"(^|\s)([a-f0-9]{40})($|\s)", content): 635 | obj = obj[1] 636 | objs.add(obj) 637 | 638 | # use .git/index to find objects 639 | index_path = os.path.join(directory, ".git", "index") 640 | if os.path.exists(index_path): 641 | index = dulwich.index.Index(index_path) 642 | 643 | for entry in index.iterobjects(): 644 | objs.add(entry[1].decode()) 645 | 646 | # use packs to find more objects to fetch, and objects that are packed 647 | pack_file_dir = os.path.join(directory, ".git", "objects", "pack") 648 | if os.path.isdir(pack_file_dir): 649 | for filename in os.listdir(pack_file_dir): 650 | if filename.startswith("pack-") and filename.endswith(".pack"): 651 | pack_data_path = os.path.join(pack_file_dir, filename) 652 | pack_idx_path = os.path.join( 653 | pack_file_dir, filename[:-5] + ".idx" 654 | ) 655 | pack_data = dulwich.pack.PackData(pack_data_path) 656 | pack_idx = dulwich.pack.load_pack_index(pack_idx_path) 657 | pack = dulwich.pack.Pack.from_objects(pack_data, pack_idx) 658 | 659 | for obj_file in pack.iterobjects(): 660 | packed_objs.add(obj_file.sha().hexdigest()) 661 | objs |= set(get_referenced_sha1(obj_file)) 662 | 663 | # fetch all objects 664 | printf("[-] Fetching objects\n") 665 | process_tasks( 666 | objs, 667 | FindObjectsWorker, 668 | jobs, 669 | args=(url, directory, retry, timeout, http_headers, client_cert_p12, client_cert_p12_password), 670 | tasks_done=packed_objs, 671 | ) 672 | 673 | # git checkout 674 | printf("[-] Running git checkout .\n") 675 | os.chdir(directory) 676 | sanitize_file(".git/config") 677 | 678 | # ignore errors 679 | subprocess.call( 680 | ["git", "checkout", "."], 681 | stderr=open(os.devnull, "wb"), 682 | env=environment 683 | ) 684 | 685 | return 0 686 | 687 | 688 | def main(): 689 | parser = argparse.ArgumentParser( 690 | usage="git-dumper [options] URL DIR", 691 | description="Dump a git repository from a website.", 692 | ) 693 | parser.add_argument("url", metavar="URL", help="url") 694 | parser.add_argument("directory", metavar="DIR", help="output directory") 695 | parser.add_argument("--proxy", help="use the specified proxy") 696 | parser.add_argument("--client-cert-p12", help="client certificate in PKCS#12") 697 | parser.add_argument("--client-cert-p12-password", help="password for the client certificate") 698 | parser.add_argument( 699 | "-j", 700 | "--jobs", 701 | type=int, 702 | default=10, 703 | help="number of simultaneous requests", 704 | ) 705 | parser.add_argument( 706 | "-r", 707 | "--retry", 708 | type=int, 709 | default=3, 710 | help="number of request attempts before giving up", 711 | ) 712 | parser.add_argument( 713 | "-t", 714 | "--timeout", 715 | type=int, 716 | default=3, 717 | help="maximum time in seconds before giving up", 718 | ) 719 | parser.add_argument( 720 | "-u", 721 | "--user-agent", 722 | type=str, 723 | default="Mozilla/5.0 (Windows NT 10.0; rv:78.0) Gecko/20100101 Firefox/78.0", 724 | help="user-agent to use for requests", 725 | ) 726 | parser.add_argument( 727 | "-H", 728 | "--header", 729 | type=str, 730 | action="append", 731 | help="additional http headers, e.g `NAME=VALUE`", 732 | ) 733 | args = parser.parse_args() 734 | 735 | # jobs 736 | if args.jobs < 1: 737 | parser.error("invalid number of jobs, got `%d`" % args.jobs) 738 | 739 | # retry 740 | if args.retry < 1: 741 | parser.error("invalid number of retries, got `%d`" % args.retry) 742 | 743 | # timeout 744 | if args.timeout < 1: 745 | parser.error("invalid timeout, got `%d`" % args.timeout) 746 | 747 | # header 748 | http_headers = {"User-Agent": args.user_agent} 749 | if args.header: 750 | for header in args.header: 751 | tokens = header.split("=", maxsplit=1) 752 | if len(tokens) != 2: 753 | parser.error( 754 | "http header must have the form NAME=VALUE, got `%s`" 755 | % header 756 | ) 757 | name, value = tokens 758 | http_headers[name.strip()] = value.strip() 759 | 760 | # proxy 761 | if args.proxy: 762 | proxy_valid = False 763 | 764 | for pattern, proxy_type in [ 765 | (r"^socks5:(.*):(\d+)$", socks.PROXY_TYPE_SOCKS5), 766 | (r"^socks4:(.*):(\d+)$", socks.PROXY_TYPE_SOCKS4), 767 | (r"^http://(.*):(\d+)$", socks.PROXY_TYPE_HTTP), 768 | (r"^(.*):(\d+)$", socks.PROXY_TYPE_SOCKS5), 769 | ]: 770 | m = re.match(pattern, args.proxy) 771 | if m: 772 | socks.setdefaultproxy(proxy_type, m.group(1), int(m.group(2))) 773 | socket.socket = socks.socksocket 774 | proxy_valid = True 775 | break 776 | 777 | if not proxy_valid: 778 | parser.error("invalid proxy, got `%s`" % args.proxy) 779 | 780 | # output directory 781 | if not os.path.exists(args.directory): 782 | os.makedirs(args.directory) 783 | 784 | if not os.path.isdir(args.directory): 785 | parser.error("`%s` is not a directory" % args.directory) 786 | 787 | # client certificate 788 | if args.client_cert_p12: 789 | if not os.path.exists(args.client_cert_p12): 790 | parser.error( 791 | "client certificate `%s` does not exist" % args.client_cert_p12 792 | ) 793 | 794 | if not os.path.isfile(args.client_cert_p12): 795 | parser.error( 796 | "client certificate `%s` is not a file" % args.client_cert_p12 797 | ) 798 | 799 | if args.client_cert_p12_password is None: 800 | parser.error("client certificate password is required") 801 | 802 | urllib3.disable_warnings(urllib3.exceptions.InsecureRequestWarning) 803 | 804 | # fetch everything 805 | sys.exit( 806 | fetch_git( 807 | args.url, 808 | args.directory, 809 | args.jobs, 810 | args.retry, 811 | args.timeout, 812 | http_headers, 813 | args.client_cert_p12, 814 | args.client_cert_p12_password 815 | ) 816 | ) 817 | 818 | 819 | if __name__ == "__main__": 820 | main() 821 | -------------------------------------------------------------------------------- /pyproject.toml: -------------------------------------------------------------------------------- 1 | [build-system] 2 | requires = ["setuptools>=42"] 3 | build-backend = "setuptools.build_meta" 4 | -------------------------------------------------------------------------------- /requirements.txt: -------------------------------------------------------------------------------- 1 | PySocks 2 | requests 3 | beautifulsoup4 4 | dulwich 5 | requests-pkcs12 -------------------------------------------------------------------------------- /setup.cfg: -------------------------------------------------------------------------------- 1 | [metadata] 2 | name = git-dumper 3 | version = 1.0.8 4 | author = Maxime Arthaud 5 | author_email = maxime@arthaud.me 6 | license = MIT 7 | description = A tool to dump a git repository from a website 8 | long_description = file: README.md 9 | long_description_content_type = text/markdown 10 | url = https://github.com/arthaud/git-dumper 11 | keywords = dump, git, repository, security, vulnerability, ctf 12 | classifiers = 13 | Development Status :: 5 - Production/Stable 14 | License :: OSI Approved :: MIT License 15 | Topic :: Security 16 | 17 | [options] 18 | py_modules = git_dumper 19 | python_requires = >=3.0 20 | install_requires = 21 | PySocks 22 | requests 23 | beautifulsoup4 24 | dulwich 25 | requests-pkcs12 26 | 27 | [options.entry_points] 28 | console_scripts = 29 | git-dumper = git_dumper:main 30 | --------------------------------------------------------------------------------