├── .gitignore
├── LICENSE
├── README.md
├── git_dumper.py
├── pyproject.toml
├── requirements.txt
└── setup.cfg


/.gitignore:
--------------------------------------------------------------------------------
 1 | # Byte-compiled / optimized / DLL files
 2 | __pycache__/
 3 | *.py[cod]
 4 | *$py.class
 5 | 
 6 | # C extensions
 7 | *.so
 8 | 
 9 | # Distribution / packaging
10 | .Python
11 | env/
12 | build/
13 | develop-eggs/
14 | dist/
15 | downloads/
16 | eggs/
17 | .eggs/
18 | lib/
19 | lib64/
20 | parts/
21 | sdist/
22 | var/
23 | *.egg-info/
24 | .installed.cfg
25 | *.egg
26 | 
27 | # PyInstaller
28 | #  Usually these files are written by a python script from a template
29 | #  before PyInstaller builds the exe, so as to inject date/other infos into it.
30 | *.manifest
31 | *.spec
32 | 
33 | # Installer logs
34 | pip-log.txt
35 | pip-delete-this-directory.txt
36 | 
37 | # Unit test / coverage reports
38 | htmlcov/
39 | .tox/
40 | .coverage
41 | .coverage.*
42 | .cache
43 | nosetests.xml
44 | coverage.xml
45 | *,cover
46 | .hypothesis/
47 | 
48 | # Translations
49 | *.mo
50 | *.pot
51 | 
52 | # Django stuff:
53 | *.log
54 | local_settings.py
55 | 
56 | # Flask stuff:
57 | instance/
58 | .webassets-cache
59 | 
60 | # Scrapy stuff:
61 | .scrapy
62 | 
63 | # Sphinx documentation
64 | docs/_build/
65 | 
66 | # PyBuilder
67 | target/
68 | 
69 | # IPython Notebook
70 | .ipynb_checkpoints
71 | 
72 | # pyenv
73 | .python-version
74 | 
75 | # celery beat schedule file
76 | celerybeat-schedule
77 | 
78 | # dotenv
79 | .env
80 | 
81 | # virtualenv
82 | venv/
83 | ENV/
84 | 
85 | # Spyder project settings
86 | .spyderproject
87 | 
88 | # Rope project settings
89 | .ropeproject
90 | 


--------------------------------------------------------------------------------
/LICENSE:
--------------------------------------------------------------------------------
 1 | MIT License
 2 | 
 3 | Copyright (c) 2017 Maxime Arthaud
 4 | 
 5 | Permission is hereby granted, free of charge, to any person obtaining a copy
 6 | of this software and associated documentation files (the "Software"), to deal
 7 | in the Software without restriction, including without limitation the rights
 8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
 9 | copies of the Software, and to permit persons to whom the Software is
10 | furnished to do so, subject to the following conditions:
11 | 
12 | The above copyright notice and this permission notice shall be included in all
13 | copies or substantial portions of the Software.
14 | 
15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
21 | SOFTWARE.
22 | 


--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
 1 | # git-dumper
 2 | 
 3 | A tool to dump a git repository from a website.
 4 | 
 5 | ## Install
 6 | 
 7 | This can be installed easily with pip:
 8 | ```
 9 | pip install git-dumper
10 | ```
11 | 
12 | ## Usage
13 | 
14 | ```
15 | usage: git-dumper [options] URL DIR
16 | 
17 | Dump a git repository from a website.
18 | 
19 | positional arguments:
20 |   URL                   url
21 |   DIR                   output directory
22 | 
23 | optional arguments:
24 |   -h, --help            show this help message and exit
25 |   --proxy PROXY         use the specified proxy
26 |   -j JOBS, --jobs JOBS  number of simultaneous requests
27 |   -r RETRY, --retry RETRY
28 |                         number of request attempts before giving up
29 |   -t TIMEOUT, --timeout TIMEOUT
30 |                         maximum time in seconds before giving up
31 |   -u USER_AGENT, --user-agent USER_AGENT
32 |                         user-agent to use for requests
33 |   -H HEADER, --header HEADER
34 |                         additional http headers, e.g `NAME=VALUE`
35 |   --client-cert-p12 CLIENT_CERT_P12
36 |                         client certificate in PKCS#12 format
37 |   --client-cert-p12-password CLIENT_CERT_P12_PASSWORD
38 |                         password for the client certificate
39 | ```
40 | 
41 | ### Example
42 | 
43 | ```
44 | git-dumper http://website.com/.git ~/website
45 | ```
46 | 
47 | ### Disclaimer
48 | 
49 | **Use this software at your own risk!**
50 | 
51 | You should know that if the repository you are downloading is controlled by an attacker,
52 | this could lead to remote code execution on your machine.
53 | 
54 | ## Build from source
55 | 
56 | Simply install the dependencies with pip:
57 | ```
58 | pip install -r requirements.txt
59 | ```
60 | 
61 | Then, simply use:
62 | ```
63 | ./git_dumper.py http://website.com/.git ~/website
64 | ```
65 | 
66 | ## How does it work?
67 | 
68 | The tool will first check if directory listing is available. If it is, then it will just recursively download the .git directory (what you would do with `wget`).
69 | 
70 | If directory listing is not available, it will use several methods to find as many files as possible. Step by step, git-dumper will:
71 | * Fetch all common files (`.gitignore`, `.git/HEAD`, `.git/index`, etc.);
72 | * Find as many refs as possible (such as `refs/heads/master`, `refs/remotes/origin/HEAD`, etc.) by analyzing `.git/HEAD`, `.git/logs/HEAD`, `.git/config`, `.git/packed-refs` and so on;
73 | * Find as many objects (sha1) as possible by analyzing `.git/packed-refs`, `.git/index`, `.git/refs/*` and `.git/logs/*`;
74 | * Fetch all objects recursively, analyzing each commits to find their parents;
75 | * Run `git checkout .` to recover the current working tree
76 | 


--------------------------------------------------------------------------------
/git_dumper.py:
--------------------------------------------------------------------------------
  1 | #!/usr/bin/env python3
  2 | from contextlib import closing
  3 | import argparse
  4 | import multiprocessing
  5 | import os
  6 | import os.path
  7 | import re
  8 | import socket
  9 | import subprocess
 10 | import sys
 11 | import traceback
 12 | import urllib.parse
 13 | 
 14 | import urllib3
 15 | 
 16 | import bs4
 17 | import dulwich.index
 18 | import dulwich.objects
 19 | import dulwich.pack
 20 | import requests
 21 | import socks
 22 | from requests_pkcs12 import Pkcs12Adapter
 23 | 
 24 | 
 25 | def printf(fmt, *args, file=sys.stdout):
 26 |     if args:
 27 |         fmt = fmt % args
 28 | 
 29 |     file.write(fmt)
 30 |     file.flush()
 31 | 
 32 | 
 33 | def is_html(response):
 34 |     """ Return True if the response is a HTML webpage """
 35 |     return (
 36 |         "Content-Type" in response.headers
 37 |         and "text/html" in response.headers["Content-Type"]
 38 |     )
 39 | 
 40 | 
 41 | def is_safe_path(path):
 42 |     """ Prevent directory traversal attacks """
 43 |     if path.startswith("/"):
 44 |         return False
 45 | 
 46 |     safe_path = os.path.expanduser("~")
 47 |     return (
 48 |         os.path.commonpath(
 49 |             (os.path.realpath(os.path.join(safe_path, path)), safe_path)
 50 |         )
 51 |         == safe_path
 52 |     )
 53 | 
 54 | 
 55 | def get_indexed_files(response):
 56 |     """ Return all the files in the directory index webpage """
 57 |     html = bs4.BeautifulSoup(response.text, "html.parser")
 58 |     files = []
 59 | 
 60 |     for link in html.find_all("a"):
 61 |         url = urllib.parse.urlparse(link.get("href"))
 62 | 
 63 |         if (
 64 |             url.path
 65 |             and is_safe_path(url.path)
 66 |             and not url.scheme
 67 |             and not url.netloc
 68 |         ):
 69 |             files.append(url.path)
 70 | 
 71 |     return files
 72 | 
 73 | 
 74 | def verify_response(response):
 75 |     if response.status_code != 200:
 76 |         return (
 77 |             False,
 78 |             "[-] %s/%s responded with status code {code}\n".format(
 79 |                 code=response.status_code
 80 |             ),
 81 |         )
 82 |     elif (
 83 |         "Content-Length" in response.headers
 84 |         and response.headers["Content-Length"] == 0
 85 |     ):
 86 |         return False, "[-] %s/%s responded with a zero-length body\n"
 87 |     elif (
 88 |         "Content-Type" in response.headers
 89 |         and "text/html" in response.headers["Content-Type"]
 90 |     ):
 91 |         return False, "[-] %s/%s responded with HTML\n"
 92 |     else:
 93 |         return True, True
 94 | 
 95 | 
 96 | def create_intermediate_dirs(path):
 97 |     """ Create intermediate directories, if necessary """
 98 | 
 99 |     dirname, basename = os.path.split(path)
100 | 
101 |     if dirname and not os.path.exists(dirname):
102 |         try:
103 |             os.makedirs(dirname)
104 |         except FileExistsError:
105 |             pass  # race condition
106 | 
107 | 
108 | def get_referenced_sha1(obj_file):
109 |     """ Return all the referenced SHA1 in the given object file """
110 |     objs = []
111 | 
112 |     if isinstance(obj_file, dulwich.objects.Commit):
113 |         objs.append(obj_file.tree.decode())
114 | 
115 |         for parent in obj_file.parents:
116 |             objs.append(parent.decode())
117 |     elif isinstance(obj_file, dulwich.objects.Tree):
118 |         for item in obj_file.iteritems():
119 |             objs.append(item.sha.decode())
120 |     elif isinstance(obj_file, dulwich.objects.Blob):
121 |         pass
122 |     elif isinstance(obj_file, dulwich.objects.Tag):
123 |         pass
124 |     else:
125 |         printf(
126 |             "error: unexpected object type: %r\n" % obj_file, file=sys.stderr
127 |         )
128 |         sys.exit(1)
129 | 
130 |     return objs
131 | 
132 | 
133 | class Worker(multiprocessing.Process):
134 |     """ Worker for process_tasks """
135 | 
136 |     def __init__(self, pending_tasks, tasks_done, args):
137 |         super().__init__()
138 |         self.daemon = True
139 |         self.pending_tasks = pending_tasks
140 |         self.tasks_done = tasks_done
141 |         self.args = args
142 | 
143 |     def run(self):
144 |         # initialize process
145 |         self.init(*self.args)
146 | 
147 |         # fetch and do tasks
148 |         while True:
149 |             task = self.pending_tasks.get(block=True)
150 | 
151 |             if task is None:  # end signal
152 |                 return
153 | 
154 |             try:
155 |                 result = self.do_task(task, *self.args)
156 |             except Exception:
157 |                 printf("Task %s raised exception:\n", task, file=sys.stderr)
158 |                 traceback.print_exc()
159 |                 result = []
160 | 
161 |             assert isinstance(
162 |                 result, list
163 |             ), "do_task() should return a list of tasks"
164 | 
165 |             self.tasks_done.put(result)
166 | 
167 |     def init(self, *args):
168 |         raise NotImplementedError
169 | 
170 |     def do_task(self, task, *args):
171 |         raise NotImplementedError
172 | 
173 | 
174 | def process_tasks(initial_tasks, worker, jobs, args=(), tasks_done=None):
175 |     """ Process tasks in parallel """
176 | 
177 |     if not initial_tasks:
178 |         return
179 | 
180 |     tasks_seen = set(tasks_done) if tasks_done else set()
181 |     pending_tasks = multiprocessing.Queue()
182 |     tasks_done = multiprocessing.Queue()
183 |     num_pending_tasks = 0
184 | 
185 |     # add all initial tasks in the queue
186 |     for task in initial_tasks:
187 |         assert task is not None
188 | 
189 |         if task not in tasks_seen:
190 |             pending_tasks.put(task)
191 |             num_pending_tasks += 1
192 |             tasks_seen.add(task)
193 | 
194 |     # initialize processes
195 |     processes = [worker(pending_tasks, tasks_done, args) for _ in range(jobs)]
196 | 
197 |     # launch them all
198 |     for p in processes:
199 |         p.start()
200 | 
201 |     # collect task results
202 |     while num_pending_tasks > 0:
203 |         task_result = tasks_done.get(block=True)
204 |         num_pending_tasks -= 1
205 | 
206 |         for task in task_result:
207 |             assert task is not None
208 | 
209 |             if task not in tasks_seen:
210 |                 pending_tasks.put(task)
211 |                 num_pending_tasks += 1
212 |                 tasks_seen.add(task)
213 | 
214 |     # send termination signal (task=None)
215 |     for _ in range(jobs):
216 |         pending_tasks.put(None)
217 | 
218 |     # join all
219 |     for p in processes:
220 |         p.join()
221 | 
222 | 
223 | class DownloadWorker(Worker):
224 |     """ Download a list of files """
225 | 
226 |     def init(self, url, directory, retry, timeout, http_headers, client_cert_p12=None, client_cert_p12_password=None):
227 |         self.session = requests.Session()
228 |         self.session.verify = False
229 |         self.session.headers = http_headers
230 |         if client_cert_p12:
231 |             self.session.mount(url, Pkcs12Adapter(pkcs12_filename=client_cert_p12, pkcs12_password=client_cert_p12_password))
232 |         else:
233 |             self.session.mount(url, requests.adapters.HTTPAdapter(max_retries=retry))
234 | 
235 |     def do_task(self, filepath, url, directory, retry, timeout, http_headers, client_cert_p12=None, client_cert_p12_password=None):
236 |         if os.path.isfile(os.path.join(directory, filepath)):
237 |             printf("[-] Already downloaded %s/%s\n", url, filepath)
238 |             return []
239 | 
240 |         with closing(
241 |             self.session.get(
242 |                 "%s/%s" % (url, filepath),
243 |                 allow_redirects=False,
244 |                 stream=True,
245 |                 timeout=timeout,
246 |             )
247 |         ) as response:
248 |             printf(
249 |                 "[-] Fetching %s/%s [%d]\n",
250 |                 url,
251 |                 filepath,
252 |                 response.status_code,
253 |             )
254 | 
255 |             valid, error_message = verify_response(response)
256 |             if not valid:
257 |                 printf(error_message, url, filepath, file=sys.stderr)
258 |                 return []
259 | 
260 |             abspath = os.path.abspath(os.path.join(directory, filepath))
261 |             create_intermediate_dirs(abspath)
262 | 
263 |             # write file
264 |             with open(abspath, "wb") as f:
265 |                 for chunk in response.iter_content(4096):
266 |                     f.write(chunk)
267 | 
268 |             return []
269 | 
270 | 
271 | class RecursiveDownloadWorker(DownloadWorker):
272 |     """ Download a directory recursively """
273 | 
274 |     def do_task(self, filepath, url, directory, retry, timeout, http_headers):
275 |         if os.path.isfile(os.path.join(directory, filepath)):
276 |             printf("[-] Already downloaded %s/%s\n", url, filepath)
277 |             return []
278 | 
279 |         with closing(
280 |             self.session.get(
281 |                 "%s/%s" % (url, filepath),
282 |                 allow_redirects=False,
283 |                 stream=True,
284 |                 timeout=timeout,
285 |             )
286 |         ) as response:
287 |             printf(
288 |                 "[-] Fetching %s/%s [%d]\n",
289 |                 url,
290 |                 filepath,
291 |                 response.status_code,
292 |             )
293 | 
294 |             if (
295 |                 response.status_code in (301, 302)
296 |                 and "Location" in response.headers
297 |                 and response.headers["Location"].endswith(filepath + "/")
298 |             ):
299 |                 return [filepath + "/"]
300 | 
301 |             if filepath.endswith("/"):  # directory index
302 |                 assert is_html(response)
303 | 
304 |                 return [
305 |                     filepath + filename
306 |                     for filename in get_indexed_files(response)
307 |                 ]
308 |             else:  # file
309 |                 valid, error_message = verify_response(response)
310 |                 if not valid:
311 |                     printf(error_message, url, filepath, file=sys.stderr)
312 |                     return []
313 | 
314 |                 abspath = os.path.abspath(os.path.join(directory, filepath))
315 |                 create_intermediate_dirs(abspath)
316 | 
317 |                 # write file
318 |                 with open(abspath, "wb") as f:
319 |                     for chunk in response.iter_content(4096):
320 |                         f.write(chunk)
321 | 
322 |                 return []
323 | 
324 | 
325 | class FindRefsWorker(DownloadWorker):
326 |     """ Find refs/ """
327 | 
328 |     def do_task(self, filepath, url, directory, retry, timeout, http_headers, client_cert_p12=None, client_cert_p12_password=None):
329 |         response = self.session.get(
330 |             "%s/%s" % (url, filepath), allow_redirects=False, timeout=timeout
331 |         )
332 |         printf(
333 |             "[-] Fetching %s/%s [%d]\n", url, filepath, response.status_code
334 |         )
335 | 
336 |         valid, error_message = verify_response(response)
337 |         if not valid:
338 |             printf(error_message, url, filepath, file=sys.stderr)
339 |             return []
340 | 
341 |         abspath = os.path.abspath(os.path.join(directory, filepath))
342 |         create_intermediate_dirs(abspath)
343 | 
344 |         # write file
345 |         with open(abspath, "w") as f:
346 |             f.write(response.text)
347 | 
348 |         # find refs
349 |         tasks = []
350 | 
351 |         for ref in re.findall(
352 |             r"(refs(/[a-zA-Z0-9\-\.\_\*]+)+)", response.text
353 |         ):
354 |             ref = ref[0]
355 |             if not ref.endswith("*") and is_safe_path(ref):
356 |                 tasks.append(".git/%s" % ref)
357 |                 tasks.append(".git/logs/%s" % ref)
358 | 
359 |         return tasks
360 | 
361 | 
362 | class FindObjectsWorker(DownloadWorker):
363 |     """ Find objects """
364 | 
365 |     def do_task(self, obj, url, directory, retry, timeout, http_headers, client_cert_p12=None, client_cert_p12_password=None):
366 |         filepath = ".git/objects/%s/%s" % (obj[:2], obj[2:])
367 | 
368 |         if os.path.isfile(os.path.join(directory, filepath)):
369 |             printf("[-] Already downloaded %s/%s\n", url, filepath)
370 |         else:
371 |             response = self.session.get(
372 |                 "%s/%s" % (url, filepath),
373 |                 allow_redirects=False,
374 |                 timeout=timeout,
375 |             )
376 |             printf(
377 |                 "[-] Fetching %s/%s [%d]\n",
378 |                 url,
379 |                 filepath,
380 |                 response.status_code,
381 |             )
382 | 
383 |             valid, error_message = verify_response(response)
384 |             if not valid:
385 |                 printf(error_message, url, filepath, file=sys.stderr)
386 |                 return []
387 | 
388 |             abspath = os.path.abspath(os.path.join(directory, filepath))
389 |             create_intermediate_dirs(abspath)
390 | 
391 |             # write file
392 |             with open(abspath, "wb") as f:
393 |                 f.write(response.content)
394 | 
395 |         abspath = os.path.abspath(os.path.join(directory, filepath))
396 |         # parse object file to find other objects
397 |         obj_file = dulwich.objects.ShaFile.from_path(abspath)
398 |         return get_referenced_sha1(obj_file)
399 | 
400 | 
401 | def sanitize_file(filepath):
402 |     """ Inplace comment out possibly unsafe lines based on regex """
403 |     assert os.path.isfile(filepath), "%s is not a file" % filepath
404 | 
405 |     UNSAFE=r"^\s*fsmonitor|sshcommand|askpass|editor|pager"
406 | 
407 |     with open(filepath, 'r+') as f:
408 |         content = f.read()
409 |         modified_content = re.sub(UNSAFE, '# \g<0>', content, flags=re.IGNORECASE)
410 |         if content != modified_content:
411 |             printf("Warning: '%s' file was altered\n" % filepath)
412 |             f.seek(0)
413 |             f.write(modified_content)
414 | 
415 | 
416 | def fetch_git(url, directory, jobs, retry, timeout, http_headers, client_cert_p12=None, client_cert_p12_password=None):
417 |     """ Dump a git repository into the output directory """
418 | 
419 |     assert os.path.isdir(directory), "%s is not a directory" % directory
420 |     assert jobs >= 1, "invalid number of jobs"
421 |     assert retry >= 1, "invalid number of retries"
422 |     assert timeout >= 1, "invalid timeout"
423 | 
424 |     session = requests.Session()
425 |     session.verify = False
426 |     session.headers = http_headers
427 |     if client_cert_p12:
428 |         session.mount(url, Pkcs12Adapter(pkcs12_filename=client_cert_p12, pkcs12_password=client_cert_p12_password))
429 |     else:
430 |         session.mount(url, requests.adapters.HTTPAdapter(max_retries=retry))
431 |     if os.listdir(directory):
432 |         printf("Warning: Destination '%s' is not empty\n", directory)
433 | 
434 |     # find base url
435 |     url = url.rstrip("/")
436 |     if url.endswith("HEAD"):
437 |         url = url[:-4]
438 |     url = url.rstrip("/")
439 |     if url.endswith(".git"):
440 |         url = url[:-4]
441 |     url = url.rstrip("/")
442 | 
443 |     # check for /.git/HEAD
444 |     printf("[-] Testing %s/.git/HEAD ", url)
445 |     response = session.get(
446 |         "%s/.git/HEAD" % url, 
447 |         timeout=timeout, 
448 |         allow_redirects=False
449 |     )
450 |     printf("[%d]\n", response.status_code)
451 | 
452 |     valid, error_message = verify_response(response)
453 |     if not valid:
454 |         printf(error_message, url, "/.git/HEAD", file=sys.stderr)
455 |         return 1
456 |     elif not re.match(r"^(ref:.*|[0-9a-f]{40}$)", response.text.strip()):
457 |         printf(
458 |             "error: %s/.git/HEAD is not a git HEAD file\n",
459 |             url,
460 |             file=sys.stderr,
461 |         )
462 |         return 1
463 | 
464 |     # set up environment to ensure proxy usage
465 |     environment = os.environ.copy()
466 |     configured_proxy = socks.getdefaultproxy()
467 |     if configured_proxy is not None:
468 |         proxy_types = ["http", "socks4h", "socks5h"]
469 |         environment["ALL_PROXY"] = f"http.proxy={proxy_types[configured_proxy[0]]}://{configured_proxy[1]}:{configured_proxy[2]}"
470 | 
471 |     # check for directory listing
472 |     printf("[-] Testing %s/.git/ ", url)
473 |     response = session.get("%s/.git/" % url, allow_redirects=False)
474 |     printf("[%d]\n", response.status_code)
475 | 
476 | 
477 |     if (
478 |         response.status_code == 200
479 |         and is_html(response)
480 |         and "HEAD" in get_indexed_files(response)
481 |     ):
482 |         printf("[-] Fetching .git recursively\n")
483 |         process_tasks(
484 |             [".git/", ".gitignore"],
485 |             RecursiveDownloadWorker,
486 |             jobs,
487 |             args=(url, directory, retry, timeout, http_headers),
488 |         )
489 |         
490 |         os.chdir(directory)
491 | 
492 |         printf("[-] Sanitizing .git/config\n")
493 |         sanitize_file(".git/config")
494 | 
495 |         printf("[-] Running git checkout .\n")
496 |         subprocess.check_call(["git", "checkout", "."], env=environment)
497 |         return 0
498 | 
499 |     # no directory listing
500 |     printf("[-] Fetching common files\n")
501 |     tasks = [
502 |         ".gitignore",
503 |         ".git/COMMIT_EDITMSG",
504 |         ".git/description",
505 |         ".git/hooks/applypatch-msg.sample",
506 |         ".git/hooks/commit-msg.sample",
507 |         ".git/hooks/post-commit.sample",
508 |         ".git/hooks/post-receive.sample",
509 |         ".git/hooks/post-update.sample",
510 |         ".git/hooks/pre-applypatch.sample",
511 |         ".git/hooks/pre-commit.sample",
512 |         ".git/hooks/pre-push.sample",
513 |         ".git/hooks/pre-rebase.sample",
514 |         ".git/hooks/pre-receive.sample",
515 |         ".git/hooks/prepare-commit-msg.sample",
516 |         ".git/hooks/update.sample",
517 |         ".git/index",
518 |         ".git/info/exclude",
519 |         ".git/objects/info/packs",
520 |     ]
521 |     process_tasks(
522 |         tasks,
523 |         DownloadWorker,
524 |         jobs,
525 |         args=(url, directory, retry, timeout, http_headers, client_cert_p12, client_cert_p12_password),
526 |     )
527 | 
528 |     # find refs
529 |     printf("[-] Finding refs/\n")
530 |     tasks = [
531 |         ".git/FETCH_HEAD",
532 |         ".git/HEAD",
533 |         ".git/ORIG_HEAD",
534 |         ".git/config",
535 |         ".git/info/refs",
536 |         ".git/logs/HEAD",
537 |         ".git/logs/refs/heads/main",
538 |         ".git/logs/refs/heads/master",
539 |         ".git/logs/refs/heads/staging",
540 |         ".git/logs/refs/heads/production",
541 |         ".git/logs/refs/heads/development",
542 |         ".git/logs/refs/remotes/origin/HEAD",
543 |         ".git/logs/refs/remotes/origin/main",
544 |         ".git/logs/refs/remotes/origin/master",
545 |         ".git/logs/refs/remotes/origin/staging",
546 |         ".git/logs/refs/remotes/origin/production",
547 |         ".git/logs/refs/remotes/origin/development",
548 |         ".git/logs/refs/stash",
549 |         ".git/packed-refs",
550 |         ".git/refs/heads/main",
551 |         ".git/refs/heads/master",
552 |         ".git/refs/heads/staging",
553 |         ".git/refs/heads/production",
554 |         ".git/refs/heads/development",
555 |         ".git/refs/remotes/origin/HEAD",
556 |         ".git/refs/remotes/origin/main",
557 |         ".git/refs/remotes/origin/master",
558 |         ".git/refs/remotes/origin/staging",
559 |         ".git/refs/remotes/origin/production",
560 |         ".git/refs/remotes/origin/development",
561 |         ".git/refs/stash",
562 |         ".git/refs/wip/wtree/refs/heads/main",
563 |         ".git/refs/wip/wtree/refs/heads/master",
564 |         ".git/refs/wip/wtree/refs/heads/staging",
565 |         ".git/refs/wip/wtree/refs/heads/production",
566 |         ".git/refs/wip/wtree/refs/heads/development",
567 |         ".git/refs/wip/index/refs/heads/main",
568 |         ".git/refs/wip/index/refs/heads/master",
569 |         ".git/refs/wip/index/refs/heads/staging",
570 |         ".git/refs/wip/index/refs/heads/production",
571 |         ".git/refs/wip/index/refs/heads/development"
572 |     ]
573 | 
574 |     process_tasks(
575 |         tasks,
576 |         FindRefsWorker,
577 |         jobs,
578 |         args=(url, directory, retry, timeout, http_headers, client_cert_p12, client_cert_p12_password),
579 |     )
580 | 
581 |     # find packs
582 |     printf("[-] Finding packs\n")
583 |     tasks = []
584 | 
585 |     # use .git/objects/info/packs to find packs
586 |     info_packs_path = os.path.join(
587 |         directory, ".git", "objects", "info", "packs"
588 |     )
589 |     if os.path.exists(info_packs_path):
590 |         with open(info_packs_path, "r") as f:
591 |             info_packs = f.read()
592 | 
593 |         for sha1 in re.findall(r"pack-([a-f0-9]{40})\.pack", info_packs):
594 |             tasks.append(".git/objects/pack/pack-%s.idx" % sha1)
595 |             tasks.append(".git/objects/pack/pack-%s.pack" % sha1)
596 | 
597 |     process_tasks(
598 |         tasks,
599 |         DownloadWorker,
600 |         jobs,
601 |         args=(url, directory, retry, timeout, http_headers, client_cert_p12, client_cert_p12_password),
602 |     )
603 | 
604 |     # find objects
605 |     printf("[-] Finding objects\n")
606 |     objs = set()
607 |     packed_objs = set()
608 | 
609 |     # .git/packed-refs, .git/info/refs, .git/refs/*, .git/logs/*
610 |     files = [
611 |         os.path.join(directory, ".git", "packed-refs"),
612 |         os.path.join(directory, ".git", "info", "refs"),
613 |         os.path.join(directory, ".git", "FETCH_HEAD"),
614 |         os.path.join(directory, ".git", "ORIG_HEAD"),
615 |     ]
616 |     for dirpath, _, filenames in os.walk(
617 |         os.path.join(directory, ".git", "refs")
618 |     ):
619 |         for filename in filenames:
620 |             files.append(os.path.join(dirpath, filename))
621 |     for dirpath, _, filenames in os.walk(
622 |         os.path.join(directory, ".git", "logs")
623 |     ):
624 |         for filename in filenames:
625 |             files.append(os.path.join(dirpath, filename))
626 | 
627 |     for filepath in files:
628 |         if not os.path.exists(filepath):
629 |             continue
630 | 
631 |         with open(filepath, "r") as f:
632 |             content = f.read()
633 | 
634 |         for obj in re.findall(r"(^|\s)([a-f0-9]{40})($|\s)", content):
635 |             obj = obj[1]
636 |             objs.add(obj)
637 | 
638 |     # use .git/index to find objects
639 |     index_path = os.path.join(directory, ".git", "index")
640 |     if os.path.exists(index_path):
641 |         index = dulwich.index.Index(index_path)
642 | 
643 |         for entry in index.iterobjects():
644 |             objs.add(entry[1].decode())
645 | 
646 |     # use packs to find more objects to fetch, and objects that are packed
647 |     pack_file_dir = os.path.join(directory, ".git", "objects", "pack")
648 |     if os.path.isdir(pack_file_dir):
649 |         for filename in os.listdir(pack_file_dir):
650 |             if filename.startswith("pack-") and filename.endswith(".pack"):
651 |                 pack_data_path = os.path.join(pack_file_dir, filename)
652 |                 pack_idx_path = os.path.join(
653 |                     pack_file_dir, filename[:-5] + ".idx"
654 |                 )
655 |                 pack_data = dulwich.pack.PackData(pack_data_path)
656 |                 pack_idx = dulwich.pack.load_pack_index(pack_idx_path)
657 |                 pack = dulwich.pack.Pack.from_objects(pack_data, pack_idx)
658 | 
659 |                 for obj_file in pack.iterobjects():
660 |                     packed_objs.add(obj_file.sha().hexdigest())
661 |                     objs |= set(get_referenced_sha1(obj_file))
662 | 
663 |     # fetch all objects
664 |     printf("[-] Fetching objects\n")
665 |     process_tasks(
666 |         objs,
667 |         FindObjectsWorker,
668 |         jobs,
669 |         args=(url, directory, retry, timeout, http_headers, client_cert_p12, client_cert_p12_password),
670 |         tasks_done=packed_objs,
671 |     )
672 | 
673 |     # git checkout
674 |     printf("[-] Running git checkout .\n")
675 |     os.chdir(directory)
676 |     sanitize_file(".git/config")
677 | 
678 |     # ignore errors
679 |     subprocess.call(
680 |         ["git", "checkout", "."], 
681 |         stderr=open(os.devnull, "wb"),
682 |         env=environment
683 |     )
684 | 
685 |     return 0
686 | 
687 | 
688 | def main():
689 |     parser = argparse.ArgumentParser(
690 |         usage="git-dumper [options] URL DIR",
691 |         description="Dump a git repository from a website.",
692 |     )
693 |     parser.add_argument("url", metavar="URL", help="url")
694 |     parser.add_argument("directory", metavar="DIR", help="output directory")
695 |     parser.add_argument("--proxy", help="use the specified proxy")
696 |     parser.add_argument("--client-cert-p12", help="client certificate in PKCS#12")
697 |     parser.add_argument("--client-cert-p12-password", help="password for the client certificate")
698 |     parser.add_argument(
699 |         "-j",
700 |         "--jobs",
701 |         type=int,
702 |         default=10,
703 |         help="number of simultaneous requests",
704 |     )
705 |     parser.add_argument(
706 |         "-r",
707 |         "--retry",
708 |         type=int,
709 |         default=3,
710 |         help="number of request attempts before giving up",
711 |     )
712 |     parser.add_argument(
713 |         "-t",
714 |         "--timeout",
715 |         type=int,
716 |         default=3,
717 |         help="maximum time in seconds before giving up",
718 |     )
719 |     parser.add_argument(
720 |         "-u",
721 |         "--user-agent",
722 |         type=str,
723 |         default="Mozilla/5.0 (Windows NT 10.0; rv:78.0) Gecko/20100101 Firefox/78.0",
724 |         help="user-agent to use for requests",
725 |     )
726 |     parser.add_argument(
727 |         "-H",
728 |         "--header",
729 |         type=str,
730 |         action="append",
731 |         help="additional http headers, e.g `NAME=VALUE`",
732 |     )
733 |     args = parser.parse_args()
734 | 
735 |     # jobs
736 |     if args.jobs < 1:
737 |         parser.error("invalid number of jobs, got `%d`" % args.jobs)
738 | 
739 |     # retry
740 |     if args.retry < 1:
741 |         parser.error("invalid number of retries, got `%d`" % args.retry)
742 | 
743 |     # timeout
744 |     if args.timeout < 1:
745 |         parser.error("invalid timeout, got `%d`" % args.timeout)
746 | 
747 |     # header
748 |     http_headers = {"User-Agent": args.user_agent}
749 |     if args.header:
750 |         for header in args.header:
751 |             tokens = header.split("=", maxsplit=1)
752 |             if len(tokens) != 2:
753 |                 parser.error(
754 |                     "http header must have the form NAME=VALUE, got `%s`"
755 |                     % header
756 |                 )
757 |             name, value = tokens
758 |             http_headers[name.strip()] = value.strip()
759 | 
760 |     # proxy
761 |     if args.proxy:
762 |         proxy_valid = False
763 | 
764 |         for pattern, proxy_type in [
765 |             (r"^socks5:(.*):(\d+)$", socks.PROXY_TYPE_SOCKS5),
766 |             (r"^socks4:(.*):(\d+)$", socks.PROXY_TYPE_SOCKS4),
767 |             (r"^http://(.*):(\d+)$", socks.PROXY_TYPE_HTTP),
768 |             (r"^(.*):(\d+)$", socks.PROXY_TYPE_SOCKS5),
769 |         ]:
770 |             m = re.match(pattern, args.proxy)
771 |             if m:
772 |                 socks.setdefaultproxy(proxy_type, m.group(1), int(m.group(2)))
773 |                 socket.socket = socks.socksocket
774 |                 proxy_valid = True
775 |                 break
776 | 
777 |         if not proxy_valid:
778 |             parser.error("invalid proxy, got `%s`" % args.proxy)
779 | 
780 |     # output directory
781 |     if not os.path.exists(args.directory):
782 |         os.makedirs(args.directory)
783 | 
784 |     if not os.path.isdir(args.directory):
785 |         parser.error("`%s` is not a directory" % args.directory)
786 | 
787 |     # client certificate
788 |     if args.client_cert_p12:
789 |         if not os.path.exists(args.client_cert_p12):
790 |             parser.error(
791 |                 "client certificate `%s` does not exist" % args.client_cert_p12
792 |             )
793 | 
794 |         if not os.path.isfile(args.client_cert_p12):
795 |             parser.error(
796 |                 "client certificate `%s` is not a file" % args.client_cert_p12
797 |             )
798 | 
799 |         if args.client_cert_p12_password is None:
800 |             parser.error("client certificate password is required")
801 | 
802 |     urllib3.disable_warnings(urllib3.exceptions.InsecureRequestWarning)
803 | 
804 |     # fetch everything
805 |     sys.exit(
806 |         fetch_git(
807 |             args.url,
808 |             args.directory,
809 |             args.jobs,
810 |             args.retry,
811 |             args.timeout,
812 |             http_headers,
813 |             args.client_cert_p12,
814 |             args.client_cert_p12_password
815 |         )
816 |     )
817 | 
818 | 
819 | if __name__ == "__main__":
820 |     main()
821 | 


--------------------------------------------------------------------------------
/pyproject.toml:
--------------------------------------------------------------------------------
1 | [build-system]
2 | requires = ["setuptools>=42"]
3 | build-backend = "setuptools.build_meta"
4 | 


--------------------------------------------------------------------------------
/requirements.txt:
--------------------------------------------------------------------------------
1 | PySocks
2 | requests
3 | beautifulsoup4
4 | dulwich
5 | requests-pkcs12


--------------------------------------------------------------------------------
/setup.cfg:
--------------------------------------------------------------------------------
 1 | [metadata]
 2 | name = git-dumper
 3 | version = 1.0.8
 4 | author = Maxime Arthaud
 5 | author_email = maxime@arthaud.me
 6 | license = MIT
 7 | description = A tool to dump a git repository from a website
 8 | long_description = file: README.md
 9 | long_description_content_type = text/markdown
10 | url = https://github.com/arthaud/git-dumper
11 | keywords = dump, git, repository, security, vulnerability, ctf
12 | classifiers =
13 |     Development Status :: 5 - Production/Stable
14 |     License :: OSI Approved :: MIT License
15 |     Topic :: Security
16 | 
17 | [options]
18 | py_modules = git_dumper
19 | python_requires = >=3.0
20 | install_requires =
21 |     PySocks
22 |     requests
23 |     beautifulsoup4
24 |     dulwich
25 |     requests-pkcs12
26 | 
27 | [options.entry_points]
28 | console_scripts =
29 |     git-dumper = git_dumper:main
30 | 


--------------------------------------------------------------------------------