├── .gitignore ├── LICENSE ├── README.md ├── Slides ├── README.txt ├── iStock_engine.jpg └── iStock_matrix.jpg └── git_log.py /.gitignore: -------------------------------------------------------------------------------- 1 | # Byte-compiled / optimized / DLL files 2 | __pycache__/ 3 | *.py[cod] 4 | *$py.class 5 | 6 | # C extensions 7 | *.so 8 | 9 | # Distribution / packaging 10 | .Python 11 | env/ 12 | build/ 13 | develop-eggs/ 14 | dist/ 15 | downloads/ 16 | eggs/ 17 | .eggs/ 18 | lib/ 19 | lib64/ 20 | parts/ 21 | sdist/ 22 | var/ 23 | *.egg-info/ 24 | .installed.cfg 25 | *.egg 26 | 27 | # PyInstaller 28 | # Usually these files are written by a python script from a template 29 | # before PyInstaller builds the exe, so as to inject date/other infos into it. 30 | *.manifest 31 | *.spec 32 | 33 | # Installer logs 34 | pip-log.txt 35 | pip-delete-this-directory.txt 36 | 37 | # Unit test / coverage reports 38 | htmlcov/ 39 | .tox/ 40 | .coverage 41 | .coverage.* 42 | .cache 43 | nosetests.xml 44 | coverage.xml 45 | *,cover 46 | .hypothesis/ 47 | 48 | # Translations 49 | *.mo 50 | *.pot 51 | 52 | # Django stuff: 53 | *.log 54 | 55 | # Sphinx documentation 56 | docs/_build/ 57 | 58 | # PyBuilder 59 | target/ 60 | 61 | #Ipython Notebook 62 | .ipynb_checkpoints 63 | -------------------------------------------------------------------------------- /LICENSE: -------------------------------------------------------------------------------- 1 | Copyright (c) 2011, Glen Jarvis, LLC. 2 | All rights reserved. 3 | 4 | Redistribution and use in source and binary forms, with or without 5 | modification, are permitted provided that the following conditions are met: 6 | * Redistributions of source code must retain the above copyright 7 | notice, this list of conditions and the following disclaimer. 8 | * Redistributions in binary form must reproduce the above copyright 9 | notice, this list of conditions and the following disclaimer in the 10 | documentation and/or other materials provided with the distribution. 11 | * Neither the name of Glen Jarvis, LLC nor the 12 | names of its contributors may be used to endorse or promote products 13 | derived from this software without specific prior written permission. 14 | 15 | THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND 16 | ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED 17 | WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE 18 | DISCLAIMED. IN NO EVENT SHALL GLEN JARVIS, LLC BE LIABLE FOR ANY 19 | DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES 20 | (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; 21 | LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND 22 | ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT 23 | (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS 24 | SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. 25 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | ## Exploring Git internals using Python 2 | ## Let's write `git log` in Python 3 | 4 | Git is a powerful tool for source control. It's often misunderstood and abused. 5 | Under the surface Git is an elegant and simple data structure. When you don't 6 | understand that data structure, you don't really understand Git. It is flexible 7 | enough to give you all the rope that you need to hang yourself in Git hell. 8 | However, if you understand it, you are released from Git hell. 9 | 10 | Python is an elegant programming language heavily influenced by ABC "a teaching 11 | language, a replacement for BASIC...." [1] It's a perfect tool that looks like 12 | pseudo-code but executes. However, even with its simplicity, it is one of the 13 | most powerful programming languages that exists. It is a perfect language to 14 | document and run the Git data structure as we explore it. 15 | 16 | In this talk, we start with a simple explanation of the Git data structure on 17 | disk. We use Python to read those data structures and reconstruct a `git log` 18 | command for any arbitrary git repository. When finished, we should have our 19 | own working command that does the same thing as `git log` for any arbitrary 20 | repository, on any branch. We'll simply start at `HEAD` and work our way down 21 | the data structure. 22 | 23 | Although it is not *useful* to have a Python version of Git, it is *fun*. Also, 24 | this exploration helps you understand the Git tool on a much deeper level. When 25 | you can program something, you can understand it. And, understanding Git helps 26 | you be a better developer and collaborator. 27 | 28 | [1] http://python-history.blogspot.com/2009/02/early-language-design-and-development.html 29 | 30 | ## About the Speaker 31 | 32 | Glen Jarvis has been programming Python for over 8 years and has been 33 | programming in different languages for over twenty years. He has been certified 34 | in Linux/Unix administration by UC-Berkeley. Before that, he gained the highest 35 | certification available for Informix database administration and supported 36 | administrators. He is also certified in MongoDB as developer and administrator. 37 | He is currently working on his AWS certification. 38 | 39 | He has worked for companies such as IBM, UC-Berkeley, Sprint and many Silicon 40 | Valley Start-ups. He has worked in the fields of Databases, Data Science, 41 | Bioinformatics and Web Technologies. He has been exclusively working in DevOps 42 | the past year. 43 | 44 | Glen has been working for almost three years at RepairPal, a successful start-up 45 | that gives you free estimates for what your car repair *should* cost [1]. He is 46 | currently putting the "Dev" in "DevOps" using Ansible (and Ruby). He additionally 47 | owns a consulting and training company, Glen Jarvis, LLC, that mentors budding 48 | programmers. Some of his training Videos include How to create a free AWS 49 | instance, Ansible Hands-On Training, and An introduction to Test Driven 50 | Development. He has also been an open source contributor and a member and 51 | co-organizer of the Bay Area Python Interest Group (BayPIGgies) [2]. 52 | 53 | [1] http://repairpal.com/ 54 | 55 | [2] http://baypiggies.net/ 56 | 57 | 58 | 59 | ## PyBay 60 | 61 | This talk given on 20-Aug at PyBay: http://www.pybay.com/ 62 | 63 | 64 | ## BayPIGgies / Silicon Valley Python MeeetUp 65 | 66 | This talk given on 24-June as a collaboration between Silicon Valley Python MeeetUp [9] and the Bay Area Python Interest Group (BayPIGgies) [1][2]. 67 | 68 | [1] http://www.meetup.com/silicon-valley-python/ 69 | 70 | [2] http://baypiggies.net/ 71 | 72 | 73 | ## Slides 74 | 75 | https://docs.google.com/presentation/d/1d1x2FsYEGsmZ662USFCloG4Aad1rXaXdvFHWgFu-clY 76 | 77 | 78 | ## Videos 79 | 80 | June, 2016: Bay Area Python Interest Group (BayPIGgies): http://baypiggies.net 81 | https://www.youtube.com/watch?v=CB9p8n3gugM 82 | 83 | ## Disclaimer 84 | 85 | The code in this repository was meant to be a toy example. I should have 86 | embraced that and ignored handing gpg signatures when parsing the commit (we do 87 | successfully handle that case but it added complexity). 88 | 89 | We did not, however, handle the complexity of properly handling all history for 90 | merges. We naively just picked one parent (even if there were two like one 91 | would see in a merge). This means we skip one branch of history. In other 92 | words, imagine this scenario into a new directory: 93 | 94 | ``` 95 | mkdir git_demo 96 | cd git_demo 97 | git init 98 | touch 1 99 | git add 1 100 | git commit -m "Add 1" 101 | git branch branch1 102 | touch 2 103 | git add 2 104 | git commit -m "Add 2" 105 | git checkout branch1 106 | touch 3 107 | git add 3 108 | git commit -m "Add 3" 109 | git checkout main 110 | git merge branch1 111 | git log 112 | ``` 113 | 114 | Then the "Add 2" commit wouldn't be shown in the history even though all files 115 | would be present. This is currently intentional. I only wish I was disciplined 116 | enough to also ignore the GPG case (which I did handle). 117 | -------------------------------------------------------------------------------- /Slides/README.txt: -------------------------------------------------------------------------------- 1 | Slides are available from this link: 2 | 3 | https://docs.google.com/presentation/d/1d1x2FsYEGsmZ662USFCloG4Aad1rXaXdvFHWgFu-clY/edit?usp=sharing 4 | 5 | -------------------------------------------------------------------------------- /Slides/iStock_engine.jpg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/glenjarvis/explore-git-internals/4177cd0b249e7e4fcba071851f452febe2d5b7f2/Slides/iStock_engine.jpg -------------------------------------------------------------------------------- /Slides/iStock_matrix.jpg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/glenjarvis/explore-git-internals/4177cd0b249e7e4fcba071851f452febe2d5b7f2/Slides/iStock_matrix.jpg -------------------------------------------------------------------------------- /git_log.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python 2 | # Allow print() to have parethesis: 3 | # pylint: disable=C0325 4 | 5 | """ 6 | Supporting code for the "Explore Git internals using Python" 7 | talk: 8 | 9 | http://www.meetup.com/silicon-valley-python/events/228160092/ 10 | 11 | For simplicity in presentation of the code, we will keep these in a 12 | single file. 13 | 14 | The ParsedCommit class is provided and it is a subclass of dictionary. 15 | This allows one to take the raw format of a commit and process it as in 16 | this example: 17 | 18 | Example raw commit: 19 | 20 | > tree 0ea3ee5e56e3123de49422ac3315b1cee3d74910 21 | > parent 3d5f868982cc7ccf4dddcfd14560d1f25507dc1d 22 | > parent 39f0875dfc705ced8250155e61801554198e0d5f 23 | > author Glen Jarvis 1465164241 -0700 24 | > committer Glen Jarvis 1465164241 -0700 25 | > 26 | > Merge pull request #4 from glenjarvis/get_branch_commit 27 | 28 | >>> from git_log import ParsedCommit 29 | >>> ParsedCommit(raw_commit) 30 | 31 | {'author': 'Glen Jarvis ', 32 | 'author_datetime': datetime.datetime(2016, 6, 5, 15, 4, 1, 33 | tzinfo=), 34 | 'committer': 'Glen Jarvis ', 35 | 'committer_datetime': datetime.datetime(2016, 6, 5, 15, 4, 1, 36 | tzinfo=), 37 | 'message': 'Merge pull request #4 from glenjarvis/get_branch_commit\n\n 38 | Get commit pointed to by branch pointed to by HEAD', 39 | 'parent': '39f0875dfc705ced8250155e61801554198e0d5f', 40 | 'tree': '0ea3ee5e56e3123de49422ac3315b1cee3d74910'} 41 | 42 | 43 | You can copy this single file to any location. Although, you may need to 44 | ensure you have any packages installed not supplied by the standard 45 | library. 46 | 47 | If you wish to use this as a command, ensure the PATH environment 48 | variable points to the directory where this is contained. For example if 49 | this is placed in the $HOME/bin directory and one is using the bash 50 | shell: 51 | 52 | export PATH=$HOME/bin:$PATH 53 | """ 54 | 55 | import datetime 56 | import os 57 | import pytz 58 | import subprocess 59 | 60 | 61 | # E.g.: Tue Jun 21 00:57:59 2016 -0700 62 | DATE_FORMAT = "%a %b %d %H:%M:%S %Y %z" 63 | 64 | 65 | def swap_sign(sign): 66 | """Given a sign (+ or -), return opposite""" 67 | 68 | assert sign in ["+", "-"] 69 | if sign == "+": 70 | return "-" 71 | if sign == "-": 72 | return "+" 73 | 74 | 75 | def epoch_to_utc(epoch_seconds): 76 | """Given seconds form the Epoch, return tz aware UTC datetime""" 77 | 78 | return pytz.utc.localize( 79 | datetime.datetime.utcfromtimestamp(float(epoch_seconds)) 80 | ) 81 | 82 | 83 | def zone_from_offset(offset_string): 84 | """Given Git offset string, return pytz timezone 85 | 86 | Example: 87 | >> offset_string = "-0700" 88 | >> time_zone = zone_from_offset(offset) 89 | >> type(time_zone) 90 | 91 | 92 | Note: The swapped sign "GMT+7" vs "-0700" is intentional: 93 | 94 | https://en.wikipedia.org/wiki/ISO_8601#Time_offsets_from_UTC 95 | 96 | Feedback regarding the philosophy of GMT, UTC, and Zulu here 97 | and suggestions for more clarity appreciated. 98 | """ 99 | sign = offset_string[0] 100 | hour = offset_string[1:3] 101 | return pytz.timezone("Etc/GMT{0}{1}".format(swap_sign(sign), 102 | int(hour))) 103 | 104 | 105 | 106 | class GitError(RuntimeError): 107 | """A Git Error Exception""" 108 | 109 | 110 | class ParsedCommit(dict): 111 | """Commit Parser used to parse raw commit messages 112 | 113 | There are only two sections of a raw commit. The first is headers 114 | and the second is a message typed by users. Three states are needed 115 | when parsing headers and multi-line values (e.g., gpgsig) are 116 | encountered. 117 | """ 118 | 119 | # PyLint, it's okay that this has so many public methods 120 | # pylint: disable=R0904 121 | 122 | HEADERS_STATE = 1 123 | GPGSIG_HEADER_STATE = 2 124 | MESSAGE_TEXT_STATE = 3 125 | 126 | def __init__(self, raw_commit, sha): 127 | self.message = [] 128 | self.gpgsig = [] 129 | self.raw_commit = raw_commit 130 | dict.__init__(self) 131 | self.current_state = self.HEADERS_STATE 132 | self.parse_commit() 133 | self["commit"] = sha 134 | 135 | def start_gpg_parse(self, line): 136 | """Take gpg start line from raw commit, fix state 137 | 138 | Take gpg start line from raw commit (a multi-lined value) which 139 | looks similar to the following subset of a potential commit message: 140 | 141 | gpgsig -----BEGIN PGP SIGNATURE----- 142 | Version: GnuPG v1 143 | 144 | iQIcBAABAgAGBQJXVJ/8AAoJEJF24vsEed5c44IP/R+4nVUKjRBmEyEnmWEI9NuA 145 | [snip] 146 | =bnDB 147 | -----END PGP SIGNATURE----- 148 | 149 | Return a tuple of values: 150 | * The new state `read_gpgsig_header` 151 | * The first line of the PGP signature 152 | """ 153 | _, first_line = line.split(" ", 1) 154 | return (self.GPGSIG_HEADER_STATE, first_line) 155 | 156 | def handle_headers_state(self, line): 157 | """Handle the HEADERS_STATE when receiving line of data 158 | 159 | Assume the HEADERS_STATE is the current state in the tiny 160 | parsing state machine. Expect input to be in key value format 161 | (with a space between key and value) as in this example: 162 | 163 | tree 0ea3ee5e56e3123de49422ac3315b1cee3d74910 164 | 165 | This input is parsed and stored in self (dictionary). 166 | 167 | HOWEVER, if the line encountered is a `gpgsig` key instead of 168 | `tree` or others, then switch to the GPGSIG_HEADER_STATE state 169 | and begin parsing the first line of the multi-line value output. 170 | The gpgsig key is only present when commits are signed (not the 171 | default case). 172 | """ 173 | 174 | if line.startswith("gpgsig"): 175 | self.current_state, pgp_header = self.start_gpg_parse(line) 176 | self.gpgsig.append(pgp_header) 177 | if len(line) == 0: 178 | self.current_state = self.MESSAGE_TEXT_STATE 179 | else: 180 | key, value = line.split(" ", 1) 181 | self[key] = value 182 | 183 | def handle_message_text_state(self, line): 184 | """Handle the MESSAGE_TEXT_STATE when receiving line of data 185 | 186 | Assume the MESSAGE_TEXT_STATE is the current state in the tiny 187 | parsing state machine. Expect input to be lines of text (a 188 | description of the commit). 189 | """ 190 | self.message.append(line) 191 | 192 | def handle_gpgsig_header_state(self, line): 193 | """Handle the GPGSIG_HEADER_STATE when receiving line of data 194 | 195 | Assume the special GPGSIG_HEADER_STATE is the curernt state in 196 | the tiny parsing state machine. Expect lines to be lines of text 197 | (the multi-line PGP signature) and transition back to the 198 | HEADERS_STATE when all lines have been found. 199 | """ 200 | 201 | if "END PGP SIGNATURE" in line: 202 | self.current_state = self.HEADERS_STATE 203 | self.gpgsig.append(line.strip()) 204 | 205 | def update_field_datetime(self, field): 206 | """Given field name update post-parsing 207 | 208 | Some fields (i.e., author|committer) contain date-time 209 | information. For example: 210 | > Glen Jarvis 1466463764 -0700 211 | 212 | Process field, moving the date time information to its own field 213 | and converted the an aware datetime. For example: 214 | 215 | 'author': 'Glen Jarvis ' 216 | 'author_datetime': datetime.datetime( 217 | 2016, 6, 20, 16, 2, 44, tzinfo=) 218 | """ 219 | components = self[field].split() 220 | epoch, offset = components[-2:] 221 | # Remove time component from field: 222 | self[field] = " ".join(components[:-2]) 223 | 224 | utc_date_time = epoch_to_utc(epoch) 225 | time_zone = zone_from_offset(offset) 226 | 227 | # Add new field {field}_time: 228 | self["{0}_datetime".format(field)] =\ 229 | utc_date_time.astimezone(time_zone) 230 | 231 | def update_timestamps(self): 232 | """Update author and committer fields after basic commit is parsed""" 233 | 234 | self.update_field_datetime("author") 235 | self.update_field_datetime("committer") 236 | 237 | def parse_commit(self): 238 | """Parse raw commit as given by `git cat-file -p` 239 | 240 | Given a raw large string that represents the raw commit as given by 241 | a `git cat-file -p ` command, return a dictionary of values 242 | for the headers, including the commit message as typed by the user. 243 | 244 | Potentials keys for dictionary are: author, committer, gpgsig, 245 | parent, tree and message. 246 | """ 247 | 248 | for line in self.raw_commit.split("\n"): 249 | if self.current_state == self.HEADERS_STATE: 250 | self.handle_headers_state(line) 251 | elif self.current_state == self.GPGSIG_HEADER_STATE: 252 | self.handle_gpgsig_header_state(line) 253 | elif self.current_state == self.MESSAGE_TEXT_STATE: 254 | self.handle_message_text_state(line) 255 | 256 | if self.gpgsig: 257 | self["gpgsig"] = "\n".join(self.gpgsig) 258 | 259 | self["message"] = "\n".join(self.message) 260 | self.update_timestamps() 261 | return self 262 | 263 | 264 | def check_base_case(cwd, potential): 265 | """Stop looking if .git not found at / 266 | 267 | Check the current working directory (cwd) to see if it is at the 268 | root of the filesystem. If it is, and there is not a .git directory 269 | there, then we have reached the top of the heirarchy and have to 270 | stop looking. 271 | """ 272 | if cwd == "/" and not os.path.exists(potential): 273 | print("fatal: Not a git repository (or any of the parent " + 274 | "directories): .git") 275 | exit(128) 276 | 277 | 278 | def git_root(cwd=None): 279 | """Return the full path to the .git directory""" 280 | 281 | if cwd is None: 282 | cwd = os.getcwd() 283 | git_dir = os.path.join(cwd, ".git") 284 | 285 | if os.path.exists(git_dir): 286 | return git_dir 287 | else: 288 | check_base_case(cwd, git_dir) 289 | return git_root(os.path.dirname(cwd)) 290 | 291 | 292 | def big_head(root=None): 293 | """Return contents of git HEAD variable""" 294 | 295 | if root is None: 296 | root = git_root() 297 | 298 | with open(os.path.join(root, "HEAD"), "r") as head_file: 299 | head = head_file.read().strip() 300 | 301 | return head 302 | 303 | 304 | def parse_head(head_contents): 305 | """Given contents of HEAD, return file path to branch head file 306 | 307 | Example head_contents include: 308 | ref: refs/heads/main 309 | 310 | Given the above example, the following would be returned: 311 | .git/refs/heads/main 312 | """ 313 | 314 | if head_contents.startswith("ref: "): 315 | return head_contents.replace("ref: ", "", 1) 316 | else: 317 | return head_contents 318 | 319 | 320 | def branch_head_filename(): 321 | """Return the full path to the branch filename pointed to by HEAD""" 322 | 323 | return os.path.join(git_root(), parse_head(big_head())) 324 | 325 | 326 | def branch_head(): 327 | """Return commit being referenced by branch referenced by HEAD""" 328 | 329 | with open(branch_head_filename(), "r") as branch_head_file: 330 | head = branch_head_file.read().strip() 331 | 332 | return head 333 | 334 | 335 | def get_commit_contents(commit): 336 | """Return commit cotents for commit""" 337 | 338 | output = subprocess.check_output(["git", "cat-file", "-p", commit]) 339 | return ParsedCommit(output, commit) 340 | 341 | 342 | def print_formatted_commit(commit): 343 | """Print commit in roughly same format Git does by default""" 344 | 345 | print "commit {0}".format(commit["commit"]) 346 | print "Author:\t{0}".format(commit["author"]) 347 | print "Date:\t{0}".format(commit["author_datetime"].strftime(DATE_FORMAT)) 348 | print "\n" 349 | for line in commit["message"].split("\n"): 350 | print " {0}".format(line) 351 | print "\n" 352 | 353 | 354 | def git_log(): 355 | """Eqiuvalent of `git log`""" 356 | current = get_commit_contents(branch_head()) 357 | print_formatted_commit(current) 358 | while "parent" in current: 359 | current = get_commit_contents(current["parent"]) 360 | print_formatted_commit(current) 361 | 362 | 363 | if __name__ == "__main__": 364 | git_log() 365 | --------------------------------------------------------------------------------