└── README.md /README.md: -------------------------------------------------------------------------------- 1 | # Exploring Git: From git init to a KV store 2 | 3 | 4 | #### This is a talk I gave at [ColumbusRB](http://columbusrb.com), the slides can be found [here](http://slides.com/bobbygrayson/deck-1/live#/) 5 | 6 | ## This eventually evolved into [this gem](https://www.github.com/ybur-yug/gkv) 7 | 8 | ## Why? 9 | It was a good excuse to get to know git's innards a bit better, as well as work on something 10 | that, while somewhat useless, is technically functional and interesting. 11 | 12 | ## For Who? 13 | Anyone with a casual knowledge of git shouldn't get too lost and hopefully learns something. 14 | Ambitious beginners are more than welcome, and [tweet me](https://www.twitter.com/yburyug) if something comes up 15 | that you think could make it better :) 16 | 17 | ## Beginning 18 | ```bash 19 | mkdir git_exploration 20 | cd git_exploration 21 | git init 22 | ``` 23 | 24 | This gets us a git repo created. Let's check out what we have: 25 | 26 | ``` 27 | ls -a 28 | . .. .git 29 | 30 | ``` 31 | 32 | And if we check out the `.git` subdirectory in our editor: 33 | 34 | ``` 35 | $ tree .git 36 | 37 | .git 38 | ├── branches 39 | ├── config 40 | ├── description 41 | ├── HEAD 42 | ├── hooks 43 | │   ├── applypatch-msg.sample 44 | │   ├── commit-msg.sample 45 | │   ├── post-update.sample 46 | │   ├── pre-applypatch.sample 47 | │   ├── pre-commit.sample 48 | │   ├── prepare-commit-msg.sample 49 | │   ├── pre-push.sample 50 | │   ├── pre-rebase.sample 51 | │   └── update.sample 52 | ├── info 53 | │   └── exclude 54 | ├── objects 55 | │   ├── info 56 | │   └── pack 57 | └── refs 58 | ├── heads 59 | └── tags 60 | 61 | 9 directories, 13 files 62 | ``` 63 | 64 | A note, you may need to install tree depending on your OS. On Ubuntu, I used 65 | 66 | `sudo apt-get install tree` 67 | 68 | I imagine it is about the same on mac with `brew`. I Have no idea on Windows as I barely know how to list 69 | a directory in Powershell (sorry). 70 | 71 | Okay, so this doesn't look too crazy. Let's open up some of the stuff we have on initialization in 72 | here. 73 | 74 | `.git/info/exclude` 75 | ``` 76 | # git ls-files --others --exclude-from=.git/info/exclude 77 | # Lines that start with '#' are comments. 78 | # For a project mostly in C, the following would be a good set of 79 | # exclude patterns (uncomment them if you want to use them): 80 | # *.[oa] 81 | # *~ 82 | ``` 83 | 84 | Okay, so it appears that this opens up with the command that the system would use to govern this 85 | behaviour. Knowing a bit about git, one can reasonably infer that this is going to work in hijinks 86 | with the `.gitignore` file that one can use to ignore certain files. 87 | 88 | `.git/config` 89 | ``` 90 | [core] 91 | repositoryformatversion = 0 92 | filemode = true 93 | bare = false 94 | logallrefupdates = true 95 | ``` 96 | It would appear this is just some general configuration for a boilerplate initialized repo. 97 | 98 | `.git/description` 99 | ``` 100 | Unnamed repository; edit this file 'description' to name the repository. 101 | ``` 102 | Here it seems we can name our little project 103 | 104 | `.git/refs/HEAD` 105 | ``` 106 | ref: refs/heads/master 107 | ``` 108 | This seems to be referencing the current `HEAD`. 109 | 110 | #### HEAD 111 | HEAD is a reference to the last commit in the current checked out branch. 112 | 113 | ## Adding A File 114 | 115 | ```bash 116 | echo "# Git Exploration" > README.md 117 | git add README.md 118 | git commit -m 'initial commit' 119 | ``` 120 | 121 | Once we do this, we can check out a new directory structure: 122 | 123 | ``` 124 | $ tree .git 125 | .git 126 | ├── branches 127 | ├── COMMIT_EDITMSG 128 | ├── config 129 | ├── description 130 | ├── HEAD 131 | ├── hooks 132 | │   ├── applypatch-msg.sample 133 | │   ├── commit-msg.sample 134 | │   ├── post-update.sample 135 | │   ├── pre-applypatch.sample 136 | │   ├── pre-commit.sample 137 | │   ├── prepare-commit-msg.sample 138 | │   ├── pre-push.sample 139 | │   ├── pre-rebase.sample 140 | │   └── update.sample 141 | ├── index 142 | ├── info 143 | │   └── exclude 144 | ├── logs 145 | │   ├── HEAD 146 | │   └── refs 147 | │   └── heads 148 | │   └── master 149 | ├── objects 150 | │   ├── 1b 151 | │   │   └── f567a9cee63cd3036628c1519b818461905b27 152 | │   ├── 9d 153 | │   │   └── aeafb9864cf43055ae93beb0afd6c7d144bfa4 154 | │   ├── c1 155 | │   │   └── 2d7c0ed49ad9c7aa938743ba6fdee54b6b7fe1 156 | │   ├── info 157 | │   └── pack 158 | └── refs 159 | ├── heads 160 | │   └── master 161 | └── tags 162 | 163 | 15 directories, 21 files 164 | ``` 165 | 166 | It appears we have some simple additions with adding one file. To start, we have expanded our 167 | info directory to now include a `logs` directory. We also have several subdirectories inside of our 168 | `objects` directory now, each containing a hash. refs subdirectory `heads` now includes a 169 | `master` file, and we also have added `COMMIT_EDITMSG`, and index at the root level of `.git`. 170 | 171 | The three hashes in our `objects` directory represent 3 data structures git utilizes. These are 172 | a `blob`, a `tree`, and a `commit`. We will go into these more in-depth later. 173 | 174 | If we examine `COMMIT_EDITMSG` we see: 175 | 176 | ``` 177 | initial commit 178 | 179 | ``` 180 | 181 | Logging our commit message. 182 | 183 | 184 | 185 | ## Making A Branch 186 | Let's create a new branch to further expand this interesting `.git` directory. 187 | 188 | ```bash 189 | git checkout -b my_feature_branch 190 | ``` 191 | 192 | What this does is use the `git checkout` command and the `-b` flag to create and checkout a new branch 193 | named whatever follows `-b`. We have created a branch called `my_feature_branch`. The reason I have 194 | called it a feature branch specifically is because this is a common flow for managing an application's 195 | development with multiple authors. Let's see what changed: 196 | 197 | ```bash 198 | $ tree .git 199 | .git 200 | ├── branches 201 | ├── COMMIT_EDITMSG 202 | ├── config 203 | ├── description 204 | ├── HEAD 205 | ├── hooks 206 | │   ├── applypatch-msg.sample 207 | │   ├── commit-msg.sample 208 | │   ├── post-update.sample 209 | │   ├── pre-applypatch.sample 210 | │   ├── pre-commit.sample 211 | │   ├── prepare-commit-msg.sample 212 | │   ├── pre-push.sample 213 | │   ├── pre-rebase.sample 214 | │   └── update.sample 215 | ├── index 216 | ├── info 217 | │   └── exclude 218 | ├── logs 219 | │   ├── HEAD 220 | │   └── refs 221 | │   └── heads 222 | │   ├── master 223 | │   └── my_feature_branch 224 | ├── objects 225 | │   ├── 1b 226 | │   │   └── f567a9cee63cd3036628c1519b818461905b27 227 | │   ├── 9d 228 | │   │   └── aeafb9864cf43055ae93beb0afd6c7d144bfa4 229 | │   ├── c1 230 | │   │   └── 2d7c0ed49ad9c7aa938743ba6fdee54b6b7fe1 231 | │   ├── info 232 | │   └── pack 233 | └── refs 234 | ├── heads 235 | │   ├── master 236 | │   └── my_feature_branch 237 | └── tags 238 | 239 | 15 directories, 23 files 240 | ``` 241 | 242 | Now, if you look at `.git/branches/refs/heads/` we can see we have added `my_feature_branch`. If we 243 | look at our `HEAD` files, we will see an addition to it as well. 244 | 245 | `.git/logs/refs/HEAD` 246 | ``` 247 | ... 1433706112 -0400 commit (initial): initial commit 248 | ... 1433796852 -0400 checkout: moving from master to my_feature_branch 249 | ``` 250 | 251 | `.git/refs/HEAD` 252 | ```bash 253 | ref: refs/heads/my_feature_branch 254 | ``` 255 | 256 | It has logged our checkout and pointed us at the new branch. Also it is worth noting we have not created 257 | any new objects. This is one of the finer pieces of git, it is differentials rather than copies and 258 | copies as one would have saving `my_documentv1`, `my_documentv2`, `my_documentvN` etc. 259 | 260 | Let's add another commit by creating a directory in here and logging its boilerplate. 261 | 262 | ``` 263 | mkdir test && echo "test" > test/file.txt 264 | cd test 265 | git status 266 | # => ./ 267 | ``` 268 | 269 | Okay, lets add this project and commit. If you don't have Volt installed locally, feel free to substitute it 270 | with anything from rails to django to meteor. It doesn't really matter for our studies here. 271 | 272 | ```bash 273 | cd .. 274 | git add test 275 | git commit -m 'add file + dir' 276 | ``` 277 | 278 | Now, let us further check out our changes in the git file tree: 279 | 280 | ``` 281 | .git 282 | ├── branches 283 | ├── COMMIT_EDITMSG 284 | ├── config 285 | ├── description 286 | ├── HEAD 287 | ├── hooks 288 | │   ├── applypatch-msg.sample 289 | │   ├── commit-msg.sample 290 | │   ├── post-update.sample 291 | │   ├── pre-applypatch.sample 292 | │   ├── pre-commit.sample 293 | │   ├── prepare-commit-msg.sample 294 | │   ├── pre-push.sample 295 | │   ├── pre-rebase.sample 296 | │   └── update.sample 297 | ├── index 298 | ├── info 299 | │   └── exclude 300 | ├── logs 301 | │   ├── HEAD 302 | │   └── refs 303 | │   └── heads 304 | │   ├── master 305 | │   └── my_feature_branch 306 | ├── objects 307 | │   ├── 1b 308 | │   │   └── f567a9cee63cd3036628c1519b818461905b27 309 | │   ├── 2b 310 | │   │   └── 297e643c551e76cfa1f93810c50811382f9117 311 | │   ├── 5e 312 | │   │   └── c1f4ac6015a50b5d8462582d7ae50d7029d012 313 | │   ├── 70 314 | │   │   └── cc10cfcc770f6b0ea11cdd9a876ee1a3184d77 315 | │   ├── 9d 316 | │   │   └── aeafb9864cf43055ae93beb0afd6c7d144bfa4 317 | │   ├── c1 318 | │   │   └── 2d7c0ed49ad9c7aa938743ba6fdee54b6b7fe1 319 | │   ├── info 320 | │   └── pack 321 | └── refs 322 | ├── heads 323 | │   ├── master 324 | │   └── my_feature_branch 325 | └── tags 326 | 327 | 18 directories, 26 files 328 | ``` 329 | 330 | Now, if we look at `COMMIT_EDITMSG` 331 | 332 | ``` 333 | add file + dir 334 | ``` 335 | 336 | And again it is our latest message. The other major change is we have a ton of new objects. 337 | Just to see what happens, let's checkout master and see if anything changes: 338 | 339 | ```bash 340 | git checkout master 341 | ``` 342 | 343 | and we get the same tree, but we can check out our HEAD item in the `.git` directory. 344 | 345 | So we have the same thing, but our `HEAD` file reads: 346 | 347 | ``` 348 | ref: refs/heads/master 349 | ``` 350 | 351 | So we can now see this is our constant anchor as we navigate changes. 352 | 353 | ## Objects 354 | ```bash 355 | $ find .git/objects 356 | .git/objects/pack 357 | .git/objects/info 358 | .git/objects/b7 359 | .git/objects/b7/37ff03e6f22c28bc4786f4b11925f2d864e00 360 | ... 361 | .git/objects/4c 362 | .git/objects/4c/2be36223ca4d07cbd7ce8c28419ba1c4144334 363 | ``` 364 | 365 | Here we see a list of a butt ton of what looks like SHA-1 hashes. So what is git doing with all of 366 | these? 367 | 368 | Let's create a clean repository and proceed to start an empty git repo in it. 369 | 370 | `cd .. && mkdir git_testing` 371 | 372 | Initialize a repository 373 | 374 | `git init` 375 | 376 | And now, let's try creating one of these hash objects. We do this with the git command `hash-object`. 377 | If we simply use some bash, we can do this without even needing a file. By doing this we will pipe 378 | an echoed statement into the hash-object command through stdin and receive our hash in stdout. 379 | 380 | `echo 'test content' | git hash-object -w --stdin` 381 | 382 | ```bash 383 | d670460b4b4aece5915caf5c68d12f560a9fe3e4 384 | ``` 385 | 386 | So, it took the string 'test content' and hashed it then spat it back out in SHA-1 form. Cool. 387 | 388 | We should examine this further. 389 | 390 | `echo "test" > test.txt` 391 | 392 | `git hash-object -w test.txt` 393 | 394 | `vim test.txt` 395 | ```txt 396 | test 397 | test 2 398 | ``` 399 | 400 | If we save this change and run the command again: 401 | 402 | `git hash-object -w test.txt` 403 | 404 | `6375a2690c50e28c8c351fc552e2fd8a24b01031` 405 | 406 | And if we check out our objects directory we can now see what git has done: 407 | 408 | ```bash 409 | bobby@bobdawg-devbox:~/code/git_test/test$ find .git/objects/ -type f 410 | .git/objects/9d/aeafb9864cf43055ae93beb0afd6c7d144bfa4 411 | .git/objects/63/75a2690c50e28c8c351fc552e2fd8a24b01031 412 | ``` 413 | 414 | It has a hash for each of our objects saved. Woo! But wait. We haven't committed anything. How is git 415 | tracking all this? 416 | 417 | Well it turns out git just keeps some headers with these SHA-1's, and does a bunch of cool stuff so it 418 | only has to track changes. Not entire new versions of each document. So each of these objects simply 419 | represents a given state of some blob of our data.' 420 | 421 | If we dive in to do the reverse of this, we can look up our input using git's `cat-file` command, which 422 | intakes a hash. 423 | 424 | `git cat-file 6375a2690c50e28c8c351fc552e2fd8a24b01031` 425 | 426 | ``` 427 | test 428 | test 2 429 | ``` 430 | 431 | Now, if we make another change on this, we will be able to see the new version. 432 | 433 | `vim test.txt` 434 | ``` 435 | test "one" 436 | ``` 437 | 438 | If we delete everything and replace it with this line, the do our typical: 439 | 440 | `git hash-object -w test.txt` 441 | 442 | We get a new hash, which when called with `cat-file` will output a the new value, 443 | while still keeping our old object in history. 444 | 445 | What this really is at it's core is a key:value store. Using this, we can leverage 446 | a very simple database that only relies on single key/value types (symbol, string) 447 | to store any data we need to and look it up. So, let's move on. 448 | 449 | ## Aside - Git: A Directed Acyclic Graph 450 | In the broadest of terms, git is a [directed acyclic graph](https://en.wikipedia.org/wiki/Directed_acyclic_graph). This sounds quite fancy and/or 451 | scary depending on how hard in the paint you go with mathematica, but it truly isn't that crazy. 452 | Let's ignore Wikipedia's terse entry, and instead break it down on our own. 453 | 454 | ## Storage 455 | In its most basic state, git functions to make one of these graphs connecting a series of objects. These 456 | objects also have a handful of types. 457 | 458 | ### Types 459 | 460 | #### Blob 461 | A `blob` is a blob of bytes. It usually is a file, but can also be a symlink or a myriad of other things. 462 | It is all simply semantics as long as there is a pointer to the `blob`. 463 | 464 | #### Tree 465 | Directories are represented by a `tree` object. They refer to `blobs` and other `trees`. When one of these nodes 466 | (a `tree` or a `blob`, in this case) points to another in the graph, it *depends* on that node. It is a connection 467 | that cannot be broken. You can garbage collect, filesystem check, and a myriad of other functions 468 | but we do not need to truly know more other than that without a referent of dependence, a node 469 | is essentially useless, as it is disconnected. 470 | 471 | #### Commit 472 | A `commit` refers to a tree that represents the state of a group of `blobs`' state at the time of that given 473 | commit. It refers to a range `X` of other commits that are its parents. More than one parent means a merge, 474 | no parent means an initial commit, and a single just means its a regular old commit. As we saw earlier, 475 | the body of a commit is its message. 476 | 477 | #### Refs 478 | Refs have two functions: storing `HEAD`s, and `branches`. They are essentially notes left on a given 479 | node. These notes can be moved around freely and arent stored in history, and arent transferred 480 | between repositories. They are simply a means to namespace 'I am working here'. 481 | 482 | #### Visualizing It 483 | ![A Typical remote/local DAG](http://eagain.net/articles/git-for-computer-scientists/git-history.6.dot.svg) 484 | 485 | As you can see, these nodes form a `tree` of functioning between master and a remote with a few merges 486 | thrown in (any of the nodes with 2 parents). 487 | 488 | ## Git as a Key:Value Store 489 | 490 | ### Note: Do not use this for real software 491 | 492 | ### Addendum: Apparently [crates.io](http://crates.io) does this, and those guys are wicked smart, so maybe its a good idea but definitely not at this capacity we are building 493 | 494 | Since the `cat-file` and `hash_object` pattern functions simply as a key:value store for git, we 495 | can utilize this to our advantage. Normal storing large strings in-memory in Ruby can get quite 496 | taxing, but if we simply store the string of the SHA-1 hash to a given key, we can greatly reduce 497 | the memory footprint of our master dictionary and allow it to grow far larger in size (theoretically). 498 | So, let's code up a pseudo-class for this and fill it in after we get that far. 499 | 500 | `vim git_database.rb` 501 | ```ruby 502 | module GitDatabase 503 | class Database 504 | def initialize 505 | # set initliazers and master dictionary 506 | end 507 | 508 | def set 509 | # set a given key to a value 510 | end 511 | 512 | def get 513 | # get a given key's value 514 | end 515 | 516 | def hash_object 517 | # hash a given input that is coerced to a string 518 | end 519 | 520 | def cat_file 521 | # cat out a given file based on SHA-1 hash 522 | end 523 | end 524 | end 525 | ``` 526 | 527 | We can now tackle this piece by piece. 528 | 529 | #### Initializers 530 | 531 | `vim git_database.rb` 532 | ```ruby 533 | ... 534 | class Database 535 | attr_accessor :items 536 | 537 | def initialize 538 | @items = {} 539 | `git init` 540 | end 541 | ... 542 | 543 | ``` 544 | 545 | Simple enough. We ensure we have a git repository initialized, and we ensure that we setup our 546 | master dictionary. 547 | 548 | #### Hashing 549 | `vim git_database.rb` 550 | ```ruby 551 | ... 552 | def hash_object(string) 553 | # What do we do? 554 | end 555 | ... 556 | ``` 557 | 558 | Well, to start, lets fire up irb and see what we can do calling git from Ruby. 559 | 560 | ``` 561 | irb 562 | irb(main):001:0> string = "test" 563 | => "test" 564 | irb(main):002:0> `echo #{string}` 565 | => "test\n" 566 | irb(main):003:0> `echo #{string} | git hash-object -w --stdin` 567 | => "9daeafb9864cf43055ae93beb0afd6c7d144bfa4\n" 568 | irb(main):004:0> `echo #{string} | git hash-object -w --stdin`.strip! 569 | => "9daeafb9864cf43055ae93beb0afd6c7d144bfa4" 570 | ``` 571 | 572 | So, it appears we can essentially call exactly what we were prior. We can now reasonable change the 573 | function to be: 574 | 575 | ```ruby 576 | ... 577 | def hash_object(data) 578 | `echo #{data} | git hash-object -w --stdin`.strip! 579 | end 580 | ... 581 | ``` 582 | 583 | And this will get that blob hashed up and stored for us. Now, notice we get the exact hash here, but if 584 | we do a 585 | 586 | `find .git/objects -type f` 587 | 588 | and look at a sampling of what we get: 589 | 590 | ``` 591 | .git/objects/e4/ea753518a47496350473b8eb0972ad2985d964 592 | ``` 593 | 594 | You might notice that objects has subdirectories of seemingly random 2 letter combos. There are the first 2 595 | characters of the hash, but git does this to save on overhead. So, if looking in the git directory for hashes 596 | you must account for the parent directory of the longer string to get the entire SHA-1. 597 | 598 | #### Cattin' 599 | Since the prior method returns us a hash directly, we can use the same command as earlier and interpolate. 600 | 601 | ```ruby 602 | ... 603 | def cat_file(hash) 604 | `git cat-file -p #{hash}` 605 | end 606 | ... 607 | ``` 608 | 609 | And now we just need a way to map keys to the hashes we have saved. 610 | 611 | #### Set 612 | ```ruby 613 | ... 614 | def set 615 | # get key, data 616 | # hash data 617 | # save key to SHA-1 hash in @items 618 | end 619 | ... 620 | ``` 621 | 622 | This is a reasonable fleshed out idea of a simple set implementation. So, first we need to take in a key: 623 | 624 | ```ruby 625 | ... 626 | def set(key, value) 627 | hash = hash_object(value.to_s) 628 | @items[key] = value 629 | end 630 | ... 631 | ``` 632 | 633 | And now, we can move onto a get implementation 634 | 635 | #### Get 636 | To get, we have a little more to do. We will have a key, and that gets us an SHA-1 hash. However, 637 | we still need to decrypt it using our `cat_file` function. So, if we pseudocode this out: 638 | 639 | ```ruby 640 | ... 641 | def get 642 | # find hash by key 643 | # cat-file hash 644 | end 645 | ... 646 | ``` 647 | 648 | So, with our functions already set up we can simply go in and do this: 649 | 650 | ```ruby 651 | ... 652 | def get(key) 653 | cat_file(@items[key.to_s]) 654 | end 655 | ... 656 | 657 | ``` 658 | 659 | And now, we have a finished class that can function as a reasonable minimal database. Consider 660 | it an equally ghetto but more interesting version of the good 'ole CSV store. 661 | 662 | ### Accessing Object History 663 | Currently we are only returning the latest version of a given item. However, we have already stored it at every 664 | state it hash ever been hashed. So, if we were to add in some functionality for grabbing versions, it would be 665 | quite simple. 666 | 667 | ```ruby 668 | module GitDatabase 669 | class Database 670 | attr_accessor :items 671 | def initialize 672 | `git init` 673 | @items = {} 674 | end 675 | 676 | def set(key, value) 677 | unless key in @items.keys 678 | @items[key] = [hash_object(value)] 679 | else 680 | @items[key] << value 681 | end 682 | end 683 | 684 | def get(key) 685 | cat_file(@items[key.to_s].first) 686 | end 687 | 688 | def get_version(key, version) 689 | # 0 = latest, numbers = older 690 | @items[key][version] 691 | end 692 | 693 | def versions(key) 694 | @items[key].count 695 | end 696 | 697 | private 698 | 699 | def hash_object(data) 700 | `echo #{data.to_s} | git hash-object -w --stdin`.strip! 701 | end 702 | 703 | def cat_file(hash) 704 | `git cat-file -p #{hash}` 705 | end 706 | end 707 | end 708 | ``` 709 | 710 | Now, we can do something like: 711 | 712 | ```ruby 713 | db = GitDatabase::Database.new 714 | db.set("Apples", "12") 715 | db.get("Apples") 716 | # => "12" 717 | db.set("Apples", "10") 718 | db.get_version("Apples", 0) 719 | # => "12" 720 | db.get("Apples") 721 | # => "10" 722 | ``` 723 | 724 | Abd to use this. We can make a very simple sinatra API to take input remotely: 725 | 726 | ```ruby 727 | ... # below the class 728 | require 'sinatra' 729 | require 'json' 730 | DB = GitDatabase::Database.new 731 | post '/set' do 732 | DB.set(params['key'], params['value'] 733 | rescue 734 | { error: 'please send key and value parameters' }.to_json 735 | end 736 | end 737 | 738 | get '/get/:key' do 739 | { result: DB.get(params['key'] }.to_json 740 | end 741 | ``` 742 | 743 | This is a very simple wrapper, but if gives the general idea of where you could take this with a toy application. 744 | 745 | ## Happy Hacking, and check out [Gkv](http://github.com/ybur-yug/gkv) to actually use this in a small app 746 | --------------------------------------------------------------------------------